DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
@ 2015-12-02  3:43 Yuanhan Liu
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

This patch set adds the initial vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration. So, this patch is basically about adding vhost
log support, and using it.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
  the dirty memory bitmap is.
    
- Patch 2 introduces a vhost_log_write() helper function to log
  pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 sets log_fhmfd protocol feature bit, which actually
  enables the vhost-user live migration support.

A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK. And here is guide
to setup OVS + DPDK:

    http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

   $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
   of live migration (by adding -incoming tcp:0:4444 option)

   $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3334,server,nowait -curses \
       -incoming tcp:0:4444 

6. connect to VM1 monitor, and start migration:

   > migrate tcp:0:4444

7. After a while, you will find that VM1 has been migrated to VM2,
   and the "ping" command continues running, perfectly.


Note: this patch set has mostly been based on Victor Kaplansk's demo
work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
as the co-author. Victor, what do you think of that? :)

Comments are welcome!

---
Yuanhan Liu (4):
  vhost: handle VHOST_USER_SET_LOG_BASE request
  vhost: introduce vhost_log_write
  vhost: log vring changes
  vhost: enable log_shmfd protocol feature

 lib/librte_vhost/rte_virtio_net.h             | 35 ++++++++++++++
 lib/librte_vhost/vhost_rxtx.c                 | 70 ++++++++++++++++++---------
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 +++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
 lib/librte_vhost/virtio-net.c                 |  4 ++
 7 files changed, 145 insertions(+), 26 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 13:53   ` Panu Matilainen
  2015-12-08  5:57   ` Xie, Huawei
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

	typedef struct VhostUserLog {
		uint64_t mmap_size;
		uint64_t mmap_offset;
	} VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/rte_virtio_net.h             |  2 ++
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
 5 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
 	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
+	uint64_t		log_size;	/**< Size of log area */
+	uint8_t			*log_base;	/**< Where dirty pages are logged */
 	void			*priv;		/**< private context */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 2dc0547..76bcac2 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,7 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		break;
 
 	case VHOST_USER_SET_LOG_BASE:
-		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
+		user_set_log_base(ctx, &msg);
+
+		/* it needs a reply */
+		msg.size = sizeof(msg.payload.u64);
+		send_vhost_message(connfd, &msg);
+		break;
 	case VHOST_USER_SET_LOG_FD:
 		close(msg.fds[0]);
 		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
 	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserLog {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+} VhostUserLog;
+
 typedef struct VhostUserMsg {
 	VhostUserRequest request;
 
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
 		struct vhost_vring_state state;
 		struct vhost_vring_addr addr;
 		VhostUserMemory memory;
+		VhostUserLog    log;
 	} payload;
 	int fds[VHOST_MEMORY_MAX_NREGIONS];
 } __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..1d705fd 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,47 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
 
 	dev->protocol_features = protocol_features;
 }
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+		 struct VhostUserMsg *msg)
+{
+	struct virtio_net *dev;
+	int fd = msg->fds[0];
+	uint64_t size, off;
+	void *addr;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
+
+	if (fd < 0) {
+		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+		return -1;
+	}
+
+	if (msg->size != sizeof(VhostUserLog)) {
+		RTE_LOG(ERR, VHOST_CONFIG,
+			"invalid log base msg size: %"PRId32" != %d\n",
+			msg->size, (int)sizeof(VhostUserLog));
+		return -1;
+	}
+
+	size = msg->payload.log.mmap_size;
+	off  = msg->payload.log.mmap_offset;
+	RTE_LOG(INFO, VHOST_CONFIG,
+		"log mmap size: %"PRId64", offset: %"PRId64"\n",
+		size, off);
+
+	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+		return -1;
+	}
+
+	/* TODO: unmap on stop */
+	dev->log_base = addr;
+	dev->log_size = size;
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 13:53   ` Victor Kaplansky
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 3/4] vhost: log vring changes Yuanhan Liu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

The page address is biased by log_guest_addr, which is derived from
SET_VRING_ADDR request as part of the vring related addresses.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
 lib/librte_vhost/virtio-net.c     |  4 ++++
 2 files changed, 38 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 416dac2..191c1be 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
  */
 
 #include <stdint.h>
+#include <linux/vhost.h>
 #include <linux/virtio_ring.h>
 #include <linux/virtio_net.h>
 #include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
 /* Backend value set by guest. */
 #define VIRTIO_DEV_STOPPED -1
 
+#define VHOST_LOG_PAGE	4096
+
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -82,6 +85,7 @@ struct vhost_virtqueue {
 	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
 	struct vring_avail	*avail;			/**< Virtqueue available ring. */
 	struct vring_used	*used;			/**< Virtqueue used ring. */
+	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
 	uint32_t		size;			/**< Size of descriptor ring. */
 	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
 	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
@@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
 	return vhost_va;
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	/* TODO: to make it atomic? */
+	log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint64_t offset, uint64_t len)
+{
+	uint64_t addr = vq->log_guest_addr;
+	uint64_t page;
+
+	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		     !dev->log_base || !len))
+		return;
+
+	addr += offset;
+	if (dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page(dev->log_base, page);
+		page += VHOST_LOG_PAGE;
+	}
+}
+
+
 /**
  *  Disable features in feature_mask. Returns 0 on success.
  */
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 8364938..4481827 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
 		return -1;
 	}
 
+	vq->log_guest_addr = addr->log_guest_addr;
+
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
 			dev->device_fh, vq->desc);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
 			dev->device_fh, vq->avail);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
 			dev->device_fh, vq->used);
+	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %p\n",
+			dev->device_fh, (void *)(uintptr_t)vq->log_guest_addr);
 
 	return 0;
 }
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 14:07   ` Victor Kaplansky
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 4/4] vhost: enable log_shmfd protocol feature Yuanhan Liu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 74 +++++++++++++++++++++++++++++--------------
 1 file changed, 50 insertions(+), 24 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9322ce6..d4805d8 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -129,6 +129,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t offset = 0, vb_offset = 0;
 		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
 		uint8_t hdr = 0, uncompleted_pkt = 0;
+		uint16_t idx;
 
 		/* Get descriptor from available ring */
 		desc = &vq->desc[head[packet_success]];
@@ -200,16 +201,22 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		}
 
 		/* Update used ring with desc information */
-		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
-							head[packet_success];
+		idx = res_cur_idx & (vq->size - 1);
+		vq->used->ring[idx].id = head[packet_success];
 
 		/* Drop the packet if it is uncompleted */
 		if (unlikely(uncompleted_pkt == 1))
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							vq->vhost_hlen;
+			vq->used->ring[idx].len = vq->vhost_hlen;
 		else
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							pkt_len + vq->vhost_hlen;
+			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+		/*
+		 * to defer the update to when updating used->idx,
+		 * and batch them?
+		 */
+		vhost_log_write(dev, vq,
+			offsetof(struct vring_used, ring[idx]),
+			sizeof(vq->used->ring[idx]));
 
 		res_cur_idx++;
 		packet_success++;
@@ -236,6 +243,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
+	vhost_log_write(dev, vq,
+		offsetof(struct vring_used, idx),
+		sizeof(vq->used->idx));
 
 	/* flush used->idx update before we read avail->flags. */
 	rte_mb();
@@ -265,6 +275,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	uint32_t seg_avail;
 	uint32_t vb_avail;
 	uint32_t cpy_len, entry_len;
+	uint16_t idx;
 
 	if (pkt == NULL)
 		return 0;
@@ -302,16 +313,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	entry_len = vq->vhost_hlen;
 
 	if (vb_avail == 0) {
-		uint32_t desc_idx =
-			vq->buf_vec[vec_idx].desc_idx;
+		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+			idx = cur_idx & (vq->size - 1);
 
-		if ((vq->desc[desc_idx].flags
-			& VRING_DESC_F_NEXT) == 0) {
 			/* Update used ring with desc information */
-			vq->used->ring[cur_idx & (vq->size - 1)].id
-				= vq->buf_vec[vec_idx].desc_idx;
-			vq->used->ring[cur_idx & (vq->size - 1)].len
-				= entry_len;
+			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+			vq->used->ring[idx].len = entry_len;
+
+			vhost_log_write(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 
 			entry_len = 0;
 			cur_idx++;
@@ -354,10 +367,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
 				VRING_DESC_F_NEXT) == 0) {
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_write(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_len = 0;
 				cur_idx++;
 				entry_success++;
@@ -390,16 +406,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 					if ((vq->desc[desc_idx].flags &
 						VRING_DESC_F_NEXT) == 0) {
-						uint16_t wrapped_idx =
-							cur_idx & (vq->size - 1);
+						idx = cur_idx & (vq->size - 1);
 						/*
 						 * Update used ring with the
 						 * descriptor information
 						 */
-						vq->used->ring[wrapped_idx].id
+						vq->used->ring[idx].id
 							= desc_idx;
-						vq->used->ring[wrapped_idx].len
+						vq->used->ring[idx].len
 							= entry_len;
+						vhost_log_write(dev, vq,
+							offsetof(struct vring_used, ring[idx]),
+							sizeof(vq->used->ring[idx]));
 						entry_success++;
 						entry_len = 0;
 						cur_idx++;
@@ -422,10 +440,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 				 * This whole packet completes.
 				 */
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_write(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_success++;
 				break;
 			}
@@ -658,6 +679,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		/* Update used index buffer information. */
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
+		vhost_log_write(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 
 		/* Allocate an mbuf and populate the structure. */
 		m = rte_pktmbuf_alloc(mbuf_pool);
@@ -778,6 +802,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
+	vhost_log_write(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH 4/4] vhost: enable log_shmfd protocol feature
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (2 preceding siblings ...)
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 3/4] vhost: log vring changes Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 14:10 ` [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_user/virtio-net-user.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..a3a889d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD	1
 
-#define VHOST_USER_PROTOCOL_FEATURES	(1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-02 13:53   ` Victor Kaplansky
  2015-12-02 14:39     ` Yuanhan Liu
  2015-12-09  3:33     ` Xie, Huawei
  0 siblings, 2 replies; 136+ messages in thread
From: Victor Kaplansky @ 2015-12-02 13:53 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 11:43:11AM +0800, Yuanhan Liu wrote:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
> 
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
> 
> The page address is biased by log_guest_addr, which is derived from
> SET_VRING_ADDR request as part of the vring related addresses.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
>  lib/librte_vhost/virtio-net.c     |  4 ++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 416dac2..191c1be 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -40,6 +40,7 @@
>   */
>  
>  #include <stdint.h>
> +#include <linux/vhost.h>
>  #include <linux/virtio_ring.h>
>  #include <linux/virtio_net.h>
>  #include <sys/eventfd.h>
> @@ -59,6 +60,8 @@ struct rte_mbuf;
>  /* Backend value set by guest. */
>  #define VIRTIO_DEV_STOPPED -1
>  
> +#define VHOST_LOG_PAGE	4096
> +
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> @@ -82,6 +85,7 @@ struct vhost_virtqueue {
>  	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
>  	struct vring_avail	*avail;			/**< Virtqueue available ring. */
>  	struct vring_used	*used;			/**< Virtqueue used ring. */
> +	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
>  	uint32_t		size;			/**< Size of descriptor ring. */
>  	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
>  	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
> @@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>  	return vhost_va;
>  }
>  
> +static inline void __attribute__((always_inline))
> +vhost_log_page(uint8_t *log_base, uint64_t page)
> +{
> +	/* TODO: to make it atomic? */
> +	log_base[page / 8] |= 1 << (page % 8);

I think the atomic OR operation is necessary only if there can be
more than one vhost-user back-end updating the guest's memory
simultaneously. However probably it is pretty safe to perform
regular OR operation, since rings are not shared between
back-end. What about buffers pointed by descriptors?  To be on
the safe side, I would use a GCC built-in function
__sync_fetch_and_or(). 

> +}
> +
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint64_t offset, uint64_t len)
> +{
> +	uint64_t addr = vq->log_guest_addr;
> +	uint64_t page;
> +
> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		     !dev->log_base || !len))
> +		return;

Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?

> +
> +	addr += offset;
> +	if (dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8))
> +		return;
> +
> +	page = addr / VHOST_LOG_PAGE;
> +	while (page * VHOST_LOG_PAGE < addr + len) {
> +		vhost_log_page(dev->log_base, page);
> +		page += VHOST_LOG_PAGE;
> +	}
> +}
> +
> +
>  /**
>   *  Disable features in feature_mask. Returns 0 on success.
>   */
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 8364938..4481827 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
>  		return -1;
>  	}
>  
> +	vq->log_guest_addr = addr->log_guest_addr;
> +
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
>  			dev->device_fh, vq->desc);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
>  			dev->device_fh, vq->avail);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
>  			dev->device_fh, vq->used);
> +	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %p\n",
> +			dev->device_fh, (void *)(uintptr_t)vq->log_guest_addr);
>  
>  	return 0;
>  }
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-02 13:53   ` Panu Matilainen
  2015-12-02 14:31     ` Yuanhan Liu
  2015-12-06 23:07     ` Thomas Monjalon
  2015-12-08  5:57   ` Xie, Huawei
  1 sibling, 2 replies; 136+ messages in thread
From: Panu Matilainen @ 2015-12-02 13:53 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/02/2015 05:43 AM, Yuanhan Liu wrote:
> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> vhost-user) where we should log dirty pages, and how big the log
> buffer is.
>
> This request introduces a new payload:
>
> 	typedef struct VhostUserLog {
> 		uint64_t mmap_size;
> 		uint64_t mmap_offset;
> 	} VhostUserLog;
>
> Also, a fd is delivered from QEMU by ancillary data.
>
> With those info given, an area of memory is mmaped, assigned
> to dev->log_base, for logging dirty pages.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>   lib/librte_vhost/rte_virtio_net.h             |  2 ++
>   lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
>   lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
>   lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
>   lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>   5 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 5687452..416dac2 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -127,6 +127,8 @@ struct virtio_net {
>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>   	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>   	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> +	uint64_t		log_size;	/**< Size of log area */
> +	uint8_t			*log_base;	/**< Where dirty pages are logged */
>   	void			*priv;		/**< private context */
>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>   } __rte_cache_aligned;

This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
you'd need to at least add a deprecation note to 2.2 to be able to do it 
in 2.3 at all according to the ABI policy.

Perhaps a better option would be adding some padding to the structs now 
for 2.2 since the vhost ABI is broken there anyway. That would at least 
give a chance to keep it compatible from 2.2 to 2.3.

	- Panu -

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 3/4] vhost: log vring changes Yuanhan Liu
@ 2015-12-02 14:07   ` Victor Kaplansky
  2015-12-02 14:38     ` Yuanhan Liu
  2015-12-09  2:45     ` Xie, Huawei
  0 siblings, 2 replies; 136+ messages in thread
From: Victor Kaplansky @ 2015-12-02 14:07 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> Invoking vhost_log_write() to mark corresponding page as dirty while
> updating used vring.

Looks good, thanks!

I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?

-- Victor

> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost_rxtx.c | 74 +++++++++++++++++++++++++++++--------------
>  1 file changed, 50 insertions(+), 24 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 9322ce6..d4805d8 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -129,6 +129,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		uint32_t offset = 0, vb_offset = 0;
>  		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
>  		uint8_t hdr = 0, uncompleted_pkt = 0;
> +		uint16_t idx;
>  
>  		/* Get descriptor from available ring */
>  		desc = &vq->desc[head[packet_success]];
> @@ -200,16 +201,22 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		}
>  
>  		/* Update used ring with desc information */
> -		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> -							head[packet_success];
> +		idx = res_cur_idx & (vq->size - 1);
> +		vq->used->ring[idx].id = head[packet_success];
>  
>  		/* Drop the packet if it is uncompleted */
>  		if (unlikely(uncompleted_pkt == 1))
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							vq->vhost_hlen;
> +			vq->used->ring[idx].len = vq->vhost_hlen;
>  		else
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							pkt_len + vq->vhost_hlen;
> +			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> +
> +		/*
> +		 * to defer the update to when updating used->idx,
> +		 * and batch them?
> +		 */
> +		vhost_log_write(dev, vq,
> +			offsetof(struct vring_used, ring[idx]),
> +			sizeof(vq->used->ring[idx]));
>  
>  		res_cur_idx++;
>  		packet_success++;
> @@ -236,6 +243,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  
>  	*(volatile uint16_t *)&vq->used->idx += count;
>  	vq->last_used_idx = res_end_idx;
> +	vhost_log_write(dev, vq,
> +		offsetof(struct vring_used, idx),
> +		sizeof(vq->used->idx));
>  
>  	/* flush used->idx update before we read avail->flags. */
>  	rte_mb();
> @@ -265,6 +275,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	uint32_t seg_avail;
>  	uint32_t vb_avail;
>  	uint32_t cpy_len, entry_len;
> +	uint16_t idx;
>  
>  	if (pkt == NULL)
>  		return 0;
> @@ -302,16 +313,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	entry_len = vq->vhost_hlen;
>  
>  	if (vb_avail == 0) {
> -		uint32_t desc_idx =
> -			vq->buf_vec[vec_idx].desc_idx;
> +		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
> +
> +		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
> +			idx = cur_idx & (vq->size - 1);
>  
> -		if ((vq->desc[desc_idx].flags
> -			& VRING_DESC_F_NEXT) == 0) {
>  			/* Update used ring with desc information */
> -			vq->used->ring[cur_idx & (vq->size - 1)].id
> -				= vq->buf_vec[vec_idx].desc_idx;
> -			vq->used->ring[cur_idx & (vq->size - 1)].len
> -				= entry_len;
> +			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
> +			vq->used->ring[idx].len = entry_len;
> +
> +			vhost_log_write(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  
>  			entry_len = 0;
>  			cur_idx++;
> @@ -354,10 +367,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
>  				VRING_DESC_F_NEXT) == 0) {
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_write(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_len = 0;
>  				cur_idx++;
>  				entry_success++;
> @@ -390,16 +406,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  
>  					if ((vq->desc[desc_idx].flags &
>  						VRING_DESC_F_NEXT) == 0) {
> -						uint16_t wrapped_idx =
> -							cur_idx & (vq->size - 1);
> +						idx = cur_idx & (vq->size - 1);
>  						/*
>  						 * Update used ring with the
>  						 * descriptor information
>  						 */
> -						vq->used->ring[wrapped_idx].id
> +						vq->used->ring[idx].id
>  							= desc_idx;
> -						vq->used->ring[wrapped_idx].len
> +						vq->used->ring[idx].len
>  							= entry_len;
> +						vhost_log_write(dev, vq,
> +							offsetof(struct vring_used, ring[idx]),
> +							sizeof(vq->used->ring[idx]));
>  						entry_success++;
>  						entry_len = 0;
>  						cur_idx++;
> @@ -422,10 +440,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  				 * This whole packet completes.
>  				 */
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_write(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_success++;
>  				break;
>  			}
> @@ -658,6 +679,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  		/* Update used index buffer information. */
>  		vq->used->ring[used_idx].id = head[entry_success];
>  		vq->used->ring[used_idx].len = 0;
> +		vhost_log_write(dev, vq,
> +				offsetof(struct vring_used, ring[used_idx]),
> +				sizeof(vq->used->ring[used_idx]));
>  
>  		/* Allocate an mbuf and populate the structure. */
>  		m = rte_pktmbuf_alloc(mbuf_pool);
> @@ -778,6 +802,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  
>  	rte_compiler_barrier();
>  	vq->used->idx += entry_success;
> +	vhost_log_write(dev, vq, offsetof(struct vring_used, idx),
> +			sizeof(vq->used->idx));
>  	/* Kick guest if required. */
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
>  		eventfd_write(vq->callfd, (eventfd_t)1);
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (3 preceding siblings ...)
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 4/4] vhost: enable log_shmfd protocol feature Yuanhan Liu
@ 2015-12-02 14:10 ` Victor Kaplansky
  2015-12-02 14:33   ` Yuanhan Liu
  2015-12-09  3:41 ` Xie, Huawei
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
  6 siblings, 1 reply; 136+ messages in thread
From: Victor Kaplansky @ 2015-12-02 14:10 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 11:43:09AM +0800, Yuanhan Liu wrote:
> This patch set adds the initial vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration. So, this patch is basically about adding vhost
> log support, and using it.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
>     
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 sets log_fhmfd protocol feature bit, which actually
>   enables the vhost-user live migration support.
> 
> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK. And here is guide
> to setup OVS + DPDK:
> 
>     http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444 
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 
> 
> Note: this patch set has mostly been based on Victor Kaplansk's demo
> work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
> as the co-author. Victor, what do you think of that? :)

Thanks for adding me to credits list!
-- Victor

> 
> Comments are welcome!
> 
> ---
> Yuanhan Liu (4):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log vring changes
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h             | 35 ++++++++++++++
>  lib/librte_vhost/vhost_rxtx.c                 | 70 ++++++++++++++++++---------
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 +++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c                 |  4 ++
>  7 files changed, 145 insertions(+), 26 deletions(-)
> 
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 13:53   ` Panu Matilainen
@ 2015-12-02 14:31     ` Yuanhan Liu
  2015-12-02 14:48       ` Panu Matilainen
  2015-12-02 16:38       ` Thomas Monjalon
  2015-12-06 23:07     ` Thomas Monjalon
  1 sibling, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:31 UTC (permalink / raw)
  To: Panu Matilainen, Thomas Monjalon
  Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 03:53:45PM +0200, Panu Matilainen wrote:
> On 12/02/2015 05:43 AM, Yuanhan Liu wrote:
> >VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> >vhost-user) where we should log dirty pages, and how big the log
> >buffer is.
> >
> >This request introduces a new payload:
> >
> >	typedef struct VhostUserLog {
> >		uint64_t mmap_size;
> >		uint64_t mmap_offset;
> >	} VhostUserLog;
> >
> >Also, a fd is delivered from QEMU by ancillary data.
> >
> >With those info given, an area of memory is mmaped, assigned
> >to dev->log_base, for logging dirty pages.
> >
> >Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >---
> >  lib/librte_vhost/rte_virtio_net.h             |  2 ++
> >  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
> >  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
> >  lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
> >  lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
> >  5 files changed, 59 insertions(+), 1 deletion(-)
> >
> >diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> >index 5687452..416dac2 100644
> >--- a/lib/librte_vhost/rte_virtio_net.h
> >+++ b/lib/librte_vhost/rte_virtio_net.h
> >@@ -127,6 +127,8 @@ struct virtio_net {
> >  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
> >  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> >+	uint64_t		log_size;	/**< Size of log area */
> >+	uint8_t			*log_base;	/**< Where dirty pages are logged */
> >  	void			*priv;		/**< private context */
> >  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
> >  } __rte_cache_aligned;
> 
> This (and other changes in patch 2 breaks the librte_vhost ABI
> again, so you'd need to at least add a deprecation note to 2.2 to be
> able to do it in 2.3 at all according to the ABI policy.

I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?

> 
> Perhaps a better option would be adding some padding to the structs
> now for 2.2 since the vhost ABI is broken there anyway. That would
> at least give a chance to keep it compatible from 2.2 to 2.3.

It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?

Thomas, should I write an ABI deprecation note? Can I make it for
v2.2 release If I make one tomorrow? (Sorry that I'm not awared
of that it would be an ABI break).

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-02 14:10 ` [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
@ 2015-12-02 14:33   ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:33 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 04:10:56PM +0200, Victor Kaplansky wrote:
...
> > Note: this patch set has mostly been based on Victor Kaplansk's demo
> > work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
> > as the co-author. Victor, what do you think of that? :)
> 
> Thanks for adding me to credits list!

Great, I will add your signed-off-by since v2. Will that be okay to you?

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02 14:07   ` Victor Kaplansky
@ 2015-12-02 14:38     ` Yuanhan Liu
  2015-12-02 15:58       ` Victor Kaplansky
  2015-12-09  2:45     ` Xie, Huawei
  1 sibling, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:38 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > Invoking vhost_log_write() to mark corresponding page as dirty while
> > updating used vring.
> 
> Looks good, thanks!
> 
> I didn't find where you log the dirty pages in result of data
> written to the buffers pointed by the descriptors in RX vring.
> AFAIU, the buffers of RX queue reside in guest's memory and have
> to be marked as dirty if they are written. What do you say?

Yeah, we should. I got a question then: why log_guest_addr is set
to the physical address of used vring in guest? I mean, apparently,
we need log more changes other than used vring only.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02 13:53   ` Victor Kaplansky
@ 2015-12-02 14:39     ` Yuanhan Liu
  2015-12-09  3:33     ` Xie, Huawei
  1 sibling, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:39 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 03:53:01PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:11AM +0800, Yuanhan Liu wrote:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> > 
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> > 
> > The page address is biased by log_guest_addr, which is derived from
> > SET_VRING_ADDR request as part of the vring related addresses.
> > 
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > ---
> >  lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
> >  lib/librte_vhost/virtio-net.c     |  4 ++++
> >  2 files changed, 38 insertions(+)
> > 
> > diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> > index 416dac2..191c1be 100644
> > --- a/lib/librte_vhost/rte_virtio_net.h
> > +++ b/lib/librte_vhost/rte_virtio_net.h
> > @@ -40,6 +40,7 @@
> >   */
> >  
> >  #include <stdint.h>
> > +#include <linux/vhost.h>
> >  #include <linux/virtio_ring.h>
> >  #include <linux/virtio_net.h>
> >  #include <sys/eventfd.h>
> > @@ -59,6 +60,8 @@ struct rte_mbuf;
> >  /* Backend value set by guest. */
> >  #define VIRTIO_DEV_STOPPED -1
> >  
> > +#define VHOST_LOG_PAGE	4096
> > +
> >  
> >  /* Enum for virtqueue management. */
> >  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > @@ -82,6 +85,7 @@ struct vhost_virtqueue {
> >  	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
> >  	struct vring_avail	*avail;			/**< Virtqueue available ring. */
> >  	struct vring_used	*used;			/**< Virtqueue used ring. */
> > +	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
> >  	uint32_t		size;			/**< Size of descriptor ring. */
> >  	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
> >  	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
> > @@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> >  	return vhost_va;
> >  }
> >  
> > +static inline void __attribute__((always_inline))
> > +vhost_log_page(uint8_t *log_base, uint64_t page)
> > +{
> > +	/* TODO: to make it atomic? */
> > +	log_base[page / 8] |= 1 << (page % 8);
> 
> I think the atomic OR operation is necessary only if there can be
> more than one vhost-user back-end updating the guest's memory
> simultaneously. However probably it is pretty safe to perform
> regular OR operation, since rings are not shared between
> back-end. What about buffers pointed by descriptors?  To be on
> the safe side, I would use a GCC built-in function
> __sync_fetch_and_or(). 

The build has to be passed not only for gcc, but for icc and clang as
well.

> 
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +		uint64_t offset, uint64_t len)
> > +{
> > +	uint64_t addr = vq->log_guest_addr;
> > +	uint64_t page;
> > +
> > +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +		     !dev->log_base || !len))
> > +		return;
> 
> Isn't "likely" more appropriate in above, since the whole
> expression is expected to be true most of the time?

Sorry, it's a typo, and thanks for the catching.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 14:31     ` Yuanhan Liu
@ 2015-12-02 14:48       ` Panu Matilainen
  2015-12-02 15:09         ` Yuanhan Liu
  2015-12-02 16:38       ` Thomas Monjalon
  1 sibling, 1 reply; 136+ messages in thread
From: Panu Matilainen @ 2015-12-02 14:48 UTC (permalink / raw)
  To: Yuanhan Liu, Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/02/2015 04:31 PM, Yuanhan Liu wrote:
> On Wed, Dec 02, 2015 at 03:53:45PM +0200, Panu Matilainen wrote:
>> On 12/02/2015 05:43 AM, Yuanhan Liu wrote:
>>> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
>>> vhost-user) where we should log dirty pages, and how big the log
>>> buffer is.
>>>
>>> This request introduces a new payload:
>>>
>>> 	typedef struct VhostUserLog {
>>> 		uint64_t mmap_size;
>>> 		uint64_t mmap_offset;
>>> 	} VhostUserLog;
>>>
>>> Also, a fd is delivered from QEMU by ancillary data.
>>>
>>> With those info given, an area of memory is mmaped, assigned
>>> to dev->log_base, for logging dirty pages.
>>>
>>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>>> ---
>>>   lib/librte_vhost/rte_virtio_net.h             |  2 ++
>>>   lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
>>>   lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
>>>   lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
>>>   lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>>>   5 files changed, 59 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>>> index 5687452..416dac2 100644
>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>> @@ -127,6 +127,8 @@ struct virtio_net {
>>>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>>>   	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>>>   	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
>>> +	uint64_t		log_size;	/**< Size of log area */
>>> +	uint8_t			*log_base;	/**< Where dirty pages are logged */
>>>   	void			*priv;		/**< private context */
>>>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>>>   } __rte_cache_aligned;
>>
>> This (and other changes in patch 2 breaks the librte_vhost ABI
>> again, so you'd need to at least add a deprecation note to 2.2 to be
>> able to do it in 2.3 at all according to the ABI policy.
>
> I was thinking that adding a new field (instead of renaming it or
> removing it) isn't an ABI break. So, I was wrong?

Adding or removing a field in the middle of a public struct is always an 
ABI break. Adding to the end often is too, but not always.
Renaming a field is an API break but not an ABI break - the compiler 
cares but the cpu does not.

>>
>> Perhaps a better option would be adding some padding to the structs
>> now for 2.2 since the vhost ABI is broken there anyway. That would
>> at least give a chance to keep it compatible from 2.2 to 2.3.
>
> It will not be compatible, unless we add exact same fields (not
> something like uint8_t pad[xx]). Otherwise, the pad field renaming
> is also an ABI break, right?

There's no ABI (or API) break in changing reserved unused fields to 
something else, as long as care is taken with sizes and alignment. In 
any case padding is best added to the end of a struct to minimize risks 
and keep things simple.

	- Panu -

>
> Thomas, should I write an ABI deprecation note? Can I make it for
> v2.2 release If I make one tomorrow? (Sorry that I'm not awared
> of that it would be an ABI break).
>
> 	--yliu
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 14:48       ` Panu Matilainen
@ 2015-12-02 15:09         ` Yuanhan Liu
  2015-12-02 16:58           ` Panu Matilainen
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-02 15:09 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
...
> >>>diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> >>>index 5687452..416dac2 100644
> >>>--- a/lib/librte_vhost/rte_virtio_net.h
> >>>+++ b/lib/librte_vhost/rte_virtio_net.h
> >>>@@ -127,6 +127,8 @@ struct virtio_net {
> >>>  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >>>  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
> >>>  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> >>>+	uint64_t		log_size;	/**< Size of log area */
> >>>+	uint8_t			*log_base;	/**< Where dirty pages are logged */
> >>>  	void			*priv;		/**< private context */
> >>>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
> >>>  } __rte_cache_aligned;
> >>
> >>This (and other changes in patch 2 breaks the librte_vhost ABI
> >>again, so you'd need to at least add a deprecation note to 2.2 to be
> >>able to do it in 2.3 at all according to the ABI policy.
> >
> >I was thinking that adding a new field (instead of renaming it or
> >removing it) isn't an ABI break. So, I was wrong?
> 
> Adding or removing a field in the middle of a public struct is
> always an ABI break. Adding to the end often is too, but not always.
> Renaming a field is an API break but not an ABI break - the compiler
> cares but the cpu does not.

Good to know. Thanks.

> 
> >>
> >>Perhaps a better option would be adding some padding to the structs
> >>now for 2.2 since the vhost ABI is broken there anyway. That would
> >>at least give a chance to keep it compatible from 2.2 to 2.3.
> >
> >It will not be compatible, unless we add exact same fields (not
> >something like uint8_t pad[xx]). Otherwise, the pad field renaming
> >is also an ABI break, right?
> 
> There's no ABI (or API) break in changing reserved unused fields to
> something else, as long as care is taken with sizes and alignment.

as long as we don't reference the reserved unused fields?

> In any case padding is best added to the end of a struct to minimize
> risks and keep things simple.

The thing is that isn't it a bit aweful to (always) add pads to
the end of a struct, especially when you don't know how many
need to be padded?

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02 14:38     ` Yuanhan Liu
@ 2015-12-02 15:58       ` Victor Kaplansky
  2015-12-02 16:26         ` Michael S. Tsirkin
  0 siblings, 1 reply; 136+ messages in thread
From: Victor Kaplansky @ 2015-12-02 15:58 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > updating used vring.
> > 
> > Looks good, thanks!
> > 
> > I didn't find where you log the dirty pages in result of data
> > written to the buffers pointed by the descriptors in RX vring.
> > AFAIU, the buffers of RX queue reside in guest's memory and have
> > to be marked as dirty if they are written. What do you say?
> 
> Yeah, we should. I got a question then: why log_guest_addr is set
> to the physical address of used vring in guest? I mean, apparently,
> we need log more changes other than used vring only.

The physical address of used vring sent to the back-end, since
otherwise back-end has to perform virtual to physical
translation, and we want to avoid this. The dirty buffers has to
be marked as well, but their guest's physical address is known
directly from the descriptors.

> 
> 	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02 15:58       ` Victor Kaplansky
@ 2015-12-02 16:26         ` Michael S. Tsirkin
  2015-12-03  2:31           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Michael S. Tsirkin @ 2015-12-02 16:26 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev

On Wed, Dec 02, 2015 at 05:58:24PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> > On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > > updating used vring.
> > > 
> > > Looks good, thanks!
> > > 
> > > I didn't find where you log the dirty pages in result of data
> > > written to the buffers pointed by the descriptors in RX vring.
> > > AFAIU, the buffers of RX queue reside in guest's memory and have
> > > to be marked as dirty if they are written. What do you say?
> > 
> > Yeah, we should. I got a question then: why log_guest_addr is set
> > to the physical address of used vring in guest? I mean, apparently,
> > we need log more changes other than used vring only.
> 
> The physical address of used vring sent to the back-end, since
> otherwise back-end has to perform virtual to physical
> translation, and we want to avoid this. The dirty buffers has to
> be marked as well, but their guest's physical address is known
> directly from the descriptors.

Yes, people wanted to be able to do multiple physical
addresses to one virtual so you do not want to translate
virt to phys.

> > 
> > 	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 14:31     ` Yuanhan Liu
  2015-12-02 14:48       ` Panu Matilainen
@ 2015-12-02 16:38       ` Thomas Monjalon
  2015-12-03  1:49         ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-02 16:38 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-02 22:31, Yuanhan Liu:
> Thomas, should I write an ABI deprecation note? Can I make it for
> v2.2 release If I make one tomorrow? (Sorry that I'm not awared
> of that it would be an ABI break).

As Panu suggested, it would be better to reserve some room now
in 2.2 which already breaks vhost ABI.
Maybe we have a chance to keep the same vhost ABI in 2.3.

The 2.2 release will probably be closed in less than 2 weeks.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 15:09         ` Yuanhan Liu
@ 2015-12-02 16:58           ` Panu Matilainen
  2015-12-02 17:24             ` Michael S. Tsirkin
  0 siblings, 1 reply; 136+ messages in thread
From: Panu Matilainen @ 2015-12-02 16:58 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/02/2015 05:09 PM, Yuanhan Liu wrote:
> On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
> ...
>>>>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>>>>> index 5687452..416dac2 100644
>>>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>>>> @@ -127,6 +127,8 @@ struct virtio_net {
>>>>>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>>>>>   	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>>>>>   	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
>>>>> +	uint64_t		log_size;	/**< Size of log area */
>>>>> +	uint8_t			*log_base;	/**< Where dirty pages are logged */
>>>>>   	void			*priv;		/**< private context */
>>>>>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>>>>>   } __rte_cache_aligned;
>>>>
>>>> This (and other changes in patch 2 breaks the librte_vhost ABI
>>>> again, so you'd need to at least add a deprecation note to 2.2 to be
>>>> able to do it in 2.3 at all according to the ABI policy.
>>>
>>> I was thinking that adding a new field (instead of renaming it or
>>> removing it) isn't an ABI break. So, I was wrong?
>>
>> Adding or removing a field in the middle of a public struct is
>> always an ABI break. Adding to the end often is too, but not always.
>> Renaming a field is an API break but not an ABI break - the compiler
>> cares but the cpu does not.
>
> Good to know. Thanks.
>
>>
>>>>
>>>> Perhaps a better option would be adding some padding to the structs
>>>> now for 2.2 since the vhost ABI is broken there anyway. That would
>>>> at least give a chance to keep it compatible from 2.2 to 2.3.
>>>
>>> It will not be compatible, unless we add exact same fields (not
>>> something like uint8_t pad[xx]). Otherwise, the pad field renaming
>>> is also an ABI break, right?
>>
>> There's no ABI (or API) break in changing reserved unused fields to
>> something else, as long as care is taken with sizes and alignment.
>
> as long as we don't reference the reserved unused fields?

That would be the definition of an unused field I think :)
Call it "reserved" if you want, it doesn't really matter as long as its 
clear its something you shouldn't be using.

>
>> In any case padding is best added to the end of a struct to minimize
>> risks and keep things simple.
>
> The thing is that isn't it a bit aweful to (always) add pads to
> the end of a struct, especially when you don't know how many
> need to be padded?

Then you pad for what you think you need, plus a bit extra, and maybe 
some more for others who might want to extend it. What is a reasonable 
amount needs deciding case by case - if a struct is alloced in the 
millions then be (very) conservative, but if there are one or 50 such 
structs within an app lifetime then who cares if its bit larger?

And yeah padding may be annoying, but that's pretty much the only option 
in a project where most of the structs are out in the open.

	- Panu -

>
> 	--yliu
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 16:58           ` Panu Matilainen
@ 2015-12-02 17:24             ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2015-12-02 17:24 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky

On Wed, Dec 02, 2015 at 06:58:03PM +0200, Panu Matilainen wrote:
> On 12/02/2015 05:09 PM, Yuanhan Liu wrote:
> >On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
> >...
> >>>>>diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> >>>>>index 5687452..416dac2 100644
> >>>>>--- a/lib/librte_vhost/rte_virtio_net.h
> >>>>>+++ b/lib/librte_vhost/rte_virtio_net.h
> >>>>>@@ -127,6 +127,8 @@ struct virtio_net {
> >>>>>  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >>>>>  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
> >>>>>  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> >>>>>+	uint64_t		log_size;	/**< Size of log area */
> >>>>>+	uint8_t			*log_base;	/**< Where dirty pages are logged */
> >>>>>  	void			*priv;		/**< private context */
> >>>>>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
> >>>>>  } __rte_cache_aligned;
> >>>>
> >>>>This (and other changes in patch 2 breaks the librte_vhost ABI
> >>>>again, so you'd need to at least add a deprecation note to 2.2 to be
> >>>>able to do it in 2.3 at all according to the ABI policy.
> >>>
> >>>I was thinking that adding a new field (instead of renaming it or
> >>>removing it) isn't an ABI break. So, I was wrong?
> >>
> >>Adding or removing a field in the middle of a public struct is
> >>always an ABI break. Adding to the end often is too, but not always.
> >>Renaming a field is an API break but not an ABI break - the compiler
> >>cares but the cpu does not.
> >
> >Good to know. Thanks.
> >
> >>
> >>>>
> >>>>Perhaps a better option would be adding some padding to the structs
> >>>>now for 2.2 since the vhost ABI is broken there anyway. That would
> >>>>at least give a chance to keep it compatible from 2.2 to 2.3.
> >>>
> >>>It will not be compatible, unless we add exact same fields (not
> >>>something like uint8_t pad[xx]). Otherwise, the pad field renaming
> >>>is also an ABI break, right?
> >>
> >>There's no ABI (or API) break in changing reserved unused fields to
> >>something else, as long as care is taken with sizes and alignment.
> >
> >as long as we don't reference the reserved unused fields?
> 
> That would be the definition of an unused field I think :)
> Call it "reserved" if you want, it doesn't really matter as long as its
> clear its something you shouldn't be using.
> 
> >
> >>In any case padding is best added to the end of a struct to minimize
> >>risks and keep things simple.
> >
> >The thing is that isn't it a bit aweful to (always) add pads to
> >the end of a struct, especially when you don't know how many
> >need to be padded?
> 
> Then you pad for what you think you need, plus a bit extra, and maybe some
> more for others who might want to extend it. What is a reasonable amount
> needs deciding case by case - if a struct is alloced in the millions then be
> (very) conservative, but if there are one or 50 such structs within an app
> lifetime then who cares if its bit larger?
> 
> And yeah padding may be annoying, but that's pretty much the only option in
> a project where most of the structs are out in the open.
> 
> 	- Panu -

Functions versioning is another option.
For a sufficiently widely used struct, it's a lot of work, padding
is easier.  But it might be better than breaking ABI
if e.g. you didn't pad enough.

> >
> >	--yliu
> >

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 16:38       ` Thomas Monjalon
@ 2015-12-03  1:49         ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-03  1:49 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 05:38:10PM +0100, Thomas Monjalon wrote:
> 2015-12-02 22:31, Yuanhan Liu:
> > Thomas, should I write an ABI deprecation note? Can I make it for
> > v2.2 release If I make one tomorrow? (Sorry that I'm not awared
> > of that it would be an ABI break).
> 
> As Panu suggested, it would be better to reserve some room now
> in 2.2 which already breaks vhost ABI.
> Maybe we have a chance to keep the same vhost ABI in 2.3.


Got it. I will cook up one now.

	--yliu
> 
> The 2.2 release will probably be closed in less than 2 weeks.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02 16:26         ` Michael S. Tsirkin
@ 2015-12-03  2:31           ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-03  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Victor Kaplansky

On Wed, Dec 02, 2015 at 06:26:37PM +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2015 at 05:58:24PM +0200, Victor Kaplansky wrote:
> > On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> > > On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > > > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > > > updating used vring.
> > > > 
> > > > Looks good, thanks!
> > > > 
> > > > I didn't find where you log the dirty pages in result of data
> > > > written to the buffers pointed by the descriptors in RX vring.
> > > > AFAIU, the buffers of RX queue reside in guest's memory and have
> > > > to be marked as dirty if they are written. What do you say?
> > > 
> > > Yeah, we should. I got a question then: why log_guest_addr is set
> > > to the physical address of used vring in guest? I mean, apparently,
> > > we need log more changes other than used vring only.
> > 
> > The physical address of used vring sent to the back-end, since
> > otherwise back-end has to perform virtual to physical
> > translation, and we want to avoid this. The dirty buffers has to
> > be marked as well, but their guest's physical address is known
> > directly from the descriptors.
> 
> Yes, people wanted to be able to do multiple physical
> addresses to one virtual so you do not want to translate
> virt to phys.

Thank you both, it's clear to me then.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 13:53   ` Panu Matilainen
  2015-12-02 14:31     ` Yuanhan Liu
@ 2015-12-06 23:07     ` Thomas Monjalon
  2015-12-07  2:00       ` Yuanhan Liu
  2015-12-07  6:29       ` Panu Matilainen
  1 sibling, 2 replies; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-06 23:07 UTC (permalink / raw)
  To: Panu Matilainen, Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-02 15:53, Panu Matilainen:
> This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> you'd need to at least add a deprecation note to 2.2 to be able to do it 
> in 2.3 at all according to the ABI policy.
> 
> Perhaps a better option would be adding some padding to the structs now 
> for 2.2 since the vhost ABI is broken there anyway. That would at least 
> give a chance to keep it compatible from 2.2 to 2.3.

Please could you point where the vhost ABI is broken in 2.2?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-06 23:07     ` Thomas Monjalon
@ 2015-12-07  2:00       ` Yuanhan Liu
  2015-12-07  2:03         ` Thomas Monjalon
  2015-12-07  6:29       ` Panu Matilainen
  1 sibling, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-07  2:00 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> 2015-12-02 15:53, Panu Matilainen:
> > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > in 2.3 at all according to the ABI policy.
> > 
> > Perhaps a better option would be adding some padding to the structs now 
> > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > give a chance to keep it compatible from 2.2 to 2.3.
> 
> Please could you point where the vhost ABI is broken in 2.2?

Thomas, here are the changes to rte_virtio_net.h:


$ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
      -- lib/librte_vhost/rte_virtio_net.h >
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index e3a21e5..426a70d 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -89,6 +89,7 @@ struct vhost_virtqueue {
 	volatile uint16_t	last_used_idx_res;	/**< Used for multiple devices reserving buffers. */
 	int			callfd;			/**< Used to notify the guest (trigger interrupt). */
 	int			kickfd;			/**< Currently unused as polling mode is enabled. */
+	int			enabled;
 	struct buf_vector	buf_vec[BUF_VECTOR_MAX];	/**< for scatter RX. */
 } __rte_cache_aligned;
 
@@ -96,7 +97,6 @@ struct vhost_virtqueue {
  * Device structure contains all configuration information relating to the device.
  */
 struct virtio_net {
-	struct vhost_virtqueue	*virtqueue[VIRTIO_QNUM];	/**< Contains all virtqueue information. */
 	struct virtio_memory	*mem;		/**< QEMU memory and memory region information. */
 	uint64_t		features;	/**< Negotiated feature set. */
 	uint64_t		protocol_features;	/**< Negotiated protocol feature set. */
@@ -104,7 +104,9 @@ struct virtio_net {
 	uint32_t		flags;		/**< Device flags. Only used to check if device is running on data core. */
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
+	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
 	void			*priv;		/**< private context */
+	struct vhost_virtqueue	*virtqueue[VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
 /**
@@ -131,7 +133,7 @@ struct virtio_memory {
 };
 
 /**
- * Device operations to add/remove device.
+ * Device and vring operations.
  *
  * Make sure to set VIRTIO_DEV_RUNNING to the device flags in new_device and
  * remove it in destroy_device.
@@ -140,12 +142,18 @@ struct virtio_memory {
 struct virtio_net_device_ops {
 	int (*new_device)(struct virtio_net *);	/**< Add device. */
 	void (*destroy_device)(volatile struct virtio_net *);	/**< Remove device. */
+
+	int (*vring_state_changed)(struct virtio_net *dev, uint16_t queue_id, int enable);	/**< triggered when a vring is enabled or disabled */
 };
 
 static inline uint16_t __attribute__((always_inline))
 rte_vring_available_entries(struct virtio_net *dev, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
+
+	if (!vq->enabled)
+		return 0;
+
 	return *(volatile uint16_t *)&vq->avail->idx - vq->last_used_idx_res;
 }
 

	--yliu 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  2:00       ` Yuanhan Liu
@ 2015-12-07  2:03         ` Thomas Monjalon
  2015-12-07  2:18           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-07  2:03 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 10:00, Yuanhan Liu:
> On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> > 2015-12-02 15:53, Panu Matilainen:
> > > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > > in 2.3 at all according to the ABI policy.
> > > 
> > > Perhaps a better option would be adding some padding to the structs now 
> > > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > > give a chance to keep it compatible from 2.2 to 2.3.
> > 
> > Please could you point where the vhost ABI is broken in 2.2?
> 
> Thomas, here are the changes to rte_virtio_net.h:
> 
> 
> $ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
>       -- lib/librte_vhost/rte_virtio_net.h >
[...]

The problem is that the changes are not noticed in the release notes
and the LIBABIVER is still 1.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  2:03         ` Thomas Monjalon
@ 2015-12-07  2:18           ` Yuanhan Liu
  2015-12-07  2:49             ` Thomas Monjalon
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-07  2:18 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 07, 2015 at 03:03:24AM +0100, Thomas Monjalon wrote:
> 2015-12-07 10:00, Yuanhan Liu:
> > On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> > > 2015-12-02 15:53, Panu Matilainen:
> > > > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > > > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > > > in 2.3 at all according to the ABI policy.
> > > > 
> > > > Perhaps a better option would be adding some padding to the structs now 
> > > > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > > > give a chance to keep it compatible from 2.2 to 2.3.
> > > 
> > > Please could you point where the vhost ABI is broken in 2.2?
> > 
> > Thomas, here are the changes to rte_virtio_net.h:
> > 
> > 
> > $ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
> >       -- lib/librte_vhost/rte_virtio_net.h >
> [...]
> 
> The problem is that the changes are not noticed in the release notes
> and the LIBABIVER is still 1.

Yeah, my bad. Firstly, I was not aware of it's an ABI change. Secondly,
I was landed to this team in the middle of v2.2 release, so that I have
limited experience of how those works in DPDK community.

Anyway, it's my fault. I should have realized that in the first time.
Should I send a patch to update LIBABIVER to 2 and update release note
now?

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  2:18           ` Yuanhan Liu
@ 2015-12-07  2:49             ` Thomas Monjalon
  0 siblings, 0 replies; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-07  2:49 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 10:18, Yuanhan Liu:
> On Mon, Dec 07, 2015 at 03:03:24AM +0100, Thomas Monjalon wrote:
> > 2015-12-07 10:00, Yuanhan Liu:
> > > On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> > > > 2015-12-02 15:53, Panu Matilainen:
> > > > > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > > > > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > > > > in 2.3 at all according to the ABI policy.
> > > > > 
> > > > > Perhaps a better option would be adding some padding to the structs now 
> > > > > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > > > > give a chance to keep it compatible from 2.2 to 2.3.
> > > > 
> > > > Please could you point where the vhost ABI is broken in 2.2?
> > > 
> > > Thomas, here are the changes to rte_virtio_net.h:
> > > 
> > > 
> > > $ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
> > >       -- lib/librte_vhost/rte_virtio_net.h >
> > [...]
> > 
> > The problem is that the changes are not noticed in the release notes
> > and the LIBABIVER is still 1.
> 
> Yeah, my bad. Firstly, I was not aware of it's an ABI change. Secondly,
> I was landed to this team in the middle of v2.2 release, so that I have
> limited experience of how those works in DPDK community.
> 
> Anyway, it's my fault. I should have realized that in the first time.

No it's not your fault, and it does not matter who is responsible.

> Should I send a patch to update LIBABIVER to 2 and update release note
> now?

Yes today or tomorrow please.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-06 23:07     ` Thomas Monjalon
  2015-12-07  2:00       ` Yuanhan Liu
@ 2015-12-07  6:29       ` Panu Matilainen
  2015-12-07 11:28         ` Thomas Monjalon
  1 sibling, 1 reply; 136+ messages in thread
From: Panu Matilainen @ 2015-12-07  6:29 UTC (permalink / raw)
  To: Thomas Monjalon, Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> 2015-12-02 15:53, Panu Matilainen:
>> This (and other changes in patch 2 breaks the librte_vhost ABI again, so
>> you'd need to at least add a deprecation note to 2.2 to be able to do it
>> in 2.3 at all according to the ABI policy.
>>
>> Perhaps a better option would be adding some padding to the structs now
>> for 2.2 since the vhost ABI is broken there anyway. That would at least
>> give a chance to keep it compatible from 2.2 to 2.3.
>
> Please could you point where the vhost ABI is broken in 2.2?
>

The vhost ABI break was announced for DPDK 2.2 in commit 
3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:

> commit 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d
> Author: Ouyang Changchun <changchun.ouyang@intel.com>
> Date:   Tue Jun 16 09:38:43 2015 +0800
>
>     doc: announce ABI changes for vhost-user multiple queues
>
>     It announces the planned ABI changes for vhost-user multiple
>     queues feature on v2.2.
>
>     Signed-off-by: Changchun Ouyang <changchun.ouyang@intel.com>
>     Acked-by: Neil Horman <nhorman@tuxdriver.com>

So the ABI process was properly followed, except for actually bumping 
LIBABIVER. Bumping LIBABIVER is mentioned in 
doc/guides/contributing/versioning.rst but it doesn't specify *when* 
this should be done, eg should the first patch breaking the ABI bump it 
or should it done be shortly before the next stable release, or 
something else. As it is, it seems a bit too easy to simply forget.

	- Panu -

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  6:29       ` Panu Matilainen
@ 2015-12-07 11:28         ` Thomas Monjalon
  2015-12-07 11:41           ` Panu Matilainen
  0 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-07 11:28 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 08:29, Panu Matilainen:
> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> > 2015-12-02 15:53, Panu Matilainen:
> >> This (and other changes in patch 2 breaks the librte_vhost ABI again, so
> >> you'd need to at least add a deprecation note to 2.2 to be able to do it
> >> in 2.3 at all according to the ABI policy.
> >>
> >> Perhaps a better option would be adding some padding to the structs now
> >> for 2.2 since the vhost ABI is broken there anyway. That would at least
> >> give a chance to keep it compatible from 2.2 to 2.3.
> >
> > Please could you point where the vhost ABI is broken in 2.2?
> >
> 
> The vhost ABI break was announced for DPDK 2.2 in commit 
> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
[...]
> So the ABI process was properly followed, except for actually bumping 
> LIBABIVER. Bumping LIBABIVER is mentioned in 
> doc/guides/contributing/versioning.rst but it doesn't specify *when* 
> this should be done, eg should the first patch breaking the ABI bump it 
> or should it done be shortly before the next stable release, or 
> something else. As it is, it seems a bit too easy to simply forget.

I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
In this case, I've missed it when reviewing the vhost patches breaking the
ABI.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 11:28         ` Thomas Monjalon
@ 2015-12-07 11:41           ` Panu Matilainen
  2015-12-07 13:55             ` Thomas Monjalon
  0 siblings, 1 reply; 136+ messages in thread
From: Panu Matilainen @ 2015-12-07 11:41 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
> 2015-12-07 08:29, Panu Matilainen:
>> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
>>> 2015-12-02 15:53, Panu Matilainen:
>>>> This (and other changes in patch 2 breaks the librte_vhost ABI again, so
>>>> you'd need to at least add a deprecation note to 2.2 to be able to do it
>>>> in 2.3 at all according to the ABI policy.
>>>>
>>>> Perhaps a better option would be adding some padding to the structs now
>>>> for 2.2 since the vhost ABI is broken there anyway. That would at least
>>>> give a chance to keep it compatible from 2.2 to 2.3.
>>>
>>> Please could you point where the vhost ABI is broken in 2.2?
>>>
>>
>> The vhost ABI break was announced for DPDK 2.2 in commit
>> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
> [...]
>> So the ABI process was properly followed, except for actually bumping
>> LIBABIVER. Bumping LIBABIVER is mentioned in
>> doc/guides/contributing/versioning.rst but it doesn't specify *when*
>> this should be done, eg should the first patch breaking the ABI bump it
>> or should it done be shortly before the next stable release, or
>> something else. As it is, it seems a bit too easy to simply forget.
>
> I thought it was not needed to explicitly say that commits must be atomic
> and we do not have to wait to do the required changes.

The "problem" is that during a development cycle, an ABI could be broken 
several times but LIBABIVER should only be bumped once. So ABI breaking 
commits will often not be atomic wrt LIBABIVER, no matter which way its 
done.

For example libtool recommendation is that library versions are updated 
only just before public releases:
https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info

	- Panu -

> In this case, I've missed it when reviewing the vhost patches breaking the
> ABI.
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 11:41           ` Panu Matilainen
@ 2015-12-07 13:55             ` Thomas Monjalon
  2015-12-07 16:48               ` Panu Matilainen
  0 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-07 13:55 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 13:41, Panu Matilainen:
> On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
> > 2015-12-07 08:29, Panu Matilainen:
> >> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> >>> 2015-12-02 15:53, Panu Matilainen:
> >> The vhost ABI break was announced for DPDK 2.2 in commit
> >> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
> > [...]
> >> So the ABI process was properly followed, except for actually bumping
> >> LIBABIVER. Bumping LIBABIVER is mentioned in
> >> doc/guides/contributing/versioning.rst but it doesn't specify *when*
> >> this should be done, eg should the first patch breaking the ABI bump it
> >> or should it done be shortly before the next stable release, or
> >> something else. As it is, it seems a bit too easy to simply forget.
> >
> > I thought it was not needed to explicitly say that commits must be atomic
> > and we do not have to wait to do the required changes.
> 
> The "problem" is that during a development cycle, an ABI could be broken 
> several times but LIBABIVER should only be bumped once. So ABI breaking 
> commits will often not be atomic wrt LIBABIVER, no matter which way its 
> done.

If the ABI version has already been changed, there should be a merge conflict.
I think it's better to manage a conflict than forget to update the version.

> For example libtool recommendation is that library versions are updated 
> only just before public releases:
> https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info

Interesting link. It makes me think that we do not manage ABI break when
downgrading the library (case of only new API keeping the ABI number).

> > In this case, I've missed it when reviewing the vhost patches breaking the
> > ABI.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 13:55             ` Thomas Monjalon
@ 2015-12-07 16:48               ` Panu Matilainen
  2015-12-07 17:47                 ` Thomas Monjalon
  0 siblings, 1 reply; 136+ messages in thread
From: Panu Matilainen @ 2015-12-07 16:48 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/07/2015 03:55 PM, Thomas Monjalon wrote:
> 2015-12-07 13:41, Panu Matilainen:
>> On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
>>> 2015-12-07 08:29, Panu Matilainen:
>>>> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
>>>>> 2015-12-02 15:53, Panu Matilainen:
>>>> The vhost ABI break was announced for DPDK 2.2 in commit
>>>> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
>>> [...]
>>>> So the ABI process was properly followed, except for actually bumping
>>>> LIBABIVER. Bumping LIBABIVER is mentioned in
>>>> doc/guides/contributing/versioning.rst but it doesn't specify *when*
>>>> this should be done, eg should the first patch breaking the ABI bump it
>>>> or should it done be shortly before the next stable release, or
>>>> something else. As it is, it seems a bit too easy to simply forget.
>>>
>>> I thought it was not needed to explicitly say that commits must be atomic
>>> and we do not have to wait to do the required changes.

Heh, now that I look more carefully... it IS documented, line 38 of 
contributing/versioning.rst:

 > ABI versions are set at the time of major release labeling, and the
 > ABI may change multiple times, without warning, between the last
 > release label and the HEAD label of the git tree.

>> The "problem" is that during a development cycle, an ABI could be broken
>> several times but LIBABIVER should only be bumped once. So ABI breaking
>> commits will often not be atomic wrt LIBABIVER, no matter which way its
>> done.
>
> If the ABI version has already been changed, there should be a merge conflict.
> I think it's better to manage a conflict than forget to update the version.

What I'm thinking of is something that would tie LIBABIVER to the 
deprecation announcement in a way that could be easily checked 
(programmatically and manually). As it is now, its quite non-trivial to 
figure what LIBABIVER *should* be for a given library at a given point - 
you need to dig up deprecation.rst history and Makefile history and 
whatnot, and its all quite error-prone.

>> For example libtool recommendation is that library versions are updated
>> only just before public releases:
>> https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info
>
> Interesting link. It makes me think that we do not manage ABI break when
> downgrading the library (case of only new API keeping the ABI number).

Hmm, not quite sure what you mean here, but full libtool-style 
versioning is not really needed with symbol versioning.

	- Panu -

>
>>> In this case, I've missed it when reviewing the vhost patches breaking the
>>> ABI.
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 16:48               ` Panu Matilainen
@ 2015-12-07 17:47                 ` Thomas Monjalon
  0 siblings, 0 replies; 136+ messages in thread
From: Thomas Monjalon @ 2015-12-07 17:47 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 18:48, Panu Matilainen:
> On 12/07/2015 03:55 PM, Thomas Monjalon wrote:
> > 2015-12-07 13:41, Panu Matilainen:
> >> On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
> >>> 2015-12-07 08:29, Panu Matilainen:
> >>>> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> >>>>> 2015-12-02 15:53, Panu Matilainen:
> >>>> The vhost ABI break was announced for DPDK 2.2 in commit
> >>>> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
> >>> [...]
> >>>> So the ABI process was properly followed, except for actually bumping
> >>>> LIBABIVER. Bumping LIBABIVER is mentioned in
> >>>> doc/guides/contributing/versioning.rst but it doesn't specify *when*
> >>>> this should be done, eg should the first patch breaking the ABI bump it
> >>>> or should it done be shortly before the next stable release, or
> >>>> something else. As it is, it seems a bit too easy to simply forget.
> >>>
> >>> I thought it was not needed to explicitly say that commits must be atomic
> >>> and we do not have to wait to do the required changes.
> 
> Heh, now that I look more carefully... it IS documented, line 38 of 
> contributing/versioning.rst:
> 
>  > ABI versions are set at the time of major release labeling, and the
>  > ABI may change multiple times, without warning, between the last
>  > release label and the HEAD label of the git tree.

Interesting :)

> >> The "problem" is that during a development cycle, an ABI could be broken
> >> several times but LIBABIVER should only be bumped once. So ABI breaking
> >> commits will often not be atomic wrt LIBABIVER, no matter which way its
> >> done.
> >
> > If the ABI version has already been changed, there should be a merge conflict.
> > I think it's better to manage a conflict than forget to update the version.
> 
> What I'm thinking of is something that would tie LIBABIVER to the 
> deprecation announcement in a way that could be easily checked 
> (programmatically and manually). As it is now, its quite non-trivial to 
> figure what LIBABIVER *should* be for a given library at a given point - 
> you need to dig up deprecation.rst history and Makefile history and 
> whatnot, and its all quite error-prone.

Yes clearly we need a safer process.
You are welcome to suggest one.
I like the idea of being based on some "parse-able" RST changes.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02  3:43 ` [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2015-12-02 13:53   ` Panu Matilainen
@ 2015-12-08  5:57   ` Xie, Huawei
  2015-12-08  7:25     ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-08  5:57 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
[...]
> +
> +	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
> +	if (addr == MAP_FAILED) {
> +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> +		return -1;
> +	}
Yuanhan:
mmap could fail with non-zero offset for huge page based mapping. Check
our workaround in user_set_mem_table.
I think you have done the validation, but i guess off is zero here.
> +
> +	/* TODO: unmap on stop */
> +	dev->log_base = addr;
> +	dev->log_size = size;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> index b82108d..013cf38 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
>  
>  void user_set_protocol_features(struct vhost_device_ctx ctx,
>  				uint64_t protocol_features);
> +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
>  
>  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
>  


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-08  5:57   ` Xie, Huawei
@ 2015-12-08  7:25     ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-08  7:25 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Tue, Dec 08, 2015 at 05:57:54AM +0000, Xie, Huawei wrote:
> On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
> [...]
> > +
> > +	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
> > +	if (addr == MAP_FAILED) {
> > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > +		return -1;
> > +	}
> Yuanhan:
> mmap could fail with non-zero offset for huge page based mapping. Check
> our workaround in user_set_mem_table.
> I think you have done the validation, but i guess off is zero here.

Yes, off is zero. And thanks for the remind; will fix it in next version.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 3/4] vhost: log vring changes
  2015-12-02 14:07   ` Victor Kaplansky
  2015-12-02 14:38     ` Yuanhan Liu
@ 2015-12-09  2:45     ` Xie, Huawei
  1 sibling, 0 replies; 136+ messages in thread
From: Xie, Huawei @ 2015-12-09  2:45 UTC (permalink / raw)
  To: Victor Kaplansky, Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On 12/2/2015 10:09 PM, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
>> Invoking vhost_log_write() to mark corresponding page as dirty while
>> updating used vring.
> Looks good, thanks!
>
> I didn't find where you log the dirty pages in result of data
> written to the buffers pointed by the descriptors in RX vring.
> AFAIU, the buffers of RX queue reside in guest's memory and have
> to be marked as dirty if they are written. What do you say?
Yes, that is actually the majority of the work.
Besides, in the first version, we could temporarily ignore zero copy,
which is much more complicated, as we have no idea when the page has
been accessed.
-- Huawei
>
> -- Victor
>
>


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02 13:53   ` Victor Kaplansky
  2015-12-02 14:39     ` Yuanhan Liu
@ 2015-12-09  3:33     ` Xie, Huawei
  2015-12-09  3:42       ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-09  3:33 UTC (permalink / raw)
  To: Victor Kaplansky, Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On 12/2/2015 9:53 PM, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:11AM +0800, Yuanhan Liu wrote:
>> Introduce vhost_log_write() helper function to log the dirty pages we
>> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
>> log is presented by 1 bit.
>>
>> Therefore, vhost_log_write() simply finds the right bit for related
>> page we are gonna change, and set it to 1. dev->log_base denotes the
>> start of the dirty page bitmap.
>>
>> The page address is biased by log_guest_addr, which is derived from
>> SET_VRING_ADDR request as part of the vring related addresses.
>>
>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> ---
>>  lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
>>  lib/librte_vhost/virtio-net.c     |  4 ++++
>>  2 files changed, 38 insertions(+)
>>
>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>> index 416dac2..191c1be 100644
>> --- a/lib/librte_vhost/rte_virtio_net.h
>> +++ b/lib/librte_vhost/rte_virtio_net.h
>> @@ -40,6 +40,7 @@
>>   */
>>  
>>  #include <stdint.h>
>> +#include <linux/vhost.h>
>>  #include <linux/virtio_ring.h>
>>  #include <linux/virtio_net.h>
>>  #include <sys/eventfd.h>
>> @@ -59,6 +60,8 @@ struct rte_mbuf;
>>  /* Backend value set by guest. */
>>  #define VIRTIO_DEV_STOPPED -1
>>  
>> +#define VHOST_LOG_PAGE	4096
>> +
>>  
>>  /* Enum for virtqueue management. */
>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>> @@ -82,6 +85,7 @@ struct vhost_virtqueue {
>>  	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
>>  	struct vring_avail	*avail;			/**< Virtqueue available ring. */
>>  	struct vring_used	*used;			/**< Virtqueue used ring. */
>> +	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
>>  	uint32_t		size;			/**< Size of descriptor ring. */
>>  	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
>>  	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
>> @@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>>  	return vhost_va;
>>  }
>>  
>> +static inline void __attribute__((always_inline))
>> +vhost_log_page(uint8_t *log_base, uint64_t page)
>> +{
>> +	/* TODO: to make it atomic? */
>> +	log_base[page / 8] |= 1 << (page % 8);
> I think the atomic OR operation is necessary only if there can be
> more than one vhost-user back-end updating the guest's memory
> simultaneously. However probably it is pretty safe to perform
> regular OR operation, since rings are not shared between
> back-end. What about buffers pointed by descriptors?  To be on
> the safe side, I would use a GCC built-in function
> __sync_fetch_and_or(). 
>
>> +}
>> +
>> +static inline void __attribute__((always_inline))
>> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +		uint64_t offset, uint64_t len)
>> +{
>> +	uint64_t addr = vq->log_guest_addr;
>> +	uint64_t page;
>> +
>> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>> +		     !dev->log_base || !len))
>> +		return;
> Isn't "likely" more appropriate in above, since the whole
> expression is expected to be true most of the time?
Victor:
So we are not always logging, what is the message that tells the backend
the migration is started?
[...]


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (4 preceding siblings ...)
  2015-12-02 14:10 ` [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
@ 2015-12-09  3:41 ` Xie, Huawei
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
  6 siblings, 0 replies; 136+ messages in thread
From: Xie, Huawei @ 2015-12-09  3:41 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
> This patch set adds the initial vhost-user live migration support.
>
> The major task behind that is to log pages we touched during
> live migration. So, this patch is basically about adding vhost
> log support, and using it.
>
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
>     
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
>
> - Patch 3 logs changes we made to used vring.
>
> - Patch 4 sets log_fhmfd protocol feature bit, which actually
>   enables the vhost-user live migration support.
>
> A simple test guide (on same host)
> ==================================
>
> The following test is based on OVS + DPDK. And here is guide
> to setup OVS + DPDK:
>
>     http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
>
> 1. start ovs-vswitchd
>
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
>
> 3. Start a VM1 to connect to vhost0. Here is my example:
>
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
>
> 4. run "ping $host" inside VM1
>
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
>
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444 
>
> 6. connect to VM1 monitor, and start migration:
>
>    > migrate tcp:0:4444
>
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
Is there some formal verification that migration is truly successful? At
least that the memory we care in our vhost-user case has been migrated
successfully?
For instance, we miss logging guest RX buffers in this patch set, but we
have no idea.

[...]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-09  3:33     ` Xie, Huawei
@ 2015-12-09  3:42       ` Yuanhan Liu
  2015-12-09  5:44         ` Xie, Huawei
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-09  3:42 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
...
> >> +static inline void __attribute__((always_inline))
> >> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> >> +		uint64_t offset, uint64_t len)
> >> +{
> >> +	uint64_t addr = vq->log_guest_addr;
> >> +	uint64_t page;
> >> +
> >> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >> +		     !dev->log_base || !len))
> >> +		return;
> > Isn't "likely" more appropriate in above, since the whole
> > expression is expected to be true most of the time?
> Victor:
> So we are not always logging, what is the message that tells the backend
> the migration is started?

When log starts, VHOST_USER_SET_FEATURES request will be sent again,
with VHOST_F_LOG_ALL feature bit set.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-09  3:42       ` Yuanhan Liu
@ 2015-12-09  5:44         ` Xie, Huawei
  2015-12-09  8:41           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-09  5:44 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/9/2015 11:41 AM, Yuanhan Liu wrote:
> On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
> ...
>>>> +static inline void __attribute__((always_inline))
>>>> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>>> +		uint64_t offset, uint64_t len)
>>>> +{
>>>> +	uint64_t addr = vq->log_guest_addr;
>>>> +	uint64_t page;
>>>> +
>>>> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>>> +		     !dev->log_base || !len))
>>>> +		return;
>>> Isn't "likely" more appropriate in above, since the whole
>>> expression is expected to be true most of the time?
>> Victor:
>> So we are not always logging, what is the message that tells the backend
>> the migration is started?
> When log starts, VHOST_USER_SET_FEATURES request will be sent again,
> with VHOST_F_LOG_ALL feature bit set.
As the VHOST_USER_SET_FEATURES handling and rx/tx runs asynchronously,
we have to make sure we don't miss logging anything when this feature is
set. For example, I doubt like in virtio_dev_rx, is the dev->features
volatile?
> 	--yliu
>


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-09  5:44         ` Xie, Huawei
@ 2015-12-09  8:41           ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-09  8:41 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 09, 2015 at 05:44:11AM +0000, Xie, Huawei wrote:
> On 12/9/2015 11:41 AM, Yuanhan Liu wrote:
> > On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
> > ...
> >>>> +static inline void __attribute__((always_inline))
> >>>> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> >>>> +		uint64_t offset, uint64_t len)
> >>>> +{
> >>>> +	uint64_t addr = vq->log_guest_addr;
> >>>> +	uint64_t page;
> >>>> +
> >>>> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >>>> +		     !dev->log_base || !len))
> >>>> +		return;
> >>> Isn't "likely" more appropriate in above, since the whole
> >>> expression is expected to be true most of the time?
> >> Victor:
> >> So we are not always logging, what is the message that tells the backend
> >> the migration is started?
> > When log starts, VHOST_USER_SET_FEATURES request will be sent again,
> > with VHOST_F_LOG_ALL feature bit set.
> As the VHOST_USER_SET_FEATURES handling and rx/tx runs asynchronously,
> we have to make sure we don't miss logging anything when this feature is
> set.

That's a good remind. Thanks.

> For example, I doubt like in virtio_dev_rx, is the dev->features
> volatile?

No, it is not volatile.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support
  2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (5 preceding siblings ...)
  2015-12-09  3:41 ` Xie, Huawei
@ 2015-12-17  3:11 ` Yuanhan Liu
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
                     ` (8 more replies)
  6 siblings, 9 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Yang Maggie, Victor Kaplansky

This patch set adds the vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration, including used vring and desc buffer. So, this
patch set is basically about adding vhost log support, and
using it.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
  the dirty memory bitmap is.
    
- Patch 2 introduces a vhost_log_write() helper function to log
  pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 logs changes we made to vring desc buffer.

- Patch 5 and 6 add some feature bits related to live migration.


A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK (check [0] for
how to setup OVS + DPDK):

    [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

Here is the rough test guide:

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
   of live migration (by adding -incoming tcp:0:4444 option)

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3334,server,nowait -curses \
       -incoming tcp:0:4444 

6. connect to VM1 monitor, and start migration:

   > migrate tcp:0:4444

7. After a while, you will find that VM1 has been migrated to VM2,
   and the "ping" command continues running, perfectly.


Cc: Chen Zhihui <zhihui.chen@intel.com>
Cc: Yang Maggie <maggie.yang@intel.com>
---
Yuanhan Liu (6):
  vhost: handle VHOST_USER_SET_LOG_BASE request
  vhost: introduce vhost_log_write
  vhost: log used vring changes
  vhost: log vring desc buffer changes
  vhost: claim that we support GUEST_ANNOUNCE feature
  vhost: enable log_shmfd protocol feature

 lib/librte_vhost/rte_virtio_net.h             | 36 ++++++++++-
 lib/librte_vhost/vhost_rxtx.c                 | 88 +++++++++++++++++++--------
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
 lib/librte_vhost/virtio-net.c                 |  5 ++
 7 files changed, 165 insertions(+), 30 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-21 15:32     ` Xie, Huawei
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

    typedef struct VhostUserLog {
            uint64_t mmap_size;
            uint64_t mmap_offset;
    } VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---

v2: workaround mmap issue when offset is not zero
---
 lib/librte_vhost/rte_virtio_net.h             |  4 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 10dcb90..8acee02 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -129,7 +129,9 @@ struct virtio_net {
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
 	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
 	void			*priv;		/**< private context */
-	uint64_t		reserved[64];	/**< Reserve some spaces for future extension. */
+	uint64_t		log_size;	/**< Size of log area */
+	uint64_t		log_base;	/**< Where dirty pages are logged */
+	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..32ad6f6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		break;
 
 	case VHOST_USER_SET_LOG_BASE:
-		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
-		break;
+		user_set_log_base(ctx, &msg);
 
+		/* it needs a reply */
+		msg.size = sizeof(msg.payload.u64);
+		send_vhost_message(connfd, &msg);
+		break;
 	case VHOST_USER_SET_LOG_FD:
 		close(msg.fds[0]);
 		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
 	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserLog {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+} VhostUserLog;
+
 typedef struct VhostUserMsg {
 	VhostUserRequest request;
 
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
 		struct vhost_vring_state state;
 		struct vhost_vring_addr addr;
 		VhostUserMemory memory;
+		VhostUserLog    log;
 	} payload;
 	int fds[VHOST_MEMORY_MAX_NREGIONS];
 } __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..b77c9b3 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
 
 	dev->protocol_features = protocol_features;
 }
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+		 struct VhostUserMsg *msg)
+{
+	struct virtio_net *dev;
+	int fd = msg->fds[0];
+	uint64_t size, off;
+	void *addr;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
+
+	if (fd < 0) {
+		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+		return -1;
+	}
+
+	if (msg->size != sizeof(VhostUserLog)) {
+		RTE_LOG(ERR, VHOST_CONFIG,
+			"invalid log base msg size: %"PRId32" != %d\n",
+			msg->size, (int)sizeof(VhostUserLog));
+		return -1;
+	}
+
+	size = msg->payload.log.mmap_size;
+	off  = msg->payload.log.mmap_offset;
+	RTE_LOG(INFO, VHOST_CONFIG,
+		"log mmap size: %"PRId64", offset: %"PRId64"\n",
+		size, off);
+
+	/*
+	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+	 * failed when offset is not page size aligned.
+	 */
+	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+		return -1;
+	}
+
+	/* TODO: unmap on stop */
+	dev->log_base = (uint64_t)(uintptr_t)addr + off;
+	dev->log_size = size;
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-21 15:06     ` Xie, Huawei
  2015-12-22  5:11     ` Peter Xu
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
                     ` (6 subsequent siblings)
  8 siblings, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---
 lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..5726683 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
  */
 
 #include <stdint.h>
+#include <linux/vhost.h>
 #include <linux/virtio_ring.h>
 #include <linux/virtio_net.h>
 #include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
 /* Backend value set by guest. */
 #define VIRTIO_DEV_STOPPED -1
 
+#define VHOST_LOG_PAGE	4096
+
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
 	return vhost_va;
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		   !dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += VHOST_LOG_PAGE;
+	}
+}
+
+
 /**
  *  Disable features in feature_mask. Returns 0 on success.
  */
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-22  6:55     ` Peter Xu
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 4/6] vhost: log vring desc buffer changes Yuanhan Liu
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Introducing a vhost_log_write() wrapper, vhost_log_used_vring, to
log used vring changes.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---
 lib/librte_vhost/rte_virtio_net.h |  3 +-
 lib/librte_vhost/vhost_rxtx.c     | 80 +++++++++++++++++++++++++++------------
 lib/librte_vhost/virtio-net.c     |  4 ++
 3 files changed, 62 insertions(+), 25 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5726683..0f83719 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -93,7 +93,8 @@ struct vhost_virtqueue {
 	int			callfd;			/**< Used to notify the guest (trigger interrupt). */
 	int			kickfd;			/**< Currently unused as polling mode is enabled. */
 	int			enabled;
-	uint64_t		reserved[16];		/**< Reserve some spaces for future extension. */
+	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
+	uint64_t		reserved[15];		/**< Reserve some spaces for future extension. */
 	struct buf_vector	buf_vec[BUF_VECTOR_MAX];	/**< for scatter RX. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..f305acd 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -49,6 +49,16 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
 	return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		     uint64_t offset, uint64_t len)
+{
+	uint64_t addr;
+
+	addr = vq->log_guest_addr + offset;
+	vhost_log_write(dev, addr, len);
+}
+
 /**
  * This function adds buffers to the virtio devices RX virtqueue. Buffers can
  * be received from the physical port or from another virtio device. A packet
@@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t offset = 0, vb_offset = 0;
 		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
 		uint8_t hdr = 0, uncompleted_pkt = 0;
+		uint16_t idx;
 
 		/* Get descriptor from available ring */
 		desc = &vq->desc[head[packet_success]];
@@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		}
 
 		/* Update used ring with desc information */
-		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
-							head[packet_success];
+		idx = res_cur_idx & (vq->size - 1);
+		vq->used->ring[idx].id = head[packet_success];
 
 		/* Drop the packet if it is uncompleted */
 		if (unlikely(uncompleted_pkt == 1))
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							vq->vhost_hlen;
+			vq->used->ring[idx].len = vq->vhost_hlen;
 		else
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							pkt_len + vq->vhost_hlen;
+			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+		vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[idx]),
+			sizeof(vq->used->ring[idx]));
 
 		res_cur_idx++;
 		packet_success++;
@@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
+	vhost_log_used_vring(dev, vq,
+		offsetof(struct vring_used, idx),
+		sizeof(vq->used->idx));
 
 	/* flush used->idx update before we read avail->flags. */
 	rte_mb();
@@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	uint32_t seg_avail;
 	uint32_t vb_avail;
 	uint32_t cpy_len, entry_len;
+	uint16_t idx;
 
 	if (pkt == NULL)
 		return 0;
@@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	entry_len = vq->vhost_hlen;
 
 	if (vb_avail == 0) {
-		uint32_t desc_idx =
-			vq->buf_vec[vec_idx].desc_idx;
+		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+			idx = cur_idx & (vq->size - 1);
 
-		if ((vq->desc[desc_idx].flags
-			& VRING_DESC_F_NEXT) == 0) {
 			/* Update used ring with desc information */
-			vq->used->ring[cur_idx & (vq->size - 1)].id
-				= vq->buf_vec[vec_idx].desc_idx;
-			vq->used->ring[cur_idx & (vq->size - 1)].len
-				= entry_len;
+			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+			vq->used->ring[idx].len = entry_len;
+
+			vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 
 			entry_len = 0;
 			cur_idx++;
@@ -354,10 +373,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
 				VRING_DESC_F_NEXT) == 0) {
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_len = 0;
 				cur_idx++;
 				entry_success++;
@@ -390,16 +412,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 					if ((vq->desc[desc_idx].flags &
 						VRING_DESC_F_NEXT) == 0) {
-						uint16_t wrapped_idx =
-							cur_idx & (vq->size - 1);
+						idx = cur_idx & (vq->size - 1);
 						/*
 						 * Update used ring with the
 						 * descriptor information
 						 */
-						vq->used->ring[wrapped_idx].id
+						vq->used->ring[idx].id
 							= desc_idx;
-						vq->used->ring[wrapped_idx].len
+						vq->used->ring[idx].len
 							= entry_len;
+						vhost_log_used_vring(dev, vq,
+							offsetof(struct vring_used, ring[idx]),
+							sizeof(vq->used->ring[idx]));
 						entry_success++;
 						entry_len = 0;
 						cur_idx++;
@@ -422,10 +446,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 				 * This whole packet completes.
 				 */
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_success++;
 				break;
 			}
@@ -653,6 +680,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		/* Update used index buffer information. */
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
+		vhost_log_used_vring(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 
 		/* Allocate an mbuf and populate the structure. */
 		m = rte_pktmbuf_alloc(mbuf_pool);
@@ -773,6 +803,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..03044f6 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
 		return -1;
 	}
 
+	vq->log_guest_addr = addr->log_guest_addr;
+
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
 			dev->device_fh, vq->desc);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
 			dev->device_fh, vq->avail);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
 			dev->device_fh, vq->used);
+	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
+			dev->device_fh, vq->log_guest_addr);
 
 	return 0;
 }
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 4/6] vhost: log vring desc buffer changes
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
                     ` (2 preceding siblings ...)
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Every time we copy a buf to vring desc, we need to log it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---
 lib/librte_vhost/vhost_rxtx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index f305acd..c2d514b 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -71,7 +71,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	struct rte_mbuf **pkts, uint32_t count)
 {
 	struct vhost_virtqueue *vq;
-	struct vring_desc *desc;
+	struct vring_desc *desc, *hdr_desc;
 	struct rte_mbuf *buff;
 	/* The virtio_hdr is initialised to 0. */
 	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
@@ -153,6 +153,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		/* Copy virtio_hdr to packet and increment buffer address */
 		buff_hdr_addr = buff_addr;
+		hdr_desc = desc;
 
 		/*
 		 * If the descriptors are chained the header and data are
@@ -177,6 +178,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 			rte_memcpy((void *)(uintptr_t)(buff_addr + vb_offset),
 				rte_pktmbuf_mtod_offset(buff, const void *, offset),
 				len_to_cpy);
+			vhost_log_write(dev, desc->addr + vb_offset, len_to_cpy);
 			PRINT_PACKET(dev, (uintptr_t)(buff_addr + vb_offset),
 				len_to_cpy, 0);
 
@@ -232,6 +234,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		rte_memcpy((void *)(uintptr_t)buff_hdr_addr,
 			(const void *)&virtio_hdr, vq->vhost_hlen);
+		vhost_log_write(dev, hdr_desc->addr, vq->vhost_hlen);
 
 		PRINT_PACKET(dev, (uintptr_t)buff_hdr_addr, vq->vhost_hlen, 1);
 
@@ -309,6 +312,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 	rte_memcpy((void *)(uintptr_t)vb_hdr_addr,
 		(const void *)&virtio_hdr, vq->vhost_hlen);
+	vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr, vq->vhost_hlen);
 
 	PRINT_PACKET(dev, (uintptr_t)vb_hdr_addr, vq->vhost_hlen, 1);
 
@@ -353,6 +357,8 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 		rte_memcpy((void *)(uintptr_t)(vb_addr + vb_offset),
 			rte_pktmbuf_mtod_offset(pkt, const void *, seg_offset),
 			cpy_len);
+		vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr + vb_offset,
+			cpy_len);
 
 		PRINT_PACKET(dev,
 			(uintptr_t)(vb_addr + vb_offset),
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
                     ` (3 preceding siblings ...)
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 4/6] vhost: log vring desc buffer changes Yuanhan Liu
@ 2015-12-17  3:12   ` Yuanhan Liu
  2015-12-22  8:11     ` Peter Xu
  2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 6/6] vhost: enable log_shmfd protocol feature Yuanhan Liu
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:12 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

It's actually a feature already enabled in Linux kernel. What we need to
do is simply to claim that we support such feature, and nothing else.

With that, the guest will send GARP messages after live migration.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/virtio-net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
 #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
 				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
 				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
 				(VHOST_SUPPORTS_MQ)            | \
 				(1ULL << VIRTIO_F_VERSION_1)   | \
 				(1ULL << VHOST_F_LOG_ALL)      | \
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v2 6/6] vhost: enable log_shmfd protocol feature
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
                     ` (4 preceding siblings ...)
  2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2015-12-17  3:12   ` Yuanhan Liu
  2015-12-17 12:08   ` [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:12 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_user/virtio-net-user.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..a3a889d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD	1
 
-#define VHOST_USER_PROTOCOL_FEATURES	(1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
                     ` (5 preceding siblings ...)
  2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 6/6] vhost: enable log_shmfd protocol feature Yuanhan Liu
@ 2015-12-17 12:08   ` Iremonger, Bernard
  2015-12-17 12:45     ` Yuanhan Liu
  2015-12-21  8:17   ` Pavel Fedin
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
  8 siblings, 1 reply; 136+ messages in thread
From: Iremonger, Bernard @ 2015-12-17 12:08 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Michael S. Tsirkin, Yang, Maggie, Victor Kaplansky

Hi Yuanhan,

> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Thursday, December 17, 2015 3:12 AM
> To: dev@dpdk.org
> Cc: Xie, Huawei <huawei.xie@intel.com>; Michael S. Tsirkin
> <mst@redhat.com>; Victor Kaplansky <vkaplans@redhat.com>; Iremonger,
> Bernard <bernard.iremonger@intel.com>; Pavel Fedin
> <p.fedin@samsung.com>; Peter Xu <peterx@redhat.com>; Yuanhan Liu
> <yuanhan.liu@linux.intel.com>; Chen, Zhihui <zhihui.chen@intel.com>;
> Yang, Maggie <maggie.yang@intel.com>
> Subject: [PATCH v2 0/6] vhost-user live migration support
> 
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during live migration,
> including used vring and desc buffer. So, this patch set is basically about
> adding vhost log support, and using it.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
> 
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 6 add some feature bits related to live migration.
> 
>
 
The follow test guide should probably be added the DPDK doc files.
It could be added to the sample app guide or the programmers guide.
There is already a Vhost Library  section in the programmers guide and
A Vhost Sample Application section in the sample app guide.


> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK (check [0] for how to setup OVS +
> DPDK):
> 
>     [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 
> 
> Cc: Chen Zhihui <zhihui.chen@intel.com>
> Cc: Yang Maggie <maggie.yang@intel.com>
> ---
> Yuanhan Liu (6):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log used vring changes
>   vhost: log vring desc buffer changes
>   vhost: claim that we support GUEST_ANNOUNCE feature
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h             | 36 ++++++++++-
>  lib/librte_vhost/vhost_rxtx.c                 | 88 +++++++++++++++++++--------
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
> lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
> lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
> lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c                 |  5 ++
>  7 files changed, 165 insertions(+), 30 deletions(-)
> 
> --
> 1.9.0

Regards,

Bernard.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support
  2015-12-17 12:08   ` [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
@ 2015-12-17 12:45     ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-17 12:45 UTC (permalink / raw)
  To: Iremonger, Bernard
  Cc: Michael S. Tsirkin, dev, Yang, Maggie, Victor Kaplansky

On Thu, Dec 17, 2015 at 12:08:13PM +0000, Iremonger, Bernard wrote:
> Hi Yuanhan,
> 
> > -----Original Message-----
> > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > Sent: Thursday, December 17, 2015 3:12 AM
> > To: dev@dpdk.org
> > Cc: Xie, Huawei <huawei.xie@intel.com>; Michael S. Tsirkin
> > <mst@redhat.com>; Victor Kaplansky <vkaplans@redhat.com>; Iremonger,
> > Bernard <bernard.iremonger@intel.com>; Pavel Fedin
> > <p.fedin@samsung.com>; Peter Xu <peterx@redhat.com>; Yuanhan Liu
> > <yuanhan.liu@linux.intel.com>; Chen, Zhihui <zhihui.chen@intel.com>;
> > Yang, Maggie <maggie.yang@intel.com>
> > Subject: [PATCH v2 0/6] vhost-user live migration support
> > 
> > This patch set adds the vhost-user live migration support.
> > 
> > The major task behind that is to log pages we touched during live migration,
> > including used vring and desc buffer. So, this patch set is basically about
> > adding vhost log support, and using it.
> > 
> > Patchset
> > ========
> > - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
> >   the dirty memory bitmap is.
> > 
> > - Patch 2 introduces a vhost_log_write() helper function to log
> >   pages we are gonna change.
> > 
> > - Patch 3 logs changes we made to used vring.
> > 
> > - Patch 4 logs changes we made to vring desc buffer.
> > 
> > - Patch 5 and 6 add some feature bits related to live migration.
> > 
> >
>  
> The follow test guide should probably be added the DPDK doc files.

Yes, but not this one, which is a fare rough one. The official one
should do live migration between two hosts.

> It could be added to the sample app guide or the programmers guide.
> There is already a Vhost Library  section in the programmers guide and
> A Vhost Sample Application section in the sample app guide.

We may do it after the validation from validation team.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
                     ` (6 preceding siblings ...)
  2015-12-17 12:08   ` [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
@ 2015-12-21  8:17   ` Pavel Fedin
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
  8 siblings, 0 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-21  8:17 UTC (permalink / raw)
  To: 'Yuanhan Liu', dev
  Cc: 'Michael S. Tsirkin', 'Yang Maggie',
	'Victor Kaplansky'

 Works fine.

 Tested-by: Pavel Fedin <p.fedin@samsung.com>

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Thursday, December 17, 2015 6:12 AM
> To: dev@dpdk.org
> Cc: huawei.xie@intel.com; Michael S. Tsirkin; Victor Kaplansky; Iremonger Bernard; Pavel
> Fedin; Peter Xu; Yuanhan Liu; Chen Zhihui; Yang Maggie
> Subject: [PATCH v2 0/6] vhost-user live migration support
> 
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration, including used vring and desc buffer. So, this
> patch set is basically about adding vhost log support, and
> using it.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
> 
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 6 add some feature bits related to live migration.
> 
> 
> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK (check [0] for
> how to setup OVS + DPDK):
> 
>     [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 
> 
> Cc: Chen Zhihui <zhihui.chen@intel.com>
> Cc: Yang Maggie <maggie.yang@intel.com>
> ---
> Yuanhan Liu (6):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log used vring changes
>   vhost: log vring desc buffer changes
>   vhost: claim that we support GUEST_ANNOUNCE feature
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h             | 36 ++++++++++-
>  lib/librte_vhost/vhost_rxtx.c                 | 88 +++++++++++++++++++--------
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c                 |  5 ++
>  7 files changed, 165 insertions(+), 30 deletions(-)
> 
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-21 15:06     ` Xie, Huawei
  2015-12-22  2:40       ` Yuanhan Liu
  2015-12-22  5:11     ` Peter Xu
  1 sibling, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-21 15:06 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
>
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Signed-off-by: Victor Kaplansky <victork@redhat.com
> ---
>  lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 8acee02..5726683 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -40,6 +40,7 @@
>   */
>  
>  #include <stdint.h>
> +#include <linux/vhost.h>
>  #include <linux/virtio_ring.h>
>  #include <linux/virtio_net.h>
>  #include <sys/eventfd.h>
> @@ -59,6 +60,8 @@ struct rte_mbuf;
>  /* Backend value set by guest. */
>  #define VIRTIO_DEV_STOPPED -1
>  
> +#define VHOST_LOG_PAGE	4096
> +
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>  	return vhost_va;
>  }
>  
> +static inline void __attribute__((always_inline))
> +vhost_log_page(uint8_t *log_base, uint64_t page)
> +{
> +	log_base[page / 8] |= 1 << (page % 8);
> +}
> +
Those logging functions are not supposed to be API. Could we move them
into an internal header file?
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> +	uint64_t page;
> +
Before we log, we need memory barrier to make sure updates are in place.
> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		   !dev->log_base || !len))
> +		return;
> +
> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> +		return;
> +
> +	page = addr / VHOST_LOG_PAGE;
> +	while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> +		page += VHOST_LOG_PAGE;
page += 1?
> +	}
> +}
> +
> +
>  /**
>   *  Disable features in feature_mask. Returns 0 on success.
>   */


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-21 15:32     ` Xie, Huawei
  2015-12-22  2:25       ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-21 15:32 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> vhost-user) where we should log dirty pages, and how big the log
> buffer is.
>
> This request introduces a new payload:
>
>     typedef struct VhostUserLog {
>             uint64_t mmap_size;
>             uint64_t mmap_offset;
>     } VhostUserLog;
>
> Also, a fd is delivered from QEMU by ancillary data.
>
> With those info given, an area of memory is mmaped, assigned
> to dev->log_base, for logging dirty pages.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Signed-off-by: Victor Kaplansky <victork@redhat.com
> ---
>
> v2: workaround mmap issue when offset is not zero
> ---
>  lib/librte_vhost/rte_virtio_net.h             |  4 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>  5 files changed, 63 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 10dcb90..8acee02 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -129,7 +129,9 @@ struct virtio_net {
>  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
>  	void			*priv;		/**< private context */
> -	uint64_t		reserved[64];	/**< Reserve some spaces for future extension. */
> +	uint64_t		log_size;	/**< Size of log area */
> +	uint64_t		log_base;	/**< Where dirty pages are logged */
> +	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>  } __rte_cache_aligned;
>  
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index 8b7a448..32ad6f6 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
>  		break;
>  
>  	case VHOST_USER_SET_LOG_BASE:
> -		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
> -		break;
> +		user_set_log_base(ctx, &msg);
>  
> +		/* it needs a reply */
> +		msg.size = sizeof(msg.payload.u64);
> +		send_vhost_message(connfd, &msg);
> +		break;
>  	case VHOST_USER_SET_LOG_FD:
>  		close(msg.fds[0]);
>  		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
> index 38637cc..6d252a3 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.h
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
> @@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
>  	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
>  } VhostUserMemory;
>  
> +typedef struct VhostUserLog {
> +	uint64_t mmap_size;
> +	uint64_t mmap_offset;
> +} VhostUserLog;
> +
>  typedef struct VhostUserMsg {
>  	VhostUserRequest request;
>  
> @@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
>  		struct vhost_vring_state state;
>  		struct vhost_vring_addr addr;
>  		VhostUserMemory memory;
> +		VhostUserLog    log;
>  	} payload;
>  	int fds[VHOST_MEMORY_MAX_NREGIONS];
>  } __attribute((packed)) VhostUserMsg;
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
> index 2934d1c..b77c9b3 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.c
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
> @@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
>  
>  	dev->protocol_features = protocol_features;
>  }
> +
> +int
> +user_set_log_base(struct vhost_device_ctx ctx,
> +		 struct VhostUserMsg *msg)
> +{
> +	struct virtio_net *dev;
> +	int fd = msg->fds[0];
> +	uint64_t size, off;
> +	void *addr;
> +
> +	dev = get_device(ctx);
> +	if (!dev)
> +		return -1;
> +
> +	if (fd < 0) {
> +		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
> +		return -1;
> +	}
> +
> +	if (msg->size != sizeof(VhostUserLog)) {
> +		RTE_LOG(ERR, VHOST_CONFIG,
> +			"invalid log base msg size: %"PRId32" != %d\n",
> +			msg->size, (int)sizeof(VhostUserLog));
> +		return -1;
> +	}
> +
> +	size = msg->payload.log.mmap_size;
> +	off  = msg->payload.log.mmap_offset;
> +	RTE_LOG(INFO, VHOST_CONFIG,
> +		"log mmap size: %"PRId64", offset: %"PRId64"\n",
> +		size, off);
> +
> +	/*
> +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> +	 * failed when offset is not page size aligned.
> +	 */
s /will be failed/will fail/
mmap will fail when offset is not zero.
Also we only know this workaround is for hugetlbfs. Not sure of other
tmpfs, so mention hugetlbfs here.
> +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +	if (addr == MAP_FAILED) {
> +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> +		return -1;
> +	}
> +
> +	/* TODO: unmap on stop */
> +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
(uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> +	dev->log_size = size;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> index b82108d..013cf38 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
>  
>  void user_set_protocol_features(struct vhost_device_ctx ctx,
>  				uint64_t protocol_features);
> +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
>  
>  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
>  


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-21 15:32     ` Xie, Huawei
@ 2015-12-22  2:25       ` Yuanhan Liu
  2015-12-22  2:41         ` Xie, Huawei
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  2:25 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Mon, Dec 21, 2015 at 03:32:53PM +0000, Xie, Huawei wrote:
> > +
> > +	/*
> > +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > +	 * failed when offset is not page size aligned.
> > +	 */
> s /will be failed/will fail/
> mmap will fail when offset is not zero.
> Also we only know this workaround is for hugetlbfs. Not sure of other
> tmpfs, so mention hugetlbfs here.

I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
not enough?

> > +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > +	if (addr == MAP_FAILED) {
> > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > +		return -1;
> > +	}
> > +
> > +	/* TODO: unmap on stop */
> > +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
> (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?

No, addr is of (void *) type, we should cast it to uint64_t type first,
before adding it with "off".

	--yliu

> > +	dev->log_size = size;
> > +
> > +	return 0;
> > +}
> > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > index b82108d..013cf38 100644
> > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
> >  
> >  void user_set_protocol_features(struct vhost_device_ctx ctx,
> >  				uint64_t protocol_features);
> > +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
> >  
> >  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
> >  
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-21 15:06     ` Xie, Huawei
@ 2015-12-22  2:40       ` Yuanhan Liu
  2015-12-22  2:45         ` Xie, Huawei
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  2:40 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Mon, Dec 21, 2015 at 03:06:43PM +0000, Xie, Huawei wrote:
> On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> >
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> >
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > Signed-off-by: Victor Kaplansky <victork@redhat.com
> > ---
> >  lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> > index 8acee02..5726683 100644
> > --- a/lib/librte_vhost/rte_virtio_net.h
> > +++ b/lib/librte_vhost/rte_virtio_net.h
> > @@ -40,6 +40,7 @@
> >   */
> >  
> >  #include <stdint.h>
> > +#include <linux/vhost.h>
> >  #include <linux/virtio_ring.h>
> >  #include <linux/virtio_net.h>
> >  #include <sys/eventfd.h>
> > @@ -59,6 +60,8 @@ struct rte_mbuf;
> >  /* Backend value set by guest. */
> >  #define VIRTIO_DEV_STOPPED -1
> >  
> > +#define VHOST_LOG_PAGE	4096
> > +
> >  
> >  /* Enum for virtqueue management. */
> >  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> >  	return vhost_va;
> >  }
> >  
> > +static inline void __attribute__((always_inline))
> > +vhost_log_page(uint8_t *log_base, uint64_t page)
> > +{
> > +	log_base[page / 8] |= 1 << (page % 8);
> > +}
> > +
> Those logging functions are not supposed to be API. Could we move them
> into an internal header file?

Agreed. I should have put them into vhost_rxtx.c

> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> > +{
> > +	uint64_t page;
> > +
> Before we log, we need memory barrier to make sure updates are in place.
> > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +		   !dev->log_base || !len))
> > +		return;

Put a memory barrier inside set_features()?

I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?

> > +
> > +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> > +		return;
> > +
> > +	page = addr / VHOST_LOG_PAGE;
> > +	while (page * VHOST_LOG_PAGE < addr + len) {
> Let us have a page_end var to make the code simpler?

Could do that.


> > +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> > +		page += VHOST_LOG_PAGE;
> page += 1?

Oops, right.

	--yliu

> > +	}
> > +}
> > +
> > +
> >  /**
> >   *  Disable features in feature_mask. Returns 0 on success.
> >   */
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-22  2:25       ` Yuanhan Liu
@ 2015-12-22  2:41         ` Xie, Huawei
  2015-12-22  2:55           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-22  2:41 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky



> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Tuesday, December 22, 2015 10:26 AM
> To: Xie, Huawei
> Cc: dev@dpdk.org; Michael S. Tsirkin; Victor Kaplansky; Iremonger,
> Bernard; Pavel Fedin; Peter Xu
> Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
> request
> 
> On Mon, Dec 21, 2015 at 03:32:53PM +0000, Xie, Huawei wrote:
> > > +
> > > +	/*
> > > +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > > +	 * failed when offset is not page size aligned.
> > > +	 */
> > s /will be failed/will fail/
> > mmap will fail when offset is not zero.
I mistake for 4KB page size. Please check if huge page size align is enough.
> > Also we only know this workaround is for hugetlbfs. Not sure of
> other
> > tmpfs, so mention hugetlbfs here.
> 
> I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
> not enough?
Yes.
> 
> > > +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
> 0);
> > > +	if (addr == MAP_FAILED) {
> > > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	/* TODO: unmap on stop */
> > > +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
> > (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> 
> No, addr is of (void *) type, we should cast it to uint64_t type first,
> before adding it with "off".
> 
> 	--yliu
RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.
> 
> > > +	dev->log_size = size;
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h
> b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > index b82108d..013cf38 100644
> > > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx,
> struct VhostUserMsg *);
> > >
> > >  void user_set_protocol_features(struct vhost_device_ctx ctx,
> > >  				uint64_t protocol_features);
> > > +int user_set_log_base(struct vhost_device_ctx ctx, struct
> VhostUserMsg *);
> > >
> > >  int user_get_vring_base(struct vhost_device_ctx, struct
> vhost_vring_state *);
> > >
> >

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  2:40       ` Yuanhan Liu
@ 2015-12-22  2:45         ` Xie, Huawei
  2015-12-22  3:04           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-22  2:45 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On 12/22/2015 10:40 AM, Yuanhan Liu wrote:
> On Mon, Dec 21, 2015 at 03:06:43PM +0000, Xie, Huawei wrote:
>> On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
>>> Introduce vhost_log_write() helper function to log the dirty pages we
>>> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
>>> log is presented by 1 bit.
>>>
>>> Therefore, vhost_log_write() simply finds the right bit for related
>>> page we are gonna change, and set it to 1. dev->log_base denotes the
>>> start of the dirty page bitmap.
>>>
>>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>>> Signed-off-by: Victor Kaplansky <victork@redhat.com
>>> ---
>>>  lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
>>>  1 file changed, 29 insertions(+)
>>>
>>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>>> index 8acee02..5726683 100644
>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>> @@ -40,6 +40,7 @@
>>>   */
>>>  
>>>  #include <stdint.h>
>>> +#include <linux/vhost.h>
>>>  #include <linux/virtio_ring.h>
>>>  #include <linux/virtio_net.h>
>>>  #include <sys/eventfd.h>
>>> @@ -59,6 +60,8 @@ struct rte_mbuf;
>>>  /* Backend value set by guest. */
>>>  #define VIRTIO_DEV_STOPPED -1
>>>  
>>> +#define VHOST_LOG_PAGE	4096
>>> +
>>>  
>>>  /* Enum for virtqueue management. */
>>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>> @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>>>  	return vhost_va;
>>>  }
>>>  
>>> +static inline void __attribute__((always_inline))
>>> +vhost_log_page(uint8_t *log_base, uint64_t page)
>>> +{
>>> +	log_base[page / 8] |= 1 << (page % 8);
>>> +}
>>> +
>> Those logging functions are not supposed to be API. Could we move them
>> into an internal header file?
> Agreed. I should have put them into vhost_rxtx.c
>
>>> +static inline void __attribute__((always_inline))
>>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
>>> +{
>>> +	uint64_t page;
>>> +
>> Before we log, we need memory barrier to make sure updates are in place.
>>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>> +		   !dev->log_base || !len))
>>> +		return;
> Put a memory barrier inside set_features()?
>
> I see no var dependence here, why putting a barrier then? We are
> accessing and modifying same var, doesn't the cache MESI protocol
> will get rid of your concerns?
This fence isn't about feature var. It is to ensure that updates to the
guest buffer are committed before the logging.
For IA strong memory model, compiler barrier is enough. For other weak
memory model, fence is required.
>>> +
>>> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
>>> +		return;
>>> +
>>> +	page = addr / VHOST_LOG_PAGE;
>>> +	while (page * VHOST_LOG_PAGE < addr + len) {
>> Let us have a page_end var to make the code simpler?
> Could do that.
>
>
>>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
>>> +		page += VHOST_LOG_PAGE;
>> page += 1?
> Oops, right.
>
> 	--yliu
>
>>> +	}
>>> +}
>>> +
>>> +
>>>  /**
>>>   *  Disable features in feature_mask. Returns 0 on success.
>>>   */


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-22  2:41         ` Xie, Huawei
@ 2015-12-22  2:55           ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  2:55 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 02:41:43AM +0000, Xie, Huawei wrote:
> 
> 
> > -----Original Message-----
> > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > Sent: Tuesday, December 22, 2015 10:26 AM
> > To: Xie, Huawei
> > Cc: dev@dpdk.org; Michael S. Tsirkin; Victor Kaplansky; Iremonger,
> > Bernard; Pavel Fedin; Peter Xu
> > Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
> > request
> > 
> > On Mon, Dec 21, 2015 at 03:32:53PM +0000, Xie, Huawei wrote:
> > > > +
> > > > +	/*
> > > > +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > > > +	 * failed when offset is not page size aligned.
> > > > +	 */
> > > s /will be failed/will fail/
> > > mmap will fail when offset is not zero.
> I mistake for 4KB page size.

Didn't follow you.

> Please check if huge page size align is enough.

It should be. However, I don't think we need bother to do that:
first of all, it happened on few specific old kernels. And, "off"
here is kind of guaranteed to be 0. Last, even it's not, mmaping
it from 0 will resolve that.

> > > Also we only know this workaround is for hugetlbfs. Not sure of
> > other
> > > tmpfs, so mention hugetlbfs here.
> > 
> > I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
> > not enough?
> Yes.
> > 
> > > > +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
> > 0);
> > > > +	if (addr == MAP_FAILED) {
> > > > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	/* TODO: unmap on stop */
> > > > +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
> > > (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> > 
> > No, addr is of (void *) type, we should cast it to uint64_t type first,
> > before adding it with "off".
> > 
> > 	--yliu
> RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.

log_base is with "uint64_t" type, RTE_PTR_ADD() returns (void*), so it
won't work here.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  2:45         ` Xie, Huawei
@ 2015-12-22  3:04           ` Yuanhan Liu
  2015-12-22  7:02             ` Xie, Huawei
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  3:04 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 02:45:52AM +0000, Xie, Huawei wrote:
> >>> +static inline void __attribute__((always_inline))
> >>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> >>> +{
> >>> +	uint64_t page;
> >>> +
> >> Before we log, we need memory barrier to make sure updates are in place.
> >>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >>> +		   !dev->log_base || !len))
> >>> +		return;
> > Put a memory barrier inside set_features()?
> >
> > I see no var dependence here, why putting a barrier then? We are
> > accessing and modifying same var, doesn't the cache MESI protocol
> > will get rid of your concerns?
> This fence isn't about feature var. It is to ensure that updates to the
> guest buffer are committed before the logging.

Oh.., I was thinking you were talking about the "dev->features" field
concurrent access and modify you mentioned from V1.

> For IA strong memory model, compiler barrier is enough. For other weak
> memory model, fence is required.
> >>> +
> >>> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> >>> +		return;

So that I should put a "rte_mb()" __here__?

	--yliu
> >>> +
> >>> +	page = addr / VHOST_LOG_PAGE;
> >>> +	while (page * VHOST_LOG_PAGE < addr + len) {
> >> Let us have a page_end var to make the code simpler?
> > Could do that.
> >
> >
> >>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> >>> +		page += VHOST_LOG_PAGE;
> >> page += 1?
> > Oops, right.
> >
> > 	--yliu
> >
> >>> +	}
> >>> +}
> >>> +
> >>> +
> >>>  /**
> >>>   *  Disable features in feature_mask. Returns 0 on success.
> >>>   */
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
  2015-12-21 15:06     ` Xie, Huawei
@ 2015-12-22  5:11     ` Peter Xu
  2015-12-22  6:09       ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Peter Xu @ 2015-12-22  5:11 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Thu, Dec 17, 2015 at 11:11:57AM +0800, Yuanhan Liu wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> +	uint64_t page;
> +
> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		   !dev->log_base || !len))
> +		return;
> +
> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))

Should it be "<="?

Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  5:11     ` Peter Xu
@ 2015-12-22  6:09       ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  6:09 UTC (permalink / raw)
  To: Peter Xu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 01:11:02PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:57AM +0800, Yuanhan Liu wrote:
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> > +{
> > +	uint64_t page;
> > +
> > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +		   !dev->log_base || !len))
> > +		return;
> > +
> > +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> 
> Should it be "<="?

Right, thanks for catching it.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes
  2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
@ 2015-12-22  6:55     ` Peter Xu
  2015-12-22  7:07       ` Xie, Huawei
  2015-12-22  7:13       ` Yuanhan Liu
  0 siblings, 2 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-22  6:55 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		     uint64_t offset, uint64_t len)
> +{

One thing optional: I feel it a little bit confusing regarding to
the helper function name here, for the reasons:

1. It more sounds like "logging all the vrings we used", however,
   what I understand is that, here we are logging dirty pages for
   guest memory. Or say, there is merely nothing to do directly with
   vring (although many vring ops might call this function, we are
   only passing [buf, len] pairs).

2. It may also let people think of "vring_used", which is part of
   virtio protocol. However, it does not mean it too.

I would suggest a better name without confusion, like
vhost_log_dirty_range() or anything else to avoid those keywords.

> +	uint64_t addr;
> +
> +	addr = vq->log_guest_addr + offset;
> +	vhost_log_write(dev, addr, len);

Optional too: since addr is only used once, would it cleaner using
one line? Like:

vhost_log_write(dev, vq->log_guest_addr + offset, len);

> +}
> +
>  /**
>   * This function adds buffers to the virtio devices RX virtqueue. Buffers can
>   * be received from the physical port or from another virtio device. A packet
> @@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		uint32_t offset = 0, vb_offset = 0;
>  		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
>  		uint8_t hdr = 0, uncompleted_pkt = 0;
> +		uint16_t idx;
>  
>  		/* Get descriptor from available ring */
>  		desc = &vq->desc[head[packet_success]];
> @@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		}
>  
>  		/* Update used ring with desc information */
> -		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> -							head[packet_success];
> +		idx = res_cur_idx & (vq->size - 1);
> +		vq->used->ring[idx].id = head[packet_success];
>  
>  		/* Drop the packet if it is uncompleted */
>  		if (unlikely(uncompleted_pkt == 1))
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							vq->vhost_hlen;
> +			vq->used->ring[idx].len = vq->vhost_hlen;
>  		else
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							pkt_len + vq->vhost_hlen;
> +			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> +
> +		vhost_log_used_vring(dev, vq,
> +			offsetof(struct vring_used, ring[idx]),
> +			sizeof(vq->used->ring[idx]));

Got a question here:

I see that we are logging down changes when we are marking
used_vring. Do we need to log down buffer copy in rte_memcpy() too?
I am not sure whether I understand it correctly, it seems that this
is part of DPDK API ops to deliver data to the guest (from, e.g.,
OVS?), when we do rte_memcpy(), we seems to be modifying guest
memory too. Am I wrong?

Peter

>  
>  		res_cur_idx++;
>  		packet_success++;
> @@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  
>  	*(volatile uint16_t *)&vq->used->idx += count;
>  	vq->last_used_idx = res_end_idx;
> +	vhost_log_used_vring(dev, vq,
> +		offsetof(struct vring_used, idx),
> +		sizeof(vq->used->idx));
>  
>  	/* flush used->idx update before we read avail->flags. */
>  	rte_mb();
> @@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	uint32_t seg_avail;
>  	uint32_t vb_avail;
>  	uint32_t cpy_len, entry_len;
> +	uint16_t idx;
>  
>  	if (pkt == NULL)
>  		return 0;
> @@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	entry_len = vq->vhost_hlen;
>  
>  	if (vb_avail == 0) {
> -		uint32_t desc_idx =
> -			vq->buf_vec[vec_idx].desc_idx;
> +		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
> +
> +		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
> +			idx = cur_idx & (vq->size - 1);
>  
> -		if ((vq->desc[desc_idx].flags
> -			& VRING_DESC_F_NEXT) == 0) {
>  			/* Update used ring with desc information */
> -			vq->used->ring[cur_idx & (vq->size - 1)].id
> -				= vq->buf_vec[vec_idx].desc_idx;
> -			vq->used->ring[cur_idx & (vq->size - 1)].len
> -				= entry_len;
> +			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
> +			vq->used->ring[idx].len = entry_len;
> +
> +			vhost_log_used_vring(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  
>  			entry_len = 0;
>  			cur_idx++;
> @@ -354,10 +373,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
>  				VRING_DESC_F_NEXT) == 0) {
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_used_vring(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_len = 0;
>  				cur_idx++;
>  				entry_success++;
> @@ -390,16 +412,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  
>  					if ((vq->desc[desc_idx].flags &
>  						VRING_DESC_F_NEXT) == 0) {
> -						uint16_t wrapped_idx =
> -							cur_idx & (vq->size - 1);
> +						idx = cur_idx & (vq->size - 1);
>  						/*
>  						 * Update used ring with the
>  						 * descriptor information
>  						 */
> -						vq->used->ring[wrapped_idx].id
> +						vq->used->ring[idx].id
>  							= desc_idx;
> -						vq->used->ring[wrapped_idx].len
> +						vq->used->ring[idx].len
>  							= entry_len;
> +						vhost_log_used_vring(dev, vq,
> +							offsetof(struct vring_used, ring[idx]),
> +							sizeof(vq->used->ring[idx]));
>  						entry_success++;
>  						entry_len = 0;
>  						cur_idx++;
> @@ -422,10 +446,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  				 * This whole packet completes.
>  				 */
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_used_vring(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_success++;
>  				break;
>  			}
> @@ -653,6 +680,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  		/* Update used index buffer information. */
>  		vq->used->ring[used_idx].id = head[entry_success];
>  		vq->used->ring[used_idx].len = 0;
> +		vhost_log_used_vring(dev, vq,
> +				offsetof(struct vring_used, ring[used_idx]),
> +				sizeof(vq->used->ring[used_idx]));
>  
>  		/* Allocate an mbuf and populate the structure. */
>  		m = rte_pktmbuf_alloc(mbuf_pool);
> @@ -773,6 +803,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  
>  	rte_compiler_barrier();
>  	vq->used->idx += entry_success;
> +	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
> +			sizeof(vq->used->idx));
>  	/* Kick guest if required. */
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
>  		eventfd_write(vq->callfd, (eventfd_t)1);
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index de78a0f..03044f6 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
>  		return -1;
>  	}
>  
> +	vq->log_guest_addr = addr->log_guest_addr;
> +
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
>  			dev->device_fh, vq->desc);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
>  			dev->device_fh, vq->avail);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
>  			dev->device_fh, vq->used);
> +	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
> +			dev->device_fh, vq->log_guest_addr);
>  
>  	return 0;
>  }
> -- 
> 1.9.0
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  3:04           ` Yuanhan Liu
@ 2015-12-22  7:02             ` Xie, Huawei
  0 siblings, 0 replies; 136+ messages in thread
From: Xie, Huawei @ 2015-12-22  7:02 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On 12/22/2015 11:03 AM, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 02:45:52AM +0000, Xie, Huawei wrote:
>>>>> +static inline void __attribute__((always_inline))
>>>>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
>>>>> +{
>>>>> +	uint64_t page;
>>>>> +
>>>> Before we log, we need memory barrier to make sure updates are in place.
>>>>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>>>> +		   !dev->log_base || !len))
>>>>> +		return;
>>> Put a memory barrier inside set_features()?
>>>
>>> I see no var dependence here, why putting a barrier then? We are
>>> accessing and modifying same var, doesn't the cache MESI protocol
>>> will get rid of your concerns?
>> This fence isn't about feature var. It is to ensure that updates to the
>> guest buffer are committed before the logging.
> Oh.., I was thinking you were talking about the "dev->features" field
> concurrent access and modify you mentioned from V1.
>
>> For IA strong memory model, compiler barrier is enough. For other weak
>> memory model, fence is required.
>>>>> +
>>>>> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
>>>>> +		return;
> So that I should put a "rte_mb()" __here__?
>
> 	--yliu

I find that we already have the arch dependent version of rte_smp_wmb()
        --huawei
>>>>> +
>>>>> +	page = addr / VHOST_LOG_PAGE;
>>>>> +	while (page * VHOST_LOG_PAGE < addr + len) {
>>>> Let us have a page_end var to make the code simpler?
>>> Could do that.
>>>
>>>
>>>>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
>>>>> +		page += VHOST_LOG_PAGE;
>>>> page += 1?
>>> Oops, right.
>>>
>>> 	--yliu
>>>
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +
>>>>>  /**
>>>>>   *  Disable features in feature_mask. Returns 0 on success.
>>>>>   */


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  6:55     ` Peter Xu
@ 2015-12-22  7:07       ` Xie, Huawei
  2015-12-22  7:59         ` Peter Xu
  2015-12-22  7:13       ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Xie, Huawei @ 2015-12-22  7:07 UTC (permalink / raw)
  To: Peter Xu, Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/22/2015 2:56 PM, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
>> +static inline void __attribute__((always_inline))
>> +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +		     uint64_t offset, uint64_t len)
>> +{
[...]
> Got a question here:
>
> I see that we are logging down changes when we are marking
> used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> I am not sure whether I understand it correctly, it seems that this
> is part of DPDK API ops to deliver data to the guest (from, e.g.,
> OVS?), when we do rte_memcpy(), we seems to be modifying guest
> memory too. Am I wrong?
>
> Peter

desc buffer logging isn't included in v1, but in the patch 4 of this
patch set, and actually it is the major work in vhost live migration.
    --huawei

[...]


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  6:55     ` Peter Xu
  2015-12-22  7:07       ` Xie, Huawei
@ 2015-12-22  7:13       ` Yuanhan Liu
  2015-12-22  8:01         ` Peter Xu
  1 sibling, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  7:13 UTC (permalink / raw)
  To: Peter Xu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 02:55:52PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > +static inline void __attribute__((always_inline))
> > +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +		     uint64_t offset, uint64_t len)
> > +{
> 
> One thing optional: I feel it a little bit confusing regarding to
> the helper function name here, for the reasons:
> 
> 1. It more sounds like "logging all the vrings we used", however,
>    what I understand is that, here we are logging dirty pages for
>    guest memory. Or say, there is merely nothing to do directly with
>    vring (although many vring ops might call this function, we are
>    only passing [buf, len] pairs).
> 
> 2. It may also let people think of "vring_used", which is part of
>    virtio protocol. However, it does not mean it too.

Yes, it does. Here log_guest_addr is set to physical address of
vring_used. (check code at vhost_virtqueue_set_addr() of qemu).

     627 static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
     628                                     struct vhost_virtqueue *vq,
     629                                     unsigned idx, bool enable_log)
     630 {
     631     struct vhost_vring_addr addr = {
     632         .index = idx,
     633         .desc_user_addr = (uint64_t)(unsigned long)vq->desc,
     634         .avail_user_addr = (uint64_t)(unsigned long)vq->avail,
     635         .used_user_addr = (uint64_t)(unsigned long)vq->used,
==>  636         .log_guest_addr = vq->used_phys,
     637         .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
     638     };

So, this function does log changes to used vring.

> 
> I would suggest a better name without confusion, like
> vhost_log_dirty_range() or anything else to avoid those keywords.
> 
> > +	uint64_t addr;
> > +
> > +	addr = vq->log_guest_addr + offset;
> > +	vhost_log_write(dev, addr, len);
> 
> Optional too: since addr is only used once, would it cleaner using
> one line? Like:

Yes, it is. Will fix it.

> 
> vhost_log_write(dev, vq->log_guest_addr + offset, len);
> 
> > +}
> > +
> >  /**
> >   * This function adds buffers to the virtio devices RX virtqueue. Buffers can
> >   * be received from the physical port or from another virtio device. A packet
> > @@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> >  		uint32_t offset = 0, vb_offset = 0;
> >  		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
> >  		uint8_t hdr = 0, uncompleted_pkt = 0;
> > +		uint16_t idx;
> >  
> >  		/* Get descriptor from available ring */
> >  		desc = &vq->desc[head[packet_success]];
> > @@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> >  		}
> >  
> >  		/* Update used ring with desc information */
> > -		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> > -							head[packet_success];
> > +		idx = res_cur_idx & (vq->size - 1);
> > +		vq->used->ring[idx].id = head[packet_success];
> >  
> >  		/* Drop the packet if it is uncompleted */
> >  		if (unlikely(uncompleted_pkt == 1))
> > -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> > -							vq->vhost_hlen;
> > +			vq->used->ring[idx].len = vq->vhost_hlen;
> >  		else
> > -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> > -							pkt_len + vq->vhost_hlen;
> > +			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> > +
> > +		vhost_log_used_vring(dev, vq,
> > +			offsetof(struct vring_used, ring[idx]),
> > +			sizeof(vq->used->ring[idx]));
> 
> Got a question here:
> 
> I see that we are logging down changes when we are marking
> used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> I am not sure whether I understand it correctly, it seems that this
> is part of DPDK API ops to deliver data to the guest (from, e.g.,
> OVS?), when we do rte_memcpy(), we seems to be modifying guest
> memory too. Am I wrong?

It's done in the next patch.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  7:07       ` Xie, Huawei
@ 2015-12-22  7:59         ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-22  7:59 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 07:07:25AM +0000, Xie, Huawei wrote:
> On 12/22/2015 2:56 PM, Peter Xu wrote:
> > Got a question here:
> >
> > I see that we are logging down changes when we are marking
> > used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> > I am not sure whether I understand it correctly, it seems that this
> > is part of DPDK API ops to deliver data to the guest (from, e.g.,
> > OVS?), when we do rte_memcpy(), we seems to be modifying guest
> > memory too. Am I wrong?
> >
> > Peter
> 
> desc buffer logging isn't included in v1, but in the patch 4 of this
> patch set, and actually it is the major work in vhost live migration.

Yes, it is. Thanks to point out.

Peter

>     --huawei
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  7:13       ` Yuanhan Liu
@ 2015-12-22  8:01         ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-22  8:01 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 03:13:49PM +0800, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 02:55:52PM +0800, Peter Xu wrote:
> > On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > +static inline void __attribute__((always_inline))
> > > +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > +		     uint64_t offset, uint64_t len)
> > > +{
> > 
> > One thing optional: I feel it a little bit confusing regarding to
> > the helper function name here, for the reasons:
> > 
> > 1. It more sounds like "logging all the vrings we used", however,
> >    what I understand is that, here we are logging dirty pages for
> >    guest memory. Or say, there is merely nothing to do directly with
> >    vring (although many vring ops might call this function, we are
> >    only passing [buf, len] pairs).
> > 
> > 2. It may also let people think of "vring_used", which is part of
> >    virtio protocol. However, it does not mean it too.
> 
> Yes, it does. Here log_guest_addr is set to physical address of
> vring_used. (check code at vhost_virtqueue_set_addr() of qemu).
> 
>      627 static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
>      628                                     struct vhost_virtqueue *vq,
>      629                                     unsigned idx, bool enable_log)
>      630 {
>      631     struct vhost_vring_addr addr = {
>      632         .index = idx,
>      633         .desc_user_addr = (uint64_t)(unsigned long)vq->desc,
>      634         .avail_user_addr = (uint64_t)(unsigned long)vq->avail,
>      635         .used_user_addr = (uint64_t)(unsigned long)vq->used,
> ==>  636         .log_guest_addr = vq->used_phys,
>      637         .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
>      638     };
> 
> So, this function does log changes to used vring.

Yes. I have made a mistake.

Thanks!
Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2015-12-22  8:11     ` Peter Xu
  2015-12-22  8:21       ` Yuanhan Liu
  2015-12-22  8:24       ` Pavel Fedin
  0 siblings, 2 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-22  8:11 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Thu, Dec 17, 2015 at 11:12:00AM +0800, Yuanhan Liu wrote:
> It's actually a feature already enabled in Linux kernel. What we need to
> do is simply to claim that we support such feature, and nothing else.
> 
> With that, the guest will send GARP messages after live migration.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/virtio-net.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 03044f6..0ba5045 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
>  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \

Do we really need this? I can understand when guest declare with
this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
handle the announcement after migration. However, how could I
understand if it's declared by a vhost-user backend? What does it
mean?

In vhost-user.txt (in QEMU repo docs/specs/), the only place that
mentioned this is SEND_RARP:

 * VHOST_USER_SEND_RARP

      Id: 19
      Equivalent ioctl: N/A
      Master payload: u64

      Ask vhost user backend to broadcast a fake RARP to notify the migration
      is terminated for guest that does not support GUEST_ANNOUNCE.
	  ...

Here, it mention the GUEST_ANNOUNCE since when guest support this,
we do not need to send SEND_RARP to vhost-user backend again. It
never explain what does it mean when vhost-user declaring to have
this flag...

Thanks.
Peter

>  				(VHOST_SUPPORTS_MQ)            | \
>  				(1ULL << VIRTIO_F_VERSION_1)   | \
>  				(1ULL << VHOST_F_LOG_ALL)      | \
> -- 
> 1.9.0
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-22  8:11     ` Peter Xu
@ 2015-12-22  8:21       ` Yuanhan Liu
  2015-12-22  8:24       ` Pavel Fedin
  1 sibling, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-22  8:21 UTC (permalink / raw)
  To: Peter Xu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 04:11:08PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:12:00AM +0800, Yuanhan Liu wrote:
> > It's actually a feature already enabled in Linux kernel. What we need to
> > do is simply to claim that we support such feature, and nothing else.
> > 
> > With that, the guest will send GARP messages after live migration.
> > 
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > ---
> >  lib/librte_vhost/virtio-net.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> 
> Do we really need this? I can understand when guest declare with
> this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
> handle the announcement after migration. However, how could I
> understand if it's declared by a vhost-user backend? What does it
> mean?
> 
> In vhost-user.txt (in QEMU repo docs/specs/), the only place that
> mentioned this is SEND_RARP:
> 
>  * VHOST_USER_SEND_RARP
> 
>       Id: 19
>       Equivalent ioctl: N/A
>       Master payload: u64
> 
>       Ask vhost user backend to broadcast a fake RARP to notify the migration
>       is terminated for guest that does not support GUEST_ANNOUNCE.
> 	  ...
> 
> Here, it mention the GUEST_ANNOUNCE since when guest support this,
> we do not need to send SEND_RARP to vhost-user backend again. It
> never explain what does it mean when vhost-user declaring to have
> this flag...

The actually work is done in QEMU and guest kernel. QEMU sends a config
interrupt to guest kernel when such flag is enabled after live migration
(see QEMU code virtio_net_load()). Guest kernel generates the GARP
once it receives the interrupt (see the kernel code virtnet_config_changed_work()).

Therefore, we could simply claim that we support VIRTIO_NET_F_GUEST_ANNOUNCE
feature and do nothing here.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-22  8:11     ` Peter Xu
  2015-12-22  8:21       ` Yuanhan Liu
@ 2015-12-22  8:24       ` Pavel Fedin
  1 sibling, 0 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-22  8:24 UTC (permalink / raw)
  To: 'Peter Xu', 'Yuanhan Liu'
  Cc: dev, 'Victor Kaplansky', 'Michael S. Tsirkin'

 Hello!

> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> 
> Do we really need this? I can understand when guest declare with
> this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
> handle the announcement after migration. However, how could I
> understand if it's declared by a vhost-user backend?

 I guess the documentation is unclear. This is due to way how qemu works. It queries vhost-user backend for the features, then offers them to the guest. The guest then responds with features FROM THE SUGGESTED SET, which it supports. So, if the backend does not claim to support this feature, qemu will not offer it to the guest, therefore the guest will not try to activate it.
 I think this is done because this feature is only useful for migration. If vhost-user backend does not support migration, it needs neither VHOST_USER_SEND_RARP nor guest-side announce.
 Actually, i was thinking about patching qemu once, but... The changeset seemed too complicated, and i imagined the situation described in the above sentence, so decided to abandon it.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
                     ` (7 preceding siblings ...)
  2015-12-21  8:17   ` Pavel Fedin
@ 2016-01-29  4:57   ` Yuanhan Liu
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
                       ` (9 more replies)
  8 siblings, 10 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

This patch set adds the vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration, including used vring and desc buffer. So, this
patch set is basically about adding vhost log support, and
using it.

Another important thing is that you need notify the switches
about the VM location change after migration is done. GUEST_ANNOUNCE
feature is for that, which sends an GARP message after migration.
For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
we construct and broadcast a RARP message, with the mac address
from VHOST_USER_SEND_RARP payload.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
  the dirty memory bitmap is.
    
- Patch 2 introduces a vhost_log_write() helper function to log
  pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 logs changes we made to vring desc buffer.

- Patch 5 and 7 add some feature bits related to live migration.

- patch 6 does the RARP construction and broadcast job.


A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK (check [0] for
how to setup OVS + DPDK):

    [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

Here is the rough test guide:

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
   of live migration (by adding -incoming tcp:0:4444 option)

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3334,server,nowait -curses \
       -incoming tcp:0:4444 

6. connect to VM1 monitor, and start migration:

   > migrate tcp:0:4444

7. After a while, you will find that VM1 has been migrated to VM2,
   and the "ping" command continues running, perfectly.




---
Yuanhan Liu (8):
  vhost: handle VHOST_USER_SET_LOG_BASE request
  vhost: introduce vhost_log_write
  vhost: log used vring changes
  vhost: log vring desc buffer changes
  vhost: claim that we support GUEST_ANNOUNCE feature
  vhost: handle VHOST_USER_SEND_RARP request
  vhost: enable log_shmfd protocol feature
  vhost: remove duplicate header include

 doc/guides/rel_notes/release_2_3.rst          |   2 +
 lib/librte_vhost/rte_virtio_net.h             |   9 +-
 lib/librte_vhost/vhost_rxtx.c                 | 114 +++++++++++++----
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  11 +-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |   7 ++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 174 +++++++++++++++++++++++++-
 lib/librte_vhost/vhost_user/virtio-net-user.h |   8 +-
 lib/librte_vhost/virtio-net.c                 |   5 +
 8 files changed, 299 insertions(+), 31 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

    typedef struct VhostUserLog {
            uint64_t mmap_size;
            uint64_t mmap_offset;
    } VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 lib/librte_vhost/rte_virtio_net.h             |  4 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 10dcb90..8acee02 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -129,7 +129,9 @@ struct virtio_net {
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
 	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
 	void			*priv;		/**< private context */
-	uint64_t		reserved[64];	/**< Reserve some spaces for future extension. */
+	uint64_t		log_size;	/**< Size of log area */
+	uint64_t		log_base;	/**< Where dirty pages are logged */
+	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..32ad6f6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		break;
 
 	case VHOST_USER_SET_LOG_BASE:
-		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
-		break;
+		user_set_log_base(ctx, &msg);
 
+		/* it needs a reply */
+		msg.size = sizeof(msg.payload.u64);
+		send_vhost_message(connfd, &msg);
+		break;
 	case VHOST_USER_SET_LOG_FD:
 		close(msg.fds[0]);
 		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
 	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserLog {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+} VhostUserLog;
+
 typedef struct VhostUserMsg {
 	VhostUserRequest request;
 
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
 		struct vhost_vring_state state;
 		struct vhost_vring_addr addr;
 		VhostUserMemory memory;
+		VhostUserLog    log;
 	} payload;
 	int fds[VHOST_MEMORY_MAX_NREGIONS];
 } __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..0f3b163 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
 
 	dev->protocol_features = protocol_features;
 }
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+		 struct VhostUserMsg *msg)
+{
+	struct virtio_net *dev;
+	int fd = msg->fds[0];
+	uint64_t size, off;
+	void *addr;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
+
+	if (fd < 0) {
+		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+		return -1;
+	}
+
+	if (msg->size != sizeof(VhostUserLog)) {
+		RTE_LOG(ERR, VHOST_CONFIG,
+			"invalid log base msg size: %"PRId32" != %d\n",
+			msg->size, (int)sizeof(VhostUserLog));
+		return -1;
+	}
+
+	size = msg->payload.log.mmap_size;
+	off  = msg->payload.log.mmap_offset;
+	RTE_LOG(INFO, VHOST_CONFIG,
+		"log mmap size: %"PRId64", offset: %"PRId64"\n",
+		size, off);
+
+	/*
+	 * mmap from 0 to workaround a hugepage mmap bug: mmap will
+	 * fail when offset is not page size aligned.
+	 */
+	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+		return -1;
+	}
+
+	/* TODO: unmap on stop */
+	dev->log_base = (uint64_t)(uintptr_t)addr + off;
+	dev->log_size = size;
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-02-19 14:26       ` Thomas Monjalon
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 3/8] vhost: log used vring changes Yuanhan Liu
                       ` (7 subsequent siblings)
  9 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---

v3: - move the two functions inside vhost_rxtx.c

    - add a memory barrier before logging, to make sure guest memory
      updates are committed before logging.

    - Fix an off-by-one bug
---
 lib/librte_vhost/rte_virtio_net.h |  2 ++
 lib/librte_vhost/vhost_rxtx.c     | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..2c891ae 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
  */
 
 #include <stdint.h>
+#include <linux/vhost.h>
 #include <linux/virtio_ring.h>
 #include <linux/virtio_net.h>
 #include <sys/eventfd.h>
@@ -205,6 +206,7 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
 	return vhost_va;
 }
 
+
 /**
  *  Disable features in feature_mask. Returns 0 on success.
  */
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..d485364 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -42,6 +42,35 @@
 #include "vhost-net.h"
 
 #define MAX_PKT_BURST 32
+#define VHOST_LOG_PAGE	4096
+
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		   !dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	/* To make sure guest memory updates are committed before logging */
+	rte_smp_wmb();
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += 1;
+	}
+}
 
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 3/8] vhost: log used vring changes
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 4/8] vhost: log vring desc buffer changes Yuanhan Liu
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Every time we update virtio used ring, we need to log it. And it's
been done by a new vhost_log_write() wrapper, vhost_log_used_vring().

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---

v3: remove the unnecessary var "addr" iat vhost_log_used_vring()
---
 lib/librte_vhost/rte_virtio_net.h |  3 +-
 lib/librte_vhost/vhost_rxtx.c     | 77 +++++++++++++++++++++++++++------------
 lib/librte_vhost/virtio-net.c     |  4 ++
 3 files changed, 59 insertions(+), 25 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 2c891ae..4a2303a 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -91,7 +91,8 @@ struct vhost_virtqueue {
 	int			callfd;			/**< Used to notify the guest (trigger interrupt). */
 	int			kickfd;			/**< Currently unused as polling mode is enabled. */
 	int			enabled;
-	uint64_t		reserved[16];		/**< Reserve some spaces for future extension. */
+	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
+	uint64_t		reserved[15];		/**< Reserve some spaces for future extension. */
 	struct buf_vector	buf_vec[BUF_VECTOR_MAX];	/**< for scatter RX. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index d485364..9f7b1f8 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -72,6 +72,13 @@ vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
 	}
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		     uint64_t offset, uint64_t len)
+{
+	vhost_log_write(dev, vq->log_guest_addr + offset, len);
+}
+
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
 {
@@ -158,6 +165,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t offset = 0, vb_offset = 0;
 		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
 		uint8_t hdr = 0, uncompleted_pkt = 0;
+		uint16_t idx;
 
 		/* Get descriptor from available ring */
 		desc = &vq->desc[head[packet_success]];
@@ -229,16 +237,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		}
 
 		/* Update used ring with desc information */
-		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
-							head[packet_success];
+		idx = res_cur_idx & (vq->size - 1);
+		vq->used->ring[idx].id = head[packet_success];
 
 		/* Drop the packet if it is uncompleted */
 		if (unlikely(uncompleted_pkt == 1))
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							vq->vhost_hlen;
+			vq->used->ring[idx].len = vq->vhost_hlen;
 		else
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							pkt_len + vq->vhost_hlen;
+			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+		vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[idx]),
+			sizeof(vq->used->ring[idx]));
 
 		res_cur_idx++;
 		packet_success++;
@@ -265,6 +275,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
+	vhost_log_used_vring(dev, vq,
+		offsetof(struct vring_used, idx),
+		sizeof(vq->used->idx));
 
 	/* flush used->idx update before we read avail->flags. */
 	rte_mb();
@@ -294,6 +307,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	uint32_t seg_avail;
 	uint32_t vb_avail;
 	uint32_t cpy_len, entry_len;
+	uint16_t idx;
 
 	if (pkt == NULL)
 		return 0;
@@ -331,16 +345,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	entry_len = vq->vhost_hlen;
 
 	if (vb_avail == 0) {
-		uint32_t desc_idx =
-			vq->buf_vec[vec_idx].desc_idx;
+		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+			idx = cur_idx & (vq->size - 1);
 
-		if ((vq->desc[desc_idx].flags
-			& VRING_DESC_F_NEXT) == 0) {
 			/* Update used ring with desc information */
-			vq->used->ring[cur_idx & (vq->size - 1)].id
-				= vq->buf_vec[vec_idx].desc_idx;
-			vq->used->ring[cur_idx & (vq->size - 1)].len
-				= entry_len;
+			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+			vq->used->ring[idx].len = entry_len;
+
+			vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 
 			entry_len = 0;
 			cur_idx++;
@@ -383,10 +399,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
 				VRING_DESC_F_NEXT) == 0) {
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_len = 0;
 				cur_idx++;
 				entry_success++;
@@ -419,16 +438,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 					if ((vq->desc[desc_idx].flags &
 						VRING_DESC_F_NEXT) == 0) {
-						uint16_t wrapped_idx =
-							cur_idx & (vq->size - 1);
+						idx = cur_idx & (vq->size - 1);
 						/*
 						 * Update used ring with the
 						 * descriptor information
 						 */
-						vq->used->ring[wrapped_idx].id
+						vq->used->ring[idx].id
 							= desc_idx;
-						vq->used->ring[wrapped_idx].len
+						vq->used->ring[idx].len
 							= entry_len;
+						vhost_log_used_vring(dev, vq,
+							offsetof(struct vring_used, ring[idx]),
+							sizeof(vq->used->ring[idx]));
 						entry_success++;
 						entry_len = 0;
 						cur_idx++;
@@ -451,10 +472,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 				 * This whole packet completes.
 				 */
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_success++;
 				break;
 			}
@@ -682,6 +706,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		/* Update used index buffer information. */
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
+		vhost_log_used_vring(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 
 		/* Allocate an mbuf and populate the structure. */
 		m = rte_pktmbuf_alloc(mbuf_pool);
@@ -802,6 +829,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..03044f6 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
 		return -1;
 	}
 
+	vq->log_guest_addr = addr->log_guest_addr;
+
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
 			dev->device_fh, vq->desc);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
 			dev->device_fh, vq->avail);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
 			dev->device_fh, vq->used);
+	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
+			dev->device_fh, vq->log_guest_addr);
 
 	return 0;
 }
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 4/8] vhost: log vring desc buffer changes
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (2 preceding siblings ...)
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 3/8] vhost: log used vring changes Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Every time we copy a buf to vring desc, we need to log it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 lib/librte_vhost/vhost_rxtx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9f7b1f8..9c6cc00 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -97,7 +97,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	struct rte_mbuf **pkts, uint32_t count)
 {
 	struct vhost_virtqueue *vq;
-	struct vring_desc *desc;
+	struct vring_desc *desc, *hdr_desc;
 	struct rte_mbuf *buff;
 	/* The virtio_hdr is initialised to 0. */
 	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
@@ -179,6 +179,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		/* Copy virtio_hdr to packet and increment buffer address */
 		buff_hdr_addr = buff_addr;
+		hdr_desc = desc;
 
 		/*
 		 * If the descriptors are chained the header and data are
@@ -203,6 +204,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 			rte_memcpy((void *)(uintptr_t)(buff_addr + vb_offset),
 				rte_pktmbuf_mtod_offset(buff, const void *, offset),
 				len_to_cpy);
+			vhost_log_write(dev, desc->addr + vb_offset, len_to_cpy);
 			PRINT_PACKET(dev, (uintptr_t)(buff_addr + vb_offset),
 				len_to_cpy, 0);
 
@@ -258,6 +260,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		rte_memcpy((void *)(uintptr_t)buff_hdr_addr,
 			(const void *)&virtio_hdr, vq->vhost_hlen);
+		vhost_log_write(dev, hdr_desc->addr, vq->vhost_hlen);
 
 		PRINT_PACKET(dev, (uintptr_t)buff_hdr_addr, vq->vhost_hlen, 1);
 
@@ -335,6 +338,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 	rte_memcpy((void *)(uintptr_t)vb_hdr_addr,
 		(const void *)&virtio_hdr, vq->vhost_hlen);
+	vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr, vq->vhost_hlen);
 
 	PRINT_PACKET(dev, (uintptr_t)vb_hdr_addr, vq->vhost_hlen, 1);
 
@@ -379,6 +383,8 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 		rte_memcpy((void *)(uintptr_t)(vb_addr + vb_offset),
 			rte_pktmbuf_mtod_offset(pkt, const void *, seg_offset),
 			cpy_len);
+		vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr + vb_offset,
+			cpy_len);
 
 		PRINT_PACKET(dev,
 			(uintptr_t)(vb_addr + vb_offset),
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (3 preceding siblings ...)
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 4/8] vhost: log vring desc buffer changes Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-03-11 12:39       ` Olivier MATZ
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
                       ` (4 subsequent siblings)
  9 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

It's actually a feature already enabled in Linux kernel (since v3.5).
What we need to do is simply to claim that we support such feature,
and nothing else.

With that, the guest will send an ARP message after live migration
to notify the switches about the new location of migrated VM.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 lib/librte_vhost/virtio-net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
 #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
 				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
 				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
 				(VHOST_SUPPORTS_MQ)            | \
 				(1ULL << VIRTIO_F_VERSION_1)   | \
 				(1ULL << VHOST_F_LOG_ALL)      | \
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (4 preceding siblings ...)
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-02-19  6:11       ` Tan, Jianfeng
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 7/8] vhost: enable log_shmfd protocol feature Yuanhan Liu
                       ` (3 subsequent siblings)
  9 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

While in former patch we enabled GUEST_ANNOUNCE feature, so that the
guest OS will broadcast a GARP message after migration to notify the
switch about the new location of migrated VM, the thing is that
GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
VHOST_USER_SEND_RARP request comes to rescue.

The payload of this new request is the mac address of the migrated VM,
with that, we could construct a RARP message, and then broadcast it
to host interfaces.

That's how this patch works:

- list all interfaces, with the help of SIOCGIFCONF ioctl command

- construct an RARP message and broadcast it

Cc: Thibaut Collet <thibaut.collet@6wind.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---

  Note that this patch did take effect in my test:

  - it indeed updated target vswitch's mac learning table. (with
    the "ovs fdb/show bridge" command)

  - the ping request packets after migration were indeeded flowed
    to the target (but not the source) host's vswitch. (with tcpdump
    command)

  However, I still saw ping lost. I asked help from Thibaut, the
  original author of the VHOST_USER_SEND_RARP request, he suggested
  that it might be an issue of my network topo, or ovs settings, which
  is likely, regarding to what I observed above.

  Anyway, I'd like to send this out, hopefully someone knows what's
  wrong there if there any.  In the meantime, I will do more debugs.
---
 lib/librte_vhost/vhost_user/vhost-net-user.c  |   4 +
 lib/librte_vhost/vhost_user/vhost-net-user.h  |   1 +
 lib/librte_vhost/vhost_user/virtio-net-user.c | 125 ++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |   5 +-
 4 files changed, 134 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 32ad6f6..cb18396 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -100,6 +100,7 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
 	[VHOST_USER_SET_PROTOCOL_FEATURES]  = "VHOST_USER_SET_PROTOCOL_FEATURES",
 	[VHOST_USER_GET_QUEUE_NUM]  = "VHOST_USER_GET_QUEUE_NUM",
 	[VHOST_USER_SET_VRING_ENABLE]  = "VHOST_USER_SET_VRING_ENABLE",
+	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",
 };
 
 /**
@@ -437,6 +438,9 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 	case VHOST_USER_SET_VRING_ENABLE:
 		user_set_vring_enable(ctx, &msg.payload.state);
 		break;
+	case VHOST_USER_SEND_RARP:
+		user_send_rarp(&msg);
+		break;
 
 	default:
 		break;
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 6d252a3..e3bb413 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
 	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
 	VHOST_USER_GET_QUEUE_NUM = 17,
 	VHOST_USER_SET_VRING_ENABLE = 18,
+	VHOST_USER_SEND_RARP = 19,
 	VHOST_USER_MAX
 } VhostUserRequest;
 
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 0f3b163..cda330d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -34,11 +34,18 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/ethernet.h>
+#include <netinet/in.h>
+#include <netinet/if_ether.h>
+#include <linux/if_packet.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -413,3 +420,121 @@ user_set_log_base(struct vhost_device_ctx ctx,
 
 	return 0;
 }
+
+#define RARP_BUF_SIZE	64
+
+static void
+make_rarp_packet(uint8_t *buf, uint8_t *mac)
+{
+	struct ether_header *eth_hdr;
+	struct ether_arp *rarp;
+
+	/* Ethernet header. */
+	eth_hdr = (struct ether_header *)buf;
+	memset(&eth_hdr->ether_dhost, 0xff, ETH_ALEN);
+	memcpy(&eth_hdr->ether_shost, mac,  ETH_ALEN);
+	eth_hdr->ether_type = htons(ETH_P_RARP);
+
+	/* RARP header. */
+	rarp = (struct ether_arp *)(eth_hdr + 1);
+	rarp->ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
+	rarp->ea_hdr.ar_pro = htons(ETHERTYPE_IP);
+	rarp->ea_hdr.ar_hln = ETH_ALEN;
+	rarp->ea_hdr.ar_pln = 4;
+	rarp->ea_hdr.ar_op  = htons(ARPOP_RREQUEST);
+
+	memcpy(&rarp->arp_sha, mac, ETH_ALEN);
+	memset(&rarp->arp_spa, 0x00, 4);
+	memcpy(&rarp->arp_tha, mac, 6);
+	memset(&rarp->arp_tpa, 0x00, 4);
+}
+
+
+static void
+send_rarp(const char *ifname, uint8_t *rarp)
+{
+	int fd;
+	struct ifreq ifr;
+	struct sockaddr_ll addr;
+
+	fd = socket(AF_PACKET, SOCK_RAW, 0);
+	if (fd < 0) {
+		perror("socket failed");
+		return;
+	}
+
+	memset(&ifr, 0, sizeof(struct ifreq));
+	strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
+	if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {
+		perror("failed to get interface index");
+		close(fd);
+		return;
+	}
+
+	addr.sll_ifindex = ifr.ifr_ifindex;
+	addr.sll_halen   = ETH_ALEN;
+
+	if (sendto(fd, rarp, RARP_BUF_SIZE, 0,
+		   (const struct sockaddr*)&addr, sizeof(addr)) < 0) {
+		perror("send rarp packet failed");
+	}
+}
+
+
+/*
+ * Broadcast a RARP message to all interfaces, to update
+ * switch's mac table
+ */
+int
+user_send_rarp(struct VhostUserMsg *msg)
+{
+	uint8_t *mac = (uint8_t *)&msg->payload.u64;
+	uint8_t rarp[RARP_BUF_SIZE];
+	struct ifconf ifc = {0, };
+	struct ifreq *ifr;
+	int nr = 16;
+	int fd;
+	uint32_t i;
+
+	RTE_LOG(DEBUG, VHOST_CONFIG,
+		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
+		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
+
+	make_rarp_packet(rarp, mac);
+
+	/*
+	 * Get all interfaces
+	 */
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("failed to create AF_INET socket");
+		return -1;
+	}
+
+again:
+	ifc.ifc_len = sizeof(*ifr) * nr;
+	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
+
+	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
+		perror("failed at SIOCGIFCONF");
+		close(fd);
+		return -1;
+	}
+
+	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
+		/*
+		 * current ifc_buf is not big enough to hold
+		 * all interfaces; double it and try again.
+		 */
+		nr *= 2;
+		goto again;
+	}
+
+	ifr = (struct ifreq *)ifc.ifc_buf;
+	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
+		send_rarp(ifr[i].ifr_name, rarp);
+
+	close(fd);
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..1e9ff9a 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_RARP	2
 
-#define VHOST_USER_PROTOCOL_FEATURES	(1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_RARP))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
 
@@ -50,6 +52,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
 int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
+int user_send_rarp(struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 7/8] vhost: enable log_shmfd protocol feature
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (5 preceding siblings ...)
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 8/8] vhost: remove duplicate header include Yuanhan Liu
                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 doc/guides/rel_notes/release_2_3.rst          | 2 ++
 lib/librte_vhost/vhost_user/virtio-net-user.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_3.rst b/doc/guides/rel_notes/release_2_3.rst
index 99de186..f2c9e41 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -4,6 +4,8 @@ DPDK Release 2.3
 New Features
 ------------
 
+* **Added vhost-user live migration support.**
+
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 1e9ff9a..28213f3 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,9 +38,11 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD	1
 #define VHOST_USER_PROTOCOL_F_RARP	2
 
 #define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD) |\
 					 (1ULL << VHOST_USER_PROTOCOL_F_RARP))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH v3 8/8] vhost: remove duplicate header include
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (6 preceding siblings ...)
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 7/8] vhost: enable log_shmfd protocol feature Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-02-01 15:54     ` [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
  2016-02-19 15:01     ` Thomas Monjalon
  9 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

unistd.h has been included twice; remove one.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_user/virtio-net-user.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index cda330d..4270c98 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -39,7 +39,6 @@
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
-#include <unistd.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
 #include <net/ethernet.h>
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (7 preceding siblings ...)
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 8/8] vhost: remove duplicate header include Yuanhan Liu
@ 2016-02-01 15:54     ` Iremonger, Bernard
  2016-02-02  1:53       ` Yuanhan Liu
  2016-02-19 15:01     ` Thomas Monjalon
  9 siblings, 1 reply; 136+ messages in thread
From: Iremonger, Bernard @ 2016-02-01 15:54 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Hi Yuanhan,

<snip>

> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK (check [0] for how to setup OVS +
> DPDK):
> 
>     [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 

The above test guide should probably be added to the DPDK doc files.
It could be added to the sample app guide or the programmers guide.
There is already a Vhost Library section in the programmers guide and
a Vhost Sample Application section in the sample app guide.
The Vhost Sample Application section might be the best place to add it.

It would be useful to add some documentation on the Vhost Logging feature Itself. 

Regards,

Bernard.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2016-02-01 15:54     ` [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
@ 2016-02-02  1:53       ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-02  1:53 UTC (permalink / raw)
  To: Iremonger, Bernard; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Mon, Feb 01, 2016 at 03:54:05PM +0000, Iremonger, Bernard wrote:
> Hi Yuanhan,
> 
> <snip>
> 
> > A simple test guide (on same host)
> > ==================================
> > 
> > The following test is based on OVS + DPDK (check [0] for how to setup OVS +
> > DPDK):
...
> 
> The above test guide should probably be added to the DPDK doc files.
> It could be added to the sample app guide or the programmers guide.
> There is already a Vhost Library section in the programmers guide and
> a Vhost Sample Application section in the sample app guide.

Hi Iremonger,

You had same ask in my last version, and I was aware of that while
preparing this version. The reason I didn't include it is the same
as I replied to you last time: it's a very rough test guide; the
formal test guide would be with 2 hosts, and also you have to establish
a connection to the VM before migration and make sure it still works
right after the migration.

So, I will consider to add such formal test plan after validation is
done from validation team.

> The Vhost Sample Application section might be the best place to add it.

I don't think that's the right place: we can't do the live migration
test with the vhost-switch example: it includes some hardware features,
such as VLAN. Those info would be lost after migration.

And that's the reason I choose OVS.

> 
> It would be useful to add some documentation on the Vhost Logging feature Itself. 

Good suggestion; I will think what I can do for it.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
@ 2016-02-19  6:11       ` Tan, Jianfeng
  2016-02-19  7:03         ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Tan, Jianfeng @ 2016-02-19  6:11 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Hi Yuanhan,

On 1/29/2016 12:58 PM, Yuanhan Liu wrote:
> While in former patch we enabled GUEST_ANNOUNCE feature, so that the
> guest OS will broadcast a GARP message after migration to notify the
> switch about the new location of migrated VM, the thing is that
> GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
> VHOST_USER_SEND_RARP request comes to rescue.
>
> The payload of this new request is the mac address of the migrated VM,
> with that, we could construct a RARP message, and then broadcast it
> to host interfaces.
>
> That's how this patch works:
>
> - list all interfaces, with the help of SIOCGIFCONF ioctl command
>
> - construct an RARP message and broadcast it
>
> Cc: Thibaut Collet <thibaut.collet@6wind.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
...
> +
> +/*
> + * Broadcast a RARP message to all interfaces, to update
> + * switch's mac table
> + */
> +int
> +user_send_rarp(struct VhostUserMsg *msg)
> +{
> +	uint8_t *mac = (uint8_t *)&msg->payload.u64;
> +	uint8_t rarp[RARP_BUF_SIZE];
> +	struct ifconf ifc = {0, };
> +	struct ifreq *ifr;
> +	int nr = 16;
> +	int fd;
> +	uint32_t i;
> +
> +	RTE_LOG(DEBUG, VHOST_CONFIG,
> +		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
> +		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
> +
> +	make_rarp_packet(rarp, mac);
> +
> +	/*
> +	 * Get all interfaces
> +	 */
> +	fd = socket(AF_INET, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("failed to create AF_INET socket");
> +		return -1;
> +	}
> +
> +again:
> +	ifc.ifc_len = sizeof(*ifr) * nr;
> +	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
> +
> +	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
> +		perror("failed at SIOCGIFCONF");
> +		close(fd);
> +		return -1;
> +	}
> +
> +	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
> +		/*
> +		 * current ifc_buf is not big enough to hold
> +		 * all interfaces; double it and try again.
> +		 */
> +		nr *= 2;
> +		goto again;
> +	}
> +
> +	ifr = (struct ifreq *)ifc.ifc_buf;
> +	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
> +		send_rarp(ifr[i].ifr_name, rarp);
> +
> +	close(fd);
> +
> +	return 0;
> +}

 From how you implement user_send_rarp(), if I understand it correctly, 
it broadcasts this ARP packets to all host interfaces, which I don't 
think it's appropriate. This ARP packets should be sent to it's own L2 
networking. You should not make the hypothesis that all interfaces 
maintained in the kernel are in the same L2 networking. Even worse, this 
could bring problems when used in overlay networking, in which two VM in 
two different overlay networking, can have same MAC address.

What I suggest here is to move user_send_rarp() to 
rte_vhost_dequeue_burst() using a flag to control, so that this arp 
packet can be broadcasted in its own L2 network.

Thanks,
Jianfeng

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-02-19  6:11       ` Tan, Jianfeng
@ 2016-02-19  7:03         ` Yuanhan Liu
  2016-02-19  8:55           ` Yuanhan Liu
  2016-02-22 14:36           ` [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
  0 siblings, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-19  7:03 UTC (permalink / raw)
  To: Tan, Jianfeng; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 02:11:36PM +0800, Tan, Jianfeng wrote:
> Hi Yuanhan,
> 
> On 1/29/2016 12:58 PM, Yuanhan Liu wrote:
> >While in former patch we enabled GUEST_ANNOUNCE feature, so that the
> >guest OS will broadcast a GARP message after migration to notify the
> >switch about the new location of migrated VM, the thing is that
> >GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
> >VHOST_USER_SEND_RARP request comes to rescue.
> >
> >The payload of this new request is the mac address of the migrated VM,
> >with that, we could construct a RARP message, and then broadcast it
> >to host interfaces.
> >
> >That's how this patch works:
> >
> >- list all interfaces, with the help of SIOCGIFCONF ioctl command
> >
> >- construct an RARP message and broadcast it
> >
> >Cc: Thibaut Collet <thibaut.collet@6wind.com>
> >Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >---
> ...
> >+
> >+/*
> >+ * Broadcast a RARP message to all interfaces, to update
> >+ * switch's mac table
> >+ */
> >+int
> >+user_send_rarp(struct VhostUserMsg *msg)
> >+{
> >+	uint8_t *mac = (uint8_t *)&msg->payload.u64;
> >+	uint8_t rarp[RARP_BUF_SIZE];
> >+	struct ifconf ifc = {0, };
> >+	struct ifreq *ifr;
> >+	int nr = 16;
> >+	int fd;
> >+	uint32_t i;
> >+
> >+	RTE_LOG(DEBUG, VHOST_CONFIG,
> >+		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
> >+		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
> >+
> >+	make_rarp_packet(rarp, mac);
> >+
> >+	/*
> >+	 * Get all interfaces
> >+	 */
> >+	fd = socket(AF_INET, SOCK_DGRAM, 0);
> >+	if (fd < 0) {
> >+		perror("failed to create AF_INET socket");
> >+		return -1;
> >+	}
> >+
> >+again:
> >+	ifc.ifc_len = sizeof(*ifr) * nr;
> >+	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
> >+
> >+	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
> >+		perror("failed at SIOCGIFCONF");
> >+		close(fd);
> >+		return -1;
> >+	}
> >+
> >+	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
> >+		/*
> >+		 * current ifc_buf is not big enough to hold
> >+		 * all interfaces; double it and try again.
> >+		 */
> >+		nr *= 2;
> >+		goto again;
> >+	}
> >+
> >+	ifr = (struct ifreq *)ifc.ifc_buf;
> >+	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
> >+		send_rarp(ifr[i].ifr_name, rarp);
> >+
> >+	close(fd);
> >+
> >+	return 0;
> >+}
> 
> From how you implement user_send_rarp(), if I understand it correctly, it
> broadcasts this ARP packets to all host interfaces, which I don't think it's
> appropriate. This ARP packets should be sent to it's own L2 networking. You
> should not make the hypothesis that all interfaces maintained in the kernel
> are in the same L2 networking. Even worse, this could bring problems when
> used in overlay networking, in which two VM in two different overlay
> networking, can have same MAC address.
> 
> What I suggest here is to move user_send_rarp() to rte_vhost_dequeue_burst()
> using a flag to control, so that this arp packet can be broadcasted in its
> own L2 network.

I have thought of that, too. It was given up because SEND_RARP request was
handled in different thread from rte_vhost_dequeue_burst(), leading to the
fact that the RARP packet will not be broadcasted immediately after migration
is done: it will be broadcasted only when rte_vhost_dequeue_burst() is invoked.

I was thinking the delay might be a problem. While thinking it twice, it
doesn't look like one then. As GUEST_ANNOUNCE is also broadcasted by
rte_vhost_dequeue_burst(); it's enqueued by guest kernel though. And
judging that we are polling mode driver, it won't be an issue then.

So, thanks. I will give it a quick try; it should work.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-02-19  7:03         ` Yuanhan Liu
@ 2016-02-19  8:55           ` Yuanhan Liu
  2016-02-22 14:36           ` [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
  1 sibling, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-19  8:55 UTC (permalink / raw)
  To: Tan, Jianfeng; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 03:03:26PM +0800, Yuanhan Liu wrote:
> On Fri, Feb 19, 2016 at 02:11:36PM +0800, Tan, Jianfeng wrote:
> > What I suggest here is to move user_send_rarp() to rte_vhost_dequeue_burst()
> > using a flag to control, so that this arp packet can be broadcasted in its
> > own L2 network.
> 
> I have thought of that, too. It was given up because SEND_RARP request was
> handled in different thread from rte_vhost_dequeue_burst(), leading to the
> fact that the RARP packet will not be broadcasted immediately after migration
> is done: it will be broadcasted only when rte_vhost_dequeue_burst() is invoked.
> 
> I was thinking the delay might be a problem. While thinking it twice, it
> doesn't look like one then. As GUEST_ANNOUNCE is also broadcasted by
> rte_vhost_dequeue_burst(); it's enqueued by guest kernel though. And
> judging that we are polling mode driver, it won't be an issue then.
> 
> So, thanks. I will give it a quick try; it should work.

It worked like a charm :) Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write
  2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
@ 2016-02-19 14:26       ` Thomas Monjalon
  2016-02-22  6:59         ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2016-02-19 14:26 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2016-01-29 12:57, Yuanhan Liu:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
> 
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Signed-off-by: Victor Kaplansky <victork@redhat.com>
> Tested-by: Pavel Fedin <p.fedin@samsung.com>
[...]
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)

lib/librte_vhost/vhost_rxtx.c:59:1: error: unused function 'vhost_log_write'

I think it's better to squash with the next patch.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
                       ` (8 preceding siblings ...)
  2016-02-01 15:54     ` [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
@ 2016-02-19 15:01     ` Thomas Monjalon
  2016-02-22  7:08       ` Yuanhan Liu
  9 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2016-02-19 15:01 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2016-01-29 12:57, Yuanhan Liu:
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration, including used vring and desc buffer. So, this
> patch set is basically about adding vhost log support, and
> using it.
> 
> Another important thing is that you need notify the switches
> about the VM location change after migration is done. GUEST_ANNOUNCE
> feature is for that, which sends an GARP message after migration.
> For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
> we construct and broadcast a RARP message, with the mac address
> from VHOST_USER_SEND_RARP payload.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
>     
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 7 add some feature bits related to live migration.
> 
> - patch 6 does the RARP construction and broadcast job.

Patches 2 and 3 have been merged to avoid a compilation error.
Applied, thanks

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write
  2016-02-19 14:26       ` Thomas Monjalon
@ 2016-02-22  6:59         ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-22  6:59 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 03:26:36PM +0100, Thomas Monjalon wrote:
> 2016-01-29 12:57, Yuanhan Liu:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> > 
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> > 
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > Signed-off-by: Victor Kaplansky <victork@redhat.com>
> > Tested-by: Pavel Fedin <p.fedin@samsung.com>
> [...]
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> 
> lib/librte_vhost/vhost_rxtx.c:59:1: error: unused function 'vhost_log_write'

Oops. This functions was defined in a header file. I then moved it to
vhost_rxtx.c, but I forgot to do the compile test :(

> I think it's better to squash with the next patch.

Yes, and thanks!

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2016-02-19 15:01     ` Thomas Monjalon
@ 2016-02-22  7:08       ` Yuanhan Liu
  2016-02-22  9:56         ` Thomas Monjalon
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-22  7:08 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 04:01:25PM +0100, Thomas Monjalon wrote:
> 2016-01-29 12:57, Yuanhan Liu:
> > This patch set adds the vhost-user live migration support.
> > 
> > The major task behind that is to log pages we touched during
> > live migration, including used vring and desc buffer. So, this
> > patch set is basically about adding vhost log support, and
> > using it.
> > 
> > Another important thing is that you need notify the switches
> > about the VM location change after migration is done. GUEST_ANNOUNCE
> > feature is for that, which sends an GARP message after migration.
> > For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
> > we construct and broadcast a RARP message, with the mac address
> > from VHOST_USER_SEND_RARP payload.
> > 
> > Patchset
> > ========
> > - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
> >   the dirty memory bitmap is.
> >     
> > - Patch 2 introduces a vhost_log_write() helper function to log
> >   pages we are gonna change.
> > 
> > - Patch 3 logs changes we made to used vring.
> > 
> > - Patch 4 logs changes we made to vring desc buffer.
> > 
> > - Patch 5 and 7 add some feature bits related to live migration.
> > 
> > - patch 6 does the RARP construction and broadcast job.
> 
> Patches 2 and 3 have been merged to avoid a compilation error.
> Applied, thanks

Actually, there was a ongoing discussion about patch 6, the handling
of VHOST_USER_SEND_RARP request:

  http://dpdk.org/ml/archives/dev/2016-February/033539.html

Maybe you have seen that and I didn't make it clear; my bad. Since you
have already applied it. I will make a standalone patch, and try to
send it out today.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2016-02-22  7:08       ` Yuanhan Liu
@ 2016-02-22  9:56         ` Thomas Monjalon
  2016-02-22 14:24           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2016-02-22  9:56 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2016-02-22 15:08, Yuanhan Liu:
> On Fri, Feb 19, 2016 at 04:01:25PM +0100, Thomas Monjalon wrote:
> > 2016-01-29 12:57, Yuanhan Liu:
> > > This patch set adds the vhost-user live migration support.
> > > 
> > > The major task behind that is to log pages we touched during
> > > live migration, including used vring and desc buffer. So, this
> > > patch set is basically about adding vhost log support, and
> > > using it.
> > > 
> > > Another important thing is that you need notify the switches
> > > about the VM location change after migration is done. GUEST_ANNOUNCE
> > > feature is for that, which sends an GARP message after migration.
> > > For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
> > > we construct and broadcast a RARP message, with the mac address
> > > from VHOST_USER_SEND_RARP payload.
> > > 
> > > Patchset
> > > ========
> > > - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
> > >   the dirty memory bitmap is.
> > >     
> > > - Patch 2 introduces a vhost_log_write() helper function to log
> > >   pages we are gonna change.
> > > 
> > > - Patch 3 logs changes we made to used vring.
> > > 
> > > - Patch 4 logs changes we made to vring desc buffer.
> > > 
> > > - Patch 5 and 7 add some feature bits related to live migration.
> > > 
> > > - patch 6 does the RARP construction and broadcast job.
> > 
> > Patches 2 and 3 have been merged to avoid a compilation error.
> > Applied, thanks
> 
> Actually, there was a ongoing discussion about patch 6, the handling
> of VHOST_USER_SEND_RARP request:
> 
>   http://dpdk.org/ml/archives/dev/2016-February/033539.html
> 
> Maybe you have seen that and I didn't make it clear; my bad. Since you
> have already applied it. I will make a standalone patch, and try to
> send it out today.

Yes I've wrongly understood there was no problem.
The series would have not been applied if you had said that a new version
was needed or if you had set the patchwork status to "Changes Requested".
Sorry, we'll do better next times ;)

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support
  2016-02-22  9:56         ` Thomas Monjalon
@ 2016-02-22 14:24           ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-22 14:24 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Feb 22, 2016 at 10:56:42AM +0100, Thomas Monjalon wrote:
> > > Patches 2 and 3 have been merged to avoid a compilation error.
> > > Applied, thanks
> > 
> > Actually, there was a ongoing discussion about patch 6, the handling
> > of VHOST_USER_SEND_RARP request:
> > 
> >   http://dpdk.org/ml/archives/dev/2016-February/033539.html
> > 
> > Maybe you have seen that and I didn't make it clear; my bad. Since you
> > have already applied it. I will make a standalone patch, and try to
> > send it out today.
> 
> Yes I've wrongly understood there was no problem.
> The series would have not been applied if you had said that a new version
> was needed or if you had set the patchwork status to "Changes Requested".

Yeah, that's my bad. I meant to send it out soon, say, last Friday
night. However, I didn't make it. For the patchwork stuff, I'm
still gettig used to it. So, thanks for the tip. I will bear that
in mind next time.

> Sorry, we'll do better next times ;)

Yeah :)

A new patch will be sent out soon. Please consider to apply it. Or,
maybe we can hold it for a while, to see if someone else have comments.
I will ping you to apply it next week if no comments.

Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-19  7:03         ` Yuanhan Liu
  2016-02-19  8:55           ` Yuanhan Liu
@ 2016-02-22 14:36           ` Yuanhan Liu
  2016-02-24  8:15             ` Qiu, Michael
  2016-02-29 15:56             ` Thomas Monjalon
  1 sibling, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-22 14:36 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Broadcast RARP packet by injecting it to receiving mbuf array at
rte_vhost_dequeue_burst().

Commit 33226236a35e ("vhost: handle request to send RARP") iterates
all host interfaces and then broadcast it by all of them.  It did
notify the switches about the new location of the migrated VM, however,
the mac learning table in the target host is wrong (at least in my
test with OVS):

    $ ovs-appctl fdb/show ovsbr0
     port  VLAN  MAC                Age
        1     0  b6:3c:72:71:cd:4d   10
    LOCAL     0  b6:3c:72:71:cd:4e   10
    LOCAL     0  52:54:00:12:34:68    9
        1     0  56:f6:64:2c:bc:c0    1

Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
above, the port learned is "LOCAL", which is the "ovsbr0" port. That
is reasonable, since we indeed send the pkt by the "ovsbr0" interface.

The wrong mac table lead all the packets to the VM go to the "ovsbr0"
in the end, which ends up with all packets being lost, until the guest
send a ARP quest (or reply) to refresh the mac learning table.

Jianfeng then came up with a solution I have thought of firstly but NAKed
by myself, concerning it has potential issues [0]. The solution is as title
stated: broadcast the RARP packet by injecting it to the receiving mbuf
arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
think it twice; it looked like a false concern to me then. And I had done
a rough verification: it worked as expected.

[0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html

Another note is that while preparing this version, I found that DPDK has
some ARP related structures and macros defined. So, use them instead of
the one from standard header files here.

Cc: Thibaut Collet <thibaut.collet@6wind.com>
Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/rte_virtio_net.h             |   5 +-
 lib/librte_vhost/vhost_rxtx.c                 |  80 +++++++++++++++-
 lib/librte_vhost/vhost_user/vhost-net-user.c  |   2 +-
 lib/librte_vhost/vhost_user/virtio-net-user.c | 128 ++++----------------------
 lib/librte_vhost/vhost_user/virtio-net-user.h |   2 +-
 5 files changed, 104 insertions(+), 113 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 4a2303a..7d1fde2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -49,6 +49,7 @@
 
 #include <rte_memory.h>
 #include <rte_mempool.h>
+#include <rte_ether.h>
 
 struct rte_mbuf;
 
@@ -133,7 +134,9 @@ struct virtio_net {
 	void			*priv;		/**< private context */
 	uint64_t		log_size;	/**< Size of log area */
 	uint64_t		log_base;	/**< Where dirty pages are logged */
-	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
+	struct ether_addr	mac;		/**< MAC address */
+	rte_atomic16_t		broadcast_rarp;	/**< A flag to tell if we need broadcast rarp packet */
+	uint64_t		reserved[61];	/**< Reserve some spaces for future extension. */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 12ce0cc..9d23eb1 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -43,6 +43,7 @@
 #include <rte_tcp.h>
 #include <rte_udp.h>
 #include <rte_sctp.h>
+#include <rte_arp.h>
 
 #include "vhost-net.h"
 
@@ -761,11 +762,50 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
 	}
 }
 
+#define RARP_PKT_SIZE	64
+
+static int
+make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
+{
+	struct ether_hdr *eth_hdr;
+	struct arp_hdr  *rarp;
+
+	if (rarp_mbuf->buf_len < 64) {
+		RTE_LOG(WARNING, VHOST_DATA,
+			"failed to make RARP; mbuf size too small %u (< %d)\n",
+			rarp_mbuf->buf_len, RARP_PKT_SIZE);
+		return -1;
+	}
+
+	/* Ethernet header. */
+	eth_hdr = rte_pktmbuf_mtod_offset(rarp_mbuf, struct ether_hdr *, 0);
+	memset(eth_hdr->d_addr.addr_bytes, 0xff, ETHER_ADDR_LEN);
+	ether_addr_copy(mac, &eth_hdr->s_addr);
+	eth_hdr->ether_type = htons(ETHER_TYPE_RARP);
+
+	/* RARP header. */
+	rarp = (struct arp_hdr *)(eth_hdr + 1);
+	rarp->arp_hrd = htons(ARP_HRD_ETHER);
+	rarp->arp_pro = htons(ETHER_TYPE_IPv4);
+	rarp->arp_hln = ETHER_ADDR_LEN;
+	rarp->arp_pln = 4;
+	rarp->arp_op  = htons(ARP_OP_REVREQUEST);
+
+	ether_addr_copy(mac, &rarp->arp_data.arp_sha);
+	ether_addr_copy(mac, &rarp->arp_data.arp_tha);
+	memset(&rarp->arp_data.arp_sip, 0x00, 4);
+	memset(&rarp->arp_data.arp_tip, 0x00, 4);
+
+	rarp_mbuf->pkt_len  = rarp_mbuf->data_len = RARP_PKT_SIZE;
+
+	return 0;
+}
+
 uint16_t
 rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
-	struct rte_mbuf *m, *prev;
+	struct rte_mbuf *m, *prev, *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
 	struct vring_desc *desc;
 	uint64_t vb_addr = 0;
@@ -788,11 +828,34 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	if (unlikely(vq->enabled == 0))
 		return 0;
 
+	/*
+	 * Construct a RARP broadcast packet, and inject it to the "pkts"
+	 * array, to looks like that guest actually send such packet.
+	 *
+	 * Check user_send_rarp() for more information.
+	 */
+	if (unlikely(rte_atomic16_cmpset((volatile uint16_t *)
+					 &dev->broadcast_rarp.cnt, 1, 0))) {
+		rarp_mbuf = rte_pktmbuf_alloc(mbuf_pool);
+		if (rarp_mbuf == NULL) {
+			RTE_LOG(ERR, VHOST_DATA,
+				"Failed to allocate memory for mbuf.\n");
+			return 0;
+		}
+
+		if (make_rarp_packet(rarp_mbuf, &dev->mac)) {
+			rte_pktmbuf_free(rarp_mbuf);
+			rarp_mbuf = NULL;
+		} else {
+			count -= 1;
+		}
+	}
+
 	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
 
 	/* If there are no available buffers then return. */
 	if (vq->last_used_idx == avail_idx)
-		return 0;
+		goto out;
 
 	LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,
 		dev->device_fh);
@@ -983,8 +1046,21 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	vq->used->idx += entry_success;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
+
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
+
+out:
+	if (unlikely(rarp_mbuf != NULL)) {
+		/*
+		 * Inject it to the head of "pkts" array, so that switch's mac
+		 * learning table will get updated first.
+		 */
+		memmove(&pkts[1], pkts, entry_success * sizeof(m));
+		pkts[0] = rarp_mbuf;
+		entry_success += 1;
+	}
+
 	return entry_success;
 }
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index de7eecb..df2bd64 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -437,7 +437,7 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		user_set_vring_enable(ctx, &msg.payload.state);
 		break;
 	case VHOST_USER_SEND_RARP:
-		user_send_rarp(&msg);
+		user_send_rarp(ctx, &msg);
 		break;
 
 	default:
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 68b24f4..65b5652 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -39,12 +39,6 @@
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
-#include <sys/ioctl.h>
-#include <sys/socket.h>
-#include <net/ethernet.h>
-#include <netinet/in.h>
-#include <netinet/if_ether.h>
-#include <linux/if_packet.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -415,120 +409,38 @@ user_set_log_base(struct vhost_device_ctx ctx,
 	return 0;
 }
 
-#define RARP_BUF_SIZE	64
-
-static void
-make_rarp_packet(uint8_t *buf, uint8_t *mac)
-{
-	struct ether_header *eth_hdr;
-	struct ether_arp *rarp;
-
-	/* Ethernet header. */
-	eth_hdr = (struct ether_header *)buf;
-	memset(&eth_hdr->ether_dhost, 0xff, ETH_ALEN);
-	memcpy(&eth_hdr->ether_shost, mac,  ETH_ALEN);
-	eth_hdr->ether_type = htons(ETH_P_RARP);
-
-	/* RARP header. */
-	rarp = (struct ether_arp *)(eth_hdr + 1);
-	rarp->ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
-	rarp->ea_hdr.ar_pro = htons(ETHERTYPE_IP);
-	rarp->ea_hdr.ar_hln = ETH_ALEN;
-	rarp->ea_hdr.ar_pln = 4;
-	rarp->ea_hdr.ar_op  = htons(ARPOP_RREQUEST);
-
-	memcpy(&rarp->arp_sha, mac, ETH_ALEN);
-	memset(&rarp->arp_spa, 0x00, 4);
-	memcpy(&rarp->arp_tha, mac, 6);
-	memset(&rarp->arp_tpa, 0x00, 4);
-}
-
-
-static void
-send_rarp(const char *ifname, uint8_t *rarp)
-{
-	int fd;
-	struct ifreq ifr;
-	struct sockaddr_ll addr;
-
-	fd = socket(AF_PACKET, SOCK_RAW, 0);
-	if (fd < 0) {
-		perror("socket failed");
-		return;
-	}
-
-	memset(&ifr, 0, sizeof(struct ifreq));
-	strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
-	if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {
-		perror("failed to get interface index");
-		close(fd);
-		return;
-	}
-
-	addr.sll_ifindex = ifr.ifr_ifindex;
-	addr.sll_halen   = ETH_ALEN;
-
-	if (sendto(fd, rarp, RARP_BUF_SIZE, 0,
-		   (const struct sockaddr*)&addr, sizeof(addr)) < 0) {
-		perror("send rarp packet failed");
-	}
-}
-
-
 /*
- * Broadcast a RARP message to all interfaces, to update
- * switch's mac table
+ * An rarp packet is constructed and broadcasted to notify switches about
+ * the new location of the migrated VM, so that packets from outside will
+ * not be lost after migration.
+ *
+ * However, we don't actually "send" a rarp packet here, instead, we set
+ * a flag 'broadcast_rarp' to let rte_vhost_dequeue_burst() inject it.
  */
 int
-user_send_rarp(struct VhostUserMsg *msg)
+user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *msg)
 {
+	struct virtio_net *dev;
 	uint8_t *mac = (uint8_t *)&msg->payload.u64;
-	uint8_t rarp[RARP_BUF_SIZE];
-	struct ifconf ifc = {0, };
-	struct ifreq *ifr;
-	int nr = 16;
-	int fd;
-	uint32_t i;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
 
 	RTE_LOG(DEBUG, VHOST_CONFIG,
 		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
 		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
-
-	make_rarp_packet(rarp, mac);
+	memcpy(dev->mac.addr_bytes, mac, 6);
 
 	/*
-	 * Get all interfaces
+	 * Set the flag to inject a RARP broadcast packet at
+	 * rte_vhost_dequeue_burst().
+	 *
+	 * rte_smp_wmb() is for making sure the mac is copied
+	 * before the flag is set.
 	 */
-	fd = socket(AF_INET, SOCK_DGRAM, 0);
-	if (fd < 0) {
-		perror("failed to create AF_INET socket");
-		return -1;
-	}
-
-again:
-	ifc.ifc_len = sizeof(*ifr) * nr;
-	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
-
-	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
-		perror("failed at SIOCGIFCONF");
-		close(fd);
-		return -1;
-	}
-
-	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
-		/*
-		 * current ifc_buf is not big enough to hold
-		 * all interfaces; double it and try again.
-		 */
-		nr *= 2;
-		goto again;
-	}
-
-	ifr = (struct ifreq *)ifc.ifc_buf;
-	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
-		send_rarp(ifr[i].ifr_name, rarp);
-
-	close(fd);
+	rte_smp_wmb();
+	rte_atomic16_set(&dev->broadcast_rarp, 1);
 
 	return 0;
 }
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 559bb46..cefec16 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -54,7 +54,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
 int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
-int user_send_rarp(struct VhostUserMsg *);
+int user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-22 14:36           ` [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
@ 2016-02-24  8:15             ` Qiu, Michael
  2016-02-24  8:28               ` Yuanhan Liu
  2016-02-29 15:56             ` Thomas Monjalon
  1 sibling, 1 reply; 136+ messages in thread
From: Qiu, Michael @ 2016-02-24  8:15 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 2/22/2016 10:35 PM, Yuanhan Liu wrote:
> Broadcast RARP packet by injecting it to receiving mbuf array at
> rte_vhost_dequeue_burst().
>
> Commit 33226236a35e ("vhost: handle request to send RARP") iterates
> all host interfaces and then broadcast it by all of them.  It did
> notify the switches about the new location of the migrated VM, however,
> the mac learning table in the target host is wrong (at least in my
> test with OVS):
>
>     $ ovs-appctl fdb/show ovsbr0
>      port  VLAN  MAC                Age
>         1     0  b6:3c:72:71:cd:4d   10
>     LOCAL     0  b6:3c:72:71:cd:4e   10
>     LOCAL     0  52:54:00:12:34:68    9
>         1     0  56:f6:64:2c:bc:c0    1
>
> Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
> above, the port learned is "LOCAL", which is the "ovsbr0" port. That
> is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
>
> The wrong mac table lead all the packets to the VM go to the "ovsbr0"
> in the end, which ends up with all packets being lost, until the guest
> send a ARP quest (or reply) to refresh the mac learning table.
>
> Jianfeng then came up with a solution I have thought of firstly but NAKed

Is it suitable to mention someone in the commit log?

Thanks,
Michael
> by myself, concerning it has potential issues [0]. The solution is as title
> stated: broadcast the RARP packet by injecting it to the receiving mbuf
> arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
> think it twice; it looked like a false concern to me then. And I had done
> a rough verification: it worked as expected.
>
> [0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html
>
> Another note is that while preparing this version, I found that DPDK has
> some ARP related structures and macros defined. So, use them instead of
> the one from standard header files here.
>
> Cc: Thibaut Collet <thibaut.collet@6wind.com>
> Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/rte_virtio_net.h             |   5 +-
>  lib/librte_vhost/vhost_rxtx.c                 |  80 +++++++++++++++-
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |   2 +-
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 128 ++++----------------------
>  lib/librte_vhost/vhost_user/virtio-net-user.h |   2 +-
>  5 files changed, 104 insertions(+), 113 deletions(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 4a2303a..7d1fde2 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -49,6 +49,7 @@
>  
>  #include <rte_memory.h>
>  #include <rte_mempool.h>
> +#include <rte_ether.h>
>  
>  struct rte_mbuf;
>  
> @@ -133,7 +134,9 @@ struct virtio_net {
>  	void			*priv;		/**< private context */
>  	uint64_t		log_size;	/**< Size of log area */
>  	uint64_t		log_base;	/**< Where dirty pages are logged */
> -	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
> +	struct ether_addr	mac;		/**< MAC address */
> +	rte_atomic16_t		broadcast_rarp;	/**< A flag to tell if we need broadcast rarp packet */
> +	uint64_t		reserved[61];	/**< Reserve some spaces for future extension. */
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>  } __rte_cache_aligned;
>  
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 12ce0cc..9d23eb1 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -43,6 +43,7 @@
>  #include <rte_tcp.h>
>  #include <rte_udp.h>
>  #include <rte_sctp.h>
> +#include <rte_arp.h>
>  
>  #include "vhost-net.h"
>  
> @@ -761,11 +762,50 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
>  	}
>  }
>  
> +#define RARP_PKT_SIZE	64
> +
> +static int
> +make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
> +{
> +	struct ether_hdr *eth_hdr;
> +	struct arp_hdr  *rarp;
> +
> +	if (rarp_mbuf->buf_len < 64) {
> +		RTE_LOG(WARNING, VHOST_DATA,
> +			"failed to make RARP; mbuf size too small %u (< %d)\n",
> +			rarp_mbuf->buf_len, RARP_PKT_SIZE);
> +		return -1;
> +	}
> +
> +	/* Ethernet header. */
> +	eth_hdr = rte_pktmbuf_mtod_offset(rarp_mbuf, struct ether_hdr *, 0);
> +	memset(eth_hdr->d_addr.addr_bytes, 0xff, ETHER_ADDR_LEN);
> +	ether_addr_copy(mac, &eth_hdr->s_addr);
> +	eth_hdr->ether_type = htons(ETHER_TYPE_RARP);
> +
> +	/* RARP header. */
> +	rarp = (struct arp_hdr *)(eth_hdr + 1);
> +	rarp->arp_hrd = htons(ARP_HRD_ETHER);
> +	rarp->arp_pro = htons(ETHER_TYPE_IPv4);
> +	rarp->arp_hln = ETHER_ADDR_LEN;
> +	rarp->arp_pln = 4;
> +	rarp->arp_op  = htons(ARP_OP_REVREQUEST);
> +
> +	ether_addr_copy(mac, &rarp->arp_data.arp_sha);
> +	ether_addr_copy(mac, &rarp->arp_data.arp_tha);
> +	memset(&rarp->arp_data.arp_sip, 0x00, 4);
> +	memset(&rarp->arp_data.arp_tip, 0x00, 4);
> +
> +	rarp_mbuf->pkt_len  = rarp_mbuf->data_len = RARP_PKT_SIZE;
> +
> +	return 0;
> +}
> +
>  uint16_t
>  rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>  {
> -	struct rte_mbuf *m, *prev;
> +	struct rte_mbuf *m, *prev, *rarp_mbuf = NULL;
>  	struct vhost_virtqueue *vq;
>  	struct vring_desc *desc;
>  	uint64_t vb_addr = 0;
> @@ -788,11 +828,34 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  	if (unlikely(vq->enabled == 0))
>  		return 0;
>  
> +	/*
> +	 * Construct a RARP broadcast packet, and inject it to the "pkts"
> +	 * array, to looks like that guest actually send such packet.
> +	 *
> +	 * Check user_send_rarp() for more information.
> +	 */
> +	if (unlikely(rte_atomic16_cmpset((volatile uint16_t *)
> +					 &dev->broadcast_rarp.cnt, 1, 0))) {
> +		rarp_mbuf = rte_pktmbuf_alloc(mbuf_pool);
> +		if (rarp_mbuf == NULL) {
> +			RTE_LOG(ERR, VHOST_DATA,
> +				"Failed to allocate memory for mbuf.\n");
> +			return 0;
> +		}
> +
> +		if (make_rarp_packet(rarp_mbuf, &dev->mac)) {
> +			rte_pktmbuf_free(rarp_mbuf);
> +			rarp_mbuf = NULL;
> +		} else {
> +			count -= 1;
> +		}
> +	}
> +
>  	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
>  
>  	/* If there are no available buffers then return. */
>  	if (vq->last_used_idx == avail_idx)
> -		return 0;
> +		goto out;
>  
>  	LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,
>  		dev->device_fh);
> @@ -983,8 +1046,21 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  	vq->used->idx += entry_success;
>  	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
>  			sizeof(vq->used->idx));
> +
>  	/* Kick guest if required. */
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
>  		eventfd_write(vq->callfd, (eventfd_t)1);
> +
> +out:
> +	if (unlikely(rarp_mbuf != NULL)) {
> +		/*
> +		 * Inject it to the head of "pkts" array, so that switch's mac
> +		 * learning table will get updated first.
> +		 */
> +		memmove(&pkts[1], pkts, entry_success * sizeof(m));
> +		pkts[0] = rarp_mbuf;
> +		entry_success += 1;
> +	}
> +
>  	return entry_success;
>  }
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index de7eecb..df2bd64 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -437,7 +437,7 @@ vserver_message_handler(int connfd, void *dat, int *remove)
>  		user_set_vring_enable(ctx, &msg.payload.state);
>  		break;
>  	case VHOST_USER_SEND_RARP:
> -		user_send_rarp(&msg);
> +		user_send_rarp(ctx, &msg);
>  		break;
>  
>  	default:
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
> index 68b24f4..65b5652 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.c
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
> @@ -39,12 +39,6 @@
>  #include <sys/mman.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
> -#include <sys/ioctl.h>
> -#include <sys/socket.h>
> -#include <net/ethernet.h>
> -#include <netinet/in.h>
> -#include <netinet/if_ether.h>
> -#include <linux/if_packet.h>
>  
>  #include <rte_common.h>
>  #include <rte_log.h>
> @@ -415,120 +409,38 @@ user_set_log_base(struct vhost_device_ctx ctx,
>  	return 0;
>  }
>  
> -#define RARP_BUF_SIZE	64
> -
> -static void
> -make_rarp_packet(uint8_t *buf, uint8_t *mac)
> -{
> -	struct ether_header *eth_hdr;
> -	struct ether_arp *rarp;
> -
> -	/* Ethernet header. */
> -	eth_hdr = (struct ether_header *)buf;
> -	memset(&eth_hdr->ether_dhost, 0xff, ETH_ALEN);
> -	memcpy(&eth_hdr->ether_shost, mac,  ETH_ALEN);
> -	eth_hdr->ether_type = htons(ETH_P_RARP);
> -
> -	/* RARP header. */
> -	rarp = (struct ether_arp *)(eth_hdr + 1);
> -	rarp->ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
> -	rarp->ea_hdr.ar_pro = htons(ETHERTYPE_IP);
> -	rarp->ea_hdr.ar_hln = ETH_ALEN;
> -	rarp->ea_hdr.ar_pln = 4;
> -	rarp->ea_hdr.ar_op  = htons(ARPOP_RREQUEST);
> -
> -	memcpy(&rarp->arp_sha, mac, ETH_ALEN);
> -	memset(&rarp->arp_spa, 0x00, 4);
> -	memcpy(&rarp->arp_tha, mac, 6);
> -	memset(&rarp->arp_tpa, 0x00, 4);
> -}
> -
> -
> -static void
> -send_rarp(const char *ifname, uint8_t *rarp)
> -{
> -	int fd;
> -	struct ifreq ifr;
> -	struct sockaddr_ll addr;
> -
> -	fd = socket(AF_PACKET, SOCK_RAW, 0);
> -	if (fd < 0) {
> -		perror("socket failed");
> -		return;
> -	}
> -
> -	memset(&ifr, 0, sizeof(struct ifreq));
> -	strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
> -	if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {
> -		perror("failed to get interface index");
> -		close(fd);
> -		return;
> -	}
> -
> -	addr.sll_ifindex = ifr.ifr_ifindex;
> -	addr.sll_halen   = ETH_ALEN;
> -
> -	if (sendto(fd, rarp, RARP_BUF_SIZE, 0,
> -		   (const struct sockaddr*)&addr, sizeof(addr)) < 0) {
> -		perror("send rarp packet failed");
> -	}
> -}
> -
> -
>  /*
> - * Broadcast a RARP message to all interfaces, to update
> - * switch's mac table
> + * An rarp packet is constructed and broadcasted to notify switches about
> + * the new location of the migrated VM, so that packets from outside will
> + * not be lost after migration.
> + *
> + * However, we don't actually "send" a rarp packet here, instead, we set
> + * a flag 'broadcast_rarp' to let rte_vhost_dequeue_burst() inject it.
>   */
>  int
> -user_send_rarp(struct VhostUserMsg *msg)
> +user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *msg)
>  {
> +	struct virtio_net *dev;
>  	uint8_t *mac = (uint8_t *)&msg->payload.u64;
> -	uint8_t rarp[RARP_BUF_SIZE];
> -	struct ifconf ifc = {0, };
> -	struct ifreq *ifr;
> -	int nr = 16;
> -	int fd;
> -	uint32_t i;
> +
> +	dev = get_device(ctx);
> +	if (!dev)
> +		return -1;
>  
>  	RTE_LOG(DEBUG, VHOST_CONFIG,
>  		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
>  		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
> -
> -	make_rarp_packet(rarp, mac);
> +	memcpy(dev->mac.addr_bytes, mac, 6);
>  
>  	/*
> -	 * Get all interfaces
> +	 * Set the flag to inject a RARP broadcast packet at
> +	 * rte_vhost_dequeue_burst().
> +	 *
> +	 * rte_smp_wmb() is for making sure the mac is copied
> +	 * before the flag is set.
>  	 */
> -	fd = socket(AF_INET, SOCK_DGRAM, 0);
> -	if (fd < 0) {
> -		perror("failed to create AF_INET socket");
> -		return -1;
> -	}
> -
> -again:
> -	ifc.ifc_len = sizeof(*ifr) * nr;
> -	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
> -
> -	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
> -		perror("failed at SIOCGIFCONF");
> -		close(fd);
> -		return -1;
> -	}
> -
> -	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
> -		/*
> -		 * current ifc_buf is not big enough to hold
> -		 * all interfaces; double it and try again.
> -		 */
> -		nr *= 2;
> -		goto again;
> -	}
> -
> -	ifr = (struct ifreq *)ifc.ifc_buf;
> -	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
> -		send_rarp(ifr[i].ifr_name, rarp);
> -
> -	close(fd);
> +	rte_smp_wmb();
> +	rte_atomic16_set(&dev->broadcast_rarp, 1);
>  
>  	return 0;
>  }
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> index 559bb46..cefec16 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> @@ -54,7 +54,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
>  void user_set_protocol_features(struct vhost_device_ctx ctx,
>  				uint64_t protocol_features);
>  int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
> -int user_send_rarp(struct VhostUserMsg *);
> +int user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *);
>  
>  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
>  


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-24  8:15             ` Qiu, Michael
@ 2016-02-24  8:28               ` Yuanhan Liu
  2016-02-25  7:55                 ` Qiu, Michael
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2016-02-24  8:28 UTC (permalink / raw)
  To: Qiu, Michael; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Feb 24, 2016 at 08:15:36AM +0000, Qiu, Michael wrote:
> On 2/22/2016 10:35 PM, Yuanhan Liu wrote:
> > Broadcast RARP packet by injecting it to receiving mbuf array at
> > rte_vhost_dequeue_burst().
> >
> > Commit 33226236a35e ("vhost: handle request to send RARP") iterates
> > all host interfaces and then broadcast it by all of them.  It did
> > notify the switches about the new location of the migrated VM, however,
> > the mac learning table in the target host is wrong (at least in my
> > test with OVS):
> >
> >     $ ovs-appctl fdb/show ovsbr0
> >      port  VLAN  MAC                Age
> >         1     0  b6:3c:72:71:cd:4d   10
> >     LOCAL     0  b6:3c:72:71:cd:4e   10
> >     LOCAL     0  52:54:00:12:34:68    9
> >         1     0  56:f6:64:2c:bc:c0    1
> >
> > Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
> > above, the port learned is "LOCAL", which is the "ovsbr0" port. That
> > is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
> >
> > The wrong mac table lead all the packets to the VM go to the "ovsbr0"
> > in the end, which ends up with all packets being lost, until the guest
> > send a ARP quest (or reply) to refresh the mac learning table.
> >
> > Jianfeng then came up with a solution I have thought of firstly but NAKed
> 
> Is it suitable to mention someone in the commit log?

Why it's not? It's not a secret name or something like that after all :)

On the other hand, it's way of thanking Jianfeng's contribution to this
patch.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-24  8:28               ` Yuanhan Liu
@ 2016-02-25  7:55                 ` Qiu, Michael
  0 siblings, 0 replies; 136+ messages in thread
From: Qiu, Michael @ 2016-02-25  7:55 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 2/24/2016 4:27 PM, Yuanhan Liu wrote:
> On Wed, Feb 24, 2016 at 08:15:36AM +0000, Qiu, Michael wrote:
>> On 2/22/2016 10:35 PM, Yuanhan Liu wrote:
>>> Broadcast RARP packet by injecting it to receiving mbuf array at
>>> rte_vhost_dequeue_burst().
>>>
>>> Commit 33226236a35e ("vhost: handle request to send RARP") iterates
>>> all host interfaces and then broadcast it by all of them.  It did
>>> notify the switches about the new location of the migrated VM, however,
>>> the mac learning table in the target host is wrong (at least in my
>>> test with OVS):
>>>
>>>     $ ovs-appctl fdb/show ovsbr0
>>>      port  VLAN  MAC                Age
>>>         1     0  b6:3c:72:71:cd:4d   10
>>>     LOCAL     0  b6:3c:72:71:cd:4e   10
>>>     LOCAL     0  52:54:00:12:34:68    9
>>>         1     0  56:f6:64:2c:bc:c0    1
>>>
>>> Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
>>> above, the port learned is "LOCAL", which is the "ovsbr0" port. That
>>> is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
>>>
>>> The wrong mac table lead all the packets to the VM go to the "ovsbr0"
>>> in the end, which ends up with all packets being lost, until the guest
>>> send a ARP quest (or reply) to refresh the mac learning table.
>>>
>>> Jianfeng then came up with a solution I have thought of firstly but NAKed
>> Is it suitable to mention someone in the commit log?
> Why it's not? It's not a secret name or something like that after all :)
>
> On the other hand, it's way of thanking Jianfeng's contribution to this
> patch.

OK, I've never seen this fashion before, forgive me.

Thanks,
Michael
>
> 	--yliu
>


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-22 14:36           ` [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
  2016-02-24  8:15             ` Qiu, Michael
@ 2016-02-29 15:56             ` Thomas Monjalon
  1 sibling, 0 replies; 136+ messages in thread
From: Thomas Monjalon @ 2016-02-29 15:56 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

2016-02-22 22:36, Yuanhan Liu:
> The wrong mac table lead all the packets to the VM go to the "ovsbr0"
> in the end, which ends up with all packets being lost, until the guest
> send a ARP quest (or reply) to refresh the mac learning table.
> 
> Jianfeng then came up with a solution I have thought of firstly but NAKed
> by myself, concerning it has potential issues [0]. The solution is as title
> stated: broadcast the RARP packet by injecting it to the receiving mbuf
> arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
> think it twice; it looked like a false concern to me then. And I had done
> a rough verification: it worked as expected.
> 
> [0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html
> 
> Another note is that while preparing this version, I found that DPDK has
> some ARP related structures and macros defined. So, use them instead of
> the one from standard header files here.
> 
> Cc: Thibaut Collet <thibaut.collet@6wind.com>
> Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

Applied, thanks

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2016-03-11 12:39       ` Olivier MATZ
  2016-03-11 13:16         ` Thomas Monjalon
  0 siblings, 1 reply; 136+ messages in thread
From: Olivier MATZ @ 2016-03-11 12:39 UTC (permalink / raw)
  To: Yuanhan Liu, dev

Hi Yuanhan,

On 01/29/2016 05:58 AM, Yuanhan Liu wrote:
> It's actually a feature already enabled in Linux kernel (since v3.5).
> What we need to do is simply to claim that we support such feature,
> and nothing else.
> 
> With that, the guest will send an ARP message after live migration
> to notify the switches about the new location of migrated VM.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Tested-by: Pavel Fedin <p.fedin@samsung.com>
> ---
>  lib/librte_vhost/virtio-net.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 03044f6..0ba5045 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
>  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
>  				(VHOST_SUPPORTS_MQ)            | \
>  				(1ULL << VIRTIO_F_VERSION_1)   | \
>  				(1ULL << VHOST_F_LOG_ALL)      | \
> 

I'm trying to compile the dpdk on an debian-7, and it fails due
to this patch. Indeed, the define VIRTIO_NET_F_GUEST_ANNOUNCE is
not present in /usr/include/linux/virtio_net.h on this distribution.

I'm wondering if the librte_vhost shouldn't embed its own version
of virtio_net.h instead of relying on the one from the distribution.
It seems it has been done this way for the virtio guest PMD in
dpdk.org/drivers/net/virtio/virtio_pci.h.

What do you think?


Thanks,
Olivier

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-03-11 12:39       ` Olivier MATZ
@ 2016-03-11 13:16         ` Thomas Monjalon
  2016-03-11 13:22           ` Olivier MATZ
  0 siblings, 1 reply; 136+ messages in thread
From: Thomas Monjalon @ 2016-03-11 13:16 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

2016-03-11 13:39, Olivier MATZ:
> I'm trying to compile the dpdk on an debian-7, and it fails due
> to this patch. Indeed, the define VIRTIO_NET_F_GUEST_ANNOUNCE is
> not present in /usr/include/linux/virtio_net.h on this distribution.

It will be fixed by this patch:
http://dpdk.org/dev/patchwork/patch/11195/

> I'm wondering if the librte_vhost shouldn't embed its own version
> of virtio_net.h instead of relying on the one from the distribution.
> It seems it has been done this way for the virtio guest PMD in
> dpdk.org/drivers/net/virtio/virtio_pci.h.
> 
> What do you think?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-03-11 13:16         ` Thomas Monjalon
@ 2016-03-11 13:22           ` Olivier MATZ
  0 siblings, 0 replies; 136+ messages in thread
From: Olivier MATZ @ 2016-03-11 13:22 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev



On 03/11/2016 02:16 PM, Thomas Monjalon wrote:
> 2016-03-11 13:39, Olivier MATZ:
>> I'm trying to compile the dpdk on an debian-7, and it fails due
>> to this patch. Indeed, the define VIRTIO_NET_F_GUEST_ANNOUNCE is
>> not present in /usr/include/linux/virtio_net.h on this distribution.
> 
> It will be fixed by this patch:
> http://dpdk.org/dev/patchwork/patch/11195/

OK, thanks Thomas

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-16 12:43                                 ` Pavel Fedin
@ 2015-12-16 13:00                                   ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-16 13:00 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, 'Victor Kaplansky'

On Wed, Dec 16, 2015 at 03:43:06PM +0300, Pavel Fedin wrote:
> rYR8N8f/ookveMRL7BfPnj5lw+EJZd+uG+v/lZnBuWidyQ4r
> 	g586/P1rPsQw8p6wT+M7LnqvMLZM9eWq2ht53Bd5liqxFGckGmoxFxUnAgC5sFKthAIAAA==
> Status: O
> Content-Length: 4853
> Lines: 66
> 
>  Hello!
> 
> > However, I'm more curious about the ping loss? Did you still see
> > that? And to be more specific, have the wireshark captured the
> > GRAP from the guest?
> 
>  Yes, everything is fine.

Great!

> 
> root@nfv_test_x86_64 /var/log/libvirt/qemu # tshark -i ovs-br0
> Running as user "root" and group "root". This could be dangerous.
> Capturing on 'ovs-br0'
>   1   0.000000 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
>   2   0.000024 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   3   0.049490 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
>   4   0.049497 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   5   0.199485 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
>   6   0.199492 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   7   0.449500 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
>   8   0.449508 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>   9   0.517229  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=70/17920, ttl=64
>  10   0.517277  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=70/17920, ttl=64 (request in 9)
>  11   0.799521 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
>  12   0.799553 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
> 52:54:00:3b:83:1a
>  13   1.517210  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=71/18176, ttl=64
>  14   1.517238  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=71/18176, ttl=64 (request in 13)
>  15   2.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=72/18432, ttl=64
>  16   2.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=72/18432, ttl=64 (request in 15)
>  17   3.517497  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=73/18688, ttl=64
>  18   3.517518  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=73/18688, ttl=64 (request in 17)
>  19   4.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=74/18944, ttl=64
>  20   4.517237  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=74/18944, ttl=64 (request in 19)
>  21   5.517222  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=75/19200, ttl=64
>  22   5.517242  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=75/19200, ttl=64 (request in 21)
>  23   6.517235  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=76/19456, ttl=64
>  24   6.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=76/19456, ttl=64 (request in 23)
>  25   6.531466 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 192.168.6.2?  Tell 192.168.6.1
>  26   6.531619 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 52:54:00:3b:83:1a
>  27   7.517212  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=77/19712, ttl=64
>  28   7.517229  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=77/19712, ttl=64 (request in 27)
> 
>  But there's one important detail here. Any replicated network interfaces (LOCAL port in my example) should be fully cloned on both
> hosts, including MAC addresses. Otherwise after the migration the guest continues to send packets to old MAC, and, obvious, there's
> still ping loss until it redoes the ARP for its ping target.

I see. And here I care more about whether we can get the GARP from the
target guest just after the migration. If you can, everything should
be fine.
> 
> >  And what's the output of 'grep virtio /proc/interrupts' inside guest?
> 
> 11:          0          0          0          0   IO-APIC  11-fasteoi   uhci_hcd:usb1, virtio3
>  24:          0          0          0          0   PCI-MSI 114688-edge      virtio2-config
>  25:       3544          0          0          0   PCI-MSI 114689-edge      virtio2-req.0
>  26:         10          0          0          0   PCI-MSI 49152-edge      virtio0-config

The GUEST_ANNOUNCE has indeed been triggered. That's great! I just have
no idea why I can't get any config IRQ from the guest after the migration.
(I can for migratin inside one same host, but not on two hosts).

In my first tries, I just got an error message telling me that the MSI is
just lost. I then found it may because I'm using a customized guest kernel.
I then switched to the kernel shipped by Fedora 22, I no longer see such
error, but I still don't see such interrupt generated inside the guest, either.

It might still be an issue on my side. Even it's not, it's likely a KVM
bug, but not from vhost-user. And glad it works on your side :)

So, I will send v2 tomorow.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-16 12:08                               ` Yuanhan Liu
@ 2015-12-16 12:43                                 ` Pavel Fedin
  2015-12-16 13:00                                   ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-16 12:43 UTC (permalink / raw)
  To: 'Yuanhan Liu'; +Cc: dev, 'Victor Kaplansky'

 Hello!

> However, I'm more curious about the ping loss? Did you still see
> that? And to be more specific, have the wireshark captured the
> GRAP from the guest?

 Yes, everything is fine.

root@nfv_test_x86_64 /var/log/libvirt/qemu # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
  1   0.000000 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
  2   0.000024 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  3   0.049490 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
  4   0.049497 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  5   0.199485 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
  6   0.199492 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  7   0.449500 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
  8   0.449508 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
  9   0.517229  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=70/17920, ttl=64
 10   0.517277  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=70/17920, ttl=64 (request in 9)
 11   0.799521 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
 12   0.799553 fe80::5054:ff:fe3b:831a -> ff02::1      ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
 13   1.517210  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=71/18176, ttl=64
 14   1.517238  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=71/18176, ttl=64 (request in 13)
 15   2.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=72/18432, ttl=64
 16   2.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=72/18432, ttl=64 (request in 15)
 17   3.517497  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=73/18688, ttl=64
 18   3.517518  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=73/18688, ttl=64 (request in 17)
 19   4.517219  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=74/18944, ttl=64
 20   4.517237  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=74/18944, ttl=64 (request in 19)
 21   5.517222  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=75/19200, ttl=64
 22   5.517242  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=75/19200, ttl=64 (request in 21)
 23   6.517235  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=76/19456, ttl=64
 24   6.517256  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=76/19456, ttl=64 (request in 23)
 25   6.531466 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 192.168.6.2?  Tell 192.168.6.1
 26   6.531619 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 52:54:00:3b:83:1a
 27   7.517212  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x04af, seq=77/19712, ttl=64
 28   7.517229  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x04af, seq=77/19712, ttl=64 (request in 27)

 But there's one important detail here. Any replicated network interfaces (LOCAL port in my example) should be fully cloned on both
hosts, including MAC addresses. Otherwise after the migration the guest continues to send packets to old MAC, and, obvious, there's
still ping loss until it redoes the ARP for its ping target.

>  And what's the output of 'grep virtio /proc/interrupts' inside guest?

11:          0          0          0          0   IO-APIC  11-fasteoi   uhci_hcd:usb1, virtio3
 24:          0          0          0          0   PCI-MSI 114688-edge      virtio2-config
 25:       3544          0          0          0   PCI-MSI 114689-edge      virtio2-req.0
 26:         10          0          0          0   PCI-MSI 49152-edge      virtio0-config
 27:        852          0          0          0   PCI-MSI 49153-edge      virtio0-input.0
 28:          3          0          0          0   PCI-MSI 49154-edge      virtio0-output.0
 29:         10          0          0          0   PCI-MSI 65536-edge      virtio1-config
 30:        172          0          0          0   PCI-MSI 65537-edge      virtio1-input.0
 31:          1          0          0          0   PCI-MSI 65538-edge      virtio1-output.0

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-16 11:57                             ` Pavel Fedin
@ 2015-12-16 12:08                               ` Yuanhan Liu
  2015-12-16 12:43                                 ` Pavel Fedin
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-16 12:08 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, 'Victor Kaplansky'

On Wed, Dec 16, 2015 at 02:57:15PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > I can reproduce your issue on my side with above patch (and only when
> > F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
> > why that happened, the cause could be subtle, and I don't think it's
> > worthwhile to dig it, especially it's not the right way to do it.
> 
>  May be not right, may be it can be done... Actually, i found what was wrong. qemu tries to feed features back to vhost-user via
> VHOST_USER_SET_FEATURES, and DPDK barfs on the unknown bit. More tweaking is needed for qemu to do the trick correctly.
> 
> > So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
> > lib side, as my early diff showed and have another test?
> 
>  Tried it, works fine, thank you.

Thanks for the test.

However, I'm more curious about the ping loss? Did you still see
that? And to be more specific, have the wireshark captured the
GRAP from the guest?  And what's the output of 'grep virtio /proc/interrupts'
inside guest?

	--yliu


>  I have almost implemented the workaround in qemu... However now i start to think that you are right. Theoretically, the application
> may want to suppress GUEST_ANNOUNCE for some reason. So, let it stay this way. Please include this bit into your v2.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-16  7:28                           ` Yuanhan Liu
@ 2015-12-16 11:57                             ` Pavel Fedin
  2015-12-16 12:08                               ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-16 11:57 UTC (permalink / raw)
  To: 'Yuanhan Liu'; +Cc: dev, 'Victor Kaplansky'

 Hello!

> I can reproduce your issue on my side with above patch (and only when
> F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
> why that happened, the cause could be subtle, and I don't think it's
> worthwhile to dig it, especially it's not the right way to do it.

 May be not right, may be it can be done... Actually, i found what was wrong. qemu tries to feed features back to vhost-user via
VHOST_USER_SET_FEATURES, and DPDK barfs on the unknown bit. More tweaking is needed for qemu to do the trick correctly.

> So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
> lib side, as my early diff showed and have another test?

 Tried it, works fine, thank you.
 I have almost implemented the workaround in qemu... However now i start to think that you are right. Theoretically, the application
may want to suppress GUEST_ANNOUNCE for some reason. So, let it stay this way. Please include this bit into your v2.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 14:58                         ` Pavel Fedin
@ 2015-12-16  7:28                           ` Yuanhan Liu
  2015-12-16 11:57                             ` Pavel Fedin
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-16  7:28 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, 'Victor Kaplansky'

On Tue, Dec 15, 2015 at 05:58:28PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > No idea. Maybe you have changed some other configures (such as of ovs)
> > without notice? Or, the ovs bridge interface resets?
> 
>  I don't touch the ovs at all. Just shut down the guest, rebuild the qemu, reinstall it, run the guest.
> 
> > 
> > BTW, would you please try my v1 patch set with above diff applied to
> > see if the ping loss is still there. You might also want to run tcpdump
> > with the dest host ovs bridge, to see if GARP is actually sent.
> 
>  Retested with wireshark running on the host. I used my qemu patch instead, but it should not matter at all:
> --- cut ---
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 1b6c5ac..5ca2987 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -480,7 +480,12 @@ static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
> 
>  static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
>  {
> -    return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
> +    int ret = vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
> +
> +    if (!ret) {
> +        virtio_add_feature(features, VIRTIO_NET_F_GUEST_ANNOUNCE);
> +    }
> +    return ret;
>  }
> 
>  static int vhost_user_set_owner(struct vhost_dev *dev)
> --- cut ---
> 
>  So, here are both wireshark captures on the host side:

Pavel,

I can reproduce your issue on my side with above patch (and only when
F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
why that happened, the cause could be subtle, and I don't think it's
worthwhile to dig it, especially it's not the right way to do it.

So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
lib side, as my early diff showed and have another test?

On the other hand, I failed to find two identical server, the two closet
I found are E5-2695 and E5-2699, However, the MSI lost fatal bug still
occurred. I'm out of thoughts what could be the root cause. I'm asking
help from som KVM gurus; hopefully they could shine some lights on.
Meanwhile, I may need try to debug it.

Since you don't meet such issue, I'd hope you could have a test and
tell me how it works :)

Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-16  2:38                     ` Peter Xu
  2015-12-16  2:50                       ` Yuanhan Liu
@ 2015-12-16  7:05                       ` Pavel Fedin
  1 sibling, 0 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-16  7:05 UTC (permalink / raw)
  To: 'Peter Xu', 'Thibaut Collet'
  Cc: dev, 'Victor Kaplansky'

 Hello!

> 1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
>    send another RARP (or say, GARP, I will use RARP as example),
>    then there will be two RARP later on the line, right? (since the
>    QEMU one is sent unconditionally from qemu_announce_self).

 qemu_announce_self() is NOT unconditional. It applies only to emulated physical NICs and bypasses virtio/vhost. So it will not send anything at all for vhost-user.

> 2. if the only thing vhost-user backend is to send another same RARP
>    when got SEND_RARP request, why would it bother if QEMU will
>    unconditionally send one?

 See above, it won't send one.
 It looks to me like qemu_announce_self() is just a poor man's solution which even doesn't always work (because GARP should reassociate an existing IP with new MAC, shouldn't it? and qemu doesn't know the IP and just sets both src and dst to 0.0.0.0).

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-16  2:38                     ` Peter Xu
@ 2015-12-16  2:50                       ` Yuanhan Liu
  2015-12-16  7:05                       ` Pavel Fedin
  1 sibling, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-16  2:50 UTC (permalink / raw)
  To: Peter Xu; +Cc: dev, Victor Kaplansky

On Wed, Dec 16, 2015 at 10:38:03AM +0800, Peter Xu wrote:
> On Tue, Dec 15, 2015 at 04:07:57PM +0100, Thibaut Collet wrote:
> > After a migration, to avoid netwotk outage, all interfaces of the guest
> > must send a packet to update switches mapping (ideally a GARP).
> > As some interfaces do not do it QEMU does it in behalf of the guest by
> > sending a RARP (his RARP is not forged by the guest but by QEMU). This is
> > the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
> > ethernet interfaces. For vhost-user backend, QEMU can not do it directly
> > and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
> > request that contains the MAC address of the guest interface.
> > 
> > Thibaut.
> 
> Hi, Thibaut,
> 
> Thanks for the explaination.
> 
> Two more questions:
> 
> 1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
>    send another RARP (or say, GARP, I will use RARP as example),
>    then there will be two RARP later on the line, right? (since the
>    QEMU one is sent unconditionally from qemu_announce_self).

The one send by qemu_announce_self() will be caught by
vhost_user_receive(), which ends up invoking vhost_user_migration_done().
And it will be dropped when VIRTIO_NET_F_GUEST_ANNOUNCE is negotiated
there.

> 2. if the only thing vhost-user backend is to send another same RARP
>    when got SEND_RARP request, why would it bother if QEMU will
>    unconditionally send one? (or say, I still do not know why we
>    need this SEND_RARP request, if the vhost-user backend is going
>    to do the same thing again as QEMU already does)

Because that one is caught by vhost-user, and vhost-user just relays
it to the backend when necessary (say when GUEST_ANNOUNCE is not
supported)?

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 15:07                   ` Thibaut Collet
  2015-12-15 15:36                     ` Pavel Fedin
@ 2015-12-16  2:38                     ` Peter Xu
  2015-12-16  2:50                       ` Yuanhan Liu
  2015-12-16  7:05                       ` Pavel Fedin
  1 sibling, 2 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-16  2:38 UTC (permalink / raw)
  To: Thibaut Collet; +Cc: dev, Victor Kaplansky

On Tue, Dec 15, 2015 at 04:07:57PM +0100, Thibaut Collet wrote:
> After a migration, to avoid netwotk outage, all interfaces of the guest
> must send a packet to update switches mapping (ideally a GARP).
> As some interfaces do not do it QEMU does it in behalf of the guest by
> sending a RARP (his RARP is not forged by the guest but by QEMU). This is
> the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
> ethernet interfaces. For vhost-user backend, QEMU can not do it directly
> and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
> request that contains the MAC address of the guest interface.
> 
> Thibaut.

Hi, Thibaut,

Thanks for the explaination.

Two more questions:

1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
   send another RARP (or say, GARP, I will use RARP as example),
   then there will be two RARP later on the line, right? (since the
   QEMU one is sent unconditionally from qemu_announce_self).

2. if the only thing vhost-user backend is to send another same RARP
   when got SEND_RARP request, why would it bother if QEMU will
   unconditionally send one? (or say, I still do not know why we
   need this SEND_RARP request, if the vhost-user backend is going
   to do the same thing again as QEMU already does)

Thanks in advance.
Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 15:07                   ` Thibaut Collet
@ 2015-12-15 15:36                     ` Pavel Fedin
  2015-12-16  2:38                     ` Peter Xu
  1 sibling, 0 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-15 15:36 UTC (permalink / raw)
  To: 'Thibaut Collet', 'Yuanhan Liu'
  Cc: dev, 'Victor Kaplansky'

 Hello!
 
> After a migration, to avoid netwotk outage, all interfaces of the guest must send a packet to update switches mapping (ideally a GARP).
> As some interfaces do not do it QEMU does it in behalf of the guest by sending a RARP (his RARP is not forged by the guest but by QEMU). This is the
> qemu_self_announce purpose that "spoofs" a RARP to all backend of guest ethernet interfaces. For vhost-user backend, QEMU can not do it directly

 Aha, see it now. qemu_announce_self() uses qemu_foreach_nic(), which actually iterates only over NET_CLIENT_OPTIONS_KIND_NIC interfaces. I expect these are fully emulated hardware controllers. virtio uses another type (see enum NetClientOptionsKind).
 So, we can happily ignore qemu_announce_self(), it does not do anything for us. Thanks for pointing it out.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 13:18                 ` Yuanhan Liu
@ 2015-12-15 15:07                   ` Thibaut Collet
  2015-12-15 15:36                     ` Pavel Fedin
  2015-12-16  2:38                     ` Peter Xu
  0 siblings, 2 replies; 136+ messages in thread
From: Thibaut Collet @ 2015-12-15 15:07 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky

On Tue, Dec 15, 2015 at 2:18 PM, Yuanhan Liu <yuanhan.liu@linux.intel.com>
wrote:

> On Tue, Dec 15, 2015 at 12:47:47PM +0100, Thibaut Collet wrote:
> > On Tue, Dec 15, 2015 at 12:43 PM, Thibaut Collet <
> thibaut.collet@6wind.com>
> > wrote:
> >
> > >
> > >
> > > On Tue, Dec 15, 2015 at 11:05 AM, Peter Xu <peterx@redhat.com> wrote:
> > >
> > >> On Tue, Dec 15, 2015 at 11:45:56AM +0300, Pavel Fedin wrote:
> > >> >  To tell the truth, i don't know. I am also learning qemu internals
> on
> > >> the fly. Indeed, i see that it should announce itself. But
> > >> > this brings up a question: why do we need special announce
> procedure in
> > >> vhost-user then?
> > >>
> > >> I have the same question. Here is my guess...
> > >>
> > >> In customized networks, maybe people are not using ARP at all? When
> > >> we use DPDK, we directly pass through the network logic inside
> > >> kernel itself. So logically all the network protocols could be
> > >> customized by the user of it. In the customized network, maybe there
> > >> is some other protocol (rather than RARP) that would do the same
> > >> thing as what ARP/RARP does. So, this SEND_RARP request could give
> > >> the vhost-user backend a chance to format its own announce packet
> > >> and broadcast (in the SEND_RARP request, the guest's mac address
> > >> will be appended).
> > >>
> > >> CCing Victor to better know the truth...
> > >>
> > >> Peter
> > >>
> > >
>
> Hey Thibaut,
>
> First of all, thanks a lot for your lengthy explanation.
>
> > > Hi,
> > >
> > > After a migration, to avoid network outage, the guest must announce its
> > > new location to the L2 layer, typically with a GARP. Otherwise requests
> > > sent to the guest arrive to the old host until a ARP request is sent
> (after
> > > 30 seconds) or the guest sends some data.
> > >
> > > QEMU implementation of self announce after a migration with a vhost
> > > backend is the following:
> > >  - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest
> > > sends automatically a GARP.
>
> I'm kind of clear how VIRTIO_GUEST_ANNOUNCE works so far, except that I
> met a bug, which I will describe in another email.
>
> > >  - Else if the vhost backend implements VHOST_USER_SEND_RARP this
> request
> > > is sent to the vhost backend. When this message is received the vhost
> > > backend must act as it receives a RARP from the guest (purpose of this
> RARP
>
> Can you be more specific about this? Say, what kind of acts the vhost
> backend should do exactly?
>
> > > is to update switches' MAC->port maaping as a GARP).
>
> Isn't it vhost library is not aware of swtich at all? How could we
> update switches's MAC-port mapping inside vhost library?
>
> > This RARP is a false
> > > one, created by the vhost backend,
>
> I'm a bit confused now. You were just saying "vhost backend must act
> as it __recevives__ a RARP from the guest", and you are now saying
> "the RARP is a false one __created__ by the vhost backend".
>
> Thanks.
>
>         --yliu
>

After a migration, to avoid netwotk outage, all interfaces of the guest
must send a packet to update switches mapping (ideally a GARP).
As some interfaces do not do it QEMU does it in behalf of the guest by
sending a RARP (his RARP is not forged by the guest but by QEMU). This is
the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
ethernet interfaces. For vhost-user backend, QEMU can not do it directly
and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
request that contains the MAC address of the guest interface.

Thibaut.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 13:59                       ` Yuanhan Liu
@ 2015-12-15 14:58                         ` Pavel Fedin
  2015-12-16  7:28                           ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-15 14:58 UTC (permalink / raw)
  To: 'Yuanhan Liu'; +Cc: dev, 'Victor Kaplansky'

 Hello!

> No idea. Maybe you have changed some other configures (such as of ovs)
> without notice? Or, the ovs bridge interface resets?

 I don't touch the ovs at all. Just shut down the guest, rebuild the qemu, reinstall it, run the guest.

> 
> BTW, would you please try my v1 patch set with above diff applied to
> see if the ping loss is still there. You might also want to run tcpdump
> with the dest host ovs bridge, to see if GARP is actually sent.

 Retested with wireshark running on the host. I used my qemu patch instead, but it should not matter at all:
--- cut ---
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 1b6c5ac..5ca2987 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -480,7 +480,12 @@ static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)

 static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
 {
-    return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
+    int ret = vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
+
+    if (!ret) {
+        virtio_add_feature(features, VIRTIO_NET_F_GUEST_ANNOUNCE);
+    }
+    return ret;
 }

 static int vhost_user_set_owner(struct vhost_dev *dev)
--- cut ---

 So, here are both wireshark captures on the host side:

1. Without the patch

root@nfv_test_x86_64 / # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
  1   0.000000           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  2   0.003304 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Reply)
  3   0.669957           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  4   0.858957           :: -> ff02::1:ff3b:831a ICMPv6 78 Neighbor Solicitation for fe80::5054:ff:fe3b:831a
  5   1.858968 fe80::5054:ff:fe3b:831a -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  6   2.300948 fe80::5054:ff:fe3b:831a -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  7   2.527088 fe80::5054:ff:fe3b:831a -> ff02::2      ICMPv6 62 Router Solicitation
  8   2.527800 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
  9   6.526814 fe80::5054:ff:fe3b:831a -> ff02::2      ICMPv6 62 Router Solicitation
 10  10.526993 fe80::5054:ff:fe3b:831a -> ff02::2      ICMPv6 62 Router Solicitation
 11  15.984632 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 12  15.984643 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 13  15.984772  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x0477, seq=1/256, ttl=64
 14  15.984798  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x0477, seq=1/256, ttl=64 (request in 13)
 15  16.984970  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x0477, seq=2/512, ttl=64
 16  16.984991  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x0477, seq=2/512, ttl=64 (request in 15)
 17  17.984956  192.168.6.2 -> 192.168.6.1  ICMP 98 Echo (ping) request  id=0x0477, seq=3/768, ttl=64
 18  17.984975  192.168.6.1 -> 192.168.6.2  ICMP 98 Echo (ping) reply    id=0x0477, seq=3/768, ttl=64 (request in 17)
 19  20.994535 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 192.168.6.2?  Tell 192.168.6.1
 20  20.994637 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 52:54:00:3b:83:1a
^C20 packets captured

2. With the patch

root@nfv_test_x86_64 / # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
  1   0.000000           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  2   0.000969 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Reply)
  3   0.156966           :: -> ff02::1:ff3b:831a ICMPv6 78 Neighbor Solicitation for fe80::5054:ff:fe3b:831a
  4   0.536948           :: -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  5   1.156968 fe80::5054:ff:fe3b:831a -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  6   1.312708 fe80::5054:ff:fe3b:831a -> ff02::2      ICMPv6 62 Router Solicitation
  7   1.629960 fe80::5054:ff:fe3b:831a -> ff02::16     ICMPv6 90 Multicast Listener Report Message v2
  8   2.314713 RealtekU_3b:83:1a -> Broadcast    ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
  9   5.313333 fe80::5054:ff:fe3b:831a -> ff02::2      ICMPv6 62 Router Solicitation
 10   9.315486 fe80::5054:ff:fe3b:831a -> ff02::2      ICMPv6 62 Router Solicitation
 11  21.536450 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 12  21.536461 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 13  22.538937 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 14  22.538943 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 15  23.540937 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 16  23.540942 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 17  25.537519 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 18  25.537525 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 19  26.538939 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 20  26.538944 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 21  27.540937 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 22  27.540942 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 23  29.538475 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 24  29.538482 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 25  30.538935 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 26  30.538941 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
 27  31.540935 RealtekU_3b:83:1a -> Broadcast    ARP 42 Who has 192.168.6.1?  Tell 192.168.6.2
 28  31.540941 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
^C28 packets captured

 Obviously, the guest simply doesn't read incoming packets. ifconfig for the interface on guest side shows:

RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 9 overruns 0 frame 9

 BTW, number 9 exactly matches the number of ARP replies from the host. The question is - why? Looks like guest behavior changes
somehow. Is it a bug in guest? It's very strange, because in these sessions i see only one difference in IPv6 packets:

  4   0.858957           :: -> ff02::1:ff3b:831a ICMPv6 78 Neighbor Solicitation for fe80::5054:ff:fe3b:831a

This is present in session #1 and missing from session #2. Can it affect the whole thing somehow? But i don't even use IPv6.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 13:48                     ` Pavel Fedin
@ 2015-12-15 13:59                       ` Yuanhan Liu
  2015-12-15 14:58                         ` Pavel Fedin
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-15 13:59 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, 'Victor Kaplansky'

On Tue, Dec 15, 2015 at 04:48:12PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > >  Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the
> > link stopped working at all. I don't understand why.
> > 
> > I'm wondering how did you do that? Why do you need enforece it in QEMU?
> > Isn't it already supported so far?
> 
>  I mean - qemu first asks vhost-user server (ovs/DPDK in our case) about capabilities, then negotiates them with the guest. And DPDK
> doesn't report VIRTIO_NET_F_GUEST_ANNOUNCE, so i just ORed this flag in qemu before the negotiation with guest (because indeed my
> logic says that the host should not do anything special about it). So the overall effect is the same as in your patch

I see.

> 
> > diff --git a/lib/librte_vhost/virtio-net.c
> > b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> >                                 (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> >                                 (1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +                               (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> >                                 (VHOST_SUPPORTS_MQ)            | \
> >                                 (1ULL << VIRTIO_F_VERSION_1)   | \
> >                                 (1ULL << VHOST_F_LOG_ALL)      | \
> 
>  But i was somehow wrong and this causes the whole thing to stop working instead. Even after just booting up the network doesn't
> work and PINGs do not pass.

No idea. Maybe you have changed some other configures (such as of ovs)
without notice? Or, the ovs bridge interface resets?

BTW, would you please try my v1 patch set with above diff applied to
see if the ping loss is still there. You might also want to run tcpdump
with the dest host ovs bridge, to see if GARP is actually sent.

> 
> > However, I found the GARP is not sent out at all, due to an error
> > I met and reported before:
> > 
> >     KVM: injection failed, MSI lost (Operation not permitted)

I was thinking that may be caused by the difference of my two hosts (a
desktop and a server). I will try to find two similar hosts tomorrow
to do more tests. Besides that, it'd be great if you could do a more
test with above diff applied.

	--yliu
> 
>  Interesting, i don't have this problem here. Some bug in your kernel/hardware?
> 
> > One thing worth noting is that it happened only when I did live migration
> > on two different hosts (the two hosts happened to be using a same old
> > kernel: v3.11.10).  It works pretty well on same host. So, seems like
> > a KVM bug then?
> 
>  3.18.9 here and no this problem.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 13:36                   ` Yuanhan Liu
@ 2015-12-15 13:48                     ` Pavel Fedin
  2015-12-15 13:59                       ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-15 13:48 UTC (permalink / raw)
  To: 'Yuanhan Liu'; +Cc: dev, 'Victor Kaplansky'

 Hello!

> >  Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the
> link stopped working at all. I don't understand why.
> 
> I'm wondering how did you do that? Why do you need enforece it in QEMU?
> Isn't it already supported so far?

 I mean - qemu first asks vhost-user server (ovs/DPDK in our case) about capabilities, then negotiates them with the guest. And DPDK
doesn't report VIRTIO_NET_F_GUEST_ANNOUNCE, so i just ORed this flag in qemu before the negotiation with guest (because indeed my
logic says that the host should not do anything special about it). So the overall effect is the same as in your patch

> diff --git a/lib/librte_vhost/virtio-net.c
> b/lib/librte_vhost/virtio-net.c
> index 03044f6..0ba5045 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
>  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
>                                 (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>                                 (1ULL << VIRTIO_NET_F_CTRL_RX) | \
> +                               (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
>                                 (VHOST_SUPPORTS_MQ)            | \
>                                 (1ULL << VIRTIO_F_VERSION_1)   | \
>                                 (1ULL << VHOST_F_LOG_ALL)      | \

 But i was somehow wrong and this causes the whole thing to stop working instead. Even after just booting up the network doesn't
work and PINGs do not pass.

> However, I found the GARP is not sent out at all, due to an error
> I met and reported before:
> 
>     KVM: injection failed, MSI lost (Operation not permitted)

 Interesting, i don't have this problem here. Some bug in your kernel/hardware?

> One thing worth noting is that it happened only when I did live migration
> on two different hosts (the two hosts happened to be using a same old
> kernel: v3.11.10).  It works pretty well on same host. So, seems like
> a KVM bug then?

 3.18.9 here and no this problem.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 12:24                 ` Pavel Fedin
@ 2015-12-15 13:36                   ` Yuanhan Liu
  2015-12-15 13:48                     ` Pavel Fedin
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-15 13:36 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, 'Victor Kaplansky'

On Tue, Dec 15, 2015 at 03:24:48PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > After a migration, to avoid network outage, the guest must announce its new location to the L2 layer, typically with a GARP. Otherwise requests sent to
> > the guest arrive to the old host until a ARP request is sent (after 30 seconds) or the guest sends some data.
> > QEMU implementation of self announce after a migration with a vhost backend is the following:
> > - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends automatically a GARP.
> > - Else if the vhost backend implements VHOST_USER_SEND_RARP this request is sent to the vhost backend. When this message is received the vhost backend
> > must act as it receives a RARP from the guest (purpose of this RARP is to update switches' MAC->port maaping as a GARP). This RARP is a false one,
> > created by the vhost backend,
> > - Else nothing is done and we have a network outage until a ARP is sent or the guest sends some data.
> 
>  But what is qemu_announce_self() then? It's just unconditionally triggered after migration, but indeed sends some strange thing.
> 
> > VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
> >  - the vhost backend announces the support of this feature. Maybe QEMU can be updated to support unconditionnaly this feature
> 
>  Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the link stopped working at all. I don't understand why.

I'm wondering how did you do that? Why do you need enforece it in QEMU?
Isn't it already supported so far?

Actually, what's we need to do is to add such feature bit in vhost
library, to claim we support it so that the the guest will send a 
gratuitous ARP when migration is done (check virtio_net_load()).

----
diff --git a/lib/librte_vhost/virtio-net.c
b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
 #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
                                (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
                                (1ULL << VIRTIO_NET_F_CTRL_RX) | \
+                               (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
                                (VHOST_SUPPORTS_MQ)            | \
                                (1ULL << VIRTIO_F_VERSION_1)   | \
                                (1ULL << VHOST_F_LOG_ALL)      | \

However, I found the GARP is not sent out at all, due to an error
I met and reported before:

    KVM: injection failed, MSI lost (Operation not permitted)

Which happened at the time QEMU is about to send the interrupt to the 
guest for announce event. However, it failed, hence no GARP was received.

One thing worth noting is that it happened only when I did live migration
on two different hosts (the two hosts happened to be using a same old 
kernel: v3.11.10).  It works pretty well on same host. So, seems like
a KVM bug then?

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 11:47               ` Thibaut Collet
  2015-12-15 12:24                 ` Pavel Fedin
@ 2015-12-15 13:18                 ` Yuanhan Liu
  2015-12-15 15:07                   ` Thibaut Collet
  1 sibling, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-15 13:18 UTC (permalink / raw)
  To: Thibaut Collet; +Cc: dev, Victor Kaplansky

On Tue, Dec 15, 2015 at 12:47:47PM +0100, Thibaut Collet wrote:
> On Tue, Dec 15, 2015 at 12:43 PM, Thibaut Collet <thibaut.collet@6wind.com>
> wrote:
> 
> >
> >
> > On Tue, Dec 15, 2015 at 11:05 AM, Peter Xu <peterx@redhat.com> wrote:
> >
> >> On Tue, Dec 15, 2015 at 11:45:56AM +0300, Pavel Fedin wrote:
> >> >  To tell the truth, i don't know. I am also learning qemu internals on
> >> the fly. Indeed, i see that it should announce itself. But
> >> > this brings up a question: why do we need special announce procedure in
> >> vhost-user then?
> >>
> >> I have the same question. Here is my guess...
> >>
> >> In customized networks, maybe people are not using ARP at all? When
> >> we use DPDK, we directly pass through the network logic inside
> >> kernel itself. So logically all the network protocols could be
> >> customized by the user of it. In the customized network, maybe there
> >> is some other protocol (rather than RARP) that would do the same
> >> thing as what ARP/RARP does. So, this SEND_RARP request could give
> >> the vhost-user backend a chance to format its own announce packet
> >> and broadcast (in the SEND_RARP request, the guest's mac address
> >> will be appended).
> >>
> >> CCing Victor to better know the truth...
> >>
> >> Peter
> >>
> >

Hey Thibaut,

First of all, thanks a lot for your lengthy explanation.

> > Hi,
> >
> > After a migration, to avoid network outage, the guest must announce its
> > new location to the L2 layer, typically with a GARP. Otherwise requests
> > sent to the guest arrive to the old host until a ARP request is sent (after
> > 30 seconds) or the guest sends some data.
> >
> > QEMU implementation of self announce after a migration with a vhost
> > backend is the following:
> >  - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest
> > sends automatically a GARP.

I'm kind of clear how VIRTIO_GUEST_ANNOUNCE works so far, except that I
met a bug, which I will describe in another email.

> >  - Else if the vhost backend implements VHOST_USER_SEND_RARP this request
> > is sent to the vhost backend. When this message is received the vhost
> > backend must act as it receives a RARP from the guest (purpose of this RARP

Can you be more specific about this? Say, what kind of acts the vhost
backend should do exactly?

> > is to update switches' MAC->port maaping as a GARP).

Isn't it vhost library is not aware of swtich at all? How could we
update switches's MAC-port mapping inside vhost library?

> This RARP is a false
> > one, created by the vhost backend,

I'm a bit confused now. You were just saying "vhost backend must act
as it __recevives__ a RARP from the guest", and you are now saying
"the RARP is a false one __created__ by the vhost backend".

Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 11:47               ` Thibaut Collet
@ 2015-12-15 12:24                 ` Pavel Fedin
  2015-12-15 13:36                   ` Yuanhan Liu
  2015-12-15 13:18                 ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-15 12:24 UTC (permalink / raw)
  To: 'Thibaut Collet', 'Peter Xu'
  Cc: dev, 'Victor Kaplansky'

 Hello!

> After a migration, to avoid network outage, the guest must announce its new location to the L2 layer, typically with a GARP. Otherwise requests sent to
> the guest arrive to the old host until a ARP request is sent (after 30 seconds) or the guest sends some data.
> QEMU implementation of self announce after a migration with a vhost backend is the following:
> - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends automatically a GARP.
> - Else if the vhost backend implements VHOST_USER_SEND_RARP this request is sent to the vhost backend. When this message is received the vhost backend
> must act as it receives a RARP from the guest (purpose of this RARP is to update switches' MAC->port maaping as a GARP). This RARP is a false one,
> created by the vhost backend,
> - Else nothing is done and we have a network outage until a ARP is sent or the guest sends some data.

 But what is qemu_announce_self() then? It's just unconditionally triggered after migration, but indeed sends some strange thing.

> VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
>  - the vhost backend announces the support of this feature. Maybe QEMU can be updated to support unconditionnaly this feature

 Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the link stopped working at all. I don't understand why.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 11:43             ` Thibaut Collet
@ 2015-12-15 11:47               ` Thibaut Collet
  2015-12-15 12:24                 ` Pavel Fedin
  2015-12-15 13:18                 ` Yuanhan Liu
  0 siblings, 2 replies; 136+ messages in thread
From: Thibaut Collet @ 2015-12-15 11:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: dev, Victor Kaplansky

On Tue, Dec 15, 2015 at 12:43 PM, Thibaut Collet <thibaut.collet@6wind.com>
wrote:

>
>
> On Tue, Dec 15, 2015 at 11:05 AM, Peter Xu <peterx@redhat.com> wrote:
>
>> On Tue, Dec 15, 2015 at 11:45:56AM +0300, Pavel Fedin wrote:
>> >  To tell the truth, i don't know. I am also learning qemu internals on
>> the fly. Indeed, i see that it should announce itself. But
>> > this brings up a question: why do we need special announce procedure in
>> vhost-user then?
>>
>> I have the same question. Here is my guess...
>>
>> In customized networks, maybe people are not using ARP at all? When
>> we use DPDK, we directly pass through the network logic inside
>> kernel itself. So logically all the network protocols could be
>> customized by the user of it. In the customized network, maybe there
>> is some other protocol (rather than RARP) that would do the same
>> thing as what ARP/RARP does. So, this SEND_RARP request could give
>> the vhost-user backend a chance to format its own announce packet
>> and broadcast (in the SEND_RARP request, the guest's mac address
>> will be appended).
>>
>> CCing Victor to better know the truth...
>>
>> Peter
>>
>
>
> Hi,
>
> After a migration, to avoid network outage, the guest must announce its
> new location to the L2 layer, typically with a GARP. Otherwise requests
> sent to the guest arrive to the old host until a ARP request is sent (after
> 30 seconds) or the guest sends some data.
>
> QEMU implementation of self announce after a migration with a vhost
> backend is the following:
>  - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest
> sends automatically a GARP.
>  - Else if the vhost backend implements VHOST_USER_SEND_RARP this request
> is sent to the vhost backend. When this message is received the vhost
> backend must act as it receives a RARP from the guest (purpose of this RARP
> is to update switches' MAC->port maaping as a GARP). This RARP is a false
> one, created by the vhost backend,
>  - Else nothing is done and we have a network outage until a ARP is sent
> or the guest sends some data.
>
>
> VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
>   - the vhost backend announces the support of this feature. Maybe QEMU
> can be updated to support unconditionnaly this feature
>   - the virtio driver of the guest implements this feature. It is not the
> case for old kernel or dpdk virtio pmd.
>
> Regarding dpdk to have a migration of vhost interface with limited network
> outage we have to:
>
>   - Implement management VHOST_USER_SEND_RARP request to emulate a fake
> RARP for guest
>
> To do that we have to consider two kinds of guest:
>   1. Guest with virtio driver implementing VIRTIO_GUEST_ANNOUNCE feature
>   2. Guest with virtio driver that does not have the VIRTIO_GUEST_ANNOUNCE
> feature. This is the case with old kernel or guest running a dpdk (virtio
> pmd of dpdk does not have this feature)
>
> Guest with VIRTIO_GUEST_ANNOUNCE feature sends automatically some GARP
> after a migration if this feature has been negotiated. So the only thing to
> do it is to negotiate the VIRTIO_GUEST_ANNOUNCE feature between QEMU, vhost
> backend and the guest.
> For this kind of guest the vhost-backend must announce the support of
> VIRTIO_GUEST_ANNOUNCE feature. As vhost-backend has no particular action to
> do in this case the support of VIRTIO_GUEST_ANNOUNCE feature can be
> unconditionally set in QEMU in the future.
>
> For guest without VIRTIO_GUEST_ANNOUNCE feature we have to send a fake
> RARP: QEMU knows the MAC address of the guest and can create and broadcast
> a RARP. But in case of vhost-backend QEMU is not able to broadcast this
> fake RARP and must ask to the vhost backend to do it through the
> VHOST_USER_SEND_RARP request. When the vhost backend receives this message
> it must create a fake RARP message (as done by QEMU) and do the appropriate
> operation as this message has been sent by the guest through the virtio
> rings.
>
>
> To solve this point 2 solutions are implemented:
>  - After the migration the guest automatically sends GARP. This solution
> occurs if VIRTIO_GUEST_ANNOUNCE feature has been negotiated between QEMU
> and the guest.
>          * VIRTIO_GUEST_ANNOUNCE
>


Sorry my previous message will be sent by error (it is a draft with rework
in progress)

The full explanation are:

Hi,

After a migration, to avoid network outage, the guest must announce its new
location to the L2 layer, typically with a GARP. Otherwise requests sent to
the guest arrive to the old host until a ARP request is sent (after 30
seconds) or the guest sends some data.

QEMU implementation of self announce after a migration with a vhost backend
is the following:
 - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends
automatically a GARP.
 - Else if the vhost backend implements VHOST_USER_SEND_RARP this request
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this RARP
is to update switches' MAC->port maaping as a GARP). This RARP is a false
one, created by the vhost backend,
 - Else nothing is done and we have a network outage until a ARP is sent or
the guest sends some data.


VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
  - the vhost backend announces the support of this feature. Maybe QEMU can
be updated to support unconditionnaly this feature
  - the virtio driver of the guest implements this feature. It is not the
case for old kernel or dpdk virtio pmd.

Regarding dpdk to have a migration of vhost interface with limited network
outage we have to:
  - In the vhost pmd
        * Announce supports of VIRTIO_GUEST_ANNOUNCE feature
        * Implement management of VHOST_USER_SEND_RARP request to emulate a
fake RARP if the VIRTIO_GUEST_ANNOUNCE feature is not implemented by the
guest

  - In the virtio pmd
        * Support VIRTIO_GUEST_ANNOUNCE feature to avoid RARP emission by
the host after a migration.


Hope this explanation will help

Regards.

Thibaut.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15 10:05           ` Peter Xu
@ 2015-12-15 11:43             ` Thibaut Collet
  2015-12-15 11:47               ` Thibaut Collet
  0 siblings, 1 reply; 136+ messages in thread
From: Thibaut Collet @ 2015-12-15 11:43 UTC (permalink / raw)
  To: Peter Xu; +Cc: dev, Victor Kaplansky

On Tue, Dec 15, 2015 at 11:05 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Dec 15, 2015 at 11:45:56AM +0300, Pavel Fedin wrote:
> >  To tell the truth, i don't know. I am also learning qemu internals on
> the fly. Indeed, i see that it should announce itself. But
> > this brings up a question: why do we need special announce procedure in
> vhost-user then?
>
> I have the same question. Here is my guess...
>
> In customized networks, maybe people are not using ARP at all? When
> we use DPDK, we directly pass through the network logic inside
> kernel itself. So logically all the network protocols could be
> customized by the user of it. In the customized network, maybe there
> is some other protocol (rather than RARP) that would do the same
> thing as what ARP/RARP does. So, this SEND_RARP request could give
> the vhost-user backend a chance to format its own announce packet
> and broadcast (in the SEND_RARP request, the guest's mac address
> will be appended).
>
> CCing Victor to better know the truth...
>
> Peter
>


Hi,

After a migration, to avoid network outage, the guest must announce its new
location to the L2 layer, typically with a GARP. Otherwise requests sent to
the guest arrive to the old host until a ARP request is sent (after 30
seconds) or the guest sends some data.

QEMU implementation of self announce after a migration with a vhost backend
is the following:
 - If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends
automatically a GARP.
 - Else if the vhost backend implements VHOST_USER_SEND_RARP this request
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this RARP
is to update switches' MAC->port maaping as a GARP). This RARP is a false
one, created by the vhost backend,
 - Else nothing is done and we have a network outage until a ARP is sent or
the guest sends some data.


VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
  - the vhost backend announces the support of this feature. Maybe QEMU can
be updated to support unconditionnaly this feature
  - the virtio driver of the guest implements this feature. It is not the
case for old kernel or dpdk virtio pmd.

Regarding dpdk to have a migration of vhost interface with limited network
outage we have to:

  - Implement management VHOST_USER_SEND_RARP request to emulate a fake
RARP for guest

To do that we have to consider two kinds of guest:
  1. Guest with virtio driver implementing VIRTIO_GUEST_ANNOUNCE feature
  2. Guest with virtio driver that does not have the VIRTIO_GUEST_ANNOUNCE
feature. This is the case with old kernel or guest running a dpdk (virtio
pmd of dpdk does not have this feature)

Guest with VIRTIO_GUEST_ANNOUNCE feature sends automatically some GARP
after a migration if this feature has been negotiated. So the only thing to
do it is to negotiate the VIRTIO_GUEST_ANNOUNCE feature between QEMU, vhost
backend and the guest.
For this kind of guest the vhost-backend must announce the support of
VIRTIO_GUEST_ANNOUNCE feature. As vhost-backend has no particular action to
do in this case the support of VIRTIO_GUEST_ANNOUNCE feature can be
unconditionally set in QEMU in the future.

For guest without VIRTIO_GUEST_ANNOUNCE feature we have to send a fake
RARP: QEMU knows the MAC address of the guest and can create and broadcast
a RARP. But in case of vhost-backend QEMU is not able to broadcast this
fake RARP and must ask to the vhost backend to do it through the
VHOST_USER_SEND_RARP request. When the vhost backend receives this message
it must create a fake RARP message (as done by QEMU) and do the appropriate
operation as this message has been sent by the guest through the virtio
rings.


To solve this point 2 solutions are implemented:
 - After the migration the guest automatically sends GARP. This solution
occurs if VIRTIO_GUEST_ANNOUNCE feature has been negotiated between QEMU
and the guest.
         * VIRTIO_GUEST_ANNOUNCE

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15  8:45         ` Pavel Fedin
  2015-12-15  8:56           ` Yuanhan Liu
@ 2015-12-15 10:05           ` Peter Xu
  2015-12-15 11:43             ` Thibaut Collet
  1 sibling, 1 reply; 136+ messages in thread
From: Peter Xu @ 2015-12-15 10:05 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, Victor Kaplansky

On Tue, Dec 15, 2015 at 11:45:56AM +0300, Pavel Fedin wrote:
>  To tell the truth, i don't know. I am also learning qemu internals on the fly. Indeed, i see that it should announce itself. But
> this brings up a question: why do we need special announce procedure in vhost-user then?

I have the same question. Here is my guess...

In customized networks, maybe people are not using ARP at all? When
we use DPDK, we directly pass through the network logic inside
kernel itself. So logically all the network protocols could be
customized by the user of it. In the customized network, maybe there
is some other protocol (rather than RARP) that would do the same
thing as what ARP/RARP does. So, this SEND_RARP request could give
the vhost-user backend a chance to format its own announce packet
and broadcast (in the SEND_RARP request, the guest's mac address
will be appended).

CCing Victor to better know the truth...

Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15  8:23       ` Yuanhan Liu
  2015-12-15  8:45         ` Pavel Fedin
@ 2015-12-15  9:42         ` Peter Xu
  1 sibling, 0 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-15  9:42 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev

On Tue, Dec 15, 2015 at 04:23:24PM +0800, Yuanhan Liu wrote:
> On Mon, Dec 14, 2015 at 11:58:42AM +0800, Peter Xu wrote:
> > If ping to guest from outside, when the migration finishes on the
> > target side of qemu, qemu_self_announce() will be called.
> 
> It's supposed to see some ARP requests if I run tcpdump against
> with the ovs bridge, right? However, in fact, I saw nothing from
> tcpdump on the target host when the migration is done.
> 
> I mean I do find that qemu_annouce_self composes an ARP
> broadcoast request, but it seems that I didn't catch it on
> the target host.
> 
> Something wrong, or someting I missed?

AFAIK, it should be RARP rather than ARP request. However, sorry
that I do not know the reason for its lossing either.

Btw, I did a very basic live migration using v1 patchset locally
today (one host, two QEMU instances attach to the same vhost-user
socket), it's working too on my host. :)

Thanks.
Peter

> 
> 	--yliu
> 
> > Although
> > we might see a warning like "Vhost user backend fails to broadcast
> > fake RARP" (notify is done by hacking vhost_user_receive(), even if
> > notify fails, things will still move on), QEMU should still send a
> > RARP onto the link.
> > 
> > Not sure whether I missed anything.
> > 
> > Thanks.
> > Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15  8:56           ` Yuanhan Liu
@ 2015-12-15  9:04             ` Pavel Fedin
  0 siblings, 0 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-15  9:04 UTC (permalink / raw)
  To: 'Yuanhan Liu', 'Thibaut Collet'; +Cc: dev

 Hello!

> Note quite sure. I found Thibaut submitted a patch to send
> VHOST_USER_SEND_RARP request after migration is done months
> ago. Thibaut, would you please elaborate it a bit more what
> should be done on vhost-user backend? To construct a gratuitous
> ARP request and broadcast it?

 By the way, some more info for you all.
1. I've just examined qemu_announce_self() and i see that IPs are all set to 0 in the packet it generates. It's quite logical
because qemu has no idea what address is used by the guest, even more, theoretically it could be not IPv4 at all. But then - how can
it work at all, and what's the use for this packet?
2. I tried to work around if by adding VIRTIO_NET_F_GUEST_ANNOUNCE. I expected that the guest will see it and make announcement by
itself. But result was quite the opposite - PING stopped working at all, right from the beginning, even without migration.

 Can local qemu/DPDK/etc gurus give some explanation?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15  8:45         ` Pavel Fedin
@ 2015-12-15  8:56           ` Yuanhan Liu
  2015-12-15  9:04             ` Pavel Fedin
  2015-12-15 10:05           ` Peter Xu
  1 sibling, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-15  8:56 UTC (permalink / raw)
  To: Pavel Fedin, Thibaut Collet; +Cc: dev

On Tue, Dec 15, 2015 at 11:45:56AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > I mean I do find that qemu_annouce_self composes an ARP
> > broadcoast request, but it seems that I didn't catch it on
> > the target host.
> > 
> > Something wrong, or someting I missed?
> 
>  To tell the truth, i don't know. I am also learning qemu internals on the fly. Indeed, i see that it should announce itself.

I was acutally asking Peter. Sorry for not making it clear and
thanks for your reply, anyway :)

> But
> this brings up a question: why do we need special announce procedure in vhost-user then?

Note quite sure. I found Thibaut submitted a patch to send
VHOST_USER_SEND_RARP request after migration is done months
ago. Thibaut, would you please elaborate it a bit more what
should be done on vhost-user backend? To construct a gratuitous
ARP request and broadcast it?

>  I think you can add some debug output and see how it works in realtime. This is what i normally do when i don't understand in which
> sequence things happen.

Thanks.

	--yliu
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-15  8:23       ` Yuanhan Liu
@ 2015-12-15  8:45         ` Pavel Fedin
  2015-12-15  8:56           ` Yuanhan Liu
  2015-12-15 10:05           ` Peter Xu
  2015-12-15  9:42         ` Peter Xu
  1 sibling, 2 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-15  8:45 UTC (permalink / raw)
  To: 'Yuanhan Liu', 'Peter Xu'; +Cc: dev

 Hello!

> I mean I do find that qemu_annouce_self composes an ARP
> broadcoast request, but it seems that I didn't catch it on
> the target host.
> 
> Something wrong, or someting I missed?

 To tell the truth, i don't know. I am also learning qemu internals on the fly. Indeed, i see that it should announce itself. But
this brings up a question: why do we need special announce procedure in vhost-user then?
 I think you can add some debug output and see how it works in realtime. This is what i normally do when i don't understand in which
sequence things happen.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14  3:58     ` Peter Xu
  2015-12-14  7:30       ` Pavel Fedin
@ 2015-12-15  8:23       ` Yuanhan Liu
  2015-12-15  8:45         ` Pavel Fedin
  2015-12-15  9:42         ` Peter Xu
  1 sibling, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-15  8:23 UTC (permalink / raw)
  To: Peter Xu; +Cc: dev

On Mon, Dec 14, 2015 at 11:58:42AM +0800, Peter Xu wrote:
> On Fri, Dec 11, 2015 at 01:22:23PM +0300, Pavel Fedin wrote:
> >  BTW, it works, and it was my bad. openvswitch was configured incorrectly on the other side, vhost port number was different for
> > some reason, while ruleset was the same. I reconfigured it and now everything migrates correctly, except increased downtime because
> > of missing GARP (the guest misses some PINGs, then it retries ARP, which brings the link back up).
> 
> Hi,
> 
> When doing the ping, was it from the guest (to another host) or to
> the guest (from another host)?
> 
> In any case, I still could not understand why the ping loss happened
> in this test.
> 
> If ping from guest, no ARP refresh is required at all?
> 
> If ping to guest from outside, when the migration finishes on the
> target side of qemu, qemu_self_announce() will be called.

It's supposed to see some ARP requests if I run tcpdump against
with the ovs bridge, right? However, in fact, I saw nothing from
tcpdump on the target host when the migration is done.

I mean I do find that qemu_annouce_self composes an ARP
broadcoast request, but it seems that I didn't catch it on
the target host.

Something wrong, or someting I missed?

	--yliu

> Although
> we might see a warning like "Vhost user backend fails to broadcast
> fake RARP" (notify is done by hacking vhost_user_receive(), even if
> notify fails, things will still move on), QEMU should still send a
> RARP onto the link.
> 
> Not sure whether I missed anything.
> 
> Thanks.
> Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14 13:21                 ` Yuanhan Liu
  2015-12-14 13:28                   ` Peter Xu
@ 2015-12-14 14:54                   ` Pavel Fedin
  1 sibling, 0 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-14 14:54 UTC (permalink / raw)
  To: 'Yuanhan Liu', 'Peter Xu'
  Cc: dev, 'Victor Kaplansky', 'Michael S. Tsirkin'

 Hello!

> > I _guess_ the problem for ping might be: guest ARP entry for
> > 192.168.100.1 is not updated. Or say, after guest migrated to host2
> > from host1, guest is still trying to send packet to host1's NIC (no
> > one is telling it to update, right?), so no one is responding the
> > ping. When the entry is expired, guest will resend the ARP request,
> > and host2 will respond this time, with mac address on host2 provided
> > this time. After that, ping works again.
> 
> Peter,
> 
> Thanks for your input, and that sounds reasonable. You just reminded
> me that the host1's NIC is indeed different with host2's NIC: the ovs
> bridge mac address is different.

 Yes, this is indeed what is happening, and actually i already wrote about it. In wireshark it looks exactly like that: the some
PINGs are sent without replies, then the guest redoes ARP, PING replies resume.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14 13:28                   ` Peter Xu
@ 2015-12-14 13:51                     ` Yuanhan Liu
  0 siblings, 0 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-14 13:51 UTC (permalink / raw)
  To: Peter Xu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 14, 2015 at 09:28:08PM +0800, Peter Xu wrote:
> On Mon, Dec 14, 2015 at 09:21:15PM +0800, Yuanhan Liu wrote:
> > Peter,
> > 
> > Thanks for your input, and that sounds reasonable. You just reminded
> > me that the host1's NIC is indeed different with host2's NIC: the ovs
> > bridge mac address is different.
> > 
> > I then had a quick try, setting the two ovs bridge with same mac
> > address, and it works like a charm: the gap is gone :)
> 
> Good to know that. :)
> 
> I will try to do some tests too using the patchset. Not sure whether
> I will encounter the same KVM warning (seems related to APIC,
> however still could not tell more than that). Will update you if
> there is anything helpful.

Appreciate that! It'd be good if you could test with my v2 patch set:
hopefully I could cook it up by the end of tomorrow.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14 13:21                 ` Yuanhan Liu
@ 2015-12-14 13:28                   ` Peter Xu
  2015-12-14 13:51                     ` Yuanhan Liu
  2015-12-14 14:54                   ` Pavel Fedin
  1 sibling, 1 reply; 136+ messages in thread
From: Peter Xu @ 2015-12-14 13:28 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 14, 2015 at 09:21:15PM +0800, Yuanhan Liu wrote:
> Peter,
> 
> Thanks for your input, and that sounds reasonable. You just reminded
> me that the host1's NIC is indeed different with host2's NIC: the ovs
> bridge mac address is different.
> 
> I then had a quick try, setting the two ovs bridge with same mac
> address, and it works like a charm: the gap is gone :)

Good to know that. :)

I will try to do some tests too using the patchset. Not sure whether
I will encounter the same KVM warning (seems related to APIC,
however still could not tell more than that). Will update you if
there is anything helpful.

Peter

> 
> 	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14 13:00               ` Peter Xu
@ 2015-12-14 13:21                 ` Yuanhan Liu
  2015-12-14 13:28                   ` Peter Xu
  2015-12-14 14:54                   ` Pavel Fedin
  0 siblings, 2 replies; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-14 13:21 UTC (permalink / raw)
  To: Peter Xu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 14, 2015 at 09:00:22PM +0800, Peter Xu wrote:
> On Mon, Dec 14, 2015 at 08:09:37PM +0800, Yuanhan Liu wrote:
> > It seems that we have exactly the same test environment set up: I have
> > one server (where I normally do vhost test there) and one desktop (my
> > dev box), 
> > 
> > On both hosts, there is an ovs bridge, with IP address 192.168.100.1
> > assigned manually. Later, I started a VM on the server, and manually
> > assigned IP to 192.168.100.10. I then run "ping 192.168.100.1" for
> > live migration testing.
> > 
> > The migration to my desktop somehow works (even though there are some
> > bugs in this patch set), however, I did see what Pavel saw: about 12
> > packets has been lost, which means about 12 seconds the network is not
> > working well.
> 
> Hi, Yuanhan,
> 
> I _guess_ the problem for ping might be: guest ARP entry for
> 192.168.100.1 is not updated. Or say, after guest migrated to host2
> from host1, guest is still trying to send packet to host1's NIC (no
> one is telling it to update, right?), so no one is responding the
> ping. When the entry is expired, guest will resend the ARP request,
> and host2 will respond this time, with mac address on host2 provided
> this time. After that, ping works again.

Peter,

Thanks for your input, and that sounds reasonable. You just reminded
me that the host1's NIC is indeed different with host2's NIC: the ovs
bridge mac address is different.

I then had a quick try, setting the two ovs bridge with same mac
address, and it works like a charm: the gap is gone :)

	--yliu
> 
> (not familiar with OVS too, so am just taking it as a "vritual"
> switch)
> 
> Thanks.
> Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14 12:09             ` Yuanhan Liu
@ 2015-12-14 13:00               ` Peter Xu
  2015-12-14 13:21                 ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Peter Xu @ 2015-12-14 13:00 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 14, 2015 at 08:09:37PM +0800, Yuanhan Liu wrote:
> It seems that we have exactly the same test environment set up: I have
> one server (where I normally do vhost test there) and one desktop (my
> dev box), 
> 
> On both hosts, there is an ovs bridge, with IP address 192.168.100.1
> assigned manually. Later, I started a VM on the server, and manually
> assigned IP to 192.168.100.10. I then run "ping 192.168.100.1" for
> live migration testing.
> 
> The migration to my desktop somehow works (even though there are some
> bugs in this patch set), however, I did see what Pavel saw: about 12
> packets has been lost, which means about 12 seconds the network is not
> working well.

Hi, Yuanhan,

I _guess_ the problem for ping might be: guest ARP entry for
192.168.100.1 is not updated. Or say, after guest migrated to host2
from host1, guest is still trying to send packet to host1's NIC (no
one is telling it to update, right?), so no one is responding the
ping. When the entry is expired, guest will resend the ARP request,
and host2 will respond this time, with mac address on host2 provided
this time. After that, ping works again.

(not familiar with OVS too, so am just taking it as a "vritual"
switch)

Thanks.
Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14  9:46           ` Pavel Fedin
  2015-12-14 10:09             ` Peter Xu
@ 2015-12-14 12:09             ` Yuanhan Liu
  2015-12-14 13:00               ` Peter Xu
  1 sibling, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-14 12:09 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 14, 2015 at 12:46:57PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > > Host<------->openVSwitch<----------->guest
> > >       LOCAL               vhostuser
> > >
> > >  So, in order to migrate the guest, i simply replicated this setup on both hosts, with the
> > same IPs on host side. And on both hosts i set up the following ruleset for openvswitch:
> > 
> > Regarding to "with the same IPs on host side": do you mean that you
> > configured the same IP on two hosts in the intranet?
> 
>  No intranet. You can think of it as an isolated network between the host and guest, and that's all. I just assigned an IP to ovs' LOCAL interface on both hosts, and these ovs instances knew nothing about each other, neither they forwarded packets between each other. I didn't want to make things overcomplicated and decided not to mess with host's own connection to the intranet, just something that sits on the other side of vhost-user and replies to PINGs was perfectly OK for me.

Pavel,

It seems that we have exactly the same test environment set up: I have
one server (where I normally do vhost test there) and one desktop (my
dev box), 

On both hosts, there is an ovs bridge, with IP address 192.168.100.1
assigned manually. Later, I started a VM on the server, and manually
assigned IP to 192.168.100.10. I then run "ping 192.168.100.1" for
live migration testing.

The migration to my desktop somehow works (even though there are some
bugs in this patch set), however, I did see what Pavel saw: about 12
packets has been lost, which means about 12 seconds the network is not
working well.

Besides that, there was always an error message from the target host
after the migration:

    KVM: injection failed, MSI lost (Operation not permitted)

Firstly, I've very limited knowledge of OVS, therefore I'm not sure
this kind of live migration test env is setup rightly or not. I'd
appreciate if anyone could shine some lights on it. Anyway, I'm digging
the code to see if I can find something abnormal there.

> > I think this
> > does not matter if we are testing it functionally (whether live
> > migration could work), However I would still perfer to try ping
> > another host (say, host3) inside the intranet. What do you think?
> 
>  Yes, perhaps this would be better test, may be next time i'll do it.

Again, appreciate your testing!

> Anyway, IIRC, PATCH v2 is coming.

Hopefully, I could fix this gap this week and send out v2.

	--yliu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14  9:46           ` Pavel Fedin
@ 2015-12-14 10:09             ` Peter Xu
  2015-12-14 12:09             ` Yuanhan Liu
  1 sibling, 0 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-14 10:09 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev

On Mon, Dec 14, 2015 at 12:46:57PM +0300, Pavel Fedin wrote:
> > Regarding to "with the same IPs on host side": do you mean that you
> > configured the same IP on two hosts in the intranet?
> 
>  No intranet. You can think of it as an isolated network between the host and guest, and that's all. I just assigned an IP to ovs' LOCAL interface on both hosts, and these ovs instances knew nothing about each other, neither they forwarded packets between each other. I didn't want to make things overcomplicated and decided not to mess with host's own connection to the intranet, just something that sits on the other side of vhost-user and replies to PINGs was perfectly OK for me.

I see.

> 
> > I think this
> > does not matter if we are testing it functionally (whether live
> > migration could work), However I would still perfer to try ping
> > another host (say, host3) inside the intranet. What do you think?
> 
>  Yes, perhaps this would be better test, may be next time i'll do it. Anyway, IIRC, PATCH v2 is coming.

Agreed.

Thanks!
Peter

> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14  9:04         ` Peter Xu
@ 2015-12-14  9:46           ` Pavel Fedin
  2015-12-14 10:09             ` Peter Xu
  2015-12-14 12:09             ` Yuanhan Liu
  0 siblings, 2 replies; 136+ messages in thread
From: Pavel Fedin @ 2015-12-14  9:46 UTC (permalink / raw)
  To: 'Peter Xu'; +Cc: dev

 Hello!

> > Host<------->openVSwitch<----------->guest
> >       LOCAL               vhostuser
> >
> >  So, in order to migrate the guest, i simply replicated this setup on both hosts, with the
> same IPs on host side. And on both hosts i set up the following ruleset for openvswitch:
> 
> Regarding to "with the same IPs on host side": do you mean that you
> configured the same IP on two hosts in the intranet?

 No intranet. You can think of it as an isolated network between the host and guest, and that's all. I just assigned an IP to ovs' LOCAL interface on both hosts, and these ovs instances knew nothing about each other, neither they forwarded packets between each other. I didn't want to make things overcomplicated and decided not to mess with host's own connection to the intranet, just something that sits on the other side of vhost-user and replies to PINGs was perfectly OK for me.

> I think this
> does not matter if we are testing it functionally (whether live
> migration could work), However I would still perfer to try ping
> another host (say, host3) inside the intranet. What do you think?

 Yes, perhaps this would be better test, may be next time i'll do it. Anyway, IIRC, PATCH v2 is coming.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14  7:30       ` Pavel Fedin
@ 2015-12-14  9:04         ` Peter Xu
  2015-12-14  9:46           ` Pavel Fedin
  0 siblings, 1 reply; 136+ messages in thread
From: Peter Xu @ 2015-12-14  9:04 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev

On Mon, Dec 14, 2015 at 10:30:54AM +0300, Pavel Fedin wrote:
>  Hello!

Hi, Pavel!

> 
> > When doing the ping, was it from the guest (to another host) or to
> > the guest (from another host)?
> > 
> > In any case, I still could not understand why the ping loss happened
> > in this test.
> > 
> > If ping from guest, no ARP refresh is required at all?
> 
>  ping from guest to host.
> 
>  Ok, my setup was:
> 
> Host<------->openVSwitch<----------->guest
>       LOCAL               vhostuser
> 
>  So, in order to migrate the guest, i simply replicated this setup on both hosts, with the same IPs on host side. And on both hosts i set up the following ruleset for openvswitch:

Regarding to "with the same IPs on host side": do you mean that you
configured the same IP on two hosts in the intranet? I think this
does not matter if we are testing it functionally (whether live
migration could work), However I would still perfer to try ping
another host (say, host3) inside the intranet. What do you think?

When pinging host3, I assume there should have no ping loss. Also,
should have no loss too in the revert direction (reason as in
previous mail).

> 
> ovs-ofctl add-flow ovs-br0 in_port=1,actions=output:LOCAL
> ovs-ofctl add-flow ovs-br0 in_port=LOCAL,actions=output:1
> 
>  And on the second host, for some reason, vhostuser port got no 2 in the database instead of 1. Probably because first i added wrong port, then added correct one, then removed the wrong one. So, as i wrote before - please don't worry, the patch works fine, it was totally my lame fault.

Yes, thanks to let me know that the patch is working. Actually what
I am interested in is the down time that when host3 ping guest from
outside during migration. Would you please let me know the result if
you are doing such tests in the future? And please just ignore this
if there is no requirement on your side.

Thanks!
Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-14  3:58     ` Peter Xu
@ 2015-12-14  7:30       ` Pavel Fedin
  2015-12-14  9:04         ` Peter Xu
  2015-12-15  8:23       ` Yuanhan Liu
  1 sibling, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-14  7:30 UTC (permalink / raw)
  To: 'Peter Xu', yuanhan.liu; +Cc: dev

 Hello!

> When doing the ping, was it from the guest (to another host) or to
> the guest (from another host)?
> 
> In any case, I still could not understand why the ping loss happened
> in this test.
> 
> If ping from guest, no ARP refresh is required at all?

 ping from guest to host.

 Ok, my setup was:

Host<------->openVSwitch<----------->guest
      LOCAL               vhostuser

 So, in order to migrate the guest, i simply replicated this setup on both hosts, with the same IPs on host side. And on both hosts i set up the following ruleset for openvswitch:

ovs-ofctl add-flow ovs-br0 in_port=1,actions=output:LOCAL
ovs-ofctl add-flow ovs-br0 in_port=LOCAL,actions=output:1

 And on the second host, for some reason, vhostuser port got no 2 in the database instead of 1. Probably because first i added wrong port, then added correct one, then removed the wrong one. So, as i wrote before - please don't worry, the patch works fine, it was totally my lame fault.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-11 10:22   ` Pavel Fedin
@ 2015-12-14  3:58     ` Peter Xu
  2015-12-14  7:30       ` Pavel Fedin
  2015-12-15  8:23       ` Yuanhan Liu
  0 siblings, 2 replies; 136+ messages in thread
From: Peter Xu @ 2015-12-14  3:58 UTC (permalink / raw)
  To: Pavel Fedin, yuanhan.liu; +Cc: dev

On Fri, Dec 11, 2015 at 01:22:23PM +0300, Pavel Fedin wrote:
>  BTW, it works, and it was my bad. openvswitch was configured incorrectly on the other side, vhost port number was different for
> some reason, while ruleset was the same. I reconfigured it and now everything migrates correctly, except increased downtime because
> of missing GARP (the guest misses some PINGs, then it retries ARP, which brings the link back up).

Hi,

When doing the ping, was it from the guest (to another host) or to
the guest (from another host)?

In any case, I still could not understand why the ping loss happened
in this test.

If ping from guest, no ARP refresh is required at all?

If ping to guest from outside, when the migration finishes on the
target side of qemu, qemu_self_announce() will be called. Although
we might see a warning like "Vhost user backend fails to broadcast
fake RARP" (notify is done by hacking vhost_user_receive(), even if
notify fails, things will still move on), QEMU should still send a
RARP onto the link.

Not sure whether I missed anything.

Thanks.
Peter

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-11  9:49 ` Yuanhan Liu
@ 2015-12-11 10:22   ` Pavel Fedin
  2015-12-14  3:58     ` Peter Xu
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-11 10:22 UTC (permalink / raw)
  To: 'Yuanhan Liu'; +Cc: dev

 Hello!

> On Fri, Dec 11, 2015 at 11:26:55AM +0300, Pavel Fedin wrote:
> >  Hello!
> >
> >  I am currently testing this patchset with qemu and have problems.
> 
> Hi,
> 
> Thanks for testing!

 Not at all :)

 BTW, it works, and it was my bad. openvswitch was configured incorrectly on the other side, vhost port number was different for
some reason, while ruleset was the same. I reconfigured it and now everything migrates correctly, except increased downtime because
of missing GARP (the guest misses some PINGs, then it retries ARP, which brings the link back up).

 Tested-by: Pavel Fedin <p.fedin@samsung.com>

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-11  8:26 [dpdk-dev] [PATCH 0/4 for 2.3] " Pavel Fedin
@ 2015-12-11  9:49 ` Yuanhan Liu
  2015-12-11 10:22   ` Pavel Fedin
  0 siblings, 1 reply; 136+ messages in thread
From: Yuanhan Liu @ 2015-12-11  9:49 UTC (permalink / raw)
  To: Pavel Fedin; +Cc: dev

On Fri, Dec 11, 2015 at 11:26:55AM +0300, Pavel Fedin wrote:
>  Hello!
> 
>  I am currently testing this patchset with qemu and have problems.

Hi,

Thanks for testing!
> 
>  The guest migrates correctly, but after the migration it cries in the log:
> 
> Vhost user backend fails to broadcast fake RARP

Yes, because I have enabled it yet on DPDK side, and I was intended to
try it in the v2 patchset, which is likely to be sent out next week.

	--yliu

> 
>  and pinging the (new) host doesn't work. When i migrate it back to the old host, the network resumes working.
> 
>  I have analyzed the code, and this problem happens because we support neither VHOST_USER_PROTOCOL_F_RARP, nor
> VIRTIO_NET_F_GUEST_ANNOUNCE. Since the latter seems to be related only to guest, i simply enabled it in qemu by force, and after
> this the network doesn't work at all.
>  Can anybody help me and explain how the thing works? I expected that gratuitous ARP packets are harmless, but they seem to break
> things somehow. And what was used for testing the implementation?
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
@ 2015-12-11  8:26 Pavel Fedin
  2015-12-11  9:49 ` Yuanhan Liu
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Fedin @ 2015-12-11  8:26 UTC (permalink / raw)
  To: dev

 Hello!

 I am currently testing this patchset with qemu and have problems.

 The guest migrates correctly, but after the migration it cries in the log:

Vhost user backend fails to broadcast fake RARP

 and pinging the (new) host doesn't work. When i migrate it back to the old host, the network resumes working.

 I have analyzed the code, and this problem happens because we support neither VHOST_USER_PROTOCOL_F_RARP, nor
VIRTIO_NET_F_GUEST_ANNOUNCE. Since the latter seems to be related only to guest, i simply enabled it in qemu by force, and after
this the network doesn't work at all.
 Can anybody help me and explain how the thing works? I expected that gratuitous ARP packets are harmless, but they seem to break
things somehow. And what was used for testing the implementation?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2016-03-11 13:22 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-02  3:43 [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
2015-12-02  3:43 ` [dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
2015-12-02 13:53   ` Panu Matilainen
2015-12-02 14:31     ` Yuanhan Liu
2015-12-02 14:48       ` Panu Matilainen
2015-12-02 15:09         ` Yuanhan Liu
2015-12-02 16:58           ` Panu Matilainen
2015-12-02 17:24             ` Michael S. Tsirkin
2015-12-02 16:38       ` Thomas Monjalon
2015-12-03  1:49         ` Yuanhan Liu
2015-12-06 23:07     ` Thomas Monjalon
2015-12-07  2:00       ` Yuanhan Liu
2015-12-07  2:03         ` Thomas Monjalon
2015-12-07  2:18           ` Yuanhan Liu
2015-12-07  2:49             ` Thomas Monjalon
2015-12-07  6:29       ` Panu Matilainen
2015-12-07 11:28         ` Thomas Monjalon
2015-12-07 11:41           ` Panu Matilainen
2015-12-07 13:55             ` Thomas Monjalon
2015-12-07 16:48               ` Panu Matilainen
2015-12-07 17:47                 ` Thomas Monjalon
2015-12-08  5:57   ` Xie, Huawei
2015-12-08  7:25     ` Yuanhan Liu
2015-12-02  3:43 ` [dpdk-dev] [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
2015-12-02 13:53   ` Victor Kaplansky
2015-12-02 14:39     ` Yuanhan Liu
2015-12-09  3:33     ` Xie, Huawei
2015-12-09  3:42       ` Yuanhan Liu
2015-12-09  5:44         ` Xie, Huawei
2015-12-09  8:41           ` Yuanhan Liu
2015-12-02  3:43 ` [dpdk-dev] [PATCH 3/4] vhost: log vring changes Yuanhan Liu
2015-12-02 14:07   ` Victor Kaplansky
2015-12-02 14:38     ` Yuanhan Liu
2015-12-02 15:58       ` Victor Kaplansky
2015-12-02 16:26         ` Michael S. Tsirkin
2015-12-03  2:31           ` Yuanhan Liu
2015-12-09  2:45     ` Xie, Huawei
2015-12-02  3:43 ` [dpdk-dev] [PATCH 4/4] vhost: enable log_shmfd protocol feature Yuanhan Liu
2015-12-02 14:10 ` [dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
2015-12-02 14:33   ` Yuanhan Liu
2015-12-09  3:41 ` Xie, Huawei
2015-12-17  3:11 ` [dpdk-dev] [PATCH v2 0/6] " Yuanhan Liu
2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
2015-12-21 15:32     ` Xie, Huawei
2015-12-22  2:25       ` Yuanhan Liu
2015-12-22  2:41         ` Xie, Huawei
2015-12-22  2:55           ` Yuanhan Liu
2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
2015-12-21 15:06     ` Xie, Huawei
2015-12-22  2:40       ` Yuanhan Liu
2015-12-22  2:45         ` Xie, Huawei
2015-12-22  3:04           ` Yuanhan Liu
2015-12-22  7:02             ` Xie, Huawei
2015-12-22  5:11     ` Peter Xu
2015-12-22  6:09       ` Yuanhan Liu
2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
2015-12-22  6:55     ` Peter Xu
2015-12-22  7:07       ` Xie, Huawei
2015-12-22  7:59         ` Peter Xu
2015-12-22  7:13       ` Yuanhan Liu
2015-12-22  8:01         ` Peter Xu
2015-12-17  3:11   ` [dpdk-dev] [PATCH v2 4/6] vhost: log vring desc buffer changes Yuanhan Liu
2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
2015-12-22  8:11     ` Peter Xu
2015-12-22  8:21       ` Yuanhan Liu
2015-12-22  8:24       ` Pavel Fedin
2015-12-17  3:12   ` [dpdk-dev] [PATCH v2 6/6] vhost: enable log_shmfd protocol feature Yuanhan Liu
2015-12-17 12:08   ` [dpdk-dev] [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
2015-12-17 12:45     ` Yuanhan Liu
2015-12-21  8:17   ` Pavel Fedin
2016-01-29  4:57   ` [dpdk-dev] [PATCH v3 0/8] " Yuanhan Liu
2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
2016-02-19 14:26       ` Thomas Monjalon
2016-02-22  6:59         ` Yuanhan Liu
2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 3/8] vhost: log used vring changes Yuanhan Liu
2016-01-29  4:57     ` [dpdk-dev] [PATCH v3 4/8] vhost: log vring desc buffer changes Yuanhan Liu
2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
2016-03-11 12:39       ` Olivier MATZ
2016-03-11 13:16         ` Thomas Monjalon
2016-03-11 13:22           ` Olivier MATZ
2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
2016-02-19  6:11       ` Tan, Jianfeng
2016-02-19  7:03         ` Yuanhan Liu
2016-02-19  8:55           ` Yuanhan Liu
2016-02-22 14:36           ` [dpdk-dev] [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
2016-02-24  8:15             ` Qiu, Michael
2016-02-24  8:28               ` Yuanhan Liu
2016-02-25  7:55                 ` Qiu, Michael
2016-02-29 15:56             ` Thomas Monjalon
2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 7/8] vhost: enable log_shmfd protocol feature Yuanhan Liu
2016-01-29  4:58     ` [dpdk-dev] [PATCH v3 8/8] vhost: remove duplicate header include Yuanhan Liu
2016-02-01 15:54     ` [dpdk-dev] [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
2016-02-02  1:53       ` Yuanhan Liu
2016-02-19 15:01     ` Thomas Monjalon
2016-02-22  7:08       ` Yuanhan Liu
2016-02-22  9:56         ` Thomas Monjalon
2016-02-22 14:24           ` Yuanhan Liu
2015-12-11  8:26 [dpdk-dev] [PATCH 0/4 for 2.3] " Pavel Fedin
2015-12-11  9:49 ` Yuanhan Liu
2015-12-11 10:22   ` Pavel Fedin
2015-12-14  3:58     ` Peter Xu
2015-12-14  7:30       ` Pavel Fedin
2015-12-14  9:04         ` Peter Xu
2015-12-14  9:46           ` Pavel Fedin
2015-12-14 10:09             ` Peter Xu
2015-12-14 12:09             ` Yuanhan Liu
2015-12-14 13:00               ` Peter Xu
2015-12-14 13:21                 ` Yuanhan Liu
2015-12-14 13:28                   ` Peter Xu
2015-12-14 13:51                     ` Yuanhan Liu
2015-12-14 14:54                   ` Pavel Fedin
2015-12-15  8:23       ` Yuanhan Liu
2015-12-15  8:45         ` Pavel Fedin
2015-12-15  8:56           ` Yuanhan Liu
2015-12-15  9:04             ` Pavel Fedin
2015-12-15 10:05           ` Peter Xu
2015-12-15 11:43             ` Thibaut Collet
2015-12-15 11:47               ` Thibaut Collet
2015-12-15 12:24                 ` Pavel Fedin
2015-12-15 13:36                   ` Yuanhan Liu
2015-12-15 13:48                     ` Pavel Fedin
2015-12-15 13:59                       ` Yuanhan Liu
2015-12-15 14:58                         ` Pavel Fedin
2015-12-16  7:28                           ` Yuanhan Liu
2015-12-16 11:57                             ` Pavel Fedin
2015-12-16 12:08                               ` Yuanhan Liu
2015-12-16 12:43                                 ` Pavel Fedin
2015-12-16 13:00                                   ` Yuanhan Liu
2015-12-15 13:18                 ` Yuanhan Liu
2015-12-15 15:07                   ` Thibaut Collet
2015-12-15 15:36                     ` Pavel Fedin
2015-12-16  2:38                     ` Peter Xu
2015-12-16  2:50                       ` Yuanhan Liu
2015-12-16  7:05                       ` Pavel Fedin
2015-12-15  9:42         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).