* [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost
@ 2020-06-11 10:02 patrick.fu
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: patrick.fu @ 2020-06-11 10:02 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: patrick.fu, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
Performing large memory copies usually takes up a major part of CPU
cycles and becomes the hot spot in vhost-user enqueue operation. To
offload expensive memory operations from the CPU, this patch set
proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
Intel's processor, to accelerate large copies.
Large copies are offloaded from the CPU to the DMA in an asynchronous
manner. The CPU just submits copy jobs to the DMA but without waiting
for its copy completion. Thus, there is no CPU intervention during
data transfer; we can save precious CPU cycles and improve the overall
throughput for vhost-user based applications, like OVS. During packet
transmission, it offloads large copies to the DMA and performs small
copies by the CPU, due to startup overheads associated with the DMA.
This patch set construct a general framework that applications can
leverage to attach DMA channels with vhost-user transmit queues. Four
new RTE APIs are introduced to vhost library for applications to
register and use the asynchronous data path. In addition, two new DMA
operation callbacks are defined, by which vhost-user asynchronous data
path can interact with DMA hardware. Currently only enqueue operation
for split queue is implemented, but the frame is flexible to extend
support for dequeue & packed queue.
Patrick Fu (2):
  vhost: introduce async data path registration API
  vhost: introduce async enqueue for split ring
 lib/librte_vhost/Makefile          |   3 +-
 lib/librte_vhost/rte_vhost.h       |   1 +
 lib/librte_vhost/rte_vhost_async.h | 172 ++++++++++++
 lib/librte_vhost/socket.c          |  20 ++
 lib/librte_vhost/vhost.c           |  74 ++++-
 lib/librte_vhost/vhost.h           |  30 ++-
 lib/librte_vhost/vhost_user.c      |  28 +-
 lib/librte_vhost/virtio_net.c      | 538 ++++++++++++++++++++++++++++++++++++-
 8 files changed, 857 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
@ 2020-06-11 10:02 ` patrick.fu
  2020-06-18  5:50   ` Liu, Yong
                     ` (2 more replies)
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring patrick.fu
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 36+ messages in thread
From: patrick.fu @ 2020-06-11 10:02 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: patrick.fu, cheng1.jiang, cunming.liang
From: Patrick <patrick.fu@intel.com>
This patch introduces registration/un-registration APIs
for async data path together with all required data
structures and DMA callback function proto-types.
Signed-off-by: Patrick <patrick.fu@intel.com>
---
 lib/librte_vhost/Makefile          |   3 +-
 lib/librte_vhost/rte_vhost.h       |   1 +
 lib/librte_vhost/rte_vhost_async.h | 134 +++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/socket.c          |  20 ++++++
 lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
 lib/librte_vhost/vhost.h           |  30 ++++++++-
 lib/librte_vhost/vhost_user.c      |  28 ++++++--
 7 files changed, 283 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index e592795..3aed094 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -41,7 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
 					vhost_user.c virtio_net.c vdpa.c
 
 # install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
+						rte_vhost_async.h
 
 # only compile vhost crypto when cryptodev is enabled
 ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index d43669f..cec4d07 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -35,6 +35,7 @@
 #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
+#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
 
 /** Protocol features. */
 #ifndef VHOST_USER_PROTOCOL_F_MQ
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
new file mode 100644
index 0000000..82f2ebe
--- /dev/null
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#ifndef _RTE_VHOST_ASYNC_H_
+#define _RTE_VHOST_ASYNC_H_
+
+#include "rte_vhost.h"
+
+/**
+ * iovec iterator
+ */
+struct iov_it {
+	/** offset to the first byte of interesting data */
+	size_t offset;
+	/** total bytes of data in this iterator */
+	size_t count;
+	/** pointer to the iovec array */
+	struct iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+/**
+ * dma transfer descriptor pair
+ */
+struct dma_trans_desc {
+	/** source memory iov_it */
+	struct iov_it *src;
+	/** destination memory iov_it */
+	struct iov_it *dst;
+};
+
+/**
+ * dma transfer status
+ */
+struct dma_trans_status {
+	/** An array of application specific data for source memory */
+	uintptr_t *src_opaque_data;
+	/** An array of application specific data for destination memory */
+	uintptr_t *dst_opaque_data;
+};
+
+/**
+ * dma operation callbacks to be implemented by applications
+ */
+struct rte_vhost_async_channel_ops {
+	/**
+	 * instruct a DMA channel to perform copies for a batch of packets
+	 *
+	 * @param vid
+	 *  id of vhost device to perform data copies
+	 * @param queue_id
+	 *  queue id to perform data copies
+	 * @param descs
+	 *  an array of DMA transfer memory descriptors
+	 * @param opaque_data
+	 *  opaque data pair sending to DMA engine
+	 * @param count
+	 *  number of elements in the "descs" array
+	 * @return
+	 *  -1 on failure, number of descs processed on success
+	 */
+	int (*transfer_data)(int vid, uint16_t queue_id,
+		struct dma_trans_desc *descs,
+		struct dma_trans_status *opaque_data,
+		uint16_t count);
+	/**
+	 * check copy-completed packets from a DMA channel
+	 * @param vid
+	 *  id of vhost device to check copy completion
+	 * @param queue_id
+	 *  queue id to check copyp completion
+	 * @param opaque_data
+	 *  buffer to receive the opaque data pair from DMA engine
+	 * @param max_packets
+	 *  max number of packets could be completed
+	 * @return
+	 *  -1 on failure, number of iov segments completed on success
+	 */
+	int (*check_completed_copies)(int vid, uint16_t queue_id,
+		struct dma_trans_status *opaque_data,
+		uint16_t max_packets);
+};
+
+/**
+ *  dma channel feature bit definition
+ */
+struct dma_channel_features {
+	union {
+		uint32_t intval;
+		struct {
+			uint32_t inorder:1;
+			uint32_t resvd0115:15;
+			uint32_t threshold:12;
+			uint32_t resvd2831:4;
+		};
+	};
+};
+
+/**
+ * register a dma channel for vhost
+ *
+ * @param vid
+ *  vhost device id DMA channel to be attached to
+ * @param queue_id
+ *  vhost queue id DMA channel to be attached to
+ * @param features
+ *  DMA channel feature bit
+ *    b0       : DMA supports inorder data transfer
+ *    b1  - b15: reserved
+ *    b16 - b27: Packet length threshold for DMA transfer
+ *    b28 - b31: reserved
+ * @param ops
+ *  DMA operation callbacks
+ * @return
+ *  0 on success, -1 on failures
+ */
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+	uint32_t features, struct rte_vhost_async_channel_ops *ops);
+
+/**
+ * unregister a dma channel for vhost
+ *
+ * @param vid
+ *  vhost device id DMA channel to be detached
+ * @param queue_id
+ *  vhost queue id DMA channel to be detached
+ * @return
+ *  0 on success, -1 on failures
+ */
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
+
+#endif /* _RTE_VDPA_H_ */
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 0a66ef9..f817783 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
 	bool use_builtin_virtio_net;
 	bool extbuf;
 	bool linearbuf;
+	bool async_copy;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -210,6 +211,7 @@ struct vhost_user {
 	size_t size;
 	struct vhost_user_connection *conn;
 	int ret;
+	struct virtio_net *dev;
 
 	if (vsocket == NULL)
 		return;
@@ -241,6 +243,13 @@ struct vhost_user {
 	if (vsocket->linearbuf)
 		vhost_enable_linearbuf(vid);
 
+	if (vsocket->async_copy) {
+		dev = get_device(vid);
+
+		if (dev)
+			dev->async_copy = 1;
+	}
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -891,6 +900,17 @@ struct vhost_user_reconnect_list {
 		goto out_mutex;
 	}
 
+	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
+
+	if (vsocket->async_copy &&
+		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
+		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
+		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and IOMMU "
+			"or post-copy feature simultaneously is not "
+			"supported\n");
+		goto out_mutex;
+	}
+
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
 	 * net driver.
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0266318..e6b688a 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -332,8 +332,13 @@
 {
 	if (vq_is_packed(dev))
 		rte_free(vq->shadow_used_packed);
-	else
+	else {
 		rte_free(vq->shadow_used_split);
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+	}
 	rte_free(vq->batch_copy_elems);
 	rte_mempool_free(vq->iotlb_pool);
 	rte_free(vq);
@@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int vid,
 	if (vhost_data_log_level >= 0)
 		rte_log_set_level(vhost_data_log_level, RTE_LOG_WARNING);
 }
+
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+					uint32_t features,
+					struct rte_vhost_async_channel_ops *ops)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	struct dma_channel_features f;
+
+	if (dev == NULL || ops == NULL)
+		return -1;
+
+	f.intval = features;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (vq == NULL)
+		return -1;
+
+	/** packed queue is not supported */
+	if (vq_is_packed(dev) || !f.inorder)
+		return -1;
+
+	if (ops->check_completed_copies == NULL ||
+		ops->transfer_data == NULL)
+		return -1;
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	vq->async_ops.check_completed_copies = ops->check_completed_copies;
+	vq->async_ops.transfer_data = ops->transfer_data;
+
+	vq->async_inorder = f.inorder;
+	vq->async_threshold = f.threshold;
+
+	vq->async_registered = true;
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return 0;
+}
+
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return -1;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (vq == NULL)
+		return -1;
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	vq->async_ops.transfer_data = NULL;
+	vq->async_ops.check_completed_copies = NULL;
+
+	vq->async_registered = false;
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return 0;
+}
+
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index df98d15..a7fbe23 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -23,6 +23,8 @@
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
 
+#include "rte_vhost_async.h"
+
 /* Used to indicate that the device is running on a data core */
 #define VIRTIO_DEV_RUNNING 1
 /* Used to indicate that the device is ready to operate */
@@ -39,6 +41,11 @@
 
 #define VHOST_LOG_CACHE_NR 32
 
+#define MAX_PKT_BURST 32
+
+#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
+#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
+
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
 		VRING_DESC_F_WRITE)
@@ -200,6 +207,25 @@ struct vhost_virtqueue {
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
 	int				iotlb_cache_nr;
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
+
+	/* operation callbacks for async dma */
+	struct rte_vhost_async_channel_ops	async_ops;
+
+	struct iov_it it_pool[VHOST_MAX_ASYNC_IT];
+	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
+
+	/* async data transfer status */
+	uintptr_t	**async_pkts_pending;
+	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
+	#define		ASYNC_PENDING_INFO_N_SFT 16
+	uint64_t	*async_pending_info;
+	uint16_t	async_pkts_idx;
+	uint16_t	async_pkts_inflight_n;
+
+	/* vq async features */
+	bool		async_inorder;
+	bool		async_registered;
+	uint16_t	async_threshold;
 } __rte_cache_aligned;
 
 /* Old kernels have no such macros defined */
@@ -353,6 +379,7 @@ struct virtio_net {
 	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
@@ -702,7 +729,8 @@ uint64_t translate_log_addr(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	/* Don't kick guest if we don't reach index specified by guest. */
 	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
 		uint16_t old = vq->signalled_used;
-		uint16_t new = vq->last_used_idx;
+		uint16_t new = vq->async_pkts_inflight_n ?
+					vq->used->idx:vq->last_used_idx;
 		bool signalled_used_valid = vq->signalled_used_valid;
 
 		vq->signalled_used = new;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 84bebad..d7600bf 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -464,12 +464,25 @@
 	} else {
 		if (vq->shadow_used_split)
 			rte_free(vq->shadow_used_split);
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+
 		vq->shadow_used_split = rte_malloc(NULL,
 				vq->size * sizeof(struct vring_used_elem),
 				RTE_CACHE_LINE_SIZE);
-		if (!vq->shadow_used_split) {
+		vq->async_pkts_pending = rte_malloc(NULL,
+				vq->size * sizeof(uintptr_t),
+				RTE_CACHE_LINE_SIZE);
+		vq->async_pending_info = rte_malloc(NULL,
+				vq->size * sizeof(uint64_t),
+				RTE_CACHE_LINE_SIZE);
+		if (!vq->shadow_used_split ||
+			!vq->async_pkts_pending ||
+			!vq->async_pending_info) {
 			VHOST_LOG_CONFIG(ERR,
-					"failed to allocate memory for shadow used ring.\n");
+					"failed to allocate memory for vq internal data.\n");
 			return RTE_VHOST_MSG_RESULT_ERR;
 		}
 	}
@@ -1147,7 +1160,8 @@
 			goto err_mmap;
 		}
 
-		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
+		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
+			MAP_POPULATE : 0;
 		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 				 MAP_SHARED | populate, fd, 0);
 
@@ -1162,7 +1176,7 @@
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
-		if (dev->dequeue_zero_copy)
+		if (dev->dequeue_zero_copy || dev->async_copy)
 			if (add_guest_pages(dev, reg, alignment) < 0) {
 				VHOST_LOG_CONFIG(ERR,
 					"adding guest pages to region %u failed.\n",
@@ -1945,6 +1959,12 @@ static int vhost_user_set_vring_err(struct virtio_net **pdev __rte_unused,
 	} else {
 		rte_free(vq->shadow_used_split);
 		vq->shadow_used_split = NULL;
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+		vq->async_pkts_pending = NULL;
+		vq->async_pending_info = NULL;
 	}
 
 	rte_free(vq->batch_copy_elems);
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
@ 2020-06-11 10:02 ` patrick.fu
  2020-06-18  6:56   ` Liu, Yong
                     ` (2 more replies)
  2020-06-26 14:42 ` [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost Maxime Coquelin
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 36+ messages in thread
From: patrick.fu @ 2020-06-11 10:02 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: patrick.fu, cheng1.jiang, cunming.liang
From: Patrick <patrick.fu@intel.com>
This patch implement async enqueue data path for split ring.
Signed-off-by: Patrick <patrick.fu@intel.com>
---
 lib/librte_vhost/rte_vhost_async.h |  38 +++
 lib/librte_vhost/virtio_net.c      | 538 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 574 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
index 82f2ebe..efcba0a 100644
--- a/lib/librte_vhost/rte_vhost_async.h
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -131,4 +131,42 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
  */
 int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
 
+/**
+ * This function submit enqueue data to DMA. This function has no
+ * guranttee to the transfer completion upon return. Applications should
+ * poll transfer status by rte_vhost_poll_enqueue_completed()
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  array of packets to be enqueued
+ * @param count
+ *  packets num to be enqueued
+ * @return
+ *  num of packets enqueued
+ */
+uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
+/**
+ * This function check DMA completion status for a specific vhost
+ * device queue. Packets which finish copying (enqueue) operation
+ * will be returned in an array.
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  blank array to get return packet pointer
+ * @param count
+ *  size of the packet array
+ * @return
+ *  num of packets returned
+ */
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
 #endif /* _RTE_VDPA_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 751c1f3..cf9f884 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -17,14 +17,15 @@
 #include <rte_arp.h>
 #include <rte_spinlock.h>
 #include <rte_malloc.h>
+#include <rte_vhost_async.h>
 
 #include "iotlb.h"
 #include "vhost.h"
 
-#define MAX_PKT_BURST 32
-
 #define MAX_BATCH_LEN 256
 
+#define VHOST_ASYNC_BATCH_THRESHOLD 8
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -117,6 +118,35 @@
 }
 
 static __rte_always_inline void
+async_flush_shadow_used_ring_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq)
+{
+	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
+
+	if (used_idx + vq->shadow_used_idx <= vq->size) {
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
+					  vq->shadow_used_idx);
+	} else {
+		uint16_t size;
+
+		/* update used ring interval [used_idx, vq->size] */
+		size = vq->size - used_idx;
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
+
+		/* update the left half used ring interval [0, left_size] */
+		do_flush_shadow_used_ring_split(dev, vq, 0, size,
+					  vq->shadow_used_idx - size);
+	}
+	vq->last_used_idx += vq->shadow_used_idx;
+
+	rte_smp_wmb();
+
+	vhost_log_cache_sync(dev, vq);
+
+	vq->shadow_used_idx = 0;
+}
+
+static __rte_always_inline void
 update_shadow_used_ring_split(struct vhost_virtqueue *vq,
 			 uint16_t desc_idx, uint32_t len)
 {
@@ -905,6 +935,199 @@
 	return error;
 }
 
+static __rte_always_inline void
+async_fill_vec(struct iovec *v, void *base, size_t len)
+{
+	v->iov_base = base;
+	v->iov_len = len;
+}
+
+static __rte_always_inline void
+async_fill_it(struct iov_it *it, size_t count,
+	struct iovec *vec, unsigned long nr_seg)
+{
+	it->offset = 0;
+	it->count = count;
+
+	if (count) {
+		it->iov = vec;
+		it->nr_segs = nr_seg;
+	} else {
+		it->iov = 0;
+		it->nr_segs = 0;
+	}
+}
+
+static __rte_always_inline void
+async_fill_des(struct dma_trans_desc *desc,
+	struct iov_it *src, struct iov_it *dst)
+{
+	desc->src = src;
+	desc->dst = dst;
+}
+
+static __rte_always_inline int
+async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			struct rte_mbuf *m, struct buf_vector *buf_vec,
+			uint16_t nr_vec, uint16_t num_buffers,
+			struct iovec *src_iovec, struct iovec *dst_iovec,
+			struct iov_it *src_it, struct iov_it *dst_it)
+{
+	uint32_t vec_idx = 0;
+	uint32_t mbuf_offset, mbuf_avail;
+	uint32_t buf_offset, buf_avail;
+	uint64_t buf_addr, buf_iova, buf_len;
+	uint32_t cpy_len, cpy_threshold;
+	uint64_t hdr_addr;
+	struct rte_mbuf *hdr_mbuf;
+	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
+	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
+	int error = 0;
+
+	uint32_t tlen = 0;
+	int tvec_idx = 0;
+	void *hpa;
+
+	if (unlikely(m == NULL)) {
+		error = -1;
+		goto out;
+	}
+
+	cpy_threshold = vq->async_threshold;
+
+	buf_addr = buf_vec[vec_idx].buf_addr;
+	buf_iova = buf_vec[vec_idx].buf_iova;
+	buf_len = buf_vec[vec_idx].buf_len;
+
+	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
+		error = -1;
+		goto out;
+	}
+
+	hdr_mbuf = m;
+	hdr_addr = buf_addr;
+	if (unlikely(buf_len < dev->vhost_hlen))
+		hdr = &tmp_hdr;
+	else
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
+		dev->vid, num_buffers);
+
+	if (unlikely(buf_len < dev->vhost_hlen)) {
+		buf_offset = dev->vhost_hlen - buf_len;
+		vec_idx++;
+		buf_addr = buf_vec[vec_idx].buf_addr;
+		buf_iova = buf_vec[vec_idx].buf_iova;
+		buf_len = buf_vec[vec_idx].buf_len;
+		buf_avail = buf_len - buf_offset;
+	} else {
+		buf_offset = dev->vhost_hlen;
+		buf_avail = buf_len - dev->vhost_hlen;
+	}
+
+	mbuf_avail  = rte_pktmbuf_data_len(m);
+	mbuf_offset = 0;
+
+	while (mbuf_avail != 0 || m->next != NULL) {
+		/* done with current buf, get the next one */
+		if (buf_avail == 0) {
+			vec_idx++;
+			if (unlikely(vec_idx >= nr_vec)) {
+				error = -1;
+				goto out;
+			}
+
+			buf_addr = buf_vec[vec_idx].buf_addr;
+			buf_iova = buf_vec[vec_idx].buf_iova;
+			buf_len = buf_vec[vec_idx].buf_len;
+
+			buf_offset = 0;
+			buf_avail  = buf_len;
+		}
+
+		/* done with current mbuf, get the next one */
+		if (mbuf_avail == 0) {
+			m = m->next;
+
+			mbuf_offset = 0;
+			mbuf_avail  = rte_pktmbuf_data_len(m);
+		}
+
+		if (hdr_addr) {
+			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
+			if (rxvq_is_mergeable(dev))
+				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
+						num_buffers);
+
+			if (unlikely(hdr == &tmp_hdr)) {
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
+			} else {
+				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
+						dev->vhost_hlen, 0);
+				vhost_log_cache_write_iova(dev, vq,
+						buf_vec[0].buf_iova,
+						dev->vhost_hlen);
+			}
+
+			hdr_addr = 0;
+		}
+
+		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
+
+		if (unlikely(cpy_len >= cpy_threshold)) {
+			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
+					buf_iova + buf_offset, cpy_len);
+
+			if (unlikely(!hpa)) {
+				error = -1;
+				goto out;
+			}
+
+			async_fill_vec(src_iovec + tvec_idx,
+				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
+						mbuf_offset), cpy_len);
+
+			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
+
+			tlen += cpy_len;
+			tvec_idx++;
+		} else {
+			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
+				rte_memcpy(
+				(void *)((uintptr_t)(buf_addr + buf_offset)),
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
+				cpy_len);
+
+				PRINT_PACKET(dev,
+					(uintptr_t)(buf_addr + buf_offset),
+					cpy_len, 0);
+			} else {
+				batch_copy[vq->batch_copy_nb_elems].dst =
+				(void *)((uintptr_t)(buf_addr + buf_offset));
+				batch_copy[vq->batch_copy_nb_elems].src =
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
+				batch_copy[vq->batch_copy_nb_elems].log_addr =
+					buf_iova + buf_offset;
+				batch_copy[vq->batch_copy_nb_elems].len =
+					cpy_len;
+				vq->batch_copy_nb_elems++;
+			}
+		}
+
+		mbuf_avail  -= cpy_len;
+		mbuf_offset += cpy_len;
+		buf_avail  -= cpy_len;
+		buf_offset += cpy_len;
+	}
+
+out:
+	async_fill_it(src_it, tlen, src_iovec, tvec_idx);
+	async_fill_it(dst_it, tlen, dst_iovec, tvec_idx);
+
+	return error;
+}
+
 static __rte_always_inline int
 vhost_enqueue_single_packed(struct virtio_net *dev,
 			    struct vhost_virtqueue *vq,
@@ -1236,6 +1459,317 @@
 	return virtio_dev_rx(dev, queue_id, pkts, count);
 }
 
+static __rte_always_inline void
+virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	uint16_t last_idx, uint16_t shadow_idx)
+{
+	while (vq->async_pkts_inflight_n) {
+		int er = vq->async_ops.check_completed_copies(dev->vid,
+			queue_id, 0, MAX_PKT_BURST);
+
+		if (er < 0) {
+			vq->async_pkts_inflight_n = 0;
+			break;
+		}
+
+		vq->async_pkts_inflight_n -= er;
+	}
+
+	vq->shadow_used_idx = shadow_idx;
+	vq->last_avail_idx = last_idx;
+}
+
+static __rte_noinline uint32_t
+virtio_dev_rx_async_submit_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
+	uint16_t num_buffers;
+	struct buf_vector buf_vec[BUF_VECTOR_MAX];
+	uint16_t avail_head, last_idx, shadow_idx;
+
+	struct iov_it *it_pool = vq->it_pool;
+	struct iovec *vec_pool = vq->vec_pool;
+	struct dma_trans_desc tdes[MAX_PKT_BURST];
+	struct iovec *src_iovec = vec_pool;
+	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+	struct iov_it *src_it = it_pool;
+	struct iov_it *dst_it = it_pool + 1;
+	uint16_t n_free_slot, slot_idx;
+	int n_pkts = 0;
+
+	avail_head = *((volatile uint16_t *)&vq->avail->idx);
+	last_idx = vq->last_avail_idx;
+	shadow_idx = vq->shadow_used_idx;
+
+	/*
+	 * The ordering between avail index and
+	 * desc reads needs to be enforced.
+	 */
+	rte_smp_rmb();
+
+	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
+	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
+		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
+		uint16_t nr_vec = 0;
+
+		if (unlikely(reserve_avail_buf_split(dev, vq,
+						pkt_len, buf_vec, &num_buffers,
+						avail_head, &nr_vec) < 0)) {
+			VHOST_LOG_DATA(DEBUG,
+				"(%d) failed to get enough desc from vring\n",
+				dev->vid);
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end index %d\n",
+			dev->vid, vq->last_avail_idx,
+			vq->last_avail_idx + num_buffers);
+
+		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
+				buf_vec, nr_vec, num_buffers,
+				src_iovec, dst_iovec, src_it, dst_it) < 0) {
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
+		if (src_it->count) {
+			async_fill_des(&tdes[pkt_burst_idx], src_it, dst_it);
+			pkt_burst_idx++;
+			vq->async_pending_info[slot_idx] =
+				num_buffers | (src_it->nr_segs << 16);
+			src_iovec += src_it->nr_segs;
+			dst_iovec += dst_it->nr_segs;
+			src_it += 2;
+			dst_it += 2;
+		} else {
+			vq->async_pending_info[slot_idx] = num_buffers;
+			vq->async_pkts_inflight_n++;
+		}
+
+		vq->last_avail_idx += num_buffers;
+
+		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
+				(pkt_idx == count - 1 && pkt_burst_idx)) {
+			n_pkts = vq->async_ops.transfer_data(dev->vid,
+					queue_id, tdes, 0, pkt_burst_idx);
+			src_iovec = vec_pool;
+			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+			src_it = it_pool;
+			dst_it = it_pool + 1;
+
+			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+				vq->async_pkts_inflight_n +=
+					n_pkts > 0 ? n_pkts : 0;
+				virtio_dev_rx_async_submit_split_err(dev,
+					vq, queue_id, last_idx, shadow_idx);
+				return 0;
+			}
+
+			pkt_burst_idx = 0;
+			vq->async_pkts_inflight_n += n_pkts;
+		}
+	}
+
+	if (pkt_burst_idx) {
+		n_pkts = vq->async_ops.transfer_data(dev->vid,
+				queue_id, tdes, 0, pkt_burst_idx);
+		if (unlikely(n_pkts <= (int)pkt_burst_idx)) {
+			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
+			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
+			last_idx, shadow_idx);
+			return 0;
+		}
+
+		vq->async_pkts_inflight_n += n_pkts;
+	}
+
+	do_data_copy_enqueue(dev, vq);
+
+	n_free_slot = vq->size - vq->async_pkts_idx;
+	if (n_free_slot > pkt_idx) {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, pkt_idx * sizeof(uintptr_t));
+		vq->async_pkts_idx += pkt_idx;
+	} else {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, n_free_slot * sizeof(uintptr_t));
+		rte_memcpy(&vq->async_pkts_pending[0],
+			&pkts[n_free_slot],
+			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
+		vq->async_pkts_idx = pkt_idx - n_free_slot;
+	}
+
+	if (likely(vq->shadow_used_idx))
+		async_flush_shadow_used_ring_split(dev, vq);
+
+	return pkt_idx;
+}
+
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+	struct vhost_virtqueue *vq;
+	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = pkts_idx > vq->async_pkts_inflight_n ?
+		pkts_idx - vq->async_pkts_inflight_n :
+		(vq_size - vq->async_pkts_inflight_n + pkts_idx) &
+		(vq_size - 1);
+
+	n_pkts_cpl =
+		vq->async_ops.check_completed_copies(vid, queue_id, 0, count);
+
+	rte_smp_wmb();
+
+	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
+		uint64_t info = async_pending_info[
+			(start_idx + n_pkts_put) & (vq_size - 1)];
+		uint64_t n_segs;
+		n_pkts_put++;
+		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
+		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
+
+		if (n_segs) {
+			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
+				n_pkts_put--;
+				n_descs -= info & ASYNC_PENDING_INFO_N_MSK;
+				if (n_pkts_cpl) {
+					async_pending_info[
+						(start_idx + n_pkts_put) &
+						(vq_size - 1)] =
+					((n_segs - n_pkts_cpl) <<
+					 ASYNC_PENDING_INFO_N_SFT) |
+					(info & ASYNC_PENDING_INFO_N_MSK);
+					n_pkts_cpl = 0;
+				}
+				break;
+			}
+			n_pkts_cpl -= n_segs;
+		}
+	}
+
+	if (n_pkts_put) {
+		vq->async_pkts_inflight_n -= n_pkts_put;
+		*(volatile uint16_t *)&vq->used->idx += n_descs;
+
+		vhost_vring_call_split(dev, vq);
+	}
+
+	if (start_idx + n_pkts_put <= vq_size) {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			n_pkts_put * sizeof(uintptr_t));
+	} else {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			(vq_size - start_idx) * sizeof(uintptr_t));
+		rte_memcpy(&pkts[vq_size - start_idx], vq->async_pkts_pending,
+			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
+	}
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return n_pkts_put;
+}
+
+static __rte_always_inline uint32_t
+virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	struct vhost_virtqueue *vq;
+	uint32_t nb_tx = 0;
+	bool drawback = false;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->enabled == 0))
+		goto out_access_unlock;
+
+	if (unlikely(!vq->async_registered)) {
+		drawback = true;
+		goto out_access_unlock;
+	}
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_lock(vq);
+
+	if (unlikely(vq->access_ok == 0))
+		if (unlikely(vring_translate(dev, vq) < 0))
+			goto out;
+
+	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
+	if (count == 0)
+		goto out;
+
+	/* TODO: packed queue not implemented */
+	if (vq_is_packed(dev))
+		nb_tx = 0;
+	else
+		nb_tx = virtio_dev_rx_async_submit_split(dev,
+				vq, queue_id, pkts, count);
+
+out:
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_unlock(vq);
+
+out_access_unlock:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	if (drawback)
+		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts, count);
+
+	return nb_tx;
+}
+
+uint16_t
+rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (!dev)
+		return 0;
+
+	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
+		VHOST_LOG_DATA(ERR,
+			"(%d) %s: built-in vhost net backend is disabled.\n",
+			dev->vid, __func__);
+		return 0;
+	}
+
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+}
+
 static inline bool
 virtio_net_with_host_offload(struct virtio_net *dev)
 {
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
@ 2020-06-18  5:50   ` Liu, Yong
  2020-06-18  9:08     ` Fu, Patrick
  2020-06-25 13:42     ` Maxime Coquelin
  2020-06-26 14:28   ` Maxime Coquelin
  2020-06-26 14:44   ` Maxime Coquelin
  2 siblings, 2 replies; 36+ messages in thread
From: Liu, Yong @ 2020-06-18  5:50 UTC (permalink / raw)
  To: Fu, Patrick
  Cc: Fu, Patrick, Jiang, Cheng1, Liang, Cunming, dev, maxime.coquelin,
	Xia, Chenbo, Wang, Zhihong, Ye, Xiaolong
Thanks, Patrick. So comments are inline.
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of patrick.fu@intel.com
> Sent: Thursday, June 11, 2020 6:02 PM
> To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> Xiaolong <xiaolong.ye@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
> Subject: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
> registration API
> 
> From: Patrick <patrick.fu@intel.com>
> 
> This patch introduces registration/un-registration APIs
> for async data path together with all required data
> structures and DMA callback function proto-types.
> 
> Signed-off-by: Patrick <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/Makefile          |   3 +-
>  lib/librte_vhost/rte_vhost.h       |   1 +
>  lib/librte_vhost/rte_vhost_async.h | 134
> +++++++++++++++++++++++++++++++++++++
>  lib/librte_vhost/socket.c          |  20 ++++++
>  lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
>  lib/librte_vhost/vhost.h           |  30 ++++++++-
>  lib/librte_vhost/vhost_user.c      |  28 ++++++--
>  7 files changed, 283 insertions(+), 7 deletions(-)
>  create mode 100644 lib/librte_vhost/rte_vhost_async.h
> 
> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> index e592795..3aed094 100644
> --- a/lib/librte_vhost/Makefile
> +++ b/lib/librte_vhost/Makefile
> @@ -41,7 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c
> iotlb.c socket.c vhost.c \
>  					vhost_user.c virtio_net.c vdpa.c
> 
>  # install includes
> -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h
> \
> +						rte_vhost_async.h
> 
Hi Patrick,
Please also update meson build for newly added file.
Thanks,
Marvin
>  # only compile vhost crypto when cryptodev is enabled
>  ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> index d43669f..cec4d07 100644
> --- a/lib/librte_vhost/rte_vhost.h
> +++ b/lib/librte_vhost/rte_vhost.h
> @@ -35,6 +35,7 @@
>  #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
>  /* support only linear buffers (no chained mbufs) */
>  #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
> +#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
> 
>  /** Protocol features. */
>  #ifndef VHOST_USER_PROTOCOL_F_MQ
> diff --git a/lib/librte_vhost/rte_vhost_async.h
> b/lib/librte_vhost/rte_vhost_async.h
> new file mode 100644
> index 0000000..82f2ebe
> --- /dev/null
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2018 Intel Corporation
> + */
s/2018/2020/ 
> +
> +#ifndef _RTE_VHOST_ASYNC_H_
> +#define _RTE_VHOST_ASYNC_H_
> +
> +#include "rte_vhost.h"
> +
> +/**
> + * iovec iterator
> + */
> +struct iov_it {
> +	/** offset to the first byte of interesting data */
> +	size_t offset;
> +	/** total bytes of data in this iterator */
> +	size_t count;
> +	/** pointer to the iovec array */
> +	struct iovec *iov;
> +	/** number of iovec in this iterator */
> +	unsigned long nr_segs;
> +};
Patrick,
I think structure named as "it" is too generic for understanding, please use more meaningful name like "iov_iter". 
> +
> +/**
> + * dma transfer descriptor pair
> + */
> +struct dma_trans_desc {
> +	/** source memory iov_it */
> +	struct iov_it *src;
> +	/** destination memory iov_it */
> +	struct iov_it *dst;
> +};
> +
This series patch named as sync copy,  and dma is just one async copy method which underneath hardware supplied. 
IMHO, structure is better to named as "async_copy_desc" which matched the overall concept. 
> +/**
> + * dma transfer status
> + */
> +struct dma_trans_status {
> +	/** An array of application specific data for source memory */
> +	uintptr_t *src_opaque_data;
> +	/** An array of application specific data for destination memory */
> +	uintptr_t *dst_opaque_data;
> +};
> +
Same as pervious comment.
> +/**
> + * dma operation callbacks to be implemented by applications
> + */
> +struct rte_vhost_async_channel_ops {
> +	/**
> +	 * instruct a DMA channel to perform copies for a batch of packets
> +	 *
> +	 * @param vid
> +	 *  id of vhost device to perform data copies
> +	 * @param queue_id
> +	 *  queue id to perform data copies
> +	 * @param descs
> +	 *  an array of DMA transfer memory descriptors
> +	 * @param opaque_data
> +	 *  opaque data pair sending to DMA engine
> +	 * @param count
> +	 *  number of elements in the "descs" array
> +	 * @return
> +	 *  -1 on failure, number of descs processed on success
> +	 */
> +	int (*transfer_data)(int vid, uint16_t queue_id,
> +		struct dma_trans_desc *descs,
> +		struct dma_trans_status *opaque_data,
> +		uint16_t count);
> +	/**
> +	 * check copy-completed packets from a DMA channel
> +	 * @param vid
> +	 *  id of vhost device to check copy completion
> +	 * @param queue_id
> +	 *  queue id to check copyp completion
> +	 * @param opaque_data
> +	 *  buffer to receive the opaque data pair from DMA engine
> +	 * @param max_packets
> +	 *  max number of packets could be completed
> +	 * @return
> +	 *  -1 on failure, number of iov segments completed on success
> +	 */
> +	int (*check_completed_copies)(int vid, uint16_t queue_id,
> +		struct dma_trans_status *opaque_data,
> +		uint16_t max_packets);
> +};
> +
> +/**
> + *  dma channel feature bit definition
> + */
> +struct dma_channel_features {
> +	union {
> +		uint32_t intval;
> +		struct {
> +			uint32_t inorder:1;
> +			uint32_t resvd0115:15;
> +			uint32_t threshold:12;
> +			uint32_t resvd2831:4;
> +		};
> +	};
> +};
> +
Naming feature bits as "intval" may cause confusion, why not just use its meaning like "engine_features"?
I'm not sure whether format "resvd0115" match dpdk copy style. In my mind, dpdk will use resvd_0 and resvd_1 for two reserved elements.
> +/**
> + * register a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be attached to
> + * @param queue_id
> + *  vhost queue id DMA channel to be attached to
> + * @param features
> + *  DMA channel feature bit
> + *    b0       : DMA supports inorder data transfer
> + *    b1  - b15: reserved
> + *    b16 - b27: Packet length threshold for DMA transfer
> + *    b28 - b31: reserved
> + * @param ops
> + *  DMA operation callbacks
> + * @return
> + *  0 on success, -1 on failures
> + */
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +	uint32_t features, struct rte_vhost_async_channel_ops *ops);
> +
> +/**
> + * unregister a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be detached
> + * @param queue_id
> + *  vhost queue id DMA channel to be detached
> + * @return
> + *  0 on success, -1 on failures
> + */
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> +
> +#endif /* _RTE_VDPA_H_ */
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> index 0a66ef9..f817783 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>  	bool use_builtin_virtio_net;
>  	bool extbuf;
>  	bool linearbuf;
> +	bool async_copy;
> 
>  	/*
>  	 * The "supported_features" indicates the feature bits the
> @@ -210,6 +211,7 @@ struct vhost_user {
>  	size_t size;
>  	struct vhost_user_connection *conn;
>  	int ret;
> +	struct virtio_net *dev;
> 
>  	if (vsocket == NULL)
>  		return;
> @@ -241,6 +243,13 @@ struct vhost_user {
>  	if (vsocket->linearbuf)
>  		vhost_enable_linearbuf(vid);
> 
> +	if (vsocket->async_copy) {
> +		dev = get_device(vid);
> +
> +		if (dev)
> +			dev->async_copy = 1;
> +	}
> +
IMHO, user can chose which queue utilize async copy as backend hardware resource is limited. 
So should async_copy enable flag be saved in virtqueue structure? 
>  	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> 
>  	if (vsocket->notify_ops->new_connection) {
> @@ -891,6 +900,17 @@ struct vhost_user_reconnect_list {
>  		goto out_mutex;
>  	}
> 
> +	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> +
> +	if (vsocket->async_copy &&
> +		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
> +		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
> +		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and
> IOMMU "
> +			"or post-copy feature simultaneously is not "
> +			"supported\n");
> +		goto out_mutex;
> +	}
> +
>  	/*
>  	 * Set the supported features correctly for the builtin vhost-user
>  	 * net driver.
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index 0266318..e6b688a 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -332,8 +332,13 @@
>  {
>  	if (vq_is_packed(dev))
>  		rte_free(vq->shadow_used_packed);
> -	else
> +	else {
>  		rte_free(vq->shadow_used_split);
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +	}
>  	rte_free(vq->batch_copy_elems);
>  	rte_mempool_free(vq->iotlb_pool);
>  	rte_free(vq);
> @@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int vid,
>  	if (vhost_data_log_level >= 0)
>  		rte_log_set_level(vhost_data_log_level,
> RTE_LOG_WARNING);
>  }
> +
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +					uint32_t features,
> +					struct rte_vhost_async_channel_ops
> *ops)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +	struct dma_channel_features f;
> +
> +	if (dev == NULL || ops == NULL)
> +		return -1;
> +
> +	f.intval = features;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (vq == NULL)
> +		return -1;
> +
> +	/** packed queue is not supported */
> +	if (vq_is_packed(dev) || !f.inorder)
> +		return -1;
> +
Virtio already has in_order concept, these two names are so like and can be easily messed up.  Please consider how to distinguish them.
> +	if (ops->check_completed_copies == NULL ||
> +		ops->transfer_data == NULL)
> +		return -1;
> +
Previous error is unlikely to be true, unlikely macro may be helpful for understanding. 
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	vq->async_ops.check_completed_copies = ops-
> >check_completed_copies;
> +	vq->async_ops.transfer_data = ops->transfer_data;
> +
> +	vq->async_inorder = f.inorder;
> +	vq->async_threshold = f.threshold;
> +
> +	vq->async_registered = true;
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return 0;
> +}
> +
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (dev == NULL)
> +		return -1;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (vq == NULL)
> +		return -1;
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	vq->async_ops.transfer_data = NULL;
> +	vq->async_ops.check_completed_copies = NULL;
> +
> +	vq->async_registered = false;
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return 0;
> +}
> +
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index df98d15..a7fbe23 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -23,6 +23,8 @@
>  #include "rte_vhost.h"
>  #include "rte_vdpa.h"
> 
> +#include "rte_vhost_async.h"
> +
>  /* Used to indicate that the device is running on a data core */
>  #define VIRTIO_DEV_RUNNING 1
>  /* Used to indicate that the device is ready to operate */
> @@ -39,6 +41,11 @@
> 
>  #define VHOST_LOG_CACHE_NR 32
> 
> +#define MAX_PKT_BURST 32
> +
> +#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
> +#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
> +
>  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> VRING_DESC_F_WRITE) : \
>  		VRING_DESC_F_WRITE)
> @@ -200,6 +207,25 @@ struct vhost_virtqueue {
>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
>  	int				iotlb_cache_nr;
>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
> +
> +	/* operation callbacks for async dma */
> +	struct rte_vhost_async_channel_ops	async_ops;
> +
> +	struct iov_it it_pool[VHOST_MAX_ASYNC_IT];
> +	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
> +
> +	/* async data transfer status */
> +	uintptr_t	**async_pkts_pending;
> +	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
> +	#define		ASYNC_PENDING_INFO_N_SFT 16
> +	uint64_t	*async_pending_info;
> +	uint16_t	async_pkts_idx;
> +	uint16_t	async_pkts_inflight_n;
> +
> +	/* vq async features */
> +	bool		async_inorder;
> +	bool		async_registered;
> +	uint16_t	async_threshold;
>  } __rte_cache_aligned;
> 
>  /* Old kernels have no such macros defined */
> @@ -353,6 +379,7 @@ struct virtio_net {
>  	int16_t			broadcast_rarp;
>  	uint32_t		nr_vring;
>  	int			dequeue_zero_copy;
> +	int			async_copy;
>  	int			extbuf;
>  	int			linearbuf;
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> @@ -702,7 +729,8 @@ uint64_t translate_log_addr(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
>  	/* Don't kick guest if we don't reach index specified by guest. */
>  	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
>  		uint16_t old = vq->signalled_used;
> -		uint16_t new = vq->last_used_idx;
> +		uint16_t new = vq->async_pkts_inflight_n ?
> +					vq->used->idx:vq->last_used_idx;
>  		bool signalled_used_valid = vq->signalled_used_valid;
> 
>  		vq->signalled_used = new;
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index 84bebad..d7600bf 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -464,12 +464,25 @@
>  	} else {
>  		if (vq->shadow_used_split)
>  			rte_free(vq->shadow_used_split);
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +
>  		vq->shadow_used_split = rte_malloc(NULL,
>  				vq->size * sizeof(struct vring_used_elem),
>  				RTE_CACHE_LINE_SIZE);
> -		if (!vq->shadow_used_split) {
> +		vq->async_pkts_pending = rte_malloc(NULL,
> +				vq->size * sizeof(uintptr_t),
> +				RTE_CACHE_LINE_SIZE);
> +		vq->async_pending_info = rte_malloc(NULL,
> +				vq->size * sizeof(uint64_t),
> +				RTE_CACHE_LINE_SIZE);
> +		if (!vq->shadow_used_split ||
> +			!vq->async_pkts_pending ||
> +			!vq->async_pending_info) {
>  			VHOST_LOG_CONFIG(ERR,
> -					"failed to allocate memory for
> shadow used ring.\n");
> +					"failed to allocate memory for vq
> internal data.\n");
If async copy not enabled, there will be no need to allocate related structures. 
>  			return RTE_VHOST_MSG_RESULT_ERR;
>  		}
>  	}
> @@ -1147,7 +1160,8 @@
>  			goto err_mmap;
>  		}
> 
> -		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
> +		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
> +			MAP_POPULATE : 0;
>  		mmap_addr = mmap(NULL, mmap_size, PROT_READ |
> PROT_WRITE,
>  				 MAP_SHARED | populate, fd, 0);
> 
> @@ -1162,7 +1176,7 @@
>  		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
>  				      mmap_offset;
> 
> -		if (dev->dequeue_zero_copy)
> +		if (dev->dequeue_zero_copy || dev->async_copy)
>  			if (add_guest_pages(dev, reg, alignment) < 0) {
>  				VHOST_LOG_CONFIG(ERR,
>  					"adding guest pages to region %u
> failed.\n",
> @@ -1945,6 +1959,12 @@ static int vhost_user_set_vring_err(struct
> virtio_net **pdev __rte_unused,
>  	} else {
>  		rte_free(vq->shadow_used_split);
>  		vq->shadow_used_split = NULL;
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +		vq->async_pkts_pending = NULL;
> +		vq->async_pending_info = NULL;
>  	}
> 
>  	rte_free(vq->batch_copy_elems);
> --
> 1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring patrick.fu
@ 2020-06-18  6:56   ` Liu, Yong
  2020-06-18 11:36     ` Fu, Patrick
  2020-06-26 14:39   ` Maxime Coquelin
  2020-06-26 14:46   ` Maxime Coquelin
  2 siblings, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-06-18  6:56 UTC (permalink / raw)
  To: Fu, Patrick
  Cc: Fu, Patrick, Jiang, Cheng1, Liang, Cunming, dev, maxime.coquelin,
	Xia, Chenbo, Wang, Zhihong, Ye, Xiaolong
Thanks, Patrick. Some comments are inline.
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of patrick.fu@intel.com
> Sent: Thursday, June 11, 2020 6:02 PM
> To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> Xiaolong <xiaolong.ye@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
> Subject: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split
> ring
> 
> From: Patrick <patrick.fu@intel.com>
> 
> This patch implement async enqueue data path for split ring.
> 
> Signed-off-by: Patrick <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/rte_vhost_async.h |  38 +++
>  lib/librte_vhost/virtio_net.c      | 538
> ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 574 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_vhost/rte_vhost_async.h
> b/lib/librte_vhost/rte_vhost_async.h
> index 82f2ebe..efcba0a 100644
> --- a/lib/librte_vhost/rte_vhost_async.h
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -131,4 +131,42 @@ int rte_vhost_async_channel_register(int vid,
> uint16_t queue_id,
>   */
>  int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> 
> +/**
> + * This function submit enqueue data to DMA. This function has no
> + * guranttee to the transfer completion upon return. Applications should
> + * poll transfer status by rte_vhost_poll_enqueue_completed()
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  array of packets to be enqueued
> + * @param count
> + *  packets num to be enqueued
> + * @return
> + *  num of packets enqueued
> + */
> +uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
> +/**
> + * This function check DMA completion status for a specific vhost
> + * device queue. Packets which finish copying (enqueue) operation
> + * will be returned in an array.
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  blank array to get return packet pointer
> + * @param count
> + *  size of the packet array
> + * @return
> + *  num of packets returned
> + */
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
>  #endif /* _RTE_VDPA_H_ */
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 751c1f3..cf9f884 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -17,14 +17,15 @@
>  #include <rte_arp.h>
>  #include <rte_spinlock.h>
>  #include <rte_malloc.h>
> +#include <rte_vhost_async.h>
> 
>  #include "iotlb.h"
>  #include "vhost.h"
> 
> -#define MAX_PKT_BURST 32
> -
>  #define MAX_BATCH_LEN 256
> 
> +#define VHOST_ASYNC_BATCH_THRESHOLD 8
> +
>  static  __rte_always_inline bool
>  rxvq_is_mergeable(struct virtio_net *dev)
>  {
> @@ -117,6 +118,35 @@
>  }
> 
>  static __rte_always_inline void
> +async_flush_shadow_used_ring_split(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq)
> +{
> +	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
> +
> +	if (used_idx + vq->shadow_used_idx <= vq->size) {
> +		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
> +					  vq->shadow_used_idx);
> +	} else {
> +		uint16_t size;
> +
> +		/* update used ring interval [used_idx, vq->size] */
> +		size = vq->size - used_idx;
> +		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
> +
> +		/* update the left half used ring interval [0, left_size] */
> +		do_flush_shadow_used_ring_split(dev, vq, 0, size,
> +					  vq->shadow_used_idx - size);
> +	}
> +	vq->last_used_idx += vq->shadow_used_idx;
> +
> +	rte_smp_wmb();
> +
> +	vhost_log_cache_sync(dev, vq);
> +
> +	vq->shadow_used_idx = 0;
> +}
> +
> +static __rte_always_inline void
>  update_shadow_used_ring_split(struct vhost_virtqueue *vq,
>  			 uint16_t desc_idx, uint32_t len)
>  {
> @@ -905,6 +935,199 @@
>  	return error;
>  }
> 
> +static __rte_always_inline void
> +async_fill_vec(struct iovec *v, void *base, size_t len)
> +{
> +	v->iov_base = base;
> +	v->iov_len = len;
> +}
> +
> +static __rte_always_inline void
> +async_fill_it(struct iov_it *it, size_t count,
> +	struct iovec *vec, unsigned long nr_seg)
> +{
> +	it->offset = 0;
> +	it->count = count;
> +
> +	if (count) {
> +		it->iov = vec;
> +		it->nr_segs = nr_seg;
> +	} else {
> +		it->iov = 0;
> +		it->nr_segs = 0;
> +	}
> +}
> +
> +static __rte_always_inline void
> +async_fill_des(struct dma_trans_desc *desc,
> +	struct iov_it *src, struct iov_it *dst)
> +{
> +	desc->src = src;
> +	desc->dst = dst;
> +}
> +
> +static __rte_always_inline int
> +async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +			struct rte_mbuf *m, struct buf_vector *buf_vec,
> +			uint16_t nr_vec, uint16_t num_buffers,
> +			struct iovec *src_iovec, struct iovec *dst_iovec,
> +			struct iov_it *src_it, struct iov_it *dst_it)
> +{
There're too much arguments in this function, please check whether it will impact performance. 
> +	uint32_t vec_idx = 0;
> +	uint32_t mbuf_offset, mbuf_avail;
> +	uint32_t buf_offset, buf_avail;
> +	uint64_t buf_addr, buf_iova, buf_len;
> +	uint32_t cpy_len, cpy_threshold;
> +	uint64_t hdr_addr;
> +	struct rte_mbuf *hdr_mbuf;
> +	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> +	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> +	int error = 0;
> +
> +	uint32_t tlen = 0;
> +	int tvec_idx = 0;
> +	void *hpa;
> +
> +	if (unlikely(m == NULL)) {
> +		error = -1;
> +		goto out;
> +	}
> +
> +	cpy_threshold = vq->async_threshold;
> +
> +	buf_addr = buf_vec[vec_idx].buf_addr;
> +	buf_iova = buf_vec[vec_idx].buf_iova;
> +	buf_len = buf_vec[vec_idx].buf_len;
> +
> +	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
> +		error = -1;
> +		goto out;
> +	}
> +
> +	hdr_mbuf = m;
> +	hdr_addr = buf_addr;
> +	if (unlikely(buf_len < dev->vhost_hlen))
> +		hdr = &tmp_hdr;
> +	else
> +		hdr = (struct virtio_net_hdr_mrg_rxbuf
> *)(uintptr_t)hdr_addr;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
> +		dev->vid, num_buffers);
> +
> +	if (unlikely(buf_len < dev->vhost_hlen)) {
> +		buf_offset = dev->vhost_hlen - buf_len;
> +		vec_idx++;
> +		buf_addr = buf_vec[vec_idx].buf_addr;
> +		buf_iova = buf_vec[vec_idx].buf_iova;
> +		buf_len = buf_vec[vec_idx].buf_len;
> +		buf_avail = buf_len - buf_offset;
> +	} else {
> +		buf_offset = dev->vhost_hlen;
> +		buf_avail = buf_len - dev->vhost_hlen;
> +	}
> +
> +	mbuf_avail  = rte_pktmbuf_data_len(m);
> +	mbuf_offset = 0;
> +
> +	while (mbuf_avail != 0 || m->next != NULL) {
> +		/* done with current buf, get the next one */
> +		if (buf_avail == 0) {
> +			vec_idx++;
> +			if (unlikely(vec_idx >= nr_vec)) {
> +				error = -1;
> +				goto out;
> +			}
> +
> +			buf_addr = buf_vec[vec_idx].buf_addr;
> +			buf_iova = buf_vec[vec_idx].buf_iova;
> +			buf_len = buf_vec[vec_idx].buf_len;
> +
> +			buf_offset = 0;
> +			buf_avail  = buf_len;
> +		}
> +
> +		/* done with current mbuf, get the next one */
> +		if (mbuf_avail == 0) {
> +			m = m->next;
> +
> +			mbuf_offset = 0;
> +			mbuf_avail  = rte_pktmbuf_data_len(m);
> +		}
> +
> +		if (hdr_addr) {
> +			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
> +			if (rxvq_is_mergeable(dev))
> +				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
> +						num_buffers);
> +
> +			if (unlikely(hdr == &tmp_hdr)) {
> +				copy_vnet_hdr_to_desc(dev, vq, buf_vec,
> hdr);
> +			} else {
> +				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
> +						dev->vhost_hlen, 0);
> +				vhost_log_cache_write_iova(dev, vq,
> +						buf_vec[0].buf_iova,
> +						dev->vhost_hlen);
> +			}
> +
> +			hdr_addr = 0;
> +		}
> +
> +		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> +
> +		if (unlikely(cpy_len >= cpy_threshold)) {
> +			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
> +					buf_iova + buf_offset, cpy_len);
I have one question here. If user has called async copy directly, should vhost library still check copy threshold for software fallback?  
If need software fallback, IMHO it will be more suitable to handle it in copy device driver.
IMHO, the cost will be too high for checking and fix virtio header in async copy function. 
Since this is async copy datapath, could it possible that eliminate the cost in calculation of segmented addresses? 
> +
> +			if (unlikely(!hpa)) {
> +				error = -1;
> +				goto out;
> +			}
> +
> +			async_fill_vec(src_iovec + tvec_idx,
> +				(void
> *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> +						mbuf_offset), cpy_len);
> +
> +			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
> +
> +			tlen += cpy_len;
> +			tvec_idx++;
> +		} else {
> +			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
> +				rte_memcpy(
> +				(void *)((uintptr_t)(buf_addr + buf_offset)),
> +				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset),
> +				cpy_len);
> +
> +				PRINT_PACKET(dev,
> +					(uintptr_t)(buf_addr + buf_offset),
> +					cpy_len, 0);
> +			} else {
> +				batch_copy[vq->batch_copy_nb_elems].dst =
> +				(void *)((uintptr_t)(buf_addr + buf_offset));
> +				batch_copy[vq->batch_copy_nb_elems].src =
> +				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset);
> +				batch_copy[vq-
> >batch_copy_nb_elems].log_addr =
> +					buf_iova + buf_offset;
> +				batch_copy[vq->batch_copy_nb_elems].len =
> +					cpy_len;
> +				vq->batch_copy_nb_elems++;
> +			}
> +		}
> +
> +		mbuf_avail  -= cpy_len;
> +		mbuf_offset += cpy_len;
> +		buf_avail  -= cpy_len;
> +		buf_offset += cpy_len;
> +	}
> +
> +out:
> +	async_fill_it(src_it, tlen, src_iovec, tvec_idx);
> +	async_fill_it(dst_it, tlen, dst_iovec, tvec_idx);
> +
> +	return error;
> +}
> +
>  static __rte_always_inline int
>  vhost_enqueue_single_packed(struct virtio_net *dev,
>  			    struct vhost_virtqueue *vq,
> @@ -1236,6 +1459,317 @@
>  	return virtio_dev_rx(dev, queue_id, pkts, count);
>  }
> 
> +static __rte_always_inline void
> +virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq, uint16_t queue_id,
> +	uint16_t last_idx, uint16_t shadow_idx)
> +{
> +	while (vq->async_pkts_inflight_n) {
> +		int er = vq->async_ops.check_completed_copies(dev->vid,
> +			queue_id, 0, MAX_PKT_BURST);
> +
> +		if (er < 0) {
> +			vq->async_pkts_inflight_n = 0;
> +			break;
> +		}
> +
> +		vq->async_pkts_inflight_n -= er;
> +	}
> +
> +	vq->shadow_used_idx = shadow_idx;
> +	vq->last_avail_idx = last_idx;
> +}
> +
> +static __rte_noinline uint32_t
> +virtio_dev_rx_async_submit_split(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq, uint16_t queue_id,
> +	struct rte_mbuf **pkts, uint32_t count)
> +{
> +	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
> +	uint16_t num_buffers;
> +	struct buf_vector buf_vec[BUF_VECTOR_MAX];
> +	uint16_t avail_head, last_idx, shadow_idx;
> +
> +	struct iov_it *it_pool = vq->it_pool;
> +	struct iovec *vec_pool = vq->vec_pool;
> +	struct dma_trans_desc tdes[MAX_PKT_BURST];
> +	struct iovec *src_iovec = vec_pool;
> +	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
> +	struct iov_it *src_it = it_pool;
> +	struct iov_it *dst_it = it_pool + 1;
> +	uint16_t n_free_slot, slot_idx;
> +	int n_pkts = 0;
> +
> +	avail_head = *((volatile uint16_t *)&vq->avail->idx);
> +	last_idx = vq->last_avail_idx;
> +	shadow_idx = vq->shadow_used_idx;
> +
> +	/*
> +	 * The ordering between avail index and
> +	 * desc reads needs to be enforced.
> +	 */
> +	rte_smp_rmb();
> +
> +	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
> +
> +	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
> +		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
> +		uint16_t nr_vec = 0;
> +
> +		if (unlikely(reserve_avail_buf_split(dev, vq,
> +						pkt_len, buf_vec,
> &num_buffers,
> +						avail_head, &nr_vec) < 0)) {
> +			VHOST_LOG_DATA(DEBUG,
> +				"(%d) failed to get enough desc from
> vring\n",
> +				dev->vid);
> +			vq->shadow_used_idx -= num_buffers;
> +			break;
> +		}
> +
> +		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end
> index %d\n",
> +			dev->vid, vq->last_avail_idx,
> +			vq->last_avail_idx + num_buffers);
> +
> +		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
> +				buf_vec, nr_vec, num_buffers,
> +				src_iovec, dst_iovec, src_it, dst_it) < 0) {
> +			vq->shadow_used_idx -= num_buffers;
> +			break;
> +		}
> +
> +		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
> +		if (src_it->count) {
> +			async_fill_des(&tdes[pkt_burst_idx], src_it, dst_it);
> +			pkt_burst_idx++;
> +			vq->async_pending_info[slot_idx] =
> +				num_buffers | (src_it->nr_segs << 16);
> +			src_iovec += src_it->nr_segs;
> +			dst_iovec += dst_it->nr_segs;
> +			src_it += 2;
> +			dst_it += 2;
> +		} else {
> +			vq->async_pending_info[slot_idx] = num_buffers;
> +			vq->async_pkts_inflight_n++;
> +		}
> +
> +		vq->last_avail_idx += num_buffers;
> +
> +		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
> +				(pkt_idx == count - 1 && pkt_burst_idx)) {
> +			n_pkts = vq->async_ops.transfer_data(dev->vid,
> +					queue_id, tdes, 0, pkt_burst_idx);
> +			src_iovec = vec_pool;
> +			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >>
> 1);
> +			src_it = it_pool;
> +			dst_it = it_pool + 1;
> +
> +			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
> +				vq->async_pkts_inflight_n +=
> +					n_pkts > 0 ? n_pkts : 0;
> +				virtio_dev_rx_async_submit_split_err(dev,
> +					vq, queue_id, last_idx, shadow_idx);
> +				return 0;
> +			}
> +
> +			pkt_burst_idx = 0;
> +			vq->async_pkts_inflight_n += n_pkts;
> +		}
> +	}
> +
> +	if (pkt_burst_idx) {
> +		n_pkts = vq->async_ops.transfer_data(dev->vid,
> +				queue_id, tdes, 0, pkt_burst_idx);
> +		if (unlikely(n_pkts <= (int)pkt_burst_idx)) {
> +			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
> +			virtio_dev_rx_async_submit_split_err(dev, vq,
> queue_id,
> +			last_idx, shadow_idx);
> +			return 0;
> +		}
> +
> +		vq->async_pkts_inflight_n += n_pkts;
> +	}
> +
> +	do_data_copy_enqueue(dev, vq);
> +
> +	n_free_slot = vq->size - vq->async_pkts_idx;
> +	if (n_free_slot > pkt_idx) {
> +		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
> +			pkts, pkt_idx * sizeof(uintptr_t));
> +		vq->async_pkts_idx += pkt_idx;
> +	} else {
> +		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
> +			pkts, n_free_slot * sizeof(uintptr_t));
> +		rte_memcpy(&vq->async_pkts_pending[0],
> +			&pkts[n_free_slot],
> +			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
> +		vq->async_pkts_idx = pkt_idx - n_free_slot;
> +	}
> +
> +	if (likely(vq->shadow_used_idx))
> +		async_flush_shadow_used_ring_split(dev, vq);
> +
> +	return pkt_idx;
> +}
> +
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +	struct vhost_virtqueue *vq;
> +	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
> +	uint16_t start_idx, pkts_idx, vq_size;
> +	uint64_t *async_pending_info;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
> +	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue
> idx %d.\n",
> +			dev->vid, __func__, queue_id);
> +		return 0;
> +	}
> +
> +	vq = dev->virtqueue[queue_id];
> +
Should check whether this device or queue support async copy, vq->async_pending_info is NULL if queue not enable async_copy.
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	pkts_idx = vq->async_pkts_idx;
> +	async_pending_info = vq->async_pending_info;
> +	vq_size = vq->size;
> +	start_idx = pkts_idx > vq->async_pkts_inflight_n ?
> +		pkts_idx - vq->async_pkts_inflight_n :
> +		(vq_size - vq->async_pkts_inflight_n + pkts_idx) &
> +		(vq_size - 1);
> +
> +	n_pkts_cpl =
> +		vq->async_ops.check_completed_copies(vid, queue_id, 0,
> count);
> +
> +	rte_smp_wmb();
> +
> +	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
> +		uint64_t info = async_pending_info[
> +			(start_idx + n_pkts_put) & (vq_size - 1)];
> +		uint64_t n_segs;
> +		n_pkts_put++;
> +		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
> +		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
> +
> +		if (n_segs) {
> +			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
> +				n_pkts_put--;
> +				n_descs -= info &
> ASYNC_PENDING_INFO_N_MSK;
> +				if (n_pkts_cpl) {
> +					async_pending_info[
> +						(start_idx + n_pkts_put) &
> +						(vq_size - 1)] =
> +					((n_segs - n_pkts_cpl) <<
> +					 ASYNC_PENDING_INFO_N_SFT) |
> +					(info &
> ASYNC_PENDING_INFO_N_MSK);
> +					n_pkts_cpl = 0;
> +				}
> +				break;
> +			}
> +			n_pkts_cpl -= n_segs;
> +		}
> +	}
> +
> +	if (n_pkts_put) {
> +		vq->async_pkts_inflight_n -= n_pkts_put;
> +		*(volatile uint16_t *)&vq->used->idx += n_descs;
> +
> +		vhost_vring_call_split(dev, vq);
> +	}
> +
> +	if (start_idx + n_pkts_put <= vq_size) {
> +		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
> +			n_pkts_put * sizeof(uintptr_t));
> +	} else {
> +		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
> +			(vq_size - start_idx) * sizeof(uintptr_t));
> +		rte_memcpy(&pkts[vq_size - start_idx], vq-
> >async_pkts_pending,
> +			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
> +	}
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return n_pkts_put;
> +}
> +
> +static __rte_always_inline uint32_t
> +virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> +	struct rte_mbuf **pkts, uint32_t count)
> +{
> +	struct vhost_virtqueue *vq;
> +	uint32_t nb_tx = 0;
> +	bool drawback = false;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
> +	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue
> idx %d.\n",
> +			dev->vid, __func__, queue_id);
> +		return 0;
> +	}
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	if (unlikely(vq->enabled == 0))
> +		goto out_access_unlock;
> +
> +	if (unlikely(!vq->async_registered)) {
> +		drawback = true;
> +		goto out_access_unlock;
> +	}
> +
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> +		vhost_user_iotlb_rd_lock(vq);
> +
> +	if (unlikely(vq->access_ok == 0))
> +		if (unlikely(vring_translate(dev, vq) < 0))
> +			goto out;
> +
> +	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
> +	if (count == 0)
> +		goto out;
> +
> +	/* TODO: packed queue not implemented */
> +	if (vq_is_packed(dev))
> +		nb_tx = 0;
> +	else
> +		nb_tx = virtio_dev_rx_async_submit_split(dev,
> +				vq, queue_id, pkts, count);
> +
> +out:
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> +		vhost_user_iotlb_rd_unlock(vq);
> +
> +out_access_unlock:
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	if (drawback)
> +		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts,
> count);
> +
> +	return nb_tx;
> +}
> +
> +uint16_t
> +rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (!dev)
> +		return 0;
> +
> +	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
> +		VHOST_LOG_DATA(ERR,
> +			"(%d) %s: built-in vhost net backend is disabled.\n",
> +			dev->vid, __func__);
> +		return 0;
> +	}
> +
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
> +}
> +
>  static inline bool
>  virtio_net_with_host_offload(struct virtio_net *dev)
>  {
> --
> 1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-18  5:50   ` Liu, Yong
@ 2020-06-18  9:08     ` Fu, Patrick
  2020-06-19  0:40       ` Liu, Yong
  2020-06-25 13:42     ` Maxime Coquelin
  1 sibling, 1 reply; 36+ messages in thread
From: Fu, Patrick @ 2020-06-18  9:08 UTC (permalink / raw)
  To: Liu, Yong
  Cc: Jiang, Cheng1, Liang, Cunming, dev, maxime.coquelin, Xia, Chenbo,
	Wang, Zhihong, Ye, Xiaolong
> -----Original Message-----
> From: Liu, Yong <yong.liu@intel.com>
> Sent: Thursday, June 18, 2020 1:51 PM
> To: Fu, Patrick <patrick.fu@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>;
> dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> Xiaolong <xiaolong.ye@intel.com>
> Subject: RE: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
> registration API
> 
> Thanks, Patrick. So comments are inline.
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of patrick.fu@intel.com
> > Sent: Thursday, June 11, 2020 6:02 PM
> > To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> > <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> > Xiaolong <xiaolong.ye@intel.com>
> > Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> > <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
> > Subject: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
> > registration API
> >
> > From: Patrick <patrick.fu@intel.com>
> >
> > This patch introduces registration/un-registration APIs for async data
> > path together with all required data structures and DMA callback
> > function proto-types.
> >
> > Signed-off-by: Patrick <patrick.fu@intel.com>
> > ---
> >  lib/librte_vhost/Makefile          |   3 +-
> >  lib/librte_vhost/rte_vhost.h       |   1 +
> >  lib/librte_vhost/rte_vhost_async.h | 134
> > +++++++++++++++++++++++++++++++++++++
> >  lib/librte_vhost/socket.c          |  20 ++++++
> >  lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
> >  lib/librte_vhost/vhost.h           |  30 ++++++++-
> >  lib/librte_vhost/vhost_user.c      |  28 ++++++--
> >  7 files changed, 283 insertions(+), 7 deletions(-)  create mode
> > 100644 lib/librte_vhost/rte_vhost_async.h
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index e592795..3aed094 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -41,7 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c
> iotlb.c
> > socket.c vhost.c \
> >  					vhost_user.c virtio_net.c vdpa.c
> >
> >  # install includes
> > -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h
> > +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h
> > \
> > +						rte_vhost_async.h
> >
> Hi Patrick,
> Please also update meson build for newly added file.
> 
> Thanks,
> Marvin
> 
> >  # only compile vhost crypto when cryptodev is enabled  ifeq
> > ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
> > diff --git a/lib/librte_vhost/rte_vhost.h
> > b/lib/librte_vhost/rte_vhost.h index d43669f..cec4d07 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -35,6 +35,7 @@
> >  #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
> >  /* support only linear buffers (no chained mbufs) */
> >  #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
> > +#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
> >
> >  /** Protocol features. */
> >  #ifndef VHOST_USER_PROTOCOL_F_MQ
> > diff --git a/lib/librte_vhost/rte_vhost_async.h
> > b/lib/librte_vhost/rte_vhost_async.h
> > new file mode 100644
> > index 0000000..82f2ebe
> > --- /dev/null
> > +++ b/lib/librte_vhost/rte_vhost_async.h
> > @@ -0,0 +1,134 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Intel Corporation  */
> 
> s/2018/2020/
> 
> > +
> > +#ifndef _RTE_VHOST_ASYNC_H_
> > +#define _RTE_VHOST_ASYNC_H_
> > +
> > +#include "rte_vhost.h"
> > +
> > +/**
> > + * iovec iterator
> > + */
> > +struct iov_it {
> > +	/** offset to the first byte of interesting data */
> > +	size_t offset;
> > +	/** total bytes of data in this iterator */
> > +	size_t count;
> > +	/** pointer to the iovec array */
> > +	struct iovec *iov;
> > +	/** number of iovec in this iterator */
> > +	unsigned long nr_segs;
> > +};
> 
> Patrick,
> I think structure named as "it" is too generic for understanding, please use
> more meaningful name like "iov_iter".
> 
> > +
> > +/**
> > + * dma transfer descriptor pair
> > + */
> > +struct dma_trans_desc {
> > +	/** source memory iov_it */
> > +	struct iov_it *src;
> > +	/** destination memory iov_it */
> > +	struct iov_it *dst;
> > +};
> > +
> 
> This series patch named as sync copy,  and dma is just one async copy
> method which underneath hardware supplied.
> IMHO, structure is better to named as "async_copy_desc" which matched the
> overall concept.
> 
> > +/**
> > + * dma transfer status
> > + */
> > +struct dma_trans_status {
> > +	/** An array of application specific data for source memory */
> > +	uintptr_t *src_opaque_data;
> > +	/** An array of application specific data for destination memory */
> > +	uintptr_t *dst_opaque_data;
> > +};
> > +
> Same as pervious comment.
> 
> > +/**
> > + * dma operation callbacks to be implemented by applications  */
> > +struct rte_vhost_async_channel_ops {
> > +	/**
> > +	 * instruct a DMA channel to perform copies for a batch of packets
> > +	 *
> > +	 * @param vid
> > +	 *  id of vhost device to perform data copies
> > +	 * @param queue_id
> > +	 *  queue id to perform data copies
> > +	 * @param descs
> > +	 *  an array of DMA transfer memory descriptors
> > +	 * @param opaque_data
> > +	 *  opaque data pair sending to DMA engine
> > +	 * @param count
> > +	 *  number of elements in the "descs" array
> > +	 * @return
> > +	 *  -1 on failure, number of descs processed on success
> > +	 */
> > +	int (*transfer_data)(int vid, uint16_t queue_id,
> > +		struct dma_trans_desc *descs,
> > +		struct dma_trans_status *opaque_data,
> > +		uint16_t count);
> > +	/**
> > +	 * check copy-completed packets from a DMA channel
> > +	 * @param vid
> > +	 *  id of vhost device to check copy completion
> > +	 * @param queue_id
> > +	 *  queue id to check copyp completion
> > +	 * @param opaque_data
> > +	 *  buffer to receive the opaque data pair from DMA engine
> > +	 * @param max_packets
> > +	 *  max number of packets could be completed
> > +	 * @return
> > +	 *  -1 on failure, number of iov segments completed on success
> > +	 */
> > +	int (*check_completed_copies)(int vid, uint16_t queue_id,
> > +		struct dma_trans_status *opaque_data,
> > +		uint16_t max_packets);
> > +};
> > +
> > +/**
> > + *  dma channel feature bit definition  */ struct
> > +dma_channel_features {
> > +	union {
> > +		uint32_t intval;
> > +		struct {
> > +			uint32_t inorder:1;
> > +			uint32_t resvd0115:15;
> > +			uint32_t threshold:12;
> > +			uint32_t resvd2831:4;
> > +		};
> > +	};
> > +};
> > +
> 
> Naming feature bits as "intval" may cause confusion, why not just use its
> meaning like "engine_features"?
> I'm not sure whether format "resvd0115" match dpdk copy style. In my mind,
> dpdk will use resvd_0 and resvd_1 for two reserved elements.
> 
For comments here above, I will take changes in v2 patch
> > +		if (dev)
> > +			dev->async_copy = 1;
> > +	}
> > +
> 
> IMHO, user can chose which queue utilize async copy as backend hardware
> resource is limited.
> So should async_copy enable flag be saved in virtqueue structure?
> 
We have per queue flag to identify the enabling status of a specific queue. 
"async_copy" flag is a dev level flag which identifies the async capability of a vhost device. 
This is necessary because we rely on this flag to do initialization work if the vhost backend need to support async mode at any of its queues. 
> >  	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> >
> >  	if (vsocket->notify_ops->new_connection) { @@ -891,6 +900,17 @@
> > struct vhost_user_reconnect_list {
> >  		goto out_mutex;
> >  	}
> >
> > +	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> > +
> > +	if (vsocket->async_copy &&
> > +		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
> > +		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
> > +		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and
> > IOMMU "
> > +			"or post-copy feature simultaneously is not "
> > +			"supported\n");
> > +		goto out_mutex;
> > +	}
> > +
> >  	/*
> >  	 * Set the supported features correctly for the builtin vhost-user
> >  	 * net driver.
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index
> > 0266318..e6b688a 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -332,8 +332,13 @@
> >  {
> >  	if (vq_is_packed(dev))
> >  		rte_free(vq->shadow_used_packed);
> > -	else
> > +	else {
> >  		rte_free(vq->shadow_used_split);
> > +		if (vq->async_pkts_pending)
> > +			rte_free(vq->async_pkts_pending);
> > +		if (vq->async_pending_info)
> > +			rte_free(vq->async_pending_info);
> > +	}
> >  	rte_free(vq->batch_copy_elems);
> >  	rte_mempool_free(vq->iotlb_pool);
> >  	rte_free(vq);
> > @@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int vid,
> >  	if (vhost_data_log_level >= 0)
> >  		rte_log_set_level(vhost_data_log_level,
> > RTE_LOG_WARNING);
> >  }
> > +
> > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > +					uint32_t features,
> > +					struct rte_vhost_async_channel_ops
> > *ops)
> > +{
> > +	struct vhost_virtqueue *vq;
> > +	struct virtio_net *dev = get_device(vid);
> > +	struct dma_channel_features f;
> > +
> > +	if (dev == NULL || ops == NULL)
> > +		return -1;
> > +
> > +	f.intval = features;
> > +
> > +	vq = dev->virtqueue[queue_id];
> > +
> > +	if (vq == NULL)
> > +		return -1;
> > +
> > +	/** packed queue is not supported */
> > +	if (vq_is_packed(dev) || !f.inorder)
> > +		return -1;
> > +
> Virtio already has in_order concept, these two names are so like and can be
> easily messed up.  Please consider how to distinguish them.
> 
What about "async_inorder"
> > +	if (ops->check_completed_copies == NULL ||
> > +		ops->transfer_data == NULL)
> > +		return -1;
> > +
> 
> Previous error is unlikely to be true, unlikely macro may be helpful for
> understanding.
> 
Will update in v2 patch
> > +	rte_spinlock_lock(&vq->access_lock);
> > +
> > +	vq->async_ops.check_completed_copies = ops-
> > >check_completed_copies;
> > +	vq->async_ops.transfer_data = ops->transfer_data;
> > +
> > +	vq->async_inorder = f.inorder;
> > +	vq->async_threshold = f.threshold;
> > +
> > +	vq->async_registered = true;
> > +
> > +	rte_spinlock_unlock(&vq->access_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id) {
> > +	struct vhost_virtqueue *vq;
> > +	struct virtio_net *dev = get_device(vid);
> > +
> > +	if (dev == NULL)
> > +		return -1;
> > +
> > +	vq = dev->virtqueue[queue_id];
> > +
> > +	if (vq == NULL)
> > +		return -1;
> > +
> > +	rte_spinlock_lock(&vq->access_lock);
> > +
> > +	vq->async_ops.transfer_data = NULL;
> > +	vq->async_ops.check_completed_copies = NULL;
> > +
> > +	vq->async_registered = false;
> > +
> > +	rte_spinlock_unlock(&vq->access_lock);
> > +
> > +	return 0;
> > +}
> > +
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index
> > df98d15..a7fbe23 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -23,6 +23,8 @@
> >  #include "rte_vhost.h"
> >  #include "rte_vdpa.h"
> >
> > +#include "rte_vhost_async.h"
> > +
> >  /* Used to indicate that the device is running on a data core */
> > #define VIRTIO_DEV_RUNNING 1
> >  /* Used to indicate that the device is ready to operate */ @@ -39,6
> > +41,11 @@
> >
> >  #define VHOST_LOG_CACHE_NR 32
> >
> > +#define MAX_PKT_BURST 32
> > +
> > +#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2) #define
> > +VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
> > +
> >  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
> >  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> > VRING_DESC_F_WRITE) : \
> >  		VRING_DESC_F_WRITE)
> > @@ -200,6 +207,25 @@ struct vhost_virtqueue {
> >  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
> >  	int				iotlb_cache_nr;
> >  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
> > +
> > +	/* operation callbacks for async dma */
> > +	struct rte_vhost_async_channel_ops	async_ops;
> > +
> > +	struct iov_it it_pool[VHOST_MAX_ASYNC_IT];
> > +	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
> > +
> > +	/* async data transfer status */
> > +	uintptr_t	**async_pkts_pending;
> > +	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
> > +	#define		ASYNC_PENDING_INFO_N_SFT 16
> > +	uint64_t	*async_pending_info;
> > +	uint16_t	async_pkts_idx;
> > +	uint16_t	async_pkts_inflight_n;
> > +
> > +	/* vq async features */
> > +	bool		async_inorder;
> > +	bool		async_registered;
> > +	uint16_t	async_threshold;
> >  } __rte_cache_aligned;
> >
> >  /* Old kernels have no such macros defined */ @@ -353,6 +379,7 @@
> > struct virtio_net {
> >  	int16_t			broadcast_rarp;
> >  	uint32_t		nr_vring;
> >  	int			dequeue_zero_copy;
> > +	int			async_copy;
> >  	int			extbuf;
> >  	int			linearbuf;
> >  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> > @@ -702,7 +729,8 @@ uint64_t translate_log_addr(struct virtio_net
> > *dev, struct vhost_virtqueue *vq,
> >  	/* Don't kick guest if we don't reach index specified by guest. */
> >  	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
> >  		uint16_t old = vq->signalled_used;
> > -		uint16_t new = vq->last_used_idx;
> > +		uint16_t new = vq->async_pkts_inflight_n ?
> > +					vq->used->idx:vq->last_used_idx;
> >  		bool signalled_used_valid = vq->signalled_used_valid;
> >
> >  		vq->signalled_used = new;
> > diff --git a/lib/librte_vhost/vhost_user.c
> > b/lib/librte_vhost/vhost_user.c index 84bebad..d7600bf 100644
> > --- a/lib/librte_vhost/vhost_user.c
> > +++ b/lib/librte_vhost/vhost_user.c
> > @@ -464,12 +464,25 @@
> >  	} else {
> >  		if (vq->shadow_used_split)
> >  			rte_free(vq->shadow_used_split);
> > +		if (vq->async_pkts_pending)
> > +			rte_free(vq->async_pkts_pending);
> > +		if (vq->async_pending_info)
> > +			rte_free(vq->async_pending_info);
> > +
> >  		vq->shadow_used_split = rte_malloc(NULL,
> >  				vq->size * sizeof(struct vring_used_elem),
> >  				RTE_CACHE_LINE_SIZE);
> > -		if (!vq->shadow_used_split) {
> > +		vq->async_pkts_pending = rte_malloc(NULL,
> > +				vq->size * sizeof(uintptr_t),
> > +				RTE_CACHE_LINE_SIZE);
> > +		vq->async_pending_info = rte_malloc(NULL,
> > +				vq->size * sizeof(uint64_t),
> > +				RTE_CACHE_LINE_SIZE);
> > +		if (!vq->shadow_used_split ||
> > +			!vq->async_pkts_pending ||
> > +			!vq->async_pending_info) {
> >  			VHOST_LOG_CONFIG(ERR,
> > -					"failed to allocate memory for
> > shadow used ring.\n");
> > +					"failed to allocate memory for vq
> > internal data.\n");
> 
> If async copy not enabled, there will be no need to allocate related structures.
> 
Will update in v2 patch
Thanks,
Patrick
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring
  2020-06-18  6:56   ` Liu, Yong
@ 2020-06-18 11:36     ` Fu, Patrick
  0 siblings, 0 replies; 36+ messages in thread
From: Fu, Patrick @ 2020-06-18 11:36 UTC (permalink / raw)
  To: Liu, Yong
  Cc: Jiang, Cheng1, Liang, Cunming, dev, maxime.coquelin, Xia, Chenbo,
	Wang, Zhihong, Ye, Xiaolong
Hi,
> -----Original Message-----
> From: Liu, Yong <yong.liu@intel.com>
> Sent: Thursday, June 18, 2020 2:57 PM
> To: Fu, Patrick <patrick.fu@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>;
> dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> Xiaolong <xiaolong.ye@intel.com>
> Subject: RE: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for
> split ring
> 
> Thanks, Patrick. Some comments are inline.
> 
> >
> > From: Patrick <patrick.fu@intel.com>
> >
> > This patch implement async enqueue data path for split ring.
> >
> > Signed-off-by: Patrick <patrick.fu@intel.com>
> > ---
> >  lib/librte_vhost/rte_vhost_async.h |  38 +++
> >  lib/librte_vhost/virtio_net.c      | 538
> > ++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 574 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_vhost/rte_vhost_async.h
> > b/lib/librte_vhost/rte_vhost_async.h
> > index 82f2ebe..efcba0a 100644
> > --- a/lib/librte_vhost/rte_vhost_async.h
> > +++ b/lib/librte_vhost/rte_vhost_async.h
> > +
> > +static __rte_always_inline int
> > +async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +			struct rte_mbuf *m, struct buf_vector *buf_vec,
> > +			uint16_t nr_vec, uint16_t num_buffers,
> > +			struct iovec *src_iovec, struct iovec *dst_iovec,
> > +			struct iov_it *src_it, struct iov_it *dst_it) {
> 
> There're too much arguments in this function, please check whether it will
> impact performance.
> 
I guess src_iovec & dst_iovec could be removed from the parameter list. 
> > +	uint32_t vec_idx = 0;
> > +	uint32_t mbuf_offset, mbuf_avail;
> > +	uint32_t buf_offset, buf_avail;
> > +	uint64_t buf_addr, buf_iova, buf_len;
> > +	uint32_t cpy_len, cpy_threshold;
> > +	uint64_t hdr_addr;
> > +	struct rte_mbuf *hdr_mbuf;
> > +	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> > +	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> > +	int error = 0;
> > +
> > +	uint32_t tlen = 0;
> > +	int tvec_idx = 0;
> > +	void *hpa;
> > +
> > +	if (unlikely(m == NULL)) {
> > +		error = -1;
> > +		goto out;
> > +	}
> > +
> > +	cpy_threshold = vq->async_threshold;
> > +
> > +	buf_addr = buf_vec[vec_idx].buf_addr;
> > +	buf_iova = buf_vec[vec_idx].buf_iova;
> > +	buf_len = buf_vec[vec_idx].buf_len;
> > +
> > +	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
> > +		error = -1;
> > +		goto out;
> > +	}
> > +
> > +	hdr_mbuf = m;
> > +	hdr_addr = buf_addr;
> > +	if (unlikely(buf_len < dev->vhost_hlen))
> > +		hdr = &tmp_hdr;
> > +	else
> > +		hdr = (struct virtio_net_hdr_mrg_rxbuf
> > *)(uintptr_t)hdr_addr;
> > +
> > +	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
> > +		dev->vid, num_buffers);
> > +
> > +	if (unlikely(buf_len < dev->vhost_hlen)) {
> > +		buf_offset = dev->vhost_hlen - buf_len;
> > +		vec_idx++;
> > +		buf_addr = buf_vec[vec_idx].buf_addr;
> > +		buf_iova = buf_vec[vec_idx].buf_iova;
> > +		buf_len = buf_vec[vec_idx].buf_len;
> > +		buf_avail = buf_len - buf_offset;
> > +	} else {
> > +		buf_offset = dev->vhost_hlen;
> > +		buf_avail = buf_len - dev->vhost_hlen;
> > +	}
> > +
> > +	mbuf_avail  = rte_pktmbuf_data_len(m);
> > +	mbuf_offset = 0;
> > +
> > +	while (mbuf_avail != 0 || m->next != NULL) {
> > +		/* done with current buf, get the next one */
> > +		if (buf_avail == 0) {
> > +			vec_idx++;
> > +			if (unlikely(vec_idx >= nr_vec)) {
> > +				error = -1;
> > +				goto out;
> > +			}
> > +
> > +			buf_addr = buf_vec[vec_idx].buf_addr;
> > +			buf_iova = buf_vec[vec_idx].buf_iova;
> > +			buf_len = buf_vec[vec_idx].buf_len;
> > +
> > +			buf_offset = 0;
> > +			buf_avail  = buf_len;
> > +		}
> > +
> > +		/* done with current mbuf, get the next one */
> > +		if (mbuf_avail == 0) {
> > +			m = m->next;
> > +
> > +			mbuf_offset = 0;
> > +			mbuf_avail  = rte_pktmbuf_data_len(m);
> > +		}
> > +
> > +		if (hdr_addr) {
> > +			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
> > +			if (rxvq_is_mergeable(dev))
> > +				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
> > +						num_buffers);
> > +
> > +			if (unlikely(hdr == &tmp_hdr)) {
> > +				copy_vnet_hdr_to_desc(dev, vq, buf_vec,
> > hdr);
> > +			} else {
> > +				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
> > +						dev->vhost_hlen, 0);
> > +				vhost_log_cache_write_iova(dev, vq,
> > +						buf_vec[0].buf_iova,
> > +						dev->vhost_hlen);
> > +			}
> > +
> > +			hdr_addr = 0;
> > +		}
> > +
> > +		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> > +
> > +		if (unlikely(cpy_len >= cpy_threshold)) {
> > +			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
> > +					buf_iova + buf_offset, cpy_len);
> 
> I have one question here. If user has called async copy directly, should vhost
> library still check copy threshold for software fallback?
> If need software fallback, IMHO it will be more suitable to handle it in copy
> device driver.
> 
Technically, we can delegate the threshold judgement from vhost to application callbacks. 
This will significantly increase the complexity of the callback implementations, which have to maintain
correct ordering between dma copied data and CPU copies data. Meanwhile, it actually doesn't help to 
boost performance comparing with current vhost implementation. Considering this threshold is a 
generic design, I would still prefer to keep it in vhost.
> IMHO, the cost will be too high for checking and fix virtio header in async
> copy function.
> Since this is async copy datapath, could it possible that eliminate the cost in
> calculation of segmented addresses?
> 
Yes, I believe async data path certainly brings new opportunity to apply more optimizations. 
However, at current time frame, settling down the overall async framework might be the priority. 
Thanks,
Patrick
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-18  9:08     ` Fu, Patrick
@ 2020-06-19  0:40       ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-06-19  0:40 UTC (permalink / raw)
  To: Fu, Patrick
  Cc: Jiang, Cheng1, Liang, Cunming, dev, maxime.coquelin, Xia, Chenbo,
	Wang, Zhihong, Ye, Xiaolong
> -----Original Message-----
> From: Fu, Patrick <patrick.fu@intel.com>
> Sent: Thursday, June 18, 2020 5:09 PM
> To: Liu, Yong <yong.liu@intel.com>
> Cc: Jiang, Cheng1 <cheng1.jiang@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>; dev@dpdk.org; maxime.coquelin@redhat.com;
> Xia, Chenbo <chenbo.xia@intel.com>; Wang, Zhihong
> <zhihong.wang@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>
> Subject: RE: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
> registration API
> 
> 
> 
> > -----Original Message-----
> > From: Liu, Yong <yong.liu@intel.com>
> > Sent: Thursday, June 18, 2020 1:51 PM
> > To: Fu, Patrick <patrick.fu@intel.com>
> > Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> > <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>;
> > dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> > <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> > Xiaolong <xiaolong.ye@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
> > registration API
> >
> > Thanks, Patrick. So comments are inline.
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of patrick.fu@intel.com
> > > Sent: Thursday, June 11, 2020 6:02 PM
> > > To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> > > <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>;
> Ye,
> > > Xiaolong <xiaolong.ye@intel.com>
> > > Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
> > > <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
> > > Subject: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
> > > registration API
> > >
> > > From: Patrick <patrick.fu@intel.com>
> > >
> > > This patch introduces registration/un-registration APIs for async data
> > > path together with all required data structures and DMA callback
> > > function proto-types.
> > >
> > > Signed-off-by: Patrick <patrick.fu@intel.com>
> > > ---
> > >  lib/librte_vhost/Makefile          |   3 +-
> > >  lib/librte_vhost/rte_vhost.h       |   1 +
> > >  lib/librte_vhost/rte_vhost_async.h | 134
> > > +++++++++++++++++++++++++++++++++++++
> > >  lib/librte_vhost/socket.c          |  20 ++++++
> > >  lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
> > >  lib/librte_vhost/vhost.h           |  30 ++++++++-
> > >  lib/librte_vhost/vhost_user.c      |  28 ++++++--
> > >  7 files changed, 283 insertions(+), 7 deletions(-)  create mode
> > > 100644 lib/librte_vhost/rte_vhost_async.h
> > >
> > > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > > index e592795..3aed094 100644
> > > --- a/lib/librte_vhost/Makefile
> > > +++ b/lib/librte_vhost/Makefile
> > > @@ -41,7 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c
> > iotlb.c
> > > socket.c vhost.c \
> > >  vhost_user.c virtio_net.c vdpa.c
> > >
> > >  # install includes
> > > -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> > rte_vdpa.h
> > > +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> > rte_vdpa.h
> > > \
> > > +rte_vhost_async.h
> > >
> > Hi Patrick,
> > Please also update meson build for newly added file.
> >
> > Thanks,
> > Marvin
> >
> > >  # only compile vhost crypto when cryptodev is enabled  ifeq
> > > ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
> > > diff --git a/lib/librte_vhost/rte_vhost.h
> > > b/lib/librte_vhost/rte_vhost.h index d43669f..cec4d07 100644
> > > --- a/lib/librte_vhost/rte_vhost.h
> > > +++ b/lib/librte_vhost/rte_vhost.h
> > > @@ -35,6 +35,7 @@
> > >  #define RTE_VHOST_USER_EXTBUF_SUPPORT(1ULL << 5)
> > >  /* support only linear buffers (no chained mbufs) */
> > >  #define RTE_VHOST_USER_LINEARBUF_SUPPORT(1ULL << 6)
> > > +#define RTE_VHOST_USER_ASYNC_COPY(1ULL << 7)
> > >
> > >  /** Protocol features. */
> > >  #ifndef VHOST_USER_PROTOCOL_F_MQ
> > > diff --git a/lib/librte_vhost/rte_vhost_async.h
> > > b/lib/librte_vhost/rte_vhost_async.h
> > > new file mode 100644
> > > index 0000000..82f2ebe
> > > --- /dev/null
> > > +++ b/lib/librte_vhost/rte_vhost_async.h
> > > @@ -0,0 +1,134 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2018 Intel Corporation  */
> >
> > s/2018/2020/
> >
> > > +
> > > +#ifndef _RTE_VHOST_ASYNC_H_
> > > +#define _RTE_VHOST_ASYNC_H_
> > > +
> > > +#include "rte_vhost.h"
> > > +
> > > +/**
> > > + * iovec iterator
> > > + */
> > > +struct iov_it {
> > > +/** offset to the first byte of interesting data */
> > > +size_t offset;
> > > +/** total bytes of data in this iterator */
> > > +size_t count;
> > > +/** pointer to the iovec array */
> > > +struct iovec *iov;
> > > +/** number of iovec in this iterator */
> > > +unsigned long nr_segs;
> > > +};
> >
> > Patrick,
> > I think structure named as "it" is too generic for understanding, please
> use
> > more meaningful name like "iov_iter".
> >
> > > +
> > > +/**
> > > + * dma transfer descriptor pair
> > > + */
> > > +struct dma_trans_desc {
> > > +/** source memory iov_it */
> > > +struct iov_it *src;
> > > +/** destination memory iov_it */
> > > +struct iov_it *dst;
> > > +};
> > > +
> >
> > This series patch named as sync copy,  and dma is just one async copy
> > method which underneath hardware supplied.
> > IMHO, structure is better to named as "async_copy_desc" which matched
> the
> > overall concept.
> >
> > > +/**
> > > + * dma transfer status
> > > + */
> > > +struct dma_trans_status {
> > > +/** An array of application specific data for source memory */
> > > +uintptr_t *src_opaque_data;
> > > +/** An array of application specific data for destination memory */
> > > +uintptr_t *dst_opaque_data;
> > > +};
> > > +
> > Same as pervious comment.
> >
> > > +/**
> > > + * dma operation callbacks to be implemented by applications  */
> > > +struct rte_vhost_async_channel_ops {
> > > +/**
> > > + * instruct a DMA channel to perform copies for a batch of packets
> > > + *
> > > + * @param vid
> > > + *  id of vhost device to perform data copies
> > > + * @param queue_id
> > > + *  queue id to perform data copies
> > > + * @param descs
> > > + *  an array of DMA transfer memory descriptors
> > > + * @param opaque_data
> > > + *  opaque data pair sending to DMA engine
> > > + * @param count
> > > + *  number of elements in the "descs" array
> > > + * @return
> > > + *  -1 on failure, number of descs processed on success
> > > + */
> > > +int (*transfer_data)(int vid, uint16_t queue_id,
> > > +struct dma_trans_desc *descs,
> > > +struct dma_trans_status *opaque_data,
> > > +uint16_t count);
> > > +/**
> > > + * check copy-completed packets from a DMA channel
> > > + * @param vid
> > > + *  id of vhost device to check copy completion
> > > + * @param queue_id
> > > + *  queue id to check copyp completion
> > > + * @param opaque_data
> > > + *  buffer to receive the opaque data pair from DMA engine
> > > + * @param max_packets
> > > + *  max number of packets could be completed
> > > + * @return
> > > + *  -1 on failure, number of iov segments completed on success
> > > + */
> > > +int (*check_completed_copies)(int vid, uint16_t queue_id,
> > > +struct dma_trans_status *opaque_data,
> > > +uint16_t max_packets);
> > > +};
> > > +
> > > +/**
> > > + *  dma channel feature bit definition  */ struct
> > > +dma_channel_features {
> > > +union {
> > > +uint32_t intval;
> > > +struct {
> > > +uint32_t inorder:1;
> > > +uint32_t resvd0115:15;
> > > +uint32_t threshold:12;
> > > +uint32_t resvd2831:4;
> > > +};
> > > +};
> > > +};
> > > +
> >
> > Naming feature bits as "intval" may cause confusion, why not just use its
> > meaning like "engine_features"?
> > I'm not sure whether format "resvd0115" match dpdk copy style. In my
> mind,
> > dpdk will use resvd_0 and resvd_1 for two reserved elements.
> >
> For comments here above, I will take changes in v2 patch
> 
> > > +if (dev)
> > > +dev->async_copy = 1;
> > > +}
> > > +
> >
> > IMHO, user can chose which queue utilize async copy as backend
> hardware
> > resource is limited.
> > So should async_copy enable flag be saved in virtqueue structure?
> >
> We have per queue flag to identify the enabling status of a specific queue.
> "async_copy" flag is a dev level flag which identifies the async capability of a
> vhost device.
> This is necessary because we rely on this flag to do initialization work if the
> vhost backend need to support async mode at any of its queues.
> 
I got u, how about rename this variable to "async_copy_enabled" which more aligned to its real implication. 
Thanks,
Marvin
> > >  VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> > >
> > >  if (vsocket->notify_ops->new_connection) { @@ -891,6 +900,17 @@
> > > struct vhost_user_reconnect_list {
> > >  goto out_mutex;
> > >  }
> > >
> > > +vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> > > +
> > > +if (vsocket->async_copy &&
> > > +(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
> > > +RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
> > > +VHOST_LOG_CONFIG(ERR, "error: enabling async copy and
> > > IOMMU "
> > > +"or post-copy feature simultaneously is not "
> > > +"supported\n");
> > > +goto out_mutex;
> > > +}
> > > +
> > >  /*
> > >   * Set the supported features correctly for the builtin vhost-user
> > >   * net driver.
> > > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index
> > > 0266318..e6b688a 100644
> > > --- a/lib/librte_vhost/vhost.c
> > > +++ b/lib/librte_vhost/vhost.c
> > > @@ -332,8 +332,13 @@
> > >  {
> > >  if (vq_is_packed(dev))
> > >  rte_free(vq->shadow_used_packed);
> > > -else
> > > +else {
> > >  rte_free(vq->shadow_used_split);
> > > +if (vq->async_pkts_pending)
> > > +rte_free(vq->async_pkts_pending);
> > > +if (vq->async_pending_info)
> > > +rte_free(vq->async_pending_info);
> > > +}
> > >  rte_free(vq->batch_copy_elems);
> > >  rte_mempool_free(vq->iotlb_pool);
> > >  rte_free(vq);
> > > @@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int
> vid,
> > >  if (vhost_data_log_level >= 0)
> > >  rte_log_set_level(vhost_data_log_level,
> > > RTE_LOG_WARNING);
> > >  }
> > > +
> > > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > > +uint32_t features,
> > > +struct rte_vhost_async_channel_ops
> > > *ops)
> > > +{
> > > +struct vhost_virtqueue *vq;
> > > +struct virtio_net *dev = get_device(vid);
> > > +struct dma_channel_features f;
> > > +
> > > +if (dev == NULL || ops == NULL)
> > > +return -1;
> > > +
> > > +f.intval = features;
> > > +
> > > +vq = dev->virtqueue[queue_id];
> > > +
> > > +if (vq == NULL)
> > > +return -1;
> > > +
> > > +/** packed queue is not supported */
> > > +if (vq_is_packed(dev) || !f.inorder)
> > > +return -1;
> > > +
> > Virtio already has in_order concept, these two names are so like and can
> be
> > easily messed up.  Please consider how to distinguish them.
> >
> What about "async_inorder"
Look great for me.
> 
> > > +if (ops->check_completed_copies == NULL ||
> > > +ops->transfer_data == NULL)
> > > +return -1;
> > > +
> >
> > Previous error is unlikely to be true, unlikely macro may be helpful for
> > understanding.
> >
> Will update in v2 patch
> 
> > > +rte_spinlock_lock(&vq->access_lock);
> > > +
> > > +vq->async_ops.check_completed_copies = ops-
> > > >check_completed_copies;
> > > +vq->async_ops.transfer_data = ops->transfer_data;
> > > +
> > > +vq->async_inorder = f.inorder;
> > > +vq->async_threshold = f.threshold;
> > > +
> > > +vq->async_registered = true;
> > > +
> > > +rte_spinlock_unlock(&vq->access_lock);
> > > +
> > > +return 0;
> > > +}
> > > +
> > > +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id) {
> > > +struct vhost_virtqueue *vq;
> > > +struct virtio_net *dev = get_device(vid);
> > > +
> > > +if (dev == NULL)
> > > +return -1;
> > > +
> > > +vq = dev->virtqueue[queue_id];
> > > +
> > > +if (vq == NULL)
> > > +return -1;
> > > +
> > > +rte_spinlock_lock(&vq->access_lock);
> > > +
> > > +vq->async_ops.transfer_data = NULL;
> > > +vq->async_ops.check_completed_copies = NULL;
> > > +
> > > +vq->async_registered = false;
> > > +
> > > +rte_spinlock_unlock(&vq->access_lock);
> > > +
> > > +return 0;
> > > +}
> > > +
> > > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index
> > > df98d15..a7fbe23 100644
> > > --- a/lib/librte_vhost/vhost.h
> > > +++ b/lib/librte_vhost/vhost.h
> > > @@ -23,6 +23,8 @@
> > >  #include "rte_vhost.h"
> > >  #include "rte_vdpa.h"
> > >
> > > +#include "rte_vhost_async.h"
> > > +
> > >  /* Used to indicate that the device is running on a data core */
> > > #define VIRTIO_DEV_RUNNING 1
> > >  /* Used to indicate that the device is ready to operate */ @@ -39,6
> > > +41,11 @@
> > >
> > >  #define VHOST_LOG_CACHE_NR 32
> > >
> > > +#define MAX_PKT_BURST 32
> > > +
> > > +#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2) #define
> > > +VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
> > > +
> > >  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)\
> > >  ((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> > > VRING_DESC_F_WRITE) : \
> > >  VRING_DESC_F_WRITE)
> > > @@ -200,6 +207,25 @@ struct vhost_virtqueue {
> > >  TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
> > >  intiotlb_cache_nr;
> > >  TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
> > > +
> > > +/* operation callbacks for async dma */
> > > +struct rte_vhost_async_channel_opsasync_ops;
> > > +
> > > +struct iov_it it_pool[VHOST_MAX_ASYNC_IT];
> > > +struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
> > > +
> > > +/* async data transfer status */
> > > +uintptr_t**async_pkts_pending;
> > > +#defineASYNC_PENDING_INFO_N_MSK 0xFFFF
> > > +#defineASYNC_PENDING_INFO_N_SFT 16
> > > +uint64_t*async_pending_info;
> > > +uint16_tasync_pkts_idx;
> > > +uint16_tasync_pkts_inflight_n;
> > > +
> > > +/* vq async features */
> > > +boolasync_inorder;
> > > +boolasync_registered;
> > > +uint16_tasync_threshold;
> > >  } __rte_cache_aligned;
> > >
> > >  /* Old kernels have no such macros defined */ @@ -353,6 +379,7 @@
> > > struct virtio_net {
> > >  int16_tbroadcast_rarp;
> > >  uint32_tnr_vring;
> > >  intdequeue_zero_copy;
> > > +intasync_copy;
> > >  intextbuf;
> > >  intlinearbuf;
> > >  struct vhost_virtqueue*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> > > @@ -702,7 +729,8 @@ uint64_t translate_log_addr(struct virtio_net
> > > *dev, struct vhost_virtqueue *vq,
> > >  /* Don't kick guest if we don't reach index specified by guest. */
> > >  if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
> > >  uint16_t old = vq->signalled_used;
> > > -uint16_t new = vq->last_used_idx;
> > > +uint16_t new = vq->async_pkts_inflight_n ?
> > > +vq->used->idx:vq->last_used_idx;
> > >  bool signalled_used_valid = vq->signalled_used_valid;
> > >
> > >  vq->signalled_used = new;
> > > diff --git a/lib/librte_vhost/vhost_user.c
> > > b/lib/librte_vhost/vhost_user.c index 84bebad..d7600bf 100644
> > > --- a/lib/librte_vhost/vhost_user.c
> > > +++ b/lib/librte_vhost/vhost_user.c
> > > @@ -464,12 +464,25 @@
> > >  } else {
> > >  if (vq->shadow_used_split)
> > >  rte_free(vq->shadow_used_split);
> > > +if (vq->async_pkts_pending)
> > > +rte_free(vq->async_pkts_pending);
> > > +if (vq->async_pending_info)
> > > +rte_free(vq->async_pending_info);
> > > +
> > >  vq->shadow_used_split = rte_malloc(NULL,
> > >  vq->size * sizeof(struct vring_used_elem),
> > >  RTE_CACHE_LINE_SIZE);
> > > -if (!vq->shadow_used_split) {
> > > +vq->async_pkts_pending = rte_malloc(NULL,
> > > +vq->size * sizeof(uintptr_t),
> > > +RTE_CACHE_LINE_SIZE);
> > > +vq->async_pending_info = rte_malloc(NULL,
> > > +vq->size * sizeof(uint64_t),
> > > +RTE_CACHE_LINE_SIZE);
> > > +if (!vq->shadow_used_split ||
> > > +!vq->async_pkts_pending ||
> > > +!vq->async_pending_info) {
> > >  VHOST_LOG_CONFIG(ERR,
> > > -"failed to allocate memory for
> > > shadow used ring.\n");
> > > +"failed to allocate memory for vq
> > > internal data.\n");
> >
> > If async copy not enabled, there will be no need to allocate related
> structures.
> >
> Will update in v2 patch
> 
> Thanks,
> 
> Patrick
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-18  5:50   ` Liu, Yong
  2020-06-18  9:08     ` Fu, Patrick
@ 2020-06-25 13:42     ` Maxime Coquelin
  1 sibling, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-06-25 13:42 UTC (permalink / raw)
  To: Liu, Yong, Fu, Patrick
  Cc: Jiang, Cheng1, Liang, Cunming, dev, Xia, Chenbo, Wang, Zhihong,
	Ye, Xiaolong
On 6/18/20 7:50 AM, Liu, Yong wrote:
> Thanks, Patrick. So comments are inline.
> 
>> -----Original Message-----
>> From: dev <dev-bounces@dpdk.org> On Behalf Of patrick.fu@intel.com
>> Sent: Thursday, June 11, 2020 6:02 PM
>> To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
>> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
>> Xiaolong <xiaolong.ye@intel.com>
>> Cc: Fu, Patrick <patrick.fu@intel.com>; Jiang, Cheng1
>> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
>> Subject: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path
>> registration API
>>
>> From: Patrick <patrick.fu@intel.com>
>>
>> This patch introduces registration/un-registration APIs
>> for async data path together with all required data
>> structures and DMA callback function proto-types.
>>
>> Signed-off-by: Patrick <patrick.fu@intel.com>
>> ---
>>  lib/librte_vhost/Makefile          |   3 +-
>>  lib/librte_vhost/rte_vhost.h       |   1 +
>>  lib/librte_vhost/rte_vhost_async.h | 134
>> +++++++++++++++++++++++++++++++++++++
>>  lib/librte_vhost/socket.c          |  20 ++++++
>>  lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
>>  lib/librte_vhost/vhost.h           |  30 ++++++++-
>>  lib/librte_vhost/vhost_user.c      |  28 ++++++--
>>  7 files changed, 283 insertions(+), 7 deletions(-)
>>  create mode 100644 lib/librte_vhost/rte_vhost_async.h
>>
>> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
>> index e592795..3aed094 100644
>> --- a/lib/librte_vhost/Makefile
>> +++ b/lib/librte_vhost/Makefile
>> @@ -41,7 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c
>> iotlb.c socket.c vhost.c \
>>  					vhost_user.c virtio_net.c vdpa.c
>>
>>  # install includes
>> -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h
>> +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h
>> \
>> +						rte_vhost_async.h
>>
> Hi Patrick,
> Please also update meson build for newly added file.
> 
> Thanks,
> Marvin
> 
>>  # only compile vhost crypto when cryptodev is enabled
>>  ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
>> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
>> index d43669f..cec4d07 100644
>> --- a/lib/librte_vhost/rte_vhost.h
>> +++ b/lib/librte_vhost/rte_vhost.h
>> @@ -35,6 +35,7 @@
>>  #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
>>  /* support only linear buffers (no chained mbufs) */
>>  #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
>> +#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
>>
>>  /** Protocol features. */
>>  #ifndef VHOST_USER_PROTOCOL_F_MQ
>> diff --git a/lib/librte_vhost/rte_vhost_async.h
>> b/lib/librte_vhost/rte_vhost_async.h
>> new file mode 100644
>> index 0000000..82f2ebe
>> --- /dev/null
>> +++ b/lib/librte_vhost/rte_vhost_async.h
>> @@ -0,0 +1,134 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2018 Intel Corporation
>> + */
> 
> s/2018/2020/ 
> 
>> +
>> +#ifndef _RTE_VHOST_ASYNC_H_
>> +#define _RTE_VHOST_ASYNC_H_
>> +
>> +#include "rte_vhost.h"
>> +
>> +/**
>> + * iovec iterator
>> + */
>> +struct iov_it {
>> +	/** offset to the first byte of interesting data */
>> +	size_t offset;
>> +	/** total bytes of data in this iterator */
>> +	size_t count;
>> +	/** pointer to the iovec array */
>> +	struct iovec *iov;
>> +	/** number of iovec in this iterator */
>> +	unsigned long nr_segs;
>> +};
> 
> Patrick,
> I think structure named as "it" is too generic for understanding, please use more meaningful name like "iov_iter". 
I would also add that being pat if the Vhost API, it needs to be
prefixed with rte_vhost_.
This comment applies to all structures in this header.
>> +
>> +/**
>> + * dma transfer descriptor pair
>> + */
>> +struct dma_trans_desc {
>> +	/** source memory iov_it */
>> +	struct iov_it *src;
>> +	/** destination memory iov_it */
>> +	struct iov_it *dst;
>> +};
>> +
> 
> This series patch named as sync copy,  and dma is just one async copy method which underneath hardware supplied. 
> IMHO, structure is better to named as "async_copy_desc" which matched the overall concept. 
> 
>> +/**
>> + * dma transfer status
>> + */
>> +struct dma_trans_status {
>> +	/** An array of application specific data for source memory */
>> +	uintptr_t *src_opaque_data;
>> +	/** An array of application specific data for destination memory */
>> +	uintptr_t *dst_opaque_data;
>> +};
>> +
> Same as pervious comment.
> 
>> +/**
>> + * dma operation callbacks to be implemented by applications
>> + */
>> +struct rte_vhost_async_channel_ops {
>> +	/**
>> +	 * instruct a DMA channel to perform copies for a batch of packets
>> +	 *
>> +	 * @param vid
>> +	 *  id of vhost device to perform data copies
>> +	 * @param queue_id
>> +	 *  queue id to perform data copies
>> +	 * @param descs
>> +	 *  an array of DMA transfer memory descriptors
>> +	 * @param opaque_data
>> +	 *  opaque data pair sending to DMA engine
>> +	 * @param count
>> +	 *  number of elements in the "descs" array
>> +	 * @return
>> +	 *  -1 on failure, number of descs processed on success
>> +	 */
>> +	int (*transfer_data)(int vid, uint16_t queue_id,
>> +		struct dma_trans_desc *descs,
>> +		struct dma_trans_status *opaque_data,
>> +		uint16_t count);
>> +	/**
>> +	 * check copy-completed packets from a DMA channel
>> +	 * @param vid
>> +	 *  id of vhost device to check copy completion
>> +	 * @param queue_id
>> +	 *  queue id to check copyp completion
>> +	 * @param opaque_data
>> +	 *  buffer to receive the opaque data pair from DMA engine
>> +	 * @param max_packets
>> +	 *  max number of packets could be completed
>> +	 * @return
>> +	 *  -1 on failure, number of iov segments completed on success
>> +	 */
>> +	int (*check_completed_copies)(int vid, uint16_t queue_id,
>> +		struct dma_trans_status *opaque_data,
>> +		uint16_t max_packets);
>> +};
>> +
>> +/**
>> + *  dma channel feature bit definition
>> + */
>> +struct dma_channel_features {
>> +	union {
>> +		uint32_t intval;
>> +		struct {
>> +			uint32_t inorder:1;
>> +			uint32_t resvd0115:15;
>> +			uint32_t threshold:12;
>> +			uint32_t resvd2831:4;
>> +		};
>> +	};
>> +};
>> +
> 
> Naming feature bits as "intval" may cause confusion, why not just use its meaning like "engine_features"?
> I'm not sure whether format "resvd0115" match dpdk copy style. In my mind, dpdk will use resvd_0 and resvd_1 for two reserved elements.
> 
>> +/**
>> + * register a dma channel for vhost
>> + *
>> + * @param vid
>> + *  vhost device id DMA channel to be attached to
>> + * @param queue_id
>> + *  vhost queue id DMA channel to be attached to
>> + * @param features
>> + *  DMA channel feature bit
>> + *    b0       : DMA supports inorder data transfer
>> + *    b1  - b15: reserved
>> + *    b16 - b27: Packet length threshold for DMA transfer
>> + *    b28 - b31: reserved
>> + * @param ops
>> + *  DMA operation callbacks
>> + * @return
>> + *  0 on success, -1 on failures
>> + */
>> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>> +	uint32_t features, struct rte_vhost_async_channel_ops *ops);
>> +
>> +/**
>> + * unregister a dma channel for vhost
>> + *
>> + * @param vid
>> + *  vhost device id DMA channel to be detached
>> + * @param queue_id
>> + *  vhost queue id DMA channel to be detached
>> + * @return
>> + *  0 on success, -1 on failures
>> + */
>> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>> +
>> +#endif /* _RTE_VDPA_H_ */
>> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
>> index 0a66ef9..f817783 100644
>> --- a/lib/librte_vhost/socket.c
>> +++ b/lib/librte_vhost/socket.c
>> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>>  	bool use_builtin_virtio_net;
>>  	bool extbuf;
>>  	bool linearbuf;
>> +	bool async_copy;
>>
>>  	/*
>>  	 * The "supported_features" indicates the feature bits the
>> @@ -210,6 +211,7 @@ struct vhost_user {
>>  	size_t size;
>>  	struct vhost_user_connection *conn;
>>  	int ret;
>> +	struct virtio_net *dev;
>>
>>  	if (vsocket == NULL)
>>  		return;
>> @@ -241,6 +243,13 @@ struct vhost_user {
>>  	if (vsocket->linearbuf)
>>  		vhost_enable_linearbuf(vid);
>>
>> +	if (vsocket->async_copy) {
>> +		dev = get_device(vid);
>> +
>> +		if (dev)
>> +			dev->async_copy = 1;
>> +	}
>> +
> 
> IMHO, user can chose which queue utilize async copy as backend hardware resource is limited. 
> So should async_copy enable flag be saved in virtqueue structure? 
> 
>>  	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
>>
>>  	if (vsocket->notify_ops->new_connection) {
>> @@ -891,6 +900,17 @@ struct vhost_user_reconnect_list {
>>  		goto out_mutex;
>>  	}
>>
>> +	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
>> +
>> +	if (vsocket->async_copy &&
>> +		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
>> +		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
>> +		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and
>> IOMMU "
>> +			"or post-copy feature simultaneously is not "
>> +			"supported\n");
>> +		goto out_mutex;
>> +	}
>> +
>>  	/*
>>  	 * Set the supported features correctly for the builtin vhost-user
>>  	 * net driver.
>> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
>> index 0266318..e6b688a 100644
>> --- a/lib/librte_vhost/vhost.c
>> +++ b/lib/librte_vhost/vhost.c
>> @@ -332,8 +332,13 @@
>>  {
>>  	if (vq_is_packed(dev))
>>  		rte_free(vq->shadow_used_packed);
>> -	else
>> +	else {
>>  		rte_free(vq->shadow_used_split);
>> +		if (vq->async_pkts_pending)
>> +			rte_free(vq->async_pkts_pending);
>> +		if (vq->async_pending_info)
>> +			rte_free(vq->async_pending_info);
>> +	}
>>  	rte_free(vq->batch_copy_elems);
>>  	rte_mempool_free(vq->iotlb_pool);
>>  	rte_free(vq);
>> @@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int vid,
>>  	if (vhost_data_log_level >= 0)
>>  		rte_log_set_level(vhost_data_log_level,
>> RTE_LOG_WARNING);
>>  }
>> +
>> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>> +					uint32_t features,
>> +					struct rte_vhost_async_channel_ops
>> *ops)
>> +{
>> +	struct vhost_virtqueue *vq;
>> +	struct virtio_net *dev = get_device(vid);
>> +	struct dma_channel_features f;
>> +
>> +	if (dev == NULL || ops == NULL)
>> +		return -1;
>> +
>> +	f.intval = features;
>> +
>> +	vq = dev->virtqueue[queue_id];
>> +
>> +	if (vq == NULL)
>> +		return -1;
>> +
>> +	/** packed queue is not supported */
>> +	if (vq_is_packed(dev) || !f.inorder)
>> +		return -1;
>> +
> Virtio already has in_order concept, these two names are so like and can be easily messed up.  Please consider how to distinguish them.
> 
>> +	if (ops->check_completed_copies == NULL ||
>> +		ops->transfer_data == NULL)
>> +		return -1;
>> +
> 
> Previous error is unlikely to be true, unlikely macro may be helpful for understanding. 
> 
>> +	rte_spinlock_lock(&vq->access_lock);
>> +
>> +	vq->async_ops.check_completed_copies = ops-
>>> check_completed_copies;
>> +	vq->async_ops.transfer_data = ops->transfer_data;
>> +
>> +	vq->async_inorder = f.inorder;
>> +	vq->async_threshold = f.threshold;
>> +
>> +	vq->async_registered = true;
>> +
>> +	rte_spinlock_unlock(&vq->access_lock);
>> +
>> +	return 0;
>> +}
>> +
>> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
>> +{
>> +	struct vhost_virtqueue *vq;
>> +	struct virtio_net *dev = get_device(vid);
>> +
>> +	if (dev == NULL)
>> +		return -1;
>> +
>> +	vq = dev->virtqueue[queue_id];
>> +
>> +	if (vq == NULL)
>> +		return -1;
>> +
>> +	rte_spinlock_lock(&vq->access_lock);
>> +
>> +	vq->async_ops.transfer_data = NULL;
>> +	vq->async_ops.check_completed_copies = NULL;
>> +
>> +	vq->async_registered = false;
>> +
>> +	rte_spinlock_unlock(&vq->access_lock);
>> +
>> +	return 0;
>> +}
>> +
>> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
>> index df98d15..a7fbe23 100644
>> --- a/lib/librte_vhost/vhost.h
>> +++ b/lib/librte_vhost/vhost.h
>> @@ -23,6 +23,8 @@
>>  #include "rte_vhost.h"
>>  #include "rte_vdpa.h"
>>
>> +#include "rte_vhost_async.h"
>> +
>>  /* Used to indicate that the device is running on a data core */
>>  #define VIRTIO_DEV_RUNNING 1
>>  /* Used to indicate that the device is ready to operate */
>> @@ -39,6 +41,11 @@
>>
>>  #define VHOST_LOG_CACHE_NR 32
>>
>> +#define MAX_PKT_BURST 32
>> +
>> +#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
>> +#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
>> +
>>  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>>  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
>> VRING_DESC_F_WRITE) : \
>>  		VRING_DESC_F_WRITE)
>> @@ -200,6 +207,25 @@ struct vhost_virtqueue {
>>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
>>  	int				iotlb_cache_nr;
>>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
>> +
>> +	/* operation callbacks for async dma */
>> +	struct rte_vhost_async_channel_ops	async_ops;
>> +
>> +	struct iov_it it_pool[VHOST_MAX_ASYNC_IT];
>> +	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
>> +
>> +	/* async data transfer status */
>> +	uintptr_t	**async_pkts_pending;
>> +	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
>> +	#define		ASYNC_PENDING_INFO_N_SFT 16
>> +	uint64_t	*async_pending_info;
>> +	uint16_t	async_pkts_idx;
>> +	uint16_t	async_pkts_inflight_n;
>> +
>> +	/* vq async features */
>> +	bool		async_inorder;
>> +	bool		async_registered;
>> +	uint16_t	async_threshold;
>>  } __rte_cache_aligned;
>>
>>  /* Old kernels have no such macros defined */
>> @@ -353,6 +379,7 @@ struct virtio_net {
>>  	int16_t			broadcast_rarp;
>>  	uint32_t		nr_vring;
>>  	int			dequeue_zero_copy;
>> +	int			async_copy;
>>  	int			extbuf;
>>  	int			linearbuf;
>>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
>> @@ -702,7 +729,8 @@ uint64_t translate_log_addr(struct virtio_net *dev,
>> struct vhost_virtqueue *vq,
>>  	/* Don't kick guest if we don't reach index specified by guest. */
>>  	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
>>  		uint16_t old = vq->signalled_used;
>> -		uint16_t new = vq->last_used_idx;
>> +		uint16_t new = vq->async_pkts_inflight_n ?
>> +					vq->used->idx:vq->last_used_idx;
>>  		bool signalled_used_valid = vq->signalled_used_valid;
>>
>>  		vq->signalled_used = new;
>> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
>> index 84bebad..d7600bf 100644
>> --- a/lib/librte_vhost/vhost_user.c
>> +++ b/lib/librte_vhost/vhost_user.c
>> @@ -464,12 +464,25 @@
>>  	} else {
>>  		if (vq->shadow_used_split)
>>  			rte_free(vq->shadow_used_split);
>> +		if (vq->async_pkts_pending)
>> +			rte_free(vq->async_pkts_pending);
>> +		if (vq->async_pending_info)
>> +			rte_free(vq->async_pending_info);
>> +
>>  		vq->shadow_used_split = rte_malloc(NULL,
>>  				vq->size * sizeof(struct vring_used_elem),
>>  				RTE_CACHE_LINE_SIZE);
>> -		if (!vq->shadow_used_split) {
>> +		vq->async_pkts_pending = rte_malloc(NULL,
>> +				vq->size * sizeof(uintptr_t),
>> +				RTE_CACHE_LINE_SIZE);
>> +		vq->async_pending_info = rte_malloc(NULL,
>> +				vq->size * sizeof(uint64_t),
>> +				RTE_CACHE_LINE_SIZE);
>> +		if (!vq->shadow_used_split ||
>> +			!vq->async_pkts_pending ||
>> +			!vq->async_pending_info) {
>>  			VHOST_LOG_CONFIG(ERR,
>> -					"failed to allocate memory for
>> shadow used ring.\n");
>> +					"failed to allocate memory for vq
>> internal data.\n");
> 
> If async copy not enabled, there will be no need to allocate related structures. 
> 
>>  			return RTE_VHOST_MSG_RESULT_ERR;
>>  		}
>>  	}
>> @@ -1147,7 +1160,8 @@
>>  			goto err_mmap;
>>  		}
>>
>> -		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
>> +		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
>> +			MAP_POPULATE : 0;
>>  		mmap_addr = mmap(NULL, mmap_size, PROT_READ |
>> PROT_WRITE,
>>  				 MAP_SHARED | populate, fd, 0);
>>
>> @@ -1162,7 +1176,7 @@
>>  		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
>>  				      mmap_offset;
>>
>> -		if (dev->dequeue_zero_copy)
>> +		if (dev->dequeue_zero_copy || dev->async_copy)
>>  			if (add_guest_pages(dev, reg, alignment) < 0) {
>>  				VHOST_LOG_CONFIG(ERR,
>>  					"adding guest pages to region %u
>> failed.\n",
>> @@ -1945,6 +1959,12 @@ static int vhost_user_set_vring_err(struct
>> virtio_net **pdev __rte_unused,
>>  	} else {
>>  		rte_free(vq->shadow_used_split);
>>  		vq->shadow_used_split = NULL;
>> +		if (vq->async_pkts_pending)
>> +			rte_free(vq->async_pkts_pending);
>> +		if (vq->async_pending_info)
>> +			rte_free(vq->async_pending_info);
>> +		vq->async_pkts_pending = NULL;
>> +		vq->async_pending_info = NULL;
>>  	}
>>
>>  	rte_free(vq->batch_copy_elems);
>> --
>> 1.8.3.1
> 
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
  2020-06-18  5:50   ` Liu, Yong
@ 2020-06-26 14:28   ` Maxime Coquelin
  2020-06-29  1:15     ` Fu, Patrick
  2020-06-26 14:44   ` Maxime Coquelin
  2 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-06-26 14:28 UTC (permalink / raw)
  To: patrick.fu, dev, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: cheng1.jiang, cunming.liang
On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> From: Patrick <patrick.fu@intel.com>
> 
> This patch introduces registration/un-registration APIs
> for async data path together with all required data
> structures and DMA callback function proto-types.
> 
> Signed-off-by: Patrick <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/Makefile          |   3 +-
>  lib/librte_vhost/rte_vhost.h       |   1 +
>  lib/librte_vhost/rte_vhost_async.h | 134 +++++++++++++++++++++++++++++++++++++
>  lib/librte_vhost/socket.c          |  20 ++++++
>  lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
>  lib/librte_vhost/vhost.h           |  30 ++++++++-
>  lib/librte_vhost/vhost_user.c      |  28 ++++++--
>  7 files changed, 283 insertions(+), 7 deletions(-)
>  create mode 100644 lib/librte_vhost/rte_vhost_async.h
> 
> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> index e592795..3aed094 100644
> --- a/lib/librte_vhost/Makefile
> +++ b/lib/librte_vhost/Makefile
> @@ -41,7 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
>  					vhost_user.c virtio_net.c vdpa.c
>  
>  # install includes
> -SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
> +						rte_vhost_async.h
>  
>  # only compile vhost crypto when cryptodev is enabled
>  ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> index d43669f..cec4d07 100644
> --- a/lib/librte_vhost/rte_vhost.h
> +++ b/lib/librte_vhost/rte_vhost.h
> @@ -35,6 +35,7 @@
>  #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
>  /* support only linear buffers (no chained mbufs) */
>  #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
> +#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
>  
>  /** Protocol features. */
>  #ifndef VHOST_USER_PROTOCOL_F_MQ
> diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
> new file mode 100644
> index 0000000..82f2ebe
> --- /dev/null
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_VHOST_ASYNC_H_
> +#define _RTE_VHOST_ASYNC_H_
> +
> +#include "rte_vhost.h"
> +
> +/**
> + * iovec iterator
> + */
> +struct iov_it {
> +	/** offset to the first byte of interesting data */
> +	size_t offset;
> +	/** total bytes of data in this iterator */
> +	size_t count;
> +	/** pointer to the iovec array */
> +	struct iovec *iov;
> +	/** number of iovec in this iterator */
> +	unsigned long nr_segs;
> +};
> +
> +/**
> + * dma transfer descriptor pair
> + */
> +struct dma_trans_desc {
> +	/** source memory iov_it */
> +	struct iov_it *src;
> +	/** destination memory iov_it */
> +	struct iov_it *dst;
> +};
> +
> +/**
> + * dma transfer status
> + */
> +struct dma_trans_status {
> +	/** An array of application specific data for source memory */
> +	uintptr_t *src_opaque_data;
> +	/** An array of application specific data for destination memory */
> +	uintptr_t *dst_opaque_data;
> +};
> +
> +/**
> + * dma operation callbacks to be implemented by applications
> + */
> +struct rte_vhost_async_channel_ops {
> +	/**
> +	 * instruct a DMA channel to perform copies for a batch of packets
> +	 *
> +	 * @param vid
> +	 *  id of vhost device to perform data copies
> +	 * @param queue_id
> +	 *  queue id to perform data copies
> +	 * @param descs
> +	 *  an array of DMA transfer memory descriptors
> +	 * @param opaque_data
> +	 *  opaque data pair sending to DMA engine
> +	 * @param count
> +	 *  number of elements in the "descs" array
> +	 * @return
> +	 *  -1 on failure, number of descs processed on success
> +	 */
> +	int (*transfer_data)(int vid, uint16_t queue_id,
> +		struct dma_trans_desc *descs,
> +		struct dma_trans_status *opaque_data,
> +		uint16_t count);
> +	/**
> +	 * check copy-completed packets from a DMA channel
> +	 * @param vid
> +	 *  id of vhost device to check copy completion
> +	 * @param queue_id
> +	 *  queue id to check copyp completion
> +	 * @param opaque_data
> +	 *  buffer to receive the opaque data pair from DMA engine
> +	 * @param max_packets
> +	 *  max number of packets could be completed
> +	 * @return
> +	 *  -1 on failure, number of iov segments completed on success
> +	 */
> +	int (*check_completed_copies)(int vid, uint16_t queue_id,
> +		struct dma_trans_status *opaque_data,
> +		uint16_t max_packets);
> +};
> +
> +/**
> + *  dma channel feature bit definition
> + */
> +struct dma_channel_features {
> +	union {
> +		uint32_t intval;
> +		struct {
> +			uint32_t inorder:1;
> +			uint32_t resvd0115:15;
> +			uint32_t threshold:12;
> +			uint32_t resvd2831:4;
> +		};
> +	};
> +};
> +
> +/**
> + * register a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be attached to
> + * @param queue_id
> + *  vhost queue id DMA channel to be attached to
> + * @param features
> + *  DMA channel feature bit
> + *    b0       : DMA supports inorder data transfer
> + *    b1  - b15: reserved
> + *    b16 - b27: Packet length threshold for DMA transfer
> + *    b28 - b31: reserved
> + * @param ops
> + *  DMA operation callbacks
> + * @return
> + *  0 on success, -1 on failures
> + */
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +	uint32_t features, struct rte_vhost_async_channel_ops *ops);
> +
> +/**
> + * unregister a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be detached
> + * @param queue_id
> + *  vhost queue id DMA channel to be detached
> + * @return
> + *  0 on success, -1 on failures
> + */
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> +
> +#endif /* _RTE_VDPA_H_ */
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> index 0a66ef9..f817783 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>  	bool use_builtin_virtio_net;
>  	bool extbuf;
>  	bool linearbuf;
> +	bool async_copy;
>  
>  	/*
>  	 * The "supported_features" indicates the feature bits the
> @@ -210,6 +211,7 @@ struct vhost_user {
>  	size_t size;
>  	struct vhost_user_connection *conn;
>  	int ret;
> +	struct virtio_net *dev;
>  
>  	if (vsocket == NULL)
>  		return;
> @@ -241,6 +243,13 @@ struct vhost_user {
>  	if (vsocket->linearbuf)
>  		vhost_enable_linearbuf(vid);
>  
> +	if (vsocket->async_copy) {
> +		dev = get_device(vid);
> +
> +		if (dev)
> +			dev->async_copy = 1;
> +	}
> +
>  	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
>  
>  	if (vsocket->notify_ops->new_connection) {
> @@ -891,6 +900,17 @@ struct vhost_user_reconnect_list {
>  		goto out_mutex;
>  	}
>  
> +	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> +
> +	if (vsocket->async_copy &&
> +		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
> +		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
> +		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and IOMMU "
> +			"or post-copy feature simultaneously is not "
> +			"supported\n");
> +		goto out_mutex;
> +	}
> +
Have you ensure compatibility with the linearbuf feature (TSO)?
You will want to do same kind of check if not compatible.
>  	/*
>  	 * Set the supported features correctly for the builtin vhost-user
>  	 * net driver.
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index 0266318..e6b688a 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -332,8 +332,13 @@
>  {
>  	if (vq_is_packed(dev))
>  		rte_free(vq->shadow_used_packed);
> -	else
> +	else {
>  		rte_free(vq->shadow_used_split);
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +	}
>  	rte_free(vq->batch_copy_elems);
>  	rte_mempool_free(vq->iotlb_pool);
>  	rte_free(vq);
> @@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int vid,
>  	if (vhost_data_log_level >= 0)
>  		rte_log_set_level(vhost_data_log_level, RTE_LOG_WARNING);
>  }
> +
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +					uint32_t features,
> +					struct rte_vhost_async_channel_ops *ops)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +	struct dma_channel_features f;
> +
> +	if (dev == NULL || ops == NULL)
> +		return -1;
> +
> +	f.intval = features;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (vq == NULL)
> +		return -1;
> +
> +	/** packed queue is not supported */
> +	if (vq_is_packed(dev) || !f.inorder)
> +		return -1;
You might want to print an error message to help the user understanding
why it fails.
> +
> +	if (ops->check_completed_copies == NULL ||
> +		ops->transfer_data == NULL)
> +		return -1;
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	vq->async_ops.check_completed_copies = ops->check_completed_copies;
> +	vq->async_ops.transfer_data = ops->transfer_data;
> +
> +	vq->async_inorder = f.inorder;
> +	vq->async_threshold = f.threshold;
> +
> +	vq->async_registered = true;
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return 0;
> +}
> +
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (dev == NULL)
> +		return -1;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (vq == NULL)
> +		return -1;
> +
> +	rte_spinlock_lock(&vq->access_lock);
We might want to wait all async transfers are done berfore
unregistering?
> +
> +	vq->async_ops.transfer_data = NULL;
> +	vq->async_ops.check_completed_copies = NULL;
> +
> +	vq->async_registered = false;
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return 0;
> +}
> +
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index df98d15..a7fbe23 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -23,6 +23,8 @@
>  #include "rte_vhost.h"
>  #include "rte_vdpa.h"
>  
> +#include "rte_vhost_async.h"
> +
>  /* Used to indicate that the device is running on a data core */
>  #define VIRTIO_DEV_RUNNING 1
>  /* Used to indicate that the device is ready to operate */
> @@ -39,6 +41,11 @@
>  
>  #define VHOST_LOG_CACHE_NR 32
>  
> +#define MAX_PKT_BURST 32
> +
> +#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
> +#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
> +
>  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
>  		VRING_DESC_F_WRITE)
> @@ -200,6 +207,25 @@ struct vhost_virtqueue {
>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
>  	int				iotlb_cache_nr;
>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
> +
> +	/* operation callbacks for async dma */
> +	struct rte_vhost_async_channel_ops	async_ops;
> +
> +	struct iov_it it_pool[VHOST_MAX_ASYNC_IT];
> +	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
> +
> +	/* async data transfer status */
> +	uintptr_t	**async_pkts_pending;
> +	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
> +	#define		ASYNC_PENDING_INFO_N_SFT 16
> +	uint64_t	*async_pending_info;
> +	uint16_t	async_pkts_idx;
> +	uint16_t	async_pkts_inflight_n;
> +
> +	/* vq async features */
> +	bool		async_inorder;
> +	bool		async_registered;
> +	uint16_t	async_threshold;
>  } __rte_cache_aligned;
>  
>  /* Old kernels have no such macros defined */
> @@ -353,6 +379,7 @@ struct virtio_net {
>  	int16_t			broadcast_rarp;
>  	uint32_t		nr_vring;
>  	int			dequeue_zero_copy;
> +	int			async_copy;
>  	int			extbuf;
>  	int			linearbuf;
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> @@ -702,7 +729,8 @@ uint64_t translate_log_addr(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	/* Don't kick guest if we don't reach index specified by guest. */
>  	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
>  		uint16_t old = vq->signalled_used;
> -		uint16_t new = vq->last_used_idx;
> +		uint16_t new = vq->async_pkts_inflight_n ?
> +					vq->used->idx:vq->last_used_idx;
>  		bool signalled_used_valid = vq->signalled_used_valid;
>  
>  		vq->signalled_used = new;
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index 84bebad..d7600bf 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -464,12 +464,25 @@
>  	} else {
>  		if (vq->shadow_used_split)
>  			rte_free(vq->shadow_used_split);
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +
>  		vq->shadow_used_split = rte_malloc(NULL,
>  				vq->size * sizeof(struct vring_used_elem),
>  				RTE_CACHE_LINE_SIZE);
> -		if (!vq->shadow_used_split) {
> +		vq->async_pkts_pending = rte_malloc(NULL,
> +				vq->size * sizeof(uintptr_t),
> +				RTE_CACHE_LINE_SIZE);
> +		vq->async_pending_info = rte_malloc(NULL,
> +				vq->size * sizeof(uint64_t),
> +				RTE_CACHE_LINE_SIZE);
> +		if (!vq->shadow_used_split ||
> +			!vq->async_pkts_pending ||
> +			!vq->async_pending_info) {
>  			VHOST_LOG_CONFIG(ERR,
> -					"failed to allocate memory for shadow used ring.\n");
> +					"failed to allocate memory for vq internal data.\n");
>  			return RTE_VHOST_MSG_RESULT_ERR;
>  		}
>  	}
> @@ -1147,7 +1160,8 @@
>  			goto err_mmap;
>  		}
>  
> -		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
> +		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
> +			MAP_POPULATE : 0;
>  		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
>  				 MAP_SHARED | populate, fd, 0);
>  
> @@ -1162,7 +1176,7 @@
>  		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
>  				      mmap_offset;
>  
> -		if (dev->dequeue_zero_copy)
> +		if (dev->dequeue_zero_copy || dev->async_copy)
>  			if (add_guest_pages(dev, reg, alignment) < 0) {
>  				VHOST_LOG_CONFIG(ERR,
>  					"adding guest pages to region %u failed.\n",
> @@ -1945,6 +1959,12 @@ static int vhost_user_set_vring_err(struct virtio_net **pdev __rte_unused,
>  	} else {
>  		rte_free(vq->shadow_used_split);
>  		vq->shadow_used_split = NULL;
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +		vq->async_pkts_pending = NULL;
> +		vq->async_pending_info = NULL;
>  	}
>  
>  	rte_free(vq->batch_copy_elems);
> 
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring patrick.fu
  2020-06-18  6:56   ` Liu, Yong
@ 2020-06-26 14:39   ` Maxime Coquelin
  2020-06-26 14:46   ` Maxime Coquelin
  2 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-06-26 14:39 UTC (permalink / raw)
  To: patrick.fu, dev, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: cheng1.jiang, cunming.liang
On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> From: Patrick <patrick.fu@intel.com>
> 
> This patch implement async enqueue data path for split ring.
A bit more verbose commit message would be helpful since the cover
letter won't be in the git history. Duplicating relevant parts from
the cover letter would be sufficient.
> Signed-off-by: Patrick <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/rte_vhost_async.h |  38 +++
>  lib/librte_vhost/virtio_net.c      | 538 ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 574 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
> index 82f2ebe..efcba0a 100644
> --- a/lib/librte_vhost/rte_vhost_async.h
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -131,4 +131,42 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   */
>  int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>  
> +/**
> + * This function submit enqueue data to DMA. This function has no
> + * guranttee to the transfer completion upon return. Applications should
> + * poll transfer status by rte_vhost_poll_enqueue_completed()
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  array of packets to be enqueued
> + * @param count
> + *  packets num to be enqueued
> + * @return
> + *  num of packets enqueued
> + */
> +uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
> +/**
> + * This function check DMA completion status for a specific vhost
> + * device queue. Packets which finish copying (enqueue) operation
> + * will be returned in an array.
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  blank array to get return packet pointer
> + * @param count
> + *  size of the packet array
> + * @return
> + *  num of packets returned
> + */
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
>  #endif /* _RTE_VDPA_H_ */
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 751c1f3..cf9f884 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -17,14 +17,15 @@
>  #include <rte_arp.h>
>  #include <rte_spinlock.h>
>  #include <rte_malloc.h>
> +#include <rte_vhost_async.h>
>  
>  #include "iotlb.h"
>  #include "vhost.h"
>  
> -#define MAX_PKT_BURST 32
> -
>  #define MAX_BATCH_LEN 256
>  
> +#define VHOST_ASYNC_BATCH_THRESHOLD 8
> +
>  static  __rte_always_inline bool
>  rxvq_is_mergeable(struct virtio_net *dev)
>  {
> @@ -117,6 +118,35 @@
>  }
>  
>  static __rte_always_inline void
> +async_flush_shadow_used_ring_split(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq)
> +{
> +	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
> +
> +	if (used_idx + vq->shadow_used_idx <= vq->size) {
> +		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
> +					  vq->shadow_used_idx);
> +	} else {
> +		uint16_t size;
> +
> +		/* update used ring interval [used_idx, vq->size] */
> +		size = vq->size - used_idx;
> +		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
> +
> +		/* update the left half used ring interval [0, left_size] */
> +		do_flush_shadow_used_ring_split(dev, vq, 0, size,
> +					  vq->shadow_used_idx - size);
> +	}
> +	vq->last_used_idx += vq->shadow_used_idx;
> +
> +	rte_smp_wmb();
> +
> +	vhost_log_cache_sync(dev, vq);
> +
> +	vq->shadow_used_idx = 0;
> +}
> +
> +static __rte_always_inline void
>  update_shadow_used_ring_split(struct vhost_virtqueue *vq,
>  			 uint16_t desc_idx, uint32_t len)
>  {
> @@ -905,6 +935,199 @@
>  	return error;
>  }
>  
> +static __rte_always_inline void
> +async_fill_vec(struct iovec *v, void *base, size_t len)
> +{
> +	v->iov_base = base;
> +	v->iov_len = len;
> +}
> +
> +static __rte_always_inline void
> +async_fill_it(struct iov_it *it, size_t count,
> +	struct iovec *vec, unsigned long nr_seg)
> +{
> +	it->offset = 0;
> +	it->count = count;
> +
> +	if (count) {
> +		it->iov = vec;
> +		it->nr_segs = nr_seg;
> +	} else {
> +		it->iov = 0;
> +		it->nr_segs = 0;
> +	}
> +}
> +
> +static __rte_always_inline void
> +async_fill_des(struct dma_trans_desc *desc,
> +	struct iov_it *src, struct iov_it *dst)
> +{
> +	desc->src = src;
> +	desc->dst = dst;
> +}
> +
> +static __rte_always_inline int
> +async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +			struct rte_mbuf *m, struct buf_vector *buf_vec,
> +			uint16_t nr_vec, uint16_t num_buffers,
> +			struct iovec *src_iovec, struct iovec *dst_iovec,
> +			struct iov_it *src_it, struct iov_it *dst_it)
> +{
> +	uint32_t vec_idx = 0;
> +	uint32_t mbuf_offset, mbuf_avail;
> +	uint32_t buf_offset, buf_avail;
> +	uint64_t buf_addr, buf_iova, buf_len;
> +	uint32_t cpy_len, cpy_threshold;
> +	uint64_t hdr_addr;
> +	struct rte_mbuf *hdr_mbuf;
> +	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> +	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> +	int error = 0;
> +
> +	uint32_t tlen = 0;
> +	int tvec_idx = 0;
> +	void *hpa;
> +
> +	if (unlikely(m == NULL)) {
> +		error = -1;
> +		goto out;
> +	}
> +
> +	cpy_threshold = vq->async_threshold;
> +
> +	buf_addr = buf_vec[vec_idx].buf_addr;
> +	buf_iova = buf_vec[vec_idx].buf_iova;
> +	buf_len = buf_vec[vec_idx].buf_len;
> +
> +	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
> +		error = -1;
> +		goto out;
> +	}
> +
> +	hdr_mbuf = m;
> +	hdr_addr = buf_addr;
> +	if (unlikely(buf_len < dev->vhost_hlen))
> +		hdr = &tmp_hdr;
> +	else
> +		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
> +		dev->vid, num_buffers);
> +
> +	if (unlikely(buf_len < dev->vhost_hlen)) {
> +		buf_offset = dev->vhost_hlen - buf_len;
> +		vec_idx++;
> +		buf_addr = buf_vec[vec_idx].buf_addr;
> +		buf_iova = buf_vec[vec_idx].buf_iova;
> +		buf_len = buf_vec[vec_idx].buf_len;
> +		buf_avail = buf_len - buf_offset;
> +	} else {
> +		buf_offset = dev->vhost_hlen;
> +		buf_avail = buf_len - dev->vhost_hlen;
> +	}
> +
> +	mbuf_avail  = rte_pktmbuf_data_len(m);
> +	mbuf_offset = 0;
> +
> +	while (mbuf_avail != 0 || m->next != NULL) {
> +		/* done with current buf, get the next one */
> +		if (buf_avail == 0) {
> +			vec_idx++;
> +			if (unlikely(vec_idx >= nr_vec)) {
> +				error = -1;
> +				goto out;
> +			}
> +
> +			buf_addr = buf_vec[vec_idx].buf_addr;
> +			buf_iova = buf_vec[vec_idx].buf_iova;
> +			buf_len = buf_vec[vec_idx].buf_len;
> +
> +			buf_offset = 0;
> +			buf_avail  = buf_len;
> +		}
> +
> +		/* done with current mbuf, get the next one */
> +		if (mbuf_avail == 0) {
> +			m = m->next;
> +
> +			mbuf_offset = 0;
> +			mbuf_avail  = rte_pktmbuf_data_len(m);
> +		}
> +
> +		if (hdr_addr) {
> +			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
> +			if (rxvq_is_mergeable(dev))
> +				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
> +						num_buffers);
> +
> +			if (unlikely(hdr == &tmp_hdr)) {
> +				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
> +			} else {
> +				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
> +						dev->vhost_hlen, 0);
> +				vhost_log_cache_write_iova(dev, vq,
> +						buf_vec[0].buf_iova,
> +						dev->vhost_hlen);
> +			}
> +
> +			hdr_addr = 0;
> +		}
> +
> +		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> +
> +		if (unlikely(cpy_len >= cpy_threshold)) {
> +			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
> +					buf_iova + buf_offset, cpy_len);
> +
> +			if (unlikely(!hpa)) {
> +				error = -1;
> +				goto out;
> +			}
> +
> +			async_fill_vec(src_iovec + tvec_idx,
> +				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> +						mbuf_offset), cpy_len);
> +
> +			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
> +
> +			tlen += cpy_len;
> +			tvec_idx++;
> +		} else {
> +			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
> +				rte_memcpy(
> +				(void *)((uintptr_t)(buf_addr + buf_offset)),
> +				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
> +				cpy_len);
> +
> +				PRINT_PACKET(dev,
> +					(uintptr_t)(buf_addr + buf_offset),
> +					cpy_len, 0);
> +			} else {
> +				batch_copy[vq->batch_copy_nb_elems].dst =
> +				(void *)((uintptr_t)(buf_addr + buf_offset));
> +				batch_copy[vq->batch_copy_nb_elems].src =
> +				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
> +				batch_copy[vq->batch_copy_nb_elems].log_addr =
> +					buf_iova + buf_offset;
> +				batch_copy[vq->batch_copy_nb_elems].len =
> +					cpy_len;
> +				vq->batch_copy_nb_elems++;
> +			}
> +		}
> +
> +		mbuf_avail  -= cpy_len;
> +		mbuf_offset += cpy_len;
> +		buf_avail  -= cpy_len;
> +		buf_offset += cpy_len;
> +	}
> +
> +out:
> +	async_fill_it(src_it, tlen, src_iovec, tvec_idx);
> +	async_fill_it(dst_it, tlen, dst_iovec, tvec_idx);
> +
> +	return error;
> +}
> +
>  static __rte_always_inline int
>  vhost_enqueue_single_packed(struct virtio_net *dev,
>  			    struct vhost_virtqueue *vq,
> @@ -1236,6 +1459,317 @@
>  	return virtio_dev_rx(dev, queue_id, pkts, count);
>  }
>  
> +static __rte_always_inline void
> +virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq, uint16_t queue_id,
> +	uint16_t last_idx, uint16_t shadow_idx)
> +{
> +	while (vq->async_pkts_inflight_n) {
> +		int er = vq->async_ops.check_completed_copies(dev->vid,
> +			queue_id, 0, MAX_PKT_BURST);
> +
> +		if (er < 0) {
> +			vq->async_pkts_inflight_n = 0;
> +			break;
> +		}
> +
> +		vq->async_pkts_inflight_n -= er;
> +	}
> +
> +	vq->shadow_used_idx = shadow_idx;
> +	vq->last_avail_idx = last_idx;
> +}
> +
> +static __rte_noinline uint32_t
> +virtio_dev_rx_async_submit_split(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq, uint16_t queue_id,
> +	struct rte_mbuf **pkts, uint32_t count)
> +{
> +	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
> +	uint16_t num_buffers;
> +	struct buf_vector buf_vec[BUF_VECTOR_MAX];
> +	uint16_t avail_head, last_idx, shadow_idx;
> +
> +	struct iov_it *it_pool = vq->it_pool;
> +	struct iovec *vec_pool = vq->vec_pool;
> +	struct dma_trans_desc tdes[MAX_PKT_BURST];
> +	struct iovec *src_iovec = vec_pool;
> +	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
> +	struct iov_it *src_it = it_pool;
> +	struct iov_it *dst_it = it_pool + 1;
> +	uint16_t n_free_slot, slot_idx;
> +	int n_pkts = 0;
> +
> +	avail_head = *((volatile uint16_t *)&vq->avail->idx);
> +	last_idx = vq->last_avail_idx;
> +	shadow_idx = vq->shadow_used_idx;
> +
> +	/*
> +	 * The ordering between avail index and
> +	 * desc reads needs to be enforced.
> +	 */
> +	rte_smp_rmb();
> +
> +	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
> +
> +	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
> +		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
> +		uint16_t nr_vec = 0;
> +
> +		if (unlikely(reserve_avail_buf_split(dev, vq,
> +						pkt_len, buf_vec, &num_buffers,
> +						avail_head, &nr_vec) < 0)) {
> +			VHOST_LOG_DATA(DEBUG,
> +				"(%d) failed to get enough desc from vring\n",
> +				dev->vid);
> +			vq->shadow_used_idx -= num_buffers;
> +			break;
> +		}
> +
> +		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end index %d\n",
> +			dev->vid, vq->last_avail_idx,
> +			vq->last_avail_idx + num_buffers);
> +
> +		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
> +				buf_vec, nr_vec, num_buffers,
> +				src_iovec, dst_iovec, src_it, dst_it) < 0) {
> +			vq->shadow_used_idx -= num_buffers;
> +			break;
> +		}
> +
> +		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
> +		if (src_it->count) {
> +			async_fill_des(&tdes[pkt_burst_idx], src_it, dst_it);
> +			pkt_burst_idx++;
> +			vq->async_pending_info[slot_idx] =
> +				num_buffers | (src_it->nr_segs << 16);
> +			src_iovec += src_it->nr_segs;
> +			dst_iovec += dst_it->nr_segs;
> +			src_it += 2;
> +			dst_it += 2;
> +		} else {
> +			vq->async_pending_info[slot_idx] = num_buffers;
> +			vq->async_pkts_inflight_n++;
> +		}
> +
> +		vq->last_avail_idx += num_buffers;
> +
> +		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
> +				(pkt_idx == count - 1 && pkt_burst_idx)) {
> +			n_pkts = vq->async_ops.transfer_data(dev->vid,
> +					queue_id, tdes, 0, pkt_burst_idx);
> +			src_iovec = vec_pool;
> +			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
> +			src_it = it_pool;
> +			dst_it = it_pool + 1;
> +
> +			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
> +				vq->async_pkts_inflight_n +=
> +					n_pkts > 0 ? n_pkts : 0;
> +				virtio_dev_rx_async_submit_split_err(dev,
> +					vq, queue_id, last_idx, shadow_idx);
> +				return 0;
> +			}
> +
> +			pkt_burst_idx = 0;
> +			vq->async_pkts_inflight_n += n_pkts;
> +		}
> +	}
> +
> +	if (pkt_burst_idx) {
> +		n_pkts = vq->async_ops.transfer_data(dev->vid,
> +				queue_id, tdes, 0, pkt_burst_idx);
> +		if (unlikely(n_pkts <= (int)pkt_burst_idx)) {
> +			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
> +			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
> +			last_idx, shadow_idx);
> +			return 0;
> +		}
> +
> +		vq->async_pkts_inflight_n += n_pkts;
> +	}
> +
> +	do_data_copy_enqueue(dev, vq);
> +
> +	n_free_slot = vq->size - vq->async_pkts_idx;
> +	if (n_free_slot > pkt_idx) {
> +		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
> +			pkts, pkt_idx * sizeof(uintptr_t));
> +		vq->async_pkts_idx += pkt_idx;
> +	} else {
> +		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
> +			pkts, n_free_slot * sizeof(uintptr_t));
> +		rte_memcpy(&vq->async_pkts_pending[0],
> +			&pkts[n_free_slot],
> +			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
> +		vq->async_pkts_idx = pkt_idx - n_free_slot;
> +	}
> +
> +	if (likely(vq->shadow_used_idx))
> +		async_flush_shadow_used_ring_split(dev, vq);
> +
> +	return pkt_idx;
> +}
> +
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +	struct vhost_virtqueue *vq;
> +	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
> +	uint16_t start_idx, pkts_idx, vq_size;
> +	uint64_t *async_pending_info;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
> +	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
> +			dev->vid, __func__, queue_id);
> +		return 0;
> +	}
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	pkts_idx = vq->async_pkts_idx;
> +	async_pending_info = vq->async_pending_info;
> +	vq_size = vq->size;
> +	start_idx = pkts_idx > vq->async_pkts_inflight_n ?
> +		pkts_idx - vq->async_pkts_inflight_n :
> +		(vq_size - vq->async_pkts_inflight_n + pkts_idx) &
> +		(vq_size - 1);
> +
> +	n_pkts_cpl =
> +		vq->async_ops.check_completed_copies(vid, queue_id, 0, count);
> +
> +	rte_smp_wmb();
> +
> +	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
> +		uint64_t info = async_pending_info[
> +			(start_idx + n_pkts_put) & (vq_size - 1)];
> +		uint64_t n_segs;
> +		n_pkts_put++;
> +		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
> +		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
> +
> +		if (n_segs) {
> +			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
> +				n_pkts_put--;
> +				n_descs -= info & ASYNC_PENDING_INFO_N_MSK;
> +				if (n_pkts_cpl) {
> +					async_pending_info[
> +						(start_idx + n_pkts_put) &
> +						(vq_size - 1)] =
> +					((n_segs - n_pkts_cpl) <<
> +					 ASYNC_PENDING_INFO_N_SFT) |
> +					(info & ASYNC_PENDING_INFO_N_MSK);
> +					n_pkts_cpl = 0;
> +				}
> +				break;
> +			}
> +			n_pkts_cpl -= n_segs;
> +		}
> +	}
> +
> +	if (n_pkts_put) {
> +		vq->async_pkts_inflight_n -= n_pkts_put;
> +		*(volatile uint16_t *)&vq->used->idx += n_descs;
> +
> +		vhost_vring_call_split(dev, vq);
> +	}
> +
> +	if (start_idx + n_pkts_put <= vq_size) {
> +		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
> +			n_pkts_put * sizeof(uintptr_t));
> +	} else {
> +		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
> +			(vq_size - start_idx) * sizeof(uintptr_t));
> +		rte_memcpy(&pkts[vq_size - start_idx], vq->async_pkts_pending,
> +			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
> +	}
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return n_pkts_put;
> +}
> +
> +static __rte_always_inline uint32_t
> +virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> +	struct rte_mbuf **pkts, uint32_t count)
> +{
> +	struct vhost_virtqueue *vq;
> +	uint32_t nb_tx = 0;
> +	bool drawback = false;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
> +	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
> +			dev->vid, __func__, queue_id);
> +		return 0;
> +	}
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	if (unlikely(vq->enabled == 0))
> +		goto out_access_unlock;
> +
> +	if (unlikely(!vq->async_registered)) {
> +		drawback = true;
fallback instead?
> +		goto out_access_unlock;
> +	}
> +
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> +		vhost_user_iotlb_rd_lock(vq);
> +
> +	if (unlikely(vq->access_ok == 0))
> +		if (unlikely(vring_translate(dev, vq) < 0))
> +			goto out;
> +
> +	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
> +	if (count == 0)
> +		goto out;
> +
> +	/* TODO: packed queue not implemented */
> +	if (vq_is_packed(dev))
> +		nb_tx = 0;
> +	else
> +		nb_tx = virtio_dev_rx_async_submit_split(dev,
> +				vq, queue_id, pkts, count);
> +
> +out:
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> +		vhost_user_iotlb_rd_unlock(vq);
> +
> +out_access_unlock:
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	if (drawback)
> +		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts, count);
> +
> +	return nb_tx;
> +}
> +
> +uint16_t
> +rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (!dev)
> +		return 0;
> +
> +	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
> +		VHOST_LOG_DATA(ERR,
> +			"(%d) %s: built-in vhost net backend is disabled.\n",
> +			dev->vid, __func__);
> +		return 0;
> +	}
> +
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
> +}
> +
>  static inline bool
>  virtio_net_with_host_offload(struct virtio_net *dev)
>  {
> 
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring patrick.fu
@ 2020-06-26 14:42 ` Maxime Coquelin
  2020-07-03 10:27 ` [dpdk-dev] [PATCH v3 " patrick.fu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-06-26 14:42 UTC (permalink / raw)
  To: patrick.fu, dev, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: cheng1.jiang, cunming.liang
Hi Patrick,
On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> From: Patrick Fu <patrick.fu@intel.com>
> 
> Performing large memory copies usually takes up a major part of CPU
> cycles and becomes the hot spot in vhost-user enqueue operation. To
> offload expensive memory operations from the CPU, this patch set
> proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
> Intel's processor, to accelerate large copies.
> 
> Large copies are offloaded from the CPU to the DMA in an asynchronous
> manner. The CPU just submits copy jobs to the DMA but without waiting
> for its copy completion. Thus, there is no CPU intervention during
> data transfer; we can save precious CPU cycles and improve the overall
> throughput for vhost-user based applications, like OVS. During packet
> transmission, it offloads large copies to the DMA and performs small
> copies by the CPU, due to startup overheads associated with the DMA.
> 
> This patch set construct a general framework that applications can
> leverage to attach DMA channels with vhost-user transmit queues. Four
> new RTE APIs are introduced to vhost library for applications to
> register and use the asynchronous data path. In addition, two new DMA
> operation callbacks are defined, by which vhost-user asynchronous data
> path can interact with DMA hardware. Currently only enqueue operation
> for split queue is implemented, but the frame is flexible to extend
> support for dequeue & packed queue.
Thanks for this big rework of the Vhost DMA series.
It looks overall good to me and is consistent with what design you
suggested few months back.
I don't see a big risk to integrate it in v20.08 once the few comments
are taken into account.
I'll try to make another pass before you send a v2.
Maxime
> Patrick Fu (2):
>   vhost: introduce async data path registration API
>   vhost: introduce async enqueue for split ring
> 
>  lib/librte_vhost/Makefile          |   3 +-
>  lib/librte_vhost/rte_vhost.h       |   1 +
>  lib/librte_vhost/rte_vhost_async.h | 172 ++++++++++++
>  lib/librte_vhost/socket.c          |  20 ++
>  lib/librte_vhost/vhost.c           |  74 ++++-
>  lib/librte_vhost/vhost.h           |  30 ++-
>  lib/librte_vhost/vhost_user.c      |  28 +-
>  lib/librte_vhost/virtio_net.c      | 538 ++++++++++++++++++++++++++++++++++++-
>  8 files changed, 857 insertions(+), 9 deletions(-)
>  create mode 100644 lib/librte_vhost/rte_vhost_async.h
> 
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
  2020-06-18  5:50   ` Liu, Yong
  2020-06-26 14:28   ` Maxime Coquelin
@ 2020-06-26 14:44   ` Maxime Coquelin
  2 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-06-26 14:44 UTC (permalink / raw)
  To: patrick.fu, dev, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: cheng1.jiang, cunming.liang
On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> From: Patrick <patrick.fu@intel.com>
> 
> This patch introduces registration/un-registration APIs
> for async data path together with all required data
> structures and DMA callback function proto-types.
> 
> Signed-off-by: Patrick <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/Makefile          |   3 +-
>  lib/librte_vhost/rte_vhost.h       |   1 +
>  lib/librte_vhost/rte_vhost_async.h | 134 +++++++++++++++++++++++++++++++++++++
>  lib/librte_vhost/socket.c          |  20 ++++++
>  lib/librte_vhost/vhost.c           |  74 +++++++++++++++++++-
>  lib/librte_vhost/vhost.h           |  30 ++++++++-
>  lib/librte_vhost/vhost_user.c      |  28 ++++++--
>  7 files changed, 283 insertions(+), 7 deletions(-)
>  create mode 100644 lib/librte_vhost/rte_vhost_async.h
> 
> +/**
> + * register a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be attached to
> + * @param queue_id
> + *  vhost queue id DMA channel to be attached to
> + * @param features
> + *  DMA channel feature bit
> + *    b0       : DMA supports inorder data transfer
> + *    b1  - b15: reserved
> + *    b16 - b27: Packet length threshold for DMA transfer
> + *    b28 - b31: reserved
> + * @param ops
> + *  DMA operation callbacks
> + * @return
> + *  0 on success, -1 on failures
> + */
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +	uint32_t features, struct rte_vhost_async_channel_ops *ops);
> +
> +/**
> + * unregister a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be detached
> + * @param queue_id
> + *  vhost queue id DMA channel to be detached
> + * @return
> + *  0 on success, -1 on failures
> + */
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
These new APIs need to be tagged as experimental. We'll need a few
releases before considering them stable.
You need to add them to rte_vhost_version.map too.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring
  2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring patrick.fu
  2020-06-18  6:56   ` Liu, Yong
  2020-06-26 14:39   ` Maxime Coquelin
@ 2020-06-26 14:46   ` Maxime Coquelin
  2020-06-29  1:25     ` Fu, Patrick
  2 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-06-26 14:46 UTC (permalink / raw)
  To: patrick.fu, dev, chenbo.xia, zhihong.wang, xiaolong.ye
  Cc: cheng1.jiang, cunming.liang
On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> From: Patrick <patrick.fu@intel.com>
> 
> This patch implement async enqueue data path for split ring.
> 
> Signed-off-by: Patrick <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/rte_vhost_async.h |  38 +++
>  lib/librte_vhost/virtio_net.c      | 538 ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 574 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
> index 82f2ebe..efcba0a 100644
> --- a/lib/librte_vhost/rte_vhost_async.h
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -131,4 +131,42 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   */
>  int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>  
> +/**
> + * This function submit enqueue data to DMA. This function has no
> + * guranttee to the transfer completion upon return. Applications should
> + * poll transfer status by rte_vhost_poll_enqueue_completed()
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  array of packets to be enqueued
> + * @param count
> + *  packets num to be enqueued
> + * @return
> + *  num of packets enqueued
> + */
> +uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
> +/**
> + * This function check DMA completion status for a specific vhost
> + * device queue. Packets which finish copying (enqueue) operation
> + * will be returned in an array.
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  blank array to get return packet pointer
> + * @param count
> + *  size of the packet array
> + * @return
> + *  num of packets returned
> + */
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
These new APIs need to be tagged as experimental. We'll need a few
releases before considering them stable.
You need to add them to rte_vhost_version.map too.
>  #endif /* _RTE_VDPA_H_ */
You need to fix the comment here (/* _RTE_VHOST_ASYNC_H_ */)
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API
  2020-06-26 14:28   ` Maxime Coquelin
@ 2020-06-29  1:15     ` Fu, Patrick
  0 siblings, 0 replies; 36+ messages in thread
From: Fu, Patrick @ 2020-06-29  1:15 UTC (permalink / raw)
  To: Maxime Coquelin, dev, Xia, Chenbo, Wang, Zhihong
  Cc: Jiang, Cheng1, Liang, Cunming
Hi Maxime,
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, June 26, 2020 10:29 PM
> To: Fu, Patrick <patrick.fu@intel.com>; dev@dpdk.org; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> Xiaolong <xiaolong.ye@intel.com>
> Cc: Jiang, Cheng1 <cheng1.jiang@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>
> Subject: Re: [PATCH v1 1/2] vhost: introduce async data path registration API
> 
> 
> 
> On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> > From: Patrick <patrick.fu@intel.com>
> >
> > diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> > index 0a66ef9..f817783 100644
> > --- a/lib/librte_vhost/socket.c
> > +++ b/lib/librte_vhost/socket.c
> > @@ -42,6 +42,7 @@ struct vhost_user_socket {
> >  	bool use_builtin_virtio_net;
> >  	bool extbuf;
> >  	bool linearbuf;
> > +	bool async_copy;
> >
> >  	/*
> >  	 * The "supported_features" indicates the feature bits the @@ -210,6
> > +211,7 @@ struct vhost_user {
> >  	size_t size;
> >  	struct vhost_user_connection *conn;
> >  	int ret;
> > +	struct virtio_net *dev;
> >
> >  	if (vsocket == NULL)
> >  		return;
> > @@ -241,6 +243,13 @@ struct vhost_user {
> >  	if (vsocket->linearbuf)
> >  		vhost_enable_linearbuf(vid);
> >
> > +	if (vsocket->async_copy) {
> > +		dev = get_device(vid);
> > +
> > +		if (dev)
> > +			dev->async_copy = 1;
> > +	}
> > +
> >  	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> >
> >  	if (vsocket->notify_ops->new_connection) { @@ -891,6 +900,17 @@
> > struct vhost_user_reconnect_list {
> >  		goto out_mutex;
> >  	}
> >
> > +	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> > +
> > +	if (vsocket->async_copy &&
> > +		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
> > +		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
> > +		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and
> IOMMU "
> > +			"or post-copy feature simultaneously is not "
> > +			"supported\n");
> > +		goto out_mutex;
> > +	}
> > +
> 
> Have you ensure compatibility with the linearbuf feature (TSO)?
> You will want to do same kind of check if not compatible.
> 
I think this concern comes primarily from dequeue side. As current patch is for enqueue only, 
linearbuf check doesn't seem to be necessary. For future dequeue implementation, we may 
need to add an additional feature bit to let application to decide if the async callback is 
compatible with linearbuf or not. For a real hardware DMA engine, it should usually support 
linearbuf. For a pure software backend (something like dequeue-zero-copy), it may not support. 
> >  	/*
> >  	 * Set the supported features correctly for the builtin vhost-user
> >  	 * net driver.
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index
> > 0266318..e6b688a 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -332,8 +332,13 @@
> >  {
> >  	if (vq_is_packed(dev))
> >  		rte_free(vq->shadow_used_packed);
> > -	else
> > +	else {
> >  		rte_free(vq->shadow_used_split);
> > +		if (vq->async_pkts_pending)
> > +			rte_free(vq->async_pkts_pending);
> > +		if (vq->async_pending_info)
> > +			rte_free(vq->async_pending_info);
> > +	}
> >  	rte_free(vq->batch_copy_elems);
> >  	rte_mempool_free(vq->iotlb_pool);
> >  	rte_free(vq);
> > @@ -1527,3 +1532,70 @@ int rte_vhost_extern_callback_register(int vid,
> >  	if (vhost_data_log_level >= 0)
> >  		rte_log_set_level(vhost_data_log_level,
> RTE_LOG_WARNING);  }
> > +
> > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > +					uint32_t features,
> > +					struct rte_vhost_async_channel_ops
> *ops) {
> > +	struct vhost_virtqueue *vq;
> > +	struct virtio_net *dev = get_device(vid);
> > +	struct dma_channel_features f;
> > +
> > +	if (dev == NULL || ops == NULL)
> > +		return -1;
> > +
> > +	f.intval = features;
> > +
> > +	vq = dev->virtqueue[queue_id];
> > +
> > +	if (vq == NULL)
> > +		return -1;
> > +
> > +	/** packed queue is not supported */
> > +	if (vq_is_packed(dev) || !f.inorder)
> > +		return -1;
> 
> You might want to print an error message to help the user understanding
> why it fails.
> 
Will update in v2 patch
> > +
> > +	if (ops->check_completed_copies == NULL ||
> > +		ops->transfer_data == NULL)
> > +		return -1;
> > +
> > +	rte_spinlock_lock(&vq->access_lock);
> > +
> > +	vq->async_ops.check_completed_copies = ops-
> >check_completed_copies;
> > +	vq->async_ops.transfer_data = ops->transfer_data;
> > +
> > +	vq->async_inorder = f.inorder;
> > +	vq->async_threshold = f.threshold;
> > +
> > +	vq->async_registered = true;
> > +
> > +	rte_spinlock_unlock(&vq->access_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id) {
> > +	struct vhost_virtqueue *vq;
> > +	struct virtio_net *dev = get_device(vid);
> > +
> > +	if (dev == NULL)
> > +		return -1;
> > +
> > +	vq = dev->virtqueue[queue_id];
> > +
> > +	if (vq == NULL)
> > +		return -1;
> > +
> > +	rte_spinlock_lock(&vq->access_lock);
> 
> We might want to wait all async transfers are done berfore unregistering?
> 
This could be a little bit tricky. In our async enqueue API design, applications send mbuf to DMA engine 
from one API call and get finished mbuf back from another API call. If we want to drain DMA buffer at this 
unregistration API, we need to either return mbuf from unregistration, or save the mbuf and return it somewhere else. 
I'm thinking if it's better to just report error in unregistration in case in-flight packets existing and rely on application to 
poll out all in-flight packets.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring
  2020-06-26 14:46   ` Maxime Coquelin
@ 2020-06-29  1:25     ` Fu, Patrick
  0 siblings, 0 replies; 36+ messages in thread
From: Fu, Patrick @ 2020-06-29  1:25 UTC (permalink / raw)
  To: Maxime Coquelin, dev, Xia, Chenbo, Wang, Zhihong, Ye, Xiaolong
  Cc: Jiang, Cheng1, Liang, Cunming
Hi Maxime,
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, June 26, 2020 10:46 PM
> To: Fu, Patrick <patrick.fu@intel.com>; dev@dpdk.org; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Ye,
> Xiaolong <xiaolong.ye@intel.com>
> Cc: Jiang, Cheng1 <cheng1.jiang@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>
> Subject: Re: [PATCH v1 2/2] vhost: introduce async enqueue for split ring
> 
> 
> 
> On 6/11/20 12:02 PM, patrick.fu@intel.com wrote:
> > From: Patrick <patrick.fu@intel.com>
> >
> > This patch implement async enqueue data path for split ring.
> >
> > Signed-off-by: Patrick <patrick.fu@intel.com>
> > ---
> >  lib/librte_vhost/rte_vhost_async.h |  38 +++
> >  lib/librte_vhost/virtio_net.c      | 538
> ++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 574 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_vhost/rte_vhost_async.h
> > b/lib/librte_vhost/rte_vhost_async.h
> > index 82f2ebe..efcba0a 100644
> > --- a/lib/librte_vhost/rte_vhost_async.h
> > +++ b/lib/librte_vhost/rte_vhost_async.h
> > @@ -131,4 +131,42 @@ int rte_vhost_async_channel_register(int vid,
> uint16_t queue_id,
> >   */
> >  int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> >
> > +/**
> > + * This function submit enqueue data to DMA. This function has no
> > + * guranttee to the transfer completion upon return. Applications
> > +should
> > + * poll transfer status by rte_vhost_poll_enqueue_completed()
> > + *
> > + * @param vid
> > + *  id of vhost device to enqueue data
> > + * @param queue_id
> > + *  queue id to enqueue data
> > + * @param pkts
> > + *  array of packets to be enqueued
> > + * @param count
> > + *  packets num to be enqueued
> > + * @return
> > + *  num of packets enqueued
> > + */
> > +uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> > +		struct rte_mbuf **pkts, uint16_t count);
> > +
> > +/**
> > + * This function check DMA completion status for a specific vhost
> > + * device queue. Packets which finish copying (enqueue) operation
> > + * will be returned in an array.
> > + *
> > + * @param vid
> > + *  id of vhost device to enqueue data
> > + * @param queue_id
> > + *  queue id to enqueue data
> > + * @param pkts
> > + *  blank array to get return packet pointer
> > + * @param count
> > + *  size of the packet array
> > + * @return
> > + *  num of packets returned
> > + */
> > +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > +		struct rte_mbuf **pkts, uint16_t count);
> > +
> 
> These new APIs need to be tagged as experimental. We'll need a few releases
> before considering them stable.
> 
> You need to add them to rte_vhost_version.map too.
> 
> >  #endif /* _RTE_VDPA_H_ */
> You need to fix the comment here (/* _RTE_VHOST_ASYNC_H_ */)
I will update in the v2 version
Thanks,
Patrick
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 0/2] introduce asynchronous data path for vhost
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
                   ` (2 preceding siblings ...)
  2020-06-26 14:42 ` [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost Maxime Coquelin
@ 2020-07-03 10:27 ` patrick.fu
  2020-07-03 10:27   ` [dpdk-dev] [PATCH v3 1/2] vhost: introduce async enqueue registration API patrick.fu
  2020-07-03 10:27   ` [dpdk-dev] [PATCH v3 2/2] vhost: introduce async enqueue for split ring patrick.fu
  2020-07-03 12:21 ` [dpdk-dev] [PATCH v4 0/2] introduce asynchronous data path for vhost patrick.fu
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-03 10:27 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
Performing large memory copies usually takes up a major part of CPU
cycles and becomes the hot spot in vhost-user enqueue operation. To
offload expensive memory operations from the CPU, this patch set
proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
Intel's processor, to accelerate large copies.
Large copies are offloaded from the CPU to the DMA in an asynchronous
manner. The CPU just submits copy jobs to the DMA but without waiting
for its copy completion. Thus, there is no CPU intervention during
data transfer; we can save precious CPU cycles and improve the overall
throughput for vhost-user based applications, like OVS. During packet
transmission, it offloads large copies to the DMA and performs small
copies by the CPU, due to startup overheads associated with the DMA.
This patch set construct a general framework that applications can
leverage to attach DMA channels with vhost-user transmit queues. Four
new RTE APIs are introduced to vhost library for applications to
register and use the asynchronous data path. In addition, two new DMA
operation callbacks are defined, by which vhost-user asynchronous data
path can interact with DMA hardware. Currently only enqueue operation
for split queue is implemented, but the framework is flexible to extend
support for packed queue.
v2:
update meson file for new header file
update rte_vhost_version.map to include new APIs
rename async APIs/structures to be prefixed with "rte_vhost"
rename some variables/structures for readibility
correct minor typo in comments/license statements
refine memory allocation logic for vq internal buffer
add error message printing in some failure cases
check inflight async packets in unregistration API call
mark new APIs as experimental
v3:
use atomic_xxx() functions in updating ring index
fix a bug in async enqueue failure handling
Patrick Fu (2):
  vhost: introduce async enqueue registration API
  vhost: introduce async enqueue for split ring
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 176 +++++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  20 ++
 lib/librte_vhost/vhost.c               | 127 +++++++-
 lib/librte_vhost/vhost.h               |  30 +-
 lib/librte_vhost/vhost_user.c          |  23 +-
 lib/librte_vhost/virtio_net.c          | 539 ++++++++++++++++++++++++++++++++-
 10 files changed, 915 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 1/2] vhost: introduce async enqueue registration API
  2020-07-03 10:27 ` [dpdk-dev] [PATCH v3 " patrick.fu
@ 2020-07-03 10:27   ` patrick.fu
  2020-07-03 10:27   ` [dpdk-dev] [PATCH v3 2/2] vhost: introduce async enqueue for split ring patrick.fu
  1 sibling, 0 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-03 10:27 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch introduces registration/un-registration APIs
for vhost async data enqueue operation. Together with
the registration APIs implementations, data structures
and async callback functions required for async enqueue
data path are also defined.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 136 +++++++++++++++++++++++++++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  20 +++++
 lib/librte_vhost/vhost.c               | 127 +++++++++++++++++++++++++++++-
 lib/librte_vhost/vhost.h               |  30 +++++++-
 lib/librte_vhost/vhost_user.c          |  23 +++++-
 9 files changed, 338 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index b7ff7dc..4f2f3e4 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -42,7 +42,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
-						rte_vdpa_dev.h
+						rte_vdpa_dev.h rte_vhost_async.h
 
 # only compile vhost crypto when cryptodev is enabled
 ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index 882a0ea..cc9aa65 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -22,5 +22,5 @@ sources = files('fd_man.c', 'iotlb.c', 'socket.c', 'vdpa.c',
 		'vhost.c', 'vhost_user.c',
 		'virtio_net.c', 'vhost_crypto.c')
 headers = files('rte_vhost.h', 'rte_vdpa.h', 'rte_vdpa_dev.h',
-		'rte_vhost_crypto.h')
+		'rte_vhost_crypto.h', 'rte_vhost_async.h')
 deps += ['ethdev', 'cryptodev', 'hash', 'pci']
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 8a5c332..f93f959 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -35,6 +35,7 @@
 #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
+#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
new file mode 100644
index 0000000..d5a5927
--- /dev/null
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_VHOST_ASYNC_H_
+#define _RTE_VHOST_ASYNC_H_
+
+#include "rte_vhost.h"
+
+/**
+ * iovec iterator
+ */
+struct rte_vhost_iov_iter {
+	/** offset to the first byte of interesting data */
+	size_t offset;
+	/** total bytes of data in this iterator */
+	size_t count;
+	/** pointer to the iovec array */
+	struct iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+/**
+ * dma transfer descriptor pair
+ */
+struct rte_vhost_async_desc {
+	/** source memory iov_iter */
+	struct rte_vhost_iov_iter *src;
+	/** destination memory iov_iter */
+	struct rte_vhost_iov_iter *dst;
+};
+
+/**
+ * dma transfer status
+ */
+struct rte_vhost_async_status {
+	/** An array of application specific data for source memory */
+	uintptr_t *src_opaque_data;
+	/** An array of application specific data for destination memory */
+	uintptr_t *dst_opaque_data;
+};
+
+/**
+ * dma operation callbacks to be implemented by applications
+ */
+struct rte_vhost_async_channel_ops {
+	/**
+	 * instruct async engines to perform copies for a batch of packets
+	 *
+	 * @param vid
+	 *  id of vhost device to perform data copies
+	 * @param queue_id
+	 *  queue id to perform data copies
+	 * @param descs
+	 *  an array of DMA transfer memory descriptors
+	 * @param opaque_data
+	 *  opaque data pair sending to DMA engine
+	 * @param count
+	 *  number of elements in the "descs" array
+	 * @return
+	 *  -1 on failure, number of descs processed on success
+	 */
+	int (*transfer_data)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_desc *descs,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t count);
+	/**
+	 * check copy-completed packets from the async engine
+	 * @param vid
+	 *  id of vhost device to check copy completion
+	 * @param queue_id
+	 *  queue id to check copyp completion
+	 * @param opaque_data
+	 *  buffer to receive the opaque data pair from DMA engine
+	 * @param max_packets
+	 *  max number of packets could be completed
+	 * @return
+	 *  -1 on failure, number of iov segments completed on success
+	 */
+	int (*check_completed_copies)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t max_packets);
+};
+
+/**
+ *  dma channel feature bit definition
+ */
+struct rte_vhost_async_features {
+	union {
+		uint32_t intval;
+		struct {
+			uint32_t async_inorder:1;
+			uint32_t resvd_0:15;
+			uint32_t async_threshold:12;
+			uint32_t resvd_1:4;
+		};
+	};
+};
+
+/**
+ * register a async channel for vhost
+ *
+ * @param vid
+ *  vhost device id async channel to be attached to
+ * @param queue_id
+ *  vhost queue id async channel to be attached to
+ * @param features
+ *  DMA channel feature bit
+ *    b0       : DMA supports inorder data transfer
+ *    b1  - b15: reserved
+ *    b16 - b27: Packet length threshold for DMA transfer
+ *    b28 - b31: reserved
+ * @param ops
+ *  DMA operation callbacks
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+	uint32_t features, struct rte_vhost_async_channel_ops *ops);
+
+/**
+ * unregister a dma channel for vhost
+ *
+ * @param vid
+ *  vhost device id DMA channel to be detached
+ * @param queue_id
+ *  vhost queue id DMA channel to be detached
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
+
+#endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/rte_vhost_version.map b/lib/librte_vhost/rte_vhost_version.map
index 8678440..13ec53b 100644
--- a/lib/librte_vhost/rte_vhost_version.map
+++ b/lib/librte_vhost/rte_vhost_version.map
@@ -71,4 +71,8 @@ EXPERIMENTAL {
 	rte_vdpa_get_queue_num;
 	rte_vdpa_get_features;
 	rte_vdpa_get_protocol_features;
+	rte_vhost_async_channel_register;
+	rte_vhost_async_channel_unregister;
+	rte_vhost_submit_enqueue_burst;
+	rte_vhost_poll_enqueue_completed;
 };
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 49267ce..698b44e 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
 	bool use_builtin_virtio_net;
 	bool extbuf;
 	bool linearbuf;
+	bool async_copy;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -205,6 +206,7 @@ struct vhost_user {
 	size_t size;
 	struct vhost_user_connection *conn;
 	int ret;
+	struct virtio_net *dev;
 
 	if (vsocket == NULL)
 		return;
@@ -236,6 +238,13 @@ struct vhost_user {
 	if (vsocket->linearbuf)
 		vhost_enable_linearbuf(vid);
 
+	if (vsocket->async_copy) {
+		dev = get_device(vid);
+
+		if (dev)
+			dev->async_copy = 1;
+	}
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -881,6 +890,17 @@ struct rte_vdpa_device *
 		goto out_mutex;
 	}
 
+	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
+
+	if (vsocket->async_copy &&
+		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
+		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
+		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and IOMMU "
+			"or post-copy feature simultaneously is not "
+			"supported\n");
+		goto out_mutex;
+	}
+
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
 	 * net driver.
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0d822d6..58ee3ef 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -332,8 +332,13 @@
 {
 	if (vq_is_packed(dev))
 		rte_free(vq->shadow_used_packed);
-	else
+	else {
 		rte_free(vq->shadow_used_split);
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+	}
 	rte_free(vq->batch_copy_elems);
 	rte_mempool_free(vq->iotlb_pool);
 	rte_free(vq);
@@ -1522,3 +1527,123 @@ int rte_vhost_extern_callback_register(int vid,
 	if (vhost_data_log_level >= 0)
 		rte_log_set_level(vhost_data_log_level, RTE_LOG_WARNING);
 }
+
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+					uint32_t features,
+					struct rte_vhost_async_channel_ops *ops)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	struct rte_vhost_async_features f;
+
+	if (dev == NULL || ops == NULL)
+		return -1;
+
+	f.intval = features;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (unlikely(vq == NULL || !dev->async_copy))
+		return -1;
+
+	/** packed queue is not supported */
+	if (unlikely(vq_is_packed(dev) || !f.async_inorder)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async copy is not supported on packed queue or non-inorder mode "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		return -1;
+	}
+
+	if (unlikely(ops->check_completed_copies == NULL ||
+		ops->transfer_data == NULL))
+		return -1;
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->async_registered)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async register failed: channel already registered "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_pkts_pending = rte_malloc(NULL,
+			vq->size * sizeof(uintptr_t),
+			RTE_CACHE_LINE_SIZE);
+	vq->async_pending_info = rte_malloc(NULL,
+			vq->size * sizeof(uint64_t),
+			RTE_CACHE_LINE_SIZE);
+	if (!vq->async_pkts_pending || !vq->async_pending_info) {
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+
+		VHOST_LOG_CONFIG(ERR,
+				"async register failed: cannot allocate memory for vq data "
+				"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_ops.check_completed_copies = ops->check_completed_copies;
+	vq->async_ops.transfer_data = ops->transfer_data;
+
+	vq->async_inorder = f.async_inorder;
+	vq->async_threshold = f.async_threshold;
+
+	vq->async_registered = true;
+
+reg_out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return 0;
+}
+
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	int ret = -1;
+
+	if (dev == NULL)
+		return ret;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (vq == NULL)
+		return ret;
+
+	ret = 0;
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (!vq->async_registered)
+		goto out;
+
+	if (vq->async_pkts_inflight_n) {
+		VHOST_LOG_CONFIG(ERR, "Failed to unregister async channel. "
+			"async inflight packets must be completed before unregistration.\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (vq->async_pkts_pending) {
+		rte_free(vq->async_pkts_pending);
+		vq->async_pkts_pending = 0;
+	}
+
+	if (vq->async_pending_info) {
+		rte_free(vq->async_pending_info);
+		vq->async_pending_info = 0;
+	}
+
+	vq->async_ops.transfer_data = NULL;
+	vq->async_ops.check_completed_copies = NULL;
+	vq->async_registered = false;
+
+out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return ret;
+}
+
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 0344636..f373198 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -24,6 +24,8 @@
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
 
+#include "rte_vhost_async.h"
+
 /* Used to indicate that the device is running on a data core */
 #define VIRTIO_DEV_RUNNING 1
 /* Used to indicate that the device is ready to operate */
@@ -40,6 +42,11 @@
 
 #define VHOST_LOG_CACHE_NR 32
 
+#define MAX_PKT_BURST 32
+
+#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
+#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
+
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
 		VRING_DESC_F_WRITE)
@@ -202,6 +209,25 @@ struct vhost_virtqueue {
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
 	int				iotlb_cache_nr;
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
+
+	/* operation callbacks for async dma */
+	struct rte_vhost_async_channel_ops	async_ops;
+
+	struct rte_vhost_iov_iter it_pool[VHOST_MAX_ASYNC_IT];
+	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
+
+	/* async data transfer status */
+	uintptr_t	**async_pkts_pending;
+	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
+	#define		ASYNC_PENDING_INFO_N_SFT 16
+	uint64_t	*async_pending_info;
+	uint16_t	async_pkts_idx;
+	uint16_t	async_pkts_inflight_n;
+
+	/* vq async features */
+	bool		async_inorder;
+	bool		async_registered;
+	uint16_t	async_threshold;
 } __rte_cache_aligned;
 
 #define VHOST_MAX_VRING			0x100
@@ -338,6 +364,7 @@ struct virtio_net {
 	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
@@ -683,7 +710,8 @@ uint64_t translate_log_addr(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	/* Don't kick guest if we don't reach index specified by guest. */
 	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
 		uint16_t old = vq->signalled_used;
-		uint16_t new = vq->last_used_idx;
+		uint16_t new = vq->async_pkts_inflight_n ?
+					vq->used->idx:vq->last_used_idx;
 		bool signalled_used_valid = vq->signalled_used_valid;
 
 		vq->signalled_used = new;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 6039a8f..aa86055 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -476,12 +476,14 @@
 	} else {
 		if (vq->shadow_used_split)
 			rte_free(vq->shadow_used_split);
+
 		vq->shadow_used_split = rte_malloc(NULL,
 				vq->size * sizeof(struct vring_used_elem),
 				RTE_CACHE_LINE_SIZE);
+
 		if (!vq->shadow_used_split) {
 			VHOST_LOG_CONFIG(ERR,
-					"failed to allocate memory for shadow used ring.\n");
+					"failed to allocate memory for vq internal data.\n");
 			return RTE_VHOST_MSG_RESULT_ERR;
 		}
 	}
@@ -1166,7 +1168,8 @@
 			goto err_mmap;
 		}
 
-		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
+		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
+			MAP_POPULATE : 0;
 		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 				 MAP_SHARED | populate, fd, 0);
 
@@ -1181,7 +1184,7 @@
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
-		if (dev->dequeue_zero_copy)
+		if (dev->dequeue_zero_copy || dev->async_copy)
 			if (add_guest_pages(dev, reg, alignment) < 0) {
 				VHOST_LOG_CONFIG(ERR,
 					"adding guest pages to region %u failed.\n",
@@ -1979,6 +1982,12 @@ static int vhost_user_set_vring_err(struct virtio_net **pdev __rte_unused,
 	} else {
 		rte_free(vq->shadow_used_split);
 		vq->shadow_used_split = NULL;
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+		vq->async_pkts_pending = NULL;
+		vq->async_pending_info = NULL;
 	}
 
 	rte_free(vq->batch_copy_elems);
@@ -2012,6 +2021,14 @@ static int vhost_user_set_vring_err(struct virtio_net **pdev __rte_unused,
 		"set queue enable: %d to qp idx: %d\n",
 		enable, index);
 
+	if (!enable && dev->virtqueue[index]->async_registered) {
+		if (dev->virtqueue[index]->async_pkts_inflight_n) {
+			VHOST_LOG_CONFIG(ERR, "failed to disable vring. "
+			"async inflight packets must be completed first\n");
+			return RTE_VHOST_MSG_RESULT_ERR;
+		}
+	}
+
 	/* On disable, rings have to be stopped being processed. */
 	if (!enable && dev->dequeue_zero_copy)
 		drain_zmbuf_list(dev->virtqueue[index]);
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 2/2] vhost: introduce async enqueue for split ring
  2020-07-03 10:27 ` [dpdk-dev] [PATCH v3 " patrick.fu
  2020-07-03 10:27   ` [dpdk-dev] [PATCH v3 1/2] vhost: introduce async enqueue registration API patrick.fu
@ 2020-07-03 10:27   ` patrick.fu
  1 sibling, 0 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-03 10:27 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch implements async enqueue data path for split ring.
2 new async data path APIs are defined, by which applications
can submit and poll packets to/from async engines. The async
enqueue data leverages callback functions registered by
applications to work with the async engine.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/rte_vhost_async.h |  40 +++
 lib/librte_vhost/virtio_net.c      | 539 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 577 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
index d5a5927..c8ad8db 100644
--- a/lib/librte_vhost/rte_vhost_async.h
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -133,4 +133,44 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 __rte_experimental
 int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
 
+/**
+ * This function submit enqueue data to async engine. This function has
+ * no guranttee to the transfer completion upon return. Applications
+ * should poll transfer status by rte_vhost_poll_enqueue_completed()
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  array of packets to be enqueued
+ * @param count
+ *  packets num to be enqueued
+ * @return
+ *  num of packets enqueued
+ */
+__rte_experimental
+uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
+/**
+ * This function check async completion status for a specific vhost
+ * device queue. Packets which finish copying (enqueue) operation
+ * will be returned in an array.
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  blank array to get return packet pointer
+ * @param count
+ *  size of the packet array
+ * @return
+ *  num of packets returned
+ */
+__rte_experimental
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 751c1f3..b38ac15 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -17,14 +17,15 @@
 #include <rte_arp.h>
 #include <rte_spinlock.h>
 #include <rte_malloc.h>
+#include <rte_vhost_async.h>
 
 #include "iotlb.h"
 #include "vhost.h"
 
-#define MAX_PKT_BURST 32
-
 #define MAX_BATCH_LEN 256
 
+#define VHOST_ASYNC_BATCH_THRESHOLD 32
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -117,6 +118,35 @@
 }
 
 static __rte_always_inline void
+async_flush_shadow_used_ring_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq)
+{
+	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
+
+	if (used_idx + vq->shadow_used_idx <= vq->size) {
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
+					  vq->shadow_used_idx);
+	} else {
+		uint16_t size;
+
+		/* update used ring interval [used_idx, vq->size] */
+		size = vq->size - used_idx;
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
+
+		/* update the left half used ring interval [0, left_size] */
+		do_flush_shadow_used_ring_split(dev, vq, 0, size,
+					  vq->shadow_used_idx - size);
+	}
+	vq->last_used_idx += vq->shadow_used_idx;
+
+	rte_smp_wmb();
+
+	vhost_log_cache_sync(dev, vq);
+
+	vq->shadow_used_idx = 0;
+}
+
+static __rte_always_inline void
 update_shadow_used_ring_split(struct vhost_virtqueue *vq,
 			 uint16_t desc_idx, uint32_t len)
 {
@@ -905,6 +935,200 @@
 	return error;
 }
 
+static __rte_always_inline void
+async_fill_vec(struct iovec *v, void *base, size_t len)
+{
+	v->iov_base = base;
+	v->iov_len = len;
+}
+
+static __rte_always_inline void
+async_fill_it(struct rte_vhost_iov_iter *it, size_t count,
+	struct iovec *vec, unsigned long nr_seg)
+{
+	it->offset = 0;
+	it->count = count;
+
+	if (count) {
+		it->iov = vec;
+		it->nr_segs = nr_seg;
+	} else {
+		it->iov = 0;
+		it->nr_segs = 0;
+	}
+}
+
+static __rte_always_inline void
+async_fill_des(struct rte_vhost_async_desc *desc,
+	struct rte_vhost_iov_iter *src, struct rte_vhost_iov_iter *dst)
+{
+	desc->src = src;
+	desc->dst = dst;
+}
+
+static __rte_always_inline int
+async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			struct rte_mbuf *m, struct buf_vector *buf_vec,
+			uint16_t nr_vec, uint16_t num_buffers,
+			struct iovec *src_iovec, struct iovec *dst_iovec,
+			struct rte_vhost_iov_iter *src_it,
+			struct rte_vhost_iov_iter *dst_it)
+{
+	uint32_t vec_idx = 0;
+	uint32_t mbuf_offset, mbuf_avail;
+	uint32_t buf_offset, buf_avail;
+	uint64_t buf_addr, buf_iova, buf_len;
+	uint32_t cpy_len, cpy_threshold;
+	uint64_t hdr_addr;
+	struct rte_mbuf *hdr_mbuf;
+	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
+	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
+	int error = 0;
+
+	uint32_t tlen = 0;
+	int tvec_idx = 0;
+	void *hpa;
+
+	if (unlikely(m == NULL)) {
+		error = -1;
+		goto out;
+	}
+
+	cpy_threshold = vq->async_threshold;
+
+	buf_addr = buf_vec[vec_idx].buf_addr;
+	buf_iova = buf_vec[vec_idx].buf_iova;
+	buf_len = buf_vec[vec_idx].buf_len;
+
+	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
+		error = -1;
+		goto out;
+	}
+
+	hdr_mbuf = m;
+	hdr_addr = buf_addr;
+	if (unlikely(buf_len < dev->vhost_hlen))
+		hdr = &tmp_hdr;
+	else
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
+		dev->vid, num_buffers);
+
+	if (unlikely(buf_len < dev->vhost_hlen)) {
+		buf_offset = dev->vhost_hlen - buf_len;
+		vec_idx++;
+		buf_addr = buf_vec[vec_idx].buf_addr;
+		buf_iova = buf_vec[vec_idx].buf_iova;
+		buf_len = buf_vec[vec_idx].buf_len;
+		buf_avail = buf_len - buf_offset;
+	} else {
+		buf_offset = dev->vhost_hlen;
+		buf_avail = buf_len - dev->vhost_hlen;
+	}
+
+	mbuf_avail  = rte_pktmbuf_data_len(m);
+	mbuf_offset = 0;
+
+	while (mbuf_avail != 0 || m->next != NULL) {
+		/* done with current buf, get the next one */
+		if (buf_avail == 0) {
+			vec_idx++;
+			if (unlikely(vec_idx >= nr_vec)) {
+				error = -1;
+				goto out;
+			}
+
+			buf_addr = buf_vec[vec_idx].buf_addr;
+			buf_iova = buf_vec[vec_idx].buf_iova;
+			buf_len = buf_vec[vec_idx].buf_len;
+
+			buf_offset = 0;
+			buf_avail  = buf_len;
+		}
+
+		/* done with current mbuf, get the next one */
+		if (mbuf_avail == 0) {
+			m = m->next;
+
+			mbuf_offset = 0;
+			mbuf_avail  = rte_pktmbuf_data_len(m);
+		}
+
+		if (hdr_addr) {
+			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
+			if (rxvq_is_mergeable(dev))
+				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
+						num_buffers);
+
+			if (unlikely(hdr == &tmp_hdr)) {
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
+			} else {
+				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
+						dev->vhost_hlen, 0);
+				vhost_log_cache_write_iova(dev, vq,
+						buf_vec[0].buf_iova,
+						dev->vhost_hlen);
+			}
+
+			hdr_addr = 0;
+		}
+
+		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
+
+		if (unlikely(cpy_len >= cpy_threshold)) {
+			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
+					buf_iova + buf_offset, cpy_len);
+
+			if (unlikely(!hpa)) {
+				error = -1;
+				goto out;
+			}
+
+			async_fill_vec(src_iovec + tvec_idx,
+				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
+						mbuf_offset), cpy_len);
+
+			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
+
+			tlen += cpy_len;
+			tvec_idx++;
+		} else {
+			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
+				rte_memcpy(
+				(void *)((uintptr_t)(buf_addr + buf_offset)),
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
+				cpy_len);
+
+				PRINT_PACKET(dev,
+					(uintptr_t)(buf_addr + buf_offset),
+					cpy_len, 0);
+			} else {
+				batch_copy[vq->batch_copy_nb_elems].dst =
+				(void *)((uintptr_t)(buf_addr + buf_offset));
+				batch_copy[vq->batch_copy_nb_elems].src =
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
+				batch_copy[vq->batch_copy_nb_elems].log_addr =
+					buf_iova + buf_offset;
+				batch_copy[vq->batch_copy_nb_elems].len =
+					cpy_len;
+				vq->batch_copy_nb_elems++;
+			}
+		}
+
+		mbuf_avail  -= cpy_len;
+		mbuf_offset += cpy_len;
+		buf_avail  -= cpy_len;
+		buf_offset += cpy_len;
+	}
+
+out:
+	async_fill_it(src_it, tlen, src_iovec, tvec_idx);
+	async_fill_it(dst_it, tlen, dst_iovec, tvec_idx);
+
+	return error;
+}
+
 static __rte_always_inline int
 vhost_enqueue_single_packed(struct virtio_net *dev,
 			    struct vhost_virtqueue *vq,
@@ -1236,6 +1460,317 @@
 	return virtio_dev_rx(dev, queue_id, pkts, count);
 }
 
+static __rte_always_inline void
+virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	uint16_t last_idx, uint16_t shadow_idx)
+{
+	while (vq->async_pkts_inflight_n) {
+		int er = vq->async_ops.check_completed_copies(dev->vid,
+			queue_id, 0, MAX_PKT_BURST);
+
+		if (er < 0) {
+			vq->async_pkts_inflight_n = 0;
+			break;
+		}
+
+		vq->async_pkts_inflight_n -= er;
+	}
+
+	vq->shadow_used_idx = shadow_idx;
+	vq->last_avail_idx = last_idx;
+}
+
+static __rte_noinline uint32_t
+virtio_dev_rx_async_submit_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
+	uint16_t num_buffers;
+	struct buf_vector buf_vec[BUF_VECTOR_MAX];
+	uint16_t avail_head, last_idx, shadow_idx;
+
+	struct rte_vhost_iov_iter *it_pool = vq->it_pool;
+	struct iovec *vec_pool = vq->vec_pool;
+	struct rte_vhost_async_desc tdes[MAX_PKT_BURST];
+	struct iovec *src_iovec = vec_pool;
+	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+	struct rte_vhost_iov_iter *src_it = it_pool;
+	struct rte_vhost_iov_iter *dst_it = it_pool + 1;
+	uint16_t n_free_slot, slot_idx;
+	int n_pkts = 0;
+
+	avail_head = *((volatile uint16_t *)&vq->avail->idx);
+	last_idx = vq->last_avail_idx;
+	shadow_idx = vq->shadow_used_idx;
+
+	/*
+	 * The ordering between avail index and
+	 * desc reads needs to be enforced.
+	 */
+	rte_smp_rmb();
+
+	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
+	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
+		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
+		uint16_t nr_vec = 0;
+
+		if (unlikely(reserve_avail_buf_split(dev, vq,
+						pkt_len, buf_vec, &num_buffers,
+						avail_head, &nr_vec) < 0)) {
+			VHOST_LOG_DATA(DEBUG,
+				"(%d) failed to get enough desc from vring\n",
+				dev->vid);
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end index %d\n",
+			dev->vid, vq->last_avail_idx,
+			vq->last_avail_idx + num_buffers);
+
+		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
+				buf_vec, nr_vec, num_buffers,
+				src_iovec, dst_iovec, src_it, dst_it) < 0) {
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
+		if (src_it->count) {
+			async_fill_des(&tdes[pkt_burst_idx], src_it, dst_it);
+			pkt_burst_idx++;
+			vq->async_pending_info[slot_idx] =
+				num_buffers | (src_it->nr_segs << 16);
+			src_iovec += src_it->nr_segs;
+			dst_iovec += dst_it->nr_segs;
+			src_it += 2;
+			dst_it += 2;
+		} else {
+			vq->async_pending_info[slot_idx] = num_buffers;
+			vq->async_pkts_inflight_n++;
+		}
+
+		vq->last_avail_idx += num_buffers;
+
+		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
+				(pkt_idx == count - 1 && pkt_burst_idx)) {
+			n_pkts = vq->async_ops.transfer_data(dev->vid,
+					queue_id, tdes, 0, pkt_burst_idx);
+			src_iovec = vec_pool;
+			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+			src_it = it_pool;
+			dst_it = it_pool + 1;
+
+			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+				vq->async_pkts_inflight_n +=
+					n_pkts > 0 ? n_pkts : 0;
+				virtio_dev_rx_async_submit_split_err(dev,
+					vq, queue_id, last_idx, shadow_idx);
+				return 0;
+			}
+
+			pkt_burst_idx = 0;
+			vq->async_pkts_inflight_n += n_pkts;
+		}
+	}
+
+	if (pkt_burst_idx) {
+		n_pkts = vq->async_ops.transfer_data(dev->vid,
+				queue_id, tdes, 0, pkt_burst_idx);
+		if (unlikely(n_pkts <= (int)pkt_burst_idx)) {
+			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
+			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
+			last_idx, shadow_idx);
+			return 0;
+		}
+
+		vq->async_pkts_inflight_n += n_pkts;
+	}
+
+	do_data_copy_enqueue(dev, vq);
+
+	n_free_slot = vq->size - vq->async_pkts_idx;
+	if (n_free_slot > pkt_idx) {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, pkt_idx * sizeof(uintptr_t));
+		vq->async_pkts_idx += pkt_idx;
+	} else {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, n_free_slot * sizeof(uintptr_t));
+		rte_memcpy(&vq->async_pkts_pending[0],
+			&pkts[n_free_slot],
+			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
+		vq->async_pkts_idx = pkt_idx - n_free_slot;
+	}
+
+	if (likely(vq->shadow_used_idx))
+		async_flush_shadow_used_ring_split(dev, vq);
+
+	return pkt_idx;
+}
+
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+	struct vhost_virtqueue *vq;
+	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = pkts_idx > vq->async_pkts_inflight_n ?
+		pkts_idx - vq->async_pkts_inflight_n :
+		(vq_size - vq->async_pkts_inflight_n + pkts_idx) &
+		(vq_size - 1);
+
+	n_pkts_cpl =
+		vq->async_ops.check_completed_copies(vid, queue_id, 0, count);
+
+	rte_smp_wmb();
+
+	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
+		uint64_t info = async_pending_info[
+			(start_idx + n_pkts_put) & (vq_size - 1)];
+		uint64_t n_segs;
+		n_pkts_put++;
+		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
+		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
+
+		if (n_segs) {
+			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
+				n_pkts_put--;
+				n_descs -= info & ASYNC_PENDING_INFO_N_MSK;
+				if (n_pkts_cpl) {
+					async_pending_info[
+						(start_idx + n_pkts_put) &
+						(vq_size - 1)] =
+					((n_segs - n_pkts_cpl) <<
+					 ASYNC_PENDING_INFO_N_SFT) |
+					(info & ASYNC_PENDING_INFO_N_MSK);
+					n_pkts_cpl = 0;
+				}
+				break;
+			}
+			n_pkts_cpl -= n_segs;
+		}
+	}
+
+	if (n_pkts_put) {
+		vq->async_pkts_inflight_n -= n_pkts_put;
+		*(volatile uint16_t *)&vq->used->idx += n_descs;
+
+		vhost_vring_call_split(dev, vq);
+	}
+
+	if (start_idx + n_pkts_put <= vq_size) {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			n_pkts_put * sizeof(uintptr_t));
+	} else {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			(vq_size - start_idx) * sizeof(uintptr_t));
+		rte_memcpy(&pkts[vq_size - start_idx], vq->async_pkts_pending,
+			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
+	}
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return n_pkts_put;
+}
+
+static __rte_always_inline uint32_t
+virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	struct vhost_virtqueue *vq;
+	uint32_t nb_tx = 0;
+	bool drawback = false;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->enabled == 0))
+		goto out_access_unlock;
+
+	if (unlikely(!vq->async_registered)) {
+		drawback = true;
+		goto out_access_unlock;
+	}
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_lock(vq);
+
+	if (unlikely(vq->access_ok == 0))
+		if (unlikely(vring_translate(dev, vq) < 0))
+			goto out;
+
+	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
+	if (count == 0)
+		goto out;
+
+	/* TODO: packed queue not implemented */
+	if (vq_is_packed(dev))
+		nb_tx = 0;
+	else
+		nb_tx = virtio_dev_rx_async_submit_split(dev,
+				vq, queue_id, pkts, count);
+
+out:
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_unlock(vq);
+
+out_access_unlock:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	if (drawback)
+		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts, count);
+
+	return nb_tx;
+}
+
+uint16_t
+rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (!dev)
+		return 0;
+
+	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
+		VHOST_LOG_DATA(ERR,
+			"(%d) %s: built-in vhost net backend is disabled.\n",
+			dev->vid, __func__);
+		return 0;
+	}
+
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+}
+
 static inline bool
 virtio_net_with_host_offload(struct virtio_net *dev)
 {
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v4 0/2] introduce asynchronous data path for vhost
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
                   ` (3 preceding siblings ...)
  2020-07-03 10:27 ` [dpdk-dev] [PATCH v3 " patrick.fu
@ 2020-07-03 12:21 ` patrick.fu
  2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API patrick.fu
  2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 2/2] vhost: introduce async enqueue for split ring patrick.fu
  2020-07-06 11:53 ` [dpdk-dev] [PATCH v5 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
  6 siblings, 2 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-03 12:21 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
Performing large memory copies usually takes up a major part of CPU
cycles and becomes the hot spot in vhost-user enqueue operation. To
offload expensive memory operations from the CPU, this patch set
proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
Intel's processor, to accelerate large copies.
Large copies are offloaded from the CPU to the DMA in an asynchronous
manner. The CPU just submits copy jobs to the DMA but without waiting
for its copy completion. Thus, there is no CPU intervention during
data transfer; we can save precious CPU cycles and improve the overall
throughput for vhost-user based applications, like OVS. During packet
transmission, it offloads large copies to the DMA and performs small
copies by the CPU, due to startup overheads associated with the DMA.
This patch set construct a general framework that applications can
leverage to attach DMA channels with vhost-user transmit queues. Four
new RTE APIs are introduced to vhost library for applications to
register and use the asynchronous data path. In addition, two new DMA
operation callbacks are defined, by which vhost-user asynchronous data
path can interact with DMA hardware. Currently only enqueue operation
for split queue is implemented, but the framework is flexible to extend
support for packed queue.
v2:
update meson file for new header file
update rte_vhost_version.map to include new APIs
rename async APIs/structures to be prefixed with "rte_vhost"
rename some variables/structures for readibility
correct minor typo in comments/license statements
refine memory allocation logic for vq internal buffer
add error message printing in some failure cases
check inflight async packets in unregistration API call
mark new APIs as experimental
v3:
use atomic_xxx() functions in updating ring index
fix a bug in async enqueue failure handling
v4:
part of the fix intended in v3 patch was missed, this patch
adds all thoes fixes
Patrick Fu (2):
  vhost: introduce async enqueue registration API
  vhost: introduce async enqueue for split ring
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 176 +++++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  20 ++
 lib/librte_vhost/vhost.c               | 127 +++++++-
 lib/librte_vhost/vhost.h               |  30 +-
 lib/librte_vhost/vhost_user.c          |  23 +-
 lib/librte_vhost/virtio_net.c          | 554 ++++++++++++++++++++++++++++++++-
 10 files changed, 930 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API
  2020-07-03 12:21 ` [dpdk-dev] [PATCH v4 0/2] introduce asynchronous data path for vhost patrick.fu
@ 2020-07-03 12:21   ` patrick.fu
  2020-07-06  3:05     ` Liu, Yong
  2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 2/2] vhost: introduce async enqueue for split ring patrick.fu
  1 sibling, 1 reply; 36+ messages in thread
From: patrick.fu @ 2020-07-03 12:21 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch introduces registration/un-registration APIs
for vhost async data enqueue operation. Together with
the registration APIs implementations, data structures
and async callback functions required for async enqueue
data path are also defined.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 136 +++++++++++++++++++++++++++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  20 +++++
 lib/librte_vhost/vhost.c               | 127 +++++++++++++++++++++++++++++-
 lib/librte_vhost/vhost.h               |  30 +++++++-
 lib/librte_vhost/vhost_user.c          |  23 +++++-
 9 files changed, 338 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index b7ff7dc..4f2f3e4 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -42,7 +42,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
-						rte_vdpa_dev.h
+						rte_vdpa_dev.h rte_vhost_async.h
 
 # only compile vhost crypto when cryptodev is enabled
 ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index 882a0ea..cc9aa65 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -22,5 +22,5 @@ sources = files('fd_man.c', 'iotlb.c', 'socket.c', 'vdpa.c',
 		'vhost.c', 'vhost_user.c',
 		'virtio_net.c', 'vhost_crypto.c')
 headers = files('rte_vhost.h', 'rte_vdpa.h', 'rte_vdpa_dev.h',
-		'rte_vhost_crypto.h')
+		'rte_vhost_crypto.h', 'rte_vhost_async.h')
 deps += ['ethdev', 'cryptodev', 'hash', 'pci']
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 8a5c332..f93f959 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -35,6 +35,7 @@
 #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
+#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
new file mode 100644
index 0000000..d5a5927
--- /dev/null
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_VHOST_ASYNC_H_
+#define _RTE_VHOST_ASYNC_H_
+
+#include "rte_vhost.h"
+
+/**
+ * iovec iterator
+ */
+struct rte_vhost_iov_iter {
+	/** offset to the first byte of interesting data */
+	size_t offset;
+	/** total bytes of data in this iterator */
+	size_t count;
+	/** pointer to the iovec array */
+	struct iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+/**
+ * dma transfer descriptor pair
+ */
+struct rte_vhost_async_desc {
+	/** source memory iov_iter */
+	struct rte_vhost_iov_iter *src;
+	/** destination memory iov_iter */
+	struct rte_vhost_iov_iter *dst;
+};
+
+/**
+ * dma transfer status
+ */
+struct rte_vhost_async_status {
+	/** An array of application specific data for source memory */
+	uintptr_t *src_opaque_data;
+	/** An array of application specific data for destination memory */
+	uintptr_t *dst_opaque_data;
+};
+
+/**
+ * dma operation callbacks to be implemented by applications
+ */
+struct rte_vhost_async_channel_ops {
+	/**
+	 * instruct async engines to perform copies for a batch of packets
+	 *
+	 * @param vid
+	 *  id of vhost device to perform data copies
+	 * @param queue_id
+	 *  queue id to perform data copies
+	 * @param descs
+	 *  an array of DMA transfer memory descriptors
+	 * @param opaque_data
+	 *  opaque data pair sending to DMA engine
+	 * @param count
+	 *  number of elements in the "descs" array
+	 * @return
+	 *  -1 on failure, number of descs processed on success
+	 */
+	int (*transfer_data)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_desc *descs,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t count);
+	/**
+	 * check copy-completed packets from the async engine
+	 * @param vid
+	 *  id of vhost device to check copy completion
+	 * @param queue_id
+	 *  queue id to check copyp completion
+	 * @param opaque_data
+	 *  buffer to receive the opaque data pair from DMA engine
+	 * @param max_packets
+	 *  max number of packets could be completed
+	 * @return
+	 *  -1 on failure, number of iov segments completed on success
+	 */
+	int (*check_completed_copies)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t max_packets);
+};
+
+/**
+ *  dma channel feature bit definition
+ */
+struct rte_vhost_async_features {
+	union {
+		uint32_t intval;
+		struct {
+			uint32_t async_inorder:1;
+			uint32_t resvd_0:15;
+			uint32_t async_threshold:12;
+			uint32_t resvd_1:4;
+		};
+	};
+};
+
+/**
+ * register a async channel for vhost
+ *
+ * @param vid
+ *  vhost device id async channel to be attached to
+ * @param queue_id
+ *  vhost queue id async channel to be attached to
+ * @param features
+ *  DMA channel feature bit
+ *    b0       : DMA supports inorder data transfer
+ *    b1  - b15: reserved
+ *    b16 - b27: Packet length threshold for DMA transfer
+ *    b28 - b31: reserved
+ * @param ops
+ *  DMA operation callbacks
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+	uint32_t features, struct rte_vhost_async_channel_ops *ops);
+
+/**
+ * unregister a dma channel for vhost
+ *
+ * @param vid
+ *  vhost device id DMA channel to be detached
+ * @param queue_id
+ *  vhost queue id DMA channel to be detached
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
+
+#endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/rte_vhost_version.map b/lib/librte_vhost/rte_vhost_version.map
index 8678440..13ec53b 100644
--- a/lib/librte_vhost/rte_vhost_version.map
+++ b/lib/librte_vhost/rte_vhost_version.map
@@ -71,4 +71,8 @@ EXPERIMENTAL {
 	rte_vdpa_get_queue_num;
 	rte_vdpa_get_features;
 	rte_vdpa_get_protocol_features;
+	rte_vhost_async_channel_register;
+	rte_vhost_async_channel_unregister;
+	rte_vhost_submit_enqueue_burst;
+	rte_vhost_poll_enqueue_completed;
 };
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 49267ce..698b44e 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
 	bool use_builtin_virtio_net;
 	bool extbuf;
 	bool linearbuf;
+	bool async_copy;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -205,6 +206,7 @@ struct vhost_user {
 	size_t size;
 	struct vhost_user_connection *conn;
 	int ret;
+	struct virtio_net *dev;
 
 	if (vsocket == NULL)
 		return;
@@ -236,6 +238,13 @@ struct vhost_user {
 	if (vsocket->linearbuf)
 		vhost_enable_linearbuf(vid);
 
+	if (vsocket->async_copy) {
+		dev = get_device(vid);
+
+		if (dev)
+			dev->async_copy = 1;
+	}
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -881,6 +890,17 @@ struct rte_vdpa_device *
 		goto out_mutex;
 	}
 
+	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
+
+	if (vsocket->async_copy &&
+		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
+		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
+		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and IOMMU "
+			"or post-copy feature simultaneously is not "
+			"supported\n");
+		goto out_mutex;
+	}
+
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
 	 * net driver.
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0d822d6..58ee3ef 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -332,8 +332,13 @@
 {
 	if (vq_is_packed(dev))
 		rte_free(vq->shadow_used_packed);
-	else
+	else {
 		rte_free(vq->shadow_used_split);
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+	}
 	rte_free(vq->batch_copy_elems);
 	rte_mempool_free(vq->iotlb_pool);
 	rte_free(vq);
@@ -1522,3 +1527,123 @@ int rte_vhost_extern_callback_register(int vid,
 	if (vhost_data_log_level >= 0)
 		rte_log_set_level(vhost_data_log_level, RTE_LOG_WARNING);
 }
+
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+					uint32_t features,
+					struct rte_vhost_async_channel_ops *ops)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	struct rte_vhost_async_features f;
+
+	if (dev == NULL || ops == NULL)
+		return -1;
+
+	f.intval = features;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (unlikely(vq == NULL || !dev->async_copy))
+		return -1;
+
+	/** packed queue is not supported */
+	if (unlikely(vq_is_packed(dev) || !f.async_inorder)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async copy is not supported on packed queue or non-inorder mode "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		return -1;
+	}
+
+	if (unlikely(ops->check_completed_copies == NULL ||
+		ops->transfer_data == NULL))
+		return -1;
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->async_registered)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async register failed: channel already registered "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_pkts_pending = rte_malloc(NULL,
+			vq->size * sizeof(uintptr_t),
+			RTE_CACHE_LINE_SIZE);
+	vq->async_pending_info = rte_malloc(NULL,
+			vq->size * sizeof(uint64_t),
+			RTE_CACHE_LINE_SIZE);
+	if (!vq->async_pkts_pending || !vq->async_pending_info) {
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+
+		VHOST_LOG_CONFIG(ERR,
+				"async register failed: cannot allocate memory for vq data "
+				"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_ops.check_completed_copies = ops->check_completed_copies;
+	vq->async_ops.transfer_data = ops->transfer_data;
+
+	vq->async_inorder = f.async_inorder;
+	vq->async_threshold = f.async_threshold;
+
+	vq->async_registered = true;
+
+reg_out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return 0;
+}
+
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	int ret = -1;
+
+	if (dev == NULL)
+		return ret;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (vq == NULL)
+		return ret;
+
+	ret = 0;
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (!vq->async_registered)
+		goto out;
+
+	if (vq->async_pkts_inflight_n) {
+		VHOST_LOG_CONFIG(ERR, "Failed to unregister async channel. "
+			"async inflight packets must be completed before unregistration.\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (vq->async_pkts_pending) {
+		rte_free(vq->async_pkts_pending);
+		vq->async_pkts_pending = 0;
+	}
+
+	if (vq->async_pending_info) {
+		rte_free(vq->async_pending_info);
+		vq->async_pending_info = 0;
+	}
+
+	vq->async_ops.transfer_data = NULL;
+	vq->async_ops.check_completed_copies = NULL;
+	vq->async_registered = false;
+
+out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return ret;
+}
+
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 0344636..f373198 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -24,6 +24,8 @@
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
 
+#include "rte_vhost_async.h"
+
 /* Used to indicate that the device is running on a data core */
 #define VIRTIO_DEV_RUNNING 1
 /* Used to indicate that the device is ready to operate */
@@ -40,6 +42,11 @@
 
 #define VHOST_LOG_CACHE_NR 32
 
+#define MAX_PKT_BURST 32
+
+#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
+#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
+
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
 		VRING_DESC_F_WRITE)
@@ -202,6 +209,25 @@ struct vhost_virtqueue {
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
 	int				iotlb_cache_nr;
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
+
+	/* operation callbacks for async dma */
+	struct rte_vhost_async_channel_ops	async_ops;
+
+	struct rte_vhost_iov_iter it_pool[VHOST_MAX_ASYNC_IT];
+	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
+
+	/* async data transfer status */
+	uintptr_t	**async_pkts_pending;
+	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
+	#define		ASYNC_PENDING_INFO_N_SFT 16
+	uint64_t	*async_pending_info;
+	uint16_t	async_pkts_idx;
+	uint16_t	async_pkts_inflight_n;
+
+	/* vq async features */
+	bool		async_inorder;
+	bool		async_registered;
+	uint16_t	async_threshold;
 } __rte_cache_aligned;
 
 #define VHOST_MAX_VRING			0x100
@@ -338,6 +364,7 @@ struct virtio_net {
 	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
@@ -683,7 +710,8 @@ uint64_t translate_log_addr(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	/* Don't kick guest if we don't reach index specified by guest. */
 	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
 		uint16_t old = vq->signalled_used;
-		uint16_t new = vq->last_used_idx;
+		uint16_t new = vq->async_pkts_inflight_n ?
+					vq->used->idx:vq->last_used_idx;
 		bool signalled_used_valid = vq->signalled_used_valid;
 
 		vq->signalled_used = new;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 6039a8f..aa86055 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -476,12 +476,14 @@
 	} else {
 		if (vq->shadow_used_split)
 			rte_free(vq->shadow_used_split);
+
 		vq->shadow_used_split = rte_malloc(NULL,
 				vq->size * sizeof(struct vring_used_elem),
 				RTE_CACHE_LINE_SIZE);
+
 		if (!vq->shadow_used_split) {
 			VHOST_LOG_CONFIG(ERR,
-					"failed to allocate memory for shadow used ring.\n");
+					"failed to allocate memory for vq internal data.\n");
 			return RTE_VHOST_MSG_RESULT_ERR;
 		}
 	}
@@ -1166,7 +1168,8 @@
 			goto err_mmap;
 		}
 
-		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
+		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
+			MAP_POPULATE : 0;
 		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 				 MAP_SHARED | populate, fd, 0);
 
@@ -1181,7 +1184,7 @@
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
-		if (dev->dequeue_zero_copy)
+		if (dev->dequeue_zero_copy || dev->async_copy)
 			if (add_guest_pages(dev, reg, alignment) < 0) {
 				VHOST_LOG_CONFIG(ERR,
 					"adding guest pages to region %u failed.\n",
@@ -1979,6 +1982,12 @@ static int vhost_user_set_vring_err(struct virtio_net **pdev __rte_unused,
 	} else {
 		rte_free(vq->shadow_used_split);
 		vq->shadow_used_split = NULL;
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+		vq->async_pkts_pending = NULL;
+		vq->async_pending_info = NULL;
 	}
 
 	rte_free(vq->batch_copy_elems);
@@ -2012,6 +2021,14 @@ static int vhost_user_set_vring_err(struct virtio_net **pdev __rte_unused,
 		"set queue enable: %d to qp idx: %d\n",
 		enable, index);
 
+	if (!enable && dev->virtqueue[index]->async_registered) {
+		if (dev->virtqueue[index]->async_pkts_inflight_n) {
+			VHOST_LOG_CONFIG(ERR, "failed to disable vring. "
+			"async inflight packets must be completed first\n");
+			return RTE_VHOST_MSG_RESULT_ERR;
+		}
+	}
+
 	/* On disable, rings have to be stopped being processed. */
 	if (!enable && dev->dequeue_zero_copy)
 		drain_zmbuf_list(dev->virtqueue[index]);
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v4 2/2] vhost: introduce async enqueue for split ring
  2020-07-03 12:21 ` [dpdk-dev] [PATCH v4 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API patrick.fu
@ 2020-07-03 12:21   ` patrick.fu
  1 sibling, 0 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-03 12:21 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch implements async enqueue data path for split ring.
2 new async data path APIs are defined, by which applications
can submit and poll packets to/from async engines. The async
enqueue data leverages callback functions registered by
applications to work with the async engine.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/rte_vhost_async.h |  40 +++
 lib/librte_vhost/virtio_net.c      | 554 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 592 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
index d5a5927..c8ad8db 100644
--- a/lib/librte_vhost/rte_vhost_async.h
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -133,4 +133,44 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 __rte_experimental
 int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
 
+/**
+ * This function submit enqueue data to async engine. This function has
+ * no guranttee to the transfer completion upon return. Applications
+ * should poll transfer status by rte_vhost_poll_enqueue_completed()
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  array of packets to be enqueued
+ * @param count
+ *  packets num to be enqueued
+ * @return
+ *  num of packets enqueued
+ */
+__rte_experimental
+uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
+/**
+ * This function check async completion status for a specific vhost
+ * device queue. Packets which finish copying (enqueue) operation
+ * will be returned in an array.
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  blank array to get return packet pointer
+ * @param count
+ *  size of the packet array
+ * @return
+ *  num of packets returned
+ */
+__rte_experimental
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 751c1f3..a57250d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -17,14 +17,15 @@
 #include <rte_arp.h>
 #include <rte_spinlock.h>
 #include <rte_malloc.h>
+#include <rte_vhost_async.h>
 
 #include "iotlb.h"
 #include "vhost.h"
 
-#define MAX_PKT_BURST 32
-
 #define MAX_BATCH_LEN 256
 
+#define VHOST_ASYNC_BATCH_THRESHOLD 32
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -117,6 +118,35 @@
 }
 
 static __rte_always_inline void
+async_flush_shadow_used_ring_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq)
+{
+	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
+
+	if (used_idx + vq->shadow_used_idx <= vq->size) {
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
+					  vq->shadow_used_idx);
+	} else {
+		uint16_t size;
+
+		/* update used ring interval [used_idx, vq->size] */
+		size = vq->size - used_idx;
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
+
+		/* update the left half used ring interval [0, left_size] */
+		do_flush_shadow_used_ring_split(dev, vq, 0, size,
+					  vq->shadow_used_idx - size);
+	}
+	vq->last_used_idx += vq->shadow_used_idx;
+
+	rte_smp_wmb();
+
+	vhost_log_cache_sync(dev, vq);
+
+	vq->shadow_used_idx = 0;
+}
+
+static __rte_always_inline void
 update_shadow_used_ring_split(struct vhost_virtqueue *vq,
 			 uint16_t desc_idx, uint32_t len)
 {
@@ -905,6 +935,200 @@
 	return error;
 }
 
+static __rte_always_inline void
+async_fill_vec(struct iovec *v, void *base, size_t len)
+{
+	v->iov_base = base;
+	v->iov_len = len;
+}
+
+static __rte_always_inline void
+async_fill_it(struct rte_vhost_iov_iter *it, size_t count,
+	struct iovec *vec, unsigned long nr_seg)
+{
+	it->offset = 0;
+	it->count = count;
+
+	if (count) {
+		it->iov = vec;
+		it->nr_segs = nr_seg;
+	} else {
+		it->iov = 0;
+		it->nr_segs = 0;
+	}
+}
+
+static __rte_always_inline void
+async_fill_des(struct rte_vhost_async_desc *desc,
+	struct rte_vhost_iov_iter *src, struct rte_vhost_iov_iter *dst)
+{
+	desc->src = src;
+	desc->dst = dst;
+}
+
+static __rte_always_inline int
+async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			struct rte_mbuf *m, struct buf_vector *buf_vec,
+			uint16_t nr_vec, uint16_t num_buffers,
+			struct iovec *src_iovec, struct iovec *dst_iovec,
+			struct rte_vhost_iov_iter *src_it,
+			struct rte_vhost_iov_iter *dst_it)
+{
+	uint32_t vec_idx = 0;
+	uint32_t mbuf_offset, mbuf_avail;
+	uint32_t buf_offset, buf_avail;
+	uint64_t buf_addr, buf_iova, buf_len;
+	uint32_t cpy_len, cpy_threshold;
+	uint64_t hdr_addr;
+	struct rte_mbuf *hdr_mbuf;
+	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
+	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
+	int error = 0;
+
+	uint32_t tlen = 0;
+	int tvec_idx = 0;
+	void *hpa;
+
+	if (unlikely(m == NULL)) {
+		error = -1;
+		goto out;
+	}
+
+	cpy_threshold = vq->async_threshold;
+
+	buf_addr = buf_vec[vec_idx].buf_addr;
+	buf_iova = buf_vec[vec_idx].buf_iova;
+	buf_len = buf_vec[vec_idx].buf_len;
+
+	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
+		error = -1;
+		goto out;
+	}
+
+	hdr_mbuf = m;
+	hdr_addr = buf_addr;
+	if (unlikely(buf_len < dev->vhost_hlen))
+		hdr = &tmp_hdr;
+	else
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
+		dev->vid, num_buffers);
+
+	if (unlikely(buf_len < dev->vhost_hlen)) {
+		buf_offset = dev->vhost_hlen - buf_len;
+		vec_idx++;
+		buf_addr = buf_vec[vec_idx].buf_addr;
+		buf_iova = buf_vec[vec_idx].buf_iova;
+		buf_len = buf_vec[vec_idx].buf_len;
+		buf_avail = buf_len - buf_offset;
+	} else {
+		buf_offset = dev->vhost_hlen;
+		buf_avail = buf_len - dev->vhost_hlen;
+	}
+
+	mbuf_avail  = rte_pktmbuf_data_len(m);
+	mbuf_offset = 0;
+
+	while (mbuf_avail != 0 || m->next != NULL) {
+		/* done with current buf, get the next one */
+		if (buf_avail == 0) {
+			vec_idx++;
+			if (unlikely(vec_idx >= nr_vec)) {
+				error = -1;
+				goto out;
+			}
+
+			buf_addr = buf_vec[vec_idx].buf_addr;
+			buf_iova = buf_vec[vec_idx].buf_iova;
+			buf_len = buf_vec[vec_idx].buf_len;
+
+			buf_offset = 0;
+			buf_avail  = buf_len;
+		}
+
+		/* done with current mbuf, get the next one */
+		if (mbuf_avail == 0) {
+			m = m->next;
+
+			mbuf_offset = 0;
+			mbuf_avail  = rte_pktmbuf_data_len(m);
+		}
+
+		if (hdr_addr) {
+			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
+			if (rxvq_is_mergeable(dev))
+				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
+						num_buffers);
+
+			if (unlikely(hdr == &tmp_hdr)) {
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
+			} else {
+				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
+						dev->vhost_hlen, 0);
+				vhost_log_cache_write_iova(dev, vq,
+						buf_vec[0].buf_iova,
+						dev->vhost_hlen);
+			}
+
+			hdr_addr = 0;
+		}
+
+		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
+
+		if (unlikely(cpy_len >= cpy_threshold)) {
+			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
+					buf_iova + buf_offset, cpy_len);
+
+			if (unlikely(!hpa)) {
+				error = -1;
+				goto out;
+			}
+
+			async_fill_vec(src_iovec + tvec_idx,
+				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
+						mbuf_offset), cpy_len);
+
+			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
+
+			tlen += cpy_len;
+			tvec_idx++;
+		} else {
+			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
+				rte_memcpy(
+				(void *)((uintptr_t)(buf_addr + buf_offset)),
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
+				cpy_len);
+
+				PRINT_PACKET(dev,
+					(uintptr_t)(buf_addr + buf_offset),
+					cpy_len, 0);
+			} else {
+				batch_copy[vq->batch_copy_nb_elems].dst =
+				(void *)((uintptr_t)(buf_addr + buf_offset));
+				batch_copy[vq->batch_copy_nb_elems].src =
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
+				batch_copy[vq->batch_copy_nb_elems].log_addr =
+					buf_iova + buf_offset;
+				batch_copy[vq->batch_copy_nb_elems].len =
+					cpy_len;
+				vq->batch_copy_nb_elems++;
+			}
+		}
+
+		mbuf_avail  -= cpy_len;
+		mbuf_offset += cpy_len;
+		buf_avail  -= cpy_len;
+		buf_offset += cpy_len;
+	}
+
+out:
+	async_fill_it(src_it, tlen, src_iovec, tvec_idx);
+	async_fill_it(dst_it, tlen, dst_iovec, tvec_idx);
+
+	return error;
+}
+
 static __rte_always_inline int
 vhost_enqueue_single_packed(struct virtio_net *dev,
 			    struct vhost_virtqueue *vq,
@@ -1236,6 +1460,332 @@
 	return virtio_dev_rx(dev, queue_id, pkts, count);
 }
 
+static __rte_always_inline uint16_t
+virtio_dev_rx_async_get_info_idx(uint16_t pkts_idx,
+	uint16_t vq_size, uint16_t n_inflight)
+{
+	return pkts_idx > n_inflight ? (pkts_idx - n_inflight) :
+		(vq_size - n_inflight + pkts_idx) & (vq_size - 1);
+}
+
+static __rte_always_inline void
+virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	uint16_t last_idx, uint16_t shadow_idx)
+{
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
+		vq_size, vq->async_pkts_inflight_n);
+
+	while (likely((start_idx & (vq_size - 1)) != pkts_idx)) {
+		uint64_t n_seg =
+			async_pending_info[(start_idx) & (vq_size - 1)] >>
+			ASYNC_PENDING_INFO_N_SFT;
+
+		while (n_seg)
+			n_seg -= vq->async_ops.check_completed_copies(dev->vid,
+				queue_id, 0, 1);
+	}
+
+	vq->async_pkts_inflight_n = 0;
+
+	vq->shadow_used_idx = shadow_idx;
+	vq->last_avail_idx = last_idx;
+}
+
+static __rte_noinline uint32_t
+virtio_dev_rx_async_submit_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
+	uint16_t num_buffers;
+	struct buf_vector buf_vec[BUF_VECTOR_MAX];
+	uint16_t avail_head, last_idx, shadow_idx;
+
+	struct rte_vhost_iov_iter *it_pool = vq->it_pool;
+	struct iovec *vec_pool = vq->vec_pool;
+	struct rte_vhost_async_desc tdes[MAX_PKT_BURST];
+	struct iovec *src_iovec = vec_pool;
+	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+	struct rte_vhost_iov_iter *src_it = it_pool;
+	struct rte_vhost_iov_iter *dst_it = it_pool + 1;
+	uint16_t n_free_slot, slot_idx;
+	int n_pkts = 0;
+
+	avail_head = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE);
+	last_idx = vq->last_avail_idx;
+	shadow_idx = vq->shadow_used_idx;
+
+	/*
+	 * The ordering between avail index and
+	 * desc reads needs to be enforced.
+	 */
+	rte_smp_rmb();
+
+	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
+	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
+		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
+		uint16_t nr_vec = 0;
+
+		if (unlikely(reserve_avail_buf_split(dev, vq,
+						pkt_len, buf_vec, &num_buffers,
+						avail_head, &nr_vec) < 0)) {
+			VHOST_LOG_DATA(DEBUG,
+				"(%d) failed to get enough desc from vring\n",
+				dev->vid);
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end index %d\n",
+			dev->vid, vq->last_avail_idx,
+			vq->last_avail_idx + num_buffers);
+
+		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
+				buf_vec, nr_vec, num_buffers,
+				src_iovec, dst_iovec, src_it, dst_it) < 0) {
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
+		if (src_it->count) {
+			async_fill_des(&tdes[pkt_burst_idx], src_it, dst_it);
+			pkt_burst_idx++;
+			vq->async_pending_info[slot_idx] =
+				num_buffers | (src_it->nr_segs << 16);
+			src_iovec += src_it->nr_segs;
+			dst_iovec += dst_it->nr_segs;
+			src_it += 2;
+			dst_it += 2;
+		} else {
+			vq->async_pending_info[slot_idx] = num_buffers;
+			vq->async_pkts_inflight_n++;
+		}
+
+		vq->last_avail_idx += num_buffers;
+
+		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
+				(pkt_idx == count - 1 && pkt_burst_idx)) {
+			n_pkts = vq->async_ops.transfer_data(dev->vid,
+					queue_id, tdes, 0, pkt_burst_idx);
+			src_iovec = vec_pool;
+			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+			src_it = it_pool;
+			dst_it = it_pool + 1;
+
+			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+				vq->async_pkts_inflight_n +=
+					n_pkts > 0 ? n_pkts : 0;
+				virtio_dev_rx_async_submit_split_err(dev,
+					vq, queue_id, last_idx, shadow_idx);
+				return 0;
+			}
+
+			pkt_burst_idx = 0;
+			vq->async_pkts_inflight_n += n_pkts;
+		}
+	}
+
+	if (pkt_burst_idx) {
+		n_pkts = vq->async_ops.transfer_data(dev->vid,
+				queue_id, tdes, 0, pkt_burst_idx);
+		if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
+			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
+				last_idx, shadow_idx);
+			return 0;
+		}
+
+		vq->async_pkts_inflight_n += n_pkts;
+	}
+
+	do_data_copy_enqueue(dev, vq);
+
+	n_free_slot = vq->size - vq->async_pkts_idx;
+	if (n_free_slot > pkt_idx) {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, pkt_idx * sizeof(uintptr_t));
+		vq->async_pkts_idx += pkt_idx;
+	} else {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, n_free_slot * sizeof(uintptr_t));
+		rte_memcpy(&vq->async_pkts_pending[0],
+			&pkts[n_free_slot],
+			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
+		vq->async_pkts_idx = pkt_idx - n_free_slot;
+	}
+
+	if (likely(vq->shadow_used_idx))
+		async_flush_shadow_used_ring_split(dev, vq);
+
+	return pkt_idx;
+}
+
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+	struct vhost_virtqueue *vq;
+	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
+		vq_size, vq->async_pkts_inflight_n);
+
+	n_pkts_cpl =
+		vq->async_ops.check_completed_copies(vid, queue_id, 0, count);
+
+	rte_smp_wmb();
+
+	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
+		uint64_t info = async_pending_info[
+			(start_idx + n_pkts_put) & (vq_size - 1)];
+		uint64_t n_segs;
+		n_pkts_put++;
+		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
+		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
+
+		if (n_segs) {
+			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
+				n_pkts_put--;
+				n_descs -= info & ASYNC_PENDING_INFO_N_MSK;
+				if (n_pkts_cpl) {
+					async_pending_info[
+						(start_idx + n_pkts_put) &
+						(vq_size - 1)] =
+					((n_segs - n_pkts_cpl) <<
+					 ASYNC_PENDING_INFO_N_SFT) |
+					(info & ASYNC_PENDING_INFO_N_MSK);
+					n_pkts_cpl = 0;
+				}
+				break;
+			}
+			n_pkts_cpl -= n_segs;
+		}
+	}
+
+	if (n_pkts_put) {
+		vq->async_pkts_inflight_n -= n_pkts_put;
+		__atomic_add_fetch(&vq->used->idx, n_descs, __ATOMIC_RELEASE);
+
+		vhost_vring_call_split(dev, vq);
+	}
+
+	if (start_idx + n_pkts_put <= vq_size) {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			n_pkts_put * sizeof(uintptr_t));
+	} else {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			(vq_size - start_idx) * sizeof(uintptr_t));
+		rte_memcpy(&pkts[vq_size - start_idx], vq->async_pkts_pending,
+			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
+	}
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return n_pkts_put;
+}
+
+static __rte_always_inline uint32_t
+virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	struct vhost_virtqueue *vq;
+	uint32_t nb_tx = 0;
+	bool drawback = false;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->enabled == 0))
+		goto out_access_unlock;
+
+	if (unlikely(!vq->async_registered)) {
+		drawback = true;
+		goto out_access_unlock;
+	}
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_lock(vq);
+
+	if (unlikely(vq->access_ok == 0))
+		if (unlikely(vring_translate(dev, vq) < 0))
+			goto out;
+
+	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
+	if (count == 0)
+		goto out;
+
+	/* TODO: packed queue not implemented */
+	if (vq_is_packed(dev))
+		nb_tx = 0;
+	else
+		nb_tx = virtio_dev_rx_async_submit_split(dev,
+				vq, queue_id, pkts, count);
+
+out:
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_unlock(vq);
+
+out_access_unlock:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	if (drawback)
+		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts, count);
+
+	return nb_tx;
+}
+
+uint16_t
+rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (!dev)
+		return 0;
+
+	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
+		VHOST_LOG_DATA(ERR,
+			"(%d) %s: built-in vhost net backend is disabled.\n",
+			dev->vid, __func__);
+		return 0;
+	}
+
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+}
+
 static inline bool
 virtio_net_with_host_offload(struct virtio_net *dev)
 {
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API
  2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API patrick.fu
@ 2020-07-06  3:05     ` Liu, Yong
  2020-07-06  9:08       ` Fu, Patrick
  0 siblings, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-07-06  3:05 UTC (permalink / raw)
  To: Fu, Patrick, dev, maxime.coquelin, Xia, Chenbo, Wang, Zhihong
  Cc: Fu, Patrick, Wang, Yinan, Jiang, Cheng1, Liang, Cunming
Hi Patrick,
Few comments are inline,  others are fine to me.
Regards,
Marvin
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index 0d822d6..58ee3ef 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -332,8 +332,13 @@
>  {
>  	if (vq_is_packed(dev))
>  		rte_free(vq->shadow_used_packed);
> -	else
> +	else {
>  		rte_free(vq->shadow_used_split);
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
Missed pointer set and feature set to 0. 
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +	int ret = -1;
> +
> +	if (dev == NULL)
> +		return ret;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (vq == NULL)
> +		return ret;
> +
> +	ret = 0;
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	if (!vq->async_registered)
> +		goto out;
> +
> +	if (vq->async_pkts_inflight_n) {
> +		VHOST_LOG_CONFIG(ERR, "Failed to unregister async
> channel. "
> +			"async inflight packets must be completed before
> unregistration.\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	if (vq->async_pkts_pending) {
> +		rte_free(vq->async_pkts_pending);
> +		vq->async_pkts_pending = 0;
> +	}
> +
> +	if (vq->async_pending_info) {
> +		rte_free(vq->async_pending_info);
> +		vq->async_pending_info = 0;
> +	}
> +
Please unify the async pending pointer check and free logic and pointer should be set to NULL.
> +	vq->async_ops.transfer_data = NULL;
> +	vq->async_ops.check_completed_copies = NULL;
> +	vq->async_registered = false;
> +
> +out:
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return ret;
> +}
> +
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API
  2020-07-06  3:05     ` Liu, Yong
@ 2020-07-06  9:08       ` Fu, Patrick
  0 siblings, 0 replies; 36+ messages in thread
From: Fu, Patrick @ 2020-07-06  9:08 UTC (permalink / raw)
  To: Liu, Yong, dev, maxime.coquelin, Xia, Chenbo, Wang, Zhihong
  Cc: Wang, Yinan, Jiang, Cheng1, Liang, Cunming
Hi,
> -----Original Message-----
> From: Liu, Yong <yong.liu@intel.com>
> Sent: Monday, July 6, 2020 11:06 AM
> To: Fu, Patrick <patrick.fu@intel.com>; dev@dpdk.org;
> maxime.coquelin@redhat.com; Xia, Chenbo <chenbo.xia@intel.com>; Wang,
> Zhihong <zhihong.wang@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Wang, Yinan
> <yinan.wang@intel.com>; Jiang, Cheng1 <cheng1.jiang@intel.com>; Liang,
> Cunming <cunming.liang@intel.com>
> Subject: RE: [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue
> registration API
> 
> Hi Patrick,
> Few comments are inline,  others are fine to me.
> 
> Regards,
> Marvin
> 
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index
> > 0d822d6..58ee3ef 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -332,8 +332,13 @@
> >  {
> >  	if (vq_is_packed(dev))
> >  		rte_free(vq->shadow_used_packed);
> > -	else
> > +	else {
> >  		rte_free(vq->shadow_used_split);
> > +		if (vq->async_pkts_pending)
> > +			rte_free(vq->async_pkts_pending);
> > +		if (vq->async_pending_info)
> > +			rte_free(vq->async_pending_info);
> 
> Missed pointer set and feature set to 0.
> 
This memory free statement is part of vq free operation, which should be safe to leave the vq member pointer just as is
> > +	if (vq->async_pkts_pending) {
> > +		rte_free(vq->async_pkts_pending);
> > +		vq->async_pkts_pending = 0;
> > +	}
> > +
> > +	if (vq->async_pending_info) {
> > +		rte_free(vq->async_pending_info);
> > +		vq->async_pending_info = 0;
> > +	}
> > +
> 
> Please unify the async pending pointer check and free logic and pointer
> should be set to NULL.
> 
Will change it in the next patch version
Thanks,
Patrick
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v5 0/2] introduce asynchronous data path for vhost
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
                   ` (4 preceding siblings ...)
  2020-07-03 12:21 ` [dpdk-dev] [PATCH v4 0/2] introduce asynchronous data path for vhost patrick.fu
@ 2020-07-06 11:53 ` patrick.fu
  2020-07-06 11:53   ` [dpdk-dev] [PATCH v5 1/2] vhost: introduce async enqueue registration API patrick.fu
  2020-07-06 11:53   ` [dpdk-dev] [PATCH v5 2/2] vhost: introduce async enqueue for split ring patrick.fu
  2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
  6 siblings, 2 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-06 11:53 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
Performing large memory copies usually takes up a major part of CPU
cycles and becomes the hot spot in vhost-user enqueue operation. To
offload expensive memory operations from the CPU, this patch set
proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
Intel's processor, to accelerate large copies.
Large copies are offloaded from the CPU to the DMA in an asynchronous
manner. The CPU just submits copy jobs to the DMA but without waiting
for its copy completion. Thus, there is no CPU intervention during
data transfer; we can save precious CPU cycles and improve the overall
throughput for vhost-user based applications, like OVS. During packet
transmission, it offloads large copies to the DMA and performs small
copies by the CPU, due to startup overheads associated with the DMA.
This patch set construct a general framework that applications can
leverage to attach DMA channels with vhost-user transmit queues. Four
new RTE APIs are introduced to vhost library for applications to
register and use the asynchronous data path. In addition, two new DMA
operation callbacks are defined, by which vhost-user asynchronous data
path can interact with DMA hardware. Currently only enqueue operation
for split queue is implemented, but the framework is flexible to extend
support for packed queue.
v2:
update meson file for new header file
update rte_vhost_version.map to include new APIs
rename async APIs/structures to be prefixed with "rte_vhost"
rename some variables/structures for readibility
correct minor typo in comments/license statements
refine memory allocation logic for vq internal buffer
add error message printing in some failure cases
check inflight async packets in unregistration API call
mark new APIs as experimental
v3:
use atomic_xxx() functions in updating ring index
fix a bug in async enqueue failure handling
v4:
part of the fix intended in v3 patch was missed, this patch
adds all thoes fixes
v5:
minor changes on some function/variable names
reset CPU batch copy packet count when async enqueue error
occurs
disable virtio log feature in async copy mode
minor optimization on async shadow index flush
Patrick Fu (2):
  vhost: introduce async enqueue registration API
  vhost: introduce async enqueue for split ring
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 176 ++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  27 ++
 lib/librte_vhost/vhost.c               | 127 +++++-
 lib/librte_vhost/vhost.h               |  30 +-
 lib/librte_vhost/vhost_user.c          |  23 +-
 lib/librte_vhost/virtio_net.c          | 551 ++++++++++++++++++++++++-
 10 files changed, 934 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
-- 
2.18.4
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v5 1/2] vhost: introduce async enqueue registration API
  2020-07-06 11:53 ` [dpdk-dev] [PATCH v5 0/2] introduce asynchronous data path for vhost patrick.fu
@ 2020-07-06 11:53   ` patrick.fu
  2020-07-06 11:53   ` [dpdk-dev] [PATCH v5 2/2] vhost: introduce async enqueue for split ring patrick.fu
  1 sibling, 0 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-06 11:53 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch introduces registration/un-registration APIs
for vhost async data enqueue operation. Together with
the registration APIs implementations, data structures
and async callback functions required for async enqueue
data path are also defined.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 136 +++++++++++++++++++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  27 +++++
 lib/librte_vhost/vhost.c               | 127 ++++++++++++++++++++++-
 lib/librte_vhost/vhost.h               |  30 +++++-
 lib/librte_vhost/vhost_user.c          |  23 ++++-
 9 files changed, 345 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index b7ff7dc4b..4f2f3e47d 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -42,7 +42,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
-						rte_vdpa_dev.h
+						rte_vdpa_dev.h rte_vhost_async.h
 
 # only compile vhost crypto when cryptodev is enabled
 ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index 882a0eaf4..cc9aa65c6 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -22,5 +22,5 @@ sources = files('fd_man.c', 'iotlb.c', 'socket.c', 'vdpa.c',
 		'vhost.c', 'vhost_user.c',
 		'virtio_net.c', 'vhost_crypto.c')
 headers = files('rte_vhost.h', 'rte_vdpa.h', 'rte_vdpa_dev.h',
-		'rte_vhost_crypto.h')
+		'rte_vhost_crypto.h', 'rte_vhost_async.h')
 deps += ['ethdev', 'cryptodev', 'hash', 'pci']
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 8a5c332c8..f93f9595a 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -35,6 +35,7 @@ extern "C" {
 #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
+#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
new file mode 100644
index 000000000..d5a59279a
--- /dev/null
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_VHOST_ASYNC_H_
+#define _RTE_VHOST_ASYNC_H_
+
+#include "rte_vhost.h"
+
+/**
+ * iovec iterator
+ */
+struct rte_vhost_iov_iter {
+	/** offset to the first byte of interesting data */
+	size_t offset;
+	/** total bytes of data in this iterator */
+	size_t count;
+	/** pointer to the iovec array */
+	struct iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+/**
+ * dma transfer descriptor pair
+ */
+struct rte_vhost_async_desc {
+	/** source memory iov_iter */
+	struct rte_vhost_iov_iter *src;
+	/** destination memory iov_iter */
+	struct rte_vhost_iov_iter *dst;
+};
+
+/**
+ * dma transfer status
+ */
+struct rte_vhost_async_status {
+	/** An array of application specific data for source memory */
+	uintptr_t *src_opaque_data;
+	/** An array of application specific data for destination memory */
+	uintptr_t *dst_opaque_data;
+};
+
+/**
+ * dma operation callbacks to be implemented by applications
+ */
+struct rte_vhost_async_channel_ops {
+	/**
+	 * instruct async engines to perform copies for a batch of packets
+	 *
+	 * @param vid
+	 *  id of vhost device to perform data copies
+	 * @param queue_id
+	 *  queue id to perform data copies
+	 * @param descs
+	 *  an array of DMA transfer memory descriptors
+	 * @param opaque_data
+	 *  opaque data pair sending to DMA engine
+	 * @param count
+	 *  number of elements in the "descs" array
+	 * @return
+	 *  -1 on failure, number of descs processed on success
+	 */
+	int (*transfer_data)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_desc *descs,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t count);
+	/**
+	 * check copy-completed packets from the async engine
+	 * @param vid
+	 *  id of vhost device to check copy completion
+	 * @param queue_id
+	 *  queue id to check copyp completion
+	 * @param opaque_data
+	 *  buffer to receive the opaque data pair from DMA engine
+	 * @param max_packets
+	 *  max number of packets could be completed
+	 * @return
+	 *  -1 on failure, number of iov segments completed on success
+	 */
+	int (*check_completed_copies)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t max_packets);
+};
+
+/**
+ *  dma channel feature bit definition
+ */
+struct rte_vhost_async_features {
+	union {
+		uint32_t intval;
+		struct {
+			uint32_t async_inorder:1;
+			uint32_t resvd_0:15;
+			uint32_t async_threshold:12;
+			uint32_t resvd_1:4;
+		};
+	};
+};
+
+/**
+ * register a async channel for vhost
+ *
+ * @param vid
+ *  vhost device id async channel to be attached to
+ * @param queue_id
+ *  vhost queue id async channel to be attached to
+ * @param features
+ *  DMA channel feature bit
+ *    b0       : DMA supports inorder data transfer
+ *    b1  - b15: reserved
+ *    b16 - b27: Packet length threshold for DMA transfer
+ *    b28 - b31: reserved
+ * @param ops
+ *  DMA operation callbacks
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+	uint32_t features, struct rte_vhost_async_channel_ops *ops);
+
+/**
+ * unregister a dma channel for vhost
+ *
+ * @param vid
+ *  vhost device id DMA channel to be detached
+ * @param queue_id
+ *  vhost queue id DMA channel to be detached
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
+
+#endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/rte_vhost_version.map b/lib/librte_vhost/rte_vhost_version.map
index 86784405a..13ec53b63 100644
--- a/lib/librte_vhost/rte_vhost_version.map
+++ b/lib/librte_vhost/rte_vhost_version.map
@@ -71,4 +71,8 @@ EXPERIMENTAL {
 	rte_vdpa_get_queue_num;
 	rte_vdpa_get_features;
 	rte_vdpa_get_protocol_features;
+	rte_vhost_async_channel_register;
+	rte_vhost_async_channel_unregister;
+	rte_vhost_submit_enqueue_burst;
+	rte_vhost_poll_enqueue_completed;
 };
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 49267cebf..c4626d2c4 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
 	bool use_builtin_virtio_net;
 	bool extbuf;
 	bool linearbuf;
+	bool async_copy;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -205,6 +206,7 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size_t size;
 	struct vhost_user_connection *conn;
 	int ret;
+	struct virtio_net *dev;
 
 	if (vsocket == NULL)
 		return;
@@ -236,6 +238,13 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	if (vsocket->linearbuf)
 		vhost_enable_linearbuf(vid);
 
+	if (vsocket->async_copy) {
+		dev = get_device(vid);
+
+		if (dev)
+			dev->async_copy = 1;
+	}
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -881,6 +890,17 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 		goto out_mutex;
 	}
 
+	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
+
+	if (vsocket->async_copy &&
+		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
+		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
+		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and IOMMU "
+			"or post-copy feature simultaneously is not "
+			"supported\n");
+		goto out_mutex;
+	}
+
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
 	 * net driver.
@@ -931,6 +951,13 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 			~(1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT);
 	}
 
+	if (vsocket->async_copy) {
+		vsocket->supported_features &= ~(1ULL << VHOST_F_LOG_ALL);
+		vsocket->features &= ~(1ULL << VHOST_F_LOG_ALL);
+		VHOST_LOG_CONFIG(INFO,
+			"Logging feature is disabled in async copy mode\n");
+	}
+
 	/*
 	 * We'll not be able to receive a buffer from guest in linear mode
 	 * without external buffer if it will not fit in a single mbuf, which is
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0d822d6a3..a11385f39 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -332,8 +332,13 @@ free_vq(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
 	if (vq_is_packed(dev))
 		rte_free(vq->shadow_used_packed);
-	else
+	else {
 		rte_free(vq->shadow_used_split);
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+	}
 	rte_free(vq->batch_copy_elems);
 	rte_mempool_free(vq->iotlb_pool);
 	rte_free(vq);
@@ -1522,3 +1527,123 @@ RTE_INIT(vhost_log_init)
 	if (vhost_data_log_level >= 0)
 		rte_log_set_level(vhost_data_log_level, RTE_LOG_WARNING);
 }
+
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+					uint32_t features,
+					struct rte_vhost_async_channel_ops *ops)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	struct rte_vhost_async_features f;
+
+	if (dev == NULL || ops == NULL)
+		return -1;
+
+	f.intval = features;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (unlikely(vq == NULL || !dev->async_copy))
+		return -1;
+
+	/* packed queue is not supported */
+	if (unlikely(vq_is_packed(dev) || !f.async_inorder)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async copy is not supported on packed queue or non-inorder mode "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		return -1;
+	}
+
+	if (unlikely(ops->check_completed_copies == NULL ||
+		ops->transfer_data == NULL))
+		return -1;
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->async_registered)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async register failed: channel already registered "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_pkts_pending = rte_malloc(NULL,
+			vq->size * sizeof(uintptr_t),
+			RTE_CACHE_LINE_SIZE);
+	vq->async_pending_info = rte_malloc(NULL,
+			vq->size * sizeof(uint64_t),
+			RTE_CACHE_LINE_SIZE);
+	if (!vq->async_pkts_pending || !vq->async_pending_info) {
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+
+		VHOST_LOG_CONFIG(ERR,
+				"async register failed: cannot allocate memory for vq data "
+				"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_ops.check_completed_copies = ops->check_completed_copies;
+	vq->async_ops.transfer_data = ops->transfer_data;
+
+	vq->async_inorder = f.async_inorder;
+	vq->async_threshold = f.async_threshold;
+
+	vq->async_registered = true;
+
+reg_out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return 0;
+}
+
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	int ret = -1;
+
+	if (dev == NULL)
+		return ret;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (vq == NULL)
+		return ret;
+
+	ret = 0;
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (!vq->async_registered)
+		goto out;
+
+	if (vq->async_pkts_inflight_n) {
+		VHOST_LOG_CONFIG(ERR, "Failed to unregister async channel. "
+			"async inflight packets must be completed before unregistration.\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (vq->async_pkts_pending) {
+		rte_free(vq->async_pkts_pending);
+		vq->async_pkts_pending = NULL;
+	}
+
+	if (vq->async_pending_info) {
+		rte_free(vq->async_pending_info);
+		vq->async_pending_info = NULL;
+	}
+
+	vq->async_ops.transfer_data = NULL;
+	vq->async_ops.check_completed_copies = NULL;
+	vq->async_registered = false;
+
+out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return ret;
+}
+
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 034463699..f3731982b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -24,6 +24,8 @@
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
 
+#include "rte_vhost_async.h"
+
 /* Used to indicate that the device is running on a data core */
 #define VIRTIO_DEV_RUNNING 1
 /* Used to indicate that the device is ready to operate */
@@ -40,6 +42,11 @@
 
 #define VHOST_LOG_CACHE_NR 32
 
+#define MAX_PKT_BURST 32
+
+#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
+#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
+
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
 		VRING_DESC_F_WRITE)
@@ -202,6 +209,25 @@ struct vhost_virtqueue {
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
 	int				iotlb_cache_nr;
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
+
+	/* operation callbacks for async dma */
+	struct rte_vhost_async_channel_ops	async_ops;
+
+	struct rte_vhost_iov_iter it_pool[VHOST_MAX_ASYNC_IT];
+	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
+
+	/* async data transfer status */
+	uintptr_t	**async_pkts_pending;
+	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
+	#define		ASYNC_PENDING_INFO_N_SFT 16
+	uint64_t	*async_pending_info;
+	uint16_t	async_pkts_idx;
+	uint16_t	async_pkts_inflight_n;
+
+	/* vq async features */
+	bool		async_inorder;
+	bool		async_registered;
+	uint16_t	async_threshold;
 } __rte_cache_aligned;
 
 #define VHOST_MAX_VRING			0x100
@@ -338,6 +364,7 @@ struct virtio_net {
 	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
@@ -683,7 +710,8 @@ vhost_vring_call_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
 	/* Don't kick guest if we don't reach index specified by guest. */
 	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
 		uint16_t old = vq->signalled_used;
-		uint16_t new = vq->last_used_idx;
+		uint16_t new = vq->async_pkts_inflight_n ?
+					vq->used->idx:vq->last_used_idx;
 		bool signalled_used_valid = vq->signalled_used_valid;
 
 		vq->signalled_used = new;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 6039a8fdb..aa8605523 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -476,12 +476,14 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
 	} else {
 		if (vq->shadow_used_split)
 			rte_free(vq->shadow_used_split);
+
 		vq->shadow_used_split = rte_malloc(NULL,
 				vq->size * sizeof(struct vring_used_elem),
 				RTE_CACHE_LINE_SIZE);
+
 		if (!vq->shadow_used_split) {
 			VHOST_LOG_CONFIG(ERR,
-					"failed to allocate memory for shadow used ring.\n");
+					"failed to allocate memory for vq internal data.\n");
 			return RTE_VHOST_MSG_RESULT_ERR;
 		}
 	}
@@ -1166,7 +1168,8 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 			goto err_mmap;
 		}
 
-		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
+		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
+			MAP_POPULATE : 0;
 		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 				 MAP_SHARED | populate, fd, 0);
 
@@ -1181,7 +1184,7 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
-		if (dev->dequeue_zero_copy)
+		if (dev->dequeue_zero_copy || dev->async_copy)
 			if (add_guest_pages(dev, reg, alignment) < 0) {
 				VHOST_LOG_CONFIG(ERR,
 					"adding guest pages to region %u failed.\n",
@@ -1979,6 +1982,12 @@ vhost_user_get_vring_base(struct virtio_net **pdev,
 	} else {
 		rte_free(vq->shadow_used_split);
 		vq->shadow_used_split = NULL;
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+		vq->async_pkts_pending = NULL;
+		vq->async_pending_info = NULL;
 	}
 
 	rte_free(vq->batch_copy_elems);
@@ -2012,6 +2021,14 @@ vhost_user_set_vring_enable(struct virtio_net **pdev,
 		"set queue enable: %d to qp idx: %d\n",
 		enable, index);
 
+	if (!enable && dev->virtqueue[index]->async_registered) {
+		if (dev->virtqueue[index]->async_pkts_inflight_n) {
+			VHOST_LOG_CONFIG(ERR, "failed to disable vring. "
+			"async inflight packets must be completed first\n");
+			return RTE_VHOST_MSG_RESULT_ERR;
+		}
+	}
+
 	/* On disable, rings have to be stopped being processed. */
 	if (!enable && dev->dequeue_zero_copy)
 		drain_zmbuf_list(dev->virtqueue[index]);
-- 
2.18.4
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v5 2/2] vhost: introduce async enqueue for split ring
  2020-07-06 11:53 ` [dpdk-dev] [PATCH v5 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-07-06 11:53   ` [dpdk-dev] [PATCH v5 1/2] vhost: introduce async enqueue registration API patrick.fu
@ 2020-07-06 11:53   ` patrick.fu
  1 sibling, 0 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-06 11:53 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch implements async enqueue data path for split ring.
2 new async data path APIs are defined, by which applications
can submit and poll packets to/from async engines. The async
enqueue data leverages callback functions registered by
applications to work with the async engine.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/rte_vhost_async.h |  40 +++
 lib/librte_vhost/virtio_net.c      | 551 ++++++++++++++++++++++++++++-
 2 files changed, 589 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
index d5a59279a..c8ad8dbc7 100644
--- a/lib/librte_vhost/rte_vhost_async.h
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -133,4 +133,44 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 __rte_experimental
 int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
 
+/**
+ * This function submit enqueue data to async engine. This function has
+ * no guranttee to the transfer completion upon return. Applications
+ * should poll transfer status by rte_vhost_poll_enqueue_completed()
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  array of packets to be enqueued
+ * @param count
+ *  packets num to be enqueued
+ * @return
+ *  num of packets enqueued
+ */
+__rte_experimental
+uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
+/**
+ * This function check async completion status for a specific vhost
+ * device queue. Packets which finish copying (enqueue) operation
+ * will be returned in an array.
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  blank array to get return packet pointer
+ * @param count
+ *  size of the packet array
+ * @return
+ *  num of packets returned
+ */
+__rte_experimental
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 751c1f373..236498f71 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -17,14 +17,15 @@
 #include <rte_arp.h>
 #include <rte_spinlock.h>
 #include <rte_malloc.h>
+#include <rte_vhost_async.h>
 
 #include "iotlb.h"
 #include "vhost.h"
 
-#define MAX_PKT_BURST 32
-
 #define MAX_BATCH_LEN 256
 
+#define VHOST_ASYNC_BATCH_THRESHOLD 32
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -116,6 +117,31 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
 		sizeof(vq->used->idx));
 }
 
+static __rte_always_inline void
+async_flush_shadow_used_ring_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq)
+{
+	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
+
+	if (used_idx + vq->shadow_used_idx <= vq->size) {
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
+					  vq->shadow_used_idx);
+	} else {
+		uint16_t size;
+
+		/* update used ring interval [used_idx, vq->size] */
+		size = vq->size - used_idx;
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
+
+		/* update the left half used ring interval [0, left_size] */
+		do_flush_shadow_used_ring_split(dev, vq, 0, size,
+					  vq->shadow_used_idx - size);
+	}
+
+	vq->last_used_idx += vq->shadow_used_idx;
+	vq->shadow_used_idx = 0;
+}
+
 static __rte_always_inline void
 update_shadow_used_ring_split(struct vhost_virtqueue *vq,
 			 uint16_t desc_idx, uint32_t len)
@@ -905,6 +931,200 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
+static __rte_always_inline void
+async_fill_vec(struct iovec *v, void *base, size_t len)
+{
+	v->iov_base = base;
+	v->iov_len = len;
+}
+
+static __rte_always_inline void
+async_fill_iter(struct rte_vhost_iov_iter *it, size_t count,
+	struct iovec *vec, unsigned long nr_seg)
+{
+	it->offset = 0;
+	it->count = count;
+
+	if (count) {
+		it->iov = vec;
+		it->nr_segs = nr_seg;
+	} else {
+		it->iov = 0;
+		it->nr_segs = 0;
+	}
+}
+
+static __rte_always_inline void
+async_fill_desc(struct rte_vhost_async_desc *desc,
+	struct rte_vhost_iov_iter *src, struct rte_vhost_iov_iter *dst)
+{
+	desc->src = src;
+	desc->dst = dst;
+}
+
+static __rte_always_inline int
+async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			struct rte_mbuf *m, struct buf_vector *buf_vec,
+			uint16_t nr_vec, uint16_t num_buffers,
+			struct iovec *src_iovec, struct iovec *dst_iovec,
+			struct rte_vhost_iov_iter *src_it,
+			struct rte_vhost_iov_iter *dst_it)
+{
+	uint32_t vec_idx = 0;
+	uint32_t mbuf_offset, mbuf_avail;
+	uint32_t buf_offset, buf_avail;
+	uint64_t buf_addr, buf_iova, buf_len;
+	uint32_t cpy_len, cpy_threshold;
+	uint64_t hdr_addr;
+	struct rte_mbuf *hdr_mbuf;
+	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
+	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
+	int error = 0;
+
+	uint32_t tlen = 0;
+	int tvec_idx = 0;
+	void *hpa;
+
+	if (unlikely(m == NULL)) {
+		error = -1;
+		goto out;
+	}
+
+	cpy_threshold = vq->async_threshold;
+
+	buf_addr = buf_vec[vec_idx].buf_addr;
+	buf_iova = buf_vec[vec_idx].buf_iova;
+	buf_len = buf_vec[vec_idx].buf_len;
+
+	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
+		error = -1;
+		goto out;
+	}
+
+	hdr_mbuf = m;
+	hdr_addr = buf_addr;
+	if (unlikely(buf_len < dev->vhost_hlen))
+		hdr = &tmp_hdr;
+	else
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
+		dev->vid, num_buffers);
+
+	if (unlikely(buf_len < dev->vhost_hlen)) {
+		buf_offset = dev->vhost_hlen - buf_len;
+		vec_idx++;
+		buf_addr = buf_vec[vec_idx].buf_addr;
+		buf_iova = buf_vec[vec_idx].buf_iova;
+		buf_len = buf_vec[vec_idx].buf_len;
+		buf_avail = buf_len - buf_offset;
+	} else {
+		buf_offset = dev->vhost_hlen;
+		buf_avail = buf_len - dev->vhost_hlen;
+	}
+
+	mbuf_avail  = rte_pktmbuf_data_len(m);
+	mbuf_offset = 0;
+
+	while (mbuf_avail != 0 || m->next != NULL) {
+		/* done with current buf, get the next one */
+		if (buf_avail == 0) {
+			vec_idx++;
+			if (unlikely(vec_idx >= nr_vec)) {
+				error = -1;
+				goto out;
+			}
+
+			buf_addr = buf_vec[vec_idx].buf_addr;
+			buf_iova = buf_vec[vec_idx].buf_iova;
+			buf_len = buf_vec[vec_idx].buf_len;
+
+			buf_offset = 0;
+			buf_avail  = buf_len;
+		}
+
+		/* done with current mbuf, get the next one */
+		if (mbuf_avail == 0) {
+			m = m->next;
+
+			mbuf_offset = 0;
+			mbuf_avail  = rte_pktmbuf_data_len(m);
+		}
+
+		if (hdr_addr) {
+			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
+			if (rxvq_is_mergeable(dev))
+				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
+						num_buffers);
+
+			if (unlikely(hdr == &tmp_hdr)) {
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
+			} else {
+				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
+						dev->vhost_hlen, 0);
+				vhost_log_cache_write_iova(dev, vq,
+						buf_vec[0].buf_iova,
+						dev->vhost_hlen);
+			}
+
+			hdr_addr = 0;
+		}
+
+		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
+
+		if (unlikely(cpy_len >= cpy_threshold)) {
+			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
+					buf_iova + buf_offset, cpy_len);
+
+			if (unlikely(!hpa)) {
+				error = -1;
+				goto out;
+			}
+
+			async_fill_vec(src_iovec + tvec_idx,
+				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
+						mbuf_offset), cpy_len);
+
+			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
+
+			tlen += cpy_len;
+			tvec_idx++;
+		} else {
+			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
+				rte_memcpy(
+				(void *)((uintptr_t)(buf_addr + buf_offset)),
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
+				cpy_len);
+
+				PRINT_PACKET(dev,
+					(uintptr_t)(buf_addr + buf_offset),
+					cpy_len, 0);
+			} else {
+				batch_copy[vq->batch_copy_nb_elems].dst =
+				(void *)((uintptr_t)(buf_addr + buf_offset));
+				batch_copy[vq->batch_copy_nb_elems].src =
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
+				batch_copy[vq->batch_copy_nb_elems].log_addr =
+					buf_iova + buf_offset;
+				batch_copy[vq->batch_copy_nb_elems].len =
+					cpy_len;
+				vq->batch_copy_nb_elems++;
+			}
+		}
+
+		mbuf_avail  -= cpy_len;
+		mbuf_offset += cpy_len;
+		buf_avail  -= cpy_len;
+		buf_offset += cpy_len;
+	}
+
+out:
+	async_fill_iter(src_it, tlen, src_iovec, tvec_idx);
+	async_fill_iter(dst_it, tlen, dst_iovec, tvec_idx);
+
+	return error;
+}
+
 static __rte_always_inline int
 vhost_enqueue_single_packed(struct virtio_net *dev,
 			    struct vhost_virtqueue *vq,
@@ -1236,6 +1456,333 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
 	return virtio_dev_rx(dev, queue_id, pkts, count);
 }
 
+static __rte_always_inline uint16_t
+virtio_dev_rx_async_get_info_idx(uint16_t pkts_idx,
+	uint16_t vq_size, uint16_t n_inflight)
+{
+	return pkts_idx > n_inflight ? (pkts_idx - n_inflight) :
+		(vq_size - n_inflight + pkts_idx) & (vq_size - 1);
+}
+
+static __rte_always_inline void
+virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	uint16_t last_idx, uint16_t shadow_idx)
+{
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
+		vq_size, vq->async_pkts_inflight_n);
+
+	while (likely((start_idx & (vq_size - 1)) != pkts_idx)) {
+		uint64_t n_seg =
+			async_pending_info[(start_idx) & (vq_size - 1)] >>
+			ASYNC_PENDING_INFO_N_SFT;
+
+		while (n_seg)
+			n_seg -= vq->async_ops.check_completed_copies(dev->vid,
+				queue_id, 0, 1);
+	}
+
+	vq->async_pkts_inflight_n = 0;
+	vq->batch_copy_nb_elems = 0;
+
+	vq->shadow_used_idx = shadow_idx;
+	vq->last_avail_idx = last_idx;
+}
+
+static __rte_noinline uint32_t
+virtio_dev_rx_async_submit_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
+	uint16_t num_buffers;
+	struct buf_vector buf_vec[BUF_VECTOR_MAX];
+	uint16_t avail_head, last_idx, shadow_idx;
+
+	struct rte_vhost_iov_iter *it_pool = vq->it_pool;
+	struct iovec *vec_pool = vq->vec_pool;
+	struct rte_vhost_async_desc tdes[MAX_PKT_BURST];
+	struct iovec *src_iovec = vec_pool;
+	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+	struct rte_vhost_iov_iter *src_it = it_pool;
+	struct rte_vhost_iov_iter *dst_it = it_pool + 1;
+	uint16_t n_free_slot, slot_idx;
+	int n_pkts = 0;
+
+	avail_head = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE);
+	last_idx = vq->last_avail_idx;
+	shadow_idx = vq->shadow_used_idx;
+
+	/*
+	 * The ordering between avail index and
+	 * desc reads needs to be enforced.
+	 */
+	rte_smp_rmb();
+
+	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
+	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
+		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
+		uint16_t nr_vec = 0;
+
+		if (unlikely(reserve_avail_buf_split(dev, vq,
+						pkt_len, buf_vec, &num_buffers,
+						avail_head, &nr_vec) < 0)) {
+			VHOST_LOG_DATA(DEBUG,
+				"(%d) failed to get enough desc from vring\n",
+				dev->vid);
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end index %d\n",
+			dev->vid, vq->last_avail_idx,
+			vq->last_avail_idx + num_buffers);
+
+		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
+				buf_vec, nr_vec, num_buffers,
+				src_iovec, dst_iovec, src_it, dst_it) < 0) {
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
+		if (src_it->count) {
+			async_fill_desc(&tdes[pkt_burst_idx], src_it, dst_it);
+			pkt_burst_idx++;
+			vq->async_pending_info[slot_idx] =
+				num_buffers | (src_it->nr_segs << 16);
+			src_iovec += src_it->nr_segs;
+			dst_iovec += dst_it->nr_segs;
+			src_it += 2;
+			dst_it += 2;
+		} else {
+			vq->async_pending_info[slot_idx] = num_buffers;
+			vq->async_pkts_inflight_n++;
+		}
+
+		vq->last_avail_idx += num_buffers;
+
+		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
+				(pkt_idx == count - 1 && pkt_burst_idx)) {
+			n_pkts = vq->async_ops.transfer_data(dev->vid,
+					queue_id, tdes, 0, pkt_burst_idx);
+			src_iovec = vec_pool;
+			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+			src_it = it_pool;
+			dst_it = it_pool + 1;
+
+			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+				vq->async_pkts_inflight_n +=
+					n_pkts > 0 ? n_pkts : 0;
+				virtio_dev_rx_async_submit_split_err(dev,
+					vq, queue_id, last_idx, shadow_idx);
+				return 0;
+			}
+
+			pkt_burst_idx = 0;
+			vq->async_pkts_inflight_n += n_pkts;
+		}
+	}
+
+	if (pkt_burst_idx) {
+		n_pkts = vq->async_ops.transfer_data(dev->vid,
+				queue_id, tdes, 0, pkt_burst_idx);
+		if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
+			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
+				last_idx, shadow_idx);
+			return 0;
+		}
+
+		vq->async_pkts_inflight_n += n_pkts;
+	}
+
+	do_data_copy_enqueue(dev, vq);
+
+	n_free_slot = vq->size - vq->async_pkts_idx;
+	if (n_free_slot > pkt_idx) {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, pkt_idx * sizeof(uintptr_t));
+		vq->async_pkts_idx += pkt_idx;
+	} else {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, n_free_slot * sizeof(uintptr_t));
+		rte_memcpy(&vq->async_pkts_pending[0],
+			&pkts[n_free_slot],
+			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
+		vq->async_pkts_idx = pkt_idx - n_free_slot;
+	}
+
+	if (likely(vq->shadow_used_idx))
+		async_flush_shadow_used_ring_split(dev, vq);
+
+	return pkt_idx;
+}
+
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+	struct vhost_virtqueue *vq;
+	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
+		vq_size, vq->async_pkts_inflight_n);
+
+	n_pkts_cpl =
+		vq->async_ops.check_completed_copies(vid, queue_id, 0, count);
+
+	rte_smp_wmb();
+
+	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
+		uint64_t info = async_pending_info[
+			(start_idx + n_pkts_put) & (vq_size - 1)];
+		uint64_t n_segs;
+		n_pkts_put++;
+		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
+		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
+
+		if (n_segs) {
+			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
+				n_pkts_put--;
+				n_descs -= info & ASYNC_PENDING_INFO_N_MSK;
+				if (n_pkts_cpl) {
+					async_pending_info[
+						(start_idx + n_pkts_put) &
+						(vq_size - 1)] =
+					((n_segs - n_pkts_cpl) <<
+					 ASYNC_PENDING_INFO_N_SFT) |
+					(info & ASYNC_PENDING_INFO_N_MSK);
+					n_pkts_cpl = 0;
+				}
+				break;
+			}
+			n_pkts_cpl -= n_segs;
+		}
+	}
+
+	if (n_pkts_put) {
+		vq->async_pkts_inflight_n -= n_pkts_put;
+		__atomic_add_fetch(&vq->used->idx, n_descs, __ATOMIC_RELEASE);
+
+		vhost_vring_call_split(dev, vq);
+	}
+
+	if (start_idx + n_pkts_put <= vq_size) {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			n_pkts_put * sizeof(uintptr_t));
+	} else {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			(vq_size - start_idx) * sizeof(uintptr_t));
+		rte_memcpy(&pkts[vq_size - start_idx], vq->async_pkts_pending,
+			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
+	}
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return n_pkts_put;
+}
+
+static __rte_always_inline uint32_t
+virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	struct vhost_virtqueue *vq;
+	uint32_t nb_tx = 0;
+	bool drawback = false;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->enabled == 0))
+		goto out_access_unlock;
+
+	if (unlikely(!vq->async_registered)) {
+		drawback = true;
+		goto out_access_unlock;
+	}
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_lock(vq);
+
+	if (unlikely(vq->access_ok == 0))
+		if (unlikely(vring_translate(dev, vq) < 0))
+			goto out;
+
+	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
+	if (count == 0)
+		goto out;
+
+	/* TODO: packed queue not implemented */
+	if (vq_is_packed(dev))
+		nb_tx = 0;
+	else
+		nb_tx = virtio_dev_rx_async_submit_split(dev,
+				vq, queue_id, pkts, count);
+
+out:
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_unlock(vq);
+
+out_access_unlock:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	if (drawback)
+		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts, count);
+
+	return nb_tx;
+}
+
+uint16_t
+rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (!dev)
+		return 0;
+
+	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
+		VHOST_LOG_DATA(ERR,
+			"(%d) %s: built-in vhost net backend is disabled.\n",
+			dev->vid, __func__);
+		return 0;
+	}
+
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+}
+
 static inline bool
 virtio_net_with_host_offload(struct virtio_net *dev)
 {
-- 
2.18.4
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost
  2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
                   ` (5 preceding siblings ...)
  2020-07-06 11:53 ` [dpdk-dev] [PATCH v5 0/2] introduce asynchronous data path for vhost patrick.fu
@ 2020-07-07  5:07 ` patrick.fu
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API patrick.fu
                     ` (3 more replies)
  6 siblings, 4 replies; 36+ messages in thread
From: patrick.fu @ 2020-07-07  5:07 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
Performing large memory copies usually takes up a major part of CPU
cycles and becomes the hot spot in vhost-user enqueue operation. To
offload expensive memory operations from the CPU, this patch set
proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
Intel's processor, to accelerate large copies.
Large copies are offloaded from the CPU to the DMA in an asynchronous
manner. The CPU just submits copy jobs to the DMA but without waiting
for its copy completion. Thus, there is no CPU intervention during
data transfer; we can save precious CPU cycles and improve the overall
throughput for vhost-user based applications, like OVS. During packet
transmission, it offloads large copies to the DMA and performs small
copies by the CPU, due to startup overheads associated with the DMA.
This patch set construct a general framework that applications can
leverage to attach DMA channels with vhost-user transmit queues. Four
new RTE APIs are introduced to vhost library for applications to
register and use the asynchronous data path. In addition, two new DMA
operation callbacks are defined, by which vhost-user asynchronous data
path can interact with DMA hardware. Currently only enqueue operation
for split queue is implemented, but the framework is flexible to extend
support for packed queue.
v2:
update meson file for new header file
update rte_vhost_version.map to include new APIs
rename async APIs/structures to be prefixed with "rte_vhost"
rename some variables/structures for readibility
correct minor typo in comments/license statements
refine memory allocation logic for vq internal buffer
add error message printing in some failure cases
check inflight async packets in unregistration API call
mark new APIs as experimental
v3:
use atomic_xxx() functions in updating ring index
fix a bug in async enqueue failure handling
v4:
part of the fix intended in v3 patch was missed, this patch
adds all thoes fixes
v5:
minor changes on some function/variable names
reset CPU batch copy packet count when async enqueue error
occurs
disable virtio log feature in async copy mode
minor optimization on async shadow index flush
v6:
add some background introduction in the commit message
Patrick Fu (2):
  vhost: introduce async enqueue registration API
  vhost: introduce async enqueue for split ring
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 176 ++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  27 ++
 lib/librte_vhost/vhost.c               | 127 +++++-
 lib/librte_vhost/vhost.h               |  30 +-
 lib/librte_vhost/vhost_user.c          |  23 +-
 lib/librte_vhost/virtio_net.c          | 551 ++++++++++++++++++++++++-
 10 files changed, 934 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
-- 
2.18.4
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API
  2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
@ 2020-07-07  5:07   ` patrick.fu
  2020-07-07  8:22     ` Xia, Chenbo
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 2/2] vhost: introduce async enqueue for split ring patrick.fu
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: patrick.fu @ 2020-07-07  5:07 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
Performing large memory copies usually takes up a major part of CPU
cycles and becomes the hot spot in vhost-user enqueue operation. To
offload the large copies from CPU to the DMA devices, asynchronous
APIs are introduced, with which the CPU just submits copy jobs to
the DMA but without waiting for its copy completion. Thus, there is
no CPU intervention during data transfer. We can save precious CPU
cycles and improve the overall throughput for vhost-user based
applications. This patch introduces registration/un-registration
APIs for vhost async data enqueue operation. Together with the
registration APIs implementations, data structures and the prototype
of the async callback functions required for async enqueue data path
are also defined.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/Makefile              |   2 +-
 lib/librte_vhost/meson.build           |   2 +-
 lib/librte_vhost/rte_vhost.h           |   1 +
 lib/librte_vhost/rte_vhost_async.h     | 136 +++++++++++++++++++++++++
 lib/librte_vhost/rte_vhost_version.map |   4 +
 lib/librte_vhost/socket.c              |  27 +++++
 lib/librte_vhost/vhost.c               | 127 ++++++++++++++++++++++-
 lib/librte_vhost/vhost.h               |  30 +++++-
 lib/librte_vhost/vhost_user.c          |  23 ++++-
 9 files changed, 345 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_vhost/rte_vhost_async.h
diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index b7ff7dc4b..4f2f3e47d 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -42,7 +42,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
-						rte_vdpa_dev.h
+						rte_vdpa_dev.h rte_vhost_async.h
 
 # only compile vhost crypto when cryptodev is enabled
 ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index 882a0eaf4..cc9aa65c6 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -22,5 +22,5 @@ sources = files('fd_man.c', 'iotlb.c', 'socket.c', 'vdpa.c',
 		'vhost.c', 'vhost_user.c',
 		'virtio_net.c', 'vhost_crypto.c')
 headers = files('rte_vhost.h', 'rte_vdpa.h', 'rte_vdpa_dev.h',
-		'rte_vhost_crypto.h')
+		'rte_vhost_crypto.h', 'rte_vhost_async.h')
 deps += ['ethdev', 'cryptodev', 'hash', 'pci']
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 8a5c332c8..f93f9595a 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -35,6 +35,7 @@ extern "C" {
 #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
+#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
new file mode 100644
index 000000000..d5a59279a
--- /dev/null
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_VHOST_ASYNC_H_
+#define _RTE_VHOST_ASYNC_H_
+
+#include "rte_vhost.h"
+
+/**
+ * iovec iterator
+ */
+struct rte_vhost_iov_iter {
+	/** offset to the first byte of interesting data */
+	size_t offset;
+	/** total bytes of data in this iterator */
+	size_t count;
+	/** pointer to the iovec array */
+	struct iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+/**
+ * dma transfer descriptor pair
+ */
+struct rte_vhost_async_desc {
+	/** source memory iov_iter */
+	struct rte_vhost_iov_iter *src;
+	/** destination memory iov_iter */
+	struct rte_vhost_iov_iter *dst;
+};
+
+/**
+ * dma transfer status
+ */
+struct rte_vhost_async_status {
+	/** An array of application specific data for source memory */
+	uintptr_t *src_opaque_data;
+	/** An array of application specific data for destination memory */
+	uintptr_t *dst_opaque_data;
+};
+
+/**
+ * dma operation callbacks to be implemented by applications
+ */
+struct rte_vhost_async_channel_ops {
+	/**
+	 * instruct async engines to perform copies for a batch of packets
+	 *
+	 * @param vid
+	 *  id of vhost device to perform data copies
+	 * @param queue_id
+	 *  queue id to perform data copies
+	 * @param descs
+	 *  an array of DMA transfer memory descriptors
+	 * @param opaque_data
+	 *  opaque data pair sending to DMA engine
+	 * @param count
+	 *  number of elements in the "descs" array
+	 * @return
+	 *  -1 on failure, number of descs processed on success
+	 */
+	int (*transfer_data)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_desc *descs,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t count);
+	/**
+	 * check copy-completed packets from the async engine
+	 * @param vid
+	 *  id of vhost device to check copy completion
+	 * @param queue_id
+	 *  queue id to check copyp completion
+	 * @param opaque_data
+	 *  buffer to receive the opaque data pair from DMA engine
+	 * @param max_packets
+	 *  max number of packets could be completed
+	 * @return
+	 *  -1 on failure, number of iov segments completed on success
+	 */
+	int (*check_completed_copies)(int vid, uint16_t queue_id,
+		struct rte_vhost_async_status *opaque_data,
+		uint16_t max_packets);
+};
+
+/**
+ *  dma channel feature bit definition
+ */
+struct rte_vhost_async_features {
+	union {
+		uint32_t intval;
+		struct {
+			uint32_t async_inorder:1;
+			uint32_t resvd_0:15;
+			uint32_t async_threshold:12;
+			uint32_t resvd_1:4;
+		};
+	};
+};
+
+/**
+ * register a async channel for vhost
+ *
+ * @param vid
+ *  vhost device id async channel to be attached to
+ * @param queue_id
+ *  vhost queue id async channel to be attached to
+ * @param features
+ *  DMA channel feature bit
+ *    b0       : DMA supports inorder data transfer
+ *    b1  - b15: reserved
+ *    b16 - b27: Packet length threshold for DMA transfer
+ *    b28 - b31: reserved
+ * @param ops
+ *  DMA operation callbacks
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+	uint32_t features, struct rte_vhost_async_channel_ops *ops);
+
+/**
+ * unregister a dma channel for vhost
+ *
+ * @param vid
+ *  vhost device id DMA channel to be detached
+ * @param queue_id
+ *  vhost queue id DMA channel to be detached
+ * @return
+ *  0 on success, -1 on failures
+ */
+__rte_experimental
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
+
+#endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/rte_vhost_version.map b/lib/librte_vhost/rte_vhost_version.map
index 86784405a..13ec53b63 100644
--- a/lib/librte_vhost/rte_vhost_version.map
+++ b/lib/librte_vhost/rte_vhost_version.map
@@ -71,4 +71,8 @@ EXPERIMENTAL {
 	rte_vdpa_get_queue_num;
 	rte_vdpa_get_features;
 	rte_vdpa_get_protocol_features;
+	rte_vhost_async_channel_register;
+	rte_vhost_async_channel_unregister;
+	rte_vhost_submit_enqueue_burst;
+	rte_vhost_poll_enqueue_completed;
 };
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 49267cebf..c4626d2c4 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
 	bool use_builtin_virtio_net;
 	bool extbuf;
 	bool linearbuf;
+	bool async_copy;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -205,6 +206,7 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size_t size;
 	struct vhost_user_connection *conn;
 	int ret;
+	struct virtio_net *dev;
 
 	if (vsocket == NULL)
 		return;
@@ -236,6 +238,13 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	if (vsocket->linearbuf)
 		vhost_enable_linearbuf(vid);
 
+	if (vsocket->async_copy) {
+		dev = get_device(vid);
+
+		if (dev)
+			dev->async_copy = 1;
+	}
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -881,6 +890,17 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 		goto out_mutex;
 	}
 
+	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
+
+	if (vsocket->async_copy &&
+		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
+		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
+		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and IOMMU "
+			"or post-copy feature simultaneously is not "
+			"supported\n");
+		goto out_mutex;
+	}
+
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
 	 * net driver.
@@ -931,6 +951,13 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 			~(1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT);
 	}
 
+	if (vsocket->async_copy) {
+		vsocket->supported_features &= ~(1ULL << VHOST_F_LOG_ALL);
+		vsocket->features &= ~(1ULL << VHOST_F_LOG_ALL);
+		VHOST_LOG_CONFIG(INFO,
+			"Logging feature is disabled in async copy mode\n");
+	}
+
 	/*
 	 * We'll not be able to receive a buffer from guest in linear mode
 	 * without external buffer if it will not fit in a single mbuf, which is
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0d822d6a3..a11385f39 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -332,8 +332,13 @@ free_vq(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
 	if (vq_is_packed(dev))
 		rte_free(vq->shadow_used_packed);
-	else
+	else {
 		rte_free(vq->shadow_used_split);
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+	}
 	rte_free(vq->batch_copy_elems);
 	rte_mempool_free(vq->iotlb_pool);
 	rte_free(vq);
@@ -1522,3 +1527,123 @@ RTE_INIT(vhost_log_init)
 	if (vhost_data_log_level >= 0)
 		rte_log_set_level(vhost_data_log_level, RTE_LOG_WARNING);
 }
+
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
+					uint32_t features,
+					struct rte_vhost_async_channel_ops *ops)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	struct rte_vhost_async_features f;
+
+	if (dev == NULL || ops == NULL)
+		return -1;
+
+	f.intval = features;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (unlikely(vq == NULL || !dev->async_copy))
+		return -1;
+
+	/* packed queue is not supported */
+	if (unlikely(vq_is_packed(dev) || !f.async_inorder)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async copy is not supported on packed queue or non-inorder mode "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		return -1;
+	}
+
+	if (unlikely(ops->check_completed_copies == NULL ||
+		ops->transfer_data == NULL))
+		return -1;
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->async_registered)) {
+		VHOST_LOG_CONFIG(ERR,
+			"async register failed: channel already registered "
+			"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_pkts_pending = rte_malloc(NULL,
+			vq->size * sizeof(uintptr_t),
+			RTE_CACHE_LINE_SIZE);
+	vq->async_pending_info = rte_malloc(NULL,
+			vq->size * sizeof(uint64_t),
+			RTE_CACHE_LINE_SIZE);
+	if (!vq->async_pkts_pending || !vq->async_pending_info) {
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+
+		VHOST_LOG_CONFIG(ERR,
+				"async register failed: cannot allocate memory for vq data "
+				"(vid %d, qid: %d)\n", vid, queue_id);
+		goto reg_out;
+	}
+
+	vq->async_ops.check_completed_copies = ops->check_completed_copies;
+	vq->async_ops.transfer_data = ops->transfer_data;
+
+	vq->async_inorder = f.async_inorder;
+	vq->async_threshold = f.async_threshold;
+
+	vq->async_registered = true;
+
+reg_out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return 0;
+}
+
+int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id)
+{
+	struct vhost_virtqueue *vq;
+	struct virtio_net *dev = get_device(vid);
+	int ret = -1;
+
+	if (dev == NULL)
+		return ret;
+
+	vq = dev->virtqueue[queue_id];
+
+	if (vq == NULL)
+		return ret;
+
+	ret = 0;
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (!vq->async_registered)
+		goto out;
+
+	if (vq->async_pkts_inflight_n) {
+		VHOST_LOG_CONFIG(ERR, "Failed to unregister async channel. "
+			"async inflight packets must be completed before unregistration.\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (vq->async_pkts_pending) {
+		rte_free(vq->async_pkts_pending);
+		vq->async_pkts_pending = NULL;
+	}
+
+	if (vq->async_pending_info) {
+		rte_free(vq->async_pending_info);
+		vq->async_pending_info = NULL;
+	}
+
+	vq->async_ops.transfer_data = NULL;
+	vq->async_ops.check_completed_copies = NULL;
+	vq->async_registered = false;
+
+out:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return ret;
+}
+
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 034463699..f3731982b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -24,6 +24,8 @@
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
 
+#include "rte_vhost_async.h"
+
 /* Used to indicate that the device is running on a data core */
 #define VIRTIO_DEV_RUNNING 1
 /* Used to indicate that the device is ready to operate */
@@ -40,6 +42,11 @@
 
 #define VHOST_LOG_CACHE_NR 32
 
+#define MAX_PKT_BURST 32
+
+#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
+#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
+
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
 		VRING_DESC_F_WRITE)
@@ -202,6 +209,25 @@ struct vhost_virtqueue {
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
 	int				iotlb_cache_nr;
 	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
+
+	/* operation callbacks for async dma */
+	struct rte_vhost_async_channel_ops	async_ops;
+
+	struct rte_vhost_iov_iter it_pool[VHOST_MAX_ASYNC_IT];
+	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
+
+	/* async data transfer status */
+	uintptr_t	**async_pkts_pending;
+	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
+	#define		ASYNC_PENDING_INFO_N_SFT 16
+	uint64_t	*async_pending_info;
+	uint16_t	async_pkts_idx;
+	uint16_t	async_pkts_inflight_n;
+
+	/* vq async features */
+	bool		async_inorder;
+	bool		async_registered;
+	uint16_t	async_threshold;
 } __rte_cache_aligned;
 
 #define VHOST_MAX_VRING			0x100
@@ -338,6 +364,7 @@ struct virtio_net {
 	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
@@ -683,7 +710,8 @@ vhost_vring_call_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
 	/* Don't kick guest if we don't reach index specified by guest. */
 	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
 		uint16_t old = vq->signalled_used;
-		uint16_t new = vq->last_used_idx;
+		uint16_t new = vq->async_pkts_inflight_n ?
+					vq->used->idx:vq->last_used_idx;
 		bool signalled_used_valid = vq->signalled_used_valid;
 
 		vq->signalled_used = new;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 6039a8fdb..aa8605523 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -476,12 +476,14 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
 	} else {
 		if (vq->shadow_used_split)
 			rte_free(vq->shadow_used_split);
+
 		vq->shadow_used_split = rte_malloc(NULL,
 				vq->size * sizeof(struct vring_used_elem),
 				RTE_CACHE_LINE_SIZE);
+
 		if (!vq->shadow_used_split) {
 			VHOST_LOG_CONFIG(ERR,
-					"failed to allocate memory for shadow used ring.\n");
+					"failed to allocate memory for vq internal data.\n");
 			return RTE_VHOST_MSG_RESULT_ERR;
 		}
 	}
@@ -1166,7 +1168,8 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 			goto err_mmap;
 		}
 
-		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
+		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
+			MAP_POPULATE : 0;
 		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 				 MAP_SHARED | populate, fd, 0);
 
@@ -1181,7 +1184,7 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
-		if (dev->dequeue_zero_copy)
+		if (dev->dequeue_zero_copy || dev->async_copy)
 			if (add_guest_pages(dev, reg, alignment) < 0) {
 				VHOST_LOG_CONFIG(ERR,
 					"adding guest pages to region %u failed.\n",
@@ -1979,6 +1982,12 @@ vhost_user_get_vring_base(struct virtio_net **pdev,
 	} else {
 		rte_free(vq->shadow_used_split);
 		vq->shadow_used_split = NULL;
+		if (vq->async_pkts_pending)
+			rte_free(vq->async_pkts_pending);
+		if (vq->async_pending_info)
+			rte_free(vq->async_pending_info);
+		vq->async_pkts_pending = NULL;
+		vq->async_pending_info = NULL;
 	}
 
 	rte_free(vq->batch_copy_elems);
@@ -2012,6 +2021,14 @@ vhost_user_set_vring_enable(struct virtio_net **pdev,
 		"set queue enable: %d to qp idx: %d\n",
 		enable, index);
 
+	if (!enable && dev->virtqueue[index]->async_registered) {
+		if (dev->virtqueue[index]->async_pkts_inflight_n) {
+			VHOST_LOG_CONFIG(ERR, "failed to disable vring. "
+			"async inflight packets must be completed first\n");
+			return RTE_VHOST_MSG_RESULT_ERR;
+		}
+	}
+
 	/* On disable, rings have to be stopped being processed. */
 	if (!enable && dev->dequeue_zero_copy)
 		drain_zmbuf_list(dev->virtqueue[index]);
-- 
2.18.4
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v6 2/2] vhost: introduce async enqueue for split ring
  2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API patrick.fu
@ 2020-07-07  5:07   ` patrick.fu
  2020-07-07  8:22     ` Xia, Chenbo
  2020-07-07 16:45   ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost Ferruh Yigit
  2020-07-20 13:26   ` Maxime Coquelin
  3 siblings, 1 reply; 36+ messages in thread
From: patrick.fu @ 2020-07-07  5:07 UTC (permalink / raw)
  To: dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: patrick.fu, yinan.wang, cheng1.jiang, cunming.liang
From: Patrick Fu <patrick.fu@intel.com>
This patch implements async enqueue data path for split ring. 2 new
async data path APIs are defined, by which applications can submit
and poll packets to/from async engines. The async engine is either
a physical DMA device or it could also be a software emulated backend.
The async enqueue data path leverages callback functions registered by
applications to work with the async engine.
Signed-off-by: Patrick Fu <patrick.fu@intel.com>
---
 lib/librte_vhost/rte_vhost_async.h |  40 +++
 lib/librte_vhost/virtio_net.c      | 551 ++++++++++++++++++++++++++++-
 2 files changed, 589 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/rte_vhost_async.h b/lib/librte_vhost/rte_vhost_async.h
index d5a59279a..c8ad8dbc7 100644
--- a/lib/librte_vhost/rte_vhost_async.h
+++ b/lib/librte_vhost/rte_vhost_async.h
@@ -133,4 +133,44 @@ int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 __rte_experimental
 int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
 
+/**
+ * This function submit enqueue data to async engine. This function has
+ * no guranttee to the transfer completion upon return. Applications
+ * should poll transfer status by rte_vhost_poll_enqueue_completed()
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  array of packets to be enqueued
+ * @param count
+ *  packets num to be enqueued
+ * @return
+ *  num of packets enqueued
+ */
+__rte_experimental
+uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
+/**
+ * This function check async completion status for a specific vhost
+ * device queue. Packets which finish copying (enqueue) operation
+ * will be returned in an array.
+ *
+ * @param vid
+ *  id of vhost device to enqueue data
+ * @param queue_id
+ *  queue id to enqueue data
+ * @param pkts
+ *  blank array to get return packet pointer
+ * @param count
+ *  size of the packet array
+ * @return
+ *  num of packets returned
+ */
+__rte_experimental
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count);
+
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 751c1f373..236498f71 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -17,14 +17,15 @@
 #include <rte_arp.h>
 #include <rte_spinlock.h>
 #include <rte_malloc.h>
+#include <rte_vhost_async.h>
 
 #include "iotlb.h"
 #include "vhost.h"
 
-#define MAX_PKT_BURST 32
-
 #define MAX_BATCH_LEN 256
 
+#define VHOST_ASYNC_BATCH_THRESHOLD 32
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -116,6 +117,31 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
 		sizeof(vq->used->idx));
 }
 
+static __rte_always_inline void
+async_flush_shadow_used_ring_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq)
+{
+	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
+
+	if (used_idx + vq->shadow_used_idx <= vq->size) {
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
+					  vq->shadow_used_idx);
+	} else {
+		uint16_t size;
+
+		/* update used ring interval [used_idx, vq->size] */
+		size = vq->size - used_idx;
+		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
+
+		/* update the left half used ring interval [0, left_size] */
+		do_flush_shadow_used_ring_split(dev, vq, 0, size,
+					  vq->shadow_used_idx - size);
+	}
+
+	vq->last_used_idx += vq->shadow_used_idx;
+	vq->shadow_used_idx = 0;
+}
+
 static __rte_always_inline void
 update_shadow_used_ring_split(struct vhost_virtqueue *vq,
 			 uint16_t desc_idx, uint32_t len)
@@ -905,6 +931,200 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
+static __rte_always_inline void
+async_fill_vec(struct iovec *v, void *base, size_t len)
+{
+	v->iov_base = base;
+	v->iov_len = len;
+}
+
+static __rte_always_inline void
+async_fill_iter(struct rte_vhost_iov_iter *it, size_t count,
+	struct iovec *vec, unsigned long nr_seg)
+{
+	it->offset = 0;
+	it->count = count;
+
+	if (count) {
+		it->iov = vec;
+		it->nr_segs = nr_seg;
+	} else {
+		it->iov = 0;
+		it->nr_segs = 0;
+	}
+}
+
+static __rte_always_inline void
+async_fill_desc(struct rte_vhost_async_desc *desc,
+	struct rte_vhost_iov_iter *src, struct rte_vhost_iov_iter *dst)
+{
+	desc->src = src;
+	desc->dst = dst;
+}
+
+static __rte_always_inline int
+async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			struct rte_mbuf *m, struct buf_vector *buf_vec,
+			uint16_t nr_vec, uint16_t num_buffers,
+			struct iovec *src_iovec, struct iovec *dst_iovec,
+			struct rte_vhost_iov_iter *src_it,
+			struct rte_vhost_iov_iter *dst_it)
+{
+	uint32_t vec_idx = 0;
+	uint32_t mbuf_offset, mbuf_avail;
+	uint32_t buf_offset, buf_avail;
+	uint64_t buf_addr, buf_iova, buf_len;
+	uint32_t cpy_len, cpy_threshold;
+	uint64_t hdr_addr;
+	struct rte_mbuf *hdr_mbuf;
+	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
+	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
+	int error = 0;
+
+	uint32_t tlen = 0;
+	int tvec_idx = 0;
+	void *hpa;
+
+	if (unlikely(m == NULL)) {
+		error = -1;
+		goto out;
+	}
+
+	cpy_threshold = vq->async_threshold;
+
+	buf_addr = buf_vec[vec_idx].buf_addr;
+	buf_iova = buf_vec[vec_idx].buf_iova;
+	buf_len = buf_vec[vec_idx].buf_len;
+
+	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
+		error = -1;
+		goto out;
+	}
+
+	hdr_mbuf = m;
+	hdr_addr = buf_addr;
+	if (unlikely(buf_len < dev->vhost_hlen))
+		hdr = &tmp_hdr;
+	else
+		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
+		dev->vid, num_buffers);
+
+	if (unlikely(buf_len < dev->vhost_hlen)) {
+		buf_offset = dev->vhost_hlen - buf_len;
+		vec_idx++;
+		buf_addr = buf_vec[vec_idx].buf_addr;
+		buf_iova = buf_vec[vec_idx].buf_iova;
+		buf_len = buf_vec[vec_idx].buf_len;
+		buf_avail = buf_len - buf_offset;
+	} else {
+		buf_offset = dev->vhost_hlen;
+		buf_avail = buf_len - dev->vhost_hlen;
+	}
+
+	mbuf_avail  = rte_pktmbuf_data_len(m);
+	mbuf_offset = 0;
+
+	while (mbuf_avail != 0 || m->next != NULL) {
+		/* done with current buf, get the next one */
+		if (buf_avail == 0) {
+			vec_idx++;
+			if (unlikely(vec_idx >= nr_vec)) {
+				error = -1;
+				goto out;
+			}
+
+			buf_addr = buf_vec[vec_idx].buf_addr;
+			buf_iova = buf_vec[vec_idx].buf_iova;
+			buf_len = buf_vec[vec_idx].buf_len;
+
+			buf_offset = 0;
+			buf_avail  = buf_len;
+		}
+
+		/* done with current mbuf, get the next one */
+		if (mbuf_avail == 0) {
+			m = m->next;
+
+			mbuf_offset = 0;
+			mbuf_avail  = rte_pktmbuf_data_len(m);
+		}
+
+		if (hdr_addr) {
+			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
+			if (rxvq_is_mergeable(dev))
+				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
+						num_buffers);
+
+			if (unlikely(hdr == &tmp_hdr)) {
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
+			} else {
+				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
+						dev->vhost_hlen, 0);
+				vhost_log_cache_write_iova(dev, vq,
+						buf_vec[0].buf_iova,
+						dev->vhost_hlen);
+			}
+
+			hdr_addr = 0;
+		}
+
+		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
+
+		if (unlikely(cpy_len >= cpy_threshold)) {
+			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
+					buf_iova + buf_offset, cpy_len);
+
+			if (unlikely(!hpa)) {
+				error = -1;
+				goto out;
+			}
+
+			async_fill_vec(src_iovec + tvec_idx,
+				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
+						mbuf_offset), cpy_len);
+
+			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
+
+			tlen += cpy_len;
+			tvec_idx++;
+		} else {
+			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
+				rte_memcpy(
+				(void *)((uintptr_t)(buf_addr + buf_offset)),
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
+				cpy_len);
+
+				PRINT_PACKET(dev,
+					(uintptr_t)(buf_addr + buf_offset),
+					cpy_len, 0);
+			} else {
+				batch_copy[vq->batch_copy_nb_elems].dst =
+				(void *)((uintptr_t)(buf_addr + buf_offset));
+				batch_copy[vq->batch_copy_nb_elems].src =
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
+				batch_copy[vq->batch_copy_nb_elems].log_addr =
+					buf_iova + buf_offset;
+				batch_copy[vq->batch_copy_nb_elems].len =
+					cpy_len;
+				vq->batch_copy_nb_elems++;
+			}
+		}
+
+		mbuf_avail  -= cpy_len;
+		mbuf_offset += cpy_len;
+		buf_avail  -= cpy_len;
+		buf_offset += cpy_len;
+	}
+
+out:
+	async_fill_iter(src_it, tlen, src_iovec, tvec_idx);
+	async_fill_iter(dst_it, tlen, dst_iovec, tvec_idx);
+
+	return error;
+}
+
 static __rte_always_inline int
 vhost_enqueue_single_packed(struct virtio_net *dev,
 			    struct vhost_virtqueue *vq,
@@ -1236,6 +1456,333 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
 	return virtio_dev_rx(dev, queue_id, pkts, count);
 }
 
+static __rte_always_inline uint16_t
+virtio_dev_rx_async_get_info_idx(uint16_t pkts_idx,
+	uint16_t vq_size, uint16_t n_inflight)
+{
+	return pkts_idx > n_inflight ? (pkts_idx - n_inflight) :
+		(vq_size - n_inflight + pkts_idx) & (vq_size - 1);
+}
+
+static __rte_always_inline void
+virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	uint16_t last_idx, uint16_t shadow_idx)
+{
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
+		vq_size, vq->async_pkts_inflight_n);
+
+	while (likely((start_idx & (vq_size - 1)) != pkts_idx)) {
+		uint64_t n_seg =
+			async_pending_info[(start_idx) & (vq_size - 1)] >>
+			ASYNC_PENDING_INFO_N_SFT;
+
+		while (n_seg)
+			n_seg -= vq->async_ops.check_completed_copies(dev->vid,
+				queue_id, 0, 1);
+	}
+
+	vq->async_pkts_inflight_n = 0;
+	vq->batch_copy_nb_elems = 0;
+
+	vq->shadow_used_idx = shadow_idx;
+	vq->last_avail_idx = last_idx;
+}
+
+static __rte_noinline uint32_t
+virtio_dev_rx_async_submit_split(struct virtio_net *dev,
+	struct vhost_virtqueue *vq, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
+	uint16_t num_buffers;
+	struct buf_vector buf_vec[BUF_VECTOR_MAX];
+	uint16_t avail_head, last_idx, shadow_idx;
+
+	struct rte_vhost_iov_iter *it_pool = vq->it_pool;
+	struct iovec *vec_pool = vq->vec_pool;
+	struct rte_vhost_async_desc tdes[MAX_PKT_BURST];
+	struct iovec *src_iovec = vec_pool;
+	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+	struct rte_vhost_iov_iter *src_it = it_pool;
+	struct rte_vhost_iov_iter *dst_it = it_pool + 1;
+	uint16_t n_free_slot, slot_idx;
+	int n_pkts = 0;
+
+	avail_head = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE);
+	last_idx = vq->last_avail_idx;
+	shadow_idx = vq->shadow_used_idx;
+
+	/*
+	 * The ordering between avail index and
+	 * desc reads needs to be enforced.
+	 */
+	rte_smp_rmb();
+
+	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
+	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
+		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
+		uint16_t nr_vec = 0;
+
+		if (unlikely(reserve_avail_buf_split(dev, vq,
+						pkt_len, buf_vec, &num_buffers,
+						avail_head, &nr_vec) < 0)) {
+			VHOST_LOG_DATA(DEBUG,
+				"(%d) failed to get enough desc from vring\n",
+				dev->vid);
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end index %d\n",
+			dev->vid, vq->last_avail_idx,
+			vq->last_avail_idx + num_buffers);
+
+		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
+				buf_vec, nr_vec, num_buffers,
+				src_iovec, dst_iovec, src_it, dst_it) < 0) {
+			vq->shadow_used_idx -= num_buffers;
+			break;
+		}
+
+		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
+		if (src_it->count) {
+			async_fill_desc(&tdes[pkt_burst_idx], src_it, dst_it);
+			pkt_burst_idx++;
+			vq->async_pending_info[slot_idx] =
+				num_buffers | (src_it->nr_segs << 16);
+			src_iovec += src_it->nr_segs;
+			dst_iovec += dst_it->nr_segs;
+			src_it += 2;
+			dst_it += 2;
+		} else {
+			vq->async_pending_info[slot_idx] = num_buffers;
+			vq->async_pkts_inflight_n++;
+		}
+
+		vq->last_avail_idx += num_buffers;
+
+		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
+				(pkt_idx == count - 1 && pkt_burst_idx)) {
+			n_pkts = vq->async_ops.transfer_data(dev->vid,
+					queue_id, tdes, 0, pkt_burst_idx);
+			src_iovec = vec_pool;
+			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
+			src_it = it_pool;
+			dst_it = it_pool + 1;
+
+			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+				vq->async_pkts_inflight_n +=
+					n_pkts > 0 ? n_pkts : 0;
+				virtio_dev_rx_async_submit_split_err(dev,
+					vq, queue_id, last_idx, shadow_idx);
+				return 0;
+			}
+
+			pkt_burst_idx = 0;
+			vq->async_pkts_inflight_n += n_pkts;
+		}
+	}
+
+	if (pkt_burst_idx) {
+		n_pkts = vq->async_ops.transfer_data(dev->vid,
+				queue_id, tdes, 0, pkt_burst_idx);
+		if (unlikely(n_pkts < (int)pkt_burst_idx)) {
+			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
+			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
+				last_idx, shadow_idx);
+			return 0;
+		}
+
+		vq->async_pkts_inflight_n += n_pkts;
+	}
+
+	do_data_copy_enqueue(dev, vq);
+
+	n_free_slot = vq->size - vq->async_pkts_idx;
+	if (n_free_slot > pkt_idx) {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, pkt_idx * sizeof(uintptr_t));
+		vq->async_pkts_idx += pkt_idx;
+	} else {
+		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
+			pkts, n_free_slot * sizeof(uintptr_t));
+		rte_memcpy(&vq->async_pkts_pending[0],
+			&pkts[n_free_slot],
+			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
+		vq->async_pkts_idx = pkt_idx - n_free_slot;
+	}
+
+	if (likely(vq->shadow_used_idx))
+		async_flush_shadow_used_ring_split(dev, vq);
+
+	return pkt_idx;
+}
+
+uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+	struct vhost_virtqueue *vq;
+	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
+	uint16_t start_idx, pkts_idx, vq_size;
+	uint64_t *async_pending_info;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	pkts_idx = vq->async_pkts_idx;
+	async_pending_info = vq->async_pending_info;
+	vq_size = vq->size;
+	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
+		vq_size, vq->async_pkts_inflight_n);
+
+	n_pkts_cpl =
+		vq->async_ops.check_completed_copies(vid, queue_id, 0, count);
+
+	rte_smp_wmb();
+
+	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
+		uint64_t info = async_pending_info[
+			(start_idx + n_pkts_put) & (vq_size - 1)];
+		uint64_t n_segs;
+		n_pkts_put++;
+		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
+		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
+
+		if (n_segs) {
+			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
+				n_pkts_put--;
+				n_descs -= info & ASYNC_PENDING_INFO_N_MSK;
+				if (n_pkts_cpl) {
+					async_pending_info[
+						(start_idx + n_pkts_put) &
+						(vq_size - 1)] =
+					((n_segs - n_pkts_cpl) <<
+					 ASYNC_PENDING_INFO_N_SFT) |
+					(info & ASYNC_PENDING_INFO_N_MSK);
+					n_pkts_cpl = 0;
+				}
+				break;
+			}
+			n_pkts_cpl -= n_segs;
+		}
+	}
+
+	if (n_pkts_put) {
+		vq->async_pkts_inflight_n -= n_pkts_put;
+		__atomic_add_fetch(&vq->used->idx, n_descs, __ATOMIC_RELEASE);
+
+		vhost_vring_call_split(dev, vq);
+	}
+
+	if (start_idx + n_pkts_put <= vq_size) {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			n_pkts_put * sizeof(uintptr_t));
+	} else {
+		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
+			(vq_size - start_idx) * sizeof(uintptr_t));
+		rte_memcpy(&pkts[vq_size - start_idx], vq->async_pkts_pending,
+			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
+	}
+
+	rte_spinlock_unlock(&vq->access_lock);
+
+	return n_pkts_put;
+}
+
+static __rte_always_inline uint32_t
+virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint32_t count)
+{
+	struct vhost_virtqueue *vq;
+	uint32_t nb_tx = 0;
+	bool drawback = false;
+
+	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
+			dev->vid, __func__, queue_id);
+		return 0;
+	}
+
+	vq = dev->virtqueue[queue_id];
+
+	rte_spinlock_lock(&vq->access_lock);
+
+	if (unlikely(vq->enabled == 0))
+		goto out_access_unlock;
+
+	if (unlikely(!vq->async_registered)) {
+		drawback = true;
+		goto out_access_unlock;
+	}
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_lock(vq);
+
+	if (unlikely(vq->access_ok == 0))
+		if (unlikely(vring_translate(dev, vq) < 0))
+			goto out;
+
+	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
+	if (count == 0)
+		goto out;
+
+	/* TODO: packed queue not implemented */
+	if (vq_is_packed(dev))
+		nb_tx = 0;
+	else
+		nb_tx = virtio_dev_rx_async_submit_split(dev,
+				vq, queue_id, pkts, count);
+
+out:
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+		vhost_user_iotlb_rd_unlock(vq);
+
+out_access_unlock:
+	rte_spinlock_unlock(&vq->access_lock);
+
+	if (drawback)
+		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts, count);
+
+	return nb_tx;
+}
+
+uint16_t
+rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
+		struct rte_mbuf **pkts, uint16_t count)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (!dev)
+		return 0;
+
+	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
+		VHOST_LOG_DATA(ERR,
+			"(%d) %s: built-in vhost net backend is disabled.\n",
+			dev->vid, __func__);
+		return 0;
+	}
+
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+}
+
 static inline bool
 virtio_net_with_host_offload(struct virtio_net *dev)
 {
-- 
2.18.4
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API patrick.fu
@ 2020-07-07  8:22     ` Xia, Chenbo
  0 siblings, 0 replies; 36+ messages in thread
From: Xia, Chenbo @ 2020-07-07  8:22 UTC (permalink / raw)
  To: Fu, Patrick, dev, maxime.coquelin, Wang, Zhihong
  Cc: Wang, Yinan, Jiang, Cheng1, Liang, Cunming
> -----Original Message-----
> From: Fu, Patrick <patrick.fu@intel.com>
> Sent: Tuesday, July 7, 2020 1:07 PM
> To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Wang, Yinan <yinan.wang@intel.com>;
> Jiang, Cheng1 <cheng1.jiang@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>
> Subject: [PATCH v6 1/2] vhost: introduce async enqueue registration API
> 
> From: Patrick Fu <patrick.fu@intel.com>
> 
> Performing large memory copies usually takes up a major part of CPU cycles and
> becomes the hot spot in vhost-user enqueue operation. To offload the large
> copies from CPU to the DMA devices, asynchronous APIs are introduced, with
> which the CPU just submits copy jobs to the DMA but without waiting for its
> copy completion. Thus, there is no CPU intervention during data transfer. We
> can save precious CPU cycles and improve the overall throughput for vhost-user
> based applications. This patch introduces registration/un-registration APIs for
> vhost async data enqueue operation. Together with the registration APIs
> implementations, data structures and the prototype of the async callback
> functions required for async enqueue data path are also defined.
> 
> Signed-off-by: Patrick Fu <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/Makefile              |   2 +-
>  lib/librte_vhost/meson.build           |   2 +-
>  lib/librte_vhost/rte_vhost.h           |   1 +
>  lib/librte_vhost/rte_vhost_async.h     | 136 +++++++++++++++++++++++++
>  lib/librte_vhost/rte_vhost_version.map |   4 +
>  lib/librte_vhost/socket.c              |  27 +++++
>  lib/librte_vhost/vhost.c               | 127 ++++++++++++++++++++++-
>  lib/librte_vhost/vhost.h               |  30 +++++-
>  lib/librte_vhost/vhost_user.c          |  23 ++++-
>  9 files changed, 345 insertions(+), 7 deletions(-)  create mode 100644
> lib/librte_vhost/rte_vhost_async.h
> 
> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile index
> b7ff7dc4b..4f2f3e47d 100644
> --- a/lib/librte_vhost/Makefile
> +++ b/lib/librte_vhost/Makefile
> @@ -42,7 +42,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c
> socket.c vhost.c \
> 
>  # install includes
>  SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
> -						rte_vdpa_dev.h
> +						rte_vdpa_dev.h
> rte_vhost_async.h
> 
>  # only compile vhost crypto when cryptodev is enabled  ifeq
> ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build index
> 882a0eaf4..cc9aa65c6 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -22,5 +22,5 @@ sources = files('fd_man.c', 'iotlb.c', 'socket.c', 'vdpa.c',
>  		'vhost.c', 'vhost_user.c',
>  		'virtio_net.c', 'vhost_crypto.c')
>  headers = files('rte_vhost.h', 'rte_vdpa.h', 'rte_vdpa_dev.h',
> -		'rte_vhost_crypto.h')
> +		'rte_vhost_crypto.h', 'rte_vhost_async.h')
>  deps += ['ethdev', 'cryptodev', 'hash', 'pci'] diff --git
> a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h index
> 8a5c332c8..f93f9595a 100644
> --- a/lib/librte_vhost/rte_vhost.h
> +++ b/lib/librte_vhost/rte_vhost.h
> @@ -35,6 +35,7 @@ extern "C" {
>  #define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
>  /* support only linear buffers (no chained mbufs) */
>  #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
> +#define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
> 
>  /* Features. */
>  #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
> diff --git a/lib/librte_vhost/rte_vhost_async.h
> b/lib/librte_vhost/rte_vhost_async.h
> new file mode 100644
> index 000000000..d5a59279a
> --- /dev/null
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -0,0 +1,136 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_VHOST_ASYNC_H_
> +#define _RTE_VHOST_ASYNC_H_
> +
> +#include "rte_vhost.h"
> +
> +/**
> + * iovec iterator
> + */
> +struct rte_vhost_iov_iter {
> +	/** offset to the first byte of interesting data */
> +	size_t offset;
> +	/** total bytes of data in this iterator */
> +	size_t count;
> +	/** pointer to the iovec array */
> +	struct iovec *iov;
> +	/** number of iovec in this iterator */
> +	unsigned long nr_segs;
> +};
> +
> +/**
> + * dma transfer descriptor pair
> + */
> +struct rte_vhost_async_desc {
> +	/** source memory iov_iter */
> +	struct rte_vhost_iov_iter *src;
> +	/** destination memory iov_iter */
> +	struct rte_vhost_iov_iter *dst;
> +};
> +
> +/**
> + * dma transfer status
> + */
> +struct rte_vhost_async_status {
> +	/** An array of application specific data for source memory */
> +	uintptr_t *src_opaque_data;
> +	/** An array of application specific data for destination memory */
> +	uintptr_t *dst_opaque_data;
> +};
> +
> +/**
> + * dma operation callbacks to be implemented by applications  */ struct
> +rte_vhost_async_channel_ops {
> +	/**
> +	 * instruct async engines to perform copies for a batch of packets
> +	 *
> +	 * @param vid
> +	 *  id of vhost device to perform data copies
> +	 * @param queue_id
> +	 *  queue id to perform data copies
> +	 * @param descs
> +	 *  an array of DMA transfer memory descriptors
> +	 * @param opaque_data
> +	 *  opaque data pair sending to DMA engine
> +	 * @param count
> +	 *  number of elements in the "descs" array
> +	 * @return
> +	 *  -1 on failure, number of descs processed on success
> +	 */
> +	int (*transfer_data)(int vid, uint16_t queue_id,
> +		struct rte_vhost_async_desc *descs,
> +		struct rte_vhost_async_status *opaque_data,
> +		uint16_t count);
> +	/**
> +	 * check copy-completed packets from the async engine
> +	 * @param vid
> +	 *  id of vhost device to check copy completion
> +	 * @param queue_id
> +	 *  queue id to check copyp completion
> +	 * @param opaque_data
> +	 *  buffer to receive the opaque data pair from DMA engine
> +	 * @param max_packets
> +	 *  max number of packets could be completed
> +	 * @return
> +	 *  -1 on failure, number of iov segments completed on success
> +	 */
> +	int (*check_completed_copies)(int vid, uint16_t queue_id,
> +		struct rte_vhost_async_status *opaque_data,
> +		uint16_t max_packets);
> +};
> +
> +/**
> + *  dma channel feature bit definition
> + */
> +struct rte_vhost_async_features {
> +	union {
> +		uint32_t intval;
> +		struct {
> +			uint32_t async_inorder:1;
> +			uint32_t resvd_0:15;
> +			uint32_t async_threshold:12;
> +			uint32_t resvd_1:4;
> +		};
> +	};
> +};
> +
> +/**
> + * register a async channel for vhost
> + *
> + * @param vid
> + *  vhost device id async channel to be attached to
> + * @param queue_id
> + *  vhost queue id async channel to be attached to
> + * @param features
> + *  DMA channel feature bit
> + *    b0       : DMA supports inorder data transfer
> + *    b1  - b15: reserved
> + *    b16 - b27: Packet length threshold for DMA transfer
> + *    b28 - b31: reserved
> + * @param ops
> + *  DMA operation callbacks
> + * @return
> + *  0 on success, -1 on failures
> + */
> +__rte_experimental
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +	uint32_t features, struct rte_vhost_async_channel_ops *ops);
> +
> +/**
> + * unregister a dma channel for vhost
> + *
> + * @param vid
> + *  vhost device id DMA channel to be detached
> + * @param queue_id
> + *  vhost queue id DMA channel to be detached
> + * @return
> + *  0 on success, -1 on failures
> + */
> +__rte_experimental
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> +
> +#endif /* _RTE_VHOST_ASYNC_H_ */
> diff --git a/lib/librte_vhost/rte_vhost_version.map
> b/lib/librte_vhost/rte_vhost_version.map
> index 86784405a..13ec53b63 100644
> --- a/lib/librte_vhost/rte_vhost_version.map
> +++ b/lib/librte_vhost/rte_vhost_version.map
> @@ -71,4 +71,8 @@ EXPERIMENTAL {
>  	rte_vdpa_get_queue_num;
>  	rte_vdpa_get_features;
>  	rte_vdpa_get_protocol_features;
> +	rte_vhost_async_channel_register;
> +	rte_vhost_async_channel_unregister;
> +	rte_vhost_submit_enqueue_burst;
> +	rte_vhost_poll_enqueue_completed;
>  };
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c index
> 49267cebf..c4626d2c4 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>  	bool use_builtin_virtio_net;
>  	bool extbuf;
>  	bool linearbuf;
> +	bool async_copy;
> 
>  	/*
>  	 * The "supported_features" indicates the feature bits the @@ -205,6
> +206,7 @@ vhost_user_add_connection(int fd, struct vhost_user_socket
> *vsocket)
>  	size_t size;
>  	struct vhost_user_connection *conn;
>  	int ret;
> +	struct virtio_net *dev;
> 
>  	if (vsocket == NULL)
>  		return;
> @@ -236,6 +238,13 @@ vhost_user_add_connection(int fd, struct
> vhost_user_socket *vsocket)
>  	if (vsocket->linearbuf)
>  		vhost_enable_linearbuf(vid);
> 
> +	if (vsocket->async_copy) {
> +		dev = get_device(vid);
> +
> +		if (dev)
> +			dev->async_copy = 1;
> +	}
> +
>  	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> 
>  	if (vsocket->notify_ops->new_connection) { @@ -881,6 +890,17 @@
> rte_vhost_driver_register(const char *path, uint64_t flags)
>  		goto out_mutex;
>  	}
> 
> +	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> +
> +	if (vsocket->async_copy &&
> +		(flags & (RTE_VHOST_USER_IOMMU_SUPPORT |
> +		RTE_VHOST_USER_POSTCOPY_SUPPORT))) {
> +		VHOST_LOG_CONFIG(ERR, "error: enabling async copy and
> IOMMU "
> +			"or post-copy feature simultaneously is not "
> +			"supported\n");
> +		goto out_mutex;
> +	}
> +
>  	/*
>  	 * Set the supported features correctly for the builtin vhost-user
>  	 * net driver.
> @@ -931,6 +951,13 @@ rte_vhost_driver_register(const char *path, uint64_t
> flags)
>  			~(1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT);
>  	}
> 
> +	if (vsocket->async_copy) {
> +		vsocket->supported_features &= ~(1ULL << VHOST_F_LOG_ALL);
> +		vsocket->features &= ~(1ULL << VHOST_F_LOG_ALL);
> +		VHOST_LOG_CONFIG(INFO,
> +			"Logging feature is disabled in async copy mode\n");
> +	}
> +
>  	/*
>  	 * We'll not be able to receive a buffer from guest in linear mode
>  	 * without external buffer if it will not fit in a single mbuf, which is diff --
> git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index
> 0d822d6a3..a11385f39 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -332,8 +332,13 @@ free_vq(struct virtio_net *dev, struct vhost_virtqueue
> *vq)  {
>  	if (vq_is_packed(dev))
>  		rte_free(vq->shadow_used_packed);
> -	else
> +	else {
>  		rte_free(vq->shadow_used_split);
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +	}
>  	rte_free(vq->batch_copy_elems);
>  	rte_mempool_free(vq->iotlb_pool);
>  	rte_free(vq);
> @@ -1522,3 +1527,123 @@ RTE_INIT(vhost_log_init)
>  	if (vhost_data_log_level >= 0)
>  		rte_log_set_level(vhost_data_log_level,
> RTE_LOG_WARNING);  }
> +
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> +					uint32_t features,
> +					struct rte_vhost_async_channel_ops
> *ops) {
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +	struct rte_vhost_async_features f;
> +
> +	if (dev == NULL || ops == NULL)
> +		return -1;
> +
> +	f.intval = features;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (unlikely(vq == NULL || !dev->async_copy))
> +		return -1;
> +
> +	/* packed queue is not supported */
> +	if (unlikely(vq_is_packed(dev) || !f.async_inorder)) {
> +		VHOST_LOG_CONFIG(ERR,
> +			"async copy is not supported on packed queue or non-
> inorder mode "
> +			"(vid %d, qid: %d)\n", vid, queue_id);
> +		return -1;
> +	}
> +
> +	if (unlikely(ops->check_completed_copies == NULL ||
> +		ops->transfer_data == NULL))
> +		return -1;
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	if (unlikely(vq->async_registered)) {
> +		VHOST_LOG_CONFIG(ERR,
> +			"async register failed: channel already registered "
> +			"(vid %d, qid: %d)\n", vid, queue_id);
> +		goto reg_out;
> +	}
> +
> +	vq->async_pkts_pending = rte_malloc(NULL,
> +			vq->size * sizeof(uintptr_t),
> +			RTE_CACHE_LINE_SIZE);
> +	vq->async_pending_info = rte_malloc(NULL,
> +			vq->size * sizeof(uint64_t),
> +			RTE_CACHE_LINE_SIZE);
> +	if (!vq->async_pkts_pending || !vq->async_pending_info) {
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +
> +		VHOST_LOG_CONFIG(ERR,
> +				"async register failed: cannot allocate memory
> for vq data "
> +				"(vid %d, qid: %d)\n", vid, queue_id);
> +		goto reg_out;
> +	}
> +
> +	vq->async_ops.check_completed_copies = ops-
> >check_completed_copies;
> +	vq->async_ops.transfer_data = ops->transfer_data;
> +
> +	vq->async_inorder = f.async_inorder;
> +	vq->async_threshold = f.async_threshold;
> +
> +	vq->async_registered = true;
> +
> +reg_out:
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return 0;
> +}
> +
> +int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id) {
> +	struct vhost_virtqueue *vq;
> +	struct virtio_net *dev = get_device(vid);
> +	int ret = -1;
> +
> +	if (dev == NULL)
> +		return ret;
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	if (vq == NULL)
> +		return ret;
> +
> +	ret = 0;
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	if (!vq->async_registered)
> +		goto out;
> +
> +	if (vq->async_pkts_inflight_n) {
> +		VHOST_LOG_CONFIG(ERR, "Failed to unregister async channel.
> "
> +			"async inflight packets must be completed before
> unregistration.\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	if (vq->async_pkts_pending) {
> +		rte_free(vq->async_pkts_pending);
> +		vq->async_pkts_pending = NULL;
> +	}
> +
> +	if (vq->async_pending_info) {
> +		rte_free(vq->async_pending_info);
> +		vq->async_pending_info = NULL;
> +	}
> +
> +	vq->async_ops.transfer_data = NULL;
> +	vq->async_ops.check_completed_copies = NULL;
> +	vq->async_registered = false;
> +
> +out:
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return ret;
> +}
> +
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index
> 034463699..f3731982b 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -24,6 +24,8 @@
>  #include "rte_vdpa.h"
>  #include "rte_vdpa_dev.h"
> 
> +#include "rte_vhost_async.h"
> +
>  /* Used to indicate that the device is running on a data core */  #define
> VIRTIO_DEV_RUNNING 1
>  /* Used to indicate that the device is ready to operate */ @@ -40,6 +42,11 @@
> 
>  #define VHOST_LOG_CACHE_NR 32
> 
> +#define MAX_PKT_BURST 32
> +
> +#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2) #define
> +VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
> +
>  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> VRING_DESC_F_WRITE) : \
>  		VRING_DESC_F_WRITE)
> @@ -202,6 +209,25 @@ struct vhost_virtqueue {
>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
>  	int				iotlb_cache_nr;
>  	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
> +
> +	/* operation callbacks for async dma */
> +	struct rte_vhost_async_channel_ops	async_ops;
> +
> +	struct rte_vhost_iov_iter it_pool[VHOST_MAX_ASYNC_IT];
> +	struct iovec vec_pool[VHOST_MAX_ASYNC_VEC];
> +
> +	/* async data transfer status */
> +	uintptr_t	**async_pkts_pending;
> +	#define		ASYNC_PENDING_INFO_N_MSK 0xFFFF
> +	#define		ASYNC_PENDING_INFO_N_SFT 16
> +	uint64_t	*async_pending_info;
> +	uint16_t	async_pkts_idx;
> +	uint16_t	async_pkts_inflight_n;
> +
> +	/* vq async features */
> +	bool		async_inorder;
> +	bool		async_registered;
> +	uint16_t	async_threshold;
>  } __rte_cache_aligned;
> 
>  #define VHOST_MAX_VRING			0x100
> @@ -338,6 +364,7 @@ struct virtio_net {
>  	int16_t			broadcast_rarp;
>  	uint32_t		nr_vring;
>  	int			dequeue_zero_copy;
> +	int			async_copy;
>  	int			extbuf;
>  	int			linearbuf;
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> @@ -683,7 +710,8 @@ vhost_vring_call_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq)
>  	/* Don't kick guest if we don't reach index specified by guest. */
>  	if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {
>  		uint16_t old = vq->signalled_used;
> -		uint16_t new = vq->last_used_idx;
> +		uint16_t new = vq->async_pkts_inflight_n ?
> +					vq->used->idx:vq->last_used_idx;
>  		bool signalled_used_valid = vq->signalled_used_valid;
> 
>  		vq->signalled_used = new;
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c index
> 6039a8fdb..aa8605523 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -476,12 +476,14 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
>  	} else {
>  		if (vq->shadow_used_split)
>  			rte_free(vq->shadow_used_split);
> +
>  		vq->shadow_used_split = rte_malloc(NULL,
>  				vq->size * sizeof(struct vring_used_elem),
>  				RTE_CACHE_LINE_SIZE);
> +
>  		if (!vq->shadow_used_split) {
>  			VHOST_LOG_CONFIG(ERR,
> -					"failed to allocate memory for shadow
> used ring.\n");
> +					"failed to allocate memory for vq
> internal data.\n");
>  			return RTE_VHOST_MSG_RESULT_ERR;
>  		}
>  	}
> @@ -1166,7 +1168,8 @@ vhost_user_set_mem_table(struct virtio_net **pdev,
> struct VhostUserMsg *msg,
>  			goto err_mmap;
>  		}
> 
> -		populate = (dev->dequeue_zero_copy) ? MAP_POPULATE : 0;
> +		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
> +			MAP_POPULATE : 0;
>  		mmap_addr = mmap(NULL, mmap_size, PROT_READ |
> PROT_WRITE,
>  				 MAP_SHARED | populate, fd, 0);
> 
> @@ -1181,7 +1184,7 @@ vhost_user_set_mem_table(struct virtio_net **pdev,
> struct VhostUserMsg *msg,
>  		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
>  				      mmap_offset;
> 
> -		if (dev->dequeue_zero_copy)
> +		if (dev->dequeue_zero_copy || dev->async_copy)
>  			if (add_guest_pages(dev, reg, alignment) < 0) {
>  				VHOST_LOG_CONFIG(ERR,
>  					"adding guest pages to region %u
> failed.\n", @@ -1979,6 +1982,12 @@ vhost_user_get_vring_base(struct
> virtio_net **pdev,
>  	} else {
>  		rte_free(vq->shadow_used_split);
>  		vq->shadow_used_split = NULL;
> +		if (vq->async_pkts_pending)
> +			rte_free(vq->async_pkts_pending);
> +		if (vq->async_pending_info)
> +			rte_free(vq->async_pending_info);
> +		vq->async_pkts_pending = NULL;
> +		vq->async_pending_info = NULL;
>  	}
> 
>  	rte_free(vq->batch_copy_elems);
> @@ -2012,6 +2021,14 @@ vhost_user_set_vring_enable(struct virtio_net
> **pdev,
>  		"set queue enable: %d to qp idx: %d\n",
>  		enable, index);
> 
> +	if (!enable && dev->virtqueue[index]->async_registered) {
> +		if (dev->virtqueue[index]->async_pkts_inflight_n) {
> +			VHOST_LOG_CONFIG(ERR, "failed to disable vring. "
> +			"async inflight packets must be completed first\n");
> +			return RTE_VHOST_MSG_RESULT_ERR;
> +		}
> +	}
> +
>  	/* On disable, rings have to be stopped being processed. */
>  	if (!enable && dev->dequeue_zero_copy)
>  		drain_zmbuf_list(dev->virtqueue[index]);
> --
> 2.18.4
Reviewed-by: Chenbo Xia <chenbo.xia@intel.com>
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v6 2/2] vhost: introduce async enqueue for split ring
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 2/2] vhost: introduce async enqueue for split ring patrick.fu
@ 2020-07-07  8:22     ` Xia, Chenbo
  0 siblings, 0 replies; 36+ messages in thread
From: Xia, Chenbo @ 2020-07-07  8:22 UTC (permalink / raw)
  To: Fu, Patrick, dev, maxime.coquelin, Wang, Zhihong
  Cc: Wang, Yinan, Jiang, Cheng1, Liang, Cunming
> -----Original Message-----
> From: Fu, Patrick <patrick.fu@intel.com>
> Sent: Tuesday, July 7, 2020 1:07 PM
> To: dev@dpdk.org; maxime.coquelin@redhat.com; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Fu, Patrick <patrick.fu@intel.com>; Wang, Yinan <yinan.wang@intel.com>;
> Jiang, Cheng1 <cheng1.jiang@intel.com>; Liang, Cunming
> <cunming.liang@intel.com>
> Subject: [PATCH v6 2/2] vhost: introduce async enqueue for split ring
> 
> From: Patrick Fu <patrick.fu@intel.com>
> 
> This patch implements async enqueue data path for split ring. 2 new async data
> path APIs are defined, by which applications can submit and poll packets to/from
> async engines. The async engine is either a physical DMA device or it could also
> be a software emulated backend.
> The async enqueue data path leverages callback functions registered by
> applications to work with the async engine.
> 
> Signed-off-by: Patrick Fu <patrick.fu@intel.com>
> ---
>  lib/librte_vhost/rte_vhost_async.h |  40 +++
>  lib/librte_vhost/virtio_net.c      | 551 ++++++++++++++++++++++++++++-
>  2 files changed, 589 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_vhost/rte_vhost_async.h
> b/lib/librte_vhost/rte_vhost_async.h
> index d5a59279a..c8ad8dbc7 100644
> --- a/lib/librte_vhost/rte_vhost_async.h
> +++ b/lib/librte_vhost/rte_vhost_async.h
> @@ -133,4 +133,44 @@ int rte_vhost_async_channel_register(int vid, uint16_t
> queue_id,  __rte_experimental  int rte_vhost_async_channel_unregister(int vid,
> uint16_t queue_id);
> 
> +/**
> + * This function submit enqueue data to async engine. This function has
> + * no guranttee to the transfer completion upon return. Applications
> + * should poll transfer status by rte_vhost_poll_enqueue_completed()
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  array of packets to be enqueued
> + * @param count
> + *  packets num to be enqueued
> + * @return
> + *  num of packets enqueued
> + */
> +__rte_experimental
> +uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
> +/**
> + * This function check async completion status for a specific vhost
> + * device queue. Packets which finish copying (enqueue) operation
> + * will be returned in an array.
> + *
> + * @param vid
> + *  id of vhost device to enqueue data
> + * @param queue_id
> + *  queue id to enqueue data
> + * @param pkts
> + *  blank array to get return packet pointer
> + * @param count
> + *  size of the packet array
> + * @return
> + *  num of packets returned
> + */
> +__rte_experimental
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count);
> +
>  #endif /* _RTE_VHOST_ASYNC_H_ */
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index
> 751c1f373..236498f71 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -17,14 +17,15 @@
>  #include <rte_arp.h>
>  #include <rte_spinlock.h>
>  #include <rte_malloc.h>
> +#include <rte_vhost_async.h>
> 
>  #include "iotlb.h"
>  #include "vhost.h"
> 
> -#define MAX_PKT_BURST 32
> -
>  #define MAX_BATCH_LEN 256
> 
> +#define VHOST_ASYNC_BATCH_THRESHOLD 32
> +
>  static  __rte_always_inline bool
>  rxvq_is_mergeable(struct virtio_net *dev)  { @@ -116,6 +117,31 @@
> flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
>  		sizeof(vq->used->idx));
>  }
> 
> +static __rte_always_inline void
> +async_flush_shadow_used_ring_split(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq)
> +{
> +	uint16_t used_idx = vq->last_used_idx & (vq->size - 1);
> +
> +	if (used_idx + vq->shadow_used_idx <= vq->size) {
> +		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0,
> +					  vq->shadow_used_idx);
> +	} else {
> +		uint16_t size;
> +
> +		/* update used ring interval [used_idx, vq->size] */
> +		size = vq->size - used_idx;
> +		do_flush_shadow_used_ring_split(dev, vq, used_idx, 0, size);
> +
> +		/* update the left half used ring interval [0, left_size] */
> +		do_flush_shadow_used_ring_split(dev, vq, 0, size,
> +					  vq->shadow_used_idx - size);
> +	}
> +
> +	vq->last_used_idx += vq->shadow_used_idx;
> +	vq->shadow_used_idx = 0;
> +}
> +
>  static __rte_always_inline void
>  update_shadow_used_ring_split(struct vhost_virtqueue *vq,
>  			 uint16_t desc_idx, uint32_t len)
> @@ -905,6 +931,200 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  	return error;
>  }
> 
> +static __rte_always_inline void
> +async_fill_vec(struct iovec *v, void *base, size_t len) {
> +	v->iov_base = base;
> +	v->iov_len = len;
> +}
> +
> +static __rte_always_inline void
> +async_fill_iter(struct rte_vhost_iov_iter *it, size_t count,
> +	struct iovec *vec, unsigned long nr_seg) {
> +	it->offset = 0;
> +	it->count = count;
> +
> +	if (count) {
> +		it->iov = vec;
> +		it->nr_segs = nr_seg;
> +	} else {
> +		it->iov = 0;
> +		it->nr_segs = 0;
> +	}
> +}
> +
> +static __rte_always_inline void
> +async_fill_desc(struct rte_vhost_async_desc *desc,
> +	struct rte_vhost_iov_iter *src, struct rte_vhost_iov_iter *dst) {
> +	desc->src = src;
> +	desc->dst = dst;
> +}
> +
> +static __rte_always_inline int
> +async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +			struct rte_mbuf *m, struct buf_vector *buf_vec,
> +			uint16_t nr_vec, uint16_t num_buffers,
> +			struct iovec *src_iovec, struct iovec *dst_iovec,
> +			struct rte_vhost_iov_iter *src_it,
> +			struct rte_vhost_iov_iter *dst_it)
> +{
> +	uint32_t vec_idx = 0;
> +	uint32_t mbuf_offset, mbuf_avail;
> +	uint32_t buf_offset, buf_avail;
> +	uint64_t buf_addr, buf_iova, buf_len;
> +	uint32_t cpy_len, cpy_threshold;
> +	uint64_t hdr_addr;
> +	struct rte_mbuf *hdr_mbuf;
> +	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> +	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> +	int error = 0;
> +
> +	uint32_t tlen = 0;
> +	int tvec_idx = 0;
> +	void *hpa;
> +
> +	if (unlikely(m == NULL)) {
> +		error = -1;
> +		goto out;
> +	}
> +
> +	cpy_threshold = vq->async_threshold;
> +
> +	buf_addr = buf_vec[vec_idx].buf_addr;
> +	buf_iova = buf_vec[vec_idx].buf_iova;
> +	buf_len = buf_vec[vec_idx].buf_len;
> +
> +	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
> +		error = -1;
> +		goto out;
> +	}
> +
> +	hdr_mbuf = m;
> +	hdr_addr = buf_addr;
> +	if (unlikely(buf_len < dev->vhost_hlen))
> +		hdr = &tmp_hdr;
> +	else
> +		hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) RX: num merge buffers %d\n",
> +		dev->vid, num_buffers);
> +
> +	if (unlikely(buf_len < dev->vhost_hlen)) {
> +		buf_offset = dev->vhost_hlen - buf_len;
> +		vec_idx++;
> +		buf_addr = buf_vec[vec_idx].buf_addr;
> +		buf_iova = buf_vec[vec_idx].buf_iova;
> +		buf_len = buf_vec[vec_idx].buf_len;
> +		buf_avail = buf_len - buf_offset;
> +	} else {
> +		buf_offset = dev->vhost_hlen;
> +		buf_avail = buf_len - dev->vhost_hlen;
> +	}
> +
> +	mbuf_avail  = rte_pktmbuf_data_len(m);
> +	mbuf_offset = 0;
> +
> +	while (mbuf_avail != 0 || m->next != NULL) {
> +		/* done with current buf, get the next one */
> +		if (buf_avail == 0) {
> +			vec_idx++;
> +			if (unlikely(vec_idx >= nr_vec)) {
> +				error = -1;
> +				goto out;
> +			}
> +
> +			buf_addr = buf_vec[vec_idx].buf_addr;
> +			buf_iova = buf_vec[vec_idx].buf_iova;
> +			buf_len = buf_vec[vec_idx].buf_len;
> +
> +			buf_offset = 0;
> +			buf_avail  = buf_len;
> +		}
> +
> +		/* done with current mbuf, get the next one */
> +		if (mbuf_avail == 0) {
> +			m = m->next;
> +
> +			mbuf_offset = 0;
> +			mbuf_avail  = rte_pktmbuf_data_len(m);
> +		}
> +
> +		if (hdr_addr) {
> +			virtio_enqueue_offload(hdr_mbuf, &hdr->hdr);
> +			if (rxvq_is_mergeable(dev))
> +				ASSIGN_UNLESS_EQUAL(hdr->num_buffers,
> +						num_buffers);
> +
> +			if (unlikely(hdr == &tmp_hdr)) {
> +				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
> +			} else {
> +				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
> +						dev->vhost_hlen, 0);
> +				vhost_log_cache_write_iova(dev, vq,
> +						buf_vec[0].buf_iova,
> +						dev->vhost_hlen);
> +			}
> +
> +			hdr_addr = 0;
> +		}
> +
> +		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> +
> +		if (unlikely(cpy_len >= cpy_threshold)) {
> +			hpa = (void *)(uintptr_t)gpa_to_hpa(dev,
> +					buf_iova + buf_offset, cpy_len);
> +
> +			if (unlikely(!hpa)) {
> +				error = -1;
> +				goto out;
> +			}
> +
> +			async_fill_vec(src_iovec + tvec_idx,
> +				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> +						mbuf_offset), cpy_len);
> +
> +			async_fill_vec(dst_iovec + tvec_idx, hpa, cpy_len);
> +
> +			tlen += cpy_len;
> +			tvec_idx++;
> +		} else {
> +			if (unlikely(vq->batch_copy_nb_elems >= vq->size)) {
> +				rte_memcpy(
> +				(void *)((uintptr_t)(buf_addr + buf_offset)),
> +				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset),
> +				cpy_len);
> +
> +				PRINT_PACKET(dev,
> +					(uintptr_t)(buf_addr + buf_offset),
> +					cpy_len, 0);
> +			} else {
> +				batch_copy[vq->batch_copy_nb_elems].dst =
> +				(void *)((uintptr_t)(buf_addr + buf_offset));
> +				batch_copy[vq->batch_copy_nb_elems].src =
> +				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset);
> +				batch_copy[vq-
> >batch_copy_nb_elems].log_addr =
> +					buf_iova + buf_offset;
> +				batch_copy[vq->batch_copy_nb_elems].len =
> +					cpy_len;
> +				vq->batch_copy_nb_elems++;
> +			}
> +		}
> +
> +		mbuf_avail  -= cpy_len;
> +		mbuf_offset += cpy_len;
> +		buf_avail  -= cpy_len;
> +		buf_offset += cpy_len;
> +	}
> +
> +out:
> +	async_fill_iter(src_it, tlen, src_iovec, tvec_idx);
> +	async_fill_iter(dst_it, tlen, dst_iovec, tvec_idx);
> +
> +	return error;
> +}
> +
>  static __rte_always_inline int
>  vhost_enqueue_single_packed(struct virtio_net *dev,
>  			    struct vhost_virtqueue *vq,
> @@ -1236,6 +1456,333 @@ rte_vhost_enqueue_burst(int vid, uint16_t
> queue_id,
>  	return virtio_dev_rx(dev, queue_id, pkts, count);  }
> 
> +static __rte_always_inline uint16_t
> +virtio_dev_rx_async_get_info_idx(uint16_t pkts_idx,
> +	uint16_t vq_size, uint16_t n_inflight) {
> +	return pkts_idx > n_inflight ? (pkts_idx - n_inflight) :
> +		(vq_size - n_inflight + pkts_idx) & (vq_size - 1); }
> +
> +static __rte_always_inline void
> +virtio_dev_rx_async_submit_split_err(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq, uint16_t queue_id,
> +	uint16_t last_idx, uint16_t shadow_idx) {
> +	uint16_t start_idx, pkts_idx, vq_size;
> +	uint64_t *async_pending_info;
> +
> +	pkts_idx = vq->async_pkts_idx;
> +	async_pending_info = vq->async_pending_info;
> +	vq_size = vq->size;
> +	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
> +		vq_size, vq->async_pkts_inflight_n);
> +
> +	while (likely((start_idx & (vq_size - 1)) != pkts_idx)) {
> +		uint64_t n_seg =
> +			async_pending_info[(start_idx) & (vq_size - 1)] >>
> +			ASYNC_PENDING_INFO_N_SFT;
> +
> +		while (n_seg)
> +			n_seg -= vq->async_ops.check_completed_copies(dev-
> >vid,
> +				queue_id, 0, 1);
> +	}
> +
> +	vq->async_pkts_inflight_n = 0;
> +	vq->batch_copy_nb_elems = 0;
> +
> +	vq->shadow_used_idx = shadow_idx;
> +	vq->last_avail_idx = last_idx;
> +}
> +
> +static __rte_noinline uint32_t
> +virtio_dev_rx_async_submit_split(struct virtio_net *dev,
> +	struct vhost_virtqueue *vq, uint16_t queue_id,
> +	struct rte_mbuf **pkts, uint32_t count) {
> +	uint32_t pkt_idx = 0, pkt_burst_idx = 0;
> +	uint16_t num_buffers;
> +	struct buf_vector buf_vec[BUF_VECTOR_MAX];
> +	uint16_t avail_head, last_idx, shadow_idx;
> +
> +	struct rte_vhost_iov_iter *it_pool = vq->it_pool;
> +	struct iovec *vec_pool = vq->vec_pool;
> +	struct rte_vhost_async_desc tdes[MAX_PKT_BURST];
> +	struct iovec *src_iovec = vec_pool;
> +	struct iovec *dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
> +	struct rte_vhost_iov_iter *src_it = it_pool;
> +	struct rte_vhost_iov_iter *dst_it = it_pool + 1;
> +	uint16_t n_free_slot, slot_idx;
> +	int n_pkts = 0;
> +
> +	avail_head = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE);
> +	last_idx = vq->last_avail_idx;
> +	shadow_idx = vq->shadow_used_idx;
> +
> +	/*
> +	 * The ordering between avail index and
> +	 * desc reads needs to be enforced.
> +	 */
> +	rte_smp_rmb();
> +
> +	rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
> +
> +	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
> +		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
> +		uint16_t nr_vec = 0;
> +
> +		if (unlikely(reserve_avail_buf_split(dev, vq,
> +						pkt_len, buf_vec,
> &num_buffers,
> +						avail_head, &nr_vec) < 0)) {
> +			VHOST_LOG_DATA(DEBUG,
> +				"(%d) failed to get enough desc from vring\n",
> +				dev->vid);
> +			vq->shadow_used_idx -= num_buffers;
> +			break;
> +		}
> +
> +		VHOST_LOG_DATA(DEBUG, "(%d) current index %d | end
> index %d\n",
> +			dev->vid, vq->last_avail_idx,
> +			vq->last_avail_idx + num_buffers);
> +
> +		if (async_mbuf_to_desc(dev, vq, pkts[pkt_idx],
> +				buf_vec, nr_vec, num_buffers,
> +				src_iovec, dst_iovec, src_it, dst_it) < 0) {
> +			vq->shadow_used_idx -= num_buffers;
> +			break;
> +		}
> +
> +		slot_idx = (vq->async_pkts_idx + pkt_idx) & (vq->size - 1);
> +		if (src_it->count) {
> +			async_fill_desc(&tdes[pkt_burst_idx], src_it, dst_it);
> +			pkt_burst_idx++;
> +			vq->async_pending_info[slot_idx] =
> +				num_buffers | (src_it->nr_segs << 16);
> +			src_iovec += src_it->nr_segs;
> +			dst_iovec += dst_it->nr_segs;
> +			src_it += 2;
> +			dst_it += 2;
> +		} else {
> +			vq->async_pending_info[slot_idx] = num_buffers;
> +			vq->async_pkts_inflight_n++;
> +		}
> +
> +		vq->last_avail_idx += num_buffers;
> +
> +		if (pkt_burst_idx >= VHOST_ASYNC_BATCH_THRESHOLD ||
> +				(pkt_idx == count - 1 && pkt_burst_idx)) {
> +			n_pkts = vq->async_ops.transfer_data(dev->vid,
> +					queue_id, tdes, 0, pkt_burst_idx);
> +			src_iovec = vec_pool;
> +			dst_iovec = vec_pool + (VHOST_MAX_ASYNC_VEC >> 1);
> +			src_it = it_pool;
> +			dst_it = it_pool + 1;
> +
> +			if (unlikely(n_pkts < (int)pkt_burst_idx)) {
> +				vq->async_pkts_inflight_n +=
> +					n_pkts > 0 ? n_pkts : 0;
> +				virtio_dev_rx_async_submit_split_err(dev,
> +					vq, queue_id, last_idx, shadow_idx);
> +				return 0;
> +			}
> +
> +			pkt_burst_idx = 0;
> +			vq->async_pkts_inflight_n += n_pkts;
> +		}
> +	}
> +
> +	if (pkt_burst_idx) {
> +		n_pkts = vq->async_ops.transfer_data(dev->vid,
> +				queue_id, tdes, 0, pkt_burst_idx);
> +		if (unlikely(n_pkts < (int)pkt_burst_idx)) {
> +			vq->async_pkts_inflight_n += n_pkts > 0 ? n_pkts : 0;
> +			virtio_dev_rx_async_submit_split_err(dev, vq, queue_id,
> +				last_idx, shadow_idx);
> +			return 0;
> +		}
> +
> +		vq->async_pkts_inflight_n += n_pkts;
> +	}
> +
> +	do_data_copy_enqueue(dev, vq);
> +
> +	n_free_slot = vq->size - vq->async_pkts_idx;
> +	if (n_free_slot > pkt_idx) {
> +		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
> +			pkts, pkt_idx * sizeof(uintptr_t));
> +		vq->async_pkts_idx += pkt_idx;
> +	} else {
> +		rte_memcpy(&vq->async_pkts_pending[vq->async_pkts_idx],
> +			pkts, n_free_slot * sizeof(uintptr_t));
> +		rte_memcpy(&vq->async_pkts_pending[0],
> +			&pkts[n_free_slot],
> +			(pkt_idx - n_free_slot) * sizeof(uintptr_t));
> +		vq->async_pkts_idx = pkt_idx - n_free_slot;
> +	}
> +
> +	if (likely(vq->shadow_used_idx))
> +		async_flush_shadow_used_ring_split(dev, vq);
> +
> +	return pkt_idx;
> +}
> +
> +uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count) {
> +	struct virtio_net *dev = get_device(vid);
> +	struct vhost_virtqueue *vq;
> +	uint16_t n_pkts_cpl, n_pkts_put = 0, n_descs = 0;
> +	uint16_t start_idx, pkts_idx, vq_size;
> +	uint64_t *async_pending_info;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
> +	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
> +			dev->vid, __func__, queue_id);
> +		return 0;
> +	}
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	pkts_idx = vq->async_pkts_idx;
> +	async_pending_info = vq->async_pending_info;
> +	vq_size = vq->size;
> +	start_idx = virtio_dev_rx_async_get_info_idx(pkts_idx,
> +		vq_size, vq->async_pkts_inflight_n);
> +
> +	n_pkts_cpl =
> +		vq->async_ops.check_completed_copies(vid, queue_id, 0,
> count);
> +
> +	rte_smp_wmb();
> +
> +	while (likely(((start_idx + n_pkts_put) & (vq_size - 1)) != pkts_idx)) {
> +		uint64_t info = async_pending_info[
> +			(start_idx + n_pkts_put) & (vq_size - 1)];
> +		uint64_t n_segs;
> +		n_pkts_put++;
> +		n_descs += info & ASYNC_PENDING_INFO_N_MSK;
> +		n_segs = info >> ASYNC_PENDING_INFO_N_SFT;
> +
> +		if (n_segs) {
> +			if (!n_pkts_cpl || n_pkts_cpl < n_segs) {
> +				n_pkts_put--;
> +				n_descs -= info &
> ASYNC_PENDING_INFO_N_MSK;
> +				if (n_pkts_cpl) {
> +					async_pending_info[
> +						(start_idx + n_pkts_put) &
> +						(vq_size - 1)] =
> +					((n_segs - n_pkts_cpl) <<
> +					 ASYNC_PENDING_INFO_N_SFT) |
> +					(info &
> ASYNC_PENDING_INFO_N_MSK);
> +					n_pkts_cpl = 0;
> +				}
> +				break;
> +			}
> +			n_pkts_cpl -= n_segs;
> +		}
> +	}
> +
> +	if (n_pkts_put) {
> +		vq->async_pkts_inflight_n -= n_pkts_put;
> +		__atomic_add_fetch(&vq->used->idx, n_descs,
> __ATOMIC_RELEASE);
> +
> +		vhost_vring_call_split(dev, vq);
> +	}
> +
> +	if (start_idx + n_pkts_put <= vq_size) {
> +		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
> +			n_pkts_put * sizeof(uintptr_t));
> +	} else {
> +		rte_memcpy(pkts, &vq->async_pkts_pending[start_idx],
> +			(vq_size - start_idx) * sizeof(uintptr_t));
> +		rte_memcpy(&pkts[vq_size - start_idx], vq-
> >async_pkts_pending,
> +			(n_pkts_put - vq_size + start_idx) * sizeof(uintptr_t));
> +	}
> +
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	return n_pkts_put;
> +}
> +
> +static __rte_always_inline uint32_t
> +virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> +	struct rte_mbuf **pkts, uint32_t count) {
> +	struct vhost_virtqueue *vq;
> +	uint32_t nb_tx = 0;
> +	bool drawback = false;
> +
> +	VHOST_LOG_DATA(DEBUG, "(%d) %s\n", dev->vid, __func__);
> +	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->nr_vring))) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: invalid virtqueue idx %d.\n",
> +			dev->vid, __func__, queue_id);
> +		return 0;
> +	}
> +
> +	vq = dev->virtqueue[queue_id];
> +
> +	rte_spinlock_lock(&vq->access_lock);
> +
> +	if (unlikely(vq->enabled == 0))
> +		goto out_access_unlock;
> +
> +	if (unlikely(!vq->async_registered)) {
> +		drawback = true;
> +		goto out_access_unlock;
> +	}
> +
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> +		vhost_user_iotlb_rd_lock(vq);
> +
> +	if (unlikely(vq->access_ok == 0))
> +		if (unlikely(vring_translate(dev, vq) < 0))
> +			goto out;
> +
> +	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
> +	if (count == 0)
> +		goto out;
> +
> +	/* TODO: packed queue not implemented */
> +	if (vq_is_packed(dev))
> +		nb_tx = 0;
> +	else
> +		nb_tx = virtio_dev_rx_async_submit_split(dev,
> +				vq, queue_id, pkts, count);
> +
> +out:
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> +		vhost_user_iotlb_rd_unlock(vq);
> +
> +out_access_unlock:
> +	rte_spinlock_unlock(&vq->access_lock);
> +
> +	if (drawback)
> +		return rte_vhost_enqueue_burst(dev->vid, queue_id, pkts,
> count);
> +
> +	return nb_tx;
> +}
> +
> +uint16_t
> +rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> +		struct rte_mbuf **pkts, uint16_t count) {
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (!dev)
> +		return 0;
> +
> +	if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) {
> +		VHOST_LOG_DATA(ERR,
> +			"(%d) %s: built-in vhost net backend is disabled.\n",
> +			dev->vid, __func__);
> +		return 0;
> +	}
> +
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count); }
> +
>  static inline bool
>  virtio_net_with_host_offload(struct virtio_net *dev)  {
> --
> 2.18.4
Reviewed-by: Chenbo Xia <chenbo.xia@intel.com>
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost
  2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API patrick.fu
  2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 2/2] vhost: introduce async enqueue for split ring patrick.fu
@ 2020-07-07 16:45   ` Ferruh Yigit
  2020-07-20 13:26   ` Maxime Coquelin
  3 siblings, 0 replies; 36+ messages in thread
From: Ferruh Yigit @ 2020-07-07 16:45 UTC (permalink / raw)
  To: patrick.fu, dev, maxime.coquelin, chenbo.xia, zhihong.wang
  Cc: yinan.wang, cheng1.jiang, cunming.liang
On 7/7/2020 6:07 AM, patrick.fu@intel.com wrote:
> From: Patrick Fu <patrick.fu@intel.com>
> 
> Performing large memory copies usually takes up a major part of CPU
> cycles and becomes the hot spot in vhost-user enqueue operation. To
> offload expensive memory operations from the CPU, this patch set
> proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
> Intel's processor, to accelerate large copies.
> 
> Large copies are offloaded from the CPU to the DMA in an asynchronous
> manner. The CPU just submits copy jobs to the DMA but without waiting
> for its copy completion. Thus, there is no CPU intervention during
> data transfer; we can save precious CPU cycles and improve the overall
> throughput for vhost-user based applications, like OVS. During packet
> transmission, it offloads large copies to the DMA and performs small
> copies by the CPU, due to startup overheads associated with the DMA.
> 
> This patch set construct a general framework that applications can
> leverage to attach DMA channels with vhost-user transmit queues. Four
> new RTE APIs are introduced to vhost library for applications to
> register and use the asynchronous data path. In addition, two new DMA
> operation callbacks are defined, by which vhost-user asynchronous data
> path can interact with DMA hardware. Currently only enqueue operation
> for split queue is implemented, but the framework is flexible to extend
> support for packed queue.
> 
> v2:
> update meson file for new header file
> update rte_vhost_version.map to include new APIs
> rename async APIs/structures to be prefixed with "rte_vhost"
> rename some variables/structures for readibility
> correct minor typo in comments/license statements
> refine memory allocation logic for vq internal buffer
> add error message printing in some failure cases
> check inflight async packets in unregistration API call
> mark new APIs as experimental
> 
> v3:
> use atomic_xxx() functions in updating ring index
> fix a bug in async enqueue failure handling
> 
> v4:
> part of the fix intended in v3 patch was missed, this patch
> adds all thoes fixes
> 
> v5:
> minor changes on some function/variable names
> reset CPU batch copy packet count when async enqueue error
> occurs
> disable virtio log feature in async copy mode
> minor optimization on async shadow index flush
> 
> v6:
> add some background introduction in the commit message
> 
> Patrick Fu (2):
>   vhost: introduce async enqueue registration API
>   vhost: introduce async enqueue for split ring
> 
Reviewed-by: Chenbo Xia <chenbo.xia@intel.com>
Series applied to dpdk-next-net/master, thanks.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost
  2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
                     ` (2 preceding siblings ...)
  2020-07-07 16:45   ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost Ferruh Yigit
@ 2020-07-20 13:26   ` Maxime Coquelin
  2020-07-21  2:28     ` Fu, Patrick
  3 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-07-20 13:26 UTC (permalink / raw)
  To: patrick.fu, dev, chenbo.xia, zhihong.wang
  Cc: yinan.wang, cheng1.jiang, cunming.liang
Hi Patrick,
Thanks for the series.
I think we miss a chapter in the Vhost lib documentation to explain what
this new API is about.
Do you think you can write something by -rc3?
Thanks in advance,
Maxime
On 7/7/20 7:07 AM, patrick.fu@intel.com wrote:
> From: Patrick Fu <patrick.fu@intel.com>
> 
> Performing large memory copies usually takes up a major part of CPU
> cycles and becomes the hot spot in vhost-user enqueue operation. To
> offload expensive memory operations from the CPU, this patch set
> proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the
> Intel's processor, to accelerate large copies.
> 
> Large copies are offloaded from the CPU to the DMA in an asynchronous
> manner. The CPU just submits copy jobs to the DMA but without waiting
> for its copy completion. Thus, there is no CPU intervention during
> data transfer; we can save precious CPU cycles and improve the overall
> throughput for vhost-user based applications, like OVS. During packet
> transmission, it offloads large copies to the DMA and performs small
> copies by the CPU, due to startup overheads associated with the DMA.
> 
> This patch set construct a general framework that applications can
> leverage to attach DMA channels with vhost-user transmit queues. Four
> new RTE APIs are introduced to vhost library for applications to
> register and use the asynchronous data path. In addition, two new DMA
> operation callbacks are defined, by which vhost-user asynchronous data
> path can interact with DMA hardware. Currently only enqueue operation
> for split queue is implemented, but the framework is flexible to extend
> support for packed queue.
> 
> v2:
> update meson file for new header file
> update rte_vhost_version.map to include new APIs
> rename async APIs/structures to be prefixed with "rte_vhost"
> rename some variables/structures for readibility
> correct minor typo in comments/license statements
> refine memory allocation logic for vq internal buffer
> add error message printing in some failure cases
> check inflight async packets in unregistration API call
> mark new APIs as experimental
> 
> v3:
> use atomic_xxx() functions in updating ring index
> fix a bug in async enqueue failure handling
> 
> v4:
> part of the fix intended in v3 patch was missed, this patch
> adds all thoes fixes
> 
> v5:
> minor changes on some function/variable names
> reset CPU batch copy packet count when async enqueue error
> occurs
> disable virtio log feature in async copy mode
> minor optimization on async shadow index flush
> 
> v6:
> add some background introduction in the commit message
> 
> Patrick Fu (2):
>   vhost: introduce async enqueue registration API
>   vhost: introduce async enqueue for split ring
> 
>  lib/librte_vhost/Makefile              |   2 +-
>  lib/librte_vhost/meson.build           |   2 +-
>  lib/librte_vhost/rte_vhost.h           |   1 +
>  lib/librte_vhost/rte_vhost_async.h     | 176 ++++++++
>  lib/librte_vhost/rte_vhost_version.map |   4 +
>  lib/librte_vhost/socket.c              |  27 ++
>  lib/librte_vhost/vhost.c               | 127 +++++-
>  lib/librte_vhost/vhost.h               |  30 +-
>  lib/librte_vhost/vhost_user.c          |  23 +-
>  lib/librte_vhost/virtio_net.c          | 551 ++++++++++++++++++++++++-
>  10 files changed, 934 insertions(+), 9 deletions(-)
>  create mode 100644 lib/librte_vhost/rte_vhost_async.h
> 
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost
  2020-07-20 13:26   ` Maxime Coquelin
@ 2020-07-21  2:28     ` Fu, Patrick
  2020-07-21  8:28       ` Maxime Coquelin
  0 siblings, 1 reply; 36+ messages in thread
From: Fu, Patrick @ 2020-07-21  2:28 UTC (permalink / raw)
  To: Maxime Coquelin, dev, Xia, Chenbo, Wang, Zhihong
  Cc: Wang, Yinan, Jiang, Cheng1, Liang, Cunming
Hi Maxime, 
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, July 20, 2020 9:27 PM
> To: Fu, Patrick <patrick.fu@intel.com>; dev@dpdk.org; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Wang, Yinan <yinan.wang@intel.com>; Jiang, Cheng1
> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
> Subject: Re: [PATCH v6 0/2] introduce asynchronous data path for vhost
> 
> Hi Patrick,
> 
> Thanks for the series.
> I think we miss a chapter in the Vhost lib documentation to explain what this
> new API is about.
> 
> Do you think you can write something by -rc3?
Yes, I'm preparing for the doc currently. Since there is still a slight change to the API proto-type I would like to propose, I will send both of the doc and the change in a day or two.
Thanks,
Patrick
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost
  2020-07-21  2:28     ` Fu, Patrick
@ 2020-07-21  8:28       ` Maxime Coquelin
  0 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-07-21  8:28 UTC (permalink / raw)
  To: Fu, Patrick, dev, Xia, Chenbo, Wang, Zhihong
  Cc: Wang, Yinan, Jiang, Cheng1, Liang, Cunming
On 7/21/20 4:28 AM, Fu, Patrick wrote:
> Hi Maxime, 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Monday, July 20, 2020 9:27 PM
>> To: Fu, Patrick <patrick.fu@intel.com>; dev@dpdk.org; Xia, Chenbo
>> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: Wang, Yinan <yinan.wang@intel.com>; Jiang, Cheng1
>> <cheng1.jiang@intel.com>; Liang, Cunming <cunming.liang@intel.com>
>> Subject: Re: [PATCH v6 0/2] introduce asynchronous data path for vhost
>>
>> Hi Patrick,
>>
>> Thanks for the series.
>> I think we miss a chapter in the Vhost lib documentation to explain what this
>> new API is about.
>>
>> Do you think you can write something by -rc3?
> 
> Yes, I'm preparing for the doc currently. Since there is still a slight change to the API proto-type I would like to propose, I will send both of the doc and the change in a day or two.
> 
> Thanks,
> 
> Patrick
> 
Thanks Patrick!
^ permalink raw reply	[flat|nested] 36+ messages in thread
end of thread, other threads:[~2020-07-21  8:28 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-11 10:02 [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost patrick.fu
2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 1/2] vhost: introduce async data path registration API patrick.fu
2020-06-18  5:50   ` Liu, Yong
2020-06-18  9:08     ` Fu, Patrick
2020-06-19  0:40       ` Liu, Yong
2020-06-25 13:42     ` Maxime Coquelin
2020-06-26 14:28   ` Maxime Coquelin
2020-06-29  1:15     ` Fu, Patrick
2020-06-26 14:44   ` Maxime Coquelin
2020-06-11 10:02 ` [dpdk-dev] [PATCH v1 2/2] vhost: introduce async enqueue for split ring patrick.fu
2020-06-18  6:56   ` Liu, Yong
2020-06-18 11:36     ` Fu, Patrick
2020-06-26 14:39   ` Maxime Coquelin
2020-06-26 14:46   ` Maxime Coquelin
2020-06-29  1:25     ` Fu, Patrick
2020-06-26 14:42 ` [dpdk-dev] [PATCH v1 0/2] introduce asynchronous data path for vhost Maxime Coquelin
2020-07-03 10:27 ` [dpdk-dev] [PATCH v3 " patrick.fu
2020-07-03 10:27   ` [dpdk-dev] [PATCH v3 1/2] vhost: introduce async enqueue registration API patrick.fu
2020-07-03 10:27   ` [dpdk-dev] [PATCH v3 2/2] vhost: introduce async enqueue for split ring patrick.fu
2020-07-03 12:21 ` [dpdk-dev] [PATCH v4 0/2] introduce asynchronous data path for vhost patrick.fu
2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 1/2] vhost: introduce async enqueue registration API patrick.fu
2020-07-06  3:05     ` Liu, Yong
2020-07-06  9:08       ` Fu, Patrick
2020-07-03 12:21   ` [dpdk-dev] [PATCH v4 2/2] vhost: introduce async enqueue for split ring patrick.fu
2020-07-06 11:53 ` [dpdk-dev] [PATCH v5 0/2] introduce asynchronous data path for vhost patrick.fu
2020-07-06 11:53   ` [dpdk-dev] [PATCH v5 1/2] vhost: introduce async enqueue registration API patrick.fu
2020-07-06 11:53   ` [dpdk-dev] [PATCH v5 2/2] vhost: introduce async enqueue for split ring patrick.fu
2020-07-07  5:07 ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost patrick.fu
2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 1/2] vhost: introduce async enqueue registration API patrick.fu
2020-07-07  8:22     ` Xia, Chenbo
2020-07-07  5:07   ` [dpdk-dev] [PATCH v6 2/2] vhost: introduce async enqueue for split ring patrick.fu
2020-07-07  8:22     ` Xia, Chenbo
2020-07-07 16:45   ` [dpdk-dev] [PATCH v6 0/2] introduce asynchronous data path for vhost Ferruh Yigit
2020-07-20 13:26   ` Maxime Coquelin
2020-07-21  2:28     ` Fu, Patrick
2020-07-21  8:28       ` Maxime Coquelin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).