[RFC 0/1] integrate dmadev in vhost

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC 0/1] integrate dmadev in vhost
@ 2021-11-22 10:54 Jiayu Hu
  2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Jiayu Hu @ 2021-11-22 10:54 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, john.mcnamara, sunil.pai.g, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in vhost.

To enable the flexibility of using DMA devices in different function
modules, not limited in vhost, vhost doesn't manage DMA devices.
Applications, like OVS, need to manage and configure DMA devices and
tell vhost what DMA device to use in every dataplane function call.

In addition, vhost supports M:N mapping between vrings and DMA virtual
channels. Specifically, one vring can use multiple different DMA channels
and one DMA channel can be shared by multiple vrings at the same time.
The reason of enabling one vring to use multiple DMA channels is that
it's possible that more than one dataplane threads enqueue packets to
the same vring with their own DMA virtual channels. Besides, the number
of DMA devices is limited. For the purpose of scaling, it's necessary to
support sharing DMA channels among vrings.

As only enqueue path is enabled DMA acceleration, the new dataplane
functions are like:
1). rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Get descriptors and submit copies to DMA virtual channel for the
    packets that need to be send to VM.

2). rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Check completed DMA copies from the given DMA virtual channel and
    write back corresponding descriptors to vring.

OVS needs to call rte_vhost_poll_enqueue_completed to clean in-flight
copies on previous call and it can be called inside rxq_recv function,
so that it doesn't require big change in OVS datapath. For example:
netdev_dpdk_vhost_rxq_recv() {
	...
	qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_RXQ;
	rte_vhost_poll_enqueue_completed(vid, qid, ...);
}

Jiayu Hu (1):
  vhost: integrate dmadev in asynchronous datapath

 doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
 examples/vhost/ioat.c               | 218 ----------------------------
 examples/vhost/ioat.h               |  63 --------
 examples/vhost/main.c               | 144 +++++++++++++++---
 examples/vhost/main.h               |  12 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   3 +-
 lib/vhost/rte_vhost_async.h         |  73 +++-------
 lib/vhost/vhost.c                   |  37 ++---
 lib/vhost/vhost.h                   |  45 +++++-
 lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
 11 files changed, 410 insertions(+), 452 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-11-22 10:54 [RFC 0/1] integrate dmadev in vhost Jiayu Hu
@ 2021-11-22 10:54 ` Jiayu Hu
  2021-12-24 10:39   ` Maxime Coquelin
  2021-12-03  3:49 ` [RFC 0/1] integrate dmadev in vhost fengchengwen
  2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
  2 siblings, 1 reply; 31+ messages in thread
From: Jiayu Hu @ 2021-11-22 10:54 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, john.mcnamara, sunil.pai.g, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.

Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
 examples/vhost/ioat.c               | 218 ----------------------------
 examples/vhost/ioat.h               |  63 --------
 examples/vhost/main.c               | 144 +++++++++++++++---
 examples/vhost/main.h               |  12 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   3 +-
 lib/vhost/rte_vhost_async.h         |  73 +++-------
 lib/vhost/vhost.c                   |  37 ++---
 lib/vhost/vhost.h                   |  45 +++++-
 lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
 11 files changed, 410 insertions(+), 452 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 76f5d303c9..32969a1c41 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -113,8 +113,8 @@ The following is an overview of some key Vhost API functions:
     the async capability. Only packets enqueued/dequeued by async APIs are
     processed through the async data path.
 
-    Currently this feature is only implemented on split ring enqueue data
-    path.
+    Currently this feature is only implemented on split and packed ring
+    enqueue data path.
 
     It is disabled by default.
 
@@ -218,11 +218,10 @@ The following is an overview of some key Vhost API functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_channel_register(vid, queue_id, config)``
 
   Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  is enabled.
 
   * ``features``
 
@@ -235,21 +234,7 @@ The following is an overview of some key Vhost API functions:
     Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
     supported by vhost.
 
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-    vhost invokes this function to submit copy data to the async devices.
-    For non-async_inorder capable devices, ``opaque_data`` could be used
-    for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-    vhost invokes this function to get the copy data completed by async
-    devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config)``
 
   Register an async copy device channel for a vhost queue without
   performing any locking.
@@ -277,18 +262,13 @@ The following is an overview of some key Vhost API functions:
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, dma_vchan)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
-
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, dma_vchan)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +278,7 @@ The following is an overview of some key Vhost API functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, dma_vchan)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -442,3 +422,26 @@ Finally, a set of device ops is defined for device specific operations:
 * ``get_notify_area``
 
   Called to get the notify area info of the queue.
+
+Vhost asynchronous data path
+----------------------------
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables applcations, like OVS, to save CPU cycles and hide memory copy
+overhead, thus achieving higher throughput.
+
+Vhost doesn't manage DMA devices and applications, like OVS, need to
+manage and configure DMA devices. Applications need to tell vhost what
+DMA devices to use in every data path function call. This design enables
+the flexibility for applications to dynamically use DMA channels in
+different function modules, not limited in vhost.
+
+In addition, vhost supports M:N mapping between vrings and DMA virtual
+channels. Specifically, one vring can use multiple different DMA channels
+and one DMA channel can be shared by multiple vrings at the same time.
+The reason of enabling one vring to use multiple DMA channels is that
+it's possible that more than one dataplane threads enqueue packets to
+the same vring with their own DMA virtual channels. Besides, the number
+of DMA devices is limited. For the purpose of scaling, it's necessary to
+support sharing DMA channels among vrings.
diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
deleted file mode 100644
index 9aeeb12fd9..0000000000
--- a/examples/vhost/ioat.c
+++ /dev/null
@@ -1,218 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#include <sys/uio.h>
-#ifdef RTE_RAW_IOAT
-#include <rte_rawdev.h>
-#include <rte_ioat_rawdev.h>
-
-#include "ioat.h"
-#include "main.h"
-
-struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
-
-struct packet_tracker {
-	unsigned short size_track[MAX_ENQUEUED_SIZE];
-	unsigned short next_read;
-	unsigned short next_write;
-	unsigned short last_remain;
-	unsigned short ioat_space;
-};
-
-struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
-
-int
-open_ioat(const char *value)
-{
-	struct dma_for_vhost *dma_info = dma_bind;
-	char *input = strndup(value, strlen(value) + 1);
-	char *addrs = input;
-	char *ptrs[2];
-	char *start, *end, *substr;
-	int64_t vid, vring_id;
-	struct rte_ioat_rawdev_config config;
-	struct rte_rawdev_info info = { .dev_private = &config };
-	char name[32];
-	int dev_id;
-	int ret = 0;
-	uint16_t i = 0;
-	char *dma_arg[MAX_VHOST_DEVICE];
-	int args_nr;
-
-	while (isblank(*addrs))
-		addrs++;
-	if (*addrs == '\0') {
-		ret = -1;
-		goto out;
-	}
-
-	/* process DMA devices within bracket. */
-	addrs++;
-	substr = strtok(addrs, ";]");
-	if (!substr) {
-		ret = -1;
-		goto out;
-	}
-	args_nr = rte_strsplit(substr, strlen(substr),
-			dma_arg, MAX_VHOST_DEVICE, ',');
-	if (args_nr <= 0) {
-		ret = -1;
-		goto out;
-	}
-	while (i < args_nr) {
-		char *arg_temp = dma_arg[i];
-		uint8_t sub_nr;
-		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
-		if (sub_nr != 2) {
-			ret = -1;
-			goto out;
-		}
-
-		start = strstr(ptrs[0], "txd");
-		if (start == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		start += 3;
-		vid = strtol(start, &end, 0);
-		if (end == start) {
-			ret = -1;
-			goto out;
-		}
-
-		vring_id = 0 + VIRTIO_RXQ;
-		if (rte_pci_addr_parse(ptrs[1],
-				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
-			ret = -1;
-			goto out;
-		}
-
-		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
-				name, sizeof(name));
-		dev_id = rte_rawdev_get_dev_id(name);
-		if (dev_id == (uint16_t)(-ENODEV) ||
-		dev_id == (uint16_t)(-EINVAL)) {
-			ret = -1;
-			goto out;
-		}
-
-		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
-		strstr(info.driver_name, "ioat") == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
-		(dma_info + vid)->dmas[vring_id].is_valid = true;
-		config.ring_size = IOAT_RING_SIZE;
-		config.hdls_disable = true;
-		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
-			ret = -1;
-			goto out;
-		}
-		rte_rawdev_start(dev_id);
-		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
-		dma_info->nr++;
-		i++;
-	}
-out:
-	free(input);
-	return ret;
-}
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count)
-{
-	uint32_t i_iter;
-	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
-	struct rte_vhost_iov_iter *iter = NULL;
-	unsigned long i_seg;
-	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-	unsigned short write = cb_tracker[dev_id].next_write;
-
-	if (!opaque_data) {
-		for (i_iter = 0; i_iter < count; i_iter++) {
-			iter = iov_iter + i_iter;
-			i_seg = 0;
-			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
-				break;
-			while (i_seg < iter->nr_segs) {
-				rte_ioat_enqueue_copy(dev_id,
-					(uintptr_t)(iter->iov[i_seg].src_addr),
-					(uintptr_t)(iter->iov[i_seg].dst_addr),
-					iter->iov[i_seg].len,
-					0,
-					0);
-				i_seg++;
-			}
-			write &= mask;
-			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
-			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
-			write++;
-		}
-	} else {
-		/* Opaque data is not supported */
-		return -1;
-	}
-	/* ring the doorbell */
-	rte_ioat_perform_ops(dev_id);
-	cb_tracker[dev_id].next_write = write;
-	return i_iter;
-}
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets)
-{
-	if (!opaque_data) {
-		uintptr_t dump[255];
-		int n_seg;
-		unsigned short read, write;
-		unsigned short nb_packet = 0;
-		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-		unsigned short i;
-
-		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
-				+ VIRTIO_RXQ].dev_id;
-		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, dump);
-		if (n_seg < 0) {
-			RTE_LOG(ERR,
-				VHOST_DATA,
-				"fail to poll completed buf on IOAT device %u",
-				dev_id);
-			return 0;
-		}
-		if (n_seg == 0)
-			return 0;
-
-		cb_tracker[dev_id].ioat_space += n_seg;
-		n_seg += cb_tracker[dev_id].last_remain;
-
-		read = cb_tracker[dev_id].next_read;
-		write = cb_tracker[dev_id].next_write;
-		for (i = 0; i < max_packets; i++) {
-			read &= mask;
-			if (read == write)
-				break;
-			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
-				n_seg -= cb_tracker[dev_id].size_track[read];
-				read++;
-				nb_packet++;
-			} else {
-				break;
-			}
-		}
-		cb_tracker[dev_id].next_read = read;
-		cb_tracker[dev_id].last_remain = n_seg;
-		return nb_packet;
-	}
-	/* Opaque data is not supported */
-	return -1;
-}
-
-#endif /* RTE_RAW_IOAT */
diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
deleted file mode 100644
index d9bf717e8d..0000000000
--- a/examples/vhost/ioat.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#ifndef _IOAT_H_
-#define _IOAT_H_
-
-#include <rte_vhost.h>
-#include <rte_pci.h>
-#include <rte_vhost_async.h>
-
-#define MAX_VHOST_DEVICE 1024
-#define IOAT_RING_SIZE 4096
-#define MAX_ENQUEUED_SIZE 4096
-
-struct dma_info {
-	struct rte_pci_addr addr;
-	uint16_t dev_id;
-	bool is_valid;
-};
-
-struct dma_for_vhost {
-	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
-	uint16_t nr;
-};
-
-#ifdef RTE_RAW_IOAT
-int open_ioat(const char *value);
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count);
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-#else
-static int open_ioat(const char *value __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
-		struct rte_vhost_iov_iter *iov_iter __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t count __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_check_completed_copies_cb(int vid __rte_unused,
-		uint16_t queue_id __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t max_packets __rte_unused)
-{
-	return -1;
-}
-#endif
-#endif /* _IOAT_H_ */
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 33d023aa39..16a02b9219 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -24,8 +24,9 @@
 #include <rte_ip.h>
 #include <rte_tcp.h>
 #include <rte_pause.h>
+#include <rte_dmadev.h>
+#include <rte_vhost_async.h>
 
-#include "ioat.h"
 #include "main.h"
 
 #ifndef MAX_QUEUES
@@ -57,6 +58,11 @@
 
 #define INVALID_PORT_ID 0xFF
 
+#define MAX_VHOST_DEVICE 1024
+#define DMA_RING_SIZE 4096
+
+struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
+
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;
 
@@ -199,10 +205,113 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
 static inline int
 open_dma(const char *value)
 {
-	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
-		return open_ioat(value);
+	struct dma_for_vhost *dma_info = dma_bind;
+	char *input = strndup(value, strlen(value) + 1);
+	char *addrs = input;
+	char *ptrs[2];
+	char *start, *end, *substr;
+	int64_t vid, vring_id;
+
+	struct rte_dma_info info;
+	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
+	struct rte_dma_vchan_conf qconf = {
+		.direction = RTE_DMA_DIR_MEM_TO_MEM,
+		.nb_desc = DMA_RING_SIZE
+	};
+
+	int dev_id;
+	int ret = 0;
+	uint16_t i = 0;
+	char *dma_arg[MAX_VHOST_DEVICE];
+	int args_nr;
+
+	while (isblank(*addrs))
+		addrs++;
+	if (*addrs == '\0') {
+		ret = -1;
+		goto out;
+	}
+
+	/* process DMA devices within bracket. */
+	addrs++;
+	substr = strtok(addrs, ";]");
+	if (!substr) {
+		ret = -1;
+		goto out;
+	}
+
+	args_nr = rte_strsplit(substr, strlen(substr),
+			dma_arg, MAX_VHOST_DEVICE, ',');
+	if (args_nr <= 0) {
+		ret = -1;
+		goto out;
+	}
+
+	while (i < args_nr) {
+		char *arg_temp = dma_arg[i];
+		uint8_t sub_nr;
+
+		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
+		if (sub_nr != 2) {
+			ret = -1;
+			goto out;
+		}
+
+		start = strstr(ptrs[0], "txd");
+		if (start == NULL) {
+			ret = -1;
+			goto out;
+		}
+
+		start += 3;
+		vid = strtol(start, &end, 0);
+		if (end == start) {
+			ret = -1;
+			goto out;
+		}
+
+		vring_id = 0 + VIRTIO_RXQ;
+
+		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
+		if (dev_id < 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_configure(dev_id, &dev_config) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
 
-	return -1;
+		rte_dma_info_get(dev_id, &info);
+		if (info.nb_vchans != 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no queues.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_start(dev_id) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
+		(dma_info + vid)->dmas[vring_id].is_valid = true;
+		dma_info->nr++;
+		i++;
+	}
+out:
+	free(input);
+	return ret;
 }
 
 /*
@@ -841,9 +950,10 @@ complete_async_pkts(struct vhost_dev *vdev)
 {
 	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
 	uint16_t complete_count;
+	uint16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
-					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
+					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
 	if (complete_count) {
 		free_pkts(p_cpl, complete_count);
 		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
@@ -880,6 +990,7 @@ drain_vhost(struct vhost_dev *vdev)
 	uint32_t buff_idx = rte_lcore_id() * MAX_VHOST_DEVICE + vdev->vid;
 	uint16_t nr_xmit = vhost_txbuff[buff_idx]->len;
 	struct rte_mbuf **m = vhost_txbuff[buff_idx]->m_table;
+	uint16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 	if (builtin_net_driver) {
 		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
@@ -887,7 +998,7 @@ drain_vhost(struct vhost_dev *vdev)
 		uint16_t enqueue_fail = 0;
 
 		complete_async_pkts(vdev);
-		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
+		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = nr_xmit - ret;
@@ -1213,10 +1324,11 @@ drain_eth_rx(struct vhost_dev *vdev)
 						pkts, rx_count);
 	} else if (async_vhost_driver) {
 		uint16_t enqueue_fail = 0;
+		uint16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
 		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
-					VIRTIO_RXQ, pkts, rx_count);
+					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = rx_count - enqueue_count;
@@ -1389,11 +1501,12 @@ destroy_device(int vid)
 
 	if (async_vhost_driver) {
 		uint16_t n_pkt = 0;
+		uint16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 		while (vdev->pkts_inflight) {
 			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
-						m_cpl, vdev->pkts_inflight);
+						m_cpl, vdev->pkts_inflight, dma_id, 0);
 			free_pkts(m_cpl, n_pkt);
 			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 		}
@@ -1470,18 +1583,10 @@ new_device(int vid)
 
 	if (async_vhost_driver) {
 		struct rte_vhost_async_config config = {0};
-		struct rte_vhost_async_channel_ops channel_ops;
-
-		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
-			channel_ops.transfer_data = ioat_transfer_data_cb;
-			channel_ops.check_completed_copies =
-				ioat_check_completed_copies_cb;
 
-			config.features = RTE_VHOST_ASYNC_INORDER;
+		config.features = RTE_VHOST_ASYNC_INORDER;
 
-			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
-				config, &channel_ops);
-		}
+		return rte_vhost_async_channel_register(vid, VIRTIO_RXQ, config);
 	}
 
 	return 0;
@@ -1505,11 +1610,12 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
 	if (async_vhost_driver) {
 		if (!enable) {
 			uint16_t n_pkt = 0;
+			uint16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 			while (vdev->pkts_inflight) {
 				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
-							m_cpl, vdev->pkts_inflight);
+							m_cpl, vdev->pkts_inflight, dma_id, 0);
 				free_pkts(m_cpl, n_pkt);
 				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 			}
diff --git a/examples/vhost/main.h b/examples/vhost/main.h
index e7b1ac60a6..609fb406aa 100644
--- a/examples/vhost/main.h
+++ b/examples/vhost/main.h
@@ -8,6 +8,7 @@
 #include <sys/queue.h>
 
 #include <rte_ether.h>
+#include <rte_pci.h>
 
 /* Macros for printing using RTE_LOG */
 #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
@@ -79,6 +80,17 @@ struct lcore_info {
 	struct vhost_dev_tailq_list vdev_list;
 };
 
+struct dma_info {
+	struct rte_pci_addr addr;
+	uint16_t dev_id;
+	bool is_valid;
+};
+
+struct dma_for_vhost {
+	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
+	uint16_t nr;
+};
+
 /* we implement non-extra virtio net features */
 #define VIRTIO_NET_FEATURES	0
 
diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
index 3efd5e6540..87a637f83f 100644
--- a/examples/vhost/meson.build
+++ b/examples/vhost/meson.build
@@ -12,13 +12,9 @@ if not is_linux
 endif
 
 deps += 'vhost'
+deps += 'dmadev'
 allow_experimental_apis = true
 sources = files(
         'main.c',
         'virtio_net.c',
 )
-
-if dpdk_conf.has('RTE_RAW_IOAT')
-    deps += 'raw_ioat'
-    sources += files('ioat.c')
-endif
diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
index cdb37a4814..8107329400 100644
--- a/lib/vhost/meson.build
+++ b/lib/vhost/meson.build
@@ -33,7 +33,8 @@ headers = files(
         'rte_vhost_async.h',
         'rte_vhost_crypto.h',
 )
+
 driver_sdk_headers = files(
         'vdpa_driver.h',
 )
-deps += ['ethdev', 'cryptodev', 'hash', 'pci']
+deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
index a87ea6ba37..0594ae5fc5 100644
--- a/lib/vhost/rte_vhost_async.h
+++ b/lib/vhost/rte_vhost_async.h
@@ -36,48 +36,6 @@ struct rte_vhost_async_status {
 	uintptr_t *dst_opaque_data;
 };
 
-/**
- * dma operation callbacks to be implemented by applications
- */
-struct rte_vhost_async_channel_ops {
-	/**
-	 * instruct async engines to perform copies for a batch of packets
-	 *
-	 * @param vid
-	 *  id of vhost device to perform data copies
-	 * @param queue_id
-	 *  queue id to perform data copies
-	 * @param iov_iter
-	 *  an array of IOV iterators
-	 * @param opaque_data
-	 *  opaque data pair sending to DMA engine
-	 * @param count
-	 *  number of elements in the "descs" array
-	 * @return
-	 *  number of IOV iterators processed, negative value means error
-	 */
-	int32_t (*transfer_data)(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t count);
-	/**
-	 * check copy-completed packets from the async engine
-	 * @param vid
-	 *  id of vhost device to check copy completion
-	 * @param queue_id
-	 *  queue id to check copy completion
-	 * @param opaque_data
-	 *  buffer to receive the opaque data pair from DMA engine
-	 * @param max_packets
-	 *  max number of packets could be completed
-	 * @return
-	 *  number of async descs completed, negative value means error
-	 */
-	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-};
-
 /**
  *  async channel features
  */
@@ -102,15 +60,12 @@ struct rte_vhost_async_config {
  *  vhost queue id async channel to be attached to
  * @param config
  *  Async channel configuration structure
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
 int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+	struct rte_vhost_async_config config);
 
 /**
  * Unregister an async channel for a vhost queue
@@ -136,8 +91,6 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration
  * @param ops
  *  Async channel operation callbacks
  * @return
@@ -145,8 +98,7 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
  */
 __rte_experimental
 int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+	struct rte_vhost_async_config config);
 
 /**
  * Unregister an async channel for a vhost queue without performing any
@@ -179,12 +131,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
  *  array of packets to be enqueued
  * @param count
  *  packets num to be enqueued
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param dma_vchan
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets enqueued
  */
 __rte_experimental
 uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan);
 
 /**
  * This function checks async completion status for a specific vhost
@@ -199,12 +156,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
  *  blank array to get return packet pointer
  * @param count
  *  size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param dma_vchan
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan);
 
 /**
  * This function returns the amount of in-flight packets for the vhost
@@ -235,11 +197,16 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
  *  Blank array to get return packet pointer
  * @param count
  *  Size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param dma_vchan
+ *  the identifier of virtual DMA channel
  * @return
  *  Number of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan);
 
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
index 13a9bb9dd1..595cf63b8d 100644
--- a/lib/vhost/vhost.c
+++ b/lib/vhost/vhost.c
@@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
 		return;
 
 	rte_free(vq->async->pkts_info);
+	rte_free(vq->async->pkts_cmpl_flag);
 
 	rte_free(vq->async->buffers_packed);
 	vq->async->buffers_packed = NULL;
@@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
 }
 
 static __rte_always_inline int
-async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_channel_ops *ops)
+async_channel_register(int vid, uint16_t queue_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
@@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
 		goto out_free_async;
 	}
 
+	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
+			RTE_CACHE_LINE_SIZE, node);
+	if (!async->pkts_cmpl_flag) {
+		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag (vid %d, qid: %d)\n",
+				vid, queue_id);
+		goto out_free_async;
+	}
+
 	if (vq_is_packed(dev)) {
 		async->buffers_packed = rte_malloc_socket(NULL,
 				vq->size * sizeof(struct vring_used_elem_packed),
@@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
 		}
 	}
 
-	async->ops.check_completed_copies = ops->check_completed_copies;
-	async->ops.transfer_data = ops->transfer_data;
-
 	vq->async = async;
 
 	return 0;
@@ -1692,14 +1697,13 @@ async_channel_register(int vid, uint16_t queue_id,
 
 int
 rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+		struct rte_vhost_async_config config)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 	int ret;
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1717,12 +1721,8 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 		return -1;
 	}
 
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
 	rte_spinlock_lock(&vq->access_lock);
-	ret = async_channel_register(vid, queue_id, ops);
+	ret = async_channel_register(vid, queue_id);
 	rte_spinlock_unlock(&vq->access_lock);
 
 	return ret;
@@ -1730,13 +1730,12 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 
 int
 rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+		struct rte_vhost_async_config config)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1754,11 +1753,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
 		return -1;
 	}
 
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
-	return async_channel_register(vid, queue_id, ops);
+	return async_channel_register(vid, queue_id);
 }
 
 int
diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index 7085e0885c..974e495b56 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h
@@ -51,6 +51,11 @@
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
 #define VHOST_MAX_ASYNC_VEC 2048
 
+/* DMA device copy operation tracking ring size. */
+#define VHOST_ASYNC_DMA_TRACK_RING_SIZE (uint32_t)4096
+#define VHOST_ASYNC_DMA_TRACK_RING_MASK (VHOST_ASYNC_DMA_TRACK_RING_SIZE - 1)
+#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
+
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
 		VRING_DESC_F_WRITE)
@@ -119,6 +124,29 @@ struct vring_used_elem_packed {
 	uint32_t count;
 };
 
+struct async_dma_info {
+	/* circular array to track copy metadata */
+	bool *metadata[VHOST_ASYNC_DMA_TRACK_RING_SIZE];
+
+	/* batching copies before a DMA doorbell */
+	uint16_t nr_batching;
+
+	/**
+	 * DMA virtual channel lock. Although it is able to bind DMA
+	 * virtual channels to data plane threads, vhost control plane
+	 * thread could call data plane functions too, thus causing
+	 * DMA device contention.
+	 *
+	 * For example, in VM exit case, vhost control plane thread needs
+	 * to clear in-flight packets before disable vring, but there could
+	 * be anotther data plane thread is enqueuing packets to the same
+	 * vring with the same DMA virtual channel. But dmadev PMD functions
+	 * are lock-free, so the control plane and data plane threads
+	 * could operate the same DMA virtual channel at the same time.
+	 */
+	rte_spinlock_t dma_lock;
+};
+
 /**
  * inflight async packet information
  */
@@ -129,9 +157,6 @@ struct async_inflight_info {
 };
 
 struct vhost_async {
-	/* operation callbacks for DMA */
-	struct rte_vhost_async_channel_ops ops;
-
 	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
 	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
 	uint16_t iter_idx;
@@ -139,8 +164,22 @@ struct vhost_async {
 
 	/* data transfer status */
 	struct async_inflight_info *pkts_info;
+	/**
+	 * packet reorder array. "true" indicates that DMA
+	 * device completes all copies for the packet.
+	 *
+	 * Note that this arry could be written by multiple
+	 * threads at the same time. For example, two threads
+	 * enqueue packets to the same virtqueue with their
+	 * own DMA devices. However, since offloading is
+	 * per-packet basis, each packet flag will only be
+	 * written by one thread. And single byte write is
+	 * atomic, so no lock is needed.
+	 */
+	bool *pkts_cmpl_flag;
 	uint16_t pkts_idx;
 	uint16_t pkts_inflight_n;
+
 	union {
 		struct vring_used_elem  *descs_split;
 		struct vring_used_elem_packed *buffers_packed;
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index b3d954aab4..95ecfeb64b 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -11,6 +11,7 @@
 #include <rte_net.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
+#include <rte_dmadev.h>
 #include <rte_vhost.h>
 #include <rte_tcp.h>
 #include <rte_udp.h>
@@ -25,6 +26,9 @@
 
 #define MAX_BATCH_LEN 256
 
+/* DMA device copy operation tracking array. */
+static struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
 }
 
+static uint16_t
+vhost_async_dma_transfer(struct vhost_virtqueue *vq, uint16_t dma_id,
+		uint16_t dma_vchan, uint16_t head_idx,
+		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
+{
+	struct async_dma_info *dma_info = &dma_copy_track[dma_id];
+	uint16_t dma_space_left = rte_dma_burst_capacity(dma_id, 0);
+	uint16_t pkt_idx = 0;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	while (pkt_idx < nr_pkts) {
+		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
+		int copy_idx = 0;
+		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
+		uint16_t i;
+
+		if (unlikely(dma_space_left < nr_segs)) {
+			goto out;
+		}
+
+		for (i = 0; i < nr_segs; i++) {
+			copy_idx = rte_dma_copy(dma_id, dma_vchan,
+					(rte_iova_t)iov[i].src_addr,
+					(rte_iova_t)iov[i].dst_addr,
+					iov[i].len, RTE_DMA_OP_FLAG_LLC);
+			if (unlikely(copy_idx < 0)) {
+				VHOST_LOG_DATA(ERR, "DMA device %u (%u) copy failed\n",
+						dma_id, dma_vchan);
+				dma_info->nr_batching += i;
+				goto out;
+			}
+
+			dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK] = NULL;
+		}
+
+		/**
+		 * Only store packet completion flag address in the last copy's
+		 * slot, and other slots are set to NULL.
+		 */
+		dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK] =
+			&vq->async->pkts_cmpl_flag[head_idx % vq->size];
+
+		dma_info->nr_batching += nr_segs;
+		if (unlikely(dma_info->nr_batching > VHOST_ASYNC_DMA_BATCHING_SIZE)) {
+			rte_dma_submit(dma_id, 0);
+			dma_info->nr_batching = 0;
+		}
+
+		dma_space_left -= nr_segs;
+		pkt_idx++;
+		head_idx++;
+	}
+
+out:
+	if (dma_info->nr_batching > 0) {
+		rte_dma_submit(dma_id, 0);
+		dma_info->nr_batching = 0;
+	}
+	rte_spinlock_unlock(&dma_info->dma_lock);
+
+	return pkt_idx;
+}
+
+static uint16_t
+vhost_async_dma_check_completed(uint16_t dma_id, uint16_t dma_vchan, uint16_t max_pkts)
+{
+	struct async_dma_info *dma_info = &dma_copy_track[dma_id];
+	uint16_t last_idx = 0;
+	uint16_t nr_copies;
+	uint16_t copy_idx;
+	uint16_t i;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	nr_copies = rte_dma_completed(dma_id, dma_vchan, max_pkts, &last_idx, NULL);
+	if (nr_copies == 0) {
+		goto out;
+	}
+
+	copy_idx = last_idx - nr_copies + 1;
+	for (i = 0; i < nr_copies; i++) {
+		bool *flag;
+
+		flag = dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK];
+		if (flag) {
+			/**
+			 * Mark the packet flag as received. The flag
+			 * could belong to another virtqueue but write
+			 * is atomic.
+			 */
+			*flag = true;
+			dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK] = NULL;
+		}
+		copy_idx++;
+	}
+
+out:
+	rte_spinlock_unlock(&dma_info->dma_lock);
+	return nr_copies;
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
@@ -1451,7 +1557,8 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
 static __rte_noinline uint32_t
 virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, uint16_t dma_id,
+	uint16_t dma_vchan)
 {
 	struct buf_vector buf_vec[BUF_VECTOR_MAX];
 	uint32_t pkt_idx = 0;
@@ -1463,6 +1570,7 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	uint32_t pkt_err = 0;
 	int32_t n_xfer;
 	uint16_t slot_idx = 0;
+	uint16_t head_idx = async->pkts_idx & (vq->size - 1);
 
 	/*
 	 * The ordering between avail index and desc reads need to be enforced.
@@ -1503,17 +1611,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		n_xfer = 0;
-	}
+	n_xfer = vhost_async_dma_transfer(vq, dma_id, dma_vchan, head_idx, async->iov_iter,
+			pkt_idx);
 
 	pkt_err = pkt_idx - n_xfer;
 	if (unlikely(pkt_err)) {
 		uint16_t num_descs = 0;
 
+		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer %u packets for queue %u.\n",
+				dev->vid, __func__, pkt_err, queue_id);
+
 		/* update number of completed packets */
 		pkt_idx = n_xfer;
 
@@ -1658,11 +1765,12 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
 static __rte_noinline uint32_t
 virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, uint16_t dma_id,
+	uint16_t dma_vchan)
 {
 	uint32_t pkt_idx = 0;
 	uint32_t remained = count;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t num_buffers;
 	uint16_t num_descs;
 
@@ -1670,6 +1778,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	struct async_inflight_info *pkts_info = async->pkts_info;
 	uint32_t pkt_err = 0;
 	uint16_t slot_idx = 0;
+	uint16_t head_idx = async->pkts_idx % vq->size;
 
 	do {
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
@@ -1694,19 +1803,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		n_xfer = 0;
-	}
-
-	pkt_err = pkt_idx - n_xfer;
+	n_xfer = vhost_async_dma_transfer(vq, dma_id, dma_vchan, head_idx,
+			async->iov_iter, pkt_idx);
 
 	async_iter_reset(async);
 
-	if (unlikely(pkt_err))
+	pkt_err = pkt_idx - n_xfer;
+	if (unlikely(pkt_err)) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer %u packets for queue %u.\n",
+				dev->vid, __func__, pkt_err, queue_id);
 		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
+	}
 
 	if (likely(vq->shadow_used_idx)) {
 		/* keep used descriptors. */
@@ -1826,28 +1933,37 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
 
 static __rte_always_inline uint16_t
 vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
-	int32_t n_cpl;
+	uint16_t nr_cpl_copies, nr_cpl_pkts = 0;
 	uint16_t n_descs = 0, n_buffers = 0;
 	uint16_t start_idx, from, i;
 
-	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
-	if (unlikely(n_cpl < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies for queue id %d.\n",
-				dev->vid, __func__, queue_id);
+	nr_cpl_copies = vhost_async_dma_check_completed(dma_id, dma_vchan, count);
+	if (nr_cpl_copies == 0)
 		return 0;
-	}
 
-	if (n_cpl == 0)
-		return 0;
+	/**
+	 * The order of updating packet completion flag needs to be
+	 * enforced.
+	 */
+	rte_atomic_thread_fence(__ATOMIC_RELEASE);
 
 	start_idx = async_get_first_inflight_pkt_idx(vq);
 
-	for (i = 0; i < n_cpl; i++) {
+	/* Calculate the number of copy completed packets */
+	from = start_idx;
+	while (vq->async->pkts_cmpl_flag[from]) {
+		vq->async->pkts_cmpl_flag[from] = false;
+		from = (from + 1) % vq->size;
+		nr_cpl_pkts++;
+	}
+
+	for (i = 0; i < nr_cpl_pkts; i++) {
 		from = (start_idx + i) % vq->size;
 		/* Only used with packed ring */
 		n_buffers += pkts_info[from].nr_buffers;
@@ -1856,7 +1972,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		pkts[i] = pkts_info[from].mbuf;
 	}
 
-	async->pkts_inflight_n -= n_cpl;
+	async->pkts_inflight_n -= nr_cpl_pkts;
 
 	if (likely(vq->enabled && vq->access_ok)) {
 		if (vq_is_packed(dev)) {
@@ -1877,12 +1993,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		}
 	}
 
-	return n_cpl;
+	return nr_cpl_pkts;
 }
 
 uint16_t
 rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1908,7 +2025,7 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 	rte_spinlock_lock(&vq->access_lock);
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, dma_vchan);
 
 	rte_spinlock_unlock(&vq->access_lock);
 
@@ -1917,7 +2034,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 uint16_t
 rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1941,14 +2059,15 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, dma_vchan);
 
 	return n_pkts_cpl;
 }
 
 static __rte_always_inline uint32_t
 virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, uint16_t dma_id,
+	uint16_t dma_vchan)
 {
 	struct vhost_virtqueue *vq;
 	uint32_t nb_tx = 0;
@@ -1980,10 +2099,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 	if (vq_is_packed(dev))
 		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, dma_vchan);
 	else
 		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, dma_vchan);
 
 out:
 	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
@@ -1997,7 +2116,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 uint16_t
 rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
+		uint16_t dma_vchan)
 {
 	struct virtio_net *dev = get_device(vid);
 
@@ -2011,7 +2131,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, dma_vchan);
 }
 
 static inline bool
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
@ 2021-12-24 10:39   ` Maxime Coquelin
  2021-12-28  1:15     ` Hu, Jiayu
  0 siblings, 1 reply; 31+ messages in thread
From: Maxime Coquelin @ 2021-12-24 10:39 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, bruce.richardson, harry.van.haaren,
	john.mcnamara, sunil.pai.g

Hi Jiayu,

This is a first review, I need to spend more time on the series to
understand it well. Do you have a prototype of the OVS part, so that it
helps us to grasp how the full integration would look like?

On 11/22/21 11:54, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
>   examples/vhost/ioat.c               | 218 ----------------------------
>   examples/vhost/ioat.h               |  63 --------
>   examples/vhost/main.c               | 144 +++++++++++++++---
>   examples/vhost/main.h               |  12 ++
>   examples/vhost/meson.build          |   6 +-
>   lib/vhost/meson.build               |   3 +-
>   lib/vhost/rte_vhost_async.h         |  73 +++-------
>   lib/vhost/vhost.c                   |  37 ++---
>   lib/vhost/vhost.h                   |  45 +++++-
>   lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
>   11 files changed, 410 insertions(+), 452 deletions(-)
>   delete mode 100644 examples/vhost/ioat.c
>   delete mode 100644 examples/vhost/ioat.h
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
> index 76f5d303c9..32969a1c41 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -113,8 +113,8 @@ The following is an overview of some key Vhost API functions:
>       the async capability. Only packets enqueued/dequeued by async APIs are
>       processed through the async data path.
>   
> -    Currently this feature is only implemented on split ring enqueue data
> -    path.
> +    Currently this feature is only implemented on split and packed ring
> +    enqueue data path.

That's not related to the topic of this patch, you may move it in a
dedicated patch in v1.

>   
>       It is disabled by default.
>   
> @@ -218,11 +218,10 @@ The following is an overview of some key Vhost API functions:
>   
>     Enable or disable zero copy feature of the vhost crypto backend.
>   
> -* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
> +* ``rte_vhost_async_channel_register(vid, queue_id, config)``
>   
>     Register an async copy device channel for a vhost queue after vring
> -  is enabled. Following device ``config`` must be specified together
> -  with the registration:
> +  is enabled.
>   
>     * ``features``
>   
> @@ -235,21 +234,7 @@ The following is an overview of some key Vhost API functions:
>       Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
>       supported by vhost.
>   
> -  Applications must provide following ``ops`` callbacks for vhost lib to
> -  work with the async copy devices:
> -
> -  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
> -
> -    vhost invokes this function to submit copy data to the async devices.
> -    For non-async_inorder capable devices, ``opaque_data`` could be used
> -    for identifying the completed packets.
> -
> -  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
> -
> -    vhost invokes this function to get the copy data completed by async
> -    devices.
> -
> -* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
> +* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config)``
>   
>     Register an async copy device channel for a vhost queue without
>     performing any locking.
> @@ -277,18 +262,13 @@ The following is an overview of some key Vhost API functions:
>     This function is only safe to call in vhost callback functions
>     (i.e., struct rte_vhost_device_ops).
>   
> -* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
> +* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, dma_vchan)``
>   
>     Submit an enqueue request to transmit ``count`` packets from host to guest
> -  by async data path. Successfully enqueued packets can be transfer completed
> -  or being occupied by DMA engines; transfer completed packets are returned in
> -  ``comp_pkts``, but others are not guaranteed to finish, when this API
> -  call returns.
> -
> -  Applications must not free the packets submitted for enqueue until the
> -  packets are completed.
> +  by async data path. Applications must not free the packets submitted for
> +  enqueue until the packets are completed.
>   
> -* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
> +* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, dma_vchan)``
>   
>     Poll enqueue completion status from async data path. Completed packets
>     are returned to applications through ``pkts``.
> @@ -298,7 +278,7 @@ The following is an overview of some key Vhost API functions:
>     This function returns the amount of in-flight packets for the vhost
>     queue using async acceleration.
>   
> -* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
> +* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, dma_vchan)``
>   
>     Clear inflight packets which are submitted to DMA engine in vhost async data
>     path. Completed packets are returned to applications through ``pkts``.
> @@ -442,3 +422,26 @@ Finally, a set of device ops is defined for device specific operations:
>   * ``get_notify_area``
>   
>     Called to get the notify area info of the queue.
> +
> +Vhost asynchronous data path
> +----------------------------
> +
> +Vhost asynchronous data path leverages DMA devices to offload memory
> +copies from the CPU and it is implemented in an asynchronous way. It
> +enables applcations, like OVS, to save CPU cycles and hide memory copy
> +overhead, thus achieving higher throughput.
> +
> +Vhost doesn't manage DMA devices and applications, like OVS, need to
> +manage and configure DMA devices. Applications need to tell vhost what
> +DMA devices to use in every data path function call. This design enables
> +the flexibility for applications to dynamically use DMA channels in
> +different function modules, not limited in vhost.
> +
> +In addition, vhost supports M:N mapping between vrings and DMA virtual
> +channels. Specifically, one vring can use multiple different DMA channels
> +and one DMA channel can be shared by multiple vrings at the same time.
> +The reason of enabling one vring to use multiple DMA channels is that
> +it's possible that more than one dataplane threads enqueue packets to
> +the same vring with their own DMA virtual channels. Besides, the number
> +of DMA devices is limited. For the purpose of scaling, it's necessary to
> +support sharing DMA channels among vrings.
> diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
> deleted file mode 100644
> index 9aeeb12fd9..0000000000
> --- a/examples/vhost/ioat.c
> +++ /dev/null

Nice to see platform-specific code not being necessary for the
application.

> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 33d023aa39..16a02b9219 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -24,8 +24,9 @@
>   #include <rte_ip.h>
>   #include <rte_tcp.h>
>   #include <rte_pause.h>
> +#include <rte_dmadev.h>
> +#include <rte_vhost_async.h>
>   
> -#include "ioat.h"
>   #include "main.h"
>   
>   #ifndef MAX_QUEUES
> @@ -57,6 +58,11 @@
>   
>   #define INVALID_PORT_ID 0xFF
>   
> +#define MAX_VHOST_DEVICE 1024
> +#define DMA_RING_SIZE 4096
> +
> +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> +
>   /* mask of enabled ports */
>   static uint32_t enabled_port_mask = 0;
>   
> @@ -199,10 +205,113 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
>   static inline int
>   open_dma(const char *value)
>   {
> -	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> -		return open_ioat(value);
> +	struct dma_for_vhost *dma_info = dma_bind;
> +	char *input = strndup(value, strlen(value) + 1);
> +	char *addrs = input;
> +	char *ptrs[2];
> +	char *start, *end, *substr;
> +	int64_t vid, vring_id;
> +
> +	struct rte_dma_info info;
> +	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> +	struct rte_dma_vchan_conf qconf = {
> +		.direction = RTE_DMA_DIR_MEM_TO_MEM,
> +		.nb_desc = DMA_RING_SIZE
> +	};
> +
> +	int dev_id;
> +	int ret = 0;
> +	uint16_t i = 0;
> +	char *dma_arg[MAX_VHOST_DEVICE];
> +	int args_nr;
> +
> +	while (isblank(*addrs))
> +		addrs++;
> +	if (*addrs == '\0') {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	/* process DMA devices within bracket. */
> +	addrs++;
> +	substr = strtok(addrs, ";]");
> +	if (!substr) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	args_nr = rte_strsplit(substr, strlen(substr),
> +			dma_arg, MAX_VHOST_DEVICE, ',');
> +	if (args_nr <= 0) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	while (i < args_nr) {
> +		char *arg_temp = dma_arg[i];
> +		uint8_t sub_nr;
> +
> +		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> +		if (sub_nr != 2) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start = strstr(ptrs[0], "txd");
> +		if (start == NULL) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start += 3;
> +		vid = strtol(start, &end, 0);
> +		if (end == start) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		vring_id = 0 + VIRTIO_RXQ;
> +
> +		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
> +		if (dev_id < 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		if (rte_dma_configure(dev_id, &dev_config) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
>   
> -	return -1;
> +		rte_dma_info_get(dev_id, &info);
> +		if (info.nb_vchans != 1) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no queues.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		if (rte_dma_start(dev_id) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> +		(dma_info + vid)->dmas[vring_id].is_valid = true;

This is_valid field is never used AFAICT, my understanding is that it
had been added to differentiate between not valid and first dmadev,
where dev_id will be both zero. Either make use of is_valid in the code,
or change dev_id to int, and initialize it with -1 value.

> +		dma_info->nr++;
> +		i++;
> +	}
> +out:
> +	free(input);
> +	return ret;
>   }
>   
>   /*
> @@ -841,9 +950,10 @@ complete_async_pkts(struct vhost_dev *vdev)
>   {
>   	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
>   	uint16_t complete_count;
> +	uint16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
>   
>   	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
> -					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
> +					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
>   	if (complete_count) {
>   		free_pkts(p_cpl, complete_count);
>   		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
> @@ -880,6 +990,7 @@ drain_vhost(struct vhost_dev *vdev)
>   	uint32_t buff_idx = rte_lcore_id() * MAX_VHOST_DEVICE + vdev->vid;
>   	uint16_t nr_xmit = vhost_txbuff[buff_idx]->len;
>   	struct rte_mbuf **m = vhost_txbuff[buff_idx]->m_table;
> +	uint16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
>   
>   	if (builtin_net_driver) {
>   		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
> @@ -887,7 +998,7 @@ drain_vhost(struct vhost_dev *vdev)
>   		uint16_t enqueue_fail = 0;
>   
>   		complete_async_pkts(vdev);
> -		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
> +		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
>   		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
>   
>   		enqueue_fail = nr_xmit - ret;
> @@ -1213,10 +1324,11 @@ drain_eth_rx(struct vhost_dev *vdev)
>   						pkts, rx_count);
>   	} else if (async_vhost_driver) {
>   		uint16_t enqueue_fail = 0;
> +		uint16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
>   
>   		complete_async_pkts(vdev);
>   		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
> -					VIRTIO_RXQ, pkts, rx_count);
> +					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
>   		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
>   
>   		enqueue_fail = rx_count - enqueue_count;
> @@ -1389,11 +1501,12 @@ destroy_device(int vid)
>   
>   	if (async_vhost_driver) {
>   		uint16_t n_pkt = 0;
> +		uint16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
>   		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
>   
>   		while (vdev->pkts_inflight) {
>   			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
> -						m_cpl, vdev->pkts_inflight);
> +						m_cpl, vdev->pkts_inflight, dma_id, 0);
>   			free_pkts(m_cpl, n_pkt);
>   			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
>   		}
> @@ -1470,18 +1583,10 @@ new_device(int vid)
>   
>   	if (async_vhost_driver) {
>   		struct rte_vhost_async_config config = {0};
> -		struct rte_vhost_async_channel_ops channel_ops;
> -
> -		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
> -			channel_ops.transfer_data = ioat_transfer_data_cb;
> -			channel_ops.check_completed_copies =
> -				ioat_check_completed_copies_cb;
>   
> -			config.features = RTE_VHOST_ASYNC_INORDER;
> +		config.features = RTE_VHOST_ASYNC_INORDER;
>   
> -			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
> -				config, &channel_ops);
> -		}
> +		return rte_vhost_async_channel_register(vid, VIRTIO_RXQ, config);
>   	}
>   
>   	return 0;
> @@ -1505,11 +1610,12 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
>   	if (async_vhost_driver) {
>   		if (!enable) {
>   			uint16_t n_pkt = 0;
> +			uint16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
>   			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
>   
>   			while (vdev->pkts_inflight) {
>   				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
> -							m_cpl, vdev->pkts_inflight);
> +							m_cpl, vdev->pkts_inflight, dma_id, 0);
>   				free_pkts(m_cpl, n_pkt);
>   				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
>   			}
> diff --git a/examples/vhost/main.h b/examples/vhost/main.h
> index e7b1ac60a6..609fb406aa 100644
> --- a/examples/vhost/main.h
> +++ b/examples/vhost/main.h
> @@ -8,6 +8,7 @@
>   #include <sys/queue.h>
>   
>   #include <rte_ether.h>
> +#include <rte_pci.h>
>   
>   /* Macros for printing using RTE_LOG */
>   #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
> @@ -79,6 +80,17 @@ struct lcore_info {
>   	struct vhost_dev_tailq_list vdev_list;
>   };
>   
> +struct dma_info {
> +	struct rte_pci_addr addr;
> +	uint16_t dev_id;
> +	bool is_valid;
> +};
> +
> +struct dma_for_vhost {
> +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> +	uint16_t nr;
> +};
> +
>   /* we implement non-extra virtio net features */
>   #define VIRTIO_NET_FEATURES	0
>   
> diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
> index 3efd5e6540..87a637f83f 100644
> --- a/examples/vhost/meson.build
> +++ b/examples/vhost/meson.build
> @@ -12,13 +12,9 @@ if not is_linux
>   endif
>   
>   deps += 'vhost'
> +deps += 'dmadev'
>   allow_experimental_apis = true
>   sources = files(
>           'main.c',
>           'virtio_net.c',
>   )
> -
> -if dpdk_conf.has('RTE_RAW_IOAT')
> -    deps += 'raw_ioat'
> -    sources += files('ioat.c')
> -endif
> diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
> index cdb37a4814..8107329400 100644
> --- a/lib/vhost/meson.build
> +++ b/lib/vhost/meson.build
> @@ -33,7 +33,8 @@ headers = files(
>           'rte_vhost_async.h',
>           'rte_vhost_crypto.h',
>   )
> +
>   driver_sdk_headers = files(
>           'vdpa_driver.h',
>   )
> -deps += ['ethdev', 'cryptodev', 'hash', 'pci']
> +deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
> diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> index a87ea6ba37..0594ae5fc5 100644
> --- a/lib/vhost/rte_vhost_async.h
> +++ b/lib/vhost/rte_vhost_async.h
> @@ -36,48 +36,6 @@ struct rte_vhost_async_status {
>   	uintptr_t *dst_opaque_data;
>   };
>   
> -/**
> - * dma operation callbacks to be implemented by applications
> - */
> -struct rte_vhost_async_channel_ops {
> -	/**
> -	 * instruct async engines to perform copies for a batch of packets
> -	 *
> -	 * @param vid
> -	 *  id of vhost device to perform data copies
> -	 * @param queue_id
> -	 *  queue id to perform data copies
> -	 * @param iov_iter
> -	 *  an array of IOV iterators
> -	 * @param opaque_data
> -	 *  opaque data pair sending to DMA engine
> -	 * @param count
> -	 *  number of elements in the "descs" array
> -	 * @return
> -	 *  number of IOV iterators processed, negative value means error
> -	 */
> -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t count);
> -	/**
> -	 * check copy-completed packets from the async engine
> -	 * @param vid
> -	 *  id of vhost device to check copy completion
> -	 * @param queue_id
> -	 *  queue id to check copy completion
> -	 * @param opaque_data
> -	 *  buffer to receive the opaque data pair from DMA engine
> -	 * @param max_packets
> -	 *  max number of packets could be completed
> -	 * @return
> -	 *  number of async descs completed, negative value means error
> -	 */
> -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets);
> -};
> -
>   /**
>    *  async channel features
>    */
> @@ -102,15 +60,12 @@ struct rte_vhost_async_config {
>    *  vhost queue id async channel to be attached to
>    * @param config
>    *  Async channel configuration structure
> - * @param ops
> - *  Async channel operation callbacks
>    * @return
>    *  0 on success, -1 on failures
>    */
>   __rte_experimental
>   int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +	struct rte_vhost_async_config config);
>   
>   /**
>    * Unregister an async channel for a vhost queue
> @@ -136,8 +91,6 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>    *  vhost device id async channel to be attached to
>    * @param queue_id
>    *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration
>    * @param ops
>    *  Async channel operation callbacks
>    * @return
> @@ -145,8 +98,7 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>    */
>   __rte_experimental
>   int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +	struct rte_vhost_async_config config);
>   
>   /**
>    * Unregister an async channel for a vhost queue without performing any
> @@ -179,12 +131,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
>    *  array of packets to be enqueued
>    * @param count
>    *  packets num to be enqueued
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param dma_vchan
> + *  the identifier of virtual DMA channel
>    * @return
>    *  num of packets enqueued
>    */
>   __rte_experimental
>   uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan);
>   
>   /**
>    * This function checks async completion status for a specific vhost
> @@ -199,12 +156,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
>    *  blank array to get return packet pointer
>    * @param count
>    *  size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param dma_vchan
> + *  the identifier of virtual DMA channel
>    * @return
>    *  num of packets returned
>    */
>   __rte_experimental
>   uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan);
>   
>   /**
>    * This function returns the amount of in-flight packets for the vhost
> @@ -235,11 +197,16 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
>    *  Blank array to get return packet pointer
>    * @param count
>    *  Size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param dma_vchan
> + *  the identifier of virtual DMA channel
>    * @return
>    *  Number of packets returned
>    */
>   __rte_experimental
>   uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan);
>   
>   #endif /* _RTE_VHOST_ASYNC_H_ */
> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> index 13a9bb9dd1..595cf63b8d 100644
> --- a/lib/vhost/vhost.c
> +++ b/lib/vhost/vhost.c
> @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
>   		return;
>   
>   	rte_free(vq->async->pkts_info);
> +	rte_free(vq->async->pkts_cmpl_flag);
>   
>   	rte_free(vq->async->buffers_packed);
>   	vq->async->buffers_packed = NULL;
> @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
>   }
>   
>   static __rte_always_inline int
> -async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_channel_ops *ops)
> +async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
>   		goto out_free_async;
>   	}
>   
> +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
> +			RTE_CACHE_LINE_SIZE, node);
> +	if (!async->pkts_cmpl_flag) {
> +		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag (vid %d, qid: %d)\n",
> +				vid, queue_id);
> +		goto out_free_async;
> +	}
> +
>   	if (vq_is_packed(dev)) {
>   		async->buffers_packed = rte_malloc_socket(NULL,
>   				vq->size * sizeof(struct vring_used_elem_packed),
> @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
>   		}
>   	}
>   
> -	async->ops.check_completed_copies = ops->check_completed_copies;
> -	async->ops.transfer_data = ops->transfer_data;
> -
>   	vq->async = async;
>   
>   	return 0;
> @@ -1692,14 +1697,13 @@ async_channel_register(int vid, uint16_t queue_id,
>   
>   int
>   rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +		struct rte_vhost_async_config config)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   	int ret;
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1717,12 +1721,8 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   		return -1;
>   	}
>   
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
>   	rte_spinlock_lock(&vq->access_lock);
> -	ret = async_channel_register(vid, queue_id, ops);
> +	ret = async_channel_register(vid, queue_id);
>   	rte_spinlock_unlock(&vq->access_lock);
>   
>   	return ret;
> @@ -1730,13 +1730,12 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   
>   int
>   rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +		struct rte_vhost_async_config config)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1754,11 +1753,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
>   		return -1;
>   	}
>   
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
> -	return async_channel_register(vid, queue_id, ops);
> +	return async_channel_register(vid, queue_id);
>   }
>   
>   int
> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> index 7085e0885c..974e495b56 100644
> --- a/lib/vhost/vhost.h
> +++ b/lib/vhost/vhost.h
> @@ -51,6 +51,11 @@
>   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
>   #define VHOST_MAX_ASYNC_VEC 2048
>   
> +/* DMA device copy operation tracking ring size. */
> +#define VHOST_ASYNC_DMA_TRACK_RING_SIZE (uint32_t)4096

How is this value chosen? Is that specific to your hardware?

> +#define VHOST_ASYNC_DMA_TRACK_RING_MASK (VHOST_ASYNC_DMA_TRACK_RING_SIZE - 1)
> +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> +
>   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
>   		VRING_DESC_F_WRITE)
> @@ -119,6 +124,29 @@ struct vring_used_elem_packed {
>   	uint32_t count;
>   };
>   
> +struct async_dma_info {
> +	/* circular array to track copy metadata */
> +	bool *metadata[VHOST_ASYNC_DMA_TRACK_RING_SIZE];
> +
> +	/* batching copies before a DMA doorbell */
> +	uint16_t nr_batching;
> +
> +	/**
> +	 * DMA virtual channel lock. Although it is able to bind DMA
> +	 * virtual channels to data plane threads, vhost control plane
> +	 * thread could call data plane functions too, thus causing
> +	 * DMA device contention.
> +	 *
> +	 * For example, in VM exit case, vhost control plane thread needs
> +	 * to clear in-flight packets before disable vring, but there could
> +	 * be anotther data plane thread is enqueuing packets to the same
> +	 * vring with the same DMA virtual channel. But dmadev PMD functions
> +	 * are lock-free, so the control plane and data plane threads
> +	 * could operate the same DMA virtual channel at the same time.
> +	 */
> +	rte_spinlock_t dma_lock;
> +};
> +
>   /**
>    * inflight async packet information
>    */
> @@ -129,9 +157,6 @@ struct async_inflight_info {
>   };
>   
>   struct vhost_async {
> -	/* operation callbacks for DMA */
> -	struct rte_vhost_async_channel_ops ops;
> -
>   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
>   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
>   	uint16_t iter_idx;
> @@ -139,8 +164,22 @@ struct vhost_async {
>   
>   	/* data transfer status */
>   	struct async_inflight_info *pkts_info;
> +	/**
> +	 * packet reorder array. "true" indicates that DMA
> +	 * device completes all copies for the packet.
> +	 *
> +	 * Note that this arry could be written by multiple

array

> +	 * threads at the same time. For example, two threads
> +	 * enqueue packets to the same virtqueue with their
> +	 * own DMA devices. However, since offloading is
> +	 * per-packet basis, each packet flag will only be
> +	 * written by one thread. And single byte write is
> +	 * atomic, so no lock is needed.
> +	 */
> +	bool *pkts_cmpl_flag;
>   	uint16_t pkts_idx;
>   	uint16_t pkts_inflight_n;
> +
>   	union {
>   		struct vring_used_elem  *descs_split;
>   		struct vring_used_elem_packed *buffers_packed;
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index b3d954aab4..95ecfeb64b 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -11,6 +11,7 @@
>   #include <rte_net.h>
>   #include <rte_ether.h>
>   #include <rte_ip.h>
> +#include <rte_dmadev.h>
>   #include <rte_vhost.h>
>   #include <rte_tcp.h>
>   #include <rte_udp.h>
> @@ -25,6 +26,9 @@
>   
>   #define MAX_BATCH_LEN 256
>   
> +/* DMA device copy operation tracking array. */
> +static struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>   static  __rte_always_inline bool
>   rxvq_is_mergeable(struct virtio_net *dev)
>   {
> @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
>   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
>   }
>   
> +static uint16_t
> +vhost_async_dma_transfer(struct vhost_virtqueue *vq, uint16_t dma_id,
> +		uint16_t dma_vchan, uint16_t head_idx,
> +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> +{
> +	struct async_dma_info *dma_info = &dma_copy_track[dma_id];
> +	uint16_t dma_space_left = rte_dma_burst_capacity(dma_id, 0);
> +	uint16_t pkt_idx = 0;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	while (pkt_idx < nr_pkts) {

A for loop would be prefered here.

> +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> +		int copy_idx = 0;
> +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> +		uint16_t i;
> +
> +		if (unlikely(dma_space_left < nr_segs)) {
> +			goto out;
> +		}
> +
> +		for (i = 0; i < nr_segs; i++) {
> +			copy_idx = rte_dma_copy(dma_id, dma_vchan,
> +					(rte_iova_t)iov[i].src_addr,
> +					(rte_iova_t)iov[i].dst_addr,
> +					iov[i].len, RTE_DMA_OP_FLAG_LLC);
> +			if (unlikely(copy_idx < 0)) {
> +				VHOST_LOG_DATA(ERR, "DMA device %u (%u) copy failed\n",
> +						dma_id, dma_vchan);
> +				dma_info->nr_batching += i;
> +				goto out;
> +			}
> +
> +			dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK] = NULL;
> +		}
> +
> +		/**
> +		 * Only store packet completion flag address in the last copy's
> +		 * slot, and other slots are set to NULL.
> +		 */
> +		dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK] =
> +			&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> +
> +		dma_info->nr_batching += nr_segs;
> +		if (unlikely(dma_info->nr_batching > VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> +			rte_dma_submit(dma_id, 0);
> +			dma_info->nr_batching = 0;
> +		}
> +
> +		dma_space_left -= nr_segs;
> +		pkt_idx++;
> +		head_idx++;
> +	}
> +
> +out:
> +	if (dma_info->nr_batching > 0) {
> +		rte_dma_submit(dma_id, 0);
> +		dma_info->nr_batching = 0;
> +	}
> +	rte_spinlock_unlock(&dma_info->dma_lock);

At a first sight, that looks like a lot of thing being done while the
spinlock is held. But maybe there will be only contention in some corner
cases?

> +
> +	return pkt_idx;
> +}
> +
> +static uint16_t
> +vhost_async_dma_check_completed(uint16_t dma_id, uint16_t dma_vchan, uint16_t max_pkts)
> +{
> +	struct async_dma_info *dma_info = &dma_copy_track[dma_id];
> +	uint16_t last_idx = 0;
> +	uint16_t nr_copies;
> +	uint16_t copy_idx;
> +	uint16_t i;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	nr_copies = rte_dma_completed(dma_id, dma_vchan, max_pkts, &last_idx, NULL);
> +	if (nr_copies == 0) {
> +		goto out;
> +	}
> +
> +	copy_idx = last_idx - nr_copies + 1;
> +	for (i = 0; i < nr_copies; i++) {
> +		bool *flag;
> +
> +		flag = dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK];
> +		if (flag) {
> +			/**
> +			 * Mark the packet flag as received. The flag
> +			 * could belong to another virtqueue but write
> +			 * is atomic.
> +			 */
> +			*flag = true;
> +			dma_info->metadata[copy_idx & VHOST_ASYNC_DMA_TRACK_RING_MASK] = NULL;
> +		}
> +		copy_idx++;
> +	}
> +
> +out:
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +	return nr_copies;
> +}
> +
>   static inline void
>   do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
>   {
> @@ -1451,7 +1557,8 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
>   static __rte_noinline uint32_t
>   virtio_dev_rx_async_submit_split(struct virtio_net *dev,
>   	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +	struct rte_mbuf **pkts, uint32_t count, uint16_t dma_id,
> +	uint16_t dma_vchan)
>   {
>   	struct buf_vector buf_vec[BUF_VECTOR_MAX];
>   	uint32_t pkt_idx = 0;
> @@ -1463,6 +1570,7 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
>   	uint32_t pkt_err = 0;
>   	int32_t n_xfer;
>   	uint16_t slot_idx = 0;
> +	uint16_t head_idx = async->pkts_idx & (vq->size - 1);
>   
>   	/*
>   	 * The ordering between avail index and desc reads need to be enforced.
> @@ -1503,17 +1611,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
>   	if (unlikely(pkt_idx == 0))
>   		return 0;
>   
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> +	n_xfer = vhost_async_dma_transfer(vq, dma_id, dma_vchan, head_idx, async->iov_iter,
> +			pkt_idx);
>   
>   	pkt_err = pkt_idx - n_xfer;
>   	if (unlikely(pkt_err)) {
>   		uint16_t num_descs = 0;
>   
> +		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer %u packets for queue %u.\n",
> +				dev->vid, __func__, pkt_err, queue_id);
> +
>   		/* update number of completed packets */
>   		pkt_idx = n_xfer;
>   
> @@ -1658,11 +1765,12 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
>   static __rte_noinline uint32_t
>   virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>   	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +	struct rte_mbuf **pkts, uint32_t count, uint16_t dma_id,
> +	uint16_t dma_vchan)
>   {
>   	uint32_t pkt_idx = 0;
>   	uint32_t remained = count;
> -	int32_t n_xfer;
> +	uint16_t n_xfer;
>   	uint16_t num_buffers;
>   	uint16_t num_descs;
>   
> @@ -1670,6 +1778,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>   	struct async_inflight_info *pkts_info = async->pkts_info;
>   	uint32_t pkt_err = 0;
>   	uint16_t slot_idx = 0;
> +	uint16_t head_idx = async->pkts_idx % vq->size;
>   
>   	do {
>   		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> @@ -1694,19 +1803,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>   	if (unlikely(pkt_idx == 0))
>   		return 0;
>   
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> -
> -	pkt_err = pkt_idx - n_xfer;
> +	n_xfer = vhost_async_dma_transfer(vq, dma_id, dma_vchan, head_idx,
> +			async->iov_iter, pkt_idx);
>   
>   	async_iter_reset(async);
>   
> -	if (unlikely(pkt_err))
> +	pkt_err = pkt_idx - n_xfer;
> +	if (unlikely(pkt_err)) {
> +		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer %u packets for queue %u.\n",
> +				dev->vid, __func__, pkt_err, queue_id);
>   		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
> +	}
>   
>   	if (likely(vq->shadow_used_idx)) {
>   		/* keep used descriptors. */
> @@ -1826,28 +1933,37 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
>   
>   static __rte_always_inline uint16_t
>   vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan)
>   {
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
>   	struct vhost_async *async = vq->async;
>   	struct async_inflight_info *pkts_info = async->pkts_info;
> -	int32_t n_cpl;
> +	uint16_t nr_cpl_copies, nr_cpl_pkts = 0;
>   	uint16_t n_descs = 0, n_buffers = 0;
>   	uint16_t start_idx, from, i;
>   
> -	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
> -	if (unlikely(n_cpl < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies for queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> +	nr_cpl_copies = vhost_async_dma_check_completed(dma_id, dma_vchan, count);
> +	if (nr_cpl_copies == 0)
>   		return 0;
> -	}
>   
> -	if (n_cpl == 0)
> -		return 0;
> +	/**
> +	 * The order of updating packet completion flag needs to be
> +	 * enforced.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
>   
>   	start_idx = async_get_first_inflight_pkt_idx(vq);
>   
> -	for (i = 0; i < n_cpl; i++) {
> +	/* Calculate the number of copy completed packets */
> +	from = start_idx;
> +	while (vq->async->pkts_cmpl_flag[from]) {
> +		vq->async->pkts_cmpl_flag[from] = false;
> +		from = (from + 1) % vq->size;
> +		nr_cpl_pkts++;
> +	}
> +
> +	for (i = 0; i < nr_cpl_pkts; i++) {
>   		from = (start_idx + i) % vq->size;
>   		/* Only used with packed ring */
>   		n_buffers += pkts_info[from].nr_buffers;
> @@ -1856,7 +1972,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
>   		pkts[i] = pkts_info[from].mbuf;
>   	}
>   
> -	async->pkts_inflight_n -= n_cpl;
> +	async->pkts_inflight_n -= nr_cpl_pkts;
>   
>   	if (likely(vq->enabled && vq->access_ok)) {
>   		if (vq_is_packed(dev)) {
> @@ -1877,12 +1993,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
>   		}
>   	}
>   
> -	return n_cpl;
> +	return nr_cpl_pkts;
>   }
>   
>   uint16_t
>   rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq;
> @@ -1908,7 +2025,7 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>   
>   	rte_spinlock_lock(&vq->access_lock);
>   
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, dma_vchan);
>   
>   	rte_spinlock_unlock(&vq->access_lock);
>   
> @@ -1917,7 +2034,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>   
>   uint16_t
>   rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq;
> @@ -1941,14 +2059,15 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, dma_vchan);
>   
>   	return n_pkts_cpl;
>   }
>   
>   static __rte_always_inline uint32_t
>   virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +	struct rte_mbuf **pkts, uint32_t count, uint16_t dma_id,
> +	uint16_t dma_vchan)
>   {
>   	struct vhost_virtqueue *vq;
>   	uint32_t nb_tx = 0;
> @@ -1980,10 +2099,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   
>   	if (vq_is_packed(dev))
>   		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, dma_vchan);
>   	else
>   		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, dma_vchan);
>   
>   out:
>   	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> @@ -1997,7 +2116,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   
>   uint16_t
>   rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id,
> +		uint16_t dma_vchan)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   
> @@ -2011,7 +2131,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, dma_vchan);
>   }
>   
>   static inline bool
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-12-24 10:39   ` Maxime Coquelin
@ 2021-12-28  1:15     ` Hu, Jiayu
  2022-01-03 10:26       ` Maxime Coquelin
  0 siblings, 1 reply; 31+ messages in thread
From: Hu, Jiayu @ 2021-12-28  1:15 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: i.maximets, Xia, Chenbo, Richardson, Bruce, Van Haaren, Harry,
	Mcnamara, John, Pai G, Sunil

Hi Maxime,

Thanks for your comments, and some replies are inline.

Thanks,
Jiayu

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, December 24, 2021 6:40 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Mcnamara, John
> <john.mcnamara@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>
> Subject: Re: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
> 
> Hi Jiayu,
> 
> This is a first review, I need to spend more time on the series to understand
> it well. Do you have a prototype of the OVS part, so that it helps us to grasp
> how the full integration would look like?

I think OVS patch will be sent soon. And we will send the deq side implementation too.

> 
> On 11/22/21 11:54, Jiayu Hu wrote:
> > Since dmadev is introduced in 21.11, to avoid the overhead of vhost
> > DMA abstraction layer and simplify application logics, this patch
> > integrates dmadev in asynchronous data path.
> >
> > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> > ---
> >   doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
> >   examples/vhost/ioat.c               | 218 ----------------------------
> >   examples/vhost/ioat.h               |  63 --------
> >   examples/vhost/main.c               | 144 +++++++++++++++---
> >   examples/vhost/main.h               |  12 ++
> >   examples/vhost/meson.build          |   6 +-
> >   lib/vhost/meson.build               |   3 +-
> >   lib/vhost/rte_vhost_async.h         |  73 +++-------
> >   lib/vhost/vhost.c                   |  37 ++---
> >   lib/vhost/vhost.h                   |  45 +++++-
> >   lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
> >   11 files changed, 410 insertions(+), 452 deletions(-)
> >   delete mode 100644 examples/vhost/ioat.c
> >   delete mode 100644 examples/vhost/ioat.h
> >
> > diff --git a/doc/guides/prog_guide/vhost_lib.rst
> > b/doc/guides/prog_guide/vhost_lib.rst
> > index 76f5d303c9..32969a1c41 100644
> > --- a/doc/guides/prog_guide/vhost_lib.rst
> > +++ b/doc/guides/prog_guide/vhost_lib.rst
> > @@ -113,8 +113,8 @@ The following is an overview of some key Vhost API
> functions:
> >       the async capability. Only packets enqueued/dequeued by async APIs
> are
> >       processed through the async data path.
> >
> > -    Currently this feature is only implemented on split ring enqueue data
> > -    path.
> > +    Currently this feature is only implemented on split and packed ring
> > +    enqueue data path.
> 
> That's not related to the topic of this patch, you may move it in a dedicated
> patch in v1.

Sure, will remove later.

> 
> >
> > diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c deleted
> > file mode 100644 index 9aeeb12fd9..0000000000
> > --- a/examples/vhost/ioat.c
> > +++ /dev/null
> 
> Nice to see platform-specific code not being necessary for the
> application.
> 
> > diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> > index 33d023aa39..16a02b9219 100644
> > --- a/examples/vhost/main.c
> > +++ b/examples/vhost/main.c
> > @@ -199,10 +205,113 @@ struct vhost_bufftable
> *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
> >   static inline int
> >   open_dma(const char *value)
> >   {
> > +		if (rte_dma_start(dev_id) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start
> DMA %u.\n", dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> > +		(dma_info + vid)->dmas[vring_id].is_valid = true;
> 
> This is_valid field is never used AFAICT, my understanding is that it
> had been added to differentiate between not valid and first dmadev,
> where dev_id will be both zero. Either make use of is_valid in the code,
> or change dev_id to int, and initialize it with -1 value.

Right, I will change it later. Thanks for reminder.

> 
> > +		dma_info->nr++;
> > +		i++;
> > +	}
> > +out:
> > +	free(input);
> > +	return ret;
> >   }
> >
> >   /*
> > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> > index 13a9bb9dd1..595cf63b8d 100644
> > --- a/lib/vhost/vhost.c
> > +++ b/lib/vhost/vhost.c
> > @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
> >   		return;
> >
> >   	rte_free(vq->async->pkts_info);
> > +	rte_free(vq->async->pkts_cmpl_flag);
> >
> >   	rte_free(vq->async->buffers_packed);
> >   	vq->async->buffers_packed = NULL;
> > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> >   }
> >
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > index 7085e0885c..974e495b56 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -51,6 +51,11 @@
> >   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> >   #define VHOST_MAX_ASYNC_VEC 2048
> >
> > +/* DMA device copy operation tracking ring size. */
> > +#define VHOST_ASYNC_DMA_TRACK_RING_SIZE (uint32_t)4096
> 
> How is this value chosen? Is that specific to your hardware?

Yes. But in fact, this value should be equal to or greater than vchan
desc number, and it should be dynamic. In addition, the context tracking
array " dma_copy_track" should be per-vchan basis, rather than per-device,
although existed DMA devices only supports 1 vchan at most.

I have reworked this part which can be configured by users dynamically.

> 
> > +#define VHOST_ASYNC_DMA_TRACK_RING_MASK
> (VHOST_ASYNC_DMA_TRACK_RING_SIZE - 1)
> > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> > +
> >   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
> >   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> VRING_DESC_F_WRITE) : \
> >   		VRING_DESC_F_WRITE)
> > @@ -119,6 +124,29 @@ struct vring_used_elem_packed {
> >   	uint32_t count;
> >   };
> >
> > +struct async_dma_info {
> > +	/* circular array to track copy metadata */
> > +	bool *metadata[VHOST_ASYNC_DMA_TRACK_RING_SIZE];
> > +
> > +	/* batching copies before a DMA doorbell */
> > +	uint16_t nr_batching;
> > +
> > +	/**
> > +	 * DMA virtual channel lock. Although it is able to bind DMA
> > +	 * virtual channels to data plane threads, vhost control plane
> > +	 * thread could call data plane functions too, thus causing
> > +	 * DMA device contention.
> > +	 *
> > +	 * For example, in VM exit case, vhost control plane thread needs
> > +	 * to clear in-flight packets before disable vring, but there could
> > +	 * be anotther data plane thread is enqueuing packets to the same
> > +	 * vring with the same DMA virtual channel. But dmadev PMD
> functions
> > +	 * are lock-free, so the control plane and data plane threads
> > +	 * could operate the same DMA virtual channel at the same time.
> > +	 */
> > +	rte_spinlock_t dma_lock;
> > +};
> > +
> >   /**
> >    * inflight async packet information
> >    */
> > @@ -129,9 +157,6 @@ struct async_inflight_info {
> >   };
> >
> >   struct vhost_async {
> > -	/* operation callbacks for DMA */
> > -	struct rte_vhost_async_channel_ops ops;
> > -
> >   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> >   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> >   	uint16_t iter_idx;
> > @@ -139,8 +164,22 @@ struct vhost_async {
> >
> >   	/* data transfer status */
> >   	struct async_inflight_info *pkts_info;
> > +	/**
> > +	 * packet reorder array. "true" indicates that DMA
> > +	 * device completes all copies for the packet.
> > +	 *
> > +	 * Note that this arry could be written by multiple
> 
> array

I will change later.

> 
> > +	 * threads at the same time. For example, two threads
> > +	 * enqueue packets to the same virtqueue with their
> > +	 * own DMA devices. However, since offloading is
> > +	 * per-packet basis, each packet flag will only be
> > +	 * written by one thread. And single byte write is
> > +	 * atomic, so no lock is needed.
> > +	 */
> > +	bool *pkts_cmpl_flag;
> >   	uint16_t pkts_idx;
> >   	uint16_t pkts_inflight_n;
> > +
> >   	union {
> >   		struct vring_used_elem  *descs_split;
> >   		struct vring_used_elem_packed *buffers_packed;
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > index b3d954aab4..95ecfeb64b 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -11,6 +11,7 @@
> >   #include <rte_net.h>
> >   #include <rte_ether.h>
> >   #include <rte_ip.h>
> > +#include <rte_dmadev.h>
> >   #include <rte_vhost.h>
> >   #include <rte_tcp.h>
> >   #include <rte_udp.h>
> > @@ -25,6 +26,9 @@
> >
> >   #define MAX_BATCH_LEN 256
> >
> > +/* DMA device copy operation tracking array. */
> > +static struct async_dma_info
> dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +
> >   static  __rte_always_inline bool
> >   rxvq_is_mergeable(struct virtio_net *dev)
> >   {
> > @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t nr_vring)
> >   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> >   }
> >
> > +static uint16_t
> > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, uint16_t dma_id,
> > +		uint16_t dma_vchan, uint16_t head_idx,
> > +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> > +{
> > +	struct async_dma_info *dma_info = &dma_copy_track[dma_id];
> > +	uint16_t dma_space_left = rte_dma_burst_capacity(dma_id, 0);
> > +	uint16_t pkt_idx = 0;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	while (pkt_idx < nr_pkts) {
> 
> A for loop would be prefered here.

Sure, I will change later.
> 
> > +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > +		int copy_idx = 0;
> > +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > +		uint16_t i;
> > +
> > +		if (unlikely(dma_space_left < nr_segs)) {
> > +			goto out;
> > +		}
> > +
> > +		for (i = 0; i < nr_segs; i++) {
> > +			copy_idx = rte_dma_copy(dma_id, dma_vchan,
> > +					(rte_iova_t)iov[i].src_addr,
> > +					(rte_iova_t)iov[i].dst_addr,
> > +					iov[i].len, RTE_DMA_OP_FLAG_LLC);
> > +			if (unlikely(copy_idx < 0)) {
> > +				VHOST_LOG_DATA(ERR, "DMA device %u
> (%u) copy failed\n",
> > +						dma_id, dma_vchan);
> > +				dma_info->nr_batching += i;
> > +				goto out;
> > +			}
> > +
> > +			dma_info->metadata[copy_idx &
> VHOST_ASYNC_DMA_TRACK_RING_MASK] = NULL;
> > +		}
> > +
> > +		/**
> > +		 * Only store packet completion flag address in the last copy's
> > +		 * slot, and other slots are set to NULL.
> > +		 */
> > +		dma_info->metadata[copy_idx &
> VHOST_ASYNC_DMA_TRACK_RING_MASK] =
> > +			&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> > +
> > +		dma_info->nr_batching += nr_segs;
> > +		if (unlikely(dma_info->nr_batching >
> VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> > +			rte_dma_submit(dma_id, 0);
> > +			dma_info->nr_batching = 0;
> > +		}
> > +
> > +		dma_space_left -= nr_segs;
> > +		pkt_idx++;
> > +		head_idx++;
> > +	}
> > +
> > +out:
> > +	if (dma_info->nr_batching > 0) {
> > +		rte_dma_submit(dma_id, 0);
> > +		dma_info->nr_batching = 0;
> > +	}
> > +	rte_spinlock_unlock(&dma_info->dma_lock);
> 
> At a first sight, that looks like a lot of thing being done while the
> spinlock is held. But maybe there will be only contention in some corner
> cases?

I think the answer is yes. As a typical model is binding dma devices to data-path
threads. That is, there is no DMA sharing in data-path. One case of contention
happens between control plane and data plane threads. But it only happens when
vrings needs to stop etc.. So I think the contention should be not very often.

Thanks,
Jiayu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-12-28  1:15     ` Hu, Jiayu
@ 2022-01-03 10:26       ` Maxime Coquelin
  2022-01-06  5:46         ` Hu, Jiayu
  0 siblings, 1 reply; 31+ messages in thread
From: Maxime Coquelin @ 2022-01-03 10:26 UTC (permalink / raw)
  To: Hu, Jiayu, dev
  Cc: i.maximets, Xia, Chenbo, Richardson, Bruce, Van Haaren, Harry,
	Mcnamara, John, Pai G, Sunil

Hi Jiayu,

On 12/28/21 02:15, Hu, Jiayu wrote:
> Hi Maxime,
> 
> Thanks for your comments, and some replies are inline.
> 
> Thanks,
> Jiayu
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Friday, December 24, 2021 6:40 PM
>> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
>> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
>> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>; Mcnamara, John
>> <john.mcnamara@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>
>> Subject: Re: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
>>
>> Hi Jiayu,
>>
>> This is a first review, I need to spend more time on the series to understand
>> it well. Do you have a prototype of the OVS part, so that it helps us to grasp
>> how the full integration would look like?
> 
> I think OVS patch will be sent soon. And we will send the deq side implementation too.
> 
>>
>> On 11/22/21 11:54, Jiayu Hu wrote:
>>> Since dmadev is introduced in 21.11, to avoid the overhead of vhost
>>> DMA abstraction layer and simplify application logics, this patch
>>> integrates dmadev in asynchronous data path.
>>>
>>> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
>>> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
>>> ---
>>>    doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
>>>    examples/vhost/ioat.c               | 218 ----------------------------
>>>    examples/vhost/ioat.h               |  63 --------
>>>    examples/vhost/main.c               | 144 +++++++++++++++---
>>>    examples/vhost/main.h               |  12 ++
>>>    examples/vhost/meson.build          |   6 +-
>>>    lib/vhost/meson.build               |   3 +-
>>>    lib/vhost/rte_vhost_async.h         |  73 +++-------
>>>    lib/vhost/vhost.c                   |  37 ++---
>>>    lib/vhost/vhost.h                   |  45 +++++-
>>>    lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
>>>    11 files changed, 410 insertions(+), 452 deletions(-)
>>>    delete mode 100644 examples/vhost/ioat.c
>>>    delete mode 100644 examples/vhost/ioat.h
>>>

...

>>> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
>>> index 13a9bb9dd1..595cf63b8d 100644
>>> --- a/lib/vhost/vhost.c
>>> +++ b/lib/vhost/vhost.c
>>> @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
>>>    		return;
>>>
>>>    	rte_free(vq->async->pkts_info);
>>> +	rte_free(vq->async->pkts_cmpl_flag);
>>>
>>>    	rte_free(vq->async->buffers_packed);
>>>    	vq->async->buffers_packed = NULL;
>>> @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
>>>    }
>>>
>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
>>> index 7085e0885c..974e495b56 100644
>>> --- a/lib/vhost/vhost.h
>>> +++ b/lib/vhost/vhost.h
>>> @@ -51,6 +51,11 @@
>>>    #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
>>>    #define VHOST_MAX_ASYNC_VEC 2048
>>>
>>> +/* DMA device copy operation tracking ring size. */
>>> +#define VHOST_ASYNC_DMA_TRACK_RING_SIZE (uint32_t)4096
>>
>> How is this value chosen? Is that specific to your hardware?
> 
> Yes. But in fact, this value should be equal to or greater than vchan
> desc number, and it should be dynamic. In addition, the context tracking
> array " dma_copy_track" should be per-vchan basis, rather than per-device,
> although existed DMA devices only supports 1 vchan at most.
> 
> I have reworked this part which can be configured by users dynamically.

Wouldn't it be better to use the max_desc value from from struct
rte_dma_info?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-01-03 10:26       ` Maxime Coquelin
@ 2022-01-06  5:46         ` Hu, Jiayu
  0 siblings, 0 replies; 31+ messages in thread
From: Hu, Jiayu @ 2022-01-06  5:46 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: i.maximets, Xia, Chenbo, Richardson, Bruce, Van Haaren, Harry,
	Mcnamara, John, Pai G, Sunil

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, January 3, 2022 6:26 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Mcnamara, John
> <john.mcnamara@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>
> Subject: Re: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
> 
> Hi Jiayu,
> 
> On 12/28/21 02:15, Hu, Jiayu wrote:
> > Hi Maxime,
> >
> > Thanks for your comments, and some replies are inline.
> >
> > Thanks,
> > Jiayu
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Friday, December 24, 2021 6:40 PM
> >> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> >> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>;
> >> Richardson, Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> >> <harry.van.haaren@intel.com>; Mcnamara, John
> >> <john.mcnamara@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>
> >> Subject: Re: [RFC 1/1] vhost: integrate dmadev in asynchronous
> >> datapath
> >>
> >> Hi Jiayu,
> >>
> >> This is a first review, I need to spend more time on the series to
> >> understand it well. Do you have a prototype of the OVS part, so that
> >> it helps us to grasp how the full integration would look like?
> >
> > I think OVS patch will be sent soon. And we will send the deq side
> implementation too.
> >
> >>
> >> On 11/22/21 11:54, Jiayu Hu wrote:
> >>> Since dmadev is introduced in 21.11, to avoid the overhead of vhost
> >>> DMA abstraction layer and simplify application logics, this patch
> >>> integrates dmadev in asynchronous data path.
> >>>
> >>> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> >>> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> >>> ---
> >>>    doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
> >>>    examples/vhost/ioat.c               | 218 ----------------------------
> >>>    examples/vhost/ioat.h               |  63 --------
> >>>    examples/vhost/main.c               | 144 +++++++++++++++---
> >>>    examples/vhost/main.h               |  12 ++
> >>>    examples/vhost/meson.build          |   6 +-
> >>>    lib/vhost/meson.build               |   3 +-
> >>>    lib/vhost/rte_vhost_async.h         |  73 +++-------
> >>>    lib/vhost/vhost.c                   |  37 ++---
> >>>    lib/vhost/vhost.h                   |  45 +++++-
> >>>    lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
> >>>    11 files changed, 410 insertions(+), 452 deletions(-)
> >>>    delete mode 100644 examples/vhost/ioat.c
> >>>    delete mode 100644 examples/vhost/ioat.h
> >>>
> 
> ...
> 
> >>> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c index
> >>> 13a9bb9dd1..595cf63b8d 100644
> >>> --- a/lib/vhost/vhost.c
> >>> +++ b/lib/vhost/vhost.c
> >>> @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue
> *vq)
> >>>    		return;
> >>>
> >>>    	rte_free(vq->async->pkts_info);
> >>> +	rte_free(vq->async->pkts_cmpl_flag);
> >>>
> >>>    	rte_free(vq->async->buffers_packed);
> >>>    	vq->async->buffers_packed = NULL; @@ -1626,8 +1627,7 @@
> >>> rte_vhost_extern_callback_register(int vid,
> >>>    }
> >>>
> >>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h index
> >>> 7085e0885c..974e495b56 100644
> >>> --- a/lib/vhost/vhost.h
> >>> +++ b/lib/vhost/vhost.h
> >>> @@ -51,6 +51,11 @@
> >>>    #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> >>>    #define VHOST_MAX_ASYNC_VEC 2048
> >>>
> >>> +/* DMA device copy operation tracking ring size. */ #define
> >>> +VHOST_ASYNC_DMA_TRACK_RING_SIZE (uint32_t)4096
> >>
> >> How is this value chosen? Is that specific to your hardware?
> >
> > Yes. But in fact, this value should be equal to or greater than vchan
> > desc number, and it should be dynamic. In addition, the context
> > tracking array " dma_copy_track" should be per-vchan basis, rather
> > than per-device, although existed DMA devices only supports 1 vchan at
> most.
> >
> > I have reworked this part which can be configured by users dynamically.
> 
> Wouldn't it be better to use the max_desc value from from struct
> rte_dma_info?

Yes, you are right. I will use this structure in the next version.

Thanks,
Jiayu


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 0/1] integrate dmadev in vhost
  2021-11-22 10:54 [RFC 0/1] integrate dmadev in vhost Jiayu Hu
  2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
@ 2021-12-03  3:49 ` fengchengwen
  2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
  2 siblings, 0 replies; 31+ messages in thread
From: fengchengwen @ 2021-12-03  3:49 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, john.mcnamara, sunil.pai.g

Hi Jiayu

I notice that examples/vhost rely on VMDQ, Could the examples/vhost provide
options that do not depend on VMDQ?  In this way, many network adapters can
be used.

Thanks.


On 2021/11/22 18:54, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in vhost.
> 
> To enable the flexibility of using DMA devices in different function
> modules, not limited in vhost, vhost doesn't manage DMA devices.
> Applications, like OVS, need to manage and configure DMA devices and
> tell vhost what DMA device to use in every dataplane function call.
> 
> In addition, vhost supports M:N mapping between vrings and DMA virtual
> channels. Specifically, one vring can use multiple different DMA channels
> and one DMA channel can be shared by multiple vrings at the same time.
> The reason of enabling one vring to use multiple DMA channels is that
> it's possible that more than one dataplane threads enqueue packets to
> the same vring with their own DMA virtual channels. Besides, the number
> of DMA devices is limited. For the purpose of scaling, it's necessary to
> support sharing DMA channels among vrings.
> 

...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v1 0/1] integrate dmadev in vhost
  2021-11-22 10:54 [RFC 0/1] integrate dmadev in vhost Jiayu Hu
  2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
  2021-12-03  3:49 ` [RFC 0/1] integrate dmadev in vhost fengchengwen
@ 2021-12-30 21:55 ` Jiayu Hu
  2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
  2022-01-24 16:40   ` [PATCH v2 0/1] integrate dmadev in vhost Jiayu Hu
  2 siblings, 2 replies; 31+ messages in thread
From: Jiayu Hu @ 2021-12-30 21:55 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, sunil.pai.g, john.mcnamara, xuan.ding,
	cheng1.jiang, liangma, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in vhost.

To enable the flexibility of using DMA devices in different function
modules, not limited in vhost, vhost doesn't manage DMA devices.
Applications, like OVS, need to manage and configure DMA devices and
tell vhost what DMA device to use in every dataplane function call.

In addition, vhost supports M:N mapping between vrings and DMA virtual
channels. Specifically, one vring can use multiple different DMA channels
and one DMA channel can be shared by multiple vrings at the same time.
The reason of enabling one vring to use multiple DMA channels is that
it's possible that more than one dataplane threads enqueue packets to
the same vring with their own DMA virtual channels. Besides, the number
of DMA devices is limited. For the purpose of scaling, it's necessary to
support sharing DMA channels among vrings.

As only enqueue path is enabled DMA acceleration, the new dataplane
functions are like:
1). rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Get descriptors and submit copies to DMA virtual channel for the
    packets that need to be send to VM.
 
2). rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Check completed DMA copies from the given DMA virtual channel and
    write back corresponding descriptors to vring.

OVS needs to call rte_vhost_poll_enqueue_completed to clean in-flight
copies on previous call and it can be called inside rxq_recv function,
so that it doesn't require big change in OVS datapath. For example:
netdev_dpdk_vhost_rxq_recv() {
	...
	qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_RXQ;
	rte_vhost_poll_enqueue_completed(vid, qid, ...);
}

Change log
==========
rfc -> v1:
- remove useless code
- support dynamic DMA vchannel ring size (rte_vhost_async_dma_configure)
- fix several bugs
- fix typo and coding style issues
- replace "while" with "for"
- update programmer guide 
- support share dma among vhost in vhost example
- remove "--dma-type" in vhost example

Jiayu Hu (1):
  vhost: integrate dmadev in asynchronous datapath

 doc/guides/prog_guide/vhost_lib.rst |  70 ++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 --------------------------
 examples/vhost/ioat.h               |  63 --------
 examples/vhost/main.c               | 230 +++++++++++++++++++++++-----
 examples/vhost/main.h               |  11 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   3 +-
 lib/vhost/rte_vhost_async.h         | 121 +++++----------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 130 +++++++++++-----
 lib/vhost/vhost.h                   |  53 ++++++-
 lib/vhost/virtio_net.c              | 206 +++++++++++++++++++------
 13 files changed, 587 insertions(+), 529 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
@ 2021-12-30 21:55   ` Jiayu Hu
  2021-12-31  0:55     ` Liang Ma
                       ` (2 more replies)
  2022-01-24 16:40   ` [PATCH v2 0/1] integrate dmadev in vhost Jiayu Hu
  1 sibling, 3 replies; 31+ messages in thread
From: Jiayu Hu @ 2021-12-30 21:55 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, sunil.pai.g, john.mcnamara, xuan.ding,
	cheng1.jiang, liangma, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.

Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  70 ++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 --------------------------
 examples/vhost/ioat.h               |  63 --------
 examples/vhost/main.c               | 230 +++++++++++++++++++++++-----
 examples/vhost/main.h               |  11 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   3 +-
 lib/vhost/rte_vhost_async.h         | 121 +++++----------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 130 +++++++++++-----
 lib/vhost/vhost.h                   |  53 ++++++-
 lib/vhost/virtio_net.c              | 206 +++++++++++++++++++------
 13 files changed, 587 insertions(+), 529 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 76f5d303c9..bdce7cbf02 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -218,38 +218,12 @@ The following is an overview of some key Vhost API functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_channel_register(vid, queue_id)``
 
   Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  is enabled.
 
-  * ``features``
-
-    This field is used to specify async copy device features.
-
-    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
-    guarantee the order of copy completion is the same as the order
-    of copy submission.
-
-    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
-    supported by vhost.
-
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-    vhost invokes this function to submit copy data to the async devices.
-    For non-async_inorder capable devices, ``opaque_data`` could be used
-    for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-    vhost invokes this function to get the copy data completed by async
-    devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
 
   Register an async copy device channel for a vhost queue without
   performing any locking.
@@ -277,18 +251,13 @@ The following is an overview of some key Vhost API functions:
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, dma_vchan)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
-
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, dma_vchan)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +267,7 @@ The following is an overview of some key Vhost API functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, dma_vchan)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -442,3 +411,26 @@ Finally, a set of device ops is defined for device specific operations:
 * ``get_notify_area``
 
   Called to get the notify area info of the queue.
+
+Vhost asynchronous data path
+----------------------------
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables applcations, like OVS, to save CPU cycles and hide memory copy
+overhead, thus achieving higher throughput.
+
+Vhost doesn't manage DMA devices and applications, like OVS, need to
+manage and configure DMA devices. Applications need to tell vhost what
+DMA devices to use in every data path function call. This design enables
+the flexibility for applications to dynamically use DMA channels in
+different function modules, not limited in vhost.
+
+In addition, vhost supports M:N mapping between vrings and DMA virtual
+channels. Specifically, one vring can use multiple different DMA channels
+and one DMA channel can be shared by multiple vrings at the same time.
+The reason of enabling one vring to use multiple DMA channels is that
+it's possible that more than one dataplane threads enqueue packets to
+the same vring with their own DMA virtual channels. Besides, the number
+of DMA devices is limited. For the purpose of scaling, it's necessary to
+support sharing DMA channels among vrings.
diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
index 587ea2ab47..975a5dfe40 100644
--- a/examples/vhost/Makefile
+++ b/examples/vhost/Makefile
@@ -5,7 +5,7 @@
 APP = vhost-switch
 
 # all source are stored in SRCS-y
-SRCS-y := main.c virtio_net.c ioat.c
+SRCS-y := main.c virtio_net.c
 
 PKGCONF ?= pkg-config
 
diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
deleted file mode 100644
index 9aeeb12fd9..0000000000
--- a/examples/vhost/ioat.c
+++ /dev/null
@@ -1,218 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#include <sys/uio.h>
-#ifdef RTE_RAW_IOAT
-#include <rte_rawdev.h>
-#include <rte_ioat_rawdev.h>
-
-#include "ioat.h"
-#include "main.h"
-
-struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
-
-struct packet_tracker {
-	unsigned short size_track[MAX_ENQUEUED_SIZE];
-	unsigned short next_read;
-	unsigned short next_write;
-	unsigned short last_remain;
-	unsigned short ioat_space;
-};
-
-struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
-
-int
-open_ioat(const char *value)
-{
-	struct dma_for_vhost *dma_info = dma_bind;
-	char *input = strndup(value, strlen(value) + 1);
-	char *addrs = input;
-	char *ptrs[2];
-	char *start, *end, *substr;
-	int64_t vid, vring_id;
-	struct rte_ioat_rawdev_config config;
-	struct rte_rawdev_info info = { .dev_private = &config };
-	char name[32];
-	int dev_id;
-	int ret = 0;
-	uint16_t i = 0;
-	char *dma_arg[MAX_VHOST_DEVICE];
-	int args_nr;
-
-	while (isblank(*addrs))
-		addrs++;
-	if (*addrs == '\0') {
-		ret = -1;
-		goto out;
-	}
-
-	/* process DMA devices within bracket. */
-	addrs++;
-	substr = strtok(addrs, ";]");
-	if (!substr) {
-		ret = -1;
-		goto out;
-	}
-	args_nr = rte_strsplit(substr, strlen(substr),
-			dma_arg, MAX_VHOST_DEVICE, ',');
-	if (args_nr <= 0) {
-		ret = -1;
-		goto out;
-	}
-	while (i < args_nr) {
-		char *arg_temp = dma_arg[i];
-		uint8_t sub_nr;
-		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
-		if (sub_nr != 2) {
-			ret = -1;
-			goto out;
-		}
-
-		start = strstr(ptrs[0], "txd");
-		if (start == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		start += 3;
-		vid = strtol(start, &end, 0);
-		if (end == start) {
-			ret = -1;
-			goto out;
-		}
-
-		vring_id = 0 + VIRTIO_RXQ;
-		if (rte_pci_addr_parse(ptrs[1],
-				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
-			ret = -1;
-			goto out;
-		}
-
-		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
-				name, sizeof(name));
-		dev_id = rte_rawdev_get_dev_id(name);
-		if (dev_id == (uint16_t)(-ENODEV) ||
-		dev_id == (uint16_t)(-EINVAL)) {
-			ret = -1;
-			goto out;
-		}
-
-		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
-		strstr(info.driver_name, "ioat") == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
-		(dma_info + vid)->dmas[vring_id].is_valid = true;
-		config.ring_size = IOAT_RING_SIZE;
-		config.hdls_disable = true;
-		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
-			ret = -1;
-			goto out;
-		}
-		rte_rawdev_start(dev_id);
-		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
-		dma_info->nr++;
-		i++;
-	}
-out:
-	free(input);
-	return ret;
-}
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count)
-{
-	uint32_t i_iter;
-	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
-	struct rte_vhost_iov_iter *iter = NULL;
-	unsigned long i_seg;
-	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-	unsigned short write = cb_tracker[dev_id].next_write;
-
-	if (!opaque_data) {
-		for (i_iter = 0; i_iter < count; i_iter++) {
-			iter = iov_iter + i_iter;
-			i_seg = 0;
-			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
-				break;
-			while (i_seg < iter->nr_segs) {
-				rte_ioat_enqueue_copy(dev_id,
-					(uintptr_t)(iter->iov[i_seg].src_addr),
-					(uintptr_t)(iter->iov[i_seg].dst_addr),
-					iter->iov[i_seg].len,
-					0,
-					0);
-				i_seg++;
-			}
-			write &= mask;
-			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
-			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
-			write++;
-		}
-	} else {
-		/* Opaque data is not supported */
-		return -1;
-	}
-	/* ring the doorbell */
-	rte_ioat_perform_ops(dev_id);
-	cb_tracker[dev_id].next_write = write;
-	return i_iter;
-}
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets)
-{
-	if (!opaque_data) {
-		uintptr_t dump[255];
-		int n_seg;
-		unsigned short read, write;
-		unsigned short nb_packet = 0;
-		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-		unsigned short i;
-
-		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
-				+ VIRTIO_RXQ].dev_id;
-		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, dump);
-		if (n_seg < 0) {
-			RTE_LOG(ERR,
-				VHOST_DATA,
-				"fail to poll completed buf on IOAT device %u",
-				dev_id);
-			return 0;
-		}
-		if (n_seg == 0)
-			return 0;
-
-		cb_tracker[dev_id].ioat_space += n_seg;
-		n_seg += cb_tracker[dev_id].last_remain;
-
-		read = cb_tracker[dev_id].next_read;
-		write = cb_tracker[dev_id].next_write;
-		for (i = 0; i < max_packets; i++) {
-			read &= mask;
-			if (read == write)
-				break;
-			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
-				n_seg -= cb_tracker[dev_id].size_track[read];
-				read++;
-				nb_packet++;
-			} else {
-				break;
-			}
-		}
-		cb_tracker[dev_id].next_read = read;
-		cb_tracker[dev_id].last_remain = n_seg;
-		return nb_packet;
-	}
-	/* Opaque data is not supported */
-	return -1;
-}
-
-#endif /* RTE_RAW_IOAT */
diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
deleted file mode 100644
index d9bf717e8d..0000000000
--- a/examples/vhost/ioat.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#ifndef _IOAT_H_
-#define _IOAT_H_
-
-#include <rte_vhost.h>
-#include <rte_pci.h>
-#include <rte_vhost_async.h>
-
-#define MAX_VHOST_DEVICE 1024
-#define IOAT_RING_SIZE 4096
-#define MAX_ENQUEUED_SIZE 4096
-
-struct dma_info {
-	struct rte_pci_addr addr;
-	uint16_t dev_id;
-	bool is_valid;
-};
-
-struct dma_for_vhost {
-	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
-	uint16_t nr;
-};
-
-#ifdef RTE_RAW_IOAT
-int open_ioat(const char *value);
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count);
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-#else
-static int open_ioat(const char *value __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
-		struct rte_vhost_iov_iter *iov_iter __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t count __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_check_completed_copies_cb(int vid __rte_unused,
-		uint16_t queue_id __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t max_packets __rte_unused)
-{
-	return -1;
-}
-#endif
-#endif /* _IOAT_H_ */
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 33d023aa39..44073499bc 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -24,8 +24,9 @@
 #include <rte_ip.h>
 #include <rte_tcp.h>
 #include <rte_pause.h>
+#include <rte_dmadev.h>
+#include <rte_vhost_async.h>
 
-#include "ioat.h"
 #include "main.h"
 
 #ifndef MAX_QUEUES
@@ -56,6 +57,14 @@
 #define RTE_TEST_TX_DESC_DEFAULT 512
 
 #define INVALID_PORT_ID 0xFF
+#define INVALID_DMA_ID -1
+
+#define MAX_VHOST_DEVICE 1024
+#define DMA_RING_SIZE 4096
+
+struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
+struct rte_vhost_async_dma_info dma_config[RTE_DMADEV_DEFAULT_MAX];
+static int dma_count;
 
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;
@@ -96,8 +105,6 @@ static int builtin_net_driver;
 
 static int async_vhost_driver;
 
-static char *dma_type;
-
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
 /* Specify the number of retries on RX. */
@@ -196,13 +203,134 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
 #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
 				 / US_PER_S * BURST_TX_DRAIN_US)
 
+static inline bool
+is_dma_configured(int16_t dev_id)
+{
+	int i;
+
+	for (i = 0; i < dma_count; i++) {
+		if (dma_config[i].dev_id == dev_id) {
+			return true;
+		}
+	}
+	return false;
+}
+
 static inline int
 open_dma(const char *value)
 {
-	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
-		return open_ioat(value);
+	struct dma_for_vhost *dma_info = dma_bind;
+	char *input = strndup(value, strlen(value) + 1);
+	char *addrs = input;
+	char *ptrs[2];
+	char *start, *end, *substr;
+	int64_t vid, vring_id;
+
+	struct rte_dma_info info;
+	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
+	struct rte_dma_vchan_conf qconf = {
+		.direction = RTE_DMA_DIR_MEM_TO_MEM,
+		.nb_desc = DMA_RING_SIZE
+	};
+
+	int dev_id;
+	int ret = 0;
+	uint16_t i = 0;
+	char *dma_arg[MAX_VHOST_DEVICE];
+	int args_nr;
+
+	while (isblank(*addrs))
+		addrs++;
+	if (*addrs == '\0') {
+		ret = -1;
+		goto out;
+	}
+
+	/* process DMA devices within bracket. */
+	addrs++;
+	substr = strtok(addrs, ";]");
+	if (!substr) {
+		ret = -1;
+		goto out;
+	}
+
+	args_nr = rte_strsplit(substr, strlen(substr),
+			dma_arg, MAX_VHOST_DEVICE, ',');
+	if (args_nr <= 0) {
+		ret = -1;
+		goto out;
+	}
+
+	while (i < args_nr) {
+		char *arg_temp = dma_arg[i];
+		uint8_t sub_nr;
+
+		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
+		if (sub_nr != 2) {
+			ret = -1;
+			goto out;
+		}
+
+		start = strstr(ptrs[0], "txd");
+		if (start == NULL) {
+			ret = -1;
+			goto out;
+		}
+
+		start += 3;
+		vid = strtol(start, &end, 0);
+		if (end == start) {
+			ret = -1;
+			goto out;
+		}
+
+		vring_id = 0 + VIRTIO_RXQ;
+
+		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
+		if (dev_id < 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
+			ret = -1;
+			goto out;
+		} else if (is_dma_configured(dev_id)) {
+			goto done;
+		}
+
+		if (rte_dma_configure(dev_id, &dev_config) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
 
-	return -1;
+		rte_dma_info_get(dev_id, &info);
+		if (info.nb_vchans != 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no queues.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_start(dev_id) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		dma_config[dma_count].dev_id = dev_id;
+		dma_config[dma_count].max_vchans = 1;
+		dma_config[dma_count++].max_desc = DMA_RING_SIZE;
+
+done:
+		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
+		i++;
+	}
+out:
+	free(input);
+	return ret;
 }
 
 /*
@@ -500,8 +628,6 @@ enum {
 	OPT_CLIENT_NUM,
 #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
 	OPT_BUILTIN_NET_DRIVER_NUM,
-#define OPT_DMA_TYPE            "dma-type"
-	OPT_DMA_TYPE_NUM,
 #define OPT_DMAS                "dmas"
 	OPT_DMAS_NUM,
 };
@@ -539,8 +665,6 @@ us_vhost_parse_args(int argc, char **argv)
 				NULL, OPT_CLIENT_NUM},
 		{OPT_BUILTIN_NET_DRIVER, no_argument,
 				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
-		{OPT_DMA_TYPE, required_argument,
-				NULL, OPT_DMA_TYPE_NUM},
 		{OPT_DMAS, required_argument,
 				NULL, OPT_DMAS_NUM},
 		{NULL, 0, 0, 0},
@@ -661,10 +785,6 @@ us_vhost_parse_args(int argc, char **argv)
 			}
 			break;
 
-		case OPT_DMA_TYPE_NUM:
-			dma_type = optarg;
-			break;
-
 		case OPT_DMAS_NUM:
 			if (open_dma(optarg) == -1) {
 				RTE_LOG(INFO, VHOST_CONFIG,
@@ -841,9 +961,10 @@ complete_async_pkts(struct vhost_dev *vdev)
 {
 	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
 	uint16_t complete_count;
+	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
-					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
+					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
 	if (complete_count) {
 		free_pkts(p_cpl, complete_count);
 		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
@@ -883,11 +1004,12 @@ drain_vhost(struct vhost_dev *vdev)
 
 	if (builtin_net_driver) {
 		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
-		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
+		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = nr_xmit - ret;
@@ -905,7 +1027,7 @@ drain_vhost(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(m, nr_xmit);
 }
 
@@ -1211,12 +1333,13 @@ drain_eth_rx(struct vhost_dev *vdev)
 	if (builtin_net_driver) {
 		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
 						pkts, rx_count);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
 		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
-					VIRTIO_RXQ, pkts, rx_count);
+					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = rx_count - enqueue_count;
@@ -1235,7 +1358,7 @@ drain_eth_rx(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(pkts, rx_count);
 }
 
@@ -1387,18 +1510,20 @@ destroy_device(int vid)
 		"(%d) device has been removed from data core\n",
 		vdev->vid);
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t n_pkt = 0;
+		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 		while (vdev->pkts_inflight) {
 			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
-						m_cpl, vdev->pkts_inflight);
+						m_cpl, vdev->pkts_inflight, dma_id, 0);
 			free_pkts(m_cpl, n_pkt);
 			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 		}
 
 		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
+		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
 	}
 
 	rte_free(vdev);
@@ -1468,20 +1593,14 @@ new_device(int vid)
 		"(%d) device has been added to data core %d\n",
 		vid, vdev->coreid);
 
-	if (async_vhost_driver) {
-		struct rte_vhost_async_config config = {0};
-		struct rte_vhost_async_channel_ops channel_ops;
-
-		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
-			channel_ops.transfer_data = ioat_transfer_data_cb;
-			channel_ops.check_completed_copies =
-				ioat_check_completed_copies_cb;
-
-			config.features = RTE_VHOST_ASYNC_INORDER;
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
+		int ret;
 
-			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
-				config, &channel_ops);
+		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
+		if (ret == 0) {
+			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
 		}
+		return ret;
 	}
 
 	return 0;
@@ -1502,14 +1621,15 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
 	if (queue_id != VIRTIO_RXQ)
 		return 0;
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[queue_id].async_enabled) {
 		if (!enable) {
 			uint16_t n_pkt = 0;
+			int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 			while (vdev->pkts_inflight) {
 				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
-							m_cpl, vdev->pkts_inflight);
+							m_cpl, vdev->pkts_inflight, dma_id, 0);
 				free_pkts(m_cpl, n_pkt);
 				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 			}
@@ -1657,6 +1777,25 @@ create_mbuf_pool(uint16_t nr_port, uint32_t nr_switch_core, uint32_t mbuf_size,
 		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
 }
 
+static void
+init_dma(void)
+{
+	int i;
+
+	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
+		int j;
+
+		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
+			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
+			dma_bind[i].dmas[j].async_enabled = false;
+		}
+	}
+
+	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
+		dma_config[i].dev_id = INVALID_DMA_ID;
+	}
+}
+
 /*
  * Main function, does initialisation and calls the per-lcore functions.
  */
@@ -1679,6 +1818,9 @@ main(int argc, char *argv[])
 	argc -= ret;
 	argv += ret;
 
+	/* initialize dma structures */
+	init_dma();
+
 	/* parse app arguments */
 	ret = us_vhost_parse_args(argc, argv);
 	if (ret < 0)
@@ -1754,6 +1896,20 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	if (async_vhost_driver) {
+		if (rte_vhost_async_dma_configure(dma_config, dma_count) < 0) {
+			RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in vhost.\n");
+			for (i = 0; i < dma_count; i++) {
+				if (dma_config[i].dev_id != INVALID_DMA_ID) {
+					rte_dma_stop(dma_config[i].dev_id);
+					dma_config[i].dev_id = INVALID_DMA_ID;
+				}
+			}
+			dma_count = 0;
+			async_vhost_driver = false;
+		}
+	}
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		char *file = socket_files + i * PATH_MAX;
diff --git a/examples/vhost/main.h b/examples/vhost/main.h
index e7b1ac60a6..b4a453e77e 100644
--- a/examples/vhost/main.h
+++ b/examples/vhost/main.h
@@ -8,6 +8,7 @@
 #include <sys/queue.h>
 
 #include <rte_ether.h>
+#include <rte_pci.h>
 
 /* Macros for printing using RTE_LOG */
 #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
@@ -79,6 +80,16 @@ struct lcore_info {
 	struct vhost_dev_tailq_list vdev_list;
 };
 
+struct dma_info {
+	struct rte_pci_addr addr;
+	int16_t dev_id;
+	bool async_enabled;
+};
+
+struct dma_for_vhost {
+	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
+};
+
 /* we implement non-extra virtio net features */
 #define VIRTIO_NET_FEATURES	0
 
diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
index 3efd5e6540..87a637f83f 100644
--- a/examples/vhost/meson.build
+++ b/examples/vhost/meson.build
@@ -12,13 +12,9 @@ if not is_linux
 endif
 
 deps += 'vhost'
+deps += 'dmadev'
 allow_experimental_apis = true
 sources = files(
         'main.c',
         'virtio_net.c',
 )
-
-if dpdk_conf.has('RTE_RAW_IOAT')
-    deps += 'raw_ioat'
-    sources += files('ioat.c')
-endif
diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
index cdb37a4814..8107329400 100644
--- a/lib/vhost/meson.build
+++ b/lib/vhost/meson.build
@@ -33,7 +33,8 @@ headers = files(
         'rte_vhost_async.h',
         'rte_vhost_crypto.h',
 )
+
 driver_sdk_headers = files(
         'vdpa_driver.h',
 )
-deps += ['ethdev', 'cryptodev', 'hash', 'pci']
+deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
index a87ea6ba37..23a7a2d8b3 100644
--- a/lib/vhost/rte_vhost_async.h
+++ b/lib/vhost/rte_vhost_async.h
@@ -27,70 +27,12 @@ struct rte_vhost_iov_iter {
 };
 
 /**
- * dma transfer status
+ * DMA device information
  */
-struct rte_vhost_async_status {
-	/** An array of application specific data for source memory */
-	uintptr_t *src_opaque_data;
-	/** An array of application specific data for destination memory */
-	uintptr_t *dst_opaque_data;
-};
-
-/**
- * dma operation callbacks to be implemented by applications
- */
-struct rte_vhost_async_channel_ops {
-	/**
-	 * instruct async engines to perform copies for a batch of packets
-	 *
-	 * @param vid
-	 *  id of vhost device to perform data copies
-	 * @param queue_id
-	 *  queue id to perform data copies
-	 * @param iov_iter
-	 *  an array of IOV iterators
-	 * @param opaque_data
-	 *  opaque data pair sending to DMA engine
-	 * @param count
-	 *  number of elements in the "descs" array
-	 * @return
-	 *  number of IOV iterators processed, negative value means error
-	 */
-	int32_t (*transfer_data)(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t count);
-	/**
-	 * check copy-completed packets from the async engine
-	 * @param vid
-	 *  id of vhost device to check copy completion
-	 * @param queue_id
-	 *  queue id to check copy completion
-	 * @param opaque_data
-	 *  buffer to receive the opaque data pair from DMA engine
-	 * @param max_packets
-	 *  max number of packets could be completed
-	 * @return
-	 *  number of async descs completed, negative value means error
-	 */
-	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-};
-
-/**
- *  async channel features
- */
-enum {
-	RTE_VHOST_ASYNC_INORDER = 1U << 0,
-};
-
-/**
- *  async channel configuration
- */
-struct rte_vhost_async_config {
-	uint32_t features;
-	uint32_t rsvd[2];
+struct rte_vhost_async_dma_info {
+	int16_t dev_id;	/* DMA device ID */
+	uint16_t max_vchans;	/* max number of vchan */
+	uint16_t max_desc;	/* max desc number of vchan */
 };
 
 /**
@@ -100,17 +42,11 @@ struct rte_vhost_async_config {
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration structure
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue
@@ -136,17 +72,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue without performing any
@@ -179,12 +109,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
  *  array of packets to be enqueued
  * @param count
  *  packets num to be enqueued
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets enqueued
  */
 __rte_experimental
 uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan);
 
 /**
  * This function checks async completion status for a specific vhost
@@ -199,12 +134,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
  *  blank array to get return packet pointer
  * @param count
  *  size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan);
 
 /**
  * This function returns the amount of in-flight packets for the vhost
@@ -235,11 +175,32 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
  *  Blank array to get return packet pointer
  * @param count
  *  Size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan
+ *  the identifier of virtual DMA channel
  * @return
  *  Number of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan);
+/**
+ * The DMA vChannels used in asynchronous data path must be configured
+ * first. So this function needs to be called before enabling DMA
+ * acceleration for vring. If this function fails, asynchronous data path
+ * cannot be enabled for any vring further.
+ *
+ * @param dmas
+ *  DMA information
+ * @param count
+ *  Element number of 'dmas'
+ * @return
+ *  0 on success, and -1 on failure
+ */
+__rte_experimental
+int rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas,
+		uint16_t count);
 
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/vhost/version.map b/lib/vhost/version.map
index a7ef7f1976..1202ba9c1a 100644
--- a/lib/vhost/version.map
+++ b/lib/vhost/version.map
@@ -84,6 +84,9 @@ EXPERIMENTAL {
 
 	# added in 21.11
 	rte_vhost_get_monitor_addr;
+
+	# added in 22.03
+	rte_vhost_async_dma_configure;
 };
 
 INTERNAL {
diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
index 13a9bb9dd1..32f37f4851 100644
--- a/lib/vhost/vhost.c
+++ b/lib/vhost/vhost.c
@@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
 		return;
 
 	rte_free(vq->async->pkts_info);
+	rte_free(vq->async->pkts_cmpl_flag);
 
 	rte_free(vq->async->buffers_packed);
 	vq->async->buffers_packed = NULL;
@@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
 }
 
 static __rte_always_inline int
-async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_channel_ops *ops)
+async_channel_register(int vid, uint16_t queue_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
@@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
 		goto out_free_async;
 	}
 
+	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
+			RTE_CACHE_LINE_SIZE, node);
+	if (!async->pkts_cmpl_flag) {
+		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag (vid %d, qid: %d)\n",
+				vid, queue_id);
+		goto out_free_async;
+	}
+
 	if (vq_is_packed(dev)) {
 		async->buffers_packed = rte_malloc_socket(NULL,
 				vq->size * sizeof(struct vring_used_elem_packed),
@@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
 		}
 	}
 
-	async->ops.check_completed_copies = ops->check_completed_copies;
-	async->ops.transfer_data = ops->transfer_data;
-
 	vq->async = async;
 
 	return 0;
@@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t queue_id,
 }
 
 int
-rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 	int ret;
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"async copy is not supported on non-inorder mode "
-			"(vid %d, qid: %d)\n", vid, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
 	rte_spinlock_lock(&vq->access_lock);
-	ret = async_channel_register(vid, queue_id, ops);
+	ret = async_channel_register(vid, queue_id);
 	rte_spinlock_unlock(&vq->access_lock);
 
 	return ret;
 }
 
 int
-rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1747,18 +1737,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"async copy is not supported on non-inorder mode "
-			"(vid %d, qid: %d)\n", vid, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
-	return async_channel_register(vid, queue_id, ops);
+	return async_channel_register(vid, queue_id);
 }
 
 int
@@ -1835,6 +1814,83 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
 	return 0;
 }
 
+static __rte_always_inline void
+vhost_free_async_dma_mem(void)
+{
+	uint16_t i;
+
+	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
+		struct async_dma_info *dma = &dma_copy_track[i];
+		int16_t j;
+
+		if (dma->max_vchans == 0) {
+			continue;
+		}
+
+		for (j = 0; j < dma->max_vchans; j++) {
+			rte_free(dma->vchans[j].metadata);
+		}
+		rte_free(dma->vchans);
+		dma->vchans = NULL;
+		dma->max_vchans = 0;
+	}
+}
+
+int
+rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas, uint16_t count)
+{
+	uint16_t i;
+
+	if (!dmas) {
+		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration parameter.\n");
+		return -1;
+	}
+
+	for (i = 0; i < count; i++) {
+		struct async_dma_vchan_info *vchans;
+		int16_t dev_id;
+		uint16_t max_vchans;
+		uint16_t max_desc;
+		uint16_t j;
+
+		dev_id = dmas[i].dev_id;
+		max_vchans = dmas[i].max_vchans;
+		max_desc = dmas[i].max_desc;
+
+		if (!rte_is_power_of_2(max_desc)) {
+			max_desc = rte_align32pow2(max_desc);
+		}
+
+		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * max_vchans,
+				RTE_CACHE_LINE_SIZE);
+		if (vchans == NULL) {
+			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for dma-%d."
+					" Cannot enable async data-path.\n", dev_id);
+			vhost_free_async_dma_mem();
+			return -1;
+		}
+
+		for (j = 0; j < max_vchans; j++) {
+			vchans[j].metadata = rte_zmalloc(NULL, sizeof(bool *) * max_desc,
+					RTE_CACHE_LINE_SIZE);
+			if (!vchans[j].metadata) {
+				VHOST_LOG_CONFIG(ERR, "Failed to allocate metadata for "
+						"dma-%d vchan-%u\n", dev_id, j);
+				vhost_free_async_dma_mem();
+				return -1;
+			}
+
+			vchans[j].ring_size = max_desc;
+			vchans[j].ring_mask = max_desc - 1;
+		}
+
+		dma_copy_track[dev_id].vchans = vchans;
+		dma_copy_track[dev_id].max_vchans = max_vchans;
+	}
+
+	return 0;
+}
+
 int
 rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
 {
diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index 7085e0885c..d9bda34e11 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h
@@ -19,6 +19,7 @@
 #include <rte_ether.h>
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
+#include <rte_dmadev.h>
 
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
@@ -50,6 +51,7 @@
 
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
 #define VHOST_MAX_ASYNC_VEC 2048
+#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
 
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
@@ -119,6 +121,41 @@ struct vring_used_elem_packed {
 	uint32_t count;
 };
 
+struct async_dma_vchan_info {
+	/* circular array to track copy metadata */
+	bool **metadata;
+
+	/* max elements in 'metadata' */
+	uint16_t ring_size;
+	/* ring index mask for 'metadata' */
+	uint16_t ring_mask;
+
+	/* batching copies before a DMA doorbell */
+	uint16_t nr_batching;
+
+	/**
+	 * DMA virtual channel lock. Although it is able to bind DMA
+	 * virtual channels to data plane threads, vhost control plane
+	 * thread could call data plane functions too, thus causing
+	 * DMA device contention.
+	 *
+	 * For example, in VM exit case, vhost control plane thread needs
+	 * to clear in-flight packets before disable vring, but there could
+	 * be anotther data plane thread is enqueuing packets to the same
+	 * vring with the same DMA virtual channel. But dmadev PMD functions
+	 * are lock-free, so the control plane and data plane threads
+	 * could operate the same DMA virtual channel at the same time.
+	 */
+	rte_spinlock_t dma_lock;
+};
+
+struct async_dma_info {
+	uint16_t max_vchans;
+	struct async_dma_vchan_info *vchans;
+};
+
+extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 /**
  * inflight async packet information
  */
@@ -129,9 +166,6 @@ struct async_inflight_info {
 };
 
 struct vhost_async {
-	/* operation callbacks for DMA */
-	struct rte_vhost_async_channel_ops ops;
-
 	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
 	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
 	uint16_t iter_idx;
@@ -139,6 +173,19 @@ struct vhost_async {
 
 	/* data transfer status */
 	struct async_inflight_info *pkts_info;
+	/**
+	 * packet reorder array. "true" indicates that DMA
+	 * device completes all copies for the packet.
+	 *
+	 * Note that this array could be written by multiple
+	 * threads at the same time. For example, two threads
+	 * enqueue packets to the same virtqueue with their
+	 * own DMA devices. However, since offloading is
+	 * per-packet basis, each packet flag will only be
+	 * written by one thread. And single byte write is
+	 * atomic, so no lock is needed.
+	 */
+	bool *pkts_cmpl_flag;
 	uint16_t pkts_idx;
 	uint16_t pkts_inflight_n;
 	union {
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index b3d954aab4..9f81fc9733 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -11,6 +11,7 @@
 #include <rte_net.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
+#include <rte_dmadev.h>
 #include <rte_vhost.h>
 #include <rte_tcp.h>
 #include <rte_udp.h>
@@ -25,6 +26,9 @@
 
 #define MAX_BATCH_LEN 256
 
+/* DMA device copy operation tracking array. */
+struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
 }
 
+static __rte_always_inline uint16_t
+vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
+		uint16_t vchan, uint16_t head_idx,
+		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan];
+	uint16_t ring_mask = dma_info->ring_mask;
+	uint16_t pkt_idx;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
+		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
+		int copy_idx = 0;
+		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
+		uint16_t i;
+
+		if (rte_dma_burst_capacity(dma_id, vchan) < nr_segs) {
+			goto out;
+		}
+
+		for (i = 0; i < nr_segs; i++) {
+			/**
+			 * We have checked the available space before submit copies to DMA
+			 * vChannel, so we don't handle error here.
+			 */
+			copy_idx = rte_dma_copy(dma_id, vchan, (rte_iova_t)iov[i].src_addr,
+					(rte_iova_t)iov[i].dst_addr, iov[i].len,
+					RTE_DMA_OP_FLAG_LLC);
+
+			/**
+			 * Only store packet completion flag address in the last copy's
+			 * slot, and other slots are set to NULL.
+			 */
+			if (unlikely(i == (nr_segs - 1))) {
+				dma_info->metadata[copy_idx & ring_mask] =
+					&vq->async->pkts_cmpl_flag[head_idx % vq->size];
+			}
+		}
+
+		dma_info->nr_batching += nr_segs;
+		if (unlikely(dma_info->nr_batching >= VHOST_ASYNC_DMA_BATCHING_SIZE)) {
+			rte_dma_submit(dma_id, vchan);
+			dma_info->nr_batching = 0;
+		}
+
+		head_idx++;
+	}
+
+out:
+	if (dma_info->nr_batching > 0) {
+		rte_dma_submit(dma_id, vchan);
+		dma_info->nr_batching = 0;
+	}
+	rte_spinlock_unlock(&dma_info->dma_lock);
+
+	return pkt_idx;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan, uint16_t max_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan];
+	uint16_t ring_mask = dma_info->ring_mask;
+	uint16_t last_idx = 0;
+	uint16_t nr_copies;
+	uint16_t copy_idx;
+	uint16_t i;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	/**
+	 * Since all memory is pinned and addresses should be valid,
+	 * we don't check errors.
+	 */
+	nr_copies = rte_dma_completed(dma_id, vchan, max_pkts, &last_idx, NULL);
+	if (nr_copies == 0) {
+		goto out;
+	}
+
+	copy_idx = last_idx - nr_copies + 1;
+	for (i = 0; i < nr_copies; i++) {
+		bool *flag;
+
+		flag = dma_info->metadata[copy_idx & ring_mask];
+		if (flag) {
+			/**
+			 * Mark the packet flag as received. The flag
+			 * could belong to another virtqueue but write
+			 * is atomic.
+			 */
+			*flag = true;
+			dma_info->metadata[copy_idx & ring_mask] = NULL;
+		}
+		copy_idx++;
+	}
+
+out:
+	rte_spinlock_unlock(&dma_info->dma_lock);
+	return nr_copies;
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
@@ -1449,9 +1555,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_split(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan)
 {
 	struct buf_vector buf_vec[BUF_VECTOR_MAX];
 	uint32_t pkt_idx = 0;
@@ -1503,17 +1609,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		n_xfer = 0;
-	}
+	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan, async->pkts_idx, async->iov_iter,
+			pkt_idx);
 
 	pkt_err = pkt_idx - n_xfer;
 	if (unlikely(pkt_err)) {
 		uint16_t num_descs = 0;
 
+		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for queue %u.\n",
+				dev->vid, __func__, pkt_err, queue_id);
+
 		/* update number of completed packets */
 		pkt_idx = n_xfer;
 
@@ -1656,13 +1761,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan)
 {
 	uint32_t pkt_idx = 0;
 	uint32_t remained = count;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t num_buffers;
 	uint16_t num_descs;
 
@@ -1670,6 +1775,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	struct async_inflight_info *pkts_info = async->pkts_info;
 	uint32_t pkt_err = 0;
 	uint16_t slot_idx = 0;
+	uint16_t head_idx = async->pkts_idx % vq->size;
 
 	do {
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
@@ -1694,19 +1800,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		n_xfer = 0;
-	}
-
-	pkt_err = pkt_idx - n_xfer;
+	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan, head_idx,
+			async->iov_iter, pkt_idx);
 
 	async_iter_reset(async);
 
-	if (unlikely(pkt_err))
+	pkt_err = pkt_idx - n_xfer;
+	if (unlikely(pkt_err)) {
+		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for queue %u.\n",
+				dev->vid, __func__, pkt_err, queue_id);
 		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
+	}
 
 	if (likely(vq->shadow_used_idx)) {
 		/* keep used descriptors. */
@@ -1826,28 +1930,37 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
 
 static __rte_always_inline uint16_t
 vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
-	int32_t n_cpl;
+	uint16_t nr_cpl_pkts = 0;
 	uint16_t n_descs = 0, n_buffers = 0;
 	uint16_t start_idx, from, i;
 
-	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
-	if (unlikely(n_cpl < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		return 0;
-	}
-
-	if (n_cpl == 0)
-		return 0;
+	/* Check completed copies for the given DMA vChannel */
+	vhost_async_dma_check_completed(dma_id, vchan, count);
 
 	start_idx = async_get_first_inflight_pkt_idx(vq);
 
-	for (i = 0; i < n_cpl; i++) {
+	/**
+	 * Calculate the number of copy completed packets.
+	 * Note that there may be completed packets even if
+	 * no copies are reported done by the given DMA vChannel,
+	 * as DMA vChannels could be shared by other threads.
+	 */
+	from = start_idx;
+	while (vq->async->pkts_cmpl_flag[from] && count--) {
+		vq->async->pkts_cmpl_flag[from] = false;
+		from++;
+		if (from >= vq->size)
+			from -= vq->size;
+		nr_cpl_pkts++;
+	}
+
+	for (i = 0; i < nr_cpl_pkts; i++) {
 		from = (start_idx + i) % vq->size;
 		/* Only used with packed ring */
 		n_buffers += pkts_info[from].nr_buffers;
@@ -1856,7 +1969,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		pkts[i] = pkts_info[from].mbuf;
 	}
 
-	async->pkts_inflight_n -= n_cpl;
+	async->pkts_inflight_n -= nr_cpl_pkts;
 
 	if (likely(vq->enabled && vq->access_ok)) {
 		if (vq_is_packed(dev)) {
@@ -1877,12 +1990,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		}
 	}
 
-	return n_cpl;
+	return nr_cpl_pkts;
 }
 
 uint16_t
 rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1908,7 +2022,7 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 	rte_spinlock_lock(&vq->access_lock);
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan);
 
 	rte_spinlock_unlock(&vq->access_lock);
 
@@ -1917,7 +2031,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 uint16_t
 rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1941,14 +2056,14 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan);
 
 	return n_pkts_cpl;
 }
 
 static __rte_always_inline uint32_t
 virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan)
 {
 	struct vhost_virtqueue *vq;
 	uint32_t nb_tx = 0;
@@ -1980,10 +2095,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 	if (vq_is_packed(dev))
 		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan);
 	else
 		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan);
 
 out:
 	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
@@ -1997,7 +2112,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 uint16_t
 rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan)
 {
 	struct virtio_net *dev = get_device(vid);
 
@@ -2011,7 +2127,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, vchan);
 }
 
 static inline bool
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
@ 2021-12-31  0:55     ` Liang Ma
  2022-01-14  6:30     ` Xia, Chenbo
  2022-01-20 17:00     ` Maxime Coquelin
  2 siblings, 0 replies; 31+ messages in thread
From: Liang Ma @ 2021-12-31  0:55 UTC (permalink / raw)
  To: Jiayu Hu
  Cc: dev, maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, sunil.pai.g, john.mcnamara, xuan.ding,
	cheng1.jiang

On Thu, Dec 30, 2021 at 04:55:05PM -0500, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  70 ++++-----
>  examples/vhost/Makefile             |   2 +-
>  examples/vhost/ioat.c               | 218 --------------------------
>  examples/vhost/ioat.h               |  63 --------
>  examples/vhost/main.c               | 230 +++++++++++++++++++++++-----
>  examples/vhost/main.h               |  11 ++
>  examples/vhost/meson.build          |   6 +-
>  lib/vhost/meson.build               |   3 +-
>  lib/vhost/rte_vhost_async.h         | 121 +++++----------
>  lib/vhost/version.map               |   3 +
>  lib/vhost/vhost.c                   | 130 +++++++++++-----
>  lib/vhost/vhost.h                   |  53 ++++++-
>  lib/vhost/virtio_net.c              | 206 +++++++++++++++++++------
>  13 files changed, 587 insertions(+), 529 deletions(-)
>  delete mode 100644 examples/vhost/ioat.c
>  delete mode 100644 examples/vhost/ioat.h
> 

> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 33d023aa39..44073499bc 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -24,8 +24,9 @@
>  #include <rte_ip.h>
>  #include <rte_tcp.h>
>  #include <rte_pause.h>
> +#include <rte_dmadev.h>
> +#include <rte_vhost_async.h>
>  
> -#include "ioat.h"
>  #include "main.h"
>  
>  #ifndef MAX_QUEUES
> @@ -56,6 +57,14 @@
>  #define RTE_TEST_TX_DESC_DEFAULT 512
>  
>  #define INVALID_PORT_ID 0xFF
> +#define INVALID_DMA_ID -1
> +
> +#define MAX_VHOST_DEVICE 1024
> +#define DMA_RING_SIZE 4096
> +
> +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> +struct rte_vhost_async_dma_info dma_config[RTE_DMADEV_DEFAULT_MAX];
> +static int dma_count;
>  
>  /* mask of enabled ports */
>  static uint32_t enabled_port_mask = 0;
> @@ -96,8 +105,6 @@ static int builtin_net_driver;
>  
>  static int async_vhost_driver;
>  
> -static char *dma_type;
> -
>  /* Specify timeout (in useconds) between retries on RX. */
>  static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
>  /* Specify the number of retries on RX. */
> @@ -196,13 +203,134 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
>  #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
>  				 / US_PER_S * BURST_TX_DRAIN_US)
>  
> +static inline bool
> +is_dma_configured(int16_t dev_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < dma_count; i++) {
> +		if (dma_config[i].dev_id == dev_id) {
> +			return true;
> +		}
> +	}
> +	return false;
> +}
> +
>  static inline int
>  open_dma(const char *value)
>  {
> -	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> -		return open_ioat(value);
> +	struct dma_for_vhost *dma_info = dma_bind;
> +	char *input = strndup(value, strlen(value) + 1);
> +	char *addrs = input;
> +	char *ptrs[2];
> +	char *start, *end, *substr;
> +	int64_t vid, vring_id;
> +
> +	struct rte_dma_info info;
> +	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> +	struct rte_dma_vchan_conf qconf = {
> +		.direction = RTE_DMA_DIR_MEM_TO_MEM,
> +		.nb_desc = DMA_RING_SIZE
> +	};
> +
> +	int dev_id;
> +	int ret = 0;
> +	uint16_t i = 0;
> +	char *dma_arg[MAX_VHOST_DEVICE];
> +	int args_nr;
> +
> +	while (isblank(*addrs))
> +		addrs++;
> +	if (*addrs == '\0') {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	/* process DMA devices within bracket. */
> +	addrs++;
> +	substr = strtok(addrs, ";]");
> +	if (!substr) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	args_nr = rte_strsplit(substr, strlen(substr),
> +			dma_arg, MAX_VHOST_DEVICE, ',');
> +	if (args_nr <= 0) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	while (i < args_nr) {
> +		char *arg_temp = dma_arg[i];
> +		uint8_t sub_nr;
> +
> +		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> +		if (sub_nr != 2) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start = strstr(ptrs[0], "txd");
Hi JiaYu, 
    it looks the parameter checking ignore the "rxd" case ? I think if
    the patch enable enqueue/dequeue at same time. rxd is needed for
    DMAS parameters.
Regards
Liang
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
  2021-12-31  0:55     ` Liang Ma
@ 2022-01-14  6:30     ` Xia, Chenbo
  2022-01-17  5:39       ` Hu, Jiayu
  2022-01-20 17:00     ` Maxime Coquelin
  2 siblings, 1 reply; 31+ messages in thread
From: Xia, Chenbo @ 2022-01-14  6:30 UTC (permalink / raw)
  To: Hu, Jiayu, dev
  Cc: maxime.coquelin, i.maximets, Richardson, Bruce, Van Haaren,
	Harry, Pai G, Sunil, Mcnamara, John, Ding, Xuan, Jiang, Cheng1,
	liangma

Hi Jiayu,

This is first round of review, I'll spend time on OVS patches later and look back.

> -----Original Message-----
> From: Hu, Jiayu <jiayu.hu@intel.com>
> Sent: Friday, December 31, 2021 5:55 AM
> To: dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; i.maximets@ovn.org; Xia, Chenbo
> <chenbo.xia@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Mcnamara, John <john.mcnamara@intel.com>; Ding, Xuan
> <xuan.ding@intel.com>; Jiang, Cheng1 <cheng1.jiang@intel.com>;
> liangma@liangbit.com; Hu, Jiayu <jiayu.hu@intel.com>
> Subject: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
> 
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  70 ++++-----
>  examples/vhost/Makefile             |   2 +-
>  examples/vhost/ioat.c               | 218 --------------------------
>  examples/vhost/ioat.h               |  63 --------
>  examples/vhost/main.c               | 230 +++++++++++++++++++++++-----
>  examples/vhost/main.h               |  11 ++
>  examples/vhost/meson.build          |   6 +-
>  lib/vhost/meson.build               |   3 +-
>  lib/vhost/rte_vhost_async.h         | 121 +++++----------
>  lib/vhost/version.map               |   3 +
>  lib/vhost/vhost.c                   | 130 +++++++++++-----
>  lib/vhost/vhost.h                   |  53 ++++++-
>  lib/vhost/virtio_net.c              | 206 +++++++++++++++++++------
>  13 files changed, 587 insertions(+), 529 deletions(-)
>  delete mode 100644 examples/vhost/ioat.c
>  delete mode 100644 examples/vhost/ioat.h
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst
> b/doc/guides/prog_guide/vhost_lib.rst
> index 76f5d303c9..bdce7cbf02 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -218,38 +218,12 @@ The following is an overview of some key Vhost API
> functions:
> 
>    Enable or disable zero copy feature of the vhost crypto backend.
> 
> -* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
> +* ``rte_vhost_async_channel_register(vid, queue_id)``
> 
>    Register an async copy device channel for a vhost queue after vring

Since dmadev is here, let's just use 'DMA device' instead of 'copy device'

> -  is enabled. Following device ``config`` must be specified together
> -  with the registration:
> +  is enabled.
> 
> -  * ``features``
> -
> -    This field is used to specify async copy device features.
> -
> -    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
> -    guarantee the order of copy completion is the same as the order
> -    of copy submission.
> -
> -    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
> -    supported by vhost.
> -
> -  Applications must provide following ``ops`` callbacks for vhost lib to
> -  work with the async copy devices:
> -
> -  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
> -
> -    vhost invokes this function to submit copy data to the async devices.
> -    For non-async_inorder capable devices, ``opaque_data`` could be used
> -    for identifying the completed packets.
> -
> -  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
> -
> -    vhost invokes this function to get the copy data completed by async
> -    devices.
> -
> -* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config,
> ops)``
> +* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
> 
>    Register an async copy device channel for a vhost queue without
>    performing any locking.
> @@ -277,18 +251,13 @@ The following is an overview of some key Vhost API
> functions:
>    This function is only safe to call in vhost callback functions
>    (i.e., struct rte_vhost_device_ops).
> 
> -* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts,
> comp_count)``
> +* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
> dma_vchan)``
> 
>    Submit an enqueue request to transmit ``count`` packets from host to guest
> -  by async data path. Successfully enqueued packets can be transfer completed
> -  or being occupied by DMA engines; transfer completed packets are returned
> in
> -  ``comp_pkts``, but others are not guaranteed to finish, when this API
> -  call returns.
> +  by async data path. Applications must not free the packets submitted for
> +  enqueue until the packets are completed.
> 
> -  Applications must not free the packets submitted for enqueue until the
> -  packets are completed.
> -
> -* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
> +* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
> dma_vchan)``
> 
>    Poll enqueue completion status from async data path. Completed packets
>    are returned to applications through ``pkts``.
> @@ -298,7 +267,7 @@ The following is an overview of some key Vhost API
> functions:
>    This function returns the amount of in-flight packets for the vhost
>    queue using async acceleration.
> 
> -* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
> +* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id,
> dma_vchan)``
> 
>    Clear inflight packets which are submitted to DMA engine in vhost async
> data
>    path. Completed packets are returned to applications through ``pkts``.
> @@ -442,3 +411,26 @@ Finally, a set of device ops is defined for device
> specific operations:
>  * ``get_notify_area``
> 
>    Called to get the notify area info of the queue.
> +
> +Vhost asynchronous data path
> +----------------------------
> +
> +Vhost asynchronous data path leverages DMA devices to offload memory
> +copies from the CPU and it is implemented in an asynchronous way. It
> +enables applcations, like OVS, to save CPU cycles and hide memory copy
> +overhead, thus achieving higher throughput.
> +
> +Vhost doesn't manage DMA devices and applications, like OVS, need to
> +manage and configure DMA devices. Applications need to tell vhost what
> +DMA devices to use in every data path function call. This design enables
> +the flexibility for applications to dynamically use DMA channels in
> +different function modules, not limited in vhost.
> +
> +In addition, vhost supports M:N mapping between vrings and DMA virtual
> +channels. Specifically, one vring can use multiple different DMA channels
> +and one DMA channel can be shared by multiple vrings at the same time.
> +The reason of enabling one vring to use multiple DMA channels is that
> +it's possible that more than one dataplane threads enqueue packets to
> +the same vring with their own DMA virtual channels. Besides, the number
> +of DMA devices is limited. For the purpose of scaling, it's necessary to
> +support sharing DMA channels among vrings.
> diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
> index 587ea2ab47..975a5dfe40 100644
> --- a/examples/vhost/Makefile
> +++ b/examples/vhost/Makefile
> @@ -5,7 +5,7 @@
>  APP = vhost-switch
> 
>  # all source are stored in SRCS-y
> -SRCS-y := main.c virtio_net.c ioat.c
> +SRCS-y := main.c virtio_net.c
> 
>  PKGCONF ?= pkg-config
> 
> diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
> deleted file mode 100644
> index 9aeeb12fd9..0000000000
> --- a/examples/vhost/ioat.c
> +++ /dev/null
> @@ -1,218 +0,0 @@
> -/* SPDX-License-Identifier: BSD-3-Clause
> - * Copyright(c) 2010-2020 Intel Corporation
> - */
> -
> -#include <sys/uio.h>
> -#ifdef RTE_RAW_IOAT
> -#include <rte_rawdev.h>
> -#include <rte_ioat_rawdev.h>
> -
> -#include "ioat.h"
> -#include "main.h"
> -
> -struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> -
> -struct packet_tracker {
> -	unsigned short size_track[MAX_ENQUEUED_SIZE];
> -	unsigned short next_read;
> -	unsigned short next_write;
> -	unsigned short last_remain;
> -	unsigned short ioat_space;
> -};
> -
> -struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
> -
> -int
> -open_ioat(const char *value)
> -{
> -	struct dma_for_vhost *dma_info = dma_bind;
> -	char *input = strndup(value, strlen(value) + 1);
> -	char *addrs = input;
> -	char *ptrs[2];
> -	char *start, *end, *substr;
> -	int64_t vid, vring_id;
> -	struct rte_ioat_rawdev_config config;
> -	struct rte_rawdev_info info = { .dev_private = &config };
> -	char name[32];
> -	int dev_id;
> -	int ret = 0;
> -	uint16_t i = 0;
> -	char *dma_arg[MAX_VHOST_DEVICE];
> -	int args_nr;
> -
> -	while (isblank(*addrs))
> -		addrs++;
> -	if (*addrs == '\0') {
> -		ret = -1;
> -		goto out;
> -	}
> -
> -	/* process DMA devices within bracket. */
> -	addrs++;
> -	substr = strtok(addrs, ";]");
> -	if (!substr) {
> -		ret = -1;
> -		goto out;
> -	}
> -	args_nr = rte_strsplit(substr, strlen(substr),
> -			dma_arg, MAX_VHOST_DEVICE, ',');
> -	if (args_nr <= 0) {
> -		ret = -1;
> -		goto out;
> -	}
> -	while (i < args_nr) {
> -		char *arg_temp = dma_arg[i];
> -		uint8_t sub_nr;
> -		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> -		if (sub_nr != 2) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		start = strstr(ptrs[0], "txd");
> -		if (start == NULL) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		start += 3;
> -		vid = strtol(start, &end, 0);
> -		if (end == start) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		vring_id = 0 + VIRTIO_RXQ;
> -		if (rte_pci_addr_parse(ptrs[1],
> -				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
> -				name, sizeof(name));
> -		dev_id = rte_rawdev_get_dev_id(name);
> -		if (dev_id == (uint16_t)(-ENODEV) ||
> -		dev_id == (uint16_t)(-EINVAL)) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
> -		strstr(info.driver_name, "ioat") == NULL) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> -		(dma_info + vid)->dmas[vring_id].is_valid = true;
> -		config.ring_size = IOAT_RING_SIZE;
> -		config.hdls_disable = true;
> -		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
> -			ret = -1;
> -			goto out;
> -		}
> -		rte_rawdev_start(dev_id);
> -		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
> -		dma_info->nr++;
> -		i++;
> -	}
> -out:
> -	free(input);
> -	return ret;
> -}
> -
> -int32_t
> -ioat_transfer_data_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data, uint16_t count)
> -{
> -	uint32_t i_iter;
> -	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
> -	struct rte_vhost_iov_iter *iter = NULL;
> -	unsigned long i_seg;
> -	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
> -	unsigned short write = cb_tracker[dev_id].next_write;
> -
> -	if (!opaque_data) {
> -		for (i_iter = 0; i_iter < count; i_iter++) {
> -			iter = iov_iter + i_iter;
> -			i_seg = 0;
> -			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
> -				break;
> -			while (i_seg < iter->nr_segs) {
> -				rte_ioat_enqueue_copy(dev_id,
> -					(uintptr_t)(iter->iov[i_seg].src_addr),
> -					(uintptr_t)(iter->iov[i_seg].dst_addr),
> -					iter->iov[i_seg].len,
> -					0,
> -					0);
> -				i_seg++;
> -			}
> -			write &= mask;
> -			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
> -			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
> -			write++;
> -		}
> -	} else {
> -		/* Opaque data is not supported */
> -		return -1;
> -	}
> -	/* ring the doorbell */
> -	rte_ioat_perform_ops(dev_id);
> -	cb_tracker[dev_id].next_write = write;
> -	return i_iter;
> -}
> -
> -int32_t
> -ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets)
> -{
> -	if (!opaque_data) {
> -		uintptr_t dump[255];
> -		int n_seg;
> -		unsigned short read, write;
> -		unsigned short nb_packet = 0;
> -		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
> -		unsigned short i;
> -
> -		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
> -				+ VIRTIO_RXQ].dev_id;
> -		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump,
> dump);
> -		if (n_seg < 0) {
> -			RTE_LOG(ERR,
> -				VHOST_DATA,
> -				"fail to poll completed buf on IOAT device %u",
> -				dev_id);
> -			return 0;
> -		}
> -		if (n_seg == 0)
> -			return 0;
> -
> -		cb_tracker[dev_id].ioat_space += n_seg;
> -		n_seg += cb_tracker[dev_id].last_remain;
> -
> -		read = cb_tracker[dev_id].next_read;
> -		write = cb_tracker[dev_id].next_write;
> -		for (i = 0; i < max_packets; i++) {
> -			read &= mask;
> -			if (read == write)
> -				break;
> -			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
> -				n_seg -= cb_tracker[dev_id].size_track[read];
> -				read++;
> -				nb_packet++;
> -			} else {
> -				break;
> -			}
> -		}
> -		cb_tracker[dev_id].next_read = read;
> -		cb_tracker[dev_id].last_remain = n_seg;
> -		return nb_packet;
> -	}
> -	/* Opaque data is not supported */
> -	return -1;
> -}
> -
> -#endif /* RTE_RAW_IOAT */
> diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
> deleted file mode 100644
> index d9bf717e8d..0000000000
> --- a/examples/vhost/ioat.h
> +++ /dev/null
> @@ -1,63 +0,0 @@
> -/* SPDX-License-Identifier: BSD-3-Clause
> - * Copyright(c) 2010-2020 Intel Corporation
> - */
> -
> -#ifndef _IOAT_H_
> -#define _IOAT_H_
> -
> -#include <rte_vhost.h>
> -#include <rte_pci.h>
> -#include <rte_vhost_async.h>
> -
> -#define MAX_VHOST_DEVICE 1024
> -#define IOAT_RING_SIZE 4096
> -#define MAX_ENQUEUED_SIZE 4096
> -
> -struct dma_info {
> -	struct rte_pci_addr addr;
> -	uint16_t dev_id;
> -	bool is_valid;
> -};
> -
> -struct dma_for_vhost {
> -	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> -	uint16_t nr;
> -};
> -
> -#ifdef RTE_RAW_IOAT
> -int open_ioat(const char *value);
> -
> -int32_t
> -ioat_transfer_data_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data, uint16_t count);
> -
> -int32_t
> -ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets);
> -#else
> -static int open_ioat(const char *value __rte_unused)
> -{
> -	return -1;
> -}
> -
> -static int32_t
> -ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
> -		struct rte_vhost_iov_iter *iov_iter __rte_unused,
> -		struct rte_vhost_async_status *opaque_data __rte_unused,
> -		uint16_t count __rte_unused)
> -{
> -	return -1;
> -}
> -
> -static int32_t
> -ioat_check_completed_copies_cb(int vid __rte_unused,
> -		uint16_t queue_id __rte_unused,
> -		struct rte_vhost_async_status *opaque_data __rte_unused,
> -		uint16_t max_packets __rte_unused)
> -{
> -	return -1;
> -}
> -#endif
> -#endif /* _IOAT_H_ */
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 33d023aa39..44073499bc 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -24,8 +24,9 @@
>  #include <rte_ip.h>
>  #include <rte_tcp.h>
>  #include <rte_pause.h>
> +#include <rte_dmadev.h>
> +#include <rte_vhost_async.h>
> 
> -#include "ioat.h"
>  #include "main.h"
> 
>  #ifndef MAX_QUEUES
> @@ -56,6 +57,14 @@
>  #define RTE_TEST_TX_DESC_DEFAULT 512
> 
>  #define INVALID_PORT_ID 0xFF
> +#define INVALID_DMA_ID -1
> +
> +#define MAX_VHOST_DEVICE 1024
> +#define DMA_RING_SIZE 4096
> +
> +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> +struct rte_vhost_async_dma_info dma_config[RTE_DMADEV_DEFAULT_MAX];
> +static int dma_count;
> 
>  /* mask of enabled ports */
>  static uint32_t enabled_port_mask = 0;
> @@ -96,8 +105,6 @@ static int builtin_net_driver;
> 
>  static int async_vhost_driver;
> 
> -static char *dma_type;
> -
>  /* Specify timeout (in useconds) between retries on RX. */
>  static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
>  /* Specify the number of retries on RX. */
> @@ -196,13 +203,134 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE *
> MAX_VHOST_DEVICE];
>  #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
>  				 / US_PER_S * BURST_TX_DRAIN_US)
> 
> +static inline bool
> +is_dma_configured(int16_t dev_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < dma_count; i++) {
> +		if (dma_config[i].dev_id == dev_id) {
> +			return true;
> +		}
> +	}
> +	return false;
> +}
> +
>  static inline int
>  open_dma(const char *value)
>  {
> -	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> -		return open_ioat(value);
> +	struct dma_for_vhost *dma_info = dma_bind;
> +	char *input = strndup(value, strlen(value) + 1);
> +	char *addrs = input;
> +	char *ptrs[2];
> +	char *start, *end, *substr;
> +	int64_t vid, vring_id;
> +
> +	struct rte_dma_info info;
> +	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> +	struct rte_dma_vchan_conf qconf = {
> +		.direction = RTE_DMA_DIR_MEM_TO_MEM,
> +		.nb_desc = DMA_RING_SIZE
> +	};
> +
> +	int dev_id;
> +	int ret = 0;
> +	uint16_t i = 0;
> +	char *dma_arg[MAX_VHOST_DEVICE];
> +	int args_nr;
> +
> +	while (isblank(*addrs))
> +		addrs++;
> +	if (*addrs == '\0') {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	/* process DMA devices within bracket. */
> +	addrs++;
> +	substr = strtok(addrs, ";]");
> +	if (!substr) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	args_nr = rte_strsplit(substr, strlen(substr),
> +			dma_arg, MAX_VHOST_DEVICE, ',');
> +	if (args_nr <= 0) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	while (i < args_nr) {
> +		char *arg_temp = dma_arg[i];
> +		uint8_t sub_nr;
> +
> +		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> +		if (sub_nr != 2) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start = strstr(ptrs[0], "txd");
> +		if (start == NULL) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start += 3;
> +		vid = strtol(start, &end, 0);
> +		if (end == start) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		vring_id = 0 + VIRTIO_RXQ;

No need to introduce vring_id, it's always VIRTIO_RXQ

> +
> +		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
> +		if (dev_id < 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n",
> ptrs[1]);
> +			ret = -1;
> +			goto out;
> +		} else if (is_dma_configured(dev_id)) {
> +			goto done;
> +		}
> +

Please call rte_dma_info_get before configure to make sure info.max_vchans >=1

> +		if (rte_dma_configure(dev_id, &dev_config) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n",
> dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n",
> dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> 
> -	return -1;
> +		rte_dma_info_get(dev_id, &info);
> +		if (info.nb_vchans != 1) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no queues.\n",
> dev_id);

Then the above means the number of vchan is not configured.

> +			ret = -1;
> +			goto out;
> +		}
> +
> +		if (rte_dma_start(dev_id) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n",
> dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		dma_config[dma_count].dev_id = dev_id;
> +		dma_config[dma_count].max_vchans = 1;
> +		dma_config[dma_count++].max_desc = DMA_RING_SIZE;
> +
> +done:
> +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> +		i++;
> +	}
> +out:
> +	free(input);
> +	return ret;
>  }
> 
>  /*
> @@ -500,8 +628,6 @@ enum {
>  	OPT_CLIENT_NUM,
>  #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
>  	OPT_BUILTIN_NET_DRIVER_NUM,
> -#define OPT_DMA_TYPE            "dma-type"
> -	OPT_DMA_TYPE_NUM,
>  #define OPT_DMAS                "dmas"
>  	OPT_DMAS_NUM,
>  };
> @@ -539,8 +665,6 @@ us_vhost_parse_args(int argc, char **argv)
>  				NULL, OPT_CLIENT_NUM},
>  		{OPT_BUILTIN_NET_DRIVER, no_argument,
>  				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
> -		{OPT_DMA_TYPE, required_argument,
> -				NULL, OPT_DMA_TYPE_NUM},
>  		{OPT_DMAS, required_argument,
>  				NULL, OPT_DMAS_NUM},
>  		{NULL, 0, 0, 0},
> @@ -661,10 +785,6 @@ us_vhost_parse_args(int argc, char **argv)
>  			}
>  			break;
> 
> -		case OPT_DMA_TYPE_NUM:
> -			dma_type = optarg;
> -			break;
> -
>  		case OPT_DMAS_NUM:
>  			if (open_dma(optarg) == -1) {
>  				RTE_LOG(INFO, VHOST_CONFIG,
> @@ -841,9 +961,10 @@ complete_async_pkts(struct vhost_dev *vdev)
>  {
>  	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
>  	uint16_t complete_count;
> +	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
> 
>  	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
> -					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
> +					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
>  	if (complete_count) {
>  		free_pkts(p_cpl, complete_count);
>  		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count,
> __ATOMIC_SEQ_CST);
> @@ -883,11 +1004,12 @@ drain_vhost(struct vhost_dev *vdev)
> 
>  	if (builtin_net_driver) {
>  		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
> -	} else if (async_vhost_driver) {
> +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
>  		uint16_t enqueue_fail = 0;
> +		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
> 
>  		complete_async_pkts(vdev);
> -		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m,
> nr_xmit);
> +		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m,
> nr_xmit, dma_id, 0);
>  		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
> 
>  		enqueue_fail = nr_xmit - ret;
> @@ -905,7 +1027,7 @@ drain_vhost(struct vhost_dev *vdev)
>  				__ATOMIC_SEQ_CST);
>  	}
> 
> -	if (!async_vhost_driver)
> +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
>  		free_pkts(m, nr_xmit);
>  }
> 
> @@ -1211,12 +1333,13 @@ drain_eth_rx(struct vhost_dev *vdev)
>  	if (builtin_net_driver) {
>  		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
>  						pkts, rx_count);
> -	} else if (async_vhost_driver) {
> +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
>  		uint16_t enqueue_fail = 0;
> +		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
> 
>  		complete_async_pkts(vdev);
>  		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
> -					VIRTIO_RXQ, pkts, rx_count);
> +					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
>  		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count,
> __ATOMIC_SEQ_CST);
> 
>  		enqueue_fail = rx_count - enqueue_count;
> @@ -1235,7 +1358,7 @@ drain_eth_rx(struct vhost_dev *vdev)
>  				__ATOMIC_SEQ_CST);
>  	}
> 
> -	if (!async_vhost_driver)
> +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
>  		free_pkts(pkts, rx_count);
>  }
> 
> @@ -1387,18 +1510,20 @@ destroy_device(int vid)
>  		"(%d) device has been removed from data core\n",
>  		vdev->vid);
> 
> -	if (async_vhost_driver) {
> +	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
>  		uint16_t n_pkt = 0;
> +		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
>  		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> 
>  		while (vdev->pkts_inflight) {
>  			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
> -						m_cpl, vdev->pkts_inflight);
> +						m_cpl, vdev->pkts_inflight, dma_id, 0);
>  			free_pkts(m_cpl, n_pkt);
>  			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt,
> __ATOMIC_SEQ_CST);
>  		}
> 
>  		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
> +		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
>  	}
> 
>  	rte_free(vdev);
> @@ -1468,20 +1593,14 @@ new_device(int vid)
>  		"(%d) device has been added to data core %d\n",
>  		vid, vdev->coreid);
> 
> -	if (async_vhost_driver) {
> -		struct rte_vhost_async_config config = {0};
> -		struct rte_vhost_async_channel_ops channel_ops;
> -
> -		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
> -			channel_ops.transfer_data = ioat_transfer_data_cb;
> -			channel_ops.check_completed_copies =
> -				ioat_check_completed_copies_cb;
> -
> -			config.features = RTE_VHOST_ASYNC_INORDER;
> +	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
> +		int ret;
> 
> -			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
> -				config, &channel_ops);
> +		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
> +		if (ret == 0) {
> +			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
>  		}
> +		return ret;
>  	}
> 
>  	return 0;
> @@ -1502,14 +1621,15 @@ vring_state_changed(int vid, uint16_t queue_id, int
> enable)
>  	if (queue_id != VIRTIO_RXQ)
>  		return 0;
> 
> -	if (async_vhost_driver) {
> +	if (dma_bind[vid].dmas[queue_id].async_enabled) {
>  		if (!enable) {
>  			uint16_t n_pkt = 0;
> +			int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
>  			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> 
>  			while (vdev->pkts_inflight) {
>  				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid,
> queue_id,
> -							m_cpl, vdev->pkts_inflight);
> +							m_cpl, vdev->pkts_inflight, dma_id,
> 0);
>  				free_pkts(m_cpl, n_pkt);
>  				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt,
> __ATOMIC_SEQ_CST);
>  			}
> @@ -1657,6 +1777,25 @@ create_mbuf_pool(uint16_t nr_port, uint32_t
> nr_switch_core, uint32_t mbuf_size,
>  		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
>  }
> 
> +static void
> +init_dma(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> +		int j;
> +
> +		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
> +			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
> +			dma_bind[i].dmas[j].async_enabled = false;
> +		}
> +	}
> +
> +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> +		dma_config[i].dev_id = INVALID_DMA_ID;
> +	}
> +}
> +
>  /*
>   * Main function, does initialisation and calls the per-lcore functions.
>   */
> @@ -1679,6 +1818,9 @@ main(int argc, char *argv[])
>  	argc -= ret;
>  	argv += ret;
> 
> +	/* initialize dma structures */
> +	init_dma();
> +
>  	/* parse app arguments */
>  	ret = us_vhost_parse_args(argc, argv);
>  	if (ret < 0)
> @@ -1754,6 +1896,20 @@ main(int argc, char *argv[])
>  	if (client_mode)
>  		flags |= RTE_VHOST_USER_CLIENT;
> 
> +	if (async_vhost_driver) {
> +		if (rte_vhost_async_dma_configure(dma_config, dma_count) < 0) {
> +			RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in
> vhost.\n");
> +			for (i = 0; i < dma_count; i++) {
> +				if (dma_config[i].dev_id != INVALID_DMA_ID) {
> +					rte_dma_stop(dma_config[i].dev_id);
> +					dma_config[i].dev_id = INVALID_DMA_ID;
> +				}
> +			}
> +			dma_count = 0;
> +			async_vhost_driver = false;
> +		}
> +	}
> +
>  	/* Register vhost user driver to handle vhost messages. */
>  	for (i = 0; i < nb_sockets; i++) {
>  		char *file = socket_files + i * PATH_MAX;
> diff --git a/examples/vhost/main.h b/examples/vhost/main.h
> index e7b1ac60a6..b4a453e77e 100644
> --- a/examples/vhost/main.h
> +++ b/examples/vhost/main.h
> @@ -8,6 +8,7 @@
>  #include <sys/queue.h>
> 
>  #include <rte_ether.h>
> +#include <rte_pci.h>
> 
>  /* Macros for printing using RTE_LOG */
>  #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
> @@ -79,6 +80,16 @@ struct lcore_info {
>  	struct vhost_dev_tailq_list vdev_list;
>  };
> 
> +struct dma_info {
> +	struct rte_pci_addr addr;
> +	int16_t dev_id;
> +	bool async_enabled;
> +};
> +
> +struct dma_for_vhost {
> +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> +};
> +
>  /* we implement non-extra virtio net features */
>  #define VIRTIO_NET_FEATURES	0
> 
> diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
> index 3efd5e6540..87a637f83f 100644
> --- a/examples/vhost/meson.build
> +++ b/examples/vhost/meson.build
> @@ -12,13 +12,9 @@ if not is_linux
>  endif
> 
>  deps += 'vhost'
> +deps += 'dmadev'
>  allow_experimental_apis = true
>  sources = files(
>          'main.c',
>          'virtio_net.c',
>  )
> -
> -if dpdk_conf.has('RTE_RAW_IOAT')
> -    deps += 'raw_ioat'
> -    sources += files('ioat.c')
> -endif
> diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
> index cdb37a4814..8107329400 100644
> --- a/lib/vhost/meson.build
> +++ b/lib/vhost/meson.build
> @@ -33,7 +33,8 @@ headers = files(
>          'rte_vhost_async.h',
>          'rte_vhost_crypto.h',
>  )
> +
>  driver_sdk_headers = files(
>          'vdpa_driver.h',
>  )
> -deps += ['ethdev', 'cryptodev', 'hash', 'pci']
> +deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
> diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> index a87ea6ba37..23a7a2d8b3 100644
> --- a/lib/vhost/rte_vhost_async.h
> +++ b/lib/vhost/rte_vhost_async.h
> @@ -27,70 +27,12 @@ struct rte_vhost_iov_iter {
>  };
> 
>  /**
> - * dma transfer status
> + * DMA device information
>   */
> -struct rte_vhost_async_status {
> -	/** An array of application specific data for source memory */
> -	uintptr_t *src_opaque_data;
> -	/** An array of application specific data for destination memory */
> -	uintptr_t *dst_opaque_data;
> -};
> -
> -/**
> - * dma operation callbacks to be implemented by applications
> - */
> -struct rte_vhost_async_channel_ops {
> -	/**
> -	 * instruct async engines to perform copies for a batch of packets
> -	 *
> -	 * @param vid
> -	 *  id of vhost device to perform data copies
> -	 * @param queue_id
> -	 *  queue id to perform data copies
> -	 * @param iov_iter
> -	 *  an array of IOV iterators
> -	 * @param opaque_data
> -	 *  opaque data pair sending to DMA engine
> -	 * @param count
> -	 *  number of elements in the "descs" array
> -	 * @return
> -	 *  number of IOV iterators processed, negative value means error
> -	 */
> -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t count);
> -	/**
> -	 * check copy-completed packets from the async engine
> -	 * @param vid
> -	 *  id of vhost device to check copy completion
> -	 * @param queue_id
> -	 *  queue id to check copy completion
> -	 * @param opaque_data
> -	 *  buffer to receive the opaque data pair from DMA engine
> -	 * @param max_packets
> -	 *  max number of packets could be completed
> -	 * @return
> -	 *  number of async descs completed, negative value means error
> -	 */
> -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets);
> -};
> -
> -/**
> - *  async channel features
> - */
> -enum {
> -	RTE_VHOST_ASYNC_INORDER = 1U << 0,
> -};
> -
> -/**
> - *  async channel configuration
> - */
> -struct rte_vhost_async_config {
> -	uint32_t features;
> -	uint32_t rsvd[2];
> +struct rte_vhost_async_dma_info {
> +	int16_t dev_id;	/* DMA device ID */
> +	uint16_t max_vchans;	/* max number of vchan */
> +	uint16_t max_desc;	/* max desc number of vchan */
>  };
> 
>  /**
> @@ -100,17 +42,11 @@ struct rte_vhost_async_config {
>   *  vhost device id async channel to be attached to
>   * @param queue_id
>   *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration structure
> - * @param ops
> - *  Async channel operation callbacks
>   * @return
>   *  0 on success, -1 on failures
>   */
>  __rte_experimental
> -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
> 
>  /**
>   * Unregister an async channel for a vhost queue
> @@ -136,17 +72,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t
> queue_id);
>   *  vhost device id async channel to be attached to
>   * @param queue_id
>   *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration
> - * @param ops
> - *  Async channel operation callbacks
>   * @return
>   *  0 on success, -1 on failures
>   */
>  __rte_experimental
> -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id);
> 
>  /**
>   * Unregister an async channel for a vhost queue without performing any
> @@ -179,12 +109,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int
> vid,
>   *  array of packets to be enqueued
>   * @param count
>   *  packets num to be enqueued
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan
> + *  the identifier of virtual DMA channel
>   * @return
>   *  num of packets enqueued
>   */
>  __rte_experimental
>  uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan);

All dma_id in the API should be uint16_t. Otherwise you need to check if valid.

> 
>  /**
>   * This function checks async completion status for a specific vhost
> @@ -199,12 +134,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid,
> uint16_t queue_id,
>   *  blank array to get return packet pointer
>   * @param count
>   *  size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan
> + *  the identifier of virtual DMA channel
>   * @return
>   *  num of packets returned
>   */
>  __rte_experimental
>  uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan);
> 
>  /**
>   * This function returns the amount of in-flight packets for the vhost
> @@ -235,11 +175,32 @@ int rte_vhost_async_get_inflight(int vid, uint16_t
> queue_id);
>   *  Blank array to get return packet pointer
>   * @param count
>   *  Size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan
> + *  the identifier of virtual DMA channel
>   * @return
>   *  Number of packets returned
>   */
>  __rte_experimental
>  uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan);
> +/**
> + * The DMA vChannels used in asynchronous data path must be configured
> + * first. So this function needs to be called before enabling DMA
> + * acceleration for vring. If this function fails, asynchronous data path
> + * cannot be enabled for any vring further.
> + *
> + * @param dmas
> + *  DMA information
> + * @param count
> + *  Element number of 'dmas'
> + * @return
> + *  0 on success, and -1 on failure
> + */
> +__rte_experimental
> +int rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas,
> +		uint16_t count);

I think based on current design, vhost can use every vchan if user app let it.
So the max_desc and max_vchans can just be got from dmadev APIs? Then there's
no need to introduce the new ABI struct rte_vhost_async_dma_info.

And about max_desc, I see the dmadev lib, you can get vchan's max_desc but you
may use a nb_desc (<= max_desc) to configure vchanl. And IIUC, vhost wants to
know the nb_desc instead of max_desc?

> 
>  #endif /* _RTE_VHOST_ASYNC_H_ */
> diff --git a/lib/vhost/version.map b/lib/vhost/version.map
> index a7ef7f1976..1202ba9c1a 100644
> --- a/lib/vhost/version.map
> +++ b/lib/vhost/version.map
> @@ -84,6 +84,9 @@ EXPERIMENTAL {
> 
>  	# added in 21.11
>  	rte_vhost_get_monitor_addr;
> +
> +	# added in 22.03
> +	rte_vhost_async_dma_configure;
>  };
> 
>  INTERNAL {
> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> index 13a9bb9dd1..32f37f4851 100644
> --- a/lib/vhost/vhost.c
> +++ b/lib/vhost/vhost.c
> @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
>  		return;
> 
>  	rte_free(vq->async->pkts_info);
> +	rte_free(vq->async->pkts_cmpl_flag);
> 
>  	rte_free(vq->async->buffers_packed);
>  	vq->async->buffers_packed = NULL;
> @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
>  }
> 
>  static __rte_always_inline int
> -async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_channel_ops *ops)
> +async_channel_register(int vid, uint16_t queue_id)
>  {
>  	struct virtio_net *dev = get_device(vid);
>  	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
>  		goto out_free_async;
>  	}
> 
> +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
> +			RTE_CACHE_LINE_SIZE, node);
> +	if (!async->pkts_cmpl_flag) {
> +		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag
> (vid %d, qid: %d)\n",
> +				vid, queue_id);

qid: %u

> +		goto out_free_async;
> +	}
> +
>  	if (vq_is_packed(dev)) {
>  		async->buffers_packed = rte_malloc_socket(NULL,
>  				vq->size * sizeof(struct vring_used_elem_packed),
> @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
>  		}
>  	}
> 
> -	async->ops.check_completed_copies = ops->check_completed_copies;
> -	async->ops.transfer_data = ops->transfer_data;
> -
>  	vq->async = async;
> 
>  	return 0;
> @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t queue_id,
>  }
> 
>  int
> -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
>  {
>  	struct vhost_virtqueue *vq;
>  	struct virtio_net *dev = get_device(vid);
>  	int ret;
> 
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>  		return -1;
> 
>  	if (queue_id >= VHOST_MAX_VRING)
> @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid, uint16_t
> queue_id,
>  	if (unlikely(vq == NULL || !dev->async_copy))
>  		return -1;
> 
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"async copy is not supported on non-inorder mode "
> -			"(vid %d, qid: %d)\n", vid, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
>  	rte_spinlock_lock(&vq->access_lock);
> -	ret = async_channel_register(vid, queue_id, ops);
> +	ret = async_channel_register(vid, queue_id);
>  	rte_spinlock_unlock(&vq->access_lock);
> 
>  	return ret;
>  }
> 
>  int
> -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
>  {
>  	struct vhost_virtqueue *vq;
>  	struct virtio_net *dev = get_device(vid);
> 
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>  		return -1;
> 
>  	if (queue_id >= VHOST_MAX_VRING)
> @@ -1747,18 +1737,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid,
> uint16_t queue_id,
>  	if (unlikely(vq == NULL || !dev->async_copy))
>  		return -1;
> 
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"async copy is not supported on non-inorder mode "
> -			"(vid %d, qid: %d)\n", vid, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
> -	return async_channel_register(vid, queue_id, ops);
> +	return async_channel_register(vid, queue_id);
>  }
> 
>  int
> @@ -1835,6 +1814,83 @@ rte_vhost_async_channel_unregister_thread_unsafe(int
> vid, uint16_t queue_id)
>  	return 0;
>  }
> 
> +static __rte_always_inline void
> +vhost_free_async_dma_mem(void)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> +		struct async_dma_info *dma = &dma_copy_track[i];
> +		int16_t j;
> +
> +		if (dma->max_vchans == 0) {
> +			continue;
> +		}
> +
> +		for (j = 0; j < dma->max_vchans; j++) {
> +			rte_free(dma->vchans[j].metadata);
> +		}
> +		rte_free(dma->vchans);
> +		dma->vchans = NULL;
> +		dma->max_vchans = 0;
> +	}
> +}
> +
> +int
> +rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas, uint16_t
> count)
> +{
> +	uint16_t i;
> +
> +	if (!dmas) {
> +		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration parameter.\n");
> +		return -1;
> +	}
> +
> +	for (i = 0; i < count; i++) {
> +		struct async_dma_vchan_info *vchans;
> +		int16_t dev_id;
> +		uint16_t max_vchans;
> +		uint16_t max_desc;
> +		uint16_t j;
> +
> +		dev_id = dmas[i].dev_id;
> +		max_vchans = dmas[i].max_vchans;
> +		max_desc = dmas[i].max_desc;
> +
> +		if (!rte_is_power_of_2(max_desc)) {
> +			max_desc = rte_align32pow2(max_desc);
> +		}

I think when aligning to power of 2, it should exceed not max_desc?
And based on above comment, if this max_desc is nb_desc configured for
vchanl, you should just make sure the nb_desc be power-of-2.

> +
> +		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) *
> max_vchans,
> +				RTE_CACHE_LINE_SIZE);
> +		if (vchans == NULL) {
> +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for dma-
> %d."
> +					" Cannot enable async data-path.\n", dev_id);
> +			vhost_free_async_dma_mem();
> +			return -1;
> +		}
> +
> +		for (j = 0; j < max_vchans; j++) {
> +			vchans[j].metadata = rte_zmalloc(NULL, sizeof(bool *) *
> max_desc,
> +					RTE_CACHE_LINE_SIZE);
> +			if (!vchans[j].metadata) {
> +				VHOST_LOG_CONFIG(ERR, "Failed to allocate metadata for
> "
> +						"dma-%d vchan-%u\n", dev_id, j);
> +				vhost_free_async_dma_mem();
> +				return -1;
> +			}
> +
> +			vchans[j].ring_size = max_desc;
> +			vchans[j].ring_mask = max_desc - 1;
> +		}
> +
> +		dma_copy_track[dev_id].vchans = vchans;
> +		dma_copy_track[dev_id].max_vchans = max_vchans;
> +	}
> +
> +	return 0;
> +}
> +
>  int
>  rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
>  {
> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> index 7085e0885c..d9bda34e11 100644
> --- a/lib/vhost/vhost.h
> +++ b/lib/vhost/vhost.h
> @@ -19,6 +19,7 @@
>  #include <rte_ether.h>
>  #include <rte_rwlock.h>
>  #include <rte_malloc.h>
> +#include <rte_dmadev.h>
> 
>  #include "rte_vhost.h"
>  #include "rte_vdpa.h"
> @@ -50,6 +51,7 @@
> 
>  #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
>  #define VHOST_MAX_ASYNC_VEC 2048
> +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> 
>  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
> @@ -119,6 +121,41 @@ struct vring_used_elem_packed {
>  	uint32_t count;
>  };
> 
> +struct async_dma_vchan_info {
> +	/* circular array to track copy metadata */
> +	bool **metadata;

If the metadata will only be flags, maybe just use some
name called XXX_flag

> +
> +	/* max elements in 'metadata' */
> +	uint16_t ring_size;
> +	/* ring index mask for 'metadata' */
> +	uint16_t ring_mask;
> +
> +	/* batching copies before a DMA doorbell */
> +	uint16_t nr_batching;
> +
> +	/**
> +	 * DMA virtual channel lock. Although it is able to bind DMA
> +	 * virtual channels to data plane threads, vhost control plane
> +	 * thread could call data plane functions too, thus causing
> +	 * DMA device contention.
> +	 *
> +	 * For example, in VM exit case, vhost control plane thread needs
> +	 * to clear in-flight packets before disable vring, but there could
> +	 * be anotther data plane thread is enqueuing packets to the same
> +	 * vring with the same DMA virtual channel. But dmadev PMD functions
> +	 * are lock-free, so the control plane and data plane threads
> +	 * could operate the same DMA virtual channel at the same time.
> +	 */
> +	rte_spinlock_t dma_lock;
> +};
> +
> +struct async_dma_info {
> +	uint16_t max_vchans;
> +	struct async_dma_vchan_info *vchans;
> +};
> +
> +extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>  /**
>   * inflight async packet information
>   */
> @@ -129,9 +166,6 @@ struct async_inflight_info {
>  };
> 
>  struct vhost_async {
> -	/* operation callbacks for DMA */
> -	struct rte_vhost_async_channel_ops ops;
> -
>  	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
>  	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
>  	uint16_t iter_idx;
> @@ -139,6 +173,19 @@ struct vhost_async {
> 
>  	/* data transfer status */
>  	struct async_inflight_info *pkts_info;
> +	/**
> +	 * packet reorder array. "true" indicates that DMA
> +	 * device completes all copies for the packet.
> +	 *
> +	 * Note that this array could be written by multiple
> +	 * threads at the same time. For example, two threads
> +	 * enqueue packets to the same virtqueue with their
> +	 * own DMA devices. However, since offloading is
> +	 * per-packet basis, each packet flag will only be
> +	 * written by one thread. And single byte write is
> +	 * atomic, so no lock is needed.
> +	 */
> +	bool *pkts_cmpl_flag;
>  	uint16_t pkts_idx;
>  	uint16_t pkts_inflight_n;
>  	union {
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index b3d954aab4..9f81fc9733 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -11,6 +11,7 @@
>  #include <rte_net.h>
>  #include <rte_ether.h>
>  #include <rte_ip.h>
> +#include <rte_dmadev.h>
>  #include <rte_vhost.h>
>  #include <rte_tcp.h>
>  #include <rte_udp.h>
> @@ -25,6 +26,9 @@
> 
>  #define MAX_BATCH_LEN 256
> 
> +/* DMA device copy operation tracking array. */
> +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>  static  __rte_always_inline bool
>  rxvq_is_mergeable(struct virtio_net *dev)
>  {
> @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t
> nr_vring)
>  	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
>  }
> 
> +static __rte_always_inline uint16_t
> +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> +		uint16_t vchan, uint16_t head_idx,
> +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t pkt_idx;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> +		int copy_idx = 0;
> +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> +		uint16_t i;
> +
> +		if (rte_dma_burst_capacity(dma_id, vchan) < nr_segs) {
> +			goto out;
> +		}
> +
> +		for (i = 0; i < nr_segs; i++) {
> +			/**
> +			 * We have checked the available space before submit copies
> to DMA
> +			 * vChannel, so we don't handle error here.
> +			 */
> +			copy_idx = rte_dma_copy(dma_id, vchan,
> (rte_iova_t)iov[i].src_addr,
> +					(rte_iova_t)iov[i].dst_addr, iov[i].len,
> +					RTE_DMA_OP_FLAG_LLC);

This assumes rte_dma_copy will always succeed if there's available space.

But the API doxygen says:

* @return
 *   - 0..UINT16_MAX: index of enqueued job.
 *   - -ENOSPC: if no space left to enqueue.
 *   - other values < 0 on failure.

So it should consider other vendor-specific errors. 

Thanks,
Chenbo

> +
> +			/**
> +			 * Only store packet completion flag address in the last
> copy's
> +			 * slot, and other slots are set to NULL.
> +			 */
> +			if (unlikely(i == (nr_segs - 1))) {
> +				dma_info->metadata[copy_idx & ring_mask] =
> +					&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> +			}
> +		}
> +
> +		dma_info->nr_batching += nr_segs;
> +		if (unlikely(dma_info->nr_batching >=
> VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> +			rte_dma_submit(dma_id, vchan);
> +			dma_info->nr_batching = 0;
> +		}
> +
> +		head_idx++;
> +	}
> +
> +out:
> +	if (dma_info->nr_batching > 0) {
> +		rte_dma_submit(dma_id, vchan);
> +		dma_info->nr_batching = 0;
> +	}
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +
> +	return pkt_idx;
> +}
> +
> +static __rte_always_inline uint16_t
> +vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan, uint16_t
> max_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t last_idx = 0;
> +	uint16_t nr_copies;
> +	uint16_t copy_idx;
> +	uint16_t i;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	/**
> +	 * Since all memory is pinned and addresses should be valid,
> +	 * we don't check errors.
> +	 */
> +	nr_copies = rte_dma_completed(dma_id, vchan, max_pkts, &last_idx, NULL);
> +	if (nr_copies == 0) {
> +		goto out;
> +	}
> +
> +	copy_idx = last_idx - nr_copies + 1;
> +	for (i = 0; i < nr_copies; i++) {
> +		bool *flag;
> +
> +		flag = dma_info->metadata[copy_idx & ring_mask];
> +		if (flag) {
> +			/**
> +			 * Mark the packet flag as received. The flag
> +			 * could belong to another virtqueue but write
> +			 * is atomic.
> +			 */
> +			*flag = true;
> +			dma_info->metadata[copy_idx & ring_mask] = NULL;
> +		}
> +		copy_idx++;
> +	}
> +
> +out:
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +	return nr_copies;
> +}
> +
>  static inline void
>  do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
>  {
> @@ -1449,9 +1555,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed
> *s_ring,
>  }
> 
>  static __rte_noinline uint32_t
> -virtio_dev_rx_async_submit_split(struct virtio_net *dev,
> -	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> +		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
> +		int16_t dma_id, uint16_t vchan)
>  {
>  	struct buf_vector buf_vec[BUF_VECTOR_MAX];
>  	uint32_t pkt_idx = 0;
> @@ -1503,17 +1609,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net
> *dev,
>  	if (unlikely(pkt_idx == 0))
>  		return 0;
> 
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0,
> pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue
> id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> +	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan, async->pkts_idx,
> async->iov_iter,
> +			pkt_idx);
> 
>  	pkt_err = pkt_idx - n_xfer;
>  	if (unlikely(pkt_err)) {
>  		uint16_t num_descs = 0;
> 
> +		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for
> queue %u.\n",
> +				dev->vid, __func__, pkt_err, queue_id);
> +
>  		/* update number of completed packets */
>  		pkt_idx = n_xfer;
> 
> @@ -1656,13 +1761,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq,
> uint16_t slot_idx,
>  }
> 
>  static __rte_noinline uint32_t
> -virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
> -	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> +		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
> +		int16_t dma_id, uint16_t vchan)
>  {
>  	uint32_t pkt_idx = 0;
>  	uint32_t remained = count;
> -	int32_t n_xfer;
> +	uint16_t n_xfer;
>  	uint16_t num_buffers;
>  	uint16_t num_descs;
> 
> @@ -1670,6 +1775,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>  	struct async_inflight_info *pkts_info = async->pkts_info;
>  	uint32_t pkt_err = 0;
>  	uint16_t slot_idx = 0;
> +	uint16_t head_idx = async->pkts_idx % vq->size;
> 
>  	do {
>  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> @@ -1694,19 +1800,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net
> *dev,
>  	if (unlikely(pkt_idx == 0))
>  		return 0;
> 
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0,
> pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue
> id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> -
> -	pkt_err = pkt_idx - n_xfer;
> +	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan, head_idx,
> +			async->iov_iter, pkt_idx);
> 
>  	async_iter_reset(async);
> 
> -	if (unlikely(pkt_err))
> +	pkt_err = pkt_idx - n_xfer;
> +	if (unlikely(pkt_err)) {
> +		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for
> queue %u.\n",
> +				dev->vid, __func__, pkt_err, queue_id);
>  		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
> +	}
> 
>  	if (likely(vq->shadow_used_idx)) {
>  		/* keep used descriptors. */
> @@ -1826,28 +1930,37 @@ write_back_completed_descs_packed(struct
> vhost_virtqueue *vq,
> 
>  static __rte_always_inline uint16_t
>  vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>  {
>  	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
>  	struct vhost_async *async = vq->async;
>  	struct async_inflight_info *pkts_info = async->pkts_info;
> -	int32_t n_cpl;
> +	uint16_t nr_cpl_pkts = 0;
>  	uint16_t n_descs = 0, n_buffers = 0;
>  	uint16_t start_idx, from, i;
> 
> -	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
> -	if (unlikely(n_cpl < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies for
> queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		return 0;
> -	}
> -
> -	if (n_cpl == 0)
> -		return 0;
> +	/* Check completed copies for the given DMA vChannel */
> +	vhost_async_dma_check_completed(dma_id, vchan, count);
> 
>  	start_idx = async_get_first_inflight_pkt_idx(vq);
> 
> -	for (i = 0; i < n_cpl; i++) {
> +	/**
> +	 * Calculate the number of copy completed packets.
> +	 * Note that there may be completed packets even if
> +	 * no copies are reported done by the given DMA vChannel,
> +	 * as DMA vChannels could be shared by other threads.
> +	 */
> +	from = start_idx;
> +	while (vq->async->pkts_cmpl_flag[from] && count--) {
> +		vq->async->pkts_cmpl_flag[from] = false;
> +		from++;
> +		if (from >= vq->size)
> +			from -= vq->size;
> +		nr_cpl_pkts++;
> +	}
> +
> +	for (i = 0; i < nr_cpl_pkts; i++) {
>  		from = (start_idx + i) % vq->size;
>  		/* Only used with packed ring */
>  		n_buffers += pkts_info[from].nr_buffers;
> @@ -1856,7 +1969,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev,
> uint16_t queue_id,
>  		pkts[i] = pkts_info[from].mbuf;
>  	}
> 
> -	async->pkts_inflight_n -= n_cpl;
> +	async->pkts_inflight_n -= nr_cpl_pkts;
> 
>  	if (likely(vq->enabled && vq->access_ok)) {
>  		if (vq_is_packed(dev)) {
> @@ -1877,12 +1990,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev,
> uint16_t queue_id,
>  		}
>  	}
> 
> -	return n_cpl;
> +	return nr_cpl_pkts;
>  }
> 
>  uint16_t
>  rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>  {
>  	struct virtio_net *dev = get_device(vid);
>  	struct vhost_virtqueue *vq;
> @@ -1908,7 +2022,7 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t
> queue_id,
> 
>  	rte_spinlock_lock(&vq->access_lock);
> 
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count,
> dma_id, vchan);
> 
>  	rte_spinlock_unlock(&vq->access_lock);
> 
> @@ -1917,7 +2031,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t
> queue_id,
> 
>  uint16_t
>  rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>  {
>  	struct virtio_net *dev = get_device(vid);
>  	struct vhost_virtqueue *vq;
> @@ -1941,14 +2056,14 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t
> queue_id,
>  		return 0;
>  	}
> 
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count,
> dma_id, vchan);
> 
>  	return n_pkts_cpl;
>  }
> 
>  static __rte_always_inline uint32_t
>  virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan)
>  {
>  	struct vhost_virtqueue *vq;
>  	uint32_t nb_tx = 0;
> @@ -1980,10 +2095,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev,
> uint16_t queue_id,
> 
>  	if (vq_is_packed(dev))
>  		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, vchan);
>  	else
>  		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, vchan);
> 
>  out:
>  	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> @@ -1997,7 +2112,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev,
> uint16_t queue_id,
> 
>  uint16_t
>  rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>  {
>  	struct virtio_net *dev = get_device(vid);
> 
> @@ -2011,7 +2127,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t
> queue_id,
>  		return 0;
>  	}
> 
> -	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id,
> vchan);
>  }
> 
>  static inline bool
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-01-14  6:30     ` Xia, Chenbo
@ 2022-01-17  5:39       ` Hu, Jiayu
  2022-01-19  2:18         ` Xia, Chenbo
  0 siblings, 1 reply; 31+ messages in thread
From: Hu, Jiayu @ 2022-01-17  5:39 UTC (permalink / raw)
  To: Xia, Chenbo, dev
  Cc: maxime.coquelin, i.maximets, Richardson, Bruce, Van Haaren,
	Harry, Pai G, Sunil, Mcnamara, John, Ding, Xuan, Jiang, Cheng1,
	liangma

Hi Chenbo,

Please see replies inline.

Thanks,
Jiayu

> -----Original Message-----
> From: Xia, Chenbo <chenbo.xia@intel.com>
> > diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> > index 33d023aa39..44073499bc 100644
> > --- a/examples/vhost/main.c
> > +++ b/examples/vhost/main.c
> > @@ -24,8 +24,9 @@
> >  #include <rte_ip.h>
> >  #include <rte_tcp.h>
> >  #include <rte_pause.h>
> > +#include <rte_dmadev.h>
> > +#include <rte_vhost_async.h>
> >
> > -#include "ioat.h"
> >  #include "main.h"
> >
> >  #ifndef MAX_QUEUES
> > @@ -56,6 +57,14 @@
> >  #define RTE_TEST_TX_DESC_DEFAULT 512
> >
> >  #define INVALID_PORT_ID 0xFF
> > +#define INVALID_DMA_ID -1
> > +
> > +#define MAX_VHOST_DEVICE 1024
> > +#define DMA_RING_SIZE 4096
> > +
> > +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> > +struct rte_vhost_async_dma_info
> dma_config[RTE_DMADEV_DEFAULT_MAX];
> > +static int dma_count;
> >
> >  /* mask of enabled ports */
> >  static uint32_t enabled_port_mask = 0;
> > @@ -96,8 +105,6 @@ static int builtin_net_driver;
> >
> >  static int async_vhost_driver;
> >
> > -static char *dma_type;
> > -
> >  /* Specify timeout (in useconds) between retries on RX. */
> >  static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
> >  /* Specify the number of retries on RX. */
> > @@ -196,13 +203,134 @@ struct vhost_bufftable
> *vhost_txbuff[RTE_MAX_LCORE *
> > MAX_VHOST_DEVICE];
> >  #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
> >  				 / US_PER_S * BURST_TX_DRAIN_US)
> >
> > +static inline bool
> > +is_dma_configured(int16_t dev_id)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < dma_count; i++) {
> > +		if (dma_config[i].dev_id == dev_id) {
> > +			return true;
> > +		}
> > +	}
> > +	return false;
> > +}
> > +
> >  static inline int
> >  open_dma(const char *value)
> >  {
> > -	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> > -		return open_ioat(value);
> > +	struct dma_for_vhost *dma_info = dma_bind;
> > +	char *input = strndup(value, strlen(value) + 1);
> > +	char *addrs = input;
> > +	char *ptrs[2];
> > +	char *start, *end, *substr;
> > +	int64_t vid, vring_id;
> > +
> > +	struct rte_dma_info info;
> > +	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> > +	struct rte_dma_vchan_conf qconf = {
> > +		.direction = RTE_DMA_DIR_MEM_TO_MEM,
> > +		.nb_desc = DMA_RING_SIZE
> > +	};
> > +
> > +	int dev_id;
> > +	int ret = 0;
> > +	uint16_t i = 0;
> > +	char *dma_arg[MAX_VHOST_DEVICE];
> > +	int args_nr;
> > +
> > +	while (isblank(*addrs))
> > +		addrs++;
> > +	if (*addrs == '\0') {
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	/* process DMA devices within bracket. */
> > +	addrs++;
> > +	substr = strtok(addrs, ";]");
> > +	if (!substr) {
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	args_nr = rte_strsplit(substr, strlen(substr),
> > +			dma_arg, MAX_VHOST_DEVICE, ',');
> > +	if (args_nr <= 0) {
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	while (i < args_nr) {
> > +		char *arg_temp = dma_arg[i];
> > +		uint8_t sub_nr;
> > +
> > +		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> > +		if (sub_nr != 2) {
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		start = strstr(ptrs[0], "txd");
> > +		if (start == NULL) {
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		start += 3;
> > +		vid = strtol(start, &end, 0);
> > +		if (end == start) {
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		vring_id = 0 + VIRTIO_RXQ;
> 
> No need to introduce vring_id, it's always VIRTIO_RXQ

I will remove it later.

> 
> > +
> > +		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
> > +		if (dev_id < 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find
> DMA %s.\n",
> > ptrs[1]);
> > +			ret = -1;
> > +			goto out;
> > +		} else if (is_dma_configured(dev_id)) {
> > +			goto done;
> > +		}
> > +
> 
> Please call rte_dma_info_get before configure to make sure
> info.max_vchans >=1

Do you suggest to use "rte_dma_info_get() and info.max_vchans=0" to indicate
the device is not configured, rather than using is_dma_configure()?

> 
> > +		if (rte_dma_configure(dev_id, &dev_config) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure
> DMA %d.\n",
> > dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up
> DMA %d.\n",
> > dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> >
> > -	return -1;
> > +		rte_dma_info_get(dev_id, &info);
> > +		if (info.nb_vchans != 1) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no
> queues.\n",
> > dev_id);
> 
> Then the above means the number of vchan is not configured.
> 
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		if (rte_dma_start(dev_id) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start
> DMA %u.\n",
> > dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		dma_config[dma_count].dev_id = dev_id;
> > +		dma_config[dma_count].max_vchans = 1;
> > +		dma_config[dma_count++].max_desc = DMA_RING_SIZE;
> > +
> > +done:
> > +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> > +		i++;
> > +	}
> > +out:
> > +	free(input);
> > +	return ret;
> >  }
> >
> >  /*
> > @@ -500,8 +628,6 @@ enum {
> >  	OPT_CLIENT_NUM,
> >  #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
> >  	OPT_BUILTIN_NET_DRIVER_NUM,
> > -#define OPT_DMA_TYPE            "dma-type"
> > -	OPT_DMA_TYPE_NUM,
> >  #define OPT_DMAS                "dmas"
> >  	OPT_DMAS_NUM,
> >  };
> > @@ -539,8 +665,6 @@ us_vhost_parse_args(int argc, char **argv)
> >  				NULL, OPT_CLIENT_NUM},
> >  		{OPT_BUILTIN_NET_DRIVER, no_argument,
> >  				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
> > -		{OPT_DMA_TYPE, required_argument,
> > -				NULL, OPT_DMA_TYPE_NUM},
> >  		{OPT_DMAS, required_argument,
> >  				NULL, OPT_DMAS_NUM},
> >  		{NULL, 0, 0, 0},
> > @@ -661,10 +785,6 @@ us_vhost_parse_args(int argc, char **argv)
> >  			}
> >  			break;
> >
> > -		case OPT_DMA_TYPE_NUM:
> > -			dma_type = optarg;
> > -			break;
> > -
> >  		case OPT_DMAS_NUM:
> >  			if (open_dma(optarg) == -1) {
> >  				RTE_LOG(INFO, VHOST_CONFIG,
> > @@ -841,9 +961,10 @@ complete_async_pkts(struct vhost_dev *vdev)
> >  {
> >  	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
> >  	uint16_t complete_count;
> > +	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
> >
> >  	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
> > -					VIRTIO_RXQ, p_cpl,
> MAX_PKT_BURST);
> > +					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST,
> dma_id, 0);
> >  	if (complete_count) {
> >  		free_pkts(p_cpl, complete_count);
> >  		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count,
> > __ATOMIC_SEQ_CST);
> > @@ -883,11 +1004,12 @@ drain_vhost(struct vhost_dev *vdev)
> >
> >  	if (builtin_net_driver) {
> >  		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
> > -	} else if (async_vhost_driver) {
> > +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
> >  		uint16_t enqueue_fail = 0;
> > +		int16_t dma_id = dma_bind[vdev-
> >vid].dmas[VIRTIO_RXQ].dev_id;
> >
> >  		complete_async_pkts(vdev);
> > -		ret = rte_vhost_submit_enqueue_burst(vdev->vid,
> VIRTIO_RXQ, m,
> > nr_xmit);
> > +		ret = rte_vhost_submit_enqueue_burst(vdev->vid,
> VIRTIO_RXQ, m,
> > nr_xmit, dma_id, 0);
> >  		__atomic_add_fetch(&vdev->pkts_inflight, ret,
> __ATOMIC_SEQ_CST);
> >
> >  		enqueue_fail = nr_xmit - ret;
> > @@ -905,7 +1027,7 @@ drain_vhost(struct vhost_dev *vdev)
> >  				__ATOMIC_SEQ_CST);
> >  	}
> >
> > -	if (!async_vhost_driver)
> > +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
> >  		free_pkts(m, nr_xmit);
> >  }
> >
> > @@ -1211,12 +1333,13 @@ drain_eth_rx(struct vhost_dev *vdev)
> >  	if (builtin_net_driver) {
> >  		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
> >  						pkts, rx_count);
> > -	} else if (async_vhost_driver) {
> > +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
> >  		uint16_t enqueue_fail = 0;
> > +		int16_t dma_id = dma_bind[vdev-
> >vid].dmas[VIRTIO_RXQ].dev_id;
> >
> >  		complete_async_pkts(vdev);
> >  		enqueue_count = rte_vhost_submit_enqueue_burst(vdev-
> >vid,
> > -					VIRTIO_RXQ, pkts, rx_count);
> > +					VIRTIO_RXQ, pkts, rx_count, dma_id,
> 0);
> >  		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count,
> > __ATOMIC_SEQ_CST);
> >
> >  		enqueue_fail = rx_count - enqueue_count;
> > @@ -1235,7 +1358,7 @@ drain_eth_rx(struct vhost_dev *vdev)
> >  				__ATOMIC_SEQ_CST);
> >  	}
> >
> > -	if (!async_vhost_driver)
> > +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
> >  		free_pkts(pkts, rx_count);
> >  }
> >
> > @@ -1387,18 +1510,20 @@ destroy_device(int vid)
> >  		"(%d) device has been removed from data core\n",
> >  		vdev->vid);
> >
> > -	if (async_vhost_driver) {
> > +	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
> >  		uint16_t n_pkt = 0;
> > +		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
> >  		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> >
> >  		while (vdev->pkts_inflight) {
> >  			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid,
> VIRTIO_RXQ,
> > -						m_cpl, vdev->pkts_inflight);
> > +						m_cpl, vdev->pkts_inflight,
> dma_id, 0);
> >  			free_pkts(m_cpl, n_pkt);
> >  			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt,
> > __ATOMIC_SEQ_CST);
> >  		}
> >
> >  		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
> > +		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
> >  	}
> >
> >  	rte_free(vdev);
> > @@ -1468,20 +1593,14 @@ new_device(int vid)
> >  		"(%d) device has been added to data core %d\n",
> >  		vid, vdev->coreid);
> >
> > -	if (async_vhost_driver) {
> > -		struct rte_vhost_async_config config = {0};
> > -		struct rte_vhost_async_channel_ops channel_ops;
> > -
> > -		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
> > -			channel_ops.transfer_data = ioat_transfer_data_cb;
> > -			channel_ops.check_completed_copies =
> > -				ioat_check_completed_copies_cb;
> > -
> > -			config.features = RTE_VHOST_ASYNC_INORDER;
> > +	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
> > +		int ret;
> >
> > -			return rte_vhost_async_channel_register(vid,
> VIRTIO_RXQ,
> > -				config, &channel_ops);
> > +		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
> > +		if (ret == 0) {
> > +			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled =
> true;
> >  		}
> > +		return ret;
> >  	}
> >
> >  	return 0;
> > @@ -1502,14 +1621,15 @@ vring_state_changed(int vid, uint16_t
> queue_id, int
> > enable)
> >  	if (queue_id != VIRTIO_RXQ)
> >  		return 0;
> >
> > -	if (async_vhost_driver) {
> > +	if (dma_bind[vid].dmas[queue_id].async_enabled) {
> >  		if (!enable) {
> >  			uint16_t n_pkt = 0;
> > +			int16_t dma_id =
> dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
> >  			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> >
> >  			while (vdev->pkts_inflight) {
> >  				n_pkt =
> rte_vhost_clear_queue_thread_unsafe(vid,
> > queue_id,
> > -							m_cpl, vdev-
> >pkts_inflight);
> > +							m_cpl, vdev-
> >pkts_inflight, dma_id,
> > 0);
> >  				free_pkts(m_cpl, n_pkt);
> >  				__atomic_sub_fetch(&vdev->pkts_inflight,
> n_pkt,
> > __ATOMIC_SEQ_CST);
> >  			}
> > @@ -1657,6 +1777,25 @@ create_mbuf_pool(uint16_t nr_port, uint32_t
> > nr_switch_core, uint32_t mbuf_size,
> >  		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
> >  }
> >
> > +static void
> > +init_dma(void)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > +		int j;
> > +
> > +		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
> > +			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
> > +			dma_bind[i].dmas[j].async_enabled = false;
> > +		}
> > +	}
> > +
> > +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > +		dma_config[i].dev_id = INVALID_DMA_ID;
> > +	}
> > +}
> > +
> >  /*
> >   * Main function, does initialisation and calls the per-lcore functions.
> >   */
> > @@ -1679,6 +1818,9 @@ main(int argc, char *argv[])
> >  	argc -= ret;
> >  	argv += ret;
> >
> > +	/* initialize dma structures */
> > +	init_dma();
> > +
> >  	/* parse app arguments */
> >  	ret = us_vhost_parse_args(argc, argv);
> >  	if (ret < 0)
> > @@ -1754,6 +1896,20 @@ main(int argc, char *argv[])
> >  	if (client_mode)
> >  		flags |= RTE_VHOST_USER_CLIENT;
> >
> > +	if (async_vhost_driver) {
> > +		if (rte_vhost_async_dma_configure(dma_config, dma_count)
> < 0) {
> > +			RTE_LOG(ERR, VHOST_PORT, "Failed to configure
> DMA in
> > vhost.\n");
> > +			for (i = 0; i < dma_count; i++) {
> > +				if (dma_config[i].dev_id != INVALID_DMA_ID)
> {
> > +					rte_dma_stop(dma_config[i].dev_id);
> > +					dma_config[i].dev_id =
> INVALID_DMA_ID;
> > +				}
> > +			}
> > +			dma_count = 0;
> > +			async_vhost_driver = false;
> > +		}
> > +	}
> > +
> >  	/* Register vhost user driver to handle vhost messages. */
> >  	for (i = 0; i < nb_sockets; i++) {
> >  		char *file = socket_files + i * PATH_MAX;
> > diff --git a/examples/vhost/main.h b/examples/vhost/main.h
> > index e7b1ac60a6..b4a453e77e 100644
> > --- a/examples/vhost/main.h
> > +++ b/examples/vhost/main.h
> > @@ -8,6 +8,7 @@
> >  #include <sys/queue.h>
> >
> >  #include <rte_ether.h>
> > +#include <rte_pci.h>
> >
> >  /* Macros for printing using RTE_LOG */
> >  #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
> > @@ -79,6 +80,16 @@ struct lcore_info {
> >  	struct vhost_dev_tailq_list vdev_list;
> >  };
> >
> > +struct dma_info {
> > +	struct rte_pci_addr addr;
> > +	int16_t dev_id;
> > +	bool async_enabled;
> > +};
> > +
> > +struct dma_for_vhost {
> > +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > +};
> > +
> >  /* we implement non-extra virtio net features */
> >  #define VIRTIO_NET_FEATURES	0
> >
> > diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
> > index 3efd5e6540..87a637f83f 100644
> > --- a/examples/vhost/meson.build
> > +++ b/examples/vhost/meson.build
> > @@ -12,13 +12,9 @@ if not is_linux
> >  endif
> >
> >  deps += 'vhost'
> > +deps += 'dmadev'
> >  allow_experimental_apis = true
> >  sources = files(
> >          'main.c',
> >          'virtio_net.c',
> >  )
> > -
> > -if dpdk_conf.has('RTE_RAW_IOAT')
> > -    deps += 'raw_ioat'
> > -    sources += files('ioat.c')
> > -endif
> > diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
> > index cdb37a4814..8107329400 100644
> > --- a/lib/vhost/meson.build
> > +++ b/lib/vhost/meson.build
> > @@ -33,7 +33,8 @@ headers = files(
> >          'rte_vhost_async.h',
> >          'rte_vhost_crypto.h',
> >  )
> > +
> >  driver_sdk_headers = files(
> >          'vdpa_driver.h',
> >  )
> > -deps += ['ethdev', 'cryptodev', 'hash', 'pci']
> > +deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
> > diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> > index a87ea6ba37..23a7a2d8b3 100644
> > --- a/lib/vhost/rte_vhost_async.h
> > +++ b/lib/vhost/rte_vhost_async.h
> > @@ -27,70 +27,12 @@ struct rte_vhost_iov_iter {
> >  };
> >
> >  /**
> > - * dma transfer status
> > + * DMA device information
> >   */
> > -struct rte_vhost_async_status {
> > -	/** An array of application specific data for source memory */
> > -	uintptr_t *src_opaque_data;
> > -	/** An array of application specific data for destination memory */
> > -	uintptr_t *dst_opaque_data;
> > -};
> > -
> > -/**
> > - * dma operation callbacks to be implemented by applications
> > - */
> > -struct rte_vhost_async_channel_ops {
> > -	/**
> > -	 * instruct async engines to perform copies for a batch of packets
> > -	 *
> > -	 * @param vid
> > -	 *  id of vhost device to perform data copies
> > -	 * @param queue_id
> > -	 *  queue id to perform data copies
> > -	 * @param iov_iter
> > -	 *  an array of IOV iterators
> > -	 * @param opaque_data
> > -	 *  opaque data pair sending to DMA engine
> > -	 * @param count
> > -	 *  number of elements in the "descs" array
> > -	 * @return
> > -	 *  number of IOV iterators processed, negative value means error
> > -	 */
> > -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> > -		struct rte_vhost_iov_iter *iov_iter,
> > -		struct rte_vhost_async_status *opaque_data,
> > -		uint16_t count);
> > -	/**
> > -	 * check copy-completed packets from the async engine
> > -	 * @param vid
> > -	 *  id of vhost device to check copy completion
> > -	 * @param queue_id
> > -	 *  queue id to check copy completion
> > -	 * @param opaque_data
> > -	 *  buffer to receive the opaque data pair from DMA engine
> > -	 * @param max_packets
> > -	 *  max number of packets could be completed
> > -	 * @return
> > -	 *  number of async descs completed, negative value means error
> > -	 */
> > -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_status *opaque_data,
> > -		uint16_t max_packets);
> > -};
> > -
> > -/**
> > - *  async channel features
> > - */
> > -enum {
> > -	RTE_VHOST_ASYNC_INORDER = 1U << 0,
> > -};
> > -
> > -/**
> > - *  async channel configuration
> > - */
> > -struct rte_vhost_async_config {
> > -	uint32_t features;
> > -	uint32_t rsvd[2];
> > +struct rte_vhost_async_dma_info {
> > +	int16_t dev_id;	/* DMA device ID */
> > +	uint16_t max_vchans;	/* max number of vchan */
> > +	uint16_t max_desc;	/* max desc number of vchan */
> >  };
> >
> >  /**
> > @@ -100,17 +42,11 @@ struct rte_vhost_async_config {
> >   *  vhost device id async channel to be attached to
> >   * @param queue_id
> >   *  vhost queue id async channel to be attached to
> > - * @param config
> > - *  Async channel configuration structure
> > - * @param ops
> > - *  Async channel operation callbacks
> >   * @return
> >   *  0 on success, -1 on failures
> >   */
> >  __rte_experimental
> > -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > -	struct rte_vhost_async_config config,
> > -	struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
> >
> >  /**
> >   * Unregister an async channel for a vhost queue
> > @@ -136,17 +72,11 @@ int rte_vhost_async_channel_unregister(int vid,
> uint16_t
> > queue_id);
> >   *  vhost device id async channel to be attached to
> >   * @param queue_id
> >   *  vhost queue id async channel to be attached to
> > - * @param config
> > - *  Async channel configuration
> > - * @param ops
> > - *  Async channel operation callbacks
> >   * @return
> >   *  0 on success, -1 on failures
> >   */
> >  __rte_experimental
> > -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > -	struct rte_vhost_async_config config,
> > -	struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > queue_id);
> >
> >  /**
> >   * Unregister an async channel for a vhost queue without performing any
> > @@ -179,12 +109,17 @@ int
> rte_vhost_async_channel_unregister_thread_unsafe(int
> > vid,
> >   *  array of packets to be enqueued
> >   * @param count
> >   *  packets num to be enqueued
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan
> > + *  the identifier of virtual DMA channel
> >   * @return
> >   *  num of packets enqueued
> >   */
> >  __rte_experimental
> >  uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan);
> 
> All dma_id in the API should be uint16_t. Otherwise you need to check if valid.

Yes, you are right. Although dma_id is defined as int16_t and DMA library checks
if it is valid, vhost doesn't handle DMA failure and we need to make sure dma_id
is valid before using it. And even if vhost handles DMA error, a better place to check
invalid dma_id is before passing it to DMA library too. I will add the check later.

> 
> >
> >  /**
> >   * This function checks async completion status for a specific vhost
> > @@ -199,12 +134,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int
> vid,
> > uint16_t queue_id,
> >   *  blank array to get return packet pointer
> >   * @param count
> >   *  size of the packet array
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan
> > + *  the identifier of virtual DMA channel
> >   * @return
> >   *  num of packets returned
> >   */
> >  __rte_experimental
> >  uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan);
> >
> >  /**
> >   * This function returns the amount of in-flight packets for the vhost
> > @@ -235,11 +175,32 @@ int rte_vhost_async_get_inflight(int vid, uint16_t
> > queue_id);
> >   *  Blank array to get return packet pointer
> >   * @param count
> >   *  Size of the packet array
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan
> > + *  the identifier of virtual DMA channel
> >   * @return
> >   *  Number of packets returned
> >   */
> >  __rte_experimental
> >  uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan);
> > +/**
> > + * The DMA vChannels used in asynchronous data path must be
> configured
> > + * first. So this function needs to be called before enabling DMA
> > + * acceleration for vring. If this function fails, asynchronous data path
> > + * cannot be enabled for any vring further.
> > + *
> > + * @param dmas
> > + *  DMA information
> > + * @param count
> > + *  Element number of 'dmas'
> > + * @return
> > + *  0 on success, and -1 on failure
> > + */
> > +__rte_experimental
> > +int rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info
> *dmas,
> > +		uint16_t count);
> 
> I think based on current design, vhost can use every vchan if user app let it.
> So the max_desc and max_vchans can just be got from dmadev APIs? Then
> there's
> no need to introduce the new ABI struct rte_vhost_async_dma_info.

Yes, no need to introduce struct rte_vhost_async_dma_info. We can either use
struct rte_dma_info which is suggested by Maxime, or query from dma library
via device id. Since dma device configuration is left to applications, I prefer to
use rte_dma_info directly. How do you think?

> 
> And about max_desc, I see the dmadev lib, you can get vchan's max_desc
> but you
> may use a nb_desc (<= max_desc) to configure vchanl. And IIUC, vhost wants
> to
> know the nb_desc instead of max_desc?

True, nb_desc is better than max_desc. But dma library doesn’t provide function
to query nb_desc for every vchannel. And rte_dma_info cannot be used in
rte_vhost_async_dma_configure(), if vhost uses nb_desc. So the only way is
to require users to provide nb_desc for every vchannel, and it will introduce
a new struct. Is it really needed?

> 
> >
> >  #endif /* _RTE_VHOST_ASYNC_H_ */
> > diff --git a/lib/vhost/version.map b/lib/vhost/version.map
> > index a7ef7f1976..1202ba9c1a 100644
> > --- a/lib/vhost/version.map
> > +++ b/lib/vhost/version.map
> > @@ -84,6 +84,9 @@ EXPERIMENTAL {
> >
> >  	# added in 21.11
> >  	rte_vhost_get_monitor_addr;
> > +
> > +	# added in 22.03
> > +	rte_vhost_async_dma_configure;
> >  };
> >
> >  INTERNAL {
> > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> > index 13a9bb9dd1..32f37f4851 100644
> > --- a/lib/vhost/vhost.c
> > +++ b/lib/vhost/vhost.c
> > @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
> >  		return;
> >
> >  	rte_free(vq->async->pkts_info);
> > +	rte_free(vq->async->pkts_cmpl_flag);
> >
> >  	rte_free(vq->async->buffers_packed);
> >  	vq->async->buffers_packed = NULL;
> > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> >  }
> >
> >  static __rte_always_inline int
> > -async_channel_register(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +async_channel_register(int vid, uint16_t queue_id)
> >  {
> >  	struct virtio_net *dev = get_device(vid);
> >  	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> > @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >  		goto out_free_async;
> >  	}
> >
> > +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size *
> sizeof(bool),
> > +			RTE_CACHE_LINE_SIZE, node);
> > +	if (!async->pkts_cmpl_flag) {
> > +		VHOST_LOG_CONFIG(ERR, "failed to allocate async
> pkts_cmpl_flag
> > (vid %d, qid: %d)\n",
> > +				vid, queue_id);
> 
> qid: %u
> 
> > +		goto out_free_async;
> > +	}
> > +
> >  	if (vq_is_packed(dev)) {
> >  		async->buffers_packed = rte_malloc_socket(NULL,
> >  				vq->size * sizeof(struct
> vring_used_elem_packed),
> > @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >  		}
> >  	}
> >
> > -	async->ops.check_completed_copies = ops-
> >check_completed_copies;
> > -	async->ops.transfer_data = ops->transfer_data;
> > -
> >  	vq->async = async;
> >
> >  	return 0;
> > @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >  }
> >
> >  int
> > -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_config config,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
> >  {
> >  	struct vhost_virtqueue *vq;
> >  	struct virtio_net *dev = get_device(vid);
> >  	int ret;
> >
> > -	if (dev == NULL || ops == NULL)
> > +	if (dev == NULL)
> >  		return -1;
> >
> >  	if (queue_id >= VHOST_MAX_VRING)
> > @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid,
> uint16_t
> > queue_id,
> >  	if (unlikely(vq == NULL || !dev->async_copy))
> >  		return -1;
> >
> > -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > -		VHOST_LOG_CONFIG(ERR,
> > -			"async copy is not supported on non-inorder mode "
> > -			"(vid %d, qid: %d)\n", vid, queue_id);
> > -		return -1;
> > -	}
> > -
> > -	if (unlikely(ops->check_completed_copies == NULL ||
> > -		ops->transfer_data == NULL))
> > -		return -1;
> > -
> >  	rte_spinlock_lock(&vq->access_lock);
> > -	ret = async_channel_register(vid, queue_id, ops);
> > +	ret = async_channel_register(vid, queue_id);
> >  	rte_spinlock_unlock(&vq->access_lock);
> >
> >  	return ret;
> >  }
> >
> >  int
> > -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > -		struct rte_vhost_async_config config,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id)
> >  {
> >  	struct vhost_virtqueue *vq;
> >  	struct virtio_net *dev = get_device(vid);
> >
> > -	if (dev == NULL || ops == NULL)
> > +	if (dev == NULL)
> >  		return -1;
> >
> >  	if (queue_id >= VHOST_MAX_VRING)
> > @@ -1747,18 +1737,7 @@
> rte_vhost_async_channel_register_thread_unsafe(int vid,
> > uint16_t queue_id,
> >  	if (unlikely(vq == NULL || !dev->async_copy))
> >  		return -1;
> >
> > -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > -		VHOST_LOG_CONFIG(ERR,
> > -			"async copy is not supported on non-inorder mode "
> > -			"(vid %d, qid: %d)\n", vid, queue_id);
> > -		return -1;
> > -	}
> > -
> > -	if (unlikely(ops->check_completed_copies == NULL ||
> > -		ops->transfer_data == NULL))
> > -		return -1;
> > -
> > -	return async_channel_register(vid, queue_id, ops);
> > +	return async_channel_register(vid, queue_id);
> >  }
> >
> >  int
> > @@ -1835,6 +1814,83 @@
> rte_vhost_async_channel_unregister_thread_unsafe(int
> > vid, uint16_t queue_id)
> >  	return 0;
> >  }
> >
> > +static __rte_always_inline void
> > +vhost_free_async_dma_mem(void)
> > +{
> > +	uint16_t i;
> > +
> > +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > +		struct async_dma_info *dma = &dma_copy_track[i];
> > +		int16_t j;
> > +
> > +		if (dma->max_vchans == 0) {
> > +			continue;
> > +		}
> > +
> > +		for (j = 0; j < dma->max_vchans; j++) {
> > +			rte_free(dma->vchans[j].metadata);
> > +		}
> > +		rte_free(dma->vchans);
> > +		dma->vchans = NULL;
> > +		dma->max_vchans = 0;
> > +	}
> > +}
> > +
> > +int
> > +rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas,
> uint16_t
> > count)
> > +{
> > +	uint16_t i;
> > +
> > +	if (!dmas) {
> > +		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration
> parameter.\n");
> > +		return -1;
> > +	}
> > +
> > +	for (i = 0; i < count; i++) {
> > +		struct async_dma_vchan_info *vchans;
> > +		int16_t dev_id;
> > +		uint16_t max_vchans;
> > +		uint16_t max_desc;
> > +		uint16_t j;
> > +
> > +		dev_id = dmas[i].dev_id;
> > +		max_vchans = dmas[i].max_vchans;
> > +		max_desc = dmas[i].max_desc;
> > +
> > +		if (!rte_is_power_of_2(max_desc)) {
> > +			max_desc = rte_align32pow2(max_desc);
> > +		}
> 
> I think when aligning to power of 2, it should exceed not max_desc?

Aligned max_desc is used to allocate context tracking array. We only need
to guarantee the size of the array for every vchannel is >= max_desc. So it's
OK to have greater array size than max_desc.

> And based on above comment, if this max_desc is nb_desc configured for
> vchanl, you should just make sure the nb_desc be power-of-2.
> 
> > +
> > +		vchans = rte_zmalloc(NULL, sizeof(struct
> async_dma_vchan_info) *
> > max_vchans,
> > +				RTE_CACHE_LINE_SIZE);
> > +		if (vchans == NULL) {
> > +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans
> for dma-
> > %d."
> > +					" Cannot enable async data-path.\n",
> dev_id);
> > +			vhost_free_async_dma_mem();
> > +			return -1;
> > +		}
> > +
> > +		for (j = 0; j < max_vchans; j++) {
> > +			vchans[j].metadata = rte_zmalloc(NULL, sizeof(bool *)
> *
> > max_desc,
> > +					RTE_CACHE_LINE_SIZE);
> > +			if (!vchans[j].metadata) {
> > +				VHOST_LOG_CONFIG(ERR, "Failed to allocate
> metadata for
> > "
> > +						"dma-%d vchan-%u\n",
> dev_id, j);
> > +				vhost_free_async_dma_mem();
> > +				return -1;
> > +			}
> > +
> > +			vchans[j].ring_size = max_desc;
> > +			vchans[j].ring_mask = max_desc - 1;
> > +		}
> > +
> > +		dma_copy_track[dev_id].vchans = vchans;
> > +		dma_copy_track[dev_id].max_vchans = max_vchans;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  int
> >  rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
> >  {
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > index 7085e0885c..d9bda34e11 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -19,6 +19,7 @@
> >  #include <rte_ether.h>
> >  #include <rte_rwlock.h>
> >  #include <rte_malloc.h>
> > +#include <rte_dmadev.h>
> >
> >  #include "rte_vhost.h"
> >  #include "rte_vdpa.h"
> > @@ -50,6 +51,7 @@
> >
> >  #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> >  #define VHOST_MAX_ASYNC_VEC 2048
> > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> >
> >  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
> >  	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> VRING_DESC_F_WRITE) : \
> > @@ -119,6 +121,41 @@ struct vring_used_elem_packed {
> >  	uint32_t count;
> >  };
> >
> > +struct async_dma_vchan_info {
> > +	/* circular array to track copy metadata */
> > +	bool **metadata;
> 
> If the metadata will only be flags, maybe just use some
> name called XXX_flag

Sure, I will rename it.

> 
> > +
> > +	/* max elements in 'metadata' */
> > +	uint16_t ring_size;
> > +	/* ring index mask for 'metadata' */
> > +	uint16_t ring_mask;
> > +
> > +	/* batching copies before a DMA doorbell */
> > +	uint16_t nr_batching;
> > +
> > +	/**
> > +	 * DMA virtual channel lock. Although it is able to bind DMA
> > +	 * virtual channels to data plane threads, vhost control plane
> > +	 * thread could call data plane functions too, thus causing
> > +	 * DMA device contention.
> > +	 *
> > +	 * For example, in VM exit case, vhost control plane thread needs
> > +	 * to clear in-flight packets before disable vring, but there could
> > +	 * be anotther data plane thread is enqueuing packets to the same
> > +	 * vring with the same DMA virtual channel. But dmadev PMD
> functions
> > +	 * are lock-free, so the control plane and data plane threads
> > +	 * could operate the same DMA virtual channel at the same time.
> > +	 */
> > +	rte_spinlock_t dma_lock;
> > +};
> > +
> > +struct async_dma_info {
> > +	uint16_t max_vchans;
> > +	struct async_dma_vchan_info *vchans;
> > +};
> > +
> > +extern struct async_dma_info
> dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +
> >  /**
> >   * inflight async packet information
> >   */
> > @@ -129,9 +166,6 @@ struct async_inflight_info {
> >  };
> >
> >  struct vhost_async {
> > -	/* operation callbacks for DMA */
> > -	struct rte_vhost_async_channel_ops ops;
> > -
> >  	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> >  	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> >  	uint16_t iter_idx;
> > @@ -139,6 +173,19 @@ struct vhost_async {
> >
> >  	/* data transfer status */
> >  	struct async_inflight_info *pkts_info;
> > +	/**
> > +	 * packet reorder array. "true" indicates that DMA
> > +	 * device completes all copies for the packet.
> > +	 *
> > +	 * Note that this array could be written by multiple
> > +	 * threads at the same time. For example, two threads
> > +	 * enqueue packets to the same virtqueue with their
> > +	 * own DMA devices. However, since offloading is
> > +	 * per-packet basis, each packet flag will only be
> > +	 * written by one thread. And single byte write is
> > +	 * atomic, so no lock is needed.
> > +	 */
> > +	bool *pkts_cmpl_flag;
> >  	uint16_t pkts_idx;
> >  	uint16_t pkts_inflight_n;
> >  	union {
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > index b3d954aab4..9f81fc9733 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -11,6 +11,7 @@
> >  #include <rte_net.h>
> >  #include <rte_ether.h>
> >  #include <rte_ip.h>
> > +#include <rte_dmadev.h>
> >  #include <rte_vhost.h>
> >  #include <rte_tcp.h>
> >  #include <rte_udp.h>
> > @@ -25,6 +26,9 @@
> >
> >  #define MAX_BATCH_LEN 256
> >
> > +/* DMA device copy operation tracking array. */
> > +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +
> >  static  __rte_always_inline bool
> >  rxvq_is_mergeable(struct virtio_net *dev)
> >  {
> > @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t
> > nr_vring)
> >  	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> >  }
> >
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> > +		uint16_t vchan, uint16_t head_idx,
> > +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> > +{
> > +	struct async_dma_vchan_info *dma_info =
> > &dma_copy_track[dma_id].vchans[vchan];
> > +	uint16_t ring_mask = dma_info->ring_mask;
> > +	uint16_t pkt_idx;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> > +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > +		int copy_idx = 0;
> > +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > +		uint16_t i;
> > +
> > +		if (rte_dma_burst_capacity(dma_id, vchan) < nr_segs) {
> > +			goto out;
> > +		}
> > +
> > +		for (i = 0; i < nr_segs; i++) {
> > +			/**
> > +			 * We have checked the available space before
> submit copies
> > to DMA
> > +			 * vChannel, so we don't handle error here.
> > +			 */
> > +			copy_idx = rte_dma_copy(dma_id, vchan,
> > (rte_iova_t)iov[i].src_addr,
> > +					(rte_iova_t)iov[i].dst_addr, iov[i].len,
> > +					RTE_DMA_OP_FLAG_LLC);
> 
> This assumes rte_dma_copy will always succeed if there's available space.
> 
> But the API doxygen says:
> 
> * @return
>  *   - 0..UINT16_MAX: index of enqueued job.
>  *   - -ENOSPC: if no space left to enqueue.
>  *   - other values < 0 on failure.
> 
> So it should consider other vendor-specific errors.

Error handling is not free here. Specifically, SW fallback is a way to handle failed
copy operations. But it requires vhost to track VA for every source and destination
buffer for every copy. DMA library uses IOVA, so vhost only prepares IOVA for copies of
every packet in async data-path. In the case of IOVA as PA, the prepared IOVAs cannot
be used as SW fallback, which means vhost needs to store VA for every copy of every
packet too, even if there no errors will happen or IOVA is VA.

I am thinking that the only usable DMA engines in vhost are CBDMA and DSA, is it worth
the cost for "future HW"? If there will be other vendor's HW in future, is it OK to add the
support later? Or is there any way to get VA from IOVA?

Thanks,
Jiayu
> 
> Thanks,
> Chenbo
> 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-01-17  5:39       ` Hu, Jiayu
@ 2022-01-19  2:18         ` Xia, Chenbo
  0 siblings, 0 replies; 31+ messages in thread
From: Xia, Chenbo @ 2022-01-19  2:18 UTC (permalink / raw)
  To: Hu, Jiayu, dev
  Cc: maxime.coquelin, i.maximets, Richardson, Bruce, Van Haaren,
	Harry, Pai G, Sunil, Mcnamara, John, Ding, Xuan, Jiang, Cheng1,
	liangma

> -----Original Message-----
> From: Hu, Jiayu <jiayu.hu@intel.com>
> Sent: Monday, January 17, 2022 1:40 PM
> To: Xia, Chenbo <chenbo.xia@intel.com>; dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; i.maximets@ovn.org; Richardson, Bruce
> <bruce.richardson@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Pai G, Sunil <sunil.pai.g@intel.com>; Mcnamara, John <john.mcnamara@intel.com>;
> Ding, Xuan <xuan.ding@intel.com>; Jiang, Cheng1 <cheng1.jiang@intel.com>;
> liangma@liangbit.com
> Subject: RE: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
> 
> Hi Chenbo,
> 
> Please see replies inline.
> 
> Thanks,
> Jiayu
> 
> > -----Original Message-----
> > From: Xia, Chenbo <chenbo.xia@intel.com>
> > > diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> > > index 33d023aa39..44073499bc 100644
> > > --- a/examples/vhost/main.c
> > > +++ b/examples/vhost/main.c
> > > @@ -24,8 +24,9 @@
> > >  #include <rte_ip.h>
> > >  #include <rte_tcp.h>
> > >  #include <rte_pause.h>
> > > +#include <rte_dmadev.h>
> > > +#include <rte_vhost_async.h>
> > >
> > > -#include "ioat.h"
> > >  #include "main.h"
> > >
> > >  #ifndef MAX_QUEUES
> > > @@ -56,6 +57,14 @@
> > >  #define RTE_TEST_TX_DESC_DEFAULT 512
> > >
> > >  #define INVALID_PORT_ID 0xFF
> > > +#define INVALID_DMA_ID -1
> > > +
> > > +#define MAX_VHOST_DEVICE 1024
> > > +#define DMA_RING_SIZE 4096
> > > +
> > > +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> > > +struct rte_vhost_async_dma_info
> > dma_config[RTE_DMADEV_DEFAULT_MAX];
> > > +static int dma_count;
> > >
> > >  /* mask of enabled ports */
> > >  static uint32_t enabled_port_mask = 0;
> > > @@ -96,8 +105,6 @@ static int builtin_net_driver;
> > >
> > >  static int async_vhost_driver;
> > >
> > > -static char *dma_type;
> > > -
> > >  /* Specify timeout (in useconds) between retries on RX. */
> > >  static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
> > >  /* Specify the number of retries on RX. */
> > > @@ -196,13 +203,134 @@ struct vhost_bufftable
> > *vhost_txbuff[RTE_MAX_LCORE *
> > > MAX_VHOST_DEVICE];
> > >  #define MBUF_TABLE_DRAIN_TSC((rte_get_tsc_hz() + US_PER_S - 1) \
> > >   / US_PER_S * BURST_TX_DRAIN_US)
> > >
> > > +static inline bool
> > > +is_dma_configured(int16_t dev_id)
> > > +{
> > > +int i;
> > > +
> > > +for (i = 0; i < dma_count; i++) {
> > > +if (dma_config[i].dev_id == dev_id) {
> > > +return true;
> > > +}
> > > +}
> > > +return false;
> > > +}
> > > +
> > >  static inline int
> > >  open_dma(const char *value)
> > >  {
> > > -if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> > > -return open_ioat(value);
> > > +struct dma_for_vhost *dma_info = dma_bind;
> > > +char *input = strndup(value, strlen(value) + 1);
> > > +char *addrs = input;
> > > +char *ptrs[2];
> > > +char *start, *end, *substr;
> > > +int64_t vid, vring_id;
> > > +
> > > +struct rte_dma_info info;
> > > +struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> > > +struct rte_dma_vchan_conf qconf = {
> > > +.direction = RTE_DMA_DIR_MEM_TO_MEM,
> > > +.nb_desc = DMA_RING_SIZE
> > > +};
> > > +
> > > +int dev_id;
> > > +int ret = 0;
> > > +uint16_t i = 0;
> > > +char *dma_arg[MAX_VHOST_DEVICE];
> > > +int args_nr;
> > > +
> > > +while (isblank(*addrs))
> > > +addrs++;
> > > +if (*addrs == '\0') {
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +/* process DMA devices within bracket. */
> > > +addrs++;
> > > +substr = strtok(addrs, ";]");
> > > +if (!substr) {
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +args_nr = rte_strsplit(substr, strlen(substr),
> > > +dma_arg, MAX_VHOST_DEVICE, ',');
> > > +if (args_nr <= 0) {
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +while (i < args_nr) {
> > > +char *arg_temp = dma_arg[i];
> > > +uint8_t sub_nr;
> > > +
> > > +sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> > > +if (sub_nr != 2) {
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +start = strstr(ptrs[0], "txd");
> > > +if (start == NULL) {
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +start += 3;
> > > +vid = strtol(start, &end, 0);
> > > +if (end == start) {
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +vring_id = 0 + VIRTIO_RXQ;
> >
> > No need to introduce vring_id, it's always VIRTIO_RXQ
> 
> I will remove it later.
> 
> >
> > > +
> > > +dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
> > > +if (dev_id < 0) {
> > > +RTE_LOG(ERR, VHOST_CONFIG, "Fail to find
> > DMA %s.\n",
> > > ptrs[1]);
> > > +ret = -1;
> > > +goto out;
> > > +} else if (is_dma_configured(dev_id)) {
> > > +goto done;
> > > +}
> > > +
> >
> > Please call rte_dma_info_get before configure to make sure
> > info.max_vchans >=1
> 
> Do you suggest to use "rte_dma_info_get() and info.max_vchans=0" to indicate
> the device is not configured, rather than using is_dma_configure()?

No, I mean when you configure the dmadev with one vchan, make sure it does have
at least one vchanl, even the 'vchan == 0' case can hardly happen. 

Just like the function call sequence in test_dmadev_instance, test_dmadev.c.

> 
> >
> > > +if (rte_dma_configure(dev_id, &dev_config) != 0) {
> > > +RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure
> > DMA %d.\n",
> > > dev_id);
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
> > > +RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up
> > DMA %d.\n",
> > > dev_id);
> > > +ret = -1;
> > > +goto out;
> > > +}
> > >
> > > -return -1;
> > > +rte_dma_info_get(dev_id, &info);
> > > +if (info.nb_vchans != 1) {
> > > +RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no
> > queues.\n",
> > > dev_id);
> >
> > Then the above means the number of vchan is not configured.
> >
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +if (rte_dma_start(dev_id) != 0) {
> > > +RTE_LOG(ERR, VHOST_CONFIG, "Fail to start
> > DMA %u.\n",
> > > dev_id);
> > > +ret = -1;
> > > +goto out;
> > > +}
> > > +
> > > +dma_config[dma_count].dev_id = dev_id;
> > > +dma_config[dma_count].max_vchans = 1;
> > > +dma_config[dma_count++].max_desc = DMA_RING_SIZE;
> > > +
> > > +done:
> > > +(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> > > +i++;
> > > +}
> > > +out:
> > > +free(input);
> > > +return ret;
> > >  }
> > >
> > >  /*
> > > @@ -500,8 +628,6 @@ enum {
> > >  OPT_CLIENT_NUM,
> > >  #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
> > >  OPT_BUILTIN_NET_DRIVER_NUM,
> > > -#define OPT_DMA_TYPE            "dma-type"
> > > -OPT_DMA_TYPE_NUM,
> > >  #define OPT_DMAS                "dmas"
> > >  OPT_DMAS_NUM,
> > >  };
> > > @@ -539,8 +665,6 @@ us_vhost_parse_args(int argc, char **argv)
> > >  NULL, OPT_CLIENT_NUM},
> > >  {OPT_BUILTIN_NET_DRIVER, no_argument,
> > >  NULL, OPT_BUILTIN_NET_DRIVER_NUM},
> > > -{OPT_DMA_TYPE, required_argument,
> > > -NULL, OPT_DMA_TYPE_NUM},
> > >  {OPT_DMAS, required_argument,
> > >  NULL, OPT_DMAS_NUM},
> > >  {NULL, 0, 0, 0},
> > > @@ -661,10 +785,6 @@ us_vhost_parse_args(int argc, char **argv)
> > >  }
> > >  break;
> > >
> > > -case OPT_DMA_TYPE_NUM:
> > > -dma_type = optarg;
> > > -break;
> > > -
> > >  case OPT_DMAS_NUM:
> > >  if (open_dma(optarg) == -1) {
> > >  RTE_LOG(INFO, VHOST_CONFIG,
> > > @@ -841,9 +961,10 @@ complete_async_pkts(struct vhost_dev *vdev)
> > >  {
> > >  struct rte_mbuf *p_cpl[MAX_PKT_BURST];
> > >  uint16_t complete_count;
> > > +int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
> > >
> > >  complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
> > > -VIRTIO_RXQ, p_cpl,
> > MAX_PKT_BURST);
> > > +VIRTIO_RXQ, p_cpl, MAX_PKT_BURST,
> > dma_id, 0);
> > >  if (complete_count) {
> > >  free_pkts(p_cpl, complete_count);
> > >  __atomic_sub_fetch(&vdev->pkts_inflight, complete_count,
> > > __ATOMIC_SEQ_CST);
> > > @@ -883,11 +1004,12 @@ drain_vhost(struct vhost_dev *vdev)
> > >
> > >  if (builtin_net_driver) {
> > >  ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
> > > -} else if (async_vhost_driver) {
> > > +} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
> > >  uint16_t enqueue_fail = 0;
> > > +int16_t dma_id = dma_bind[vdev-
> > >vid].dmas[VIRTIO_RXQ].dev_id;
> > >
> > >  complete_async_pkts(vdev);
> > > -ret = rte_vhost_submit_enqueue_burst(vdev->vid,
> > VIRTIO_RXQ, m,
> > > nr_xmit);
> > > +ret = rte_vhost_submit_enqueue_burst(vdev->vid,
> > VIRTIO_RXQ, m,
> > > nr_xmit, dma_id, 0);
> > >  __atomic_add_fetch(&vdev->pkts_inflight, ret,
> > __ATOMIC_SEQ_CST);
> > >
> > >  enqueue_fail = nr_xmit - ret;
> > > @@ -905,7 +1027,7 @@ drain_vhost(struct vhost_dev *vdev)
> > >  __ATOMIC_SEQ_CST);
> > >  }
> > >
> > > -if (!async_vhost_driver)
> > > +if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
> > >  free_pkts(m, nr_xmit);
> > >  }
> > >
> > > @@ -1211,12 +1333,13 @@ drain_eth_rx(struct vhost_dev *vdev)
> > >  if (builtin_net_driver) {
> > >  enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
> > >  pkts, rx_count);
> > > -} else if (async_vhost_driver) {
> > > +} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
> > >  uint16_t enqueue_fail = 0;
> > > +int16_t dma_id = dma_bind[vdev-
> > >vid].dmas[VIRTIO_RXQ].dev_id;
> > >
> > >  complete_async_pkts(vdev);
> > >  enqueue_count = rte_vhost_submit_enqueue_burst(vdev-
> > >vid,
> > > -VIRTIO_RXQ, pkts, rx_count);
> > > +VIRTIO_RXQ, pkts, rx_count, dma_id,
> > 0);
> > >  __atomic_add_fetch(&vdev->pkts_inflight, enqueue_count,
> > > __ATOMIC_SEQ_CST);
> > >
> > >  enqueue_fail = rx_count - enqueue_count;
> > > @@ -1235,7 +1358,7 @@ drain_eth_rx(struct vhost_dev *vdev)
> > >  __ATOMIC_SEQ_CST);
> > >  }
> > >
> > > -if (!async_vhost_driver)
> > > +if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
> > >  free_pkts(pkts, rx_count);
> > >  }
> > >
> > > @@ -1387,18 +1510,20 @@ destroy_device(int vid)
> > >  "(%d) device has been removed from data core\n",
> > >  vdev->vid);
> > >
> > > -if (async_vhost_driver) {
> > > +if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
> > >  uint16_t n_pkt = 0;
> > > +int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
> > >  struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> > >
> > >  while (vdev->pkts_inflight) {
> > >  n_pkt = rte_vhost_clear_queue_thread_unsafe(vid,
> > VIRTIO_RXQ,
> > > -m_cpl, vdev->pkts_inflight);
> > > +m_cpl, vdev->pkts_inflight,
> > dma_id, 0);
> > >  free_pkts(m_cpl, n_pkt);
> > >  __atomic_sub_fetch(&vdev->pkts_inflight, n_pkt,
> > > __ATOMIC_SEQ_CST);
> > >  }
> > >
> > >  rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
> > > +dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
> > >  }
> > >
> > >  rte_free(vdev);
> > > @@ -1468,20 +1593,14 @@ new_device(int vid)
> > >  "(%d) device has been added to data core %d\n",
> > >  vid, vdev->coreid);
> > >
> > > -if (async_vhost_driver) {
> > > -struct rte_vhost_async_config config = {0};
> > > -struct rte_vhost_async_channel_ops channel_ops;
> > > -
> > > -if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
> > > -channel_ops.transfer_data = ioat_transfer_data_cb;
> > > -channel_ops.check_completed_copies =
> > > -ioat_check_completed_copies_cb;
> > > -
> > > -config.features = RTE_VHOST_ASYNC_INORDER;
> > > +if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
> > > +int ret;
> > >
> > > -return rte_vhost_async_channel_register(vid,
> > VIRTIO_RXQ,
> > > -config, &channel_ops);
> > > +ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
> > > +if (ret == 0) {
> > > +dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled =
> > true;
> > >  }
> > > +return ret;
> > >  }
> > >
> > >  return 0;
> > > @@ -1502,14 +1621,15 @@ vring_state_changed(int vid, uint16_t
> > queue_id, int
> > > enable)
> > >  if (queue_id != VIRTIO_RXQ)
> > >  return 0;
> > >
> > > -if (async_vhost_driver) {
> > > +if (dma_bind[vid].dmas[queue_id].async_enabled) {
> > >  if (!enable) {
> > >  uint16_t n_pkt = 0;
> > > +int16_t dma_id =
> > dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
> > >  struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> > >
> > >  while (vdev->pkts_inflight) {
> > >  n_pkt =
> > rte_vhost_clear_queue_thread_unsafe(vid,
> > > queue_id,
> > > -m_cpl, vdev-
> > >pkts_inflight);
> > > +m_cpl, vdev-
> > >pkts_inflight, dma_id,
> > > 0);
> > >  free_pkts(m_cpl, n_pkt);
> > >  __atomic_sub_fetch(&vdev->pkts_inflight,
> > n_pkt,
> > > __ATOMIC_SEQ_CST);
> > >  }
> > > @@ -1657,6 +1777,25 @@ create_mbuf_pool(uint16_t nr_port, uint32_t
> > > nr_switch_core, uint32_t mbuf_size,
> > >  rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
> > >  }
> > >
> > > +static void
> > > +init_dma(void)
> > > +{
> > > +int i;
> > > +
> > > +for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > > +int j;
> > > +
> > > +for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
> > > +dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
> > > +dma_bind[i].dmas[j].async_enabled = false;
> > > +}
> > > +}
> > > +
> > > +for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > > +dma_config[i].dev_id = INVALID_DMA_ID;
> > > +}
> > > +}
> > > +
> > >  /*
> > >   * Main function, does initialisation and calls the per-lcore functions.
> > >   */
> > > @@ -1679,6 +1818,9 @@ main(int argc, char *argv[])
> > >  argc -= ret;
> > >  argv += ret;
> > >
> > > +/* initialize dma structures */
> > > +init_dma();
> > > +
> > >  /* parse app arguments */
> > >  ret = us_vhost_parse_args(argc, argv);
> > >  if (ret < 0)
> > > @@ -1754,6 +1896,20 @@ main(int argc, char *argv[])
> > >  if (client_mode)
> > >  flags |= RTE_VHOST_USER_CLIENT;
> > >
> > > +if (async_vhost_driver) {
> > > +if (rte_vhost_async_dma_configure(dma_config, dma_count)
> > < 0) {
> > > +RTE_LOG(ERR, VHOST_PORT, "Failed to configure
> > DMA in
> > > vhost.\n");
> > > +for (i = 0; i < dma_count; i++) {
> > > +if (dma_config[i].dev_id != INVALID_DMA_ID)
> > {
> > > +rte_dma_stop(dma_config[i].dev_id);
> > > +dma_config[i].dev_id =
> > INVALID_DMA_ID;
> > > +}
> > > +}
> > > +dma_count = 0;
> > > +async_vhost_driver = false;
> > > +}
> > > +}
> > > +
> > >  /* Register vhost user driver to handle vhost messages. */
> > >  for (i = 0; i < nb_sockets; i++) {
> > >  char *file = socket_files + i * PATH_MAX;
> > > diff --git a/examples/vhost/main.h b/examples/vhost/main.h
> > > index e7b1ac60a6..b4a453e77e 100644
> > > --- a/examples/vhost/main.h
> > > +++ b/examples/vhost/main.h
> > > @@ -8,6 +8,7 @@
> > >  #include <sys/queue.h>
> > >
> > >  #include <rte_ether.h>
> > > +#include <rte_pci.h>
> > >
> > >  /* Macros for printing using RTE_LOG */
> > >  #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
> > > @@ -79,6 +80,16 @@ struct lcore_info {
> > >  struct vhost_dev_tailq_list vdev_list;
> > >  };
> > >
> > > +struct dma_info {
> > > +struct rte_pci_addr addr;
> > > +int16_t dev_id;
> > > +bool async_enabled;
> > > +};
> > > +
> > > +struct dma_for_vhost {
> > > +struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > > +};
> > > +
> > >  /* we implement non-extra virtio net features */
> > >  #define VIRTIO_NET_FEATURES0
> > >
> > > diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
> > > index 3efd5e6540..87a637f83f 100644
> > > --- a/examples/vhost/meson.build
> > > +++ b/examples/vhost/meson.build
> > > @@ -12,13 +12,9 @@ if not is_linux
> > >  endif
> > >
> > >  deps += 'vhost'
> > > +deps += 'dmadev'
> > >  allow_experimental_apis = true
> > >  sources = files(
> > >          'main.c',
> > >          'virtio_net.c',
> > >  )
> > > -
> > > -if dpdk_conf.has('RTE_RAW_IOAT')
> > > -    deps += 'raw_ioat'
> > > -    sources += files('ioat.c')
> > > -endif
> > > diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
> > > index cdb37a4814..8107329400 100644
> > > --- a/lib/vhost/meson.build
> > > +++ b/lib/vhost/meson.build
> > > @@ -33,7 +33,8 @@ headers = files(
> > >          'rte_vhost_async.h',
> > >          'rte_vhost_crypto.h',
> > >  )
> > > +
> > >  driver_sdk_headers = files(
> > >          'vdpa_driver.h',
> > >  )
> > > -deps += ['ethdev', 'cryptodev', 'hash', 'pci']
> > > +deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
> > > diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> > > index a87ea6ba37..23a7a2d8b3 100644
> > > --- a/lib/vhost/rte_vhost_async.h
> > > +++ b/lib/vhost/rte_vhost_async.h
> > > @@ -27,70 +27,12 @@ struct rte_vhost_iov_iter {
> > >  };
> > >
> > >  /**
> > > - * dma transfer status
> > > + * DMA device information
> > >   */
> > > -struct rte_vhost_async_status {
> > > -/** An array of application specific data for source memory */
> > > -uintptr_t *src_opaque_data;
> > > -/** An array of application specific data for destination memory */
> > > -uintptr_t *dst_opaque_data;
> > > -};
> > > -
> > > -/**
> > > - * dma operation callbacks to be implemented by applications
> > > - */
> > > -struct rte_vhost_async_channel_ops {
> > > -/**
> > > - * instruct async engines to perform copies for a batch of packets
> > > - *
> > > - * @param vid
> > > - *  id of vhost device to perform data copies
> > > - * @param queue_id
> > > - *  queue id to perform data copies
> > > - * @param iov_iter
> > > - *  an array of IOV iterators
> > > - * @param opaque_data
> > > - *  opaque data pair sending to DMA engine
> > > - * @param count
> > > - *  number of elements in the "descs" array
> > > - * @return
> > > - *  number of IOV iterators processed, negative value means error
> > > - */
> > > -int32_t (*transfer_data)(int vid, uint16_t queue_id,
> > > -struct rte_vhost_iov_iter *iov_iter,
> > > -struct rte_vhost_async_status *opaque_data,
> > > -uint16_t count);
> > > -/**
> > > - * check copy-completed packets from the async engine
> > > - * @param vid
> > > - *  id of vhost device to check copy completion
> > > - * @param queue_id
> > > - *  queue id to check copy completion
> > > - * @param opaque_data
> > > - *  buffer to receive the opaque data pair from DMA engine
> > > - * @param max_packets
> > > - *  max number of packets could be completed
> > > - * @return
> > > - *  number of async descs completed, negative value means error
> > > - */
> > > -int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> > > -struct rte_vhost_async_status *opaque_data,
> > > -uint16_t max_packets);
> > > -};
> > > -
> > > -/**
> > > - *  async channel features
> > > - */
> > > -enum {
> > > -RTE_VHOST_ASYNC_INORDER = 1U << 0,
> > > -};
> > > -
> > > -/**
> > > - *  async channel configuration
> > > - */
> > > -struct rte_vhost_async_config {
> > > -uint32_t features;
> > > -uint32_t rsvd[2];
> > > +struct rte_vhost_async_dma_info {
> > > +int16_t dev_id;/* DMA device ID */
> > > +uint16_t max_vchans;/* max number of vchan */
> > > +uint16_t max_desc;/* max desc number of vchan */
> > >  };
> > >
> > >  /**
> > > @@ -100,17 +42,11 @@ struct rte_vhost_async_config {
> > >   *  vhost device id async channel to be attached to
> > >   * @param queue_id
> > >   *  vhost queue id async channel to be attached to
> > > - * @param config
> > > - *  Async channel configuration structure
> > > - * @param ops
> > > - *  Async channel operation callbacks
> > >   * @return
> > >   *  0 on success, -1 on failures
> > >   */
> > >  __rte_experimental
> > > -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > > -struct rte_vhost_async_config config,
> > > -struct rte_vhost_async_channel_ops *ops);
> > > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
> > >
> > >  /**
> > >   * Unregister an async channel for a vhost queue
> > > @@ -136,17 +72,11 @@ int rte_vhost_async_channel_unregister(int vid,
> > uint16_t
> > > queue_id);
> > >   *  vhost device id async channel to be attached to
> > >   * @param queue_id
> > >   *  vhost queue id async channel to be attached to
> > > - * @param config
> > > - *  Async channel configuration
> > > - * @param ops
> > > - *  Async channel operation callbacks
> > >   * @return
> > >   *  0 on success, -1 on failures
> > >   */
> > >  __rte_experimental
> > > -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > queue_id,
> > > -struct rte_vhost_async_config config,
> > > -struct rte_vhost_async_channel_ops *ops);
> > > +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > > queue_id);
> > >
> > >  /**
> > >   * Unregister an async channel for a vhost queue without performing any
> > > @@ -179,12 +109,17 @@ int
> > rte_vhost_async_channel_unregister_thread_unsafe(int
> > > vid,
> > >   *  array of packets to be enqueued
> > >   * @param count
> > >   *  packets num to be enqueued
> > > + * @param dma_id
> > > + *  the identifier of the DMA device
> > > + * @param vchan
> > > + *  the identifier of virtual DMA channel
> > >   * @return
> > >   *  num of packets enqueued
> > >   */
> > >  __rte_experimental
> > >  uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> > > -struct rte_mbuf **pkts, uint16_t count);
> > > +struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > > +uint16_t vchan);
> >
> > All dma_id in the API should be uint16_t. Otherwise you need to check if
> valid.
> 
> Yes, you are right. Although dma_id is defined as int16_t and DMA library
> checks
> if it is valid, vhost doesn't handle DMA failure and we need to make sure
> dma_id
> is valid before using it. And even if vhost handles DMA error, a better place
> to check
> invalid dma_id is before passing it to DMA library too. I will add the check
> later.
> 
> >
> > >
> > >  /**
> > >   * This function checks async completion status for a specific vhost
> > > @@ -199,12 +134,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int
> > vid,
> > > uint16_t queue_id,
> > >   *  blank array to get return packet pointer
> > >   * @param count
> > >   *  size of the packet array
> > > + * @param dma_id
> > > + *  the identifier of the DMA device
> > > + * @param vchan
> > > + *  the identifier of virtual DMA channel
> > >   * @return
> > >   *  num of packets returned
> > >   */
> > >  __rte_experimental
> > >  uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > > -struct rte_mbuf **pkts, uint16_t count);
> > > +struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > > +uint16_t vchan);
> > >
> > >  /**
> > >   * This function returns the amount of in-flight packets for the vhost
> > > @@ -235,11 +175,32 @@ int rte_vhost_async_get_inflight(int vid, uint16_t
> > > queue_id);
> > >   *  Blank array to get return packet pointer
> > >   * @param count
> > >   *  Size of the packet array
> > > + * @param dma_id
> > > + *  the identifier of the DMA device
> > > + * @param vchan
> > > + *  the identifier of virtual DMA channel
> > >   * @return
> > >   *  Number of packets returned
> > >   */
> > >  __rte_experimental
> > >  uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> > > -struct rte_mbuf **pkts, uint16_t count);
> > > +struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > > +uint16_t vchan);
> > > +/**
> > > + * The DMA vChannels used in asynchronous data path must be
> > configured
> > > + * first. So this function needs to be called before enabling DMA
> > > + * acceleration for vring. If this function fails, asynchronous data path
> > > + * cannot be enabled for any vring further.
> > > + *
> > > + * @param dmas
> > > + *  DMA information
> > > + * @param count
> > > + *  Element number of 'dmas'
> > > + * @return
> > > + *  0 on success, and -1 on failure
> > > + */
> > > +__rte_experimental
> > > +int rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info
> > *dmas,
> > > +uint16_t count);
> >
> > I think based on current design, vhost can use every vchan if user app let
> it.
> > So the max_desc and max_vchans can just be got from dmadev APIs? Then
> > there's
> > no need to introduce the new ABI struct rte_vhost_async_dma_info.
> 
> Yes, no need to introduce struct rte_vhost_async_dma_info. We can either use
> struct rte_dma_info which is suggested by Maxime, or query from dma library
> via device id. Since dma device configuration is left to applications, I
> prefer to
> use rte_dma_info directly. How do you think?

If you only use rte_dma_info as input param, you will also need to call dmadev
API to get dmadev ID in rte_vhost_async_dma_configure (Or you add both rte_dma_info
and dmadev ID). So I suggest to only use dmadev ID as input.

> 
> >
> > And about max_desc, I see the dmadev lib, you can get vchan's max_desc
> > but you
> > may use a nb_desc (<= max_desc) to configure vchanl. And IIUC, vhost wants
> > to
> > know the nb_desc instead of max_desc?
> 
> True, nb_desc is better than max_desc. But dma library doesn’t provide
> function
> to query nb_desc for every vchannel. And rte_dma_info cannot be used in
> rte_vhost_async_dma_configure(), if vhost uses nb_desc. So the only way is
> to require users to provide nb_desc for every vchannel, and it will introduce
> a new struct. Is it really needed?
> 

Since now dmadev lib does not provide a way to query real nb_desc for a vchanl,
so I think we can just use max_desc.

But ideally, if dmadev lib provides such a way, the configured nb_desc and nb_vchanl
should be used to configure vhost lib.

@Bruce, should you add such a way in dmadev lib? As users now do not know the real
configured nb_desc of vchanl.

> >
> > >
> > >  #endif /* _RTE_VHOST_ASYNC_H_ */
> > > diff --git a/lib/vhost/version.map b/lib/vhost/version.map
> > > index a7ef7f1976..1202ba9c1a 100644
> > > --- a/lib/vhost/version.map
> > > +++ b/lib/vhost/version.map
> > > @@ -84,6 +84,9 @@ EXPERIMENTAL {
> > >
> > >  # added in 21.11
> > >  rte_vhost_get_monitor_addr;
> > > +
> > > +# added in 22.03
> > > +rte_vhost_async_dma_configure;
> > >  };
> > >
> > >  INTERNAL {
> > > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> > > index 13a9bb9dd1..32f37f4851 100644
> > > --- a/lib/vhost/vhost.c
> > > +++ b/lib/vhost/vhost.c
> > > @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
> > >  return;
> > >
> > >  rte_free(vq->async->pkts_info);
> > > +rte_free(vq->async->pkts_cmpl_flag);
> > >
> > >  rte_free(vq->async->buffers_packed);
> > >  vq->async->buffers_packed = NULL;
> > > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> > >  }
> > >
> > >  static __rte_always_inline int
> > > -async_channel_register(int vid, uint16_t queue_id,
> > > -struct rte_vhost_async_channel_ops *ops)
> > > +async_channel_register(int vid, uint16_t queue_id)
> > >  {
> > >  struct virtio_net *dev = get_device(vid);
> > >  struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> > > @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t
> > queue_id,
> > >  goto out_free_async;
> > >  }
> > >
> > > +async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size *
> > sizeof(bool),
> > > +RTE_CACHE_LINE_SIZE, node);
> > > +if (!async->pkts_cmpl_flag) {
> > > +VHOST_LOG_CONFIG(ERR, "failed to allocate async
> > pkts_cmpl_flag
> > > (vid %d, qid: %d)\n",
> > > +vid, queue_id);
> >
> > qid: %u
> >
> > > +goto out_free_async;
> > > +}
> > > +
> > >  if (vq_is_packed(dev)) {
> > >  async->buffers_packed = rte_malloc_socket(NULL,
> > >  vq->size * sizeof(struct
> > vring_used_elem_packed),
> > > @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t
> > queue_id,
> > >  }
> > >  }
> > >
> > > -async->ops.check_completed_copies = ops-
> > >check_completed_copies;
> > > -async->ops.transfer_data = ops->transfer_data;
> > > -
> > >  vq->async = async;
> > >
> > >  return 0;
> > > @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t
> > queue_id,
> > >  }
> > >
> > >  int
> > > -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > > -struct rte_vhost_async_config config,
> > > -struct rte_vhost_async_channel_ops *ops)
> > > +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
> > >  {
> > >  struct vhost_virtqueue *vq;
> > >  struct virtio_net *dev = get_device(vid);
> > >  int ret;
> > >
> > > -if (dev == NULL || ops == NULL)
> > > +if (dev == NULL)
> > >  return -1;
> > >
> > >  if (queue_id >= VHOST_MAX_VRING)
> > > @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid,
> > uint16_t
> > > queue_id,
> > >  if (unlikely(vq == NULL || !dev->async_copy))
> > >  return -1;
> > >
> > > -if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > > -VHOST_LOG_CONFIG(ERR,
> > > -"async copy is not supported on non-inorder mode "
> > > -"(vid %d, qid: %d)\n", vid, queue_id);
> > > -return -1;
> > > -}
> > > -
> > > -if (unlikely(ops->check_completed_copies == NULL ||
> > > -ops->transfer_data == NULL))
> > > -return -1;
> > > -
> > >  rte_spinlock_lock(&vq->access_lock);
> > > -ret = async_channel_register(vid, queue_id, ops);
> > > +ret = async_channel_register(vid, queue_id);
> > >  rte_spinlock_unlock(&vq->access_lock);
> > >
> > >  return ret;
> > >  }
> > >
> > >  int
> > > -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > queue_id,
> > > -struct rte_vhost_async_config config,
> > > -struct rte_vhost_async_channel_ops *ops)
> > > +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > queue_id)
> > >  {
> > >  struct vhost_virtqueue *vq;
> > >  struct virtio_net *dev = get_device(vid);
> > >
> > > -if (dev == NULL || ops == NULL)
> > > +if (dev == NULL)
> > >  return -1;
> > >
> > >  if (queue_id >= VHOST_MAX_VRING)
> > > @@ -1747,18 +1737,7 @@
> > rte_vhost_async_channel_register_thread_unsafe(int vid,
> > > uint16_t queue_id,
> > >  if (unlikely(vq == NULL || !dev->async_copy))
> > >  return -1;
> > >
> > > -if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > > -VHOST_LOG_CONFIG(ERR,
> > > -"async copy is not supported on non-inorder mode "
> > > -"(vid %d, qid: %d)\n", vid, queue_id);
> > > -return -1;
> > > -}
> > > -
> > > -if (unlikely(ops->check_completed_copies == NULL ||
> > > -ops->transfer_data == NULL))
> > > -return -1;
> > > -
> > > -return async_channel_register(vid, queue_id, ops);
> > > +return async_channel_register(vid, queue_id);
> > >  }
> > >
> > >  int
> > > @@ -1835,6 +1814,83 @@
> > rte_vhost_async_channel_unregister_thread_unsafe(int
> > > vid, uint16_t queue_id)
> > >  return 0;
> > >  }
> > >
> > > +static __rte_always_inline void
> > > +vhost_free_async_dma_mem(void)
> > > +{
> > > +uint16_t i;
> > > +
> > > +for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > > +struct async_dma_info *dma = &dma_copy_track[i];
> > > +int16_t j;
> > > +
> > > +if (dma->max_vchans == 0) {
> > > +continue;
> > > +}
> > > +
> > > +for (j = 0; j < dma->max_vchans; j++) {
> > > +rte_free(dma->vchans[j].metadata);
> > > +}
> > > +rte_free(dma->vchans);
> > > +dma->vchans = NULL;
> > > +dma->max_vchans = 0;
> > > +}
> > > +}
> > > +
> > > +int
> > > +rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas,
> > uint16_t
> > > count)
> > > +{
> > > +uint16_t i;
> > > +
> > > +if (!dmas) {
> > > +VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration
> > parameter.\n");
> > > +return -1;
> > > +}
> > > +
> > > +for (i = 0; i < count; i++) {
> > > +struct async_dma_vchan_info *vchans;
> > > +int16_t dev_id;
> > > +uint16_t max_vchans;
> > > +uint16_t max_desc;
> > > +uint16_t j;
> > > +
> > > +dev_id = dmas[i].dev_id;
> > > +max_vchans = dmas[i].max_vchans;
> > > +max_desc = dmas[i].max_desc;
> > > +
> > > +if (!rte_is_power_of_2(max_desc)) {
> > > +max_desc = rte_align32pow2(max_desc);
> > > +}
> >
> > I think when aligning to power of 2, it should exceed not max_desc?
> 
> Aligned max_desc is used to allocate context tracking array. We only need
> to guarantee the size of the array for every vchannel is >= max_desc. So it's
> OK to have greater array size than max_desc.
> 
> > And based on above comment, if this max_desc is nb_desc configured for
> > vchanl, you should just make sure the nb_desc be power-of-2.
> >
> > > +
> > > +vchans = rte_zmalloc(NULL, sizeof(struct
> > async_dma_vchan_info) *
> > > max_vchans,
> > > +RTE_CACHE_LINE_SIZE);
> > > +if (vchans == NULL) {
> > > +VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans
> > for dma-
> > > %d."
> > > +" Cannot enable async data-path.\n",
> > dev_id);
> > > +vhost_free_async_dma_mem();
> > > +return -1;
> > > +}
> > > +
> > > +for (j = 0; j < max_vchans; j++) {
> > > +vchans[j].metadata = rte_zmalloc(NULL, sizeof(bool *)
> > *
> > > max_desc,
> > > +RTE_CACHE_LINE_SIZE);
> > > +if (!vchans[j].metadata) {
> > > +VHOST_LOG_CONFIG(ERR, "Failed to allocate
> > metadata for
> > > "
> > > +"dma-%d vchan-%u\n",
> > dev_id, j);
> > > +vhost_free_async_dma_mem();
> > > +return -1;
> > > +}
> > > +
> > > +vchans[j].ring_size = max_desc;
> > > +vchans[j].ring_mask = max_desc - 1;
> > > +}
> > > +
> > > +dma_copy_track[dev_id].vchans = vchans;
> > > +dma_copy_track[dev_id].max_vchans = max_vchans;
> > > +}
> > > +
> > > +return 0;
> > > +}
> > > +
> > >  int
> > >  rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
> > >  {
> > > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > > index 7085e0885c..d9bda34e11 100644
> > > --- a/lib/vhost/vhost.h
> > > +++ b/lib/vhost/vhost.h
> > > @@ -19,6 +19,7 @@
> > >  #include <rte_ether.h>
> > >  #include <rte_rwlock.h>
> > >  #include <rte_malloc.h>
> > > +#include <rte_dmadev.h>
> > >
> > >  #include "rte_vhost.h"
> > >  #include "rte_vdpa.h"
> > > @@ -50,6 +51,7 @@
> > >
> > >  #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> > >  #define VHOST_MAX_ASYNC_VEC 2048
> > > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> > >
> > >  #define PACKED_DESC_ENQUEUE_USED_FLAG(w)\
> > >  ((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> > VRING_DESC_F_WRITE) : \
> > > @@ -119,6 +121,41 @@ struct vring_used_elem_packed {
> > >  uint32_t count;
> > >  };
> > >
> > > +struct async_dma_vchan_info {
> > > +/* circular array to track copy metadata */
> > > +bool **metadata;
> >
> > If the metadata will only be flags, maybe just use some
> > name called XXX_flag
> 
> Sure, I will rename it.
> 
> >
> > > +
> > > +/* max elements in 'metadata' */
> > > +uint16_t ring_size;
> > > +/* ring index mask for 'metadata' */
> > > +uint16_t ring_mask;
> > > +
> > > +/* batching copies before a DMA doorbell */
> > > +uint16_t nr_batching;
> > > +
> > > +/**
> > > + * DMA virtual channel lock. Although it is able to bind DMA
> > > + * virtual channels to data plane threads, vhost control plane
> > > + * thread could call data plane functions too, thus causing
> > > + * DMA device contention.
> > > + *
> > > + * For example, in VM exit case, vhost control plane thread needs
> > > + * to clear in-flight packets before disable vring, but there could
> > > + * be anotther data plane thread is enqueuing packets to the same
> > > + * vring with the same DMA virtual channel. But dmadev PMD
> > functions
> > > + * are lock-free, so the control plane and data plane threads
> > > + * could operate the same DMA virtual channel at the same time.
> > > + */
> > > +rte_spinlock_t dma_lock;
> > > +};
> > > +
> > > +struct async_dma_info {
> > > +uint16_t max_vchans;
> > > +struct async_dma_vchan_info *vchans;
> > > +};
> > > +
> > > +extern struct async_dma_info
> > dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > > +
> > >  /**
> > >   * inflight async packet information
> > >   */
> > > @@ -129,9 +166,6 @@ struct async_inflight_info {
> > >  };
> > >
> > >  struct vhost_async {
> > > -/* operation callbacks for DMA */
> > > -struct rte_vhost_async_channel_ops ops;
> > > -
> > >  struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> > >  struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> > >  uint16_t iter_idx;
> > > @@ -139,6 +173,19 @@ struct vhost_async {
> > >
> > >  /* data transfer status */
> > >  struct async_inflight_info *pkts_info;
> > > +/**
> > > + * packet reorder array. "true" indicates that DMA
> > > + * device completes all copies for the packet.
> > > + *
> > > + * Note that this array could be written by multiple
> > > + * threads at the same time. For example, two threads
> > > + * enqueue packets to the same virtqueue with their
> > > + * own DMA devices. However, since offloading is
> > > + * per-packet basis, each packet flag will only be
> > > + * written by one thread. And single byte write is
> > > + * atomic, so no lock is needed.
> > > + */
> > > +bool *pkts_cmpl_flag;
> > >  uint16_t pkts_idx;
> > >  uint16_t pkts_inflight_n;
> > >  union {
> > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > > index b3d954aab4..9f81fc9733 100644
> > > --- a/lib/vhost/virtio_net.c
> > > +++ b/lib/vhost/virtio_net.c
> > > @@ -11,6 +11,7 @@
> > >  #include <rte_net.h>
> > >  #include <rte_ether.h>
> > >  #include <rte_ip.h>
> > > +#include <rte_dmadev.h>
> > >  #include <rte_vhost.h>
> > >  #include <rte_tcp.h>
> > >  #include <rte_udp.h>
> > > @@ -25,6 +26,9 @@
> > >
> > >  #define MAX_BATCH_LEN 256
> > >
> > > +/* DMA device copy operation tracking array. */
> > > +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > > +
> > >  static  __rte_always_inline bool
> > >  rxvq_is_mergeable(struct virtio_net *dev)
> > >  {
> > > @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> > uint32_t
> > > nr_vring)
> > >  return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> > >  }
> > >
> > > +static __rte_always_inline uint16_t
> > > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> > > +uint16_t vchan, uint16_t head_idx,
> > > +struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> > > +{
> > > +struct async_dma_vchan_info *dma_info =
> > > &dma_copy_track[dma_id].vchans[vchan];
> > > +uint16_t ring_mask = dma_info->ring_mask;
> > > +uint16_t pkt_idx;
> > > +
> > > +rte_spinlock_lock(&dma_info->dma_lock);
> > > +
> > > +for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> > > +struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > > +int copy_idx = 0;
> > > +uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > > +uint16_t i;
> > > +
> > > +if (rte_dma_burst_capacity(dma_id, vchan) < nr_segs) {
> > > +goto out;
> > > +}
> > > +
> > > +for (i = 0; i < nr_segs; i++) {
> > > +/**
> > > + * We have checked the available space before
> > submit copies
> > > to DMA
> > > + * vChannel, so we don't handle error here.
> > > + */
> > > +copy_idx = rte_dma_copy(dma_id, vchan,
> > > (rte_iova_t)iov[i].src_addr,
> > > +(rte_iova_t)iov[i].dst_addr, iov[i].len,
> > > +RTE_DMA_OP_FLAG_LLC);
> >
> > This assumes rte_dma_copy will always succeed if there's available space.
> >
> > But the API doxygen says:
> >
> > * @return
> >  *   - 0..UINT16_MAX: index of enqueued job.
> >  *   - -ENOSPC: if no space left to enqueue.
> >  *   - other values < 0 on failure.
> >
> > So it should consider other vendor-specific errors.
> 
> Error handling is not free here. Specifically, SW fallback is a way to handle
> failed
> copy operations. But it requires vhost to track VA for every source and
> destination
> buffer for every copy. DMA library uses IOVA, so vhost only prepares IOVA for
> copies of
> every packet in async data-path. In the case of IOVA as PA, the prepared IOVAs
> cannot
> be used as SW fallback, which means vhost needs to store VA for every copy of
> every
> packet too, even if there no errors will happen or IOVA is VA.
> 
> I am thinking that the only usable DMA engines in vhost are CBDMA and DSA, is
> it worth
> the cost for "future HW"? If there will be other vendor's HW in future, is it
> OK to add the
> support later? Or is there any way to get VA from IOVA?

Let's investigate how much performance drop the error handling will bring and see...

Thanks,
Chenbo

> 
> Thanks,
> Jiayu
> >
> > Thanks,
> > Chenbo
> >
> >
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
  2021-12-31  0:55     ` Liang Ma
  2022-01-14  6:30     ` Xia, Chenbo
@ 2022-01-20 17:00     ` Maxime Coquelin
  2022-01-21  1:56       ` Hu, Jiayu
  2 siblings, 1 reply; 31+ messages in thread
From: Maxime Coquelin @ 2022-01-20 17:00 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, bruce.richardson, harry.van.haaren,
	sunil.pai.g, john.mcnamara, xuan.ding, cheng1.jiang, liangma

Hi Jiayu,

On 12/30/21 22:55, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst |  70 ++++-----
>   examples/vhost/Makefile             |   2 +-
>   examples/vhost/ioat.c               | 218 --------------------------
>   examples/vhost/ioat.h               |  63 --------
>   examples/vhost/main.c               | 230 +++++++++++++++++++++++-----
>   examples/vhost/main.h               |  11 ++
>   examples/vhost/meson.build          |   6 +-
>   lib/vhost/meson.build               |   3 +-
>   lib/vhost/rte_vhost_async.h         | 121 +++++----------
>   lib/vhost/version.map               |   3 +
>   lib/vhost/vhost.c                   | 130 +++++++++++-----
>   lib/vhost/vhost.h                   |  53 ++++++-
>   lib/vhost/virtio_net.c              | 206 +++++++++++++++++++------
>   13 files changed, 587 insertions(+), 529 deletions(-)
>   delete mode 100644 examples/vhost/ioat.c
>   delete mode 100644 examples/vhost/ioat.h



> diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
> index 76f5d303c9..bdce7cbf02 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -218,38 +218,12 @@ The following is an overview of some key Vhost API functions:
>   
>     Enable or disable zero copy feature of the vhost crypto backend.
>   
> -* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
> +* ``rte_vhost_async_channel_register(vid, queue_id)``
>   
>     Register an async copy device channel for a vhost queue after vring
> -  is enabled. Following device ``config`` must be specified together
> -  with the registration:
> +  is enabled.
>   
> -  * ``features``
> -
> -    This field is used to specify async copy device features.
> -
> -    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
> -    guarantee the order of copy completion is the same as the order
> -    of copy submission.
> -
> -    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
> -    supported by vhost.
> -
> -  Applications must provide following ``ops`` callbacks for vhost lib to
> -  work with the async copy devices:
> -
> -  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
> -
> -    vhost invokes this function to submit copy data to the async devices.
> -    For non-async_inorder capable devices, ``opaque_data`` could be used
> -    for identifying the completed packets.
> -
> -  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
> -
> -    vhost invokes this function to get the copy data completed by async
> -    devices.
> -
> -* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
> +* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
>   
>     Register an async copy device channel for a vhost queue without
>     performing any locking.
> @@ -277,18 +251,13 @@ The following is an overview of some key Vhost API functions:
>     This function is only safe to call in vhost callback functions
>     (i.e., struct rte_vhost_device_ops).
>   
> -* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
> +* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, dma_vchan)``
>   
>     Submit an enqueue request to transmit ``count`` packets from host to guest
> -  by async data path. Successfully enqueued packets can be transfer completed
> -  or being occupied by DMA engines; transfer completed packets are returned in
> -  ``comp_pkts``, but others are not guaranteed to finish, when this API
> -  call returns.
> +  by async data path. Applications must not free the packets submitted for
> +  enqueue until the packets are completed.
>   
> -  Applications must not free the packets submitted for enqueue until the
> -  packets are completed.
> -
> -* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
> +* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, dma_vchan)``
>   
>     Poll enqueue completion status from async data path. Completed packets
>     are returned to applications through ``pkts``.
> @@ -298,7 +267,7 @@ The following is an overview of some key Vhost API functions:
>     This function returns the amount of in-flight packets for the vhost
>     queue using async acceleration.
>   
> -* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
> +* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, dma_vchan)``
>   
>     Clear inflight packets which are submitted to DMA engine in vhost async data
>     path. Completed packets are returned to applications through ``pkts``.
> @@ -442,3 +411,26 @@ Finally, a set of device ops is defined for device specific operations:
>   * ``get_notify_area``
>   
>     Called to get the notify area info of the queue.
> +
> +Vhost asynchronous data path
> +----------------------------
> +
> +Vhost asynchronous data path leverages DMA devices to offload memory
> +copies from the CPU and it is implemented in an asynchronous way. It
> +enables applcations, like OVS, to save CPU cycles and hide memory copy

s/applcations/applications/

> +overhead, thus achieving higher throughput.
> +
> +Vhost doesn't manage DMA devices and applications, like OVS, need to
> +manage and configure DMA devices. Applications need to tell vhost what
> +DMA devices to use in every data path function call. This design enables
> +the flexibility for applications to dynamically use DMA channels in
> +different function modules, not limited in vhost.
> +
> +In addition, vhost supports M:N mapping between vrings and DMA virtual
> +channels. Specifically, one vring can use multiple different DMA channels
> +and one DMA channel can be shared by multiple vrings at the same time.
> +The reason of enabling one vring to use multiple DMA channels is that
> +it's possible that more than one dataplane threads enqueue packets to
> +the same vring with their own DMA virtual channels. Besides, the number
> +of DMA devices is limited. For the purpose of scaling, it's necessary to
> +support sharing DMA channels among vrings.
> diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
> index 587ea2ab47..975a5dfe40 100644
> --- a/examples/vhost/Makefile
> +++ b/examples/vhost/Makefile
> @@ -5,7 +5,7 @@
>   APP = vhost-switch
>   
>   # all source are stored in SRCS-y
> -SRCS-y := main.c virtio_net.c ioat.c
> +SRCS-y := main.c virtio_net.c
>   
>   PKGCONF ?= pkg-config
>   
> diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
> deleted file mode 100644
> index 9aeeb12fd9..0000000000
> --- a/examples/vhost/ioat.c
> +++ /dev/null
> @@ -1,218 +0,0 @@
> -/* SPDX-License-Identifier: BSD-3-Clause
> - * Copyright(c) 2010-2020 Intel Corporation
> - */
> -
> -#include <sys/uio.h>
> -#ifdef RTE_RAW_IOAT
> -#include <rte_rawdev.h>
> -#include <rte_ioat_rawdev.h>
> -
> -#include "ioat.h"
> -#include "main.h"
> -
> -struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> -
> -struct packet_tracker {
> -	unsigned short size_track[MAX_ENQUEUED_SIZE];
> -	unsigned short next_read;
> -	unsigned short next_write;
> -	unsigned short last_remain;
> -	unsigned short ioat_space;
> -};
> -
> -struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
> -
> -int
> -open_ioat(const char *value)
> -{
> -	struct dma_for_vhost *dma_info = dma_bind;
> -	char *input = strndup(value, strlen(value) + 1);
> -	char *addrs = input;
> -	char *ptrs[2];
> -	char *start, *end, *substr;
> -	int64_t vid, vring_id;
> -	struct rte_ioat_rawdev_config config;
> -	struct rte_rawdev_info info = { .dev_private = &config };
> -	char name[32];
> -	int dev_id;
> -	int ret = 0;
> -	uint16_t i = 0;
> -	char *dma_arg[MAX_VHOST_DEVICE];
> -	int args_nr;
> -
> -	while (isblank(*addrs))
> -		addrs++;
> -	if (*addrs == '\0') {
> -		ret = -1;
> -		goto out;
> -	}
> -
> -	/* process DMA devices within bracket. */
> -	addrs++;
> -	substr = strtok(addrs, ";]");
> -	if (!substr) {
> -		ret = -1;
> -		goto out;
> -	}
> -	args_nr = rte_strsplit(substr, strlen(substr),
> -			dma_arg, MAX_VHOST_DEVICE, ',');
> -	if (args_nr <= 0) {
> -		ret = -1;
> -		goto out;
> -	}
> -	while (i < args_nr) {
> -		char *arg_temp = dma_arg[i];
> -		uint8_t sub_nr;
> -		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> -		if (sub_nr != 2) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		start = strstr(ptrs[0], "txd");
> -		if (start == NULL) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		start += 3;
> -		vid = strtol(start, &end, 0);
> -		if (end == start) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		vring_id = 0 + VIRTIO_RXQ;
> -		if (rte_pci_addr_parse(ptrs[1],
> -				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
> -				name, sizeof(name));
> -		dev_id = rte_rawdev_get_dev_id(name);
> -		if (dev_id == (uint16_t)(-ENODEV) ||
> -		dev_id == (uint16_t)(-EINVAL)) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
> -		strstr(info.driver_name, "ioat") == NULL) {
> -			ret = -1;
> -			goto out;
> -		}
> -
> -		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> -		(dma_info + vid)->dmas[vring_id].is_valid = true;
> -		config.ring_size = IOAT_RING_SIZE;
> -		config.hdls_disable = true;
> -		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
> -			ret = -1;
> -			goto out;
> -		}
> -		rte_rawdev_start(dev_id);
> -		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
> -		dma_info->nr++;
> -		i++;
> -	}
> -out:
> -	free(input);
> -	return ret;
> -}
> -
> -int32_t
> -ioat_transfer_data_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data, uint16_t count)
> -{
> -	uint32_t i_iter;
> -	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
> -	struct rte_vhost_iov_iter *iter = NULL;
> -	unsigned long i_seg;
> -	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
> -	unsigned short write = cb_tracker[dev_id].next_write;
> -
> -	if (!opaque_data) {
> -		for (i_iter = 0; i_iter < count; i_iter++) {
> -			iter = iov_iter + i_iter;
> -			i_seg = 0;
> -			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
> -				break;
> -			while (i_seg < iter->nr_segs) {
> -				rte_ioat_enqueue_copy(dev_id,
> -					(uintptr_t)(iter->iov[i_seg].src_addr),
> -					(uintptr_t)(iter->iov[i_seg].dst_addr),
> -					iter->iov[i_seg].len,
> -					0,
> -					0);
> -				i_seg++;
> -			}
> -			write &= mask;
> -			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
> -			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
> -			write++;
> -		}
> -	} else {
> -		/* Opaque data is not supported */
> -		return -1;
> -	}
> -	/* ring the doorbell */
> -	rte_ioat_perform_ops(dev_id);
> -	cb_tracker[dev_id].next_write = write;
> -	return i_iter;
> -}
> -
> -int32_t
> -ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets)
> -{
> -	if (!opaque_data) {
> -		uintptr_t dump[255];
> -		int n_seg;
> -		unsigned short read, write;
> -		unsigned short nb_packet = 0;
> -		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
> -		unsigned short i;
> -
> -		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
> -				+ VIRTIO_RXQ].dev_id;
> -		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, dump);
> -		if (n_seg < 0) {
> -			RTE_LOG(ERR,
> -				VHOST_DATA,
> -				"fail to poll completed buf on IOAT device %u",
> -				dev_id);
> -			return 0;
> -		}
> -		if (n_seg == 0)
> -			return 0;
> -
> -		cb_tracker[dev_id].ioat_space += n_seg;
> -		n_seg += cb_tracker[dev_id].last_remain;
> -
> -		read = cb_tracker[dev_id].next_read;
> -		write = cb_tracker[dev_id].next_write;
> -		for (i = 0; i < max_packets; i++) {
> -			read &= mask;
> -			if (read == write)
> -				break;
> -			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
> -				n_seg -= cb_tracker[dev_id].size_track[read];
> -				read++;
> -				nb_packet++;
> -			} else {
> -				break;
> -			}
> -		}
> -		cb_tracker[dev_id].next_read = read;
> -		cb_tracker[dev_id].last_remain = n_seg;
> -		return nb_packet;
> -	}
> -	/* Opaque data is not supported */
> -	return -1;
> -}
> -
> -#endif /* RTE_RAW_IOAT */
> diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
> deleted file mode 100644
> index d9bf717e8d..0000000000
> --- a/examples/vhost/ioat.h
> +++ /dev/null
> @@ -1,63 +0,0 @@
> -/* SPDX-License-Identifier: BSD-3-Clause
> - * Copyright(c) 2010-2020 Intel Corporation
> - */
> -
> -#ifndef _IOAT_H_
> -#define _IOAT_H_
> -
> -#include <rte_vhost.h>
> -#include <rte_pci.h>
> -#include <rte_vhost_async.h>
> -
> -#define MAX_VHOST_DEVICE 1024
> -#define IOAT_RING_SIZE 4096
> -#define MAX_ENQUEUED_SIZE 4096
> -
> -struct dma_info {
> -	struct rte_pci_addr addr;
> -	uint16_t dev_id;
> -	bool is_valid;
> -};
> -
> -struct dma_for_vhost {
> -	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> -	uint16_t nr;
> -};
> -
> -#ifdef RTE_RAW_IOAT
> -int open_ioat(const char *value);
> -
> -int32_t
> -ioat_transfer_data_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data, uint16_t count);
> -
> -int32_t
> -ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets);
> -#else
> -static int open_ioat(const char *value __rte_unused)
> -{
> -	return -1;
> -}
> -
> -static int32_t
> -ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
> -		struct rte_vhost_iov_iter *iov_iter __rte_unused,
> -		struct rte_vhost_async_status *opaque_data __rte_unused,
> -		uint16_t count __rte_unused)
> -{
> -	return -1;
> -}
> -
> -static int32_t
> -ioat_check_completed_copies_cb(int vid __rte_unused,
> -		uint16_t queue_id __rte_unused,
> -		struct rte_vhost_async_status *opaque_data __rte_unused,
> -		uint16_t max_packets __rte_unused)
> -{
> -	return -1;
> -}
> -#endif
> -#endif /* _IOAT_H_ */
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 33d023aa39..44073499bc 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -24,8 +24,9 @@
>   #include <rte_ip.h>
>   #include <rte_tcp.h>
>   #include <rte_pause.h>
> +#include <rte_dmadev.h>
> +#include <rte_vhost_async.h>
>   
> -#include "ioat.h"
>   #include "main.h"
>   
>   #ifndef MAX_QUEUES
> @@ -56,6 +57,14 @@
>   #define RTE_TEST_TX_DESC_DEFAULT 512
>   
>   #define INVALID_PORT_ID 0xFF
> +#define INVALID_DMA_ID -1
> +
> +#define MAX_VHOST_DEVICE 1024

It is better to define RTE_VHOST_MAX_DEVICES in rte_vhost.h,
and use it here in in vhost library so that it will always be aligned
with the Vhost library.

> +#define DMA_RING_SIZE 4096

As previous review, the DMA ring size should not be hard-coded.
Please use the DMA lib helpers to get the ring size also here.

Looking at the dmadev library, I would just use the max value provided
by the device. What would be the downside of doing so?

Looking at currently available DMA device drivers, this is their
max_desc value:
  - CNXK: 15
  - DPAA: 64
  - HISI: 8192
  - IDXD: 4192
  - IOAT: 4192
  - Skeleton: 8192

So harcoding to 4192 will prevent some to DMA devices to be used, and
will not make full benefit of their capabilities for others.

> +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> +struct rte_vhost_async_dma_info dma_config[RTE_DMADEV_DEFAULT_MAX];
> +static int dma_count;
>   
>   /* mask of enabled ports */
>   static uint32_t enabled_port_mask = 0;
> @@ -96,8 +105,6 @@ static int builtin_net_driver;
>   
>   static int async_vhost_driver;
>   
> -static char *dma_type;
> -
>   /* Specify timeout (in useconds) between retries on RX. */
>   static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
>   /* Specify the number of retries on RX. */
> @@ -196,13 +203,134 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
>   #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
>   				 / US_PER_S * BURST_TX_DRAIN_US)
>   
> +static inline bool
> +is_dma_configured(int16_t dev_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < dma_count; i++) {
> +		if (dma_config[i].dev_id == dev_id) {
> +			return true;
> +		}
> +	}

No need for braces for both the loop and the if.

> +	return false;
> +}
> +
>   static inline int
>   open_dma(const char *value)
>   {
> -	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> -		return open_ioat(value);
> +	struct dma_for_vhost *dma_info = dma_bind;
> +	char *input = strndup(value, strlen(value) + 1);
> +	char *addrs = input;
> +	char *ptrs[2];
> +	char *start, *end, *substr;
> +	int64_t vid, vring_id;
> +
> +	struct rte_dma_info info;
> +	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> +	struct rte_dma_vchan_conf qconf = {
> +		.direction = RTE_DMA_DIR_MEM_TO_MEM,
> +		.nb_desc = DMA_RING_SIZE
> +	};
> +
> +	int dev_id;
> +	int ret = 0;
> +	uint16_t i = 0;
> +	char *dma_arg[MAX_VHOST_DEVICE];
> +	int args_nr;
> +
> +	while (isblank(*addrs))
> +		addrs++;
> +	if (*addrs == '\0') {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	/* process DMA devices within bracket. */
> +	addrs++;
> +	substr = strtok(addrs, ";]");
> +	if (!substr) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	args_nr = rte_strsplit(substr, strlen(substr),
> +			dma_arg, MAX_VHOST_DEVICE, ',');
> +	if (args_nr <= 0) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	while (i < args_nr) {
> +		char *arg_temp = dma_arg[i];
> +		uint8_t sub_nr;
> +
> +		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> +		if (sub_nr != 2) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start = strstr(ptrs[0], "txd");
> +		if (start == NULL) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		start += 3;
> +		vid = strtol(start, &end, 0);
> +		if (end == start) {
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		vring_id = 0 + VIRTIO_RXQ;
> +
> +		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
> +		if (dev_id < 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
> +			ret = -1;
> +			goto out;
> +		} else if (is_dma_configured(dev_id)) {
> +			goto done;
> +		}
> +
> +		if (rte_dma_configure(dev_id, &dev_config) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
>   
> -	return -1;
> +		rte_dma_info_get(dev_id, &info);
> +		if (info.nb_vchans != 1) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no queues.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}

So, I would call rte_dma_info_get() before calling rte_dma_vchan_setup()
to get desc number value working with the DMA device.

> +
> +		if (rte_dma_start(dev_id) != 0) {
> +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
> +			ret = -1;
> +			goto out;
> +		}
> +
> +		dma_config[dma_count].dev_id = dev_id;
> +		dma_config[dma_count].max_vchans = 1;
> +		dma_config[dma_count++].max_desc = DMA_RING_SIZE;
> +
> +done:
> +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> +		i++;
> +	}
> +out:
> +	free(input);
> +	return ret;
>   }
>   
>   /*
> @@ -500,8 +628,6 @@ enum {
>   	OPT_CLIENT_NUM,
>   #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
>   	OPT_BUILTIN_NET_DRIVER_NUM,
> -#define OPT_DMA_TYPE            "dma-type"
> -	OPT_DMA_TYPE_NUM,
>   #define OPT_DMAS                "dmas"
>   	OPT_DMAS_NUM,
>   };
> @@ -539,8 +665,6 @@ us_vhost_parse_args(int argc, char **argv)
>   				NULL, OPT_CLIENT_NUM},
>   		{OPT_BUILTIN_NET_DRIVER, no_argument,
>   				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
> -		{OPT_DMA_TYPE, required_argument,
> -				NULL, OPT_DMA_TYPE_NUM},
>   		{OPT_DMAS, required_argument,
>   				NULL, OPT_DMAS_NUM},
>   		{NULL, 0, 0, 0},
> @@ -661,10 +785,6 @@ us_vhost_parse_args(int argc, char **argv)
>   			}
>   			break;
>   
> -		case OPT_DMA_TYPE_NUM:
> -			dma_type = optarg;
> -			break;
> -
>   		case OPT_DMAS_NUM:
>   			if (open_dma(optarg) == -1) {
>   				RTE_LOG(INFO, VHOST_CONFIG,
> @@ -841,9 +961,10 @@ complete_async_pkts(struct vhost_dev *vdev)
>   {
>   	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
>   	uint16_t complete_count;
> +	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
>   
>   	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
> -					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
> +					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
>   	if (complete_count) {
>   		free_pkts(p_cpl, complete_count);
>   		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
> @@ -883,11 +1004,12 @@ drain_vhost(struct vhost_dev *vdev)
>   
>   	if (builtin_net_driver) {
>   		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
> -	} else if (async_vhost_driver) {
> +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
>   		uint16_t enqueue_fail = 0;
> +		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
>   
>   		complete_async_pkts(vdev);
> -		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
> +		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
>   		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
>   
>   		enqueue_fail = nr_xmit - ret;
> @@ -905,7 +1027,7 @@ drain_vhost(struct vhost_dev *vdev)
>   				__ATOMIC_SEQ_CST);
>   	}
>   
> -	if (!async_vhost_driver)
> +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
>   		free_pkts(m, nr_xmit);
>   }
>   
> @@ -1211,12 +1333,13 @@ drain_eth_rx(struct vhost_dev *vdev)
>   	if (builtin_net_driver) {
>   		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
>   						pkts, rx_count);
> -	} else if (async_vhost_driver) {
> +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
>   		uint16_t enqueue_fail = 0;
> +		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
>   
>   		complete_async_pkts(vdev);
>   		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
> -					VIRTIO_RXQ, pkts, rx_count);
> +					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
>   		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
>   
>   		enqueue_fail = rx_count - enqueue_count;
> @@ -1235,7 +1358,7 @@ drain_eth_rx(struct vhost_dev *vdev)
>   				__ATOMIC_SEQ_CST);
>   	}
>   
> -	if (!async_vhost_driver)
> +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
>   		free_pkts(pkts, rx_count);
>   }
>   
> @@ -1387,18 +1510,20 @@ destroy_device(int vid)
>   		"(%d) device has been removed from data core\n",
>   		vdev->vid);
>   
> -	if (async_vhost_driver) {
> +	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
>   		uint16_t n_pkt = 0;
> +		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
>   		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
>   
>   		while (vdev->pkts_inflight) {
>   			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
> -						m_cpl, vdev->pkts_inflight);
> +						m_cpl, vdev->pkts_inflight, dma_id, 0);
>   			free_pkts(m_cpl, n_pkt);
>   			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
>   		}
>   
>   		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
> +		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
>   	}
>   
>   	rte_free(vdev);
> @@ -1468,20 +1593,14 @@ new_device(int vid)
>   		"(%d) device has been added to data core %d\n",
>   		vid, vdev->coreid);
>   
> -	if (async_vhost_driver) {
> -		struct rte_vhost_async_config config = {0};
> -		struct rte_vhost_async_channel_ops channel_ops;
> -
> -		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
> -			channel_ops.transfer_data = ioat_transfer_data_cb;
> -			channel_ops.check_completed_copies =
> -				ioat_check_completed_copies_cb;
> -
> -			config.features = RTE_VHOST_ASYNC_INORDER;
> +	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
> +		int ret;
>   
> -			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
> -				config, &channel_ops);
> +		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
> +		if (ret == 0) {
> +			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
>   		}
> +		return ret;
>   	}
>   
>   	return 0;
> @@ -1502,14 +1621,15 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
>   	if (queue_id != VIRTIO_RXQ)
>   		return 0;
>   
> -	if (async_vhost_driver) {
> +	if (dma_bind[vid].dmas[queue_id].async_enabled) {
>   		if (!enable) {
>   			uint16_t n_pkt = 0;
> +			int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
>   			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
>   
>   			while (vdev->pkts_inflight) {
>   				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
> -							m_cpl, vdev->pkts_inflight);
> +							m_cpl, vdev->pkts_inflight, dma_id, 0);
>   				free_pkts(m_cpl, n_pkt);
>   				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
>   			}
> @@ -1657,6 +1777,25 @@ create_mbuf_pool(uint16_t nr_port, uint32_t nr_switch_core, uint32_t mbuf_size,
>   		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
>   }
>   
> +static void
> +init_dma(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> +		int j;
> +
> +		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
> +			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
> +			dma_bind[i].dmas[j].async_enabled = false;
> +		}
> +	}
> +
> +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> +		dma_config[i].dev_id = INVALID_DMA_ID;
> +	}
> +}
> +
>   /*
>    * Main function, does initialisation and calls the per-lcore functions.
>    */
> @@ -1679,6 +1818,9 @@ main(int argc, char *argv[])
>   	argc -= ret;
>   	argv += ret;
>   
> +	/* initialize dma structures */
> +	init_dma();
> +
>   	/* parse app arguments */
>   	ret = us_vhost_parse_args(argc, argv);
>   	if (ret < 0)
> @@ -1754,6 +1896,20 @@ main(int argc, char *argv[])
>   	if (client_mode)
>   		flags |= RTE_VHOST_USER_CLIENT;
>   
> +	if (async_vhost_driver) {

You should be able to get rid off async_vhost_driver and instead rely on
dma_count.

> +		if (rte_vhost_async_dma_configure(dma_config, dma_count) < 0) {
> +			RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in vhost.\n");
> +			for (i = 0; i < dma_count; i++) {
> +				if (dma_config[i].dev_id != INVALID_DMA_ID) {
> +					rte_dma_stop(dma_config[i].dev_id);
> +					dma_config[i].dev_id = INVALID_DMA_ID;
> +				}
> +			}
> +			dma_count = 0;
> +			async_vhost_driver = false;

Let's just exit the app if DMAs were provided in command line but cannot
be used.

> +		}
> +	}
> +
>   	/* Register vhost user driver to handle vhost messages. */
>   	for (i = 0; i < nb_sockets; i++) {
>   		char *file = socket_files + i * PATH_MAX;
> diff --git a/examples/vhost/main.h b/examples/vhost/main.h
> index e7b1ac60a6..b4a453e77e 100644
> --- a/examples/vhost/main.h
> +++ b/examples/vhost/main.h
> @@ -8,6 +8,7 @@
>   #include <sys/queue.h>
>   
>   #include <rte_ether.h>
> +#include <rte_pci.h>
>   
>   /* Macros for printing using RTE_LOG */
>   #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
> @@ -79,6 +80,16 @@ struct lcore_info {
>   	struct vhost_dev_tailq_list vdev_list;
>   };
>   
> +struct dma_info {
> +	struct rte_pci_addr addr;
> +	int16_t dev_id;
> +	bool async_enabled;
> +};
> +
> +struct dma_for_vhost {
> +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> +};
> +
>   /* we implement non-extra virtio net features */
>   #define VIRTIO_NET_FEATURES	0
>   
> diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
> index 3efd5e6540..87a637f83f 100644
> --- a/examples/vhost/meson.build
> +++ b/examples/vhost/meson.build
> @@ -12,13 +12,9 @@ if not is_linux
>   endif
>   
>   deps += 'vhost'
> +deps += 'dmadev'
>   allow_experimental_apis = true
>   sources = files(
>           'main.c',
>           'virtio_net.c',
>   )
> -
> -if dpdk_conf.has('RTE_RAW_IOAT')
> -    deps += 'raw_ioat'
> -    sources += files('ioat.c')
> -endif
> diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
> index cdb37a4814..8107329400 100644
> --- a/lib/vhost/meson.build
> +++ b/lib/vhost/meson.build
> @@ -33,7 +33,8 @@ headers = files(
>           'rte_vhost_async.h',
>           'rte_vhost_crypto.h',
>   )
> +
>   driver_sdk_headers = files(
>           'vdpa_driver.h',
>   )
> -deps += ['ethdev', 'cryptodev', 'hash', 'pci']
> +deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
> diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> index a87ea6ba37..23a7a2d8b3 100644
> --- a/lib/vhost/rte_vhost_async.h
> +++ b/lib/vhost/rte_vhost_async.h
> @@ -27,70 +27,12 @@ struct rte_vhost_iov_iter {
>   };
>   
>   /**
> - * dma transfer status
> + * DMA device information
>    */
> -struct rte_vhost_async_status {
> -	/** An array of application specific data for source memory */
> -	uintptr_t *src_opaque_data;
> -	/** An array of application specific data for destination memory */
> -	uintptr_t *dst_opaque_data;
> -};
> -
> -/**
> - * dma operation callbacks to be implemented by applications
> - */
> -struct rte_vhost_async_channel_ops {
> -	/**
> -	 * instruct async engines to perform copies for a batch of packets
> -	 *
> -	 * @param vid
> -	 *  id of vhost device to perform data copies
> -	 * @param queue_id
> -	 *  queue id to perform data copies
> -	 * @param iov_iter
> -	 *  an array of IOV iterators
> -	 * @param opaque_data
> -	 *  opaque data pair sending to DMA engine
> -	 * @param count
> -	 *  number of elements in the "descs" array
> -	 * @return
> -	 *  number of IOV iterators processed, negative value means error
> -	 */
> -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t count);
> -	/**
> -	 * check copy-completed packets from the async engine
> -	 * @param vid
> -	 *  id of vhost device to check copy completion
> -	 * @param queue_id
> -	 *  queue id to check copy completion
> -	 * @param opaque_data
> -	 *  buffer to receive the opaque data pair from DMA engine
> -	 * @param max_packets
> -	 *  max number of packets could be completed
> -	 * @return
> -	 *  number of async descs completed, negative value means error
> -	 */
> -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets);
> -};
> -
> -/**
> - *  async channel features
> - */
> -enum {
> -	RTE_VHOST_ASYNC_INORDER = 1U << 0,
> -};
> -
> -/**
> - *  async channel configuration
> - */
> -struct rte_vhost_async_config {
> -	uint32_t features;
> -	uint32_t rsvd[2];
> +struct rte_vhost_async_dma_info {
> +	int16_t dev_id;	/* DMA device ID */
> +	uint16_t max_vchans;	/* max number of vchan */
> +	uint16_t max_desc;	/* max desc number of vchan */
>   };
>   
>   /**
> @@ -100,17 +42,11 @@ struct rte_vhost_async_config {
>    *  vhost device id async channel to be attached to
>    * @param queue_id
>    *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration structure
> - * @param ops
> - *  Async channel operation callbacks
>    * @return
>    *  0 on success, -1 on failures
>    */
>   __rte_experimental
> -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
>   
>   /**
>    * Unregister an async channel for a vhost queue
> @@ -136,17 +72,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>    *  vhost device id async channel to be attached to
>    * @param queue_id
>    *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration
> - * @param ops
> - *  Async channel operation callbacks
>    * @return
>    *  0 on success, -1 on failures
>    */
>   __rte_experimental
> -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
>   
>   /**
>    * Unregister an async channel for a vhost queue without performing any
> @@ -179,12 +109,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
>    *  array of packets to be enqueued
>    * @param count
>    *  packets num to be enqueued
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan
> + *  the identifier of virtual DMA channel
>    * @return
>    *  num of packets enqueued
>    */
>   __rte_experimental
>   uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan);

Maybe using vchan_id would be clearer for the API user, same comment
below.

>   
>   /**
>    * This function checks async completion status for a specific vhost
> @@ -199,12 +134,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
>    *  blank array to get return packet pointer
>    * @param count
>    *  size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan
> + *  the identifier of virtual DMA channel
>    * @return
>    *  num of packets returned
>    */
>   __rte_experimental
>   uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan);
>   
>   /**
>    * This function returns the amount of in-flight packets for the vhost
> @@ -235,11 +175,32 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
>    *  Blank array to get return packet pointer
>    * @param count
>    *  Size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan
> + *  the identifier of virtual DMA channel
>    * @return
>    *  Number of packets returned
>    */
>   __rte_experimental
>   uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan);
> +/**
> + * The DMA vChannels used in asynchronous data path must be configured
> + * first. So this function needs to be called before enabling DMA
> + * acceleration for vring. If this function fails, asynchronous data path
> + * cannot be enabled for any vring further.
> + *
> + * @param dmas
> + *  DMA information
> + * @param count
> + *  Element number of 'dmas'
> + * @return
> + *  0 on success, and -1 on failure
> + */
> +__rte_experimental
> +int rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas,
> +		uint16_t count);
>   
>   #endif /* _RTE_VHOST_ASYNC_H_ */
> diff --git a/lib/vhost/version.map b/lib/vhost/version.map
> index a7ef7f1976..1202ba9c1a 100644
> --- a/lib/vhost/version.map
> +++ b/lib/vhost/version.map
> @@ -84,6 +84,9 @@ EXPERIMENTAL {
>   
>   	# added in 21.11
>   	rte_vhost_get_monitor_addr;
> +
> +	# added in 22.03
> +	rte_vhost_async_dma_configure;
>   };
>   
>   INTERNAL {
> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> index 13a9bb9dd1..32f37f4851 100644
> --- a/lib/vhost/vhost.c
> +++ b/lib/vhost/vhost.c
> @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
>   		return;
>   
>   	rte_free(vq->async->pkts_info);
> +	rte_free(vq->async->pkts_cmpl_flag);
>   
>   	rte_free(vq->async->buffers_packed);
>   	vq->async->buffers_packed = NULL;
> @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
>   }
>   
>   static __rte_always_inline int
> -async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_channel_ops *ops)
> +async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
>   		goto out_free_async;
>   	}
>   
> +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
> +			RTE_CACHE_LINE_SIZE, node);
> +	if (!async->pkts_cmpl_flag) {
> +		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag (vid %d, qid: %d)\n",
> +				vid, queue_id);
> +		goto out_free_async;
> +	}
> +
>   	if (vq_is_packed(dev)) {
>   		async->buffers_packed = rte_malloc_socket(NULL,
>   				vq->size * sizeof(struct vring_used_elem_packed),
> @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
>   		}
>   	}
>   
> -	async->ops.check_completed_copies = ops->check_completed_copies;
> -	async->ops.transfer_data = ops->transfer_data;
> -
>   	vq->async = async;
>   
>   	return 0;
> @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t queue_id,
>   }
>   
>   int
> -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   	int ret;
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   	if (unlikely(vq == NULL || !dev->async_copy))
>   		return -1;
>   
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"async copy is not supported on non-inorder mode "
> -			"(vid %d, qid: %d)\n", vid, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
>   	rte_spinlock_lock(&vq->access_lock);
> -	ret = async_channel_register(vid, queue_id, ops);
> +	ret = async_channel_register(vid, queue_id);
>   	rte_spinlock_unlock(&vq->access_lock);
>   
>   	return ret;
>   }
>   
>   int
> -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1747,18 +1737,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
>   	if (unlikely(vq == NULL || !dev->async_copy))
>   		return -1;
>   
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"async copy is not supported on non-inorder mode "
> -			"(vid %d, qid: %d)\n", vid, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
> -	return async_channel_register(vid, queue_id, ops);
> +	return async_channel_register(vid, queue_id);
>   }
>   
>   int
> @@ -1835,6 +1814,83 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
>   	return 0;
>   }
>   
> +static __rte_always_inline void
> +vhost_free_async_dma_mem(void)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> +		struct async_dma_info *dma = &dma_copy_track[i];
> +		int16_t j;
> +
> +		if (dma->max_vchans == 0) {
> +			continue;
> +		}
> +
> +		for (j = 0; j < dma->max_vchans; j++) {
> +			rte_free(dma->vchans[j].metadata);
> +		}
> +		rte_free(dma->vchans);
> +		dma->vchans = NULL;
> +		dma->max_vchans = 0;
> +	}
> +}
> +
> +int
> +rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas, uint16_t count)
> +{
> +	uint16_t i;
> +
> +	if (!dmas) {
> +		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration parameter.\n");
> +		return -1;
> +	}
> +
> +	for (i = 0; i < count; i++) {
> +		struct async_dma_vchan_info *vchans;
> +		int16_t dev_id;
> +		uint16_t max_vchans;
> +		uint16_t max_desc;
> +		uint16_t j;
> +
> +		dev_id = dmas[i].dev_id;
> +		max_vchans = dmas[i].max_vchans;
> +		max_desc = dmas[i].max_desc;
> +
> +		if (!rte_is_power_of_2(max_desc)) {
> +			max_desc = rte_align32pow2(max_desc);
> +		}

That will be problematic with CNXK driver that reports 15 as max_desc.

> +
> +		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * max_vchans,
> +				RTE_CACHE_LINE_SIZE);
> +		if (vchans == NULL) {
> +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for dma-%d."
> +					" Cannot enable async data-path.\n", dev_id);
> +			vhost_free_async_dma_mem();
> +			return -1;
> +		}
> +
> +		for (j = 0; j < max_vchans; j++) {
> +			vchans[j].metadata = rte_zmalloc(NULL, sizeof(bool *) * max_desc,
> +					RTE_CACHE_LINE_SIZE);

That's quite a huge allocation (4096 * 8B per channel).

> +			if (!vchans[j].metadata) {
> +				VHOST_LOG_CONFIG(ERR, "Failed to allocate metadata for "
> +						"dma-%d vchan-%u\n", dev_id, j);
> +				vhost_free_async_dma_mem();
> +				return -1;
> +			}
> +
> +			vchans[j].ring_size = max_desc;
> +			vchans[j].ring_mask = max_desc - 1;
> +		}
> +
> +		dma_copy_track[dev_id].vchans = vchans;
> +		dma_copy_track[dev_id].max_vchans = max_vchans;
> +	}
> +
> +	return 0;
> +}
> +
>   int
>   rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
>   {
> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> index 7085e0885c..d9bda34e11 100644
> --- a/lib/vhost/vhost.h
> +++ b/lib/vhost/vhost.h
> @@ -19,6 +19,7 @@
>   #include <rte_ether.h>
>   #include <rte_rwlock.h>
>   #include <rte_malloc.h>
> +#include <rte_dmadev.h>
>   
>   #include "rte_vhost.h"
>   #include "rte_vdpa.h"
> @@ -50,6 +51,7 @@
>   
>   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
>   #define VHOST_MAX_ASYNC_VEC 2048
> +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
>   
>   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
> @@ -119,6 +121,41 @@ struct vring_used_elem_packed {
>   	uint32_t count;
>   };
>   
> +struct async_dma_vchan_info {
> +	/* circular array to track copy metadata */
> +	bool **metadata;
> +
> +	/* max elements in 'metadata' */
> +	uint16_t ring_size;
> +	/* ring index mask for 'metadata' */
> +	uint16_t ring_mask;

Given cnxk, we cannot use a mask as the ring size may not be a pow2.

> +
> +	/* batching copies before a DMA doorbell */
> +	uint16_t nr_batching;
> +
> +	/**
> +	 * DMA virtual channel lock. Although it is able to bind DMA
> +	 * virtual channels to data plane threads, vhost control plane
> +	 * thread could call data plane functions too, thus causing
> +	 * DMA device contention.
> +	 *
> +	 * For example, in VM exit case, vhost control plane thread needs
> +	 * to clear in-flight packets before disable vring, but there could
> +	 * be anotther data plane thread is enqueuing packets to the same
> +	 * vring with the same DMA virtual channel. But dmadev PMD functions
> +	 * are lock-free, so the control plane and data plane threads
> +	 * could operate the same DMA virtual channel at the same time.
> +	 */
> +	rte_spinlock_t dma_lock;
> +};
> +
> +struct async_dma_info {
> +	uint16_t max_vchans;
> +	struct async_dma_vchan_info *vchans;
> +};
> +
> +extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>   /**
>    * inflight async packet information
>    */
> @@ -129,9 +166,6 @@ struct async_inflight_info {
>   };
>   
>   struct vhost_async {
> -	/* operation callbacks for DMA */
> -	struct rte_vhost_async_channel_ops ops;
> -
>   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
>   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
>   	uint16_t iter_idx;
> @@ -139,6 +173,19 @@ struct vhost_async {
>   
>   	/* data transfer status */
>   	struct async_inflight_info *pkts_info;
> +	/**
> +	 * packet reorder array. "true" indicates that DMA
> +	 * device completes all copies for the packet.
> +	 *
> +	 * Note that this array could be written by multiple
> +	 * threads at the same time. For example, two threads
> +	 * enqueue packets to the same virtqueue with their
> +	 * own DMA devices. However, since offloading is
> +	 * per-packet basis, each packet flag will only be
> +	 * written by one thread. And single byte write is
> +	 * atomic, so no lock is needed.
> +	 */

The vq->access_lock is held by the threads (directly or indirectly)
anyway, right?

> +	bool *pkts_cmpl_flag;
>   	uint16_t pkts_idx;
>   	uint16_t pkts_inflight_n;
>   	union {
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index b3d954aab4..9f81fc9733 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -11,6 +11,7 @@
>   #include <rte_net.h>
>   #include <rte_ether.h>
>   #include <rte_ip.h>
> +#include <rte_dmadev.h>
>   #include <rte_vhost.h>
>   #include <rte_tcp.h>
>   #include <rte_udp.h>
> @@ -25,6 +26,9 @@
>   
>   #define MAX_BATCH_LEN 256
>   
> +/* DMA device copy operation tracking array. */
> +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>   static  __rte_always_inline bool
>   rxvq_is_mergeable(struct virtio_net *dev)
>   {
> @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
>   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
>   }
>   
> +static __rte_always_inline uint16_t
> +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> +		uint16_t vchan, uint16_t head_idx,
> +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t pkt_idx;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> +		int copy_idx = 0;
> +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> +		uint16_t i;
> +
> +		if (rte_dma_burst_capacity(dma_id, vchan) < nr_segs) {
> +			goto out;
> +		}
> +
> +		for (i = 0; i < nr_segs; i++) {
> +			/**
> +			 * We have checked the available space before submit copies to DMA
> +			 * vChannel, so we don't handle error here.
> +			 */
> +			copy_idx = rte_dma_copy(dma_id, vchan, (rte_iova_t)iov[i].src_addr,
> +					(rte_iova_t)iov[i].dst_addr, iov[i].len,
> +					RTE_DMA_OP_FLAG_LLC);
> +
> +			/**
> +			 * Only store packet completion flag address in the last copy's
> +			 * slot, and other slots are set to NULL.
> +			 */
> +			if (unlikely(i == (nr_segs - 1))) {

I don't think using unlikely() is appropriate here, as single-segment 
packets are more the norm than the exception, isn't?

> +				dma_info->metadata[copy_idx & ring_mask] =
dma_info->metadata[copy_idx % ring_size] instead.
> +					&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> +			}
> +		}
> +
> +		dma_info->nr_batching += nr_segs;
> +		if (unlikely(dma_info->nr_batching >= VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> +			rte_dma_submit(dma_id, vchan);
> +			dma_info->nr_batching = 0;
> +		}
> +
> +		head_idx++;
> +	}
> +
> +out:
> +	if (dma_info->nr_batching > 0) {
> +		rte_dma_submit(dma_id, vchan);
> +		dma_info->nr_batching = 0;
> +	}
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +
> +	return pkt_idx;
> +}
> +
> +static __rte_always_inline uint16_t
> +vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan, uint16_t max_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t last_idx = 0;
> +	uint16_t nr_copies;
> +	uint16_t copy_idx;
> +	uint16_t i;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	/**
> +	 * Since all memory is pinned and addresses should be valid,
> +	 * we don't check errors.

Please, check for errors to ease debugging, you can for now add a debug
print if error is set which would print rte_errno. And once we have my
Vhost statistics series in, I could add a counter for it.

The DMA device could be in a bad state for other reason than unpinned 
memory or unvalid addresses.

> +	 */
> +	nr_copies = rte_dma_completed(dma_id, vchan, max_pkts, &last_idx, NULL);

Are you sure max_pkts is valid here as a packet could contain several
segments?

> +	if (nr_copies == 0) {
> +		goto out;
> +	}
> +
> +	copy_idx = last_idx - nr_copies + 1;
> +	for (i = 0; i < nr_copies; i++) {
> +		bool *flag;
> +
> +		flag = dma_info->metadata[copy_idx & ring_mask];

dma_info->metadata[copy_idx % ring_size]

> +		if (flag) {
> +			/**
> +			 * Mark the packet flag as received. The flag
> +			 * could belong to another virtqueue but write
> +			 * is atomic.
> +			 */
> +			*flag = true;
> +			dma_info->metadata[copy_idx & ring_mask] = NULL;

dma_info->metadata[copy_idx % ring_size]

> +		}
> +		copy_idx++;
> +	}
> +
> +out:
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +	return nr_copies;

> +}
> +
>   static inline void
>   do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
>   {
> @@ -1449,9 +1555,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
>   }
>   
>   static __rte_noinline uint32_t
> -virtio_dev_rx_async_submit_split(struct virtio_net *dev,
> -	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
> +		int16_t dma_id, uint16_t vchan)
>   {
>   	struct buf_vector buf_vec[BUF_VECTOR_MAX];
>   	uint32_t pkt_idx = 0;
> @@ -1503,17 +1609,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
>   	if (unlikely(pkt_idx == 0))
>   		return 0;
>   
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> +	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan, async->pkts_idx, async->iov_iter,
> +			pkt_idx);
>   
>   	pkt_err = pkt_idx - n_xfer;
>   	if (unlikely(pkt_err)) {
>   		uint16_t num_descs = 0;
>   
> +		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for queue %u.\n",
> +				dev->vid, __func__, pkt_err, queue_id);
> +
>   		/* update number of completed packets */
>   		pkt_idx = n_xfer;
>   
> @@ -1656,13 +1761,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
>   }
>   
>   static __rte_noinline uint32_t
> -virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
> -	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
> +		int16_t dma_id, uint16_t vchan)
>   {
>   	uint32_t pkt_idx = 0;
>   	uint32_t remained = count;
> -	int32_t n_xfer;
> +	uint16_t n_xfer;
>   	uint16_t num_buffers;
>   	uint16_t num_descs;
>   
> @@ -1670,6 +1775,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>   	struct async_inflight_info *pkts_info = async->pkts_info;
>   	uint32_t pkt_err = 0;
>   	uint16_t slot_idx = 0;
> +	uint16_t head_idx = async->pkts_idx % vq->size;
>   
>   	do {
>   		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> @@ -1694,19 +1800,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>   	if (unlikely(pkt_idx == 0))
>   		return 0;
>   
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> -
> -	pkt_err = pkt_idx - n_xfer;
> +	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan, head_idx,
> +			async->iov_iter, pkt_idx);
>   
>   	async_iter_reset(async);
>   
> -	if (unlikely(pkt_err))
> +	pkt_err = pkt_idx - n_xfer;
> +	if (unlikely(pkt_err)) {
> +		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for queue %u.\n",
> +				dev->vid, __func__, pkt_err, queue_id);
>   		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
> +	}
>   
>   	if (likely(vq->shadow_used_idx)) {
>   		/* keep used descriptors. */
> @@ -1826,28 +1930,37 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
>   
>   static __rte_always_inline uint16_t
>   vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>   {
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
>   	struct vhost_async *async = vq->async;
>   	struct async_inflight_info *pkts_info = async->pkts_info;
> -	int32_t n_cpl;
> +	uint16_t nr_cpl_pkts = 0;
>   	uint16_t n_descs = 0, n_buffers = 0;
>   	uint16_t start_idx, from, i;
>   
> -	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
> -	if (unlikely(n_cpl < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies for queue id %d.\n",
> -				dev->vid, __func__, queue_id);
> -		return 0;
> -	}
> -
> -	if (n_cpl == 0)
> -		return 0;
> +	/* Check completed copies for the given DMA vChannel */
> +	vhost_async_dma_check_completed(dma_id, vchan, count);
>   
>   	start_idx = async_get_first_inflight_pkt_idx(vq);
>   
> -	for (i = 0; i < n_cpl; i++) {
> +	/**
> +	 * Calculate the number of copy completed packets.
> +	 * Note that there may be completed packets even if
> +	 * no copies are reported done by the given DMA vChannel,
> +	 * as DMA vChannels could be shared by other threads.
> +	 */
> +	from = start_idx;
> +	while (vq->async->pkts_cmpl_flag[from] && count--) {
> +		vq->async->pkts_cmpl_flag[from] = false;
> +		from++;
> +		if (from >= vq->size)
> +			from -= vq->size;
> +		nr_cpl_pkts++;
> +	}
> +
> +	for (i = 0; i < nr_cpl_pkts; i++) {
>   		from = (start_idx + i) % vq->size;
>   		/* Only used with packed ring */
>   		n_buffers += pkts_info[from].nr_buffers;
> @@ -1856,7 +1969,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
>   		pkts[i] = pkts_info[from].mbuf;
>   	}
>   
> -	async->pkts_inflight_n -= n_cpl;
> +	async->pkts_inflight_n -= nr_cpl_pkts;
>   
>   	if (likely(vq->enabled && vq->access_ok)) {
>   		if (vq_is_packed(dev)) {
> @@ -1877,12 +1990,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
>   		}
>   	}
>   
> -	return n_cpl;
> +	return nr_cpl_pkts;
>   }
>   
>   uint16_t
>   rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq;
> @@ -1908,7 +2022,7 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>   
>   	rte_spinlock_lock(&vq->access_lock);
>   
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan);
>   
>   	rte_spinlock_unlock(&vq->access_lock);
>   
> @@ -1917,7 +2031,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>   
>   uint16_t
>   rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq;
> @@ -1941,14 +2056,14 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan);
>   
>   	return n_pkts_cpl;
>   }
>   
>   static __rte_always_inline uint32_t
>   virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan)
>   {
>   	struct vhost_virtqueue *vq;
>   	uint32_t nb_tx = 0;
> @@ -1980,10 +2095,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   
>   	if (vq_is_packed(dev))
>   		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, vchan);
>   	else
>   		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, vchan);
>   
>   out:
>   	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> @@ -1997,7 +2112,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   
>   uint16_t
>   rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   
> @@ -2011,7 +2127,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, vchan);
>   }
>   
>   static inline bool


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-01-20 17:00     ` Maxime Coquelin
@ 2022-01-21  1:56       ` Hu, Jiayu
  0 siblings, 0 replies; 31+ messages in thread
From: Hu, Jiayu @ 2022-01-21  1:56 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: i.maximets, Xia, Chenbo, Richardson, Bruce, Van Haaren, Harry,
	Pai G, Sunil, Mcnamara, John, Ding, Xuan, Jiang, Cheng1, liangma

Hi Maxime,

Thanks for your comments, and please see replies inline.

Thanks,
Jiayu

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, January 21, 2022 1:00 AM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>;
> Mcnamara, John <john.mcnamara@intel.com>; Ding, Xuan
> <xuan.ding@intel.com>; Jiang, Cheng1 <cheng1.jiang@intel.com>;
> liangma@liangbit.com
> Subject: Re: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous
> datapath
> 
> Hi Jiayu,
> 
> On 12/30/21 22:55, Jiayu Hu wrote:
> > Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> > abstraction layer and simplify application logics, this patch integrates
> > dmadev in asynchronous data path.
> >
> > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> > ---
> >   doc/guides/prog_guide/vhost_lib.rst |  70 ++++-----
> >   examples/vhost/Makefile             |   2 +-
> >   examples/vhost/ioat.c               | 218 --------------------------
> >   examples/vhost/ioat.h               |  63 --------
> >   examples/vhost/main.c               | 230 +++++++++++++++++++++++-----
> >   examples/vhost/main.h               |  11 ++
> >   examples/vhost/meson.build          |   6 +-
> >   lib/vhost/meson.build               |   3 +-
> >   lib/vhost/rte_vhost_async.h         | 121 +++++----------
> >   lib/vhost/version.map               |   3 +
> >   lib/vhost/vhost.c                   | 130 +++++++++++-----
> >   lib/vhost/vhost.h                   |  53 ++++++-
> >   lib/vhost/virtio_net.c              | 206 +++++++++++++++++++------
> >   13 files changed, 587 insertions(+), 529 deletions(-)
> >   delete mode 100644 examples/vhost/ioat.c
> >   delete mode 100644 examples/vhost/ioat.h
> 
> 
> 
> > diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> > index 33d023aa39..44073499bc 100644
> > --- a/examples/vhost/main.c
> > +++ b/examples/vhost/main.c
> > @@ -24,8 +24,9 @@
> >   #include <rte_ip.h>
> >   #include <rte_tcp.h>
> >   #include <rte_pause.h>
> > +#include <rte_dmadev.h>
> > +#include <rte_vhost_async.h>
> >
> > -#include "ioat.h"
> >   #include "main.h"
> >
> >   #ifndef MAX_QUEUES
> > @@ -56,6 +57,14 @@
> >   #define RTE_TEST_TX_DESC_DEFAULT 512
> >
> >   #define INVALID_PORT_ID 0xFF
> > +#define INVALID_DMA_ID -1
> > +
> > +#define MAX_VHOST_DEVICE 1024
> 
> It is better to define RTE_VHOST_MAX_DEVICES in rte_vhost.h,
> and use it here in in vhost library so that it will always be aligned
> with the Vhost library.

OK, I will rename "MAX_VHOST_DEVICE" to "RTE_MAX_VHOST_DEVICE" in
vhost library, and use it instead.

> 
> > +#define DMA_RING_SIZE 4096
> 
> As previous review, the DMA ring size should not be hard-coded.
> Please use the DMA lib helpers to get the ring size also here.

Yes, it shouldn't hardcode. But from my perspective, I don't want to
give any implication to users that the max ring size is always the best
choice. It's like the choice of NIC ring size, and default is 2048 in OVS.
I think it's better use a default value and check if it's OK for used DMA
devices.

> 
> Looking at the dmadev library, I would just use the max value provided
> by the device. What would be the downside of doing so?

Using max ring size may cause the memory footprint a bit larger, and it is
the same as the large ring size of vhost, IMO. But it's not a measurable
"cost", and I am not sure if it's a cost or not TBH. Because mempool has
a cache and it usually allocates mbufs recently used. So cache miss status
should not become too much worse because of large ring size. But it
depends on lots of factors, and hard to tell if it hurts perf or not.

> 
> Looking at currently available DMA device drivers, this is their
> max_desc value:
>   - CNXK: 15
>   - DPAA: 64
>   - HISI: 8192
>   - IDXD: 4192
>   - IOAT: 4192
>   - Skeleton: 8192
> 
> So harcoding to 4192 will prevent some to DMA devices to be used, and
> will not make full benefit of their capabilities for others.

Good catch. vhost example should be compatible to all DMA devices too.
I will change later.

> 
> > +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> > +struct rte_vhost_async_dma_info
> dma_config[RTE_DMADEV_DEFAULT_MAX];
> > +static int dma_count;
> >
> >   /* mask of enabled ports */
> >   static uint32_t enabled_port_mask = 0;
> > @@ -96,8 +105,6 @@ static int builtin_net_driver;
> >
> >   static int async_vhost_driver;
> >
> > -static char *dma_type;
> > -
> >   /* Specify timeout (in useconds) between retries on RX. */
> >   static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
> >   /* Specify the number of retries on RX. */
> > @@ -196,13 +203,134 @@ struct vhost_bufftable
> *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
> >   #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
> >   				 / US_PER_S * BURST_TX_DRAIN_US)
> >
> > +static inline bool
> > +is_dma_configured(int16_t dev_id)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < dma_count; i++) {
> > +		if (dma_config[i].dev_id == dev_id) {
> > +			return true;
> > +		}
> > +	}
> 
> No need for braces for both the loop and the if.

Sure, I will change in the next version.

> 
> > +	return false;
> > +}
> > +
> >   static inline int
> >   open_dma(const char *value)
> >   {
> > -	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> > -		return open_ioat(value);
> > +	struct dma_for_vhost *dma_info = dma_bind;
> > +	char *input = strndup(value, strlen(value) + 1);
> > +	char *addrs = input;
> > +	char *ptrs[2];
> > +	char *start, *end, *substr;
> > +	int64_t vid, vring_id;
> > +
> > +	struct rte_dma_info info;
> > +	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> > +	struct rte_dma_vchan_conf qconf = {
> > +		.direction = RTE_DMA_DIR_MEM_TO_MEM,
> > +		.nb_desc = DMA_RING_SIZE
> > +	};
> > +
> > +	int dev_id;
> > +	int ret = 0;
> > +	uint16_t i = 0;
> > +	char *dma_arg[MAX_VHOST_DEVICE];
> > +	int args_nr;
> > +
> > +	while (isblank(*addrs))
> > +		addrs++;
> > +	if (*addrs == '\0') {
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	/* process DMA devices within bracket. */
> > +	addrs++;
> > +	substr = strtok(addrs, ";]");
> > +	if (!substr) {
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	args_nr = rte_strsplit(substr, strlen(substr),
> > +			dma_arg, MAX_VHOST_DEVICE, ',');
> > +	if (args_nr <= 0) {
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	while (i < args_nr) {
> > +		char *arg_temp = dma_arg[i];
> > +		uint8_t sub_nr;
> > +
> > +		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> > +		if (sub_nr != 2) {
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		start = strstr(ptrs[0], "txd");
> > +		if (start == NULL) {
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		start += 3;
> > +		vid = strtol(start, &end, 0);
> > +		if (end == start) {
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		vring_id = 0 + VIRTIO_RXQ;
> > +
> > +		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
> > +		if (dev_id < 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find
> DMA %s.\n", ptrs[1]);
> > +			ret = -1;
> > +			goto out;
> > +		} else if (is_dma_configured(dev_id)) {
> > +			goto done;
> > +		}
> > +
> > +		if (rte_dma_configure(dev_id, &dev_config) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure
> DMA %d.\n", dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up
> DMA %d.\n", dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> >
> > -	return -1;
> > +		rte_dma_info_get(dev_id, &info);
> > +		if (info.nb_vchans != 1) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "DMA %d has no
> queues.\n", dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> 
> So, I would call rte_dma_info_get() before calling rte_dma_vchan_setup()
> to get desc number value working with the DMA device.

Right, I will change later.

> 
> > +
> > +		if (rte_dma_start(dev_id) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start
> DMA %u.\n", dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		dma_config[dma_count].dev_id = dev_id;
> > +		dma_config[dma_count].max_vchans = 1;
> > +		dma_config[dma_count++].max_desc = DMA_RING_SIZE;
> > +
> > +done:
> > +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> > +		i++;
> > +	}
> > +out:
> > +	free(input);
> > +	return ret;
> >   }
> >
> >   /*
> > @@ -500,8 +628,6 @@ enum {
> >   	OPT_CLIENT_NUM,
> >   #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
> >   	OPT_BUILTIN_NET_DRIVER_NUM,
> > -#define OPT_DMA_TYPE            "dma-type"
> > -	OPT_DMA_TYPE_NUM,
> >   #define OPT_DMAS                "dmas"
> >   	OPT_DMAS_NUM,
> >   };
> > @@ -539,8 +665,6 @@ us_vhost_parse_args(int argc, char **argv)
> >   				NULL, OPT_CLIENT_NUM},
> >   		{OPT_BUILTIN_NET_DRIVER, no_argument,
> >   				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
> > -		{OPT_DMA_TYPE, required_argument,
> > -				NULL, OPT_DMA_TYPE_NUM},
> >   		{OPT_DMAS, required_argument,
> >   				NULL, OPT_DMAS_NUM},
> >   		{NULL, 0, 0, 0},
> > @@ -661,10 +785,6 @@ us_vhost_parse_args(int argc, char **argv)
> >   			}
> >   			break;
> >
> > -		case OPT_DMA_TYPE_NUM:
> > -			dma_type = optarg;
> > -			break;
> > -
> >   		case OPT_DMAS_NUM:
> >   			if (open_dma(optarg) == -1) {
> >   				RTE_LOG(INFO, VHOST_CONFIG,
> > @@ -841,9 +961,10 @@ complete_async_pkts(struct vhost_dev *vdev)
> >   {
> >   	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
> >   	uint16_t complete_count;
> > +	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
> >
> >   	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
> > -					VIRTIO_RXQ, p_cpl,
> MAX_PKT_BURST);
> > +					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST,
> dma_id, 0);
> >   	if (complete_count) {
> >   		free_pkts(p_cpl, complete_count);
> >   		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count,
> __ATOMIC_SEQ_CST);
> > @@ -883,11 +1004,12 @@ drain_vhost(struct vhost_dev *vdev)
> >
> >   	if (builtin_net_driver) {
> >   		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
> > -	} else if (async_vhost_driver) {
> > +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
> >   		uint16_t enqueue_fail = 0;
> > +		int16_t dma_id = dma_bind[vdev-
> >vid].dmas[VIRTIO_RXQ].dev_id;
> >
> >   		complete_async_pkts(vdev);
> > -		ret = rte_vhost_submit_enqueue_burst(vdev->vid,
> VIRTIO_RXQ, m, nr_xmit);
> > +		ret = rte_vhost_submit_enqueue_burst(vdev->vid,
> VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
> >   		__atomic_add_fetch(&vdev->pkts_inflight, ret,
> __ATOMIC_SEQ_CST);
> >
> >   		enqueue_fail = nr_xmit - ret;
> > @@ -905,7 +1027,7 @@ drain_vhost(struct vhost_dev *vdev)
> >   				__ATOMIC_SEQ_CST);
> >   	}
> >
> > -	if (!async_vhost_driver)
> > +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
> >   		free_pkts(m, nr_xmit);
> >   }
> >
> > @@ -1211,12 +1333,13 @@ drain_eth_rx(struct vhost_dev *vdev)
> >   	if (builtin_net_driver) {
> >   		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
> >   						pkts, rx_count);
> > -	} else if (async_vhost_driver) {
> > +	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
> >   		uint16_t enqueue_fail = 0;
> > +		int16_t dma_id = dma_bind[vdev-
> >vid].dmas[VIRTIO_RXQ].dev_id;
> >
> >   		complete_async_pkts(vdev);
> >   		enqueue_count = rte_vhost_submit_enqueue_burst(vdev-
> >vid,
> > -					VIRTIO_RXQ, pkts, rx_count);
> > +					VIRTIO_RXQ, pkts, rx_count, dma_id,
> 0);
> >   		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count,
> __ATOMIC_SEQ_CST);
> >
> >   		enqueue_fail = rx_count - enqueue_count;
> > @@ -1235,7 +1358,7 @@ drain_eth_rx(struct vhost_dev *vdev)
> >   				__ATOMIC_SEQ_CST);
> >   	}
> >
> > -	if (!async_vhost_driver)
> > +	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
> >   		free_pkts(pkts, rx_count);
> >   }
> >
> > @@ -1387,18 +1510,20 @@ destroy_device(int vid)
> >   		"(%d) device has been removed from data core\n",
> >   		vdev->vid);
> >
> > -	if (async_vhost_driver) {
> > +	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
> >   		uint16_t n_pkt = 0;
> > +		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
> >   		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> >
> >   		while (vdev->pkts_inflight) {
> >   			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid,
> VIRTIO_RXQ,
> > -						m_cpl, vdev->pkts_inflight);
> > +						m_cpl, vdev->pkts_inflight,
> dma_id, 0);
> >   			free_pkts(m_cpl, n_pkt);
> >   			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt,
> __ATOMIC_SEQ_CST);
> >   		}
> >
> >   		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
> > +		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
> >   	}
> >
> >   	rte_free(vdev);
> > @@ -1468,20 +1593,14 @@ new_device(int vid)
> >   		"(%d) device has been added to data core %d\n",
> >   		vid, vdev->coreid);
> >
> > -	if (async_vhost_driver) {
> > -		struct rte_vhost_async_config config = {0};
> > -		struct rte_vhost_async_channel_ops channel_ops;
> > -
> > -		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
> > -			channel_ops.transfer_data = ioat_transfer_data_cb;
> > -			channel_ops.check_completed_copies =
> > -				ioat_check_completed_copies_cb;
> > -
> > -			config.features = RTE_VHOST_ASYNC_INORDER;
> > +	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
> > +		int ret;
> >
> > -			return rte_vhost_async_channel_register(vid,
> VIRTIO_RXQ,
> > -				config, &channel_ops);
> > +		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
> > +		if (ret == 0) {
> > +			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled =
> true;
> >   		}
> > +		return ret;
> >   	}
> >
> >   	return 0;
> > @@ -1502,14 +1621,15 @@ vring_state_changed(int vid, uint16_t
> queue_id, int enable)
> >   	if (queue_id != VIRTIO_RXQ)
> >   		return 0;
> >
> > -	if (async_vhost_driver) {
> > +	if (dma_bind[vid].dmas[queue_id].async_enabled) {
> >   		if (!enable) {
> >   			uint16_t n_pkt = 0;
> > +			int16_t dma_id =
> dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
> >   			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
> >
> >   			while (vdev->pkts_inflight) {
> >   				n_pkt =
> rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
> > -							m_cpl, vdev-
> >pkts_inflight);
> > +							m_cpl, vdev-
> >pkts_inflight, dma_id, 0);
> >   				free_pkts(m_cpl, n_pkt);
> >   				__atomic_sub_fetch(&vdev->pkts_inflight,
> n_pkt, __ATOMIC_SEQ_CST);
> >   			}
> > @@ -1657,6 +1777,25 @@ create_mbuf_pool(uint16_t nr_port, uint32_t
> nr_switch_core, uint32_t mbuf_size,
> >   		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
> >   }
> >
> > +static void
> > +init_dma(void)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > +		int j;
> > +
> > +		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
> > +			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
> > +			dma_bind[i].dmas[j].async_enabled = false;
> > +		}
> > +	}
> > +
> > +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > +		dma_config[i].dev_id = INVALID_DMA_ID;
> > +	}
> > +}
> > +
> >   /*
> >    * Main function, does initialisation and calls the per-lcore functions.
> >    */
> > @@ -1679,6 +1818,9 @@ main(int argc, char *argv[])
> >   	argc -= ret;
> >   	argv += ret;
> >
> > +	/* initialize dma structures */
> > +	init_dma();
> > +
> >   	/* parse app arguments */
> >   	ret = us_vhost_parse_args(argc, argv);
> >   	if (ret < 0)
> > @@ -1754,6 +1896,20 @@ main(int argc, char *argv[])
> >   	if (client_mode)
> >   		flags |= RTE_VHOST_USER_CLIENT;
> >
> > +	if (async_vhost_driver) {
> 
> You should be able to get rid off async_vhost_driver and instead rely on
> dma_count.

Yes, I will remove it.

> 
> > +		if (rte_vhost_async_dma_configure(dma_config, dma_count)
> < 0) {
> > +			RTE_LOG(ERR, VHOST_PORT, "Failed to configure
> DMA in vhost.\n");
> > +			for (i = 0; i < dma_count; i++) {
> > +				if (dma_config[i].dev_id != INVALID_DMA_ID)
> {
> > +					rte_dma_stop(dma_config[i].dev_id);
> > +					dma_config[i].dev_id =
> INVALID_DMA_ID;
> > +				}
> > +			}
> > +			dma_count = 0;
> > +			async_vhost_driver = false;
> 
> Let's just exit the app if DMAs were provided in command line but cannot
> be used.

OK, I will change later.

> 
> > +		}
> > +	}
> > +
> >   	/* Register vhost user driver to handle vhost messages. */
> >   	for (i = 0; i < nb_sockets; i++) {
> >   		char *file = socket_files + i * PATH_MAX;
> > diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> > index a87ea6ba37..23a7a2d8b3 100644
> > --- a/lib/vhost/rte_vhost_async.h
> > +++ b/lib/vhost/rte_vhost_async.h
> > @@ -27,70 +27,12 @@ struct rte_vhost_iov_iter {
> >   };
> >
> >   /**
> > - * dma transfer status
> > + * DMA device information
> >    */
> > -struct rte_vhost_async_status {
> > -	/** An array of application specific data for source memory */
> > -	uintptr_t *src_opaque_data;
> > -	/** An array of application specific data for destination memory */
> > -	uintptr_t *dst_opaque_data;
> > -};
> > -
> > -/**
> > - * dma operation callbacks to be implemented by applications
> > - */
> > -struct rte_vhost_async_channel_ops {
> > -	/**
> > -	 * instruct async engines to perform copies for a batch of packets
> > -	 *
> > -	 * @param vid
> > -	 *  id of vhost device to perform data copies
> > -	 * @param queue_id
> > -	 *  queue id to perform data copies
> > -	 * @param iov_iter
> > -	 *  an array of IOV iterators
> > -	 * @param opaque_data
> > -	 *  opaque data pair sending to DMA engine
> > -	 * @param count
> > -	 *  number of elements in the "descs" array
> > -	 * @return
> > -	 *  number of IOV iterators processed, negative value means error
> > -	 */
> > -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> > -		struct rte_vhost_iov_iter *iov_iter,
> > -		struct rte_vhost_async_status *opaque_data,
> > -		uint16_t count);
> > -	/**
> > -	 * check copy-completed packets from the async engine
> > -	 * @param vid
> > -	 *  id of vhost device to check copy completion
> > -	 * @param queue_id
> > -	 *  queue id to check copy completion
> > -	 * @param opaque_data
> > -	 *  buffer to receive the opaque data pair from DMA engine
> > -	 * @param max_packets
> > -	 *  max number of packets could be completed
> > -	 * @return
> > -	 *  number of async descs completed, negative value means error
> > -	 */
> > -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_status *opaque_data,
> > -		uint16_t max_packets);
> > -};
> > -
> > -/**
> > - *  async channel features
> > - */
> > -enum {
> > -	RTE_VHOST_ASYNC_INORDER = 1U << 0,
> > -};
> > -
> > -/**
> > - *  async channel configuration
> > - */
> > -struct rte_vhost_async_config {
> > -	uint32_t features;
> > -	uint32_t rsvd[2];
> > +struct rte_vhost_async_dma_info {
> > +	int16_t dev_id;	/* DMA device ID */
> > +	uint16_t max_vchans;	/* max number of vchan */
> > +	uint16_t max_desc;	/* max desc number of vchan */
> >   };
> >
> >   /**
> > @@ -100,17 +42,11 @@ struct rte_vhost_async_config {
> >    *  vhost device id async channel to be attached to
> >    * @param queue_id
> >    *  vhost queue id async channel to be attached to
> > - * @param config
> > - *  Async channel configuration structure
> > - * @param ops
> > - *  Async channel operation callbacks
> >    * @return
> >    *  0 on success, -1 on failures
> >    */
> >   __rte_experimental
> > -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > -	struct rte_vhost_async_config config,
> > -	struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
> >
> >   /**
> >    * Unregister an async channel for a vhost queue
> > @@ -136,17 +72,11 @@ int rte_vhost_async_channel_unregister(int vid,
> uint16_t queue_id);
> >    *  vhost device id async channel to be attached to
> >    * @param queue_id
> >    *  vhost queue id async channel to be attached to
> > - * @param config
> > - *  Async channel configuration
> > - * @param ops
> > - *  Async channel operation callbacks
> >    * @return
> >    *  0 on success, -1 on failures
> >    */
> >   __rte_experimental
> > -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > -	struct rte_vhost_async_config config,
> > -	struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id);
> >
> >   /**
> >    * Unregister an async channel for a vhost queue without performing any
> > @@ -179,12 +109,17 @@ int
> rte_vhost_async_channel_unregister_thread_unsafe(int vid,
> >    *  array of packets to be enqueued
> >    * @param count
> >    *  packets num to be enqueued
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan
> > + *  the identifier of virtual DMA channel
> >    * @return
> >    *  num of packets enqueued
> >    */
> >   __rte_experimental
> >   uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan);
> 
> Maybe using vchan_id would be clearer for the API user, same comment
> below.

Yes, I will change later.

> 
> >
> >   /**
> >    * This function checks async completion status for a specific vhost
> > @@ -199,12 +134,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int
> vid, uint16_t queue_id,
> >    *  blank array to get return packet pointer
> >    * @param count
> >    *  size of the packet array
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan
> > + *  the identifier of virtual DMA channel
> >    * @return
> >    *  num of packets returned
> >    */
> >   __rte_experimental
> >   uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan);
> >
> >   /**
> >    * This function returns the amount of in-flight packets for the vhost
> > @@ -235,11 +175,32 @@ int rte_vhost_async_get_inflight(int vid, uint16_t
> queue_id);
> >    *  Blank array to get return packet pointer
> >    * @param count
> >    *  Size of the packet array
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan
> > + *  the identifier of virtual DMA channel
> >    * @return
> >    *  Number of packets returned
> >    */
> >   __rte_experimental
> >   uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan);
> > +/**
> > + * The DMA vChannels used in asynchronous data path must be
> configured
> > + * first. So this function needs to be called before enabling DMA
> > + * acceleration for vring. If this function fails, asynchronous data path
> > + * cannot be enabled for any vring further.
> > + *
> > + * @param dmas
> > + *  DMA information
> > + * @param count
> > + *  Element number of 'dmas'
> > + * @return
> > + *  0 on success, and -1 on failure
> > + */
> > +__rte_experimental
> > +int rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info
> *dmas,
> > +		uint16_t count);
> >
> >   #endif /* _RTE_VHOST_ASYNC_H_ */
> > diff --git a/lib/vhost/version.map b/lib/vhost/version.map
> > index a7ef7f1976..1202ba9c1a 100644
> > --- a/lib/vhost/version.map
> > +++ b/lib/vhost/version.map
> > @@ -84,6 +84,9 @@ EXPERIMENTAL {
> >
> >   	# added in 21.11
> >   	rte_vhost_get_monitor_addr;
> > +
> > +	# added in 22.03
> > +	rte_vhost_async_dma_configure;
> >   };
> >
> >   INTERNAL {
> > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> > index 13a9bb9dd1..32f37f4851 100644
> > --- a/lib/vhost/vhost.c
> > +++ b/lib/vhost/vhost.c
> > @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
> >   		return;
> >
> >   	rte_free(vq->async->pkts_info);
> > +	rte_free(vq->async->pkts_cmpl_flag);
> >
> >   	rte_free(vq->async->buffers_packed);
> >   	vq->async->buffers_packed = NULL;
> > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> >   }
> >
> >   static __rte_always_inline int
> > -async_channel_register(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +async_channel_register(int vid, uint16_t queue_id)
> >   {
> >   	struct virtio_net *dev = get_device(vid);
> >   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> > @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >   		goto out_free_async;
> >   	}
> >
> > +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size *
> sizeof(bool),
> > +			RTE_CACHE_LINE_SIZE, node);
> > +	if (!async->pkts_cmpl_flag) {
> > +		VHOST_LOG_CONFIG(ERR, "failed to allocate async
> pkts_cmpl_flag (vid %d, qid: %d)\n",
> > +				vid, queue_id);
> > +		goto out_free_async;
> > +	}
> > +
> >   	if (vq_is_packed(dev)) {
> >   		async->buffers_packed = rte_malloc_socket(NULL,
> >   				vq->size * sizeof(struct
> vring_used_elem_packed),
> > @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >   		}
> >   	}
> >
> > -	async->ops.check_completed_copies = ops-
> >check_completed_copies;
> > -	async->ops.transfer_data = ops->transfer_data;
> > -
> >   	vq->async = async;
> >
> >   	return 0;
> > @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >   }
> >
> >   int
> > -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_config config,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
> >   {
> >   	struct vhost_virtqueue *vq;
> >   	struct virtio_net *dev = get_device(vid);
> >   	int ret;
> >
> > -	if (dev == NULL || ops == NULL)
> > +	if (dev == NULL)
> >   		return -1;
> >
> >   	if (queue_id >= VHOST_MAX_VRING)
> > @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid,
> uint16_t queue_id,
> >   	if (unlikely(vq == NULL || !dev->async_copy))
> >   		return -1;
> >
> > -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > -		VHOST_LOG_CONFIG(ERR,
> > -			"async copy is not supported on non-inorder mode "
> > -			"(vid %d, qid: %d)\n", vid, queue_id);
> > -		return -1;
> > -	}
> > -
> > -	if (unlikely(ops->check_completed_copies == NULL ||
> > -		ops->transfer_data == NULL))
> > -		return -1;
> > -
> >   	rte_spinlock_lock(&vq->access_lock);
> > -	ret = async_channel_register(vid, queue_id, ops);
> > +	ret = async_channel_register(vid, queue_id);
> >   	rte_spinlock_unlock(&vq->access_lock);
> >
> >   	return ret;
> >   }
> >
> >   int
> > -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > -		struct rte_vhost_async_config config,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id)
> >   {
> >   	struct vhost_virtqueue *vq;
> >   	struct virtio_net *dev = get_device(vid);
> >
> > -	if (dev == NULL || ops == NULL)
> > +	if (dev == NULL)
> >   		return -1;
> >
> >   	if (queue_id >= VHOST_MAX_VRING)
> > @@ -1747,18 +1737,7 @@
> rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> >   	if (unlikely(vq == NULL || !dev->async_copy))
> >   		return -1;
> >
> > -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > -		VHOST_LOG_CONFIG(ERR,
> > -			"async copy is not supported on non-inorder mode "
> > -			"(vid %d, qid: %d)\n", vid, queue_id);
> > -		return -1;
> > -	}
> > -
> > -	if (unlikely(ops->check_completed_copies == NULL ||
> > -		ops->transfer_data == NULL))
> > -		return -1;
> > -
> > -	return async_channel_register(vid, queue_id, ops);
> > +	return async_channel_register(vid, queue_id);
> >   }
> >
> >   int
> > @@ -1835,6 +1814,83 @@
> rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t
> queue_id)
> >   	return 0;
> >   }
> >
> > +static __rte_always_inline void
> > +vhost_free_async_dma_mem(void)
> > +{
> > +	uint16_t i;
> > +
> > +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > +		struct async_dma_info *dma = &dma_copy_track[i];
> > +		int16_t j;
> > +
> > +		if (dma->max_vchans == 0) {
> > +			continue;
> > +		}
> > +
> > +		for (j = 0; j < dma->max_vchans; j++) {
> > +			rte_free(dma->vchans[j].metadata);
> > +		}
> > +		rte_free(dma->vchans);
> > +		dma->vchans = NULL;
> > +		dma->max_vchans = 0;
> > +	}
> > +}
> > +
> > +int
> > +rte_vhost_async_dma_configure(struct rte_vhost_async_dma_info *dmas,
> uint16_t count)
> > +{
> > +	uint16_t i;
> > +
> > +	if (!dmas) {
> > +		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration
> parameter.\n");
> > +		return -1;
> > +	}
> > +
> > +	for (i = 0; i < count; i++) {
> > +		struct async_dma_vchan_info *vchans;
> > +		int16_t dev_id;
> > +		uint16_t max_vchans;
> > +		uint16_t max_desc;
> > +		uint16_t j;
> > +
> > +		dev_id = dmas[i].dev_id;
> > +		max_vchans = dmas[i].max_vchans;
> > +		max_desc = dmas[i].max_desc;
> > +
> > +		if (!rte_is_power_of_2(max_desc)) {
> > +			max_desc = rte_align32pow2(max_desc);
> > +		}
> 
> That will be problematic with CNXK driver that reports 15 as max_desc.

The max_desc used here is to allocate a circular array for vchannel, and its size
must >= vchan SW ring size, which is configured by rte_dma_vchan_setup().
So I think rte_align32pow2(15) will not become a problem for CNXK.

> 
> > +
> > +		vchans = rte_zmalloc(NULL, sizeof(struct
> async_dma_vchan_info) * max_vchans,
> > +				RTE_CACHE_LINE_SIZE);
> > +		if (vchans == NULL) {
> > +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans
> for dma-%d."
> > +					" Cannot enable async data-path.\n",
> dev_id);
> > +			vhost_free_async_dma_mem();
> > +			return -1;
> > +		}
> > +
> > +		for (j = 0; j < max_vchans; j++) {
> > +			vchans[j].metadata = rte_zmalloc(NULL, sizeof(bool *)
> * max_desc,
> > +					RTE_CACHE_LINE_SIZE);
> 
> That's quite a huge allocation (4096 * 8B per channel).

The fact is that max_vchans is 1 for all available DMA devices in DPDK. So the real
size is 4096*8*num_dma_devices. Like Chenbo suggested, it would save memory
if DMA library provides a API to query vchannel real ring size. But for now, to make
API compatible for later changes, I think Chenbo's suggestion that using dma_id and
querying max_desc in vhost is good, since it's easy to change vhost to use real ring
size instead max ring size, if DMA library provides this API in the future.

> 
> > +			if (!vchans[j].metadata) {
> > +				VHOST_LOG_CONFIG(ERR, "Failed to allocate
> metadata for "
> > +						"dma-%d vchan-%u\n",
> dev_id, j);
> > +				vhost_free_async_dma_mem();
> > +				return -1;
> > +			}
> > +
> > +			vchans[j].ring_size = max_desc;
> > +			vchans[j].ring_mask = max_desc - 1;
> > +		}
> > +
> > +		dma_copy_track[dev_id].vchans = vchans;
> > +		dma_copy_track[dev_id].max_vchans = max_vchans;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
> >   {
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > index 7085e0885c..d9bda34e11 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -19,6 +19,7 @@
> >   #include <rte_ether.h>
> >   #include <rte_rwlock.h>
> >   #include <rte_malloc.h>
> > +#include <rte_dmadev.h>
> >
> >   #include "rte_vhost.h"
> >   #include "rte_vdpa.h"
> > @@ -50,6 +51,7 @@
> >
> >   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> >   #define VHOST_MAX_ASYNC_VEC 2048
> > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> >
> >   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
> >   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> VRING_DESC_F_WRITE) : \
> > @@ -119,6 +121,41 @@ struct vring_used_elem_packed {
> >   	uint32_t count;
> >   };
> >
> > +struct async_dma_vchan_info {
> > +	/* circular array to track copy metadata */
> > +	bool **metadata;
> > +
> > +	/* max elements in 'metadata' */
> > +	uint16_t ring_size;
> > +	/* ring index mask for 'metadata' */
> > +	uint16_t ring_mask;
> 
> Given cnxk, we cannot use a mask as the ring size may not be a pow2.

The ring_mask is not for device vchan ring, but for circular buffer which is
used to track copy context. The size of circular buffer only needs >= vchan
ring size. So if we can guarantee the size of circular buffer is pow2, ring_mask
can be used.

> 
> > +
> > +	/* batching copies before a DMA doorbell */
> > +	uint16_t nr_batching;
> > +
> > +	/**
> > +	 * DMA virtual channel lock. Although it is able to bind DMA
> > +	 * virtual channels to data plane threads, vhost control plane
> > +	 * thread could call data plane functions too, thus causing
> > +	 * DMA device contention.
> > +	 *
> > +	 * For example, in VM exit case, vhost control plane thread needs
> > +	 * to clear in-flight packets before disable vring, but there could
> > +	 * be anotther data plane thread is enqueuing packets to the same
> > +	 * vring with the same DMA virtual channel. But dmadev PMD
> functions
> > +	 * are lock-free, so the control plane and data plane threads
> > +	 * could operate the same DMA virtual channel at the same time.
> > +	 */
> > +	rte_spinlock_t dma_lock;
> > +};
> > +
> > +struct async_dma_info {
> > +	uint16_t max_vchans;
> > +	struct async_dma_vchan_info *vchans;
> > +};
> > +
> > +extern struct async_dma_info
> dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +
> >   /**
> >    * inflight async packet information
> >    */
> > @@ -129,9 +166,6 @@ struct async_inflight_info {
> >   };
> >
> >   struct vhost_async {
> > -	/* operation callbacks for DMA */
> > -	struct rte_vhost_async_channel_ops ops;
> > -
> >   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> >   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> >   	uint16_t iter_idx;
> > @@ -139,6 +173,19 @@ struct vhost_async {
> >
> >   	/* data transfer status */
> >   	struct async_inflight_info *pkts_info;
> > +	/**
> > +	 * packet reorder array. "true" indicates that DMA
> > +	 * device completes all copies for the packet.
> > +	 *
> > +	 * Note that this array could be written by multiple
> > +	 * threads at the same time. For example, two threads
> > +	 * enqueue packets to the same virtqueue with their
> > +	 * own DMA devices. However, since offloading is
> > +	 * per-packet basis, each packet flag will only be
> > +	 * written by one thread. And single byte write is
> > +	 * atomic, so no lock is needed.
> > +	 */
> 
> The vq->access_lock is held by the threads (directly or indirectly)
> anyway, right?

Yes, threads will acquire vq->access_lock, but think about the case of
sharing one DMA between 2 vrings. It's possible to poll completed copies
from DMA0 that belong to vring1, when the thread calls poll_enqueue_completed
for vring0, as vring0 and vring1 share DMA0. The vq->access_lock will not
protect vring1 in this case. 

I think the comment is not accurate... In the case of two threads enqueue
pkts to the same vring, they will not write the array pkts_cmpl_flag simultaneously
in submit_enqueue_burst, but poll_enqueue_completed. In addition, since
offloading is per-packet basis, each elem in pkts_cmpl_flag will only be written
by one thread. Therefore, it's possible multiple threads write pkts_cmpl_flag
at the same time, but they write different elements. I will update the comment
if the above explanation looks reasonable to you.

> 
> > +	bool *pkts_cmpl_flag;
> >   	uint16_t pkts_idx;
> >   	uint16_t pkts_inflight_n;
> >   	union {
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > index b3d954aab4..9f81fc9733 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -11,6 +11,7 @@
> >   #include <rte_net.h>
> >   #include <rte_ether.h>
> >   #include <rte_ip.h>
> > +#include <rte_dmadev.h>
> >   #include <rte_vhost.h>
> >   #include <rte_tcp.h>
> >   #include <rte_udp.h>
> > @@ -25,6 +26,9 @@
> >
> >   #define MAX_BATCH_LEN 256
> >
> > +/* DMA device copy operation tracking array. */
> > +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +
> >   static  __rte_always_inline bool
> >   rxvq_is_mergeable(struct virtio_net *dev)
> >   {
> > @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t nr_vring)
> >   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> >   }
> >
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> > +		uint16_t vchan, uint16_t head_idx,
> > +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> > +{
> > +	struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan];
> > +	uint16_t ring_mask = dma_info->ring_mask;
> > +	uint16_t pkt_idx;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> > +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > +		int copy_idx = 0;
> > +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > +		uint16_t i;
> > +
> > +		if (rte_dma_burst_capacity(dma_id, vchan) < nr_segs) {
> > +			goto out;
> > +		}
> > +
> > +		for (i = 0; i < nr_segs; i++) {
> > +			/**
> > +			 * We have checked the available space before
> submit copies to DMA
> > +			 * vChannel, so we don't handle error here.
> > +			 */
> > +			copy_idx = rte_dma_copy(dma_id, vchan,
> (rte_iova_t)iov[i].src_addr,
> > +					(rte_iova_t)iov[i].dst_addr, iov[i].len,
> > +					RTE_DMA_OP_FLAG_LLC);
> > +
> > +			/**
> > +			 * Only store packet completion flag address in the
> last copy's
> > +			 * slot, and other slots are set to NULL.
> > +			 */
> > +			if (unlikely(i == (nr_segs - 1))) {
> 
> I don't think using unlikely() is appropriate here, as single-segment
> packets are more the norm than the exception, isn't?

Yes, good suggestion, and I will change later.

> 
> > +				dma_info->metadata[copy_idx & ring_mask]
> =
> dma_info->metadata[copy_idx % ring_size] instead.
> > +					&vq->async-
> >pkts_cmpl_flag[head_idx % vq->size];
> > +			}
> > +		}
> > +
> > +		dma_info->nr_batching += nr_segs;
> > +		if (unlikely(dma_info->nr_batching >=
> VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> > +			rte_dma_submit(dma_id, vchan);
> > +			dma_info->nr_batching = 0;
> > +		}
> > +
> > +		head_idx++;
> > +	}
> > +
> > +out:
> > +	if (dma_info->nr_batching > 0) {
> > +		rte_dma_submit(dma_id, vchan);
> > +		dma_info->nr_batching = 0;
> > +	}
> > +	rte_spinlock_unlock(&dma_info->dma_lock);
> > +
> > +	return pkt_idx;
> > +}
> > +
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan,
> uint16_t max_pkts)
> > +{
> > +	struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan];
> > +	uint16_t ring_mask = dma_info->ring_mask;
> > +	uint16_t last_idx = 0;
> > +	uint16_t nr_copies;
> > +	uint16_t copy_idx;
> > +	uint16_t i;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	/**
> > +	 * Since all memory is pinned and addresses should be valid,
> > +	 * we don't check errors.
> 
> Please, check for errors to ease debugging, you can for now add a debug
> print if error is set which would print rte_errno. And once we have my
> Vhost statistics series in, I could add a counter for it.
> 
> The DMA device could be in a bad state for other reason than unpinned
> memory or unvalid addresses.

Sure, thanks for your suggestions. I will add later. In addition, for the error
handling in rte_dma_copy(), I avoids ENOSPC error by checking capacity before
call rte_dma_copy(). When other errors happen, it's possible that a part of copies
of a packet are enqueue to DMA, but the left are failed. DMA library doesn't have
a cancel API to drop copies enqueued to DMA SW ring but not submitted to DMA
device, so the only way to handle the partial offloading case error is to store all VA
for each IOVA segments and do SW copy for error ones. IOVA can be PA, so IOVA
and VA segments for a same packet would be different. How do you think?

> 
> > +	 */
> > +	nr_copies = rte_dma_completed(dma_id, vchan, max_pkts, &last_idx,
> NULL);
> 
> Are you sure max_pkts is valid here as a packet could contain several
> segments?

No, it's not the best choice for SG packets. But it will not cause error,
but hurt performance. I have added a factor to mitigate the impacts to
performance for SG packets. Specifically, the number of copies to poll
is calculated by factor*max_pkts, and factor is set to 1 by default but
users can change it via rte_vhost_async_dma_configure(). For example,
if the avg. packet is 4KB, so the factor can be set to 2 (4KB/2KB, where
2KB is mbuf_size). Then the copies to poll from DMA device is 2*32,
rather than 32 in current design.

> 
> > +	if (nr_copies == 0) {
> > +		goto out;
> > +	}
> > +
> > +	copy_idx = last_idx - nr_copies + 1;
> > +	for (i = 0; i < nr_copies; i++) {
> > +		bool *flag;
> > +
> > +		flag = dma_info->metadata[copy_idx & ring_mask];
> 
> dma_info->metadata[copy_idx % ring_size]
> 
> > +		if (flag) {
> > +			/**
> > +			 * Mark the packet flag as received. The flag
> > +			 * could belong to another virtqueue but write
> > +			 * is atomic.
> > +			 */
> > +			*flag = true;
> > +			dma_info->metadata[copy_idx & ring_mask] = NULL;
> 
> dma_info->metadata[copy_idx % ring_size]

Please see comments above.

Thanks,
Jiayu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 0/1] integrate dmadev in vhost
  2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
  2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
@ 2022-01-24 16:40   ` Jiayu Hu
  2022-01-24 16:40     ` [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
  1 sibling, 1 reply; 31+ messages in thread
From: Jiayu Hu @ 2022-01-24 16:40 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, sunil.pai.g, john.mcnamara, xuan.ding,
	cheng1.jiang, liangma, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in vhost.

To enable the flexibility of using DMA devices in different function
modules, not limited in vhost, vhost doesn't manage DMA devices.
Applications, like OVS, need to manage and configure DMA devices and
tell vhost what DMA device to use in every dataplane function call.

In addition, vhost supports M:N mapping between vrings and DMA virtual
channels. Specifically, one vring can use multiple different DMA channels
and one DMA channel can be shared by multiple vrings at the same time.
The reason of enabling one vring to use multiple DMA channels is that
it's possible that more than one dataplane threads enqueue packets to
the same vring with their own DMA virtual channels. Besides, the number
of DMA devices is limited. For the purpose of scaling, it's necessary to
support sharing DMA channels among vrings.

As only enqueue path is enabled DMA acceleration, the new dataplane
functions are like:
1). rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Get descriptors and submit copies to DMA virtual channel for the
    packets that need to be send to VM.
 
2). rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Check completed DMA copies from the given DMA virtual channel and
    write back corresponding descriptors to vring.

OVS needs to call rte_vhost_poll_enqueue_completed to clean in-flight
copies on previous call and it can be called inside rxq_recv function,
so that it doesn't require big change in OVS datapath. For example:
netdev_dpdk_vhost_rxq_recv() {
	...
	qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_RXQ;
	rte_vhost_poll_enqueue_completed(vid, qid, ...);
}

Change log
==========
v1 -> v2:
- add SW fallback if rte_dma_copy() reports error
- print error if rte_dma_completed() reports error
- add poll_factor while call rte_dma_completed() for scatter-gaher packets
- use trylock instead of lock in rte_vhost_poll_enqueue_completed()
- check if dma_id and vchan_id valid
- input dma_id in rte_vhost_async_dma_configure()
- remove useless code, brace and hardcode in vhost example
- redefine MAX_VHOST_DEVICE to RTE_MAX_VHOST_DEVICE
- update doc and comments
rfc -> v1:
- remove useless code
- support dynamic DMA vchannel ring size (rte_vhost_async_dma_configure)
- fix several bugs
- fix typo and coding style issues
- replace "while" with "for"
- update programmer guide 
- support share dma among vhost in vhost example
- remove "--dma-type" in vhost example

Jiayu Hu (1):
  vhost: integrate dmadev in asynchronous datapath

 doc/guides/prog_guide/vhost_lib.rst |  95 ++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 --------------------
 examples/vhost/ioat.h               |  63 ------
 examples/vhost/main.c               | 255 ++++++++++++++++++-----
 examples/vhost/main.h               |  11 +
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 132 +++++-------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 148 ++++++++++----
 lib/vhost/vhost.h                   |  64 +++++-
 lib/vhost/vhost_user.c              |   2 +
 lib/vhost/virtio_net.c              | 305 +++++++++++++++++++++++-----
 15 files changed, 744 insertions(+), 564 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-01-24 16:40   ` [PATCH v2 0/1] integrate dmadev in vhost Jiayu Hu
@ 2022-01-24 16:40     ` Jiayu Hu
  2022-02-03 13:04       ` Maxime Coquelin
  2022-02-08 10:40       ` [PATCH v3 0/1] integrate dmadev in vhost Jiayu Hu
  0 siblings, 2 replies; 31+ messages in thread
From: Jiayu Hu @ 2022-01-24 16:40 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, bruce.richardson,
	harry.van.haaren, sunil.pai.g, john.mcnamara, xuan.ding,
	cheng1.jiang, liangma, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.

Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  95 ++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 --------------------
 examples/vhost/ioat.h               |  63 ------
 examples/vhost/main.c               | 255 ++++++++++++++++++-----
 examples/vhost/main.h               |  11 +
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 132 +++++-------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 148 ++++++++++----
 lib/vhost/vhost.h                   |  64 +++++-
 lib/vhost/vhost_user.c              |   2 +
 lib/vhost/virtio_net.c              | 305 +++++++++++++++++++++++-----
 15 files changed, 744 insertions(+), 564 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 76f5d303c9..acc10ea851 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -106,12 +106,11 @@ The following is an overview of some key Vhost API functions:
   - ``RTE_VHOST_USER_ASYNC_COPY``
 
     Asynchronous data path will be enabled when this flag is set. Async data
-    path allows applications to register async copy devices (typically
-    hardware DMA channels) to the vhost queues. Vhost leverages the copy
-    device registered to free CPU from memory copy operations. A set of
-    async data path APIs are defined for DPDK applications to make use of
-    the async capability. Only packets enqueued/dequeued by async APIs are
-    processed through the async data path.
+    path allows applications to register DMA channels to the vhost queues.
+    Vhost leverages the registered DMA devices to free CPU from memory copy
+    operations. A set of async data path APIs are defined for DPDK applications
+    to make use of the async capability. Only packets enqueued/dequeued by
+    async APIs are processed through the async data path.
 
     Currently this feature is only implemented on split ring enqueue data
     path.
@@ -218,52 +217,30 @@ The following is an overview of some key Vhost API functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_dma_configure(dmas_id, count, poll_factor)``
 
-  Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  Tell vhost what DMA devices are going to use. This function needs to
+  be called before register async data-path for vring.
 
-  * ``features``
+* ``rte_vhost_async_channel_register(vid, queue_id)``
 
-    This field is used to specify async copy device features.
+  Register async DMA acceleration for a vhost queue after vring is enabled.
 
-    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
-    guarantee the order of copy completion is the same as the order
-    of copy submission.
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
 
-    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
-    supported by vhost.
-
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-    vhost invokes this function to submit copy data to the async devices.
-    For non-async_inorder capable devices, ``opaque_data`` could be used
-    for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-    vhost invokes this function to get the copy data completed by async
-    devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
-
-  Register an async copy device channel for a vhost queue without
-  performing any locking.
+  Register async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
 * ``rte_vhost_async_channel_unregister(vid, queue_id)``
 
-  Unregister the async copy device channel from a vhost queue.
+  Unregister the async DMA acceleration from a vhost queue.
   Unregistration will fail, if the vhost queue has in-flight
   packets that are not completed.
 
-  Unregister async copy devices in vring_state_changed() may
+  Unregister async DMA acceleration in vring_state_changed() may
   fail, as this API tries to acquire the spinlock of vhost
   queue. The recommended way is to unregister async copy
   devices for all vhost queues in destroy_device(), when a
@@ -271,24 +248,19 @@ The following is an overview of some key Vhost API functions:
 
 * ``rte_vhost_async_channel_unregister_thread_unsafe(vid, queue_id)``
 
-  Unregister the async copy device channel for a vhost queue without
-  performing any locking.
+  Unregister async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, vchan_id)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
-
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, vchan_id)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +270,7 @@ The following is an overview of some key Vhost API functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, vchan_id)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -442,3 +414,26 @@ Finally, a set of device ops is defined for device specific operations:
 * ``get_notify_area``
 
   Called to get the notify area info of the queue.
+
+Vhost asynchronous data path
+----------------------------
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables applications, like OVS, to save CPU cycles and hide memory copy
+overhead, thus achieving higher throughput.
+
+Vhost doesn't manage DMA devices and applications, like OVS, need to
+manage and configure DMA devices. Applications need to tell vhost what
+DMA devices to use in every data path function call. This design enables
+the flexibility for applications to dynamically use DMA channels in
+different function modules, not limited in vhost.
+
+In addition, vhost supports M:N mapping between vrings and DMA virtual
+channels. Specifically, one vring can use multiple different DMA channels
+and one DMA channel can be shared by multiple vrings at the same time.
+The reason of enabling one vring to use multiple DMA channels is that
+it's possible that more than one dataplane threads enqueue packets to
+the same vring with their own DMA virtual channels. Besides, the number
+of DMA devices is limited. For the purpose of scaling, it's necessary to
+support sharing DMA channels among vrings.
diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
index 587ea2ab47..975a5dfe40 100644
--- a/examples/vhost/Makefile
+++ b/examples/vhost/Makefile
@@ -5,7 +5,7 @@
 APP = vhost-switch
 
 # all source are stored in SRCS-y
-SRCS-y := main.c virtio_net.c ioat.c
+SRCS-y := main.c virtio_net.c
 
 PKGCONF ?= pkg-config
 
diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
deleted file mode 100644
index 9aeeb12fd9..0000000000
--- a/examples/vhost/ioat.c
+++ /dev/null
@@ -1,218 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#include <sys/uio.h>
-#ifdef RTE_RAW_IOAT
-#include <rte_rawdev.h>
-#include <rte_ioat_rawdev.h>
-
-#include "ioat.h"
-#include "main.h"
-
-struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
-
-struct packet_tracker {
-	unsigned short size_track[MAX_ENQUEUED_SIZE];
-	unsigned short next_read;
-	unsigned short next_write;
-	unsigned short last_remain;
-	unsigned short ioat_space;
-};
-
-struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
-
-int
-open_ioat(const char *value)
-{
-	struct dma_for_vhost *dma_info = dma_bind;
-	char *input = strndup(value, strlen(value) + 1);
-	char *addrs = input;
-	char *ptrs[2];
-	char *start, *end, *substr;
-	int64_t vid, vring_id;
-	struct rte_ioat_rawdev_config config;
-	struct rte_rawdev_info info = { .dev_private = &config };
-	char name[32];
-	int dev_id;
-	int ret = 0;
-	uint16_t i = 0;
-	char *dma_arg[MAX_VHOST_DEVICE];
-	int args_nr;
-
-	while (isblank(*addrs))
-		addrs++;
-	if (*addrs == '\0') {
-		ret = -1;
-		goto out;
-	}
-
-	/* process DMA devices within bracket. */
-	addrs++;
-	substr = strtok(addrs, ";]");
-	if (!substr) {
-		ret = -1;
-		goto out;
-	}
-	args_nr = rte_strsplit(substr, strlen(substr),
-			dma_arg, MAX_VHOST_DEVICE, ',');
-	if (args_nr <= 0) {
-		ret = -1;
-		goto out;
-	}
-	while (i < args_nr) {
-		char *arg_temp = dma_arg[i];
-		uint8_t sub_nr;
-		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
-		if (sub_nr != 2) {
-			ret = -1;
-			goto out;
-		}
-
-		start = strstr(ptrs[0], "txd");
-		if (start == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		start += 3;
-		vid = strtol(start, &end, 0);
-		if (end == start) {
-			ret = -1;
-			goto out;
-		}
-
-		vring_id = 0 + VIRTIO_RXQ;
-		if (rte_pci_addr_parse(ptrs[1],
-				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
-			ret = -1;
-			goto out;
-		}
-
-		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
-				name, sizeof(name));
-		dev_id = rte_rawdev_get_dev_id(name);
-		if (dev_id == (uint16_t)(-ENODEV) ||
-		dev_id == (uint16_t)(-EINVAL)) {
-			ret = -1;
-			goto out;
-		}
-
-		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
-		strstr(info.driver_name, "ioat") == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
-		(dma_info + vid)->dmas[vring_id].is_valid = true;
-		config.ring_size = IOAT_RING_SIZE;
-		config.hdls_disable = true;
-		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
-			ret = -1;
-			goto out;
-		}
-		rte_rawdev_start(dev_id);
-		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
-		dma_info->nr++;
-		i++;
-	}
-out:
-	free(input);
-	return ret;
-}
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count)
-{
-	uint32_t i_iter;
-	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
-	struct rte_vhost_iov_iter *iter = NULL;
-	unsigned long i_seg;
-	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-	unsigned short write = cb_tracker[dev_id].next_write;
-
-	if (!opaque_data) {
-		for (i_iter = 0; i_iter < count; i_iter++) {
-			iter = iov_iter + i_iter;
-			i_seg = 0;
-			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
-				break;
-			while (i_seg < iter->nr_segs) {
-				rte_ioat_enqueue_copy(dev_id,
-					(uintptr_t)(iter->iov[i_seg].src_addr),
-					(uintptr_t)(iter->iov[i_seg].dst_addr),
-					iter->iov[i_seg].len,
-					0,
-					0);
-				i_seg++;
-			}
-			write &= mask;
-			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
-			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
-			write++;
-		}
-	} else {
-		/* Opaque data is not supported */
-		return -1;
-	}
-	/* ring the doorbell */
-	rte_ioat_perform_ops(dev_id);
-	cb_tracker[dev_id].next_write = write;
-	return i_iter;
-}
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets)
-{
-	if (!opaque_data) {
-		uintptr_t dump[255];
-		int n_seg;
-		unsigned short read, write;
-		unsigned short nb_packet = 0;
-		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-		unsigned short i;
-
-		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
-				+ VIRTIO_RXQ].dev_id;
-		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, dump);
-		if (n_seg < 0) {
-			RTE_LOG(ERR,
-				VHOST_DATA,
-				"fail to poll completed buf on IOAT device %u",
-				dev_id);
-			return 0;
-		}
-		if (n_seg == 0)
-			return 0;
-
-		cb_tracker[dev_id].ioat_space += n_seg;
-		n_seg += cb_tracker[dev_id].last_remain;
-
-		read = cb_tracker[dev_id].next_read;
-		write = cb_tracker[dev_id].next_write;
-		for (i = 0; i < max_packets; i++) {
-			read &= mask;
-			if (read == write)
-				break;
-			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
-				n_seg -= cb_tracker[dev_id].size_track[read];
-				read++;
-				nb_packet++;
-			} else {
-				break;
-			}
-		}
-		cb_tracker[dev_id].next_read = read;
-		cb_tracker[dev_id].last_remain = n_seg;
-		return nb_packet;
-	}
-	/* Opaque data is not supported */
-	return -1;
-}
-
-#endif /* RTE_RAW_IOAT */
diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
deleted file mode 100644
index d9bf717e8d..0000000000
--- a/examples/vhost/ioat.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#ifndef _IOAT_H_
-#define _IOAT_H_
-
-#include <rte_vhost.h>
-#include <rte_pci.h>
-#include <rte_vhost_async.h>
-
-#define MAX_VHOST_DEVICE 1024
-#define IOAT_RING_SIZE 4096
-#define MAX_ENQUEUED_SIZE 4096
-
-struct dma_info {
-	struct rte_pci_addr addr;
-	uint16_t dev_id;
-	bool is_valid;
-};
-
-struct dma_for_vhost {
-	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
-	uint16_t nr;
-};
-
-#ifdef RTE_RAW_IOAT
-int open_ioat(const char *value);
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count);
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-#else
-static int open_ioat(const char *value __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
-		struct rte_vhost_iov_iter *iov_iter __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t count __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_check_completed_copies_cb(int vid __rte_unused,
-		uint16_t queue_id __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t max_packets __rte_unused)
-{
-	return -1;
-}
-#endif
-#endif /* _IOAT_H_ */
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 590a77c723..b2c272059e 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -24,8 +24,9 @@
 #include <rte_ip.h>
 #include <rte_tcp.h>
 #include <rte_pause.h>
+#include <rte_dmadev.h>
+#include <rte_vhost_async.h>
 
-#include "ioat.h"
 #include "main.h"
 
 #ifndef MAX_QUEUES
@@ -56,6 +57,13 @@
 #define RTE_TEST_TX_DESC_DEFAULT 512
 
 #define INVALID_PORT_ID 0xFF
+#define INVALID_DMA_ID -1
+
+#define DMA_RING_SIZE 4096
+
+struct dma_for_vhost dma_bind[RTE_MAX_VHOST_DEVICE];
+int16_t dmas_id[RTE_DMADEV_DEFAULT_MAX];
+static int dma_count;
 
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;
@@ -94,10 +102,6 @@ static int client_mode;
 
 static int builtin_net_driver;
 
-static int async_vhost_driver;
-
-static char *dma_type;
-
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
 /* Specify the number of retries on RX. */
@@ -191,18 +195,150 @@ struct mbuf_table lcore_tx_queue[RTE_MAX_LCORE];
  * Every data core maintains a TX buffer for every vhost device,
  * which is used for batch pkts enqueue for higher performance.
  */
-struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
+struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * RTE_MAX_VHOST_DEVICE];
 
 #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
 				 / US_PER_S * BURST_TX_DRAIN_US)
 
+static inline bool
+is_dma_configured(int16_t dev_id)
+{
+	int i;
+
+	for (i = 0; i < dma_count; i++)
+		if (dmas_id[i] == dev_id)
+			return true;
+	return false;
+}
+
 static inline int
 open_dma(const char *value)
 {
-	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
-		return open_ioat(value);
+	struct dma_for_vhost *dma_info = dma_bind;
+	char *input = strndup(value, strlen(value) + 1);
+	char *addrs = input;
+	char *ptrs[2];
+	char *start, *end, *substr;
+	int64_t vid;
+
+	struct rte_dma_info info;
+	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
+	struct rte_dma_vchan_conf qconf = {
+		.direction = RTE_DMA_DIR_MEM_TO_MEM,
+		.nb_desc = DMA_RING_SIZE
+	};
+
+	int dev_id;
+	int ret = 0;
+	uint16_t i = 0;
+	char *dma_arg[RTE_MAX_VHOST_DEVICE];
+	int args_nr;
+
+	while (isblank(*addrs))
+		addrs++;
+	if (*addrs == '\0') {
+		ret = -1;
+		goto out;
+	}
+
+	/* process DMA devices within bracket. */
+	addrs++;
+	substr = strtok(addrs, ";]");
+	if (!substr) {
+		ret = -1;
+		goto out;
+	}
+
+	args_nr = rte_strsplit(substr, strlen(substr), dma_arg, RTE_MAX_VHOST_DEVICE, ',');
+	if (args_nr <= 0) {
+		ret = -1;
+		goto out;
+	}
+
+	while (i < args_nr) {
+		char *arg_temp = dma_arg[i];
+		uint8_t sub_nr;
+
+		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
+		if (sub_nr != 2) {
+			ret = -1;
+			goto out;
+		}
+
+		start = strstr(ptrs[0], "txd");
+		if (start == NULL) {
+			ret = -1;
+			goto out;
+		}
+
+		start += 3;
+		vid = strtol(start, &end, 0);
+		if (end == start) {
+			ret = -1;
+			goto out;
+		}
+
+		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
+		if (dev_id < 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
+			ret = -1;
+			goto out;
+		}
+
+		/* DMA device is already configured, so skip */
+		if (is_dma_configured(dev_id))
+			goto done;
+
+		if (rte_dma_info_get(dev_id, &info) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Error with rte_dma_info_get()\n");
+			ret = -1;
+			goto out;
+		}
+
+		if (info.max_vchans < 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "No channels available on device %d\n", dev_id);
+			ret = -1;
+			goto out;
+		}
 
-	return -1;
+		if (rte_dma_configure(dev_id, &dev_config) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		/* Check the max desc supported by DMA device */
+		rte_dma_info_get(dev_id, &info);
+		if (info.nb_vchans != 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "No configured queues reported by DMA %d.\n",
+					dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		qconf.nb_desc = RTE_MIN(DMA_RING_SIZE, info.max_desc);
+
+		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_start(dev_id) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		dmas_id[dma_count++] = dev_id;
+
+done:
+		(dma_info + vid)->dmas[VIRTIO_RXQ].dev_id = dev_id;
+		i++;
+	}
+out:
+	free(input);
+	return ret;
 }
 
 /*
@@ -500,8 +636,6 @@ enum {
 	OPT_CLIENT_NUM,
 #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
 	OPT_BUILTIN_NET_DRIVER_NUM,
-#define OPT_DMA_TYPE            "dma-type"
-	OPT_DMA_TYPE_NUM,
 #define OPT_DMAS                "dmas"
 	OPT_DMAS_NUM,
 };
@@ -539,8 +673,6 @@ us_vhost_parse_args(int argc, char **argv)
 				NULL, OPT_CLIENT_NUM},
 		{OPT_BUILTIN_NET_DRIVER, no_argument,
 				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
-		{OPT_DMA_TYPE, required_argument,
-				NULL, OPT_DMA_TYPE_NUM},
 		{OPT_DMAS, required_argument,
 				NULL, OPT_DMAS_NUM},
 		{NULL, 0, 0, 0},
@@ -661,10 +793,6 @@ us_vhost_parse_args(int argc, char **argv)
 			}
 			break;
 
-		case OPT_DMA_TYPE_NUM:
-			dma_type = optarg;
-			break;
-
 		case OPT_DMAS_NUM:
 			if (open_dma(optarg) == -1) {
 				RTE_LOG(INFO, VHOST_CONFIG,
@@ -672,7 +800,6 @@ us_vhost_parse_args(int argc, char **argv)
 				us_vhost_usage(prgname);
 				return -1;
 			}
-			async_vhost_driver = 1;
 			break;
 
 		case OPT_CLIENT_NUM:
@@ -841,9 +968,10 @@ complete_async_pkts(struct vhost_dev *vdev)
 {
 	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
 	uint16_t complete_count;
+	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
-					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
+					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
 	if (complete_count) {
 		free_pkts(p_cpl, complete_count);
 		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
@@ -877,17 +1005,18 @@ static __rte_always_inline void
 drain_vhost(struct vhost_dev *vdev)
 {
 	uint16_t ret;
-	uint32_t buff_idx = rte_lcore_id() * MAX_VHOST_DEVICE + vdev->vid;
+	uint32_t buff_idx = rte_lcore_id() * RTE_MAX_VHOST_DEVICE + vdev->vid;
 	uint16_t nr_xmit = vhost_txbuff[buff_idx]->len;
 	struct rte_mbuf **m = vhost_txbuff[buff_idx]->m_table;
 
 	if (builtin_net_driver) {
 		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
-		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
+		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = nr_xmit - ret;
@@ -905,7 +1034,7 @@ drain_vhost(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(m, nr_xmit);
 }
 
@@ -921,7 +1050,7 @@ drain_vhost_table(void)
 		if (unlikely(vdev->remove == 1))
 			continue;
 
-		vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE
+		vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE
 						+ vdev->vid];
 
 		cur_tsc = rte_rdtsc();
@@ -970,7 +1099,7 @@ virtio_tx_local(struct vhost_dev *vdev, struct rte_mbuf *m)
 		return 0;
 	}
 
-	vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE + dst_vdev->vid];
+	vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE + dst_vdev->vid];
 	vhost_txq->m_table[vhost_txq->len++] = m;
 
 	if (enable_stats) {
@@ -1211,12 +1340,13 @@ drain_eth_rx(struct vhost_dev *vdev)
 	if (builtin_net_driver) {
 		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
 						pkts, rx_count);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
 		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
-					VIRTIO_RXQ, pkts, rx_count);
+					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = rx_count - enqueue_count;
@@ -1235,7 +1365,7 @@ drain_eth_rx(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(pkts, rx_count);
 }
 
@@ -1357,7 +1487,7 @@ destroy_device(int vid)
 	}
 
 	for (i = 0; i < RTE_MAX_LCORE; i++)
-		rte_free(vhost_txbuff[i * MAX_VHOST_DEVICE + vid]);
+		rte_free(vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]);
 
 	if (builtin_net_driver)
 		vs_vhost_net_remove(vdev);
@@ -1387,18 +1517,20 @@ destroy_device(int vid)
 		"(%d) device has been removed from data core\n",
 		vdev->vid);
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t n_pkt = 0;
+		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 		while (vdev->pkts_inflight) {
 			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
-						m_cpl, vdev->pkts_inflight);
+						m_cpl, vdev->pkts_inflight, dma_id, 0);
 			free_pkts(m_cpl, n_pkt);
 			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 		}
 
 		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
+		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
 	}
 
 	rte_free(vdev);
@@ -1425,12 +1557,12 @@ new_device(int vid)
 	vdev->vid = vid;
 
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		vhost_txbuff[i * MAX_VHOST_DEVICE + vid]
+		vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]
 			= rte_zmalloc("vhost bufftable",
 				sizeof(struct vhost_bufftable),
 				RTE_CACHE_LINE_SIZE);
 
-		if (vhost_txbuff[i * MAX_VHOST_DEVICE + vid] == NULL) {
+		if (vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid] == NULL) {
 			RTE_LOG(INFO, VHOST_DATA,
 			  "(%d) couldn't allocate memory for vhost TX\n", vid);
 			return -1;
@@ -1468,20 +1600,13 @@ new_device(int vid)
 		"(%d) device has been added to data core %d\n",
 		vid, vdev->coreid);
 
-	if (async_vhost_driver) {
-		struct rte_vhost_async_config config = {0};
-		struct rte_vhost_async_channel_ops channel_ops;
-
-		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
-			channel_ops.transfer_data = ioat_transfer_data_cb;
-			channel_ops.check_completed_copies =
-				ioat_check_completed_copies_cb;
-
-			config.features = RTE_VHOST_ASYNC_INORDER;
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
+		int ret;
 
-			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
-				config, &channel_ops);
-		}
+		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
+		if (ret == 0)
+			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
+		return ret;
 	}
 
 	return 0;
@@ -1502,14 +1627,15 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
 	if (queue_id != VIRTIO_RXQ)
 		return 0;
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[queue_id].async_enabled) {
 		if (!enable) {
 			uint16_t n_pkt = 0;
+			int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 			while (vdev->pkts_inflight) {
 				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
-							m_cpl, vdev->pkts_inflight);
+							m_cpl, vdev->pkts_inflight, dma_id, 0);
 				free_pkts(m_cpl, n_pkt);
 				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 			}
@@ -1657,6 +1783,24 @@ create_mbuf_pool(uint16_t nr_port, uint32_t nr_switch_core, uint32_t mbuf_size,
 		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
 }
 
+static void
+reset_dma(void)
+{
+	int i;
+
+	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
+		int j;
+
+		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
+			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
+			dma_bind[i].dmas[j].async_enabled = false;
+		}
+	}
+
+	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++)
+		dmas_id[i] = INVALID_DMA_ID;
+}
+
 /*
  * Main function, does initialisation and calls the per-lcore functions.
  */
@@ -1679,6 +1823,9 @@ main(int argc, char *argv[])
 	argc -= ret;
 	argv += ret;
 
+	/* initialize dma structures */
+	reset_dma();
+
 	/* parse app arguments */
 	ret = us_vhost_parse_args(argc, argv);
 	if (ret < 0)
@@ -1754,11 +1901,21 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	if (dma_count) {
+		if (rte_vhost_async_dma_configure(dmas_id, dma_count, 1) < 0) {
+			RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in vhost.\n");
+			for (i = 0; i < dma_count; i++)
+				if (dmas_id[i] >= 0)
+					rte_dma_stop(dmas_id[i]);
+			rte_exit(EXIT_FAILURE, "Cannot use given DMA devices\n");
+		}
+	}
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		char *file = socket_files + i * PATH_MAX;
 
-		if (async_vhost_driver)
+		if (dma_count)
 			flags = flags | RTE_VHOST_USER_ASYNC_COPY;
 
 		ret = rte_vhost_driver_register(file, flags);
diff --git a/examples/vhost/main.h b/examples/vhost/main.h
index e7b1ac60a6..b4a453e77e 100644
--- a/examples/vhost/main.h
+++ b/examples/vhost/main.h
@@ -8,6 +8,7 @@
 #include <sys/queue.h>
 
 #include <rte_ether.h>
+#include <rte_pci.h>
 
 /* Macros for printing using RTE_LOG */
 #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
@@ -79,6 +80,16 @@ struct lcore_info {
 	struct vhost_dev_tailq_list vdev_list;
 };
 
+struct dma_info {
+	struct rte_pci_addr addr;
+	int16_t dev_id;
+	bool async_enabled;
+};
+
+struct dma_for_vhost {
+	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
+};
+
 /* we implement non-extra virtio net features */
 #define VIRTIO_NET_FEATURES	0
 
diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
index 3efd5e6540..87a637f83f 100644
--- a/examples/vhost/meson.build
+++ b/examples/vhost/meson.build
@@ -12,13 +12,9 @@ if not is_linux
 endif
 
 deps += 'vhost'
+deps += 'dmadev'
 allow_experimental_apis = true
 sources = files(
         'main.c',
         'virtio_net.c',
 )
-
-if dpdk_conf.has('RTE_RAW_IOAT')
-    deps += 'raw_ioat'
-    sources += files('ioat.c')
-endif
diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
index cdb37a4814..bc7272053b 100644
--- a/lib/vhost/meson.build
+++ b/lib/vhost/meson.build
@@ -36,4 +36,4 @@ headers = files(
 driver_sdk_headers = files(
         'vdpa_driver.h',
 )
-deps += ['ethdev', 'cryptodev', 'hash', 'pci']
+deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
index b454c05868..15c37dd26e 100644
--- a/lib/vhost/rte_vhost.h
+++ b/lib/vhost/rte_vhost.h
@@ -113,6 +113,8 @@ extern "C" {
 #define VHOST_USER_F_PROTOCOL_FEATURES	30
 #endif
 
+#define RTE_MAX_VHOST_DEVICE	1024
+
 struct rte_vdpa_device;
 
 /**
diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
index a87ea6ba37..758a80f403 100644
--- a/lib/vhost/rte_vhost_async.h
+++ b/lib/vhost/rte_vhost_async.h
@@ -26,73 +26,6 @@ struct rte_vhost_iov_iter {
 	unsigned long nr_segs;
 };
 
-/**
- * dma transfer status
- */
-struct rte_vhost_async_status {
-	/** An array of application specific data for source memory */
-	uintptr_t *src_opaque_data;
-	/** An array of application specific data for destination memory */
-	uintptr_t *dst_opaque_data;
-};
-
-/**
- * dma operation callbacks to be implemented by applications
- */
-struct rte_vhost_async_channel_ops {
-	/**
-	 * instruct async engines to perform copies for a batch of packets
-	 *
-	 * @param vid
-	 *  id of vhost device to perform data copies
-	 * @param queue_id
-	 *  queue id to perform data copies
-	 * @param iov_iter
-	 *  an array of IOV iterators
-	 * @param opaque_data
-	 *  opaque data pair sending to DMA engine
-	 * @param count
-	 *  number of elements in the "descs" array
-	 * @return
-	 *  number of IOV iterators processed, negative value means error
-	 */
-	int32_t (*transfer_data)(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t count);
-	/**
-	 * check copy-completed packets from the async engine
-	 * @param vid
-	 *  id of vhost device to check copy completion
-	 * @param queue_id
-	 *  queue id to check copy completion
-	 * @param opaque_data
-	 *  buffer to receive the opaque data pair from DMA engine
-	 * @param max_packets
-	 *  max number of packets could be completed
-	 * @return
-	 *  number of async descs completed, negative value means error
-	 */
-	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-};
-
-/**
- *  async channel features
- */
-enum {
-	RTE_VHOST_ASYNC_INORDER = 1U << 0,
-};
-
-/**
- *  async channel configuration
- */
-struct rte_vhost_async_config {
-	uint32_t features;
-	uint32_t rsvd[2];
-};
-
 /**
  * Register an async channel for a vhost queue
  *
@@ -100,17 +33,11 @@ struct rte_vhost_async_config {
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration structure
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue
@@ -136,17 +63,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue without performing any
@@ -179,12 +100,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
  *  array of packets to be enqueued
  * @param count
  *  packets num to be enqueued
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets enqueued
  */
 __rte_experimental
 uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
 
 /**
  * This function checks async completion status for a specific vhost
@@ -199,12 +125,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
  *  blank array to get return packet pointer
  * @param count
  *  size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
 
 /**
  * This function returns the amount of in-flight packets for the vhost
@@ -235,11 +166,44 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
  *  Blank array to get return packet pointer
  * @param count
  *  Size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  Number of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
+/**
+ * The DMA vChannels used in asynchronous data path must be configured
+ * first. So this function needs to be called before enabling DMA
+ * acceleration for vring. If this function fails, asynchronous data path
+ * cannot be enabled for any vring further.
+ *
+ * DMA devices used in data-path must belong to DMA devices given in this
+ * function. But users are free to use DMA devices given in the function
+ * in non-vhost scenarios, only if guarantee no copies in vhost are
+ * offloaded to them at the same time.
+ *
+ * @param dmas_id
+ *  DMA ID array
+ * @param count
+ *  Element number of 'dmas_id'
+ * @param poll_factor
+ *  For large or scatter-gather packets, one packet would consist of
+ *  small buffers. In this case, vhost will issue several DMA copy
+ *  operations for the packet. Therefore, the number of copies to
+ *  check by rte_dma_completed() is calculated by "nb_pkts_to_poll *
+ *  poll_factor" andused in rte_vhost_poll_enqueue_completed(). The
+ *  default value of "poll_factor" is 1.
+ * @return
+ *  0 on success, and -1 on failure
+ */
+__rte_experimental
+int rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
+		uint16_t poll_factor);
 
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/vhost/version.map b/lib/vhost/version.map
index a7ef7f1976..1202ba9c1a 100644
--- a/lib/vhost/version.map
+++ b/lib/vhost/version.map
@@ -84,6 +84,9 @@ EXPERIMENTAL {
 
 	# added in 21.11
 	rte_vhost_get_monitor_addr;
+
+	# added in 22.03
+	rte_vhost_async_dma_configure;
 };
 
 INTERNAL {
diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
index 13a9bb9dd1..c408cee63e 100644
--- a/lib/vhost/vhost.c
+++ b/lib/vhost/vhost.c
@@ -25,7 +25,7 @@
 #include "vhost.h"
 #include "vhost_user.h"
 
-struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 /* Called with iotlb_lock read-locked */
@@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
 		return;
 
 	rte_free(vq->async->pkts_info);
+	rte_free(vq->async->pkts_cmpl_flag);
 
 	rte_free(vq->async->buffers_packed);
 	vq->async->buffers_packed = NULL;
@@ -667,12 +668,12 @@ vhost_new_device(void)
 	int i;
 
 	pthread_mutex_lock(&vhost_dev_lock);
-	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
+	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
 		if (vhost_devices[i] == NULL)
 			break;
 	}
 
-	if (i == MAX_VHOST_DEVICE) {
+	if (i == RTE_MAX_VHOST_DEVICE) {
 		VHOST_LOG_CONFIG(ERR,
 			"Failed to find a free slot for new device.\n");
 		pthread_mutex_unlock(&vhost_dev_lock);
@@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
 }
 
 static __rte_always_inline int
-async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_channel_ops *ops)
+async_channel_register(int vid, uint16_t queue_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
@@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
 		goto out_free_async;
 	}
 
+	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
+			RTE_CACHE_LINE_SIZE, node);
+	if (!async->pkts_cmpl_flag) {
+		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag (vid %d, qid: %d)\n",
+				vid, queue_id);
+		goto out_free_async;
+	}
+
 	if (vq_is_packed(dev)) {
 		async->buffers_packed = rte_malloc_socket(NULL,
 				vq->size * sizeof(struct vring_used_elem_packed),
@@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
 		}
 	}
 
-	async->ops.check_completed_copies = ops->check_completed_copies;
-	async->ops.transfer_data = ops->transfer_data;
-
 	vq->async = async;
 
 	return 0;
@@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t queue_id,
 }
 
 int
-rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 	int ret;
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"async copy is not supported on non-inorder mode "
-			"(vid %d, qid: %d)\n", vid, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
 	rte_spinlock_lock(&vq->access_lock);
-	ret = async_channel_register(vid, queue_id, ops);
+	ret = async_channel_register(vid, queue_id);
 	rte_spinlock_unlock(&vq->access_lock);
 
 	return ret;
 }
 
 int
-rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1747,18 +1737,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"async copy is not supported on non-inorder mode "
-			"(vid %d, qid: %d)\n", vid, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
-	return async_channel_register(vid, queue_id, ops);
+	return async_channel_register(vid, queue_id);
 }
 
 int
@@ -1835,6 +1814,95 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
 	return 0;
 }
 
+static __rte_always_inline void
+vhost_free_async_dma_mem(void)
+{
+	uint16_t i;
+
+	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
+		struct async_dma_info *dma = &dma_copy_track[i];
+		int16_t j;
+
+		if (dma->max_vchans == 0)
+			continue;
+
+		for (j = 0; j < dma->max_vchans; j++)
+			rte_free(dma->vchans[j].pkts_completed_flag);
+
+		rte_free(dma->vchans);
+		dma->vchans = NULL;
+		dma->max_vchans = 0;
+	}
+}
+
+int
+rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count, uint16_t poll_factor)
+{
+	uint16_t i;
+
+	if (!dmas_id) {
+		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration parameter.\n");
+		return -1;
+	}
+
+	if (poll_factor == 0) {
+		VHOST_LOG_CONFIG(ERR, "Invalid DMA poll factor %u\n", poll_factor);
+		return -1;
+	}
+	dma_poll_factor = poll_factor;
+
+	for (i = 0; i < count; i++) {
+		struct async_dma_vchan_info *vchans;
+		struct rte_dma_info info;
+		uint16_t max_vchans;
+		uint16_t max_desc;
+		uint16_t j;
+
+		if (!rte_dma_is_valid(dmas_id[i])) {
+			VHOST_LOG_CONFIG(ERR, "DMA %d is not found. Cannot enable async"
+				       " data-path\n.", dmas_id[i]);
+			vhost_free_async_dma_mem();
+			return -1;
+		}
+
+		rte_dma_info_get(dmas_id[i], &info);
+
+		max_vchans = info.max_vchans;
+		max_desc = info.max_desc;
+
+		if (!rte_is_power_of_2(max_desc))
+			max_desc = rte_align32pow2(max_desc);
+
+		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * max_vchans,
+				RTE_CACHE_LINE_SIZE);
+		if (vchans == NULL) {
+			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for dma-%d."
+					" Cannot enable async data-path.\n", dmas_id[i]);
+			vhost_free_async_dma_mem();
+			return -1;
+		}
+
+		for (j = 0; j < max_vchans; j++) {
+			vchans[j].pkts_completed_flag = rte_zmalloc(NULL, sizeof(bool *) * max_desc,
+					RTE_CACHE_LINE_SIZE);
+			if (!vchans[j].pkts_completed_flag) {
+				VHOST_LOG_CONFIG(ERR, "Failed to allocate  pkts_completed_flag for "
+						"dma-%d vchan-%u\n", dmas_id[i], j);
+				vhost_free_async_dma_mem();
+				return -1;
+			}
+
+			vchans[j].ring_size = max_desc;
+			vchans[j].ring_mask = max_desc - 1;
+		}
+
+		dma_copy_track[dmas_id[i]].vchans = vchans;
+		dma_copy_track[dmas_id[i]].max_vchans = max_vchans;
+	}
+
+	return 0;
+}
+
 int
 rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
 {
diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index 7085e0885c..475843fec0 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h
@@ -19,6 +19,7 @@
 #include <rte_ether.h>
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
+#include <rte_dmadev.h>
 
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
@@ -50,6 +51,7 @@
 
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
 #define VHOST_MAX_ASYNC_VEC 2048
+#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
 
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
@@ -119,6 +121,42 @@ struct vring_used_elem_packed {
 	uint32_t count;
 };
 
+struct async_dma_vchan_info {
+	/* circular array to track if packet copy completes */
+	bool **pkts_completed_flag;
+
+	/* max elements in 'metadata' */
+	uint16_t ring_size;
+	/* ring index mask for 'metadata' */
+	uint16_t ring_mask;
+
+	/* batching copies before a DMA doorbell */
+	uint16_t nr_batching;
+
+	/**
+	 * DMA virtual channel lock. Although it is able to bind DMA
+	 * virtual channels to data plane threads, vhost control plane
+	 * thread could call data plane functions too, thus causing
+	 * DMA device contention.
+	 *
+	 * For example, in VM exit case, vhost control plane thread needs
+	 * to clear in-flight packets before disable vring, but there could
+	 * be anotther data plane thread is enqueuing packets to the same
+	 * vring with the same DMA virtual channel. But dmadev PMD functions
+	 * are lock-free, so the control plane and data plane threads
+	 * could operate the same DMA virtual channel at the same time.
+	 */
+	rte_spinlock_t dma_lock;
+};
+
+struct async_dma_info {
+	uint16_t max_vchans;
+	struct async_dma_vchan_info *vchans;
+};
+
+extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+extern uint16_t dma_poll_factor;
+
 /**
  * inflight async packet information
  */
@@ -129,9 +167,6 @@ struct async_inflight_info {
 };
 
 struct vhost_async {
-	/* operation callbacks for DMA */
-	struct rte_vhost_async_channel_ops ops;
-
 	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
 	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
 	uint16_t iter_idx;
@@ -139,6 +174,25 @@ struct vhost_async {
 
 	/* data transfer status */
 	struct async_inflight_info *pkts_info;
+	/**
+	 * Packet reorder array. "true" indicates that DMA device
+	 * completes all copies for the packet.
+	 *
+	 * Note that this array could be written by multiple threads
+	 * simultaneously. For example, in the case of thread0 and
+	 * thread1 RX packets from NIC and then enqueue packets to
+	 * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
+	 * possible for thread0 to get completed copies belonging to
+	 * vring1 from DMA0, while thread0 is calling rte_vhost_poll
+	 * _enqueue_completed() for vring0 and thread1 is calling
+	 * rte_vhost_submit_enqueue_burst() for vring1. In this case,
+	 * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
+	 *
+	 * However, since offloading is per-packet basis, each packet
+	 * flag will only be written by one thread. And single byte
+	 * write is atomic, so no lock for pkts_cmpl_flag is needed.
+	 */
+	bool *pkts_cmpl_flag;
 	uint16_t pkts_idx;
 	uint16_t pkts_inflight_n;
 	union {
@@ -198,6 +252,7 @@ struct vhost_virtqueue {
 	/* Record packed ring first dequeue desc index */
 	uint16_t		shadow_last_used_idx;
 
+	uint16_t		batch_copy_max_elems;
 	uint16_t		batch_copy_nb_elems;
 	struct batch_copy_elem	*batch_copy_elems;
 	int			numa_node;
@@ -568,8 +623,7 @@ extern int vhost_data_log_level;
 #define PRINT_PACKET(device, addr, size, header) do {} while (0)
 #endif
 
-#define MAX_VHOST_DEVICE	1024
-extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 
 #define VHOST_BINARY_SEARCH_THRESH 256
 
diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
index 5eb1dd6812..3147e72f04 100644
--- a/lib/vhost/vhost_user.c
+++ b/lib/vhost/vhost_user.c
@@ -527,6 +527,8 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
 		return RTE_VHOST_MSG_RESULT_ERR;
 	}
 
+	vq->batch_copy_max_elems = vq->size;
+
 	return RTE_VHOST_MSG_RESULT_OK;
 }
 
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index b3d954aab4..305f6cd562 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -11,6 +11,7 @@
 #include <rte_net.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
+#include <rte_dmadev.h>
 #include <rte_vhost.h>
 #include <rte_tcp.h>
 #include <rte_udp.h>
@@ -25,6 +26,10 @@
 
 #define MAX_BATCH_LEN 256
 
+/* DMA device copy operation tracking array. */
+struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+uint16_t dma_poll_factor = 1;
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -43,6 +48,140 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
 }
 
+static __rte_always_inline uint16_t
+vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
+		uint16_t vchan_id, uint16_t head_idx,
+		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	uint16_t ring_mask = dma_info->ring_mask;
+	uint16_t pkt_idx, bce_idx = 0;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
+		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
+		int copy_idx, last_copy_idx = 0;
+		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
+		uint16_t nr_sw_copy = 0;
+		uint16_t i;
+
+		if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
+			goto out;
+
+		for (i = 0; i < nr_segs; i++) {
+			/* Fallback to SW copy if error happens */
+			copy_idx = rte_dma_copy(dma_id, vchan_id, (rte_iova_t)iov[i].src_addr,
+					(rte_iova_t)iov[i].dst_addr, iov[i].len,
+					RTE_DMA_OP_FLAG_LLC);
+			if (unlikely(copy_idx < 0)) {
+				/* Find corresponding VA pair and do SW copy */
+				rte_memcpy(vq->batch_copy_elems[bce_idx].dst,
+						vq->batch_copy_elems[bce_idx].src,
+						vq->batch_copy_elems[bce_idx].len);
+				nr_sw_copy++;
+
+				/**
+				 * All copies of the packet are performed
+				 * by the CPU, set the packet completion flag
+				 * to true, as all copies are done.
+				 */
+				if (nr_sw_copy == nr_segs) {
+					vq->async->pkts_cmpl_flag[head_idx % vq->size] = true;
+					break;
+				} else if (i == (nr_segs - 1)) {
+					/**
+					 * A part of copies of current packet
+					 * are enqueued to the DMA successfully
+					 * but the last copy fails, store the
+					 * packet completion flag address
+					 * in the last DMA copy slot.
+					 */
+					dma_info->pkts_completed_flag[last_copy_idx & ring_mask] =
+						&vq->async->pkts_cmpl_flag[head_idx % vq->size];
+					break;
+				}
+			} else
+				last_copy_idx = copy_idx;
+
+			bce_idx++;
+
+			/**
+			 * Only store packet completion flag address in the last copy's
+			 * slot, and other slots are set to NULL.
+			 */
+			if (i == (nr_segs - 1)) {
+				dma_info->pkts_completed_flag[copy_idx & ring_mask] =
+					&vq->async->pkts_cmpl_flag[head_idx % vq->size];
+			}
+		}
+
+		dma_info->nr_batching += nr_segs;
+		if (unlikely(dma_info->nr_batching >= VHOST_ASYNC_DMA_BATCHING_SIZE)) {
+			rte_dma_submit(dma_id, vchan_id);
+			dma_info->nr_batching = 0;
+		}
+
+		head_idx++;
+	}
+
+out:
+	if (dma_info->nr_batching > 0) {
+		rte_dma_submit(dma_id, vchan_id);
+		dma_info->nr_batching = 0;
+	}
+	rte_spinlock_unlock(&dma_info->dma_lock);
+	vq->batch_copy_nb_elems = 0;
+
+	return pkt_idx;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan_id, uint16_t max_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	uint16_t ring_mask = dma_info->ring_mask;
+	uint16_t last_idx = 0;
+	uint16_t nr_copies;
+	uint16_t copy_idx;
+	uint16_t i;
+	bool has_error = false;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	/**
+	 * Print error log for debugging, if DMA reports error during
+	 * DMA transfer. We do not handle error in vhost level.
+	 */
+	nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts, &last_idx, &has_error);
+	if (unlikely(has_error)) {
+		VHOST_LOG_DATA(ERR, "dma %d vchannel %u reports error in rte_dma_completed()\n",
+				dma_id, vchan_id);
+	} else if (nr_copies == 0)
+		goto out;
+
+	copy_idx = last_idx - nr_copies + 1;
+	for (i = 0; i < nr_copies; i++) {
+		bool *flag;
+
+		flag = dma_info->pkts_completed_flag[copy_idx & ring_mask];
+		if (flag) {
+			/**
+			 * Mark the packet flag as received. The flag
+			 * could belong to another virtqueue but write
+			 * is atomic.
+			 */
+			*flag = true;
+			dma_info->pkts_completed_flag[copy_idx & ring_mask] = NULL;
+		}
+		copy_idx++;
+	}
+
+out:
+	rte_spinlock_unlock(&dma_info->dma_lock);
+	return nr_copies;
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
@@ -865,12 +1004,13 @@ async_iter_reset(struct vhost_async *async)
 static __rte_always_inline int
 async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		struct rte_mbuf *m, uint32_t mbuf_offset,
-		uint64_t buf_iova, uint32_t cpy_len)
+		uint64_t buf_addr, uint64_t buf_iova, uint32_t cpy_len)
 {
 	struct vhost_async *async = vq->async;
 	uint64_t mapped_len;
 	uint32_t buf_offset = 0;
 	void *hpa;
+	struct batch_copy_elem *bce = vq->batch_copy_elems;
 
 	while (cpy_len) {
 		hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
@@ -886,6 +1026,31 @@ async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
 						hpa, (size_t)mapped_len)))
 			return -1;
 
+		/**
+		 * Keep VA for all IOVA segments for falling back to SW
+		 * copy in case of rte_dma_copy() error.
+		 */
+		if (unlikely(vq->batch_copy_nb_elems >= vq->batch_copy_max_elems)) {
+			struct batch_copy_elem *tmp;
+			uint16_t nb_elems = 2 * vq->batch_copy_max_elems;
+
+			VHOST_LOG_DATA(DEBUG, "(%d) %s: run out of batch_copy_elems, "
+					"and realloc double elements.\n", dev->vid, __func__);
+			tmp = rte_realloc_socket(vq->batch_copy_elems, nb_elems * sizeof(*tmp),
+					RTE_CACHE_LINE_SIZE, vq->numa_node);
+			if (!tmp) {
+				VHOST_LOG_DATA(ERR, "Failed to re-alloc batch_copy_elems\n");
+				return -1;
+			}
+
+			vq->batch_copy_max_elems = nb_elems;
+			vq->batch_copy_elems = tmp;
+			bce = tmp;
+		}
+		bce[vq->batch_copy_nb_elems].dst = (void *)((uintptr_t)(buf_addr + buf_offset));
+		bce[vq->batch_copy_nb_elems].src = rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
+		bce[vq->batch_copy_nb_elems++].len = mapped_len;
+
 		cpy_len -= (uint32_t)mapped_len;
 		mbuf_offset += (uint32_t)mapped_len;
 		buf_offset += (uint32_t)mapped_len;
@@ -901,7 +1066,8 @@ sync_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
 {
 	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
 
-	if (likely(cpy_len > MAX_BATCH_LEN || vq->batch_copy_nb_elems >= vq->size)) {
+	if (likely(cpy_len > MAX_BATCH_LEN ||
+				vq->batch_copy_nb_elems >= vq->batch_copy_max_elems)) {
 		rte_memcpy((void *)((uintptr_t)(buf_addr)),
 				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
 				cpy_len);
@@ -1020,8 +1186,10 @@ mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 		if (is_async) {
 			if (async_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
+						buf_addr + buf_offset,
 						buf_iova + buf_offset, cpy_len) < 0)
 				goto error;
+
 		} else {
 			sync_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
 					buf_addr + buf_offset,
@@ -1449,9 +1617,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_split(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan_id)
 {
 	struct buf_vector buf_vec[BUF_VECTOR_MAX];
 	uint32_t pkt_idx = 0;
@@ -1503,17 +1671,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		n_xfer = 0;
-	}
+	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan_id, async->pkts_idx, async->iov_iter,
+			pkt_idx);
 
 	pkt_err = pkt_idx - n_xfer;
 	if (unlikely(pkt_err)) {
 		uint16_t num_descs = 0;
 
+		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for queue %u.\n",
+				dev->vid, __func__, pkt_err, queue_id);
+
 		/* update number of completed packets */
 		pkt_idx = n_xfer;
 
@@ -1656,13 +1823,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan_id)
 {
 	uint32_t pkt_idx = 0;
 	uint32_t remained = count;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t num_buffers;
 	uint16_t num_descs;
 
@@ -1670,6 +1837,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	struct async_inflight_info *pkts_info = async->pkts_info;
 	uint32_t pkt_err = 0;
 	uint16_t slot_idx = 0;
+	uint16_t head_idx = async->pkts_idx % vq->size;
 
 	do {
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
@@ -1694,19 +1862,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		n_xfer = 0;
-	}
-
-	pkt_err = pkt_idx - n_xfer;
+	n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan_id, head_idx,
+			async->iov_iter, pkt_idx);
 
 	async_iter_reset(async);
 
-	if (unlikely(pkt_err))
+	pkt_err = pkt_idx - n_xfer;
+	if (unlikely(pkt_err)) {
+		VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets for queue %u.\n",
+				dev->vid, __func__, pkt_err, queue_id);
 		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
+	}
 
 	if (likely(vq->shadow_used_idx)) {
 		/* keep used descriptors. */
@@ -1826,28 +1992,43 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
 
 static __rte_always_inline uint16_t
 vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
-	int32_t n_cpl;
+	uint32_t max_count;
+	uint16_t nr_cpl_pkts = 0;
 	uint16_t n_descs = 0, n_buffers = 0;
 	uint16_t start_idx, from, i;
 
-	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
-	if (unlikely(n_cpl < 0)) {
-		VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies for queue id %d.\n",
-				dev->vid, __func__, queue_id);
-		return 0;
+	/* Check completed copies for the given DMA vChannel */
+	max_count = count * dma_poll_factor;
+	vhost_async_dma_check_completed(dma_id, vchan_id, max_count <= UINT16_MAX ? max_count :
+			UINT16_MAX);
+
+	start_idx = async_get_first_inflight_pkt_idx(vq);
+
+	/**
+	 * Calculate the number of copy completed packets.
+	 * Note that there may be completed packets even if
+	 * no copies are reported done by the given DMA vChannel,
+	 * as DMA vChannels could be shared by other threads.
+	 */
+	from = start_idx;
+	while (vq->async->pkts_cmpl_flag[from] && count--) {
+		vq->async->pkts_cmpl_flag[from] = false;
+		from++;
+		if (from >= vq->size)
+			from -= vq->size;
+		nr_cpl_pkts++;
 	}
 
-	if (n_cpl == 0)
+	if (nr_cpl_pkts == 0)
 		return 0;
 
-	start_idx = async_get_first_inflight_pkt_idx(vq);
-
-	for (i = 0; i < n_cpl; i++) {
+	for (i = 0; i < nr_cpl_pkts; i++) {
 		from = (start_idx + i) % vq->size;
 		/* Only used with packed ring */
 		n_buffers += pkts_info[from].nr_buffers;
@@ -1856,7 +2037,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		pkts[i] = pkts_info[from].mbuf;
 	}
 
-	async->pkts_inflight_n -= n_cpl;
+	async->pkts_inflight_n -= nr_cpl_pkts;
 
 	if (likely(vq->enabled && vq->access_ok)) {
 		if (vq_is_packed(dev)) {
@@ -1877,12 +2058,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		}
 	}
 
-	return n_cpl;
+	return nr_cpl_pkts;
 }
 
 uint16_t
 rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1906,9 +2088,20 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	rte_spinlock_lock(&vq->access_lock);
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				vchan_id > dma_copy_track[dma_id].max_vchans)) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid DMA %d vchan %u.\n",
+			       dev->vid, __func__, dma_id, vchan_id);
+		return 0;
+	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	if (!rte_spinlock_trylock(&vq->access_lock)) {
+		VHOST_LOG_CONFIG(DEBUG, "Failed to poll completed packets from queue id %u. "
+			"virt queue busy.\n", queue_id);
+		return 0;
+	}
+
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
 
 	rte_spinlock_unlock(&vq->access_lock);
 
@@ -1917,7 +2110,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 uint16_t
 rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1941,14 +2135,21 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				vchan_id > dma_copy_track[dma_id].max_vchans)) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid DMA %d vchan %u.\n",
+			       dev->vid, __func__, dma_id, vchan_id);
+		return 0;
+	}
+
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
 
 	return n_pkts_cpl;
 }
 
 static __rte_always_inline uint32_t
 virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan_id)
 {
 	struct vhost_virtqueue *vq;
 	uint32_t nb_tx = 0;
@@ -1960,6 +2161,13 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 		return 0;
 	}
 
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				vchan_id > dma_copy_track[dma_id].max_vchans)) {
+		VHOST_LOG_DATA(ERR, "(%d) %s: invalid DMA %d vchan %u.\n", dev->vid, __func__,
+				dma_id, vchan_id);
+		return 0;
+	}
+
 	vq = dev->virtqueue[queue_id];
 
 	rte_spinlock_lock(&vq->access_lock);
@@ -1980,10 +2188,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 	if (vq_is_packed(dev))
 		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan_id);
 	else
 		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan_id);
 
 out:
 	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
@@ -1997,7 +2205,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 uint16_t
 rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 
@@ -2011,7 +2220,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, vchan_id);
 }
 
 static inline bool
@@ -2369,7 +2578,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
 
 		if (likely(cpy_len > MAX_BATCH_LEN ||
-					vq->batch_copy_nb_elems >= vq->size ||
+					vq->batch_copy_nb_elems >= vq->batch_copy_max_elems ||
 					(hdr && cur == m))) {
 			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
 						mbuf_offset),
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-01-24 16:40     ` [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
@ 2022-02-03 13:04       ` Maxime Coquelin
  2022-02-07  1:34         ` Hu, Jiayu
  2022-02-08 10:40       ` [PATCH v3 0/1] integrate dmadev in vhost Jiayu Hu
  1 sibling, 1 reply; 31+ messages in thread
From: Maxime Coquelin @ 2022-02-03 13:04 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, bruce.richardson, harry.van.haaren,
	sunil.pai.g, john.mcnamara, xuan.ding, cheng1.jiang, liangma

Hi Jiayu,

On 1/24/22 17:40, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst |  95 ++++-----
>   examples/vhost/Makefile             |   2 +-
>   examples/vhost/ioat.c               | 218 --------------------
>   examples/vhost/ioat.h               |  63 ------
>   examples/vhost/main.c               | 255 ++++++++++++++++++-----
>   examples/vhost/main.h               |  11 +
>   examples/vhost/meson.build          |   6 +-
>   lib/vhost/meson.build               |   2 +-
>   lib/vhost/rte_vhost.h               |   2 +
>   lib/vhost/rte_vhost_async.h         | 132 +++++-------
>   lib/vhost/version.map               |   3 +
>   lib/vhost/vhost.c                   | 148 ++++++++++----
>   lib/vhost/vhost.h                   |  64 +++++-
>   lib/vhost/vhost_user.c              |   2 +
>   lib/vhost/virtio_net.c              | 305 +++++++++++++++++++++++-----
>   15 files changed, 744 insertions(+), 564 deletions(-)
>   delete mode 100644 examples/vhost/ioat.c
>   delete mode 100644 examples/vhost/ioat.h
> 

When you rebase to the next version, please ensure to rework all the 
logs to follow the new standard:
VHOST_LOG_CONFIG(ERR,"(%s) .....", dev->ifname, ...);

> git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
> index a87ea6ba37..758a80f403 100644
> --- a/lib/vhost/rte_vhost_async.h
> +++ b/lib/vhost/rte_vhost_async.h
> @@ -26,73 +26,6 @@ struct rte_vhost_iov_iter {
>   	unsigned long nr_segs;
>   };
>   
> -/**
> - * dma transfer status
> - */
> -struct rte_vhost_async_status {
> -	/** An array of application specific data for source memory */
> -	uintptr_t *src_opaque_data;
> -	/** An array of application specific data for destination memory */
> -	uintptr_t *dst_opaque_data;
> -};
> -
> -/**
> - * dma operation callbacks to be implemented by applications
> - */
> -struct rte_vhost_async_channel_ops {
> -	/**
> -	 * instruct async engines to perform copies for a batch of packets
> -	 *
> -	 * @param vid
> -	 *  id of vhost device to perform data copies
> -	 * @param queue_id
> -	 *  queue id to perform data copies
> -	 * @param iov_iter
> -	 *  an array of IOV iterators
> -	 * @param opaque_data
> -	 *  opaque data pair sending to DMA engine
> -	 * @param count
> -	 *  number of elements in the "descs" array
> -	 * @return
> -	 *  number of IOV iterators processed, negative value means error
> -	 */
> -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> -		struct rte_vhost_iov_iter *iov_iter,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t count);
> -	/**
> -	 * check copy-completed packets from the async engine
> -	 * @param vid
> -	 *  id of vhost device to check copy completion
> -	 * @param queue_id
> -	 *  queue id to check copy completion
> -	 * @param opaque_data
> -	 *  buffer to receive the opaque data pair from DMA engine
> -	 * @param max_packets
> -	 *  max number of packets could be completed
> -	 * @return
> -	 *  number of async descs completed, negative value means error
> -	 */
> -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_status *opaque_data,
> -		uint16_t max_packets);
> -};
> -
> -/**
> - *  async channel features
> - */
> -enum {
> -	RTE_VHOST_ASYNC_INORDER = 1U << 0,
> -};
> -
> -/**
> - *  async channel configuration
> - */
> -struct rte_vhost_async_config {
> -	uint32_t features;
> -	uint32_t rsvd[2];
> -};
> -
>   /**
>    * Register an async channel for a vhost queue
>    *
> @@ -100,17 +33,11 @@ struct rte_vhost_async_config {
>    *  vhost device id async channel to be attached to
>    * @param queue_id
>    *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration structure
> - * @param ops
> - *  Async channel operation callbacks
>    * @return
>    *  0 on success, -1 on failures
>    */
>   __rte_experimental
> -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
>   
>   /**
>    * Unregister an async channel for a vhost queue
> @@ -136,17 +63,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
>    *  vhost device id async channel to be attached to
>    * @param queue_id
>    *  vhost queue id async channel to be attached to
> - * @param config
> - *  Async channel configuration
> - * @param ops
> - *  Async channel operation callbacks
>    * @return
>    *  0 on success, -1 on failures
>    */
>   __rte_experimental
> -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -	struct rte_vhost_async_config config,
> -	struct rte_vhost_async_channel_ops *ops);
> +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
>   
>   /**
>    * Unregister an async channel for a vhost queue without performing any
> @@ -179,12 +100,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
>    *  array of packets to be enqueued
>    * @param count
>    *  packets num to be enqueued
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan_id
> + *  the identifier of virtual DMA channel
>    * @return
>    *  num of packets enqueued
>    */
>   __rte_experimental
>   uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id);
>   
>   /**
>    * This function checks async completion status for a specific vhost
> @@ -199,12 +125,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
>    *  blank array to get return packet pointer
>    * @param count
>    *  size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan_id
> + *  the identifier of virtual DMA channel
>    * @return
>    *  num of packets returned
>    */
>   __rte_experimental
>   uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id);
>   
>   /**
>    * This function returns the amount of in-flight packets for the vhost
> @@ -235,11 +166,44 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
>    *  Blank array to get return packet pointer
>    * @param count
>    *  Size of the packet array
> + * @param dma_id
> + *  the identifier of the DMA device
> + * @param vchan_id
> + *  the identifier of virtual DMA channel
>    * @return
>    *  Number of packets returned
>    */
>   __rte_experimental
>   uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count);
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id);
> +/**
> + * The DMA vChannels used in asynchronous data path must be configured
> + * first. So this function needs to be called before enabling DMA
> + * acceleration for vring. If this function fails, asynchronous data path
> + * cannot be enabled for any vring further.
> + *
> + * DMA devices used in data-path must belong to DMA devices given in this
> + * function. But users are free to use DMA devices given in the function
> + * in non-vhost scenarios, only if guarantee no copies in vhost are
> + * offloaded to them at the same time.
> + *
> + * @param dmas_id
> + *  DMA ID array
> + * @param count
> + *  Element number of 'dmas_id'
> + * @param poll_factor
> + *  For large or scatter-gather packets, one packet would consist of
> + *  small buffers. In this case, vhost will issue several DMA copy
> + *  operations for the packet. Therefore, the number of copies to
> + *  check by rte_dma_completed() is calculated by "nb_pkts_to_poll *
> + *  poll_factor" andused in rte_vhost_poll_enqueue_completed(). The
> + *  default value of "poll_factor" is 1.
> + * @return
> + *  0 on success, and -1 on failure
> + */
> +__rte_experimental
> +int rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
> +		uint16_t poll_factor);
>   
>   #endif /* _RTE_VHOST_ASYNC_H_ */
> diff --git a/lib/vhost/version.map b/lib/vhost/version.map
> index a7ef7f1976..1202ba9c1a 100644
> --- a/lib/vhost/version.map
> +++ b/lib/vhost/version.map
> @@ -84,6 +84,9 @@ EXPERIMENTAL {
>   
>   	# added in 21.11
>   	rte_vhost_get_monitor_addr;
> +
> +	# added in 22.03
> +	rte_vhost_async_dma_configure;
>   };
>   
>   INTERNAL {
> diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> index 13a9bb9dd1..c408cee63e 100644
> --- a/lib/vhost/vhost.c
> +++ b/lib/vhost/vhost.c
> @@ -25,7 +25,7 @@
>   #include "vhost.h"
>   #include "vhost_user.h"
>   
> -struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> +struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
>   pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
>   
>   /* Called with iotlb_lock read-locked */
> @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
>   		return;
>   
>   	rte_free(vq->async->pkts_info);
> +	rte_free(vq->async->pkts_cmpl_flag);
>   
>   	rte_free(vq->async->buffers_packed);
>   	vq->async->buffers_packed = NULL;
> @@ -667,12 +668,12 @@ vhost_new_device(void)
>   	int i;
>   
>   	pthread_mutex_lock(&vhost_dev_lock);
> -	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> +	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
>   		if (vhost_devices[i] == NULL)
>   			break;
>   	}
>   
> -	if (i == MAX_VHOST_DEVICE) {
> +	if (i == RTE_MAX_VHOST_DEVICE) {
>   		VHOST_LOG_CONFIG(ERR,
>   			"Failed to find a free slot for new device.\n");
>   		pthread_mutex_unlock(&vhost_dev_lock);
> @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
>   }
>   
>   static __rte_always_inline int
> -async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_channel_ops *ops)
> +async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> @@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
>   		goto out_free_async;
>   	}
>   
> +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
> +			RTE_CACHE_LINE_SIZE, node);
> +	if (!async->pkts_cmpl_flag) {
> +		VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag (vid %d, qid: %d)\n",
> +				vid, queue_id);
> +		goto out_free_async;
> +	}
> +
>   	if (vq_is_packed(dev)) {
>   		async->buffers_packed = rte_malloc_socket(NULL,
>   				vq->size * sizeof(struct vring_used_elem_packed),
> @@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
>   		}
>   	}
>   
> -	async->ops.check_completed_copies = ops->check_completed_copies;
> -	async->ops.transfer_data = ops->transfer_data;
> -
>   	vq->async = async;
>   
>   	return 0;
> @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t queue_id,
>   }
>   
>   int
> -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   	int ret;
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   	if (unlikely(vq == NULL || !dev->async_copy))
>   		return -1;
>   
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"async copy is not supported on non-inorder mode "
> -			"(vid %d, qid: %d)\n", vid, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
>   	rte_spinlock_lock(&vq->access_lock);
> -	ret = async_channel_register(vid, queue_id, ops);
> +	ret = async_channel_register(vid, queue_id);
>   	rte_spinlock_unlock(&vq->access_lock);
>   
>   	return ret;
>   }
>   
>   int
> -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1747,18 +1737,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
>   	if (unlikely(vq == NULL || !dev->async_copy))
>   		return -1;
>   
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"async copy is not supported on non-inorder mode "
> -			"(vid %d, qid: %d)\n", vid, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
> -	return async_channel_register(vid, queue_id, ops);
> +	return async_channel_register(vid, queue_id);
>   }
>   
>   int
> @@ -1835,6 +1814,95 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
>   	return 0;
>   }
>   
> +static __rte_always_inline void
> +vhost_free_async_dma_mem(void)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> +		struct async_dma_info *dma = &dma_copy_track[i];
> +		int16_t j;
> +
> +		if (dma->max_vchans == 0)
> +			continue;
> +
> +		for (j = 0; j < dma->max_vchans; j++)
> +			rte_free(dma->vchans[j].pkts_completed_flag);
> +
> +		rte_free(dma->vchans);
> +		dma->vchans = NULL;
> +		dma->max_vchans = 0;
> +	}
> +}
> +
> +int
> +rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count, uint16_t poll_factor)

I'm not fan of the poll_factor, I think it is too complex for the user
to know what value he should set.

Also, I would like that the API only registers one DMA channel at a
time and let the application call it multiple times. Dong that, user can
still use the DMA channels that could be registered.

> +{
> +	uint16_t i;
> +
> +	if (!dmas_id) {
> +		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration parameter.\n");
> +		return -1;
> +	}
> +
> +	if (poll_factor == 0) {
> +		VHOST_LOG_CONFIG(ERR, "Invalid DMA poll factor %u\n", poll_factor);
> +		return -1;
> +	}
> +	dma_poll_factor = poll_factor;
> +
> +	for (i = 0; i < count; i++) {
> +		struct async_dma_vchan_info *vchans;
> +		struct rte_dma_info info;
> +		uint16_t max_vchans;
> +		uint16_t max_desc;
> +		uint16_t j;
> +
> +		if (!rte_dma_is_valid(dmas_id[i])) {
> +			VHOST_LOG_CONFIG(ERR, "DMA %d is not found. Cannot enable async"
> +				       " data-path\n.", dmas_id[i]);
> +			vhost_free_async_dma_mem();
> +			return -1;
> +		}
> +
> +		rte_dma_info_get(dmas_id[i], &info);
> +
> +		max_vchans = info.max_vchans;
> +		max_desc = info.max_desc;
> +
> +		if (!rte_is_power_of_2(max_desc))
> +			max_desc = rte_align32pow2(max_desc);
> +
> +		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * max_vchans,
> +				RTE_CACHE_LINE_SIZE);
> +		if (vchans == NULL) {
> +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for dma-%d."
> +					" Cannot enable async data-path.\n", dmas_id[i]);
> +			vhost_free_async_dma_mem();
> +			return -1;
> +		}
> +
> +		for (j = 0; j < max_vchans; j++) {
> +			vchans[j].pkts_completed_flag = rte_zmalloc(NULL, sizeof(bool *) * max_desc,
> +					RTE_CACHE_LINE_SIZE);
> +			if (!vchans[j].pkts_completed_flag) {
> +				VHOST_LOG_CONFIG(ERR, "Failed to allocate  pkts_completed_flag for "
> +						"dma-%d vchan-%u\n", dmas_id[i], j);
> +				vhost_free_async_dma_mem();
> +				return -1;
> +			}
> +
> +			vchans[j].ring_size = max_desc;
> +			vchans[j].ring_mask = max_desc - 1;
> +		}
> +
> +		dma_copy_track[dmas_id[i]].vchans = vchans;
> +		dma_copy_track[dmas_id[i]].max_vchans = max_vchans;
> +	}
> +
> +	return 0;
> +}
> +
>   int
>   rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
>   {
> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> index 7085e0885c..475843fec0 100644
> --- a/lib/vhost/vhost.h
> +++ b/lib/vhost/vhost.h
> @@ -19,6 +19,7 @@
>   #include <rte_ether.h>
>   #include <rte_rwlock.h>
>   #include <rte_malloc.h>
> +#include <rte_dmadev.h>
>   
>   #include "rte_vhost.h"
>   #include "rte_vdpa.h"
> @@ -50,6 +51,7 @@
>   
>   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
>   #define VHOST_MAX_ASYNC_VEC 2048
> +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
>   
>   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
> @@ -119,6 +121,42 @@ struct vring_used_elem_packed {
>   	uint32_t count;
>   };
>   
> +struct async_dma_vchan_info {
> +	/* circular array to track if packet copy completes */
> +	bool **pkts_completed_flag;
> +
> +	/* max elements in 'metadata' */
> +	uint16_t ring_size;
> +	/* ring index mask for 'metadata' */
> +	uint16_t ring_mask;
> +
> +	/* batching copies before a DMA doorbell */
> +	uint16_t nr_batching;
> +
> +	/**
> +	 * DMA virtual channel lock. Although it is able to bind DMA
> +	 * virtual channels to data plane threads, vhost control plane
> +	 * thread could call data plane functions too, thus causing
> +	 * DMA device contention.
> +	 *
> +	 * For example, in VM exit case, vhost control plane thread needs
> +	 * to clear in-flight packets before disable vring, but there could
> +	 * be anotther data plane thread is enqueuing packets to the same
> +	 * vring with the same DMA virtual channel. But dmadev PMD functions
> +	 * are lock-free, so the control plane and data plane threads
> +	 * could operate the same DMA virtual channel at the same time.
> +	 */
> +	rte_spinlock_t dma_lock;
> +};
> +
> +struct async_dma_info {
> +	uint16_t max_vchans;
> +	struct async_dma_vchan_info *vchans;
> +};
> +
> +extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +extern uint16_t dma_poll_factor;
> +
>   /**
>    * inflight async packet information
>    */
> @@ -129,9 +167,6 @@ struct async_inflight_info {
>   };
>   
>   struct vhost_async {
> -	/* operation callbacks for DMA */
> -	struct rte_vhost_async_channel_ops ops;
> -
>   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
>   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
>   	uint16_t iter_idx;
> @@ -139,6 +174,25 @@ struct vhost_async {
>   
>   	/* data transfer status */
>   	struct async_inflight_info *pkts_info;
> +	/**
> +	 * Packet reorder array. "true" indicates that DMA device
> +	 * completes all copies for the packet.
> +	 *
> +	 * Note that this array could be written by multiple threads
> +	 * simultaneously. For example, in the case of thread0 and
> +	 * thread1 RX packets from NIC and then enqueue packets to
> +	 * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
> +	 * possible for thread0 to get completed copies belonging to
> +	 * vring1 from DMA0, while thread0 is calling rte_vhost_poll
> +	 * _enqueue_completed() for vring0 and thread1 is calling
> +	 * rte_vhost_submit_enqueue_burst() for vring1. In this case,
> +	 * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
> +	 *
> +	 * However, since offloading is per-packet basis, each packet
> +	 * flag will only be written by one thread. And single byte
> +	 * write is atomic, so no lock for pkts_cmpl_flag is needed.
> +	 */
> +	bool *pkts_cmpl_flag;
>   	uint16_t pkts_idx;
>   	uint16_t pkts_inflight_n;
>   	union {
> @@ -198,6 +252,7 @@ struct vhost_virtqueue {
>   	/* Record packed ring first dequeue desc index */
>   	uint16_t		shadow_last_used_idx;
>   
> +	uint16_t		batch_copy_max_elems;
>   	uint16_t		batch_copy_nb_elems;
>   	struct batch_copy_elem	*batch_copy_elems;
>   	int			numa_node;
> @@ -568,8 +623,7 @@ extern int vhost_data_log_level;
>   #define PRINT_PACKET(device, addr, size, header) do {} while (0)
>   #endif
>   
> -#define MAX_VHOST_DEVICE	1024
> -extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> +extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
>   
>   #define VHOST_BINARY_SEARCH_THRESH 256
>   
> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> index 5eb1dd6812..3147e72f04 100644
> --- a/lib/vhost/vhost_user.c
> +++ b/lib/vhost/vhost_user.c
> @@ -527,6 +527,8 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
>   		return RTE_VHOST_MSG_RESULT_ERR;
>   	}
>   
> +	vq->batch_copy_max_elems = vq->size;
> +

I don't understand the point of this new field. But it can be removed
anyway if we agree to drop the SW fallback.

>   	return RTE_VHOST_MSG_RESULT_OK;
>   }
>   
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index b3d954aab4..305f6cd562 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -11,6 +11,7 @@
>   #include <rte_net.h>
>   #include <rte_ether.h>
>   #include <rte_ip.h>
> +#include <rte_dmadev.h>
>   #include <rte_vhost.h>
>   #include <rte_tcp.h>
>   #include <rte_udp.h>
> @@ -25,6 +26,10 @@
>   
>   #define MAX_BATCH_LEN 256
>   
> +/* DMA device copy operation tracking array. */
> +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +uint16_t dma_poll_factor = 1;
> +
>   static  __rte_always_inline bool
>   rxvq_is_mergeable(struct virtio_net *dev)
>   {
> @@ -43,6 +48,140 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
>   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
>   }
>   
> +static __rte_always_inline uint16_t
> +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> +		uint16_t vchan_id, uint16_t head_idx,
> +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t pkt_idx, bce_idx = 0;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> +		int copy_idx, last_copy_idx = 0;
> +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> +		uint16_t nr_sw_copy = 0;
> +		uint16_t i;
> +
> +		if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
> +			goto out;

I would consider introducing a vhost_async_dma_transfer_one function to
avoid nesting too much loops and make the code cleaner.

> +		for (i = 0; i < nr_segs; i++) {
> +			/* Fallback to SW copy if error happens */
> +			copy_idx = rte_dma_copy(dma_id, vchan_id, (rte_iova_t)iov[i].src_addr,
> +					(rte_iova_t)iov[i].dst_addr, iov[i].len,
> +					RTE_DMA_OP_FLAG_LLC);
> +			if (unlikely(copy_idx < 0)) {

The DMA channel is protected by a lock, and we check the capacity before
initiating the copy.
So I don't expect rte_dma_copy() to fail because of lack of capacity. If
an error happens, that is a serious one.

So, I wonder whether having a SW fallback makes sense. Code would be
much simpler if we just exit early if an error happens. Logging an error
message instead would help debugging. Certainly with rate limiting not
to flood the log file.

> +				/* Find corresponding VA pair and do SW copy */
> +				rte_memcpy(vq->batch_copy_elems[bce_idx].dst,
> +						vq->batch_copy_elems[bce_idx].src,
> +						vq->batch_copy_elems[bce_idx].len);
> +				nr_sw_copy++;
> +
> +				/**
> +				 * All copies of the packet are performed
> +				 * by the CPU, set the packet completion flag
> +				 * to true, as all copies are done.
> +				 */

I think it would better be moved out of the loop to avoid doing the
check for every segment while only the last one has a chance to match.

> +				if (nr_sw_copy == nr_segs) {
> +					vq->async->pkts_cmpl_flag[head_idx % vq->size] = true;
> +					break;
> +				} else if (i == (nr_segs - 1)) {
> +					/**
> +					 * A part of copies of current packet
> +					 * are enqueued to the DMA successfully
> +					 * but the last copy fails, store the
> +					 * packet completion flag address
> +					 * in the last DMA copy slot.
> +					 */
> +					dma_info->pkts_completed_flag[last_copy_idx & ring_mask] =
> +						&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> +					break;
> +				}
> +			} else
> +				last_copy_idx = copy_idx;

Braces on the else as you have braces for the if statement.

> +
> +			bce_idx++;
> +
> +			/**
> +			 * Only store packet completion flag address in the last copy's
> +			 * slot, and other slots are set to NULL.
> +			 */
> +			if (i == (nr_segs - 1)) {
> +				dma_info->pkts_completed_flag[copy_idx & ring_mask] =
> +					&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> +			}
> +		}
> +
> +		dma_info->nr_batching += nr_segs;
> +		if (unlikely(dma_info->nr_batching >= VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> +			rte_dma_submit(dma_id, vchan_id);
> +			dma_info->nr_batching = 0;
> +		}

I wonder whether we could just remove this submit.
I don't expect completions to happen between two packets as the DMA
channel is protected by a lock, so my understanding is once the DMA ring
is full, we just end-up exiting early because DMA channel capacity is 
checked for every packet.

Removing it will maybe improve performance a (very) little bit, but will
certainly make the code simpler to follow.

> +
> +		head_idx++;
> +	}
> +
> +out:
> +	if (dma_info->nr_batching > 0) {

if (likely(...))

> +		rte_dma_submit(dma_id, vchan_id);
> +		dma_info->nr_batching = 0;
> +	}
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +	vq->batch_copy_nb_elems = 0;
> +
> +	return pkt_idx;
> +}
> +
> +static __rte_always_inline uint16_t
> +vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan_id, uint16_t max_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t last_idx = 0;
> +	uint16_t nr_copies;
> +	uint16_t copy_idx;
> +	uint16_t i;
> +	bool has_error = false;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	/**
> +	 * Print error log for debugging, if DMA reports error during
> +	 * DMA transfer. We do not handle error in vhost level.
> +	 */
> +	nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts, &last_idx, &has_error);
> +	if (unlikely(has_error)) {
> +		VHOST_LOG_DATA(ERR, "dma %d vchannel %u reports error in rte_dma_completed()\n",
> +				dma_id, vchan_id);

I wonder if rate limiting would

> +	} else if (nr_copies == 0)
> +		goto out;
> +
> +	copy_idx = last_idx - nr_copies + 1;
> +	for (i = 0; i < nr_copies; i++) {
> +		bool *flag;
> +
> +		flag = dma_info->pkts_completed_flag[copy_idx & ring_mask];
> +		if (flag) {
> +			/**
> +			 * Mark the packet flag as received. The flag
> +			 * could belong to another virtqueue but write
> +			 * is atomic.
> +			 */
> +			*flag = true;
> +			dma_info->pkts_completed_flag[copy_idx & ring_mask] = NULL;
> +		}
> +		copy_idx++;
> +	}
> +
> +out:
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +	return nr_copies;
> +}
> +
>   static inline void
>   do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
>   {
> @@ -865,12 +1004,13 @@ async_iter_reset(struct vhost_async *async)
>   static __rte_always_inline int
>   async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   		struct rte_mbuf *m, uint32_t mbuf_offset,
> -		uint64_t buf_iova, uint32_t cpy_len)
> +		uint64_t buf_addr, uint64_t buf_iova, uint32_t cpy_len)
>   {
>   	struct vhost_async *async = vq->async;
>   	uint64_t mapped_len;
>   	uint32_t buf_offset = 0;
>   	void *hpa;
> +	struct batch_copy_elem *bce = vq->batch_copy_elems;
>   
>   	while (cpy_len) {
>   		hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> @@ -886,6 +1026,31 @@ async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   						hpa, (size_t)mapped_len)))
>   			return -1;
>   
> +		/**
> +		 * Keep VA for all IOVA segments for falling back to SW
> +		 * copy in case of rte_dma_copy() error.
> +		 */

As said below, I think we could get rid off the SW fallback.
But in case we didn't, I think it would be prefferable to change the
rte_vhost_iovec struct to have both the iova and the VA, that would make
the code simpler.

Also, while looking at this, I notice the structs rte_vhost_iov_iter and
rte_vhost_iovec are still part of the Vhost API, but it should not be
necessary now since application no more need to know about it.

> +		if (unlikely(vq->batch_copy_nb_elems >= vq->batch_copy_max_elems)) {
> +			struct batch_copy_elem *tmp;
> +			uint16_t nb_elems = 2 * vq->batch_copy_max_elems;
> +
> +			VHOST_LOG_DATA(DEBUG, "(%d) %s: run out of batch_copy_elems, "
> +					"and realloc double elements.\n", dev->vid, __func__);
> +			tmp = rte_realloc_socket(vq->batch_copy_elems, nb_elems * sizeof(*tmp),
> +					RTE_CACHE_LINE_SIZE, vq->numa_node);
> +			if (!tmp) {
> +				VHOST_LOG_DATA(ERR, "Failed to re-alloc batch_copy_elems\n");
> +				return -1;
> +			}
> +
> +			vq->batch_copy_max_elems = nb_elems;
> +			vq->batch_copy_elems = tmp;
> +			bce = tmp;
> +		}
> +		bce[vq->batch_copy_nb_elems].dst = (void *)((uintptr_t)(buf_addr + buf_offset));
> +		bce[vq->batch_copy_nb_elems].src = rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
> +		bce[vq->batch_copy_nb_elems++].len = mapped_len;
> +
>   		cpy_len -= (uint32_t)mapped_len;
>   		mbuf_offset += (uint32_t)mapped_len;
>   		buf_offset += (uint32_t)mapped_len;
> @@ -901,7 +1066,8 @@ sync_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   {
>   	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
>   
> -	if (likely(cpy_len > MAX_BATCH_LEN || vq->batch_copy_nb_elems >= vq->size)) {
> +	if (likely(cpy_len > MAX_BATCH_LEN ||
> +				vq->batch_copy_nb_elems >= vq->batch_copy_max_elems)) {
>   		rte_memcpy((void *)((uintptr_t)(buf_addr)),
>   				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
>   				cpy_len);
> @@ -1020,8 +1186,10 @@ mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   
>   		if (is_async) {
>   			if (async_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
> +						buf_addr + buf_offset,
>   						buf_iova + buf_offset, cpy_len) < 0)
>   				goto error;
> +

Remove new line.

>   		} else {
>   			sync_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
>   					buf_addr + buf_offset,
> @@ -1449,9 +1617,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
>   }
>   

Regards,
Maxime


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath
  2022-02-03 13:04       ` Maxime Coquelin
@ 2022-02-07  1:34         ` Hu, Jiayu
  0 siblings, 0 replies; 31+ messages in thread
From: Hu, Jiayu @ 2022-02-07  1:34 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: i.maximets, Xia, Chenbo, Richardson, Bruce, Van Haaren, Harry,
	Pai G, Sunil, Mcnamara, John, Ding, Xuan, Jiang, Cheng1, liangma

Hi Maxime,

Thanks for your comments. Please see replies inline.

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, February 3, 2022 9:04 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>;
> Mcnamara, John <john.mcnamara@intel.com>; Ding, Xuan
> <xuan.ding@intel.com>; Jiang, Cheng1 <cheng1.jiang@intel.com>;
> liangma@liangbit.com
> Subject: Re: [PATCH v2 1/1] vhost: integrate dmadev in asynchronous
> datapath
> 
> Hi Jiayu,
> 
> On 1/24/22 17:40, Jiayu Hu wrote:
> > Since dmadev is introduced in 21.11, to avoid the overhead of vhost
> > DMA abstraction layer and simplify application logics, this patch
> > integrates dmadev in asynchronous data path.
> >
> > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> > ---
> >   doc/guides/prog_guide/vhost_lib.rst |  95 ++++-----
> >   examples/vhost/Makefile             |   2 +-
> >   examples/vhost/ioat.c               | 218 --------------------
> >   examples/vhost/ioat.h               |  63 ------
> >   examples/vhost/main.c               | 255 ++++++++++++++++++-----
> >   examples/vhost/main.h               |  11 +
> >   examples/vhost/meson.build          |   6 +-
> >   lib/vhost/meson.build               |   2 +-
> >   lib/vhost/rte_vhost.h               |   2 +
> >   lib/vhost/rte_vhost_async.h         | 132 +++++-------
> >   lib/vhost/version.map               |   3 +
> >   lib/vhost/vhost.c                   | 148 ++++++++++----
> >   lib/vhost/vhost.h                   |  64 +++++-
> >   lib/vhost/vhost_user.c              |   2 +
> >   lib/vhost/virtio_net.c              | 305 +++++++++++++++++++++++-----
> >   15 files changed, 744 insertions(+), 564 deletions(-)
> >   delete mode 100644 examples/vhost/ioat.c
> >   delete mode 100644 examples/vhost/ioat.h
> >
> 
> When you rebase to the next version, please ensure to rework all the logs to
> follow the new standard:
> VHOST_LOG_CONFIG(ERR,"(%s) .....", dev->ifname, ...);

Sure, will do.

> 
> > git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h index
> > a87ea6ba37..758a80f403 100644
> > --- a/lib/vhost/rte_vhost_async.h
> > +++ b/lib/vhost/rte_vhost_async.h
> > @@ -26,73 +26,6 @@ struct rte_vhost_iov_iter {
> >   	unsigned long nr_segs;
> >   };
> >
> > -/**
> > - * dma transfer status
> > - */
> > -struct rte_vhost_async_status {
> > -	/** An array of application specific data for source memory */
> > -	uintptr_t *src_opaque_data;
> > -	/** An array of application specific data for destination memory */
> > -	uintptr_t *dst_opaque_data;
> > -};
> > -
> > -/**
> > - * dma operation callbacks to be implemented by applications
> > - */
> > -struct rte_vhost_async_channel_ops {
> > -	/**
> > -	 * instruct async engines to perform copies for a batch of packets
> > -	 *
> > -	 * @param vid
> > -	 *  id of vhost device to perform data copies
> > -	 * @param queue_id
> > -	 *  queue id to perform data copies
> > -	 * @param iov_iter
> > -	 *  an array of IOV iterators
> > -	 * @param opaque_data
> > -	 *  opaque data pair sending to DMA engine
> > -	 * @param count
> > -	 *  number of elements in the "descs" array
> > -	 * @return
> > -	 *  number of IOV iterators processed, negative value means error
> > -	 */
> > -	int32_t (*transfer_data)(int vid, uint16_t queue_id,
> > -		struct rte_vhost_iov_iter *iov_iter,
> > -		struct rte_vhost_async_status *opaque_data,
> > -		uint16_t count);
> > -	/**
> > -	 * check copy-completed packets from the async engine
> > -	 * @param vid
> > -	 *  id of vhost device to check copy completion
> > -	 * @param queue_id
> > -	 *  queue id to check copy completion
> > -	 * @param opaque_data
> > -	 *  buffer to receive the opaque data pair from DMA engine
> > -	 * @param max_packets
> > -	 *  max number of packets could be completed
> > -	 * @return
> > -	 *  number of async descs completed, negative value means error
> > -	 */
> > -	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_status *opaque_data,
> > -		uint16_t max_packets);
> > -};
> > -
> > -/**
> > - *  async channel features
> > - */
> > -enum {
> > -	RTE_VHOST_ASYNC_INORDER = 1U << 0,
> > -};
> > -
> > -/**
> > - *  async channel configuration
> > - */
> > -struct rte_vhost_async_config {
> > -	uint32_t features;
> > -	uint32_t rsvd[2];
> > -};
> > -
> >   /**
> >    * Register an async channel for a vhost queue
> >    *
> > @@ -100,17 +33,11 @@ struct rte_vhost_async_config {
> >    *  vhost device id async channel to be attached to
> >    * @param queue_id
> >    *  vhost queue id async channel to be attached to
> > - * @param config
> > - *  Async channel configuration structure
> > - * @param ops
> > - *  Async channel operation callbacks
> >    * @return
> >    *  0 on success, -1 on failures
> >    */
> >   __rte_experimental
> > -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > -	struct rte_vhost_async_config config,
> > -	struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
> >
> >   /**
> >    * Unregister an async channel for a vhost queue @@ -136,17 +63,11
> > @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> >    *  vhost device id async channel to be attached to
> >    * @param queue_id
> >    *  vhost queue id async channel to be attached to
> > - * @param config
> > - *  Async channel configuration
> > - * @param ops
> > - *  Async channel operation callbacks
> >    * @return
> >    *  0 on success, -1 on failures
> >    */
> >   __rte_experimental
> > -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > -	struct rte_vhost_async_config config,
> > -	struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > +queue_id);
> >
> >   /**
> >    * Unregister an async channel for a vhost queue without performing
> > any @@ -179,12 +100,17 @@ int
> rte_vhost_async_channel_unregister_thread_unsafe(int vid,
> >    *  array of packets to be enqueued
> >    * @param count
> >    *  packets num to be enqueued
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan_id
> > + *  the identifier of virtual DMA channel
> >    * @return
> >    *  num of packets enqueued
> >    */
> >   __rte_experimental
> >   uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan_id);
> >
> >   /**
> >    * This function checks async completion status for a specific vhost
> > @@ -199,12 +125,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int
> vid, uint16_t queue_id,
> >    *  blank array to get return packet pointer
> >    * @param count
> >    *  size of the packet array
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan_id
> > + *  the identifier of virtual DMA channel
> >    * @return
> >    *  num of packets returned
> >    */
> >   __rte_experimental
> >   uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan_id);
> >
> >   /**
> >    * This function returns the amount of in-flight packets for the
> > vhost @@ -235,11 +166,44 @@ int rte_vhost_async_get_inflight(int vid,
> uint16_t queue_id);
> >    *  Blank array to get return packet pointer
> >    * @param count
> >    *  Size of the packet array
> > + * @param dma_id
> > + *  the identifier of the DMA device
> > + * @param vchan_id
> > + *  the identifier of virtual DMA channel
> >    * @return
> >    *  Number of packets returned
> >    */
> >   __rte_experimental
> >   uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> > -		struct rte_mbuf **pkts, uint16_t count);
> > +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > +		uint16_t vchan_id);
> > +/**
> > + * The DMA vChannels used in asynchronous data path must be
> > +configured
> > + * first. So this function needs to be called before enabling DMA
> > + * acceleration for vring. If this function fails, asynchronous data
> > +path
> > + * cannot be enabled for any vring further.
> > + *
> > + * DMA devices used in data-path must belong to DMA devices given in
> > +this
> > + * function. But users are free to use DMA devices given in the
> > +function
> > + * in non-vhost scenarios, only if guarantee no copies in vhost are
> > + * offloaded to them at the same time.
> > + *
> > + * @param dmas_id
> > + *  DMA ID array
> > + * @param count
> > + *  Element number of 'dmas_id'
> > + * @param poll_factor
> > + *  For large or scatter-gather packets, one packet would consist of
> > + *  small buffers. In this case, vhost will issue several DMA copy
> > + *  operations for the packet. Therefore, the number of copies to
> > + *  check by rte_dma_completed() is calculated by "nb_pkts_to_poll *
> > + *  poll_factor" andused in rte_vhost_poll_enqueue_completed(). The
> > + *  default value of "poll_factor" is 1.
> > + * @return
> > + *  0 on success, and -1 on failure
> > + */
> > +__rte_experimental
> > +int rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
> > +		uint16_t poll_factor);
> >
> >   #endif /* _RTE_VHOST_ASYNC_H_ */
> > diff --git a/lib/vhost/version.map b/lib/vhost/version.map index
> > a7ef7f1976..1202ba9c1a 100644
> > --- a/lib/vhost/version.map
> > +++ b/lib/vhost/version.map
> > @@ -84,6 +84,9 @@ EXPERIMENTAL {
> >
> >   	# added in 21.11
> >   	rte_vhost_get_monitor_addr;
> > +
> > +	# added in 22.03
> > +	rte_vhost_async_dma_configure;
> >   };
> >
> >   INTERNAL {
> > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c index
> > 13a9bb9dd1..c408cee63e 100644
> > --- a/lib/vhost/vhost.c
> > +++ b/lib/vhost/vhost.c
> > @@ -25,7 +25,7 @@
> >   #include "vhost.h"
> >   #include "vhost_user.h"
> >
> > -struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> > +struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
> >   pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
> >
> >   /* Called with iotlb_lock read-locked */ @@ -344,6 +344,7 @@
> > vhost_free_async_mem(struct vhost_virtqueue *vq)
> >   		return;
> >
> >   	rte_free(vq->async->pkts_info);
> > +	rte_free(vq->async->pkts_cmpl_flag);
> >
> >   	rte_free(vq->async->buffers_packed);
> >   	vq->async->buffers_packed = NULL;
> > @@ -667,12 +668,12 @@ vhost_new_device(void)
> >   	int i;
> >
> >   	pthread_mutex_lock(&vhost_dev_lock);
> > -	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > +	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
> >   		if (vhost_devices[i] == NULL)
> >   			break;
> >   	}
> >
> > -	if (i == MAX_VHOST_DEVICE) {
> > +	if (i == RTE_MAX_VHOST_DEVICE) {
> >   		VHOST_LOG_CONFIG(ERR,
> >   			"Failed to find a free slot for new device.\n");
> >   		pthread_mutex_unlock(&vhost_dev_lock);
> > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> >   }
> >
> >   static __rte_always_inline int
> > -async_channel_register(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +async_channel_register(int vid, uint16_t queue_id)
> >   {
> >   	struct virtio_net *dev = get_device(vid);
> >   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id]; @@ -1656,6
> > +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
> >   		goto out_free_async;
> >   	}
> >
> > +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size *
> sizeof(bool),
> > +			RTE_CACHE_LINE_SIZE, node);
> > +	if (!async->pkts_cmpl_flag) {
> > +		VHOST_LOG_CONFIG(ERR, "failed to allocate async
> pkts_cmpl_flag (vid %d, qid: %d)\n",
> > +				vid, queue_id);
> > +		goto out_free_async;
> > +	}
> > +
> >   	if (vq_is_packed(dev)) {
> >   		async->buffers_packed = rte_malloc_socket(NULL,
> >   				vq->size * sizeof(struct
> vring_used_elem_packed), @@ -1676,9
> > +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
> >   		}
> >   	}
> >
> > -	async->ops.check_completed_copies = ops-
> >check_completed_copies;
> > -	async->ops.transfer_data = ops->transfer_data;
> > -
> >   	vq->async = async;
> >
> >   	return 0;
> > @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t
> queue_id,
> >   }
> >
> >   int
> > -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > -		struct rte_vhost_async_config config,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
> >   {
> >   	struct vhost_virtqueue *vq;
> >   	struct virtio_net *dev = get_device(vid);
> >   	int ret;
> >
> > -	if (dev == NULL || ops == NULL)
> > +	if (dev == NULL)
> >   		return -1;
> >
> >   	if (queue_id >= VHOST_MAX_VRING)
> > @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid,
> uint16_t queue_id,
> >   	if (unlikely(vq == NULL || !dev->async_copy))
> >   		return -1;
> >
> > -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > -		VHOST_LOG_CONFIG(ERR,
> > -			"async copy is not supported on non-inorder mode "
> > -			"(vid %d, qid: %d)\n", vid, queue_id);
> > -		return -1;
> > -	}
> > -
> > -	if (unlikely(ops->check_completed_copies == NULL ||
> > -		ops->transfer_data == NULL))
> > -		return -1;
> > -
> >   	rte_spinlock_lock(&vq->access_lock);
> > -	ret = async_channel_register(vid, queue_id, ops);
> > +	ret = async_channel_register(vid, queue_id);
> >   	rte_spinlock_unlock(&vq->access_lock);
> >
> >   	return ret;
> >   }
> >
> >   int
> > -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > -		struct rte_vhost_async_config config,
> > -		struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > +queue_id)
> >   {
> >   	struct vhost_virtqueue *vq;
> >   	struct virtio_net *dev = get_device(vid);
> >
> > -	if (dev == NULL || ops == NULL)
> > +	if (dev == NULL)
> >   		return -1;
> >
> >   	if (queue_id >= VHOST_MAX_VRING)
> > @@ -1747,18 +1737,7 @@
> rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> >   	if (unlikely(vq == NULL || !dev->async_copy))
> >   		return -1;
> >
> > -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > -		VHOST_LOG_CONFIG(ERR,
> > -			"async copy is not supported on non-inorder mode "
> > -			"(vid %d, qid: %d)\n", vid, queue_id);
> > -		return -1;
> > -	}
> > -
> > -	if (unlikely(ops->check_completed_copies == NULL ||
> > -		ops->transfer_data == NULL))
> > -		return -1;
> > -
> > -	return async_channel_register(vid, queue_id, ops);
> > +	return async_channel_register(vid, queue_id);
> >   }
> >
> >   int
> > @@ -1835,6 +1814,95 @@
> rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t
> queue_id)
> >   	return 0;
> >   }
> >
> > +static __rte_always_inline void
> > +vhost_free_async_dma_mem(void)
> > +{
> > +	uint16_t i;
> > +
> > +	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > +		struct async_dma_info *dma = &dma_copy_track[i];
> > +		int16_t j;
> > +
> > +		if (dma->max_vchans == 0)
> > +			continue;
> > +
> > +		for (j = 0; j < dma->max_vchans; j++)
> > +			rte_free(dma->vchans[j].pkts_completed_flag);
> > +
> > +		rte_free(dma->vchans);
> > +		dma->vchans = NULL;
> > +		dma->max_vchans = 0;
> > +	}
> > +}
> > +
> > +int
> > +rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
> > +uint16_t poll_factor)
> 
> I'm not fan of the poll_factor, I think it is too complex for the user to know
> what value he should set.

It seems so, and users need to know DMA offloading details before setting it.
The simple way is setting the max copies to check by rte_dma_completed to
9728/2048*32, where 9728 is equal to VIRTIO_MAX_RX_PKTLEN and 32 is
MAX_PKT_BURST. It should be able to cover most of cases.

> 
> Also, I would like that the API only registers one DMA channel at a time and
> let the application call it multiple times. Dong that, user can still use the DMA
> channels that could be registered.

Sure, I will change it.
> 
> > +{
> > +	uint16_t i;
> > +
> > +	if (!dmas_id) {
> > +		VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration
> parameter.\n");
> > +		return -1;
> > +	}
> > +
> > +	if (poll_factor == 0) {
> > +		VHOST_LOG_CONFIG(ERR, "Invalid DMA poll factor %u\n",
> poll_factor);
> > +		return -1;
> > +	}
> > +	dma_poll_factor = poll_factor;
> > +
> > +	for (i = 0; i < count; i++) {
> > +		struct async_dma_vchan_info *vchans;
> > +		struct rte_dma_info info;
> > +		uint16_t max_vchans;
> > +		uint16_t max_desc;
> > +		uint16_t j;
> > +
> > +		if (!rte_dma_is_valid(dmas_id[i])) {
> > +			VHOST_LOG_CONFIG(ERR, "DMA %d is not found.
> Cannot enable async"
> > +				       " data-path\n.", dmas_id[i]);
> > +			vhost_free_async_dma_mem();
> > +			return -1;
> > +		}
> > +
> > +		rte_dma_info_get(dmas_id[i], &info);
> > +
> > +		max_vchans = info.max_vchans;
> > +		max_desc = info.max_desc;
> > +
> > +		if (!rte_is_power_of_2(max_desc))
> > +			max_desc = rte_align32pow2(max_desc);
> > +
> > +		vchans = rte_zmalloc(NULL, sizeof(struct
> async_dma_vchan_info) * max_vchans,
> > +				RTE_CACHE_LINE_SIZE);
> > +		if (vchans == NULL) {
> > +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans
> for dma-%d."
> > +					" Cannot enable async data-path.\n",
> dmas_id[i]);
> > +			vhost_free_async_dma_mem();
> > +			return -1;
> > +		}
> > +
> > +		for (j = 0; j < max_vchans; j++) {
> > +			vchans[j].pkts_completed_flag = rte_zmalloc(NULL,
> sizeof(bool *) * max_desc,
> > +					RTE_CACHE_LINE_SIZE);
> > +			if (!vchans[j].pkts_completed_flag) {
> > +				VHOST_LOG_CONFIG(ERR, "Failed to allocate
> pkts_completed_flag for "
> > +						"dma-%d vchan-%u\n",
> dmas_id[i], j);
> > +				vhost_free_async_dma_mem();
> > +				return -1;
> > +			}
> > +
> > +			vchans[j].ring_size = max_desc;
> > +			vchans[j].ring_mask = max_desc - 1;
> > +		}
> > +
> > +		dma_copy_track[dmas_id[i]].vchans = vchans;
> > +		dma_copy_track[dmas_id[i]].max_vchans = max_vchans;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
> >   {
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h index
> > 7085e0885c..475843fec0 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -19,6 +19,7 @@
> >   #include <rte_ether.h>
> >   #include <rte_rwlock.h>
> >   #include <rte_malloc.h>
> > +#include <rte_dmadev.h>
> >
> >   #include "rte_vhost.h"
> >   #include "rte_vdpa.h"
> > @@ -50,6 +51,7 @@
> >
> >   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> >   #define VHOST_MAX_ASYNC_VEC 2048
> > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> >
> >   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
> >   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> > VRING_DESC_F_WRITE) : \ @@ -119,6 +121,42 @@ struct
> vring_used_elem_packed {
> >   	uint32_t count;
> >   };
> >
> > +struct async_dma_vchan_info {
> > +	/* circular array to track if packet copy completes */
> > +	bool **pkts_completed_flag;
> > +
> > +	/* max elements in 'metadata' */
> > +	uint16_t ring_size;
> > +	/* ring index mask for 'metadata' */
> > +	uint16_t ring_mask;
> > +
> > +	/* batching copies before a DMA doorbell */
> > +	uint16_t nr_batching;
> > +
> > +	/**
> > +	 * DMA virtual channel lock. Although it is able to bind DMA
> > +	 * virtual channels to data plane threads, vhost control plane
> > +	 * thread could call data plane functions too, thus causing
> > +	 * DMA device contention.
> > +	 *
> > +	 * For example, in VM exit case, vhost control plane thread needs
> > +	 * to clear in-flight packets before disable vring, but there could
> > +	 * be anotther data plane thread is enqueuing packets to the same
> > +	 * vring with the same DMA virtual channel. But dmadev PMD
> functions
> > +	 * are lock-free, so the control plane and data plane threads
> > +	 * could operate the same DMA virtual channel at the same time.
> > +	 */
> > +	rte_spinlock_t dma_lock;
> > +};
> > +
> > +struct async_dma_info {
> > +	uint16_t max_vchans;
> > +	struct async_dma_vchan_info *vchans; };
> > +
> > +extern struct async_dma_info
> dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +extern uint16_t dma_poll_factor;
> > +
> >   /**
> >    * inflight async packet information
> >    */
> > @@ -129,9 +167,6 @@ struct async_inflight_info {
> >   };
> >
> >   struct vhost_async {
> > -	/* operation callbacks for DMA */
> > -	struct rte_vhost_async_channel_ops ops;
> > -
> >   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> >   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> >   	uint16_t iter_idx;
> > @@ -139,6 +174,25 @@ struct vhost_async {
> >
> >   	/* data transfer status */
> >   	struct async_inflight_info *pkts_info;
> > +	/**
> > +	 * Packet reorder array. "true" indicates that DMA device
> > +	 * completes all copies for the packet.
> > +	 *
> > +	 * Note that this array could be written by multiple threads
> > +	 * simultaneously. For example, in the case of thread0 and
> > +	 * thread1 RX packets from NIC and then enqueue packets to
> > +	 * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
> > +	 * possible for thread0 to get completed copies belonging to
> > +	 * vring1 from DMA0, while thread0 is calling rte_vhost_poll
> > +	 * _enqueue_completed() for vring0 and thread1 is calling
> > +	 * rte_vhost_submit_enqueue_burst() for vring1. In this case,
> > +	 * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
> > +	 *
> > +	 * However, since offloading is per-packet basis, each packet
> > +	 * flag will only be written by one thread. And single byte
> > +	 * write is atomic, so no lock for pkts_cmpl_flag is needed.
> > +	 */
> > +	bool *pkts_cmpl_flag;
> >   	uint16_t pkts_idx;
> >   	uint16_t pkts_inflight_n;
> >   	union {
> > @@ -198,6 +252,7 @@ struct vhost_virtqueue {
> >   	/* Record packed ring first dequeue desc index */
> >   	uint16_t		shadow_last_used_idx;
> >
> > +	uint16_t		batch_copy_max_elems;
> >   	uint16_t		batch_copy_nb_elems;
> >   	struct batch_copy_elem	*batch_copy_elems;
> >   	int			numa_node;
> > @@ -568,8 +623,7 @@ extern int vhost_data_log_level;
> >   #define PRINT_PACKET(device, addr, size, header) do {} while (0)
> >   #endif
> >
> > -#define MAX_VHOST_DEVICE	1024
> > -extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> > +extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
> >
> >   #define VHOST_BINARY_SEARCH_THRESH 256
> >
> > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c index
> > 5eb1dd6812..3147e72f04 100644
> > --- a/lib/vhost/vhost_user.c
> > +++ b/lib/vhost/vhost_user.c
> > @@ -527,6 +527,8 @@ vhost_user_set_vring_num(struct virtio_net
> **pdev,
> >   		return RTE_VHOST_MSG_RESULT_ERR;
> >   	}
> >
> > +	vq->batch_copy_max_elems = vq->size;
> > +
> 
> I don't understand the point of this new field. But it can be removed anyway
> if we agree to drop the SW fallback.

This is for handling lacking of batch_copy elements in SW fallback.

> 
> >   	return RTE_VHOST_MSG_RESULT_OK;
> >   }
> >
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c index
> > b3d954aab4..305f6cd562 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -11,6 +11,7 @@
> >   #include <rte_net.h>
> >   #include <rte_ether.h>
> >   #include <rte_ip.h>
> > +#include <rte_dmadev.h>
> >   #include <rte_vhost.h>
> >   #include <rte_tcp.h>
> >   #include <rte_udp.h>
> > @@ -25,6 +26,10 @@
> >
> >   #define MAX_BATCH_LEN 256
> >
> > +/* DMA device copy operation tracking array. */ struct async_dma_info
> > +dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +uint16_t dma_poll_factor = 1;
> > +
> >   static  __rte_always_inline bool
> >   rxvq_is_mergeable(struct virtio_net *dev)
> >   {
> > @@ -43,6 +48,140 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t nr_vring)
> >   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> >   }
> >
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> > +		uint16_t vchan_id, uint16_t head_idx,
> > +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts) {
> > +	struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan_id];
> > +	uint16_t ring_mask = dma_info->ring_mask;
> > +	uint16_t pkt_idx, bce_idx = 0;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> > +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > +		int copy_idx, last_copy_idx = 0;
> > +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > +		uint16_t nr_sw_copy = 0;
> > +		uint16_t i;
> > +
> > +		if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
> > +			goto out;
> 
> I would consider introducing a vhost_async_dma_transfer_one function to
> avoid nesting too much loops and make the code cleaner.

Sure, I will add it.

> 
> > +		for (i = 0; i < nr_segs; i++) {
> > +			/* Fallback to SW copy if error happens */
> > +			copy_idx = rte_dma_copy(dma_id, vchan_id,
> (rte_iova_t)iov[i].src_addr,
> > +					(rte_iova_t)iov[i].dst_addr, iov[i].len,
> > +					RTE_DMA_OP_FLAG_LLC);
> > +			if (unlikely(copy_idx < 0)) {
> 
> The DMA channel is protected by a lock, and we check the capacity before
> initiating the copy.
> So I don't expect rte_dma_copy() to fail because of lack of capacity. If an
> error happens, that is a serious one.
> 
> So, I wonder whether having a SW fallback makes sense. Code would be
> much simpler if we just exit early if an error happens. Logging an error
> message instead would help debugging. Certainly with rate limiting not to
> flood the log file.

That's correct. If error really happens in this case, it means DMA definitely
has something wrong. Better to stop async data-path and debug. SW fallback
may hide this serious issue.

If no objections, I will remove SW fallback but adding an error log.

> 
> > +				/* Find corresponding VA pair and do SW
> copy */
> > +				rte_memcpy(vq-
> >batch_copy_elems[bce_idx].dst,
> > +						vq-
> >batch_copy_elems[bce_idx].src,
> > +						vq-
> >batch_copy_elems[bce_idx].len);
> > +				nr_sw_copy++;
> > +
> > +				/**
> > +				 * All copies of the packet are performed
> > +				 * by the CPU, set the packet completion flag
> > +				 * to true, as all copies are done.
> > +				 */
> 
> I think it would better be moved out of the loop to avoid doing the check for
> every segment while only the last one has a chance to match.

I didn't get the point. How to get rid of the loop? Do you suggest to do SW copy
for all left copies once one DMA error happens (if SW copy is kept)?

> 
> > +				if (nr_sw_copy == nr_segs) {
> > +					vq->async->pkts_cmpl_flag[head_idx %
> vq->size] = true;
> > +					break;
> > +				} else if (i == (nr_segs - 1)) {
> > +					/**
> > +					 * A part of copies of current packet
> > +					 * are enqueued to the DMA
> successfully
> > +					 * but the last copy fails, store the
> > +					 * packet completion flag address
> > +					 * in the last DMA copy slot.
> > +					 */
> > +					dma_info-
> >pkts_completed_flag[last_copy_idx & ring_mask] =
> > +						&vq->async-
> >pkts_cmpl_flag[head_idx % vq->size];
> > +					break;
> > +				}
> > +			} else
> > +				last_copy_idx = copy_idx;
> 
> Braces on the else as you have braces for the if statement.
> 
> > +
> > +			bce_idx++;
> > +
> > +			/**
> > +			 * Only store packet completion flag address in the
> last copy's
> > +			 * slot, and other slots are set to NULL.
> > +			 */
> > +			if (i == (nr_segs - 1)) {
> > +				dma_info->pkts_completed_flag[copy_idx &
> ring_mask] =
> > +					&vq->async-
> >pkts_cmpl_flag[head_idx % vq->size];
> > +			}
> > +		}
> > +
> > +		dma_info->nr_batching += nr_segs;
> > +		if (unlikely(dma_info->nr_batching >=
> VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> > +			rte_dma_submit(dma_id, vchan_id);
> > +			dma_info->nr_batching = 0;
> > +		}
> 
> I wonder whether we could just remove this submit.
> I don't expect completions to happen between two packets as the DMA
> channel is protected by a lock, so my understanding is once the DMA ring is
> full, we just end-up exiting early because DMA channel capacity is checked
> for every packet.
> 
> Removing it will maybe improve performance a (very) little bit, but will
> certainly make the code simpler to follow.

Good suggestion. I will remove it.

> 
> > +
> > +		head_idx++;
> > +	}
> > +
> > +out:
> > +	if (dma_info->nr_batching > 0) {
> 
> if (likely(...))

Sure, I will add later.

> 
> > +		rte_dma_submit(dma_id, vchan_id);
> > +		dma_info->nr_batching = 0;
> > +	}
> > +	rte_spinlock_unlock(&dma_info->dma_lock);
> > +	vq->batch_copy_nb_elems = 0;
> > +
> > +	return pkt_idx;
> > +}
> > +
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan_id,
> > +uint16_t max_pkts) {
> > +	struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan_id];
> > +	uint16_t ring_mask = dma_info->ring_mask;
> > +	uint16_t last_idx = 0;
> > +	uint16_t nr_copies;
> > +	uint16_t copy_idx;
> > +	uint16_t i;
> > +	bool has_error = false;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	/**
> > +	 * Print error log for debugging, if DMA reports error during
> > +	 * DMA transfer. We do not handle error in vhost level.
> > +	 */
> > +	nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts,
> &last_idx, &has_error);
> > +	if (unlikely(has_error)) {
> > +		VHOST_LOG_DATA(ERR, "dma %d vchannel %u reports error
> in rte_dma_completed()\n",
> > +				dma_id, vchan_id);
> 
> I wonder if rate limiting would

Sure, I will avoid log flooding.

> 
> > +	} else if (nr_copies == 0)
> > +		goto out;
> > +
> > +	copy_idx = last_idx - nr_copies + 1;
> > +	for (i = 0; i < nr_copies; i++) {
> > +		bool *flag;
> > +
> > +		flag = dma_info->pkts_completed_flag[copy_idx &
> ring_mask];
> > +		if (flag) {
> > +			/**
> > +			 * Mark the packet flag as received. The flag
> > +			 * could belong to another virtqueue but write
> > +			 * is atomic.
> > +			 */
> > +			*flag = true;
> > +			dma_info->pkts_completed_flag[copy_idx &
> ring_mask] = NULL;
> > +		}
> > +		copy_idx++;
> > +	}
> > +
> > +out:
> > +	rte_spinlock_unlock(&dma_info->dma_lock);
> > +	return nr_copies;
> > +}
> > +
> >   static inline void
> >   do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
> >   {
> > @@ -865,12 +1004,13 @@ async_iter_reset(struct vhost_async *async)
> >   static __rte_always_inline int
> >   async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue
> *vq,
> >   		struct rte_mbuf *m, uint32_t mbuf_offset,
> > -		uint64_t buf_iova, uint32_t cpy_len)
> > +		uint64_t buf_addr, uint64_t buf_iova, uint32_t cpy_len)
> >   {
> >   	struct vhost_async *async = vq->async;
> >   	uint64_t mapped_len;
> >   	uint32_t buf_offset = 0;
> >   	void *hpa;
> > +	struct batch_copy_elem *bce = vq->batch_copy_elems;
> >
> >   	while (cpy_len) {
> >   		hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> > @@ -886,6 +1026,31 @@ async_mbuf_to_desc_seg(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> >   						hpa, (size_t)mapped_len)))
> >   			return -1;
> >
> > +		/**
> > +		 * Keep VA for all IOVA segments for falling back to SW
> > +		 * copy in case of rte_dma_copy() error.
> > +		 */
> 
> As said below, I think we could get rid off the SW fallback.

Like the replies above, a better way for me is to remove SW fallback.

> But in case we didn't, I think it would be prefferable to change the
> rte_vhost_iovec struct to have both the iova and the VA, that would make
> the code simpler.
> 
> Also, while looking at this, I notice the structs rte_vhost_iov_iter and
> rte_vhost_iovec are still part of the Vhost API, but it should not be necessary
> now since application no more need to know about it.

Good catch, and I will change it later.

> 
> > +		if (unlikely(vq->batch_copy_nb_elems >= vq-
> >batch_copy_max_elems)) {
> > +			struct batch_copy_elem *tmp;
> > +			uint16_t nb_elems = 2 * vq->batch_copy_max_elems;
> > +
> > +			VHOST_LOG_DATA(DEBUG, "(%d) %s: run out of
> batch_copy_elems, "
> > +					"and realloc double elements.\n",
> dev->vid, __func__);
> > +			tmp = rte_realloc_socket(vq->batch_copy_elems,
> nb_elems * sizeof(*tmp),
> > +					RTE_CACHE_LINE_SIZE, vq-
> >numa_node);
> > +			if (!tmp) {
> > +				VHOST_LOG_DATA(ERR, "Failed to re-alloc
> batch_copy_elems\n");
> > +				return -1;
> > +			}
> > +
> > +			vq->batch_copy_max_elems = nb_elems;
> > +			vq->batch_copy_elems = tmp;
> > +			bce = tmp;
> > +		}
> > +		bce[vq->batch_copy_nb_elems].dst = (void
> *)((uintptr_t)(buf_addr + buf_offset));
> > +		bce[vq->batch_copy_nb_elems].src =
> rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
> > +		bce[vq->batch_copy_nb_elems++].len = mapped_len;
> > +
> >   		cpy_len -= (uint32_t)mapped_len;
> >   		mbuf_offset += (uint32_t)mapped_len;
> >   		buf_offset += (uint32_t)mapped_len; @@ -901,7 +1066,8
> @@
> > sync_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
> >   {
> >   	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> >
> > -	if (likely(cpy_len > MAX_BATCH_LEN || vq->batch_copy_nb_elems >=
> vq->size)) {
> > +	if (likely(cpy_len > MAX_BATCH_LEN ||
> > +				vq->batch_copy_nb_elems >= vq-
> >batch_copy_max_elems)) {
> >   		rte_memcpy((void *)((uintptr_t)(buf_addr)),
> >   				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset),
> >   				cpy_len);
> > @@ -1020,8 +1186,10 @@ mbuf_to_desc(struct virtio_net *dev, struct
> > vhost_virtqueue *vq,
> >
> >   		if (is_async) {
> >   			if (async_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
> > +						buf_addr + buf_offset,
> >   						buf_iova + buf_offset,
> cpy_len) < 0)
> >   				goto error;
> > +
> 
> Remove new line.

I will remove it later.

Thanks,
Jiayu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 0/1] integrate dmadev in vhost
  2022-01-24 16:40     ` [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
  2022-02-03 13:04       ` Maxime Coquelin
@ 2022-02-08 10:40       ` Jiayu Hu
  2022-02-08 10:40         ` [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
  1 sibling, 1 reply; 31+ messages in thread
From: Jiayu Hu @ 2022-02-08 10:40 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, xuan.ding, cheng1.jiang,
	liangma, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in vhost.

To enable the flexibility of using DMA devices in different function
modules, not limited in vhost, vhost doesn't manage DMA devices.
Applications, like OVS, need to manage and configure DMA devices and
tell vhost what DMA device to use in every dataplane function call.

In addition, vhost supports M:N mapping between vrings and DMA virtual
channels. Specifically, one vring can use multiple different DMA channels
and one DMA channel can be shared by multiple vrings at the same time.
The reason of enabling one vring to use multiple DMA channels is that
it's possible that more than one dataplane threads enqueue packets to
the same vring with their own DMA virtual channels. Besides, the number
of DMA devices is limited. For the purpose of scaling, it's necessary to
support sharing DMA channels among vrings.

As only enqueue path is enabled DMA acceleration, the new dataplane
functions are like:
1). rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Get descriptors and submit copies to DMA virtual channel for the
    packets that need to be send to VM.
 
2). rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Check completed DMA copies from the given DMA virtual channel and
    write back corresponding descriptors to vring.

OVS needs to call rte_vhost_poll_enqueue_completed to clean in-flight
copies on previous call and it can be called inside rxq_recv function,
so that it doesn't require big change in OVS datapath. For example:
netdev_dpdk_vhost_rxq_recv() {
	...
	qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_RXQ;
	rte_vhost_poll_enqueue_completed(vid, qid, ...);
}

Change log
==========
v2 -> v3:
- remove SW fallback
- remove middle-packet dma submit
- refactor rte_async_dma_configure() and remove poll_factor
- introduce vhost_async_dma_transfer_one()
- rename rte_vhost_iov_iter and rte_vhost_iovec and place them in vhost.h
- refactor LOG format 
- print error log for rte_dma_copy() failure with avoiding log flood
- avoid log flood for rte_dma_completed() failure
- fix some typo and update comment and doc
v1 -> v2:
- add SW fallback if rte_dma_copy() reports error
- print error if rte_dma_completed() reports error
- add poll_factor while call rte_dma_completed() for scatter-gaher packets
- use trylock instead of lock in rte_vhost_poll_enqueue_completed()
- check if dma_id and vchan_id valid
- input dma_id in rte_vhost_async_dma_configure()
- remove useless code, brace and hardcode in vhost example
- redefine MAX_VHOST_DEVICE to RTE_MAX_VHOST_DEVICE
- update doc and comments
rfc -> v1:
- remove useless code
- support dynamic DMA vchannel ring size (rte_vhost_async_dma_configure)
- fix several bugs
- fix typo and coding style issues
- replace "while" with "for"
- update programmer guide 
- support share dma among vhost in vhost example
- remove "--dma-type" in vhost example

Jiayu Hu (1):
  vhost: integrate dmadev in asynchronous data-path

 doc/guides/prog_guide/vhost_lib.rst |  97 +++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 ----------------------
 examples/vhost/ioat.h               |  63 -------
 examples/vhost/main.c               | 252 +++++++++++++++++++++-----
 examples/vhost/main.h               |  11 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 145 ++++-----------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 122 +++++++++----
 lib/vhost/vhost.h                   |  85 ++++++++-
 lib/vhost/virtio_net.c              | 271 +++++++++++++++++++++++-----
 14 files changed, 689 insertions(+), 590 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-08 10:40       ` [PATCH v3 0/1] integrate dmadev in vhost Jiayu Hu
@ 2022-02-08 10:40         ` Jiayu Hu
  2022-02-08 17:46           ` Maxime Coquelin
  2022-02-09 12:51           ` [PATCH v4 0/1] integrate dmadev in vhost Jiayu Hu
  0 siblings, 2 replies; 31+ messages in thread
From: Jiayu Hu @ 2022-02-08 10:40 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, xuan.ding, cheng1.jiang,
	liangma, Jiayu Hu, Sunil Pai G

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.

Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  97 +++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 ----------------------
 examples/vhost/ioat.h               |  63 -------
 examples/vhost/main.c               | 252 +++++++++++++++++++++-----
 examples/vhost/main.h               |  11 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 145 ++++-----------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 122 +++++++++----
 lib/vhost/vhost.h                   |  85 ++++++++-
 lib/vhost/virtio_net.c              | 271 +++++++++++++++++++++++-----
 14 files changed, 689 insertions(+), 590 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index f72ce75909..a5f7861366 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -106,12 +106,11 @@ The following is an overview of some key Vhost API functions:
   - ``RTE_VHOST_USER_ASYNC_COPY``
 
     Asynchronous data path will be enabled when this flag is set. Async data
-    path allows applications to register async copy devices (typically
-    hardware DMA channels) to the vhost queues. Vhost leverages the copy
-    device registered to free CPU from memory copy operations. A set of
-    async data path APIs are defined for DPDK applications to make use of
-    the async capability. Only packets enqueued/dequeued by async APIs are
-    processed through the async data path.
+    path allows applications to register DMA channels to the vhost queues.
+    Vhost leverages the registered DMA devices to free CPU from memory copy
+    operations. A set of async data path APIs are defined for DPDK applications
+    to make use of the async capability. Only packets enqueued/dequeued by
+    async APIs are processed through the async data path.
 
     Currently this feature is only implemented on split ring enqueue data
     path.
@@ -218,52 +217,30 @@ The following is an overview of some key Vhost API functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_dma_configure(dma_id, vchan_id)``
 
-  Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  Tell vhost which DMA vChannel is going to use. This function needs to
+  be called before register async data-path for vring.
 
-  * ``features``
+* ``rte_vhost_async_channel_register(vid, queue_id)``
 
-    This field is used to specify async copy device features.
+  Register async DMA acceleration for a vhost queue after vring is enabled.
 
-    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
-    guarantee the order of copy completion is the same as the order
-    of copy submission.
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
 
-    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
-    supported by vhost.
-
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-    vhost invokes this function to submit copy data to the async devices.
-    For non-async_inorder capable devices, ``opaque_data`` could be used
-    for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-    vhost invokes this function to get the copy data completed by async
-    devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
-
-  Register an async copy device channel for a vhost queue without
-  performing any locking.
+  Register async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
 * ``rte_vhost_async_channel_unregister(vid, queue_id)``
 
-  Unregister the async copy device channel from a vhost queue.
+  Unregister the async DMA acceleration from a vhost queue.
   Unregistration will fail, if the vhost queue has in-flight
   packets that are not completed.
 
-  Unregister async copy devices in vring_state_changed() may
+  Unregister async DMA acceleration in vring_state_changed() may
   fail, as this API tries to acquire the spinlock of vhost
   queue. The recommended way is to unregister async copy
   devices for all vhost queues in destroy_device(), when a
@@ -271,24 +248,19 @@ The following is an overview of some key Vhost API functions:
 
 * ``rte_vhost_async_channel_unregister_thread_unsafe(vid, queue_id)``
 
-  Unregister the async copy device channel for a vhost queue without
-  performing any locking.
+  Unregister async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, vchan_id)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
-
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, vchan_id)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +270,7 @@ The following is an overview of some key Vhost API functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, vchan_id)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -443,6 +415,29 @@ Finally, a set of device ops is defined for device specific operations:
 
   Called to get the notify area info of the queue.
 
+Vhost asynchronous data path
+----------------------------
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables applications, like OVS, to save CPU cycles and hide memory copy
+overhead, thus achieving higher throughput.
+
+Vhost doesn't manage DMA devices and applications, like OVS, need to
+manage and configure DMA devices. Applications need to tell vhost what
+DMA devices to use in every data path function call. This design enables
+the flexibility for applications to dynamically use DMA channels in
+different function modules, not limited in vhost.
+
+In addition, vhost supports M:N mapping between vrings and DMA virtual
+channels. Specifically, one vring can use multiple different DMA channels
+and one DMA channel can be shared by multiple vrings at the same time.
+The reason of enabling one vring to use multiple DMA channels is that
+it's possible that more than one dataplane threads enqueue packets to
+the same vring with their own DMA virtual channels. Besides, the number
+of DMA devices is limited. For the purpose of scaling, it's necessary to
+support sharing DMA channels among vrings.
+
 Recommended IOVA mode in async datapath
 ---------------------------------------
 
@@ -450,4 +445,4 @@ When DMA devices are bound to vfio driver, VA mode is recommended.
 For PA mode, page by page mapping may exceed IOMMU's max capability,
 better to use 1G guest hugepage.
 
-For uio driver, any vfio related error message can be ignored.
\ No newline at end of file
+For uio driver, any vfio related error message can be ignored.
diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
index 587ea2ab47..975a5dfe40 100644
--- a/examples/vhost/Makefile
+++ b/examples/vhost/Makefile
@@ -5,7 +5,7 @@
 APP = vhost-switch
 
 # all source are stored in SRCS-y
-SRCS-y := main.c virtio_net.c ioat.c
+SRCS-y := main.c virtio_net.c
 
 PKGCONF ?= pkg-config
 
diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
deleted file mode 100644
index 9aeeb12fd9..0000000000
--- a/examples/vhost/ioat.c
+++ /dev/null
@@ -1,218 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#include <sys/uio.h>
-#ifdef RTE_RAW_IOAT
-#include <rte_rawdev.h>
-#include <rte_ioat_rawdev.h>
-
-#include "ioat.h"
-#include "main.h"
-
-struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
-
-struct packet_tracker {
-	unsigned short size_track[MAX_ENQUEUED_SIZE];
-	unsigned short next_read;
-	unsigned short next_write;
-	unsigned short last_remain;
-	unsigned short ioat_space;
-};
-
-struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
-
-int
-open_ioat(const char *value)
-{
-	struct dma_for_vhost *dma_info = dma_bind;
-	char *input = strndup(value, strlen(value) + 1);
-	char *addrs = input;
-	char *ptrs[2];
-	char *start, *end, *substr;
-	int64_t vid, vring_id;
-	struct rte_ioat_rawdev_config config;
-	struct rte_rawdev_info info = { .dev_private = &config };
-	char name[32];
-	int dev_id;
-	int ret = 0;
-	uint16_t i = 0;
-	char *dma_arg[MAX_VHOST_DEVICE];
-	int args_nr;
-
-	while (isblank(*addrs))
-		addrs++;
-	if (*addrs == '\0') {
-		ret = -1;
-		goto out;
-	}
-
-	/* process DMA devices within bracket. */
-	addrs++;
-	substr = strtok(addrs, ";]");
-	if (!substr) {
-		ret = -1;
-		goto out;
-	}
-	args_nr = rte_strsplit(substr, strlen(substr),
-			dma_arg, MAX_VHOST_DEVICE, ',');
-	if (args_nr <= 0) {
-		ret = -1;
-		goto out;
-	}
-	while (i < args_nr) {
-		char *arg_temp = dma_arg[i];
-		uint8_t sub_nr;
-		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
-		if (sub_nr != 2) {
-			ret = -1;
-			goto out;
-		}
-
-		start = strstr(ptrs[0], "txd");
-		if (start == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		start += 3;
-		vid = strtol(start, &end, 0);
-		if (end == start) {
-			ret = -1;
-			goto out;
-		}
-
-		vring_id = 0 + VIRTIO_RXQ;
-		if (rte_pci_addr_parse(ptrs[1],
-				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
-			ret = -1;
-			goto out;
-		}
-
-		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
-				name, sizeof(name));
-		dev_id = rte_rawdev_get_dev_id(name);
-		if (dev_id == (uint16_t)(-ENODEV) ||
-		dev_id == (uint16_t)(-EINVAL)) {
-			ret = -1;
-			goto out;
-		}
-
-		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
-		strstr(info.driver_name, "ioat") == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
-		(dma_info + vid)->dmas[vring_id].is_valid = true;
-		config.ring_size = IOAT_RING_SIZE;
-		config.hdls_disable = true;
-		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
-			ret = -1;
-			goto out;
-		}
-		rte_rawdev_start(dev_id);
-		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
-		dma_info->nr++;
-		i++;
-	}
-out:
-	free(input);
-	return ret;
-}
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count)
-{
-	uint32_t i_iter;
-	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
-	struct rte_vhost_iov_iter *iter = NULL;
-	unsigned long i_seg;
-	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-	unsigned short write = cb_tracker[dev_id].next_write;
-
-	if (!opaque_data) {
-		for (i_iter = 0; i_iter < count; i_iter++) {
-			iter = iov_iter + i_iter;
-			i_seg = 0;
-			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
-				break;
-			while (i_seg < iter->nr_segs) {
-				rte_ioat_enqueue_copy(dev_id,
-					(uintptr_t)(iter->iov[i_seg].src_addr),
-					(uintptr_t)(iter->iov[i_seg].dst_addr),
-					iter->iov[i_seg].len,
-					0,
-					0);
-				i_seg++;
-			}
-			write &= mask;
-			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
-			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
-			write++;
-		}
-	} else {
-		/* Opaque data is not supported */
-		return -1;
-	}
-	/* ring the doorbell */
-	rte_ioat_perform_ops(dev_id);
-	cb_tracker[dev_id].next_write = write;
-	return i_iter;
-}
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets)
-{
-	if (!opaque_data) {
-		uintptr_t dump[255];
-		int n_seg;
-		unsigned short read, write;
-		unsigned short nb_packet = 0;
-		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-		unsigned short i;
-
-		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
-				+ VIRTIO_RXQ].dev_id;
-		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, dump);
-		if (n_seg < 0) {
-			RTE_LOG(ERR,
-				VHOST_DATA,
-				"fail to poll completed buf on IOAT device %u",
-				dev_id);
-			return 0;
-		}
-		if (n_seg == 0)
-			return 0;
-
-		cb_tracker[dev_id].ioat_space += n_seg;
-		n_seg += cb_tracker[dev_id].last_remain;
-
-		read = cb_tracker[dev_id].next_read;
-		write = cb_tracker[dev_id].next_write;
-		for (i = 0; i < max_packets; i++) {
-			read &= mask;
-			if (read == write)
-				break;
-			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
-				n_seg -= cb_tracker[dev_id].size_track[read];
-				read++;
-				nb_packet++;
-			} else {
-				break;
-			}
-		}
-		cb_tracker[dev_id].next_read = read;
-		cb_tracker[dev_id].last_remain = n_seg;
-		return nb_packet;
-	}
-	/* Opaque data is not supported */
-	return -1;
-}
-
-#endif /* RTE_RAW_IOAT */
diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
deleted file mode 100644
index d9bf717e8d..0000000000
--- a/examples/vhost/ioat.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#ifndef _IOAT_H_
-#define _IOAT_H_
-
-#include <rte_vhost.h>
-#include <rte_pci.h>
-#include <rte_vhost_async.h>
-
-#define MAX_VHOST_DEVICE 1024
-#define IOAT_RING_SIZE 4096
-#define MAX_ENQUEUED_SIZE 4096
-
-struct dma_info {
-	struct rte_pci_addr addr;
-	uint16_t dev_id;
-	bool is_valid;
-};
-
-struct dma_for_vhost {
-	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
-	uint16_t nr;
-};
-
-#ifdef RTE_RAW_IOAT
-int open_ioat(const char *value);
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count);
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-#else
-static int open_ioat(const char *value __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
-		struct rte_vhost_iov_iter *iov_iter __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t count __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_check_completed_copies_cb(int vid __rte_unused,
-		uint16_t queue_id __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t max_packets __rte_unused)
-{
-	return -1;
-}
-#endif
-#endif /* _IOAT_H_ */
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 590a77c723..5cc21de594 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -24,8 +24,9 @@
 #include <rte_ip.h>
 #include <rte_tcp.h>
 #include <rte_pause.h>
+#include <rte_dmadev.h>
+#include <rte_vhost_async.h>
 
-#include "ioat.h"
 #include "main.h"
 
 #ifndef MAX_QUEUES
@@ -56,6 +57,13 @@
 #define RTE_TEST_TX_DESC_DEFAULT 512
 
 #define INVALID_PORT_ID 0xFF
+#define INVALID_DMA_ID -1
+
+#define DMA_RING_SIZE 4096
+
+struct dma_for_vhost dma_bind[RTE_MAX_VHOST_DEVICE];
+int16_t dmas_id[RTE_DMADEV_DEFAULT_MAX];
+static int dma_count;
 
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;
@@ -94,10 +102,6 @@ static int client_mode;
 
 static int builtin_net_driver;
 
-static int async_vhost_driver;
-
-static char *dma_type;
-
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
 /* Specify the number of retries on RX. */
@@ -191,18 +195,150 @@ struct mbuf_table lcore_tx_queue[RTE_MAX_LCORE];
  * Every data core maintains a TX buffer for every vhost device,
  * which is used for batch pkts enqueue for higher performance.
  */
-struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
+struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * RTE_MAX_VHOST_DEVICE];
 
 #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
 				 / US_PER_S * BURST_TX_DRAIN_US)
 
+static inline bool
+is_dma_configured(int16_t dev_id)
+{
+	int i;
+
+	for (i = 0; i < dma_count; i++)
+		if (dmas_id[i] == dev_id)
+			return true;
+	return false;
+}
+
 static inline int
 open_dma(const char *value)
 {
-	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
-		return open_ioat(value);
+	struct dma_for_vhost *dma_info = dma_bind;
+	char *input = strndup(value, strlen(value) + 1);
+	char *addrs = input;
+	char *ptrs[2];
+	char *start, *end, *substr;
+	int64_t vid;
+
+	struct rte_dma_info info;
+	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
+	struct rte_dma_vchan_conf qconf = {
+		.direction = RTE_DMA_DIR_MEM_TO_MEM,
+		.nb_desc = DMA_RING_SIZE
+	};
+
+	int dev_id;
+	int ret = 0;
+	uint16_t i = 0;
+	char *dma_arg[RTE_MAX_VHOST_DEVICE];
+	int args_nr;
+
+	while (isblank(*addrs))
+		addrs++;
+	if (*addrs == '\0') {
+		ret = -1;
+		goto out;
+	}
+
+	/* process DMA devices within bracket. */
+	addrs++;
+	substr = strtok(addrs, ";]");
+	if (!substr) {
+		ret = -1;
+		goto out;
+	}
+
+	args_nr = rte_strsplit(substr, strlen(substr), dma_arg, RTE_MAX_VHOST_DEVICE, ',');
+	if (args_nr <= 0) {
+		ret = -1;
+		goto out;
+	}
+
+	while (i < args_nr) {
+		char *arg_temp = dma_arg[i];
+		uint8_t sub_nr;
+
+		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
+		if (sub_nr != 2) {
+			ret = -1;
+			goto out;
+		}
+
+		start = strstr(ptrs[0], "txd");
+		if (start == NULL) {
+			ret = -1;
+			goto out;
+		}
+
+		start += 3;
+		vid = strtol(start, &end, 0);
+		if (end == start) {
+			ret = -1;
+			goto out;
+		}
+
+		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
+		if (dev_id < 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
+			ret = -1;
+			goto out;
+		}
+
+		/* DMA device is already configured, so skip */
+		if (is_dma_configured(dev_id))
+			goto done;
+
+		if (rte_dma_info_get(dev_id, &info) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Error with rte_dma_info_get()\n");
+			ret = -1;
+			goto out;
+		}
+
+		if (info.max_vchans < 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "No channels available on device %d\n", dev_id);
+			ret = -1;
+			goto out;
+		}
 
-	return -1;
+		if (rte_dma_configure(dev_id, &dev_config) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		/* Check the max desc supported by DMA device */
+		rte_dma_info_get(dev_id, &info);
+		if (info.nb_vchans != 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "No configured queues reported by DMA %d.\n",
+					dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		qconf.nb_desc = RTE_MIN(DMA_RING_SIZE, info.max_desc);
+
+		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_start(dev_id) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		dmas_id[dma_count++] = dev_id;
+
+done:
+		(dma_info + vid)->dmas[VIRTIO_RXQ].dev_id = dev_id;
+		i++;
+	}
+out:
+	free(input);
+	return ret;
 }
 
 /*
@@ -500,8 +636,6 @@ enum {
 	OPT_CLIENT_NUM,
 #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
 	OPT_BUILTIN_NET_DRIVER_NUM,
-#define OPT_DMA_TYPE            "dma-type"
-	OPT_DMA_TYPE_NUM,
 #define OPT_DMAS                "dmas"
 	OPT_DMAS_NUM,
 };
@@ -539,8 +673,6 @@ us_vhost_parse_args(int argc, char **argv)
 				NULL, OPT_CLIENT_NUM},
 		{OPT_BUILTIN_NET_DRIVER, no_argument,
 				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
-		{OPT_DMA_TYPE, required_argument,
-				NULL, OPT_DMA_TYPE_NUM},
 		{OPT_DMAS, required_argument,
 				NULL, OPT_DMAS_NUM},
 		{NULL, 0, 0, 0},
@@ -661,10 +793,6 @@ us_vhost_parse_args(int argc, char **argv)
 			}
 			break;
 
-		case OPT_DMA_TYPE_NUM:
-			dma_type = optarg;
-			break;
-
 		case OPT_DMAS_NUM:
 			if (open_dma(optarg) == -1) {
 				RTE_LOG(INFO, VHOST_CONFIG,
@@ -672,7 +800,6 @@ us_vhost_parse_args(int argc, char **argv)
 				us_vhost_usage(prgname);
 				return -1;
 			}
-			async_vhost_driver = 1;
 			break;
 
 		case OPT_CLIENT_NUM:
@@ -841,9 +968,10 @@ complete_async_pkts(struct vhost_dev *vdev)
 {
 	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
 	uint16_t complete_count;
+	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
-					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
+					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
 	if (complete_count) {
 		free_pkts(p_cpl, complete_count);
 		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
@@ -877,17 +1005,18 @@ static __rte_always_inline void
 drain_vhost(struct vhost_dev *vdev)
 {
 	uint16_t ret;
-	uint32_t buff_idx = rte_lcore_id() * MAX_VHOST_DEVICE + vdev->vid;
+	uint32_t buff_idx = rte_lcore_id() * RTE_MAX_VHOST_DEVICE + vdev->vid;
 	uint16_t nr_xmit = vhost_txbuff[buff_idx]->len;
 	struct rte_mbuf **m = vhost_txbuff[buff_idx]->m_table;
 
 	if (builtin_net_driver) {
 		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
-		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
+		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = nr_xmit - ret;
@@ -905,7 +1034,7 @@ drain_vhost(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(m, nr_xmit);
 }
 
@@ -921,7 +1050,7 @@ drain_vhost_table(void)
 		if (unlikely(vdev->remove == 1))
 			continue;
 
-		vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE
+		vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE
 						+ vdev->vid];
 
 		cur_tsc = rte_rdtsc();
@@ -970,7 +1099,7 @@ virtio_tx_local(struct vhost_dev *vdev, struct rte_mbuf *m)
 		return 0;
 	}
 
-	vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE + dst_vdev->vid];
+	vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE + dst_vdev->vid];
 	vhost_txq->m_table[vhost_txq->len++] = m;
 
 	if (enable_stats) {
@@ -1211,12 +1340,13 @@ drain_eth_rx(struct vhost_dev *vdev)
 	if (builtin_net_driver) {
 		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
 						pkts, rx_count);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
 		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
-					VIRTIO_RXQ, pkts, rx_count);
+					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = rx_count - enqueue_count;
@@ -1235,7 +1365,7 @@ drain_eth_rx(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(pkts, rx_count);
 }
 
@@ -1357,7 +1487,7 @@ destroy_device(int vid)
 	}
 
 	for (i = 0; i < RTE_MAX_LCORE; i++)
-		rte_free(vhost_txbuff[i * MAX_VHOST_DEVICE + vid]);
+		rte_free(vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]);
 
 	if (builtin_net_driver)
 		vs_vhost_net_remove(vdev);
@@ -1387,18 +1517,20 @@ destroy_device(int vid)
 		"(%d) device has been removed from data core\n",
 		vdev->vid);
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t n_pkt = 0;
+		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 		while (vdev->pkts_inflight) {
 			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
-						m_cpl, vdev->pkts_inflight);
+						m_cpl, vdev->pkts_inflight, dma_id, 0);
 			free_pkts(m_cpl, n_pkt);
 			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 		}
 
 		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
+		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
 	}
 
 	rte_free(vdev);
@@ -1425,12 +1557,12 @@ new_device(int vid)
 	vdev->vid = vid;
 
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		vhost_txbuff[i * MAX_VHOST_DEVICE + vid]
+		vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]
 			= rte_zmalloc("vhost bufftable",
 				sizeof(struct vhost_bufftable),
 				RTE_CACHE_LINE_SIZE);
 
-		if (vhost_txbuff[i * MAX_VHOST_DEVICE + vid] == NULL) {
+		if (vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid] == NULL) {
 			RTE_LOG(INFO, VHOST_DATA,
 			  "(%d) couldn't allocate memory for vhost TX\n", vid);
 			return -1;
@@ -1468,20 +1600,13 @@ new_device(int vid)
 		"(%d) device has been added to data core %d\n",
 		vid, vdev->coreid);
 
-	if (async_vhost_driver) {
-		struct rte_vhost_async_config config = {0};
-		struct rte_vhost_async_channel_ops channel_ops;
-
-		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
-			channel_ops.transfer_data = ioat_transfer_data_cb;
-			channel_ops.check_completed_copies =
-				ioat_check_completed_copies_cb;
-
-			config.features = RTE_VHOST_ASYNC_INORDER;
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
+		int ret;
 
-			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
-				config, &channel_ops);
-		}
+		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
+		if (ret == 0)
+			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
+		return ret;
 	}
 
 	return 0;
@@ -1502,14 +1627,15 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
 	if (queue_id != VIRTIO_RXQ)
 		return 0;
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[queue_id].async_enabled) {
 		if (!enable) {
 			uint16_t n_pkt = 0;
+			int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 			while (vdev->pkts_inflight) {
 				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
-							m_cpl, vdev->pkts_inflight);
+							m_cpl, vdev->pkts_inflight, dma_id, 0);
 				free_pkts(m_cpl, n_pkt);
 				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 			}
@@ -1657,6 +1783,24 @@ create_mbuf_pool(uint16_t nr_port, uint32_t nr_switch_core, uint32_t mbuf_size,
 		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
 }
 
+static void
+reset_dma(void)
+{
+	int i;
+
+	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
+		int j;
+
+		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
+			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
+			dma_bind[i].dmas[j].async_enabled = false;
+		}
+	}
+
+	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++)
+		dmas_id[i] = INVALID_DMA_ID;
+}
+
 /*
  * Main function, does initialisation and calls the per-lcore functions.
  */
@@ -1679,6 +1823,9 @@ main(int argc, char *argv[])
 	argc -= ret;
 	argv += ret;
 
+	/* initialize dma structures */
+	reset_dma();
+
 	/* parse app arguments */
 	ret = us_vhost_parse_args(argc, argv);
 	if (ret < 0)
@@ -1754,11 +1901,18 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	for (i = 0; i < dma_count; i++) {
+		if (rte_vhost_async_dma_configure(dmas_id[i], 0) < 0) {
+			RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in vhost.\n");
+			rte_exit(EXIT_FAILURE, "Cannot use given DMA device\n");
+		}
+	}
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		char *file = socket_files + i * PATH_MAX;
 
-		if (async_vhost_driver)
+		if (dma_count)
 			flags = flags | RTE_VHOST_USER_ASYNC_COPY;
 
 		ret = rte_vhost_driver_register(file, flags);
diff --git a/examples/vhost/main.h b/examples/vhost/main.h
index e7b1ac60a6..b4a453e77e 100644
--- a/examples/vhost/main.h
+++ b/examples/vhost/main.h
@@ -8,6 +8,7 @@
 #include <sys/queue.h>
 
 #include <rte_ether.h>
+#include <rte_pci.h>
 
 /* Macros for printing using RTE_LOG */
 #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
@@ -79,6 +80,16 @@ struct lcore_info {
 	struct vhost_dev_tailq_list vdev_list;
 };
 
+struct dma_info {
+	struct rte_pci_addr addr;
+	int16_t dev_id;
+	bool async_enabled;
+};
+
+struct dma_for_vhost {
+	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
+};
+
 /* we implement non-extra virtio net features */
 #define VIRTIO_NET_FEATURES	0
 
diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
index 3efd5e6540..87a637f83f 100644
--- a/examples/vhost/meson.build
+++ b/examples/vhost/meson.build
@@ -12,13 +12,9 @@ if not is_linux
 endif
 
 deps += 'vhost'
+deps += 'dmadev'
 allow_experimental_apis = true
 sources = files(
         'main.c',
         'virtio_net.c',
 )
-
-if dpdk_conf.has('RTE_RAW_IOAT')
-    deps += 'raw_ioat'
-    sources += files('ioat.c')
-endif
diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
index cdb37a4814..bc7272053b 100644
--- a/lib/vhost/meson.build
+++ b/lib/vhost/meson.build
@@ -36,4 +36,4 @@ headers = files(
 driver_sdk_headers = files(
         'vdpa_driver.h',
 )
-deps += ['ethdev', 'cryptodev', 'hash', 'pci']
+deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
index b454c05868..15c37dd26e 100644
--- a/lib/vhost/rte_vhost.h
+++ b/lib/vhost/rte_vhost.h
@@ -113,6 +113,8 @@ extern "C" {
 #define VHOST_USER_F_PROTOCOL_FEATURES	30
 #endif
 
+#define RTE_MAX_VHOST_DEVICE	1024
+
 struct rte_vdpa_device;
 
 /**
diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
index a87ea6ba37..3424d2681a 100644
--- a/lib/vhost/rte_vhost_async.h
+++ b/lib/vhost/rte_vhost_async.h
@@ -5,94 +5,6 @@
 #ifndef _RTE_VHOST_ASYNC_H_
 #define _RTE_VHOST_ASYNC_H_
 
-#include "rte_vhost.h"
-
-/**
- * iovec
- */
-struct rte_vhost_iovec {
-	void *src_addr;
-	void *dst_addr;
-	size_t len;
-};
-
-/**
- * iovec iterator
- */
-struct rte_vhost_iov_iter {
-	/** pointer to the iovec array */
-	struct rte_vhost_iovec *iov;
-	/** number of iovec in this iterator */
-	unsigned long nr_segs;
-};
-
-/**
- * dma transfer status
- */
-struct rte_vhost_async_status {
-	/** An array of application specific data for source memory */
-	uintptr_t *src_opaque_data;
-	/** An array of application specific data for destination memory */
-	uintptr_t *dst_opaque_data;
-};
-
-/**
- * dma operation callbacks to be implemented by applications
- */
-struct rte_vhost_async_channel_ops {
-	/**
-	 * instruct async engines to perform copies for a batch of packets
-	 *
-	 * @param vid
-	 *  id of vhost device to perform data copies
-	 * @param queue_id
-	 *  queue id to perform data copies
-	 * @param iov_iter
-	 *  an array of IOV iterators
-	 * @param opaque_data
-	 *  opaque data pair sending to DMA engine
-	 * @param count
-	 *  number of elements in the "descs" array
-	 * @return
-	 *  number of IOV iterators processed, negative value means error
-	 */
-	int32_t (*transfer_data)(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t count);
-	/**
-	 * check copy-completed packets from the async engine
-	 * @param vid
-	 *  id of vhost device to check copy completion
-	 * @param queue_id
-	 *  queue id to check copy completion
-	 * @param opaque_data
-	 *  buffer to receive the opaque data pair from DMA engine
-	 * @param max_packets
-	 *  max number of packets could be completed
-	 * @return
-	 *  number of async descs completed, negative value means error
-	 */
-	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-};
-
-/**
- *  async channel features
- */
-enum {
-	RTE_VHOST_ASYNC_INORDER = 1U << 0,
-};
-
-/**
- *  async channel configuration
- */
-struct rte_vhost_async_config {
-	uint32_t features;
-	uint32_t rsvd[2];
-};
-
 /**
  * Register an async channel for a vhost queue
  *
@@ -100,17 +12,11 @@ struct rte_vhost_async_config {
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration structure
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue
@@ -136,17 +42,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue without performing any
@@ -179,12 +79,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
  *  array of packets to be enqueued
  * @param count
  *  packets num to be enqueued
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets enqueued
  */
 __rte_experimental
 uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
 
 /**
  * This function checks async completion status for a specific vhost
@@ -199,12 +104,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
  *  blank array to get return packet pointer
  * @param count
  *  size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
 
 /**
  * This function returns the amount of in-flight packets for the vhost
@@ -235,11 +145,36 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
  *  Blank array to get return packet pointer
  * @param count
  *  Size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  Number of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
+/**
+ * The DMA vChannels used in asynchronous data path must be configured
+ * first. So this function needs to be called before enabling DMA
+ * acceleration for vring. If this function fails, the given DMA vChannel
+ * cannot be used in asynchronous data path.
+ *
+ * DMA devices used in data-path must belong to DMA devices given in this
+ * function. But users are free to use DMA devices given in the function
+ * in non-vhost scenarios, only if guarantee no copies in vhost are
+ * offloaded to them at the same time.
+ *
+ * @param dma_id
+ *  the identifier of DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
+ * @return
+ *  0 on success, and -1 on failure
+ */
+__rte_experimental
+int rte_vhost_async_dma_configure(int16_t dma_id, uint16_t vchan_id);
 
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/vhost/version.map b/lib/vhost/version.map
index a7ef7f1976..1202ba9c1a 100644
--- a/lib/vhost/version.map
+++ b/lib/vhost/version.map
@@ -84,6 +84,9 @@ EXPERIMENTAL {
 
 	# added in 21.11
 	rte_vhost_get_monitor_addr;
+
+	# added in 22.03
+	rte_vhost_async_dma_configure;
 };
 
 INTERNAL {
diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
index f59ca6c157..6261487f3d 100644
--- a/lib/vhost/vhost.c
+++ b/lib/vhost/vhost.c
@@ -25,7 +25,7 @@
 #include "vhost.h"
 #include "vhost_user.h"
 
-struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 /* Called with iotlb_lock read-locked */
@@ -343,6 +343,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
 		return;
 
 	rte_free(vq->async->pkts_info);
+	rte_free(vq->async->pkts_cmpl_flag);
 
 	rte_free(vq->async->buffers_packed);
 	vq->async->buffers_packed = NULL;
@@ -665,12 +666,12 @@ vhost_new_device(void)
 	int i;
 
 	pthread_mutex_lock(&vhost_dev_lock);
-	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
+	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
 		if (vhost_devices[i] == NULL)
 			break;
 	}
 
-	if (i == MAX_VHOST_DEVICE) {
+	if (i == RTE_MAX_VHOST_DEVICE) {
 		VHOST_LOG_CONFIG(ERR, "failed to find a free slot for new device.\n");
 		pthread_mutex_unlock(&vhost_dev_lock);
 		return -1;
@@ -1621,8 +1622,7 @@ rte_vhost_extern_callback_register(int vid,
 }
 
 static __rte_always_inline int
-async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_channel_ops *ops)
+async_channel_register(int vid, uint16_t queue_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
@@ -1651,6 +1651,14 @@ async_channel_register(int vid, uint16_t queue_id,
 		goto out_free_async;
 	}
 
+	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool), RTE_CACHE_LINE_SIZE,
+			node);
+	if (!async->pkts_cmpl_flag) {
+		VHOST_LOG_CONFIG(ERR, "(%s) failed to allocate async pkts_cmpl_flag (qid: %d)\n",
+				dev->ifname, queue_id);
+		goto out_free_async;
+	}
+
 	if (vq_is_packed(dev)) {
 		async->buffers_packed = rte_malloc_socket(NULL,
 				vq->size * sizeof(struct vring_used_elem_packed),
@@ -1671,9 +1679,6 @@ async_channel_register(int vid, uint16_t queue_id,
 		}
 	}
 
-	async->ops.check_completed_copies = ops->check_completed_copies;
-	async->ops.transfer_data = ops->transfer_data;
-
 	vq->async = async;
 
 	return 0;
@@ -1686,15 +1691,13 @@ async_channel_register(int vid, uint16_t queue_id,
 }
 
 int
-rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 	int ret;
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1705,33 +1708,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"(%s) async copy is not supported on non-inorder mode (qid: %d)\n",
-			dev->ifname, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
 	rte_spinlock_lock(&vq->access_lock);
-	ret = async_channel_register(vid, queue_id, ops);
+	ret = async_channel_register(vid, queue_id);
 	rte_spinlock_unlock(&vq->access_lock);
 
 	return ret;
 }
 
 int
-rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1742,18 +1732,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"(%s) async copy is not supported on non-inorder mode (qid: %d)\n",
-			dev->ifname, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
-	return async_channel_register(vid, queue_id, ops);
+	return async_channel_register(vid, queue_id);
 }
 
 int
@@ -1832,6 +1811,69 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
 	return 0;
 }
 
+int
+rte_vhost_async_dma_configure(int16_t dma_id, uint16_t vchan_id)
+{
+	struct rte_dma_info info;
+	void *pkts_cmpl_flag_addr;
+	uint16_t max_desc;
+
+	if (!rte_dma_is_valid(dma_id)) {
+		VHOST_LOG_CONFIG(ERR, "DMA %d is not found. Cannot use it in vhost\n", dma_id);
+		return -1;
+	}
+
+	rte_dma_info_get(dma_id, &info);
+	if (vchan_id >= info.max_vchans) {
+		VHOST_LOG_CONFIG(ERR, "Invalid vChannel ID. Cannot use DMA %d vChannel %u for "
+				"vhost\n", dma_id, vchan_id);
+		return -1;
+	}
+
+	if (!dma_copy_track[dma_id].vchans) {
+		struct async_dma_vchan_info *vchans;
+
+		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * info.max_vchans,
+				RTE_CACHE_LINE_SIZE);
+		if (vchans == NULL) {
+			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans, Cannot use DMA %d "
+					"vChannel %u for vhost.\n", dma_id, vchan_id);
+			return -1;
+		}
+
+		dma_copy_track[dma_id].vchans = vchans;
+	}
+
+	if (dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr) {
+		VHOST_LOG_CONFIG(INFO, "DMA %d vChannel %u has registered in vhost. Ignore\n",
+				dma_id, vchan_id);
+		return 0;
+	}
+
+	max_desc = info.max_desc;
+	if (!rte_is_power_of_2(max_desc))
+		max_desc = rte_align32pow2(max_desc);
+
+	pkts_cmpl_flag_addr = rte_zmalloc(NULL, sizeof(bool *) * max_desc, RTE_CACHE_LINE_SIZE);
+	if (!pkts_cmpl_flag_addr) {
+		VHOST_LOG_CONFIG(ERR, "Failed to allocate pkts_cmpl_flag_addr for DMA %d "
+				"vChannel %u. Cannot use it for vhost\n", dma_id, vchan_id);
+
+		if (dma_copy_track[dma_id].nr_vchans == 0) {
+			rte_free(dma_copy_track[dma_id].vchans);
+			dma_copy_track[dma_id].vchans = NULL;
+		}
+		return -1;
+	}
+
+	dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr = pkts_cmpl_flag_addr;
+	dma_copy_track[dma_id].vchans[vchan_id].ring_size = max_desc;
+	dma_copy_track[dma_id].vchans[vchan_id].ring_mask = max_desc - 1;
+	dma_copy_track[dma_id].nr_vchans++;
+
+	return 0;
+}
+
 int
 rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
 {
diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index b3f0c1d07c..1c2ee29600 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h
@@ -19,6 +19,7 @@
 #include <rte_ether.h>
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
+#include <rte_dmadev.h>
 
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
@@ -50,6 +51,9 @@
 
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
 #define VHOST_MAX_ASYNC_VEC 2048
+#define VIRTIO_MAX_RX_PKTLEN 9728U
+#define VHOST_DMA_MAX_COPY_COMPLETE ((VIRTIO_MAX_RX_PKTLEN / RTE_MBUF_DEFAULT_DATAROOM) \
+		* MAX_PKT_BURST)
 
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
@@ -119,6 +123,58 @@ struct vring_used_elem_packed {
 	uint32_t count;
 };
 
+/**
+ * iovec
+ */
+struct vhost_iovec {
+	void *src_addr;
+	void *dst_addr;
+	size_t len;
+};
+
+/**
+ * iovec iterator
+ */
+struct vhost_iov_iter {
+	/** pointer to the iovec array */
+	struct vhost_iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+struct async_dma_vchan_info {
+	/* circular array to track if packet copy completes */
+	bool **pkts_cmpl_flag_addr;
+
+	/* max elements in 'pkts_cmpl_flag_addr' */
+	uint16_t ring_size;
+	/* ring index mask for 'pkts_cmpl_flag_addr' */
+	uint16_t ring_mask;
+
+	/**
+	 * DMA virtual channel lock. Although it is able to bind DMA
+	 * virtual channels to data plane threads, vhost control plane
+	 * thread could call data plane functions too, thus causing
+	 * DMA device contention.
+	 *
+	 * For example, in VM exit case, vhost control plane thread needs
+	 * to clear in-flight packets before disable vring, but there could
+	 * be anotther data plane thread is enqueuing packets to the same
+	 * vring with the same DMA virtual channel. As dmadev PMD functions
+	 * are lock-free, the control plane and data plane threads could
+	 * operate the same DMA virtual channel at the same time.
+	 */
+	rte_spinlock_t dma_lock;
+};
+
+struct async_dma_info {
+	struct async_dma_vchan_info *vchans;
+	/* number of registered virtual channels */
+	uint16_t nr_vchans;
+};
+
+extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 /**
  * inflight async packet information
  */
@@ -129,16 +185,32 @@ struct async_inflight_info {
 };
 
 struct vhost_async {
-	/* operation callbacks for DMA */
-	struct rte_vhost_async_channel_ops ops;
-
-	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
-	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
+	struct vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
+	struct vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
 	uint16_t iter_idx;
 	uint16_t iovec_idx;
 
 	/* data transfer status */
 	struct async_inflight_info *pkts_info;
+	/**
+	 * Packet reorder array. "true" indicates that DMA device
+	 * completes all copies for the packet.
+	 *
+	 * Note that this array could be written by multiple threads
+	 * simultaneously. For example, in the case of thread0 and
+	 * thread1 RX packets from NIC and then enqueue packets to
+	 * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
+	 * possible for thread0 to get completed copies belonging to
+	 * vring1 from DMA0, while thread0 is calling rte_vhost_poll
+	 * _enqueue_completed() for vring0 and thread1 is calling
+	 * rte_vhost_submit_enqueue_burst() for vring1. In this case,
+	 * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
+	 *
+	 * However, since offloading is per-packet basis, each packet
+	 * flag will only be written by one thread. And single byte
+	 * write is atomic, so no lock for pkts_cmpl_flag is needed.
+	 */
+	bool *pkts_cmpl_flag;
 	uint16_t pkts_idx;
 	uint16_t pkts_inflight_n;
 	union {
@@ -568,8 +640,7 @@ extern int vhost_data_log_level;
 #define PRINT_PACKET(device, addr, size, header) do {} while (0)
 #endif
 
-#define MAX_VHOST_DEVICE	1024
-extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 
 #define VHOST_BINARY_SEARCH_THRESH 256
 
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index f19713137c..cc4e2504ac 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -11,6 +11,7 @@
 #include <rte_net.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
+#include <rte_dmadev.h>
 #include <rte_vhost.h>
 #include <rte_tcp.h>
 #include <rte_udp.h>
@@ -25,6 +26,9 @@
 
 #define MAX_BATCH_LEN 256
 
+/* DMA device copy operation tracking array. */
+struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -43,6 +47,136 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
 }
 
+static __rte_always_inline int64_t
+vhost_async_dma_transfer_one(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		int16_t dma_id, uint16_t vchan_id, uint16_t flag_idx,
+		struct vhost_iov_iter *pkt)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	uint16_t ring_mask = dma_info->ring_mask;
+	static bool vhost_async_dma_copy_log;
+
+
+	struct vhost_iovec *iov = pkt->iov;
+	int copy_idx = 0;
+	uint32_t nr_segs = pkt->nr_segs;
+	uint16_t i;
+
+	if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
+		return -1;
+
+	for (i = 0; i < nr_segs; i++) {
+		copy_idx = rte_dma_copy(dma_id, vchan_id, (rte_iova_t)iov[i].src_addr,
+				(rte_iova_t)iov[i].dst_addr, iov[i].len, RTE_DMA_OP_FLAG_LLC);
+		/**
+		 * Since all memory is pinned and DMA vChannel
+		 * ring has enough space, failure should be a
+		 * rare case. If failure happens, it means DMA
+		 * device encounters serious errors; in this
+		 * case, please stop async data-path and check
+		 * what has happened to DMA device.
+		 */
+		if (unlikely(copy_idx < 0)) {
+			if (!vhost_async_dma_copy_log) {
+				VHOST_LOG_DATA(ERR, "(%s) DMA %d vChannel %u reports error in "
+						"rte_dma_copy(). Please stop async data-path and "
+						"debug what has happened to DMA device\n",
+						dev->ifname, dma_id, vchan_id);
+				vhost_async_dma_copy_log = true;
+			}
+			return -1;
+		}
+	}
+
+	/**
+	 * Only store packet completion flag address in the last copy's
+	 * slot, and other slots are set to NULL.
+	 */
+	dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask] = &vq->async->pkts_cmpl_flag[flag_idx];
+
+	return nr_segs;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_transfer(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		int16_t dma_id, uint16_t vchan_id, uint16_t head_idx,
+		struct vhost_iov_iter *pkts, uint16_t nr_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	int64_t ret, nr_copies = 0;
+	uint16_t pkt_idx;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
+		ret = vhost_async_dma_transfer_one(dev, vq, dma_id, vchan_id, head_idx, &pkts[pkt_idx]);
+		if (unlikely(ret < 0))
+			break;
+
+		nr_copies += ret;
+		head_idx++;
+		if (head_idx >= vq->size)
+			head_idx -= vq->size;
+	}
+
+	if (likely(nr_copies > 0))
+		rte_dma_submit(dma_id, vchan_id);
+
+	rte_spinlock_unlock(&dma_info->dma_lock);
+
+	return pkt_idx;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_check_completed(struct virtio_net *dev, int16_t dma_id, uint16_t vchan_id,
+		uint16_t max_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	uint16_t ring_mask = dma_info->ring_mask;
+	uint16_t last_idx = 0;
+	uint16_t nr_copies;
+	uint16_t copy_idx;
+	uint16_t i;
+	bool has_error = false;
+	static bool vhost_async_dma_complete_log;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	/**
+	 * Print error log for debugging, if DMA reports error during
+	 * DMA transfer. We do not handle error in vhost level.
+	 */
+	nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts, &last_idx, &has_error);
+	if (unlikely(!vhost_async_dma_complete_log && has_error)) {
+		VHOST_LOG_DATA(ERR, "(%s) DMA %d vChannel %u reports error in "
+				"rte_dma_completed()\n", dev->ifname, dma_id, vchan_id);
+		vhost_async_dma_complete_log = true;
+	} else if (nr_copies == 0) {
+		goto out;
+	}
+
+	copy_idx = last_idx - nr_copies + 1;
+	for (i = 0; i < nr_copies; i++) {
+		bool *flag;
+
+		flag = dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask];
+		if (flag) {
+			/**
+			 * Mark the packet flag as received. The flag
+			 * could belong to another virtqueue but write
+			 * is atomic.
+			 */
+			*flag = true;
+			dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask] = NULL;
+		}
+		copy_idx++;
+	}
+
+out:
+	rte_spinlock_unlock(&dma_info->dma_lock);
+	return nr_copies;
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
@@ -794,7 +928,7 @@ copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 static __rte_always_inline int
 async_iter_initialize(struct virtio_net *dev, struct vhost_async *async)
 {
-	struct rte_vhost_iov_iter *iter;
+	struct vhost_iov_iter *iter;
 
 	if (unlikely(async->iovec_idx >= VHOST_MAX_ASYNC_VEC)) {
 		VHOST_LOG_DATA(ERR, "(%s) no more async iovec available\n", dev->ifname);
@@ -812,8 +946,8 @@ static __rte_always_inline int
 async_iter_add_iovec(struct virtio_net *dev, struct vhost_async *async,
 		void *src, void *dst, size_t len)
 {
-	struct rte_vhost_iov_iter *iter;
-	struct rte_vhost_iovec *iovec;
+	struct vhost_iov_iter *iter;
+	struct vhost_iovec *iovec;
 
 	if (unlikely(async->iovec_idx >= VHOST_MAX_ASYNC_VEC)) {
 		static bool vhost_max_async_vec_log;
@@ -848,7 +982,7 @@ async_iter_finalize(struct vhost_async *async)
 static __rte_always_inline void
 async_iter_cancel(struct vhost_async *async)
 {
-	struct rte_vhost_iov_iter *iter;
+	struct vhost_iov_iter *iter;
 
 	iter = async->iov_iter + async->iter_idx;
 	async->iovec_idx -= iter->nr_segs;
@@ -1448,9 +1582,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_split(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan_id)
 {
 	struct buf_vector buf_vec[BUF_VECTOR_MAX];
 	uint32_t pkt_idx = 0;
@@ -1460,7 +1594,7 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
 	uint32_t pkt_err = 0;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t slot_idx = 0;
 
 	/*
@@ -1502,17 +1636,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: failed to transfer data for queue id %d.\n",
-				dev->ifname, __func__, queue_id);
-		n_xfer = 0;
-	}
+	n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx,
+			async->iov_iter, pkt_idx);
 
 	pkt_err = pkt_idx - n_xfer;
 	if (unlikely(pkt_err)) {
 		uint16_t num_descs = 0;
 
+		VHOST_LOG_DATA(DEBUG, "(%s) %s: failed to transfer %u packets for queue %u.\n",
+				dev->ifname, __func__, pkt_err, queue_id);
+
 		/* update number of completed packets */
 		pkt_idx = n_xfer;
 
@@ -1655,13 +1788,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan_id)
 {
 	uint32_t pkt_idx = 0;
 	uint32_t remained = count;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t num_buffers;
 	uint16_t num_descs;
 
@@ -1693,19 +1826,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: failed to transfer data for queue id %d.\n",
-				dev->ifname, __func__, queue_id);
-		n_xfer = 0;
-	}
-
-	pkt_err = pkt_idx - n_xfer;
+	n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx, async->iov_iter,
+			pkt_idx);
 
 	async_iter_reset(async);
 
-	if (unlikely(pkt_err))
+	pkt_err = pkt_idx - n_xfer;
+	if (unlikely(pkt_err)) {
+		VHOST_LOG_DATA(DEBUG, "(%s) %s: failed to transfer %u packets for queue %u.\n",
+				dev->ifname, __func__, pkt_err, queue_id);
 		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
+	}
 
 	if (likely(vq->shadow_used_idx)) {
 		/* keep used descriptors. */
@@ -1825,28 +1956,40 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
 
 static __rte_always_inline uint16_t
 vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
-	int32_t n_cpl;
+	uint16_t nr_cpl_pkts = 0;
 	uint16_t n_descs = 0, n_buffers = 0;
 	uint16_t start_idx, from, i;
 
-	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
-	if (unlikely(n_cpl < 0)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: failed to check completed copies for queue id %d.\n",
-				dev->ifname, __func__, queue_id);
-		return 0;
+	/* Check completed copies for the given DMA vChannel */
+	vhost_async_dma_check_completed(dev, dma_id, vchan_id, VHOST_DMA_MAX_COPY_COMPLETE);
+
+	start_idx = async_get_first_inflight_pkt_idx(vq);
+	/**
+	 * Calculate the number of copy completed packets.
+	 * Note that there may be completed packets even if
+	 * no copies are reported done by the given DMA vChannel。
+	 * For example, multiple data plane threads enqueue packets
+	 * to the same virtqueue with their own DMA vChannels.
+	 */
+	from = start_idx;
+	while (vq->async->pkts_cmpl_flag[from] && count--) {
+		vq->async->pkts_cmpl_flag[from] = false;
+		from++;
+		if (from >= vq->size)
+			from -= vq->size;
+		nr_cpl_pkts++;
 	}
 
-	if (n_cpl == 0)
+	if (nr_cpl_pkts == 0)
 		return 0;
 
-	start_idx = async_get_first_inflight_pkt_idx(vq);
-
-	for (i = 0; i < n_cpl; i++) {
+	for (i = 0; i < nr_cpl_pkts; i++) {
 		from = (start_idx + i) % vq->size;
 		/* Only used with packed ring */
 		n_buffers += pkts_info[from].nr_buffers;
@@ -1855,7 +1998,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		pkts[i] = pkts_info[from].mbuf;
 	}
 
-	async->pkts_inflight_n -= n_cpl;
+	async->pkts_inflight_n -= nr_cpl_pkts;
 
 	if (likely(vq->enabled && vq->access_ok)) {
 		if (vq_is_packed(dev)) {
@@ -1876,12 +2019,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		}
 	}
 
-	return n_cpl;
+	return nr_cpl_pkts;
 }
 
 uint16_t
 rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1905,9 +2049,20 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	rte_spinlock_lock(&vq->access_lock);
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: invalid DMA %d vChannel %u.\n", dev->ifname, __func__,
+			       dma_id, vchan_id);
+		return 0;
+	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	if (!rte_spinlock_trylock(&vq->access_lock)) {
+		VHOST_LOG_DATA(DEBUG, "(%s) failed to poll completed packets from queue id %u. "
+			"virtqueue busy.\n", dev->ifname, queue_id);
+		return 0;
+	}
+
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
 
 	rte_spinlock_unlock(&vq->access_lock);
 
@@ -1916,7 +2071,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 uint16_t
 rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1940,14 +2096,21 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: invalid DMA %d vChannel %u.\n", dev->ifname, __func__,
+				dma_id, vchan_id);
+		return 0;
+	}
+
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
 
 	return n_pkts_cpl;
 }
 
 static __rte_always_inline uint32_t
 virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan_id)
 {
 	struct vhost_virtqueue *vq;
 	uint32_t nb_tx = 0;
@@ -1959,6 +2122,13 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 		return 0;
 	}
 
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: invalid DMA %d vChannel %u.\n", dev->ifname, __func__,
+			       dma_id, vchan_id);
+		return 0;
+	}
+
 	vq = dev->virtqueue[queue_id];
 
 	rte_spinlock_lock(&vq->access_lock);
@@ -1979,10 +2149,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 	if (vq_is_packed(dev))
 		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan_id);
 	else
 		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan_id);
 
 out:
 	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
@@ -1996,7 +2166,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 uint16_t
 rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 
@@ -2009,7 +2180,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, vchan_id);
 }
 
 static inline bool
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-08 10:40         ` [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
@ 2022-02-08 17:46           ` Maxime Coquelin
  2022-02-09 12:51           ` [PATCH v4 0/1] integrate dmadev in vhost Jiayu Hu
  1 sibling, 0 replies; 31+ messages in thread
From: Maxime Coquelin @ 2022-02-08 17:46 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, xuan.ding, cheng1.jiang, liangma, Sunil Pai G

Hi Jiayu,

On 2/8/22 11:40, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst |  97 +++++-----
>   examples/vhost/Makefile             |   2 +-
>   examples/vhost/ioat.c               | 218 ----------------------
>   examples/vhost/ioat.h               |  63 -------
>   examples/vhost/main.c               | 252 +++++++++++++++++++++-----
>   examples/vhost/main.h               |  11 ++
>   examples/vhost/meson.build          |   6 +-
>   lib/vhost/meson.build               |   2 +-
>   lib/vhost/rte_vhost.h               |   2 +
>   lib/vhost/rte_vhost_async.h         | 145 ++++-----------
>   lib/vhost/version.map               |   3 +
>   lib/vhost/vhost.c                   | 122 +++++++++----
>   lib/vhost/vhost.h                   |  85 ++++++++-
>   lib/vhost/virtio_net.c              | 271 +++++++++++++++++++++++-----
>   14 files changed, 689 insertions(+), 590 deletions(-)
>   delete mode 100644 examples/vhost/ioat.c
>   delete mode 100644 examples/vhost/ioat.h

Thanks for the quick follow-up patch.

I notice checkpatch is complaining for a few too long lines, I can fix
them by myself if the rest is good, otherwise remember to run checkpatch
for the next iteration.

> --- a/lib/vhost/vhost.c
> +++ b/lib/vhost/vhost.c
> @@ -25,7 +25,7 @@
>   #include "vhost.h"
>   #include "vhost_user.h"
>   
> -struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> +struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
>   pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
>   
>   /* Called with iotlb_lock read-locked */
> @@ -343,6 +343,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
>   		return;
>   
>   	rte_free(vq->async->pkts_info);
> +	rte_free(vq->async->pkts_cmpl_flag);
>   
>   	rte_free(vq->async->buffers_packed);
>   	vq->async->buffers_packed = NULL;
> @@ -665,12 +666,12 @@ vhost_new_device(void)
>   	int i;
>   
>   	pthread_mutex_lock(&vhost_dev_lock);
> -	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> +	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
>   		if (vhost_devices[i] == NULL)
>   			break;
>   	}
>   
> -	if (i == MAX_VHOST_DEVICE) {
> +	if (i == RTE_MAX_VHOST_DEVICE) {
>   		VHOST_LOG_CONFIG(ERR, "failed to find a free slot for new device.\n");
>   		pthread_mutex_unlock(&vhost_dev_lock);
>   		return -1;
> @@ -1621,8 +1622,7 @@ rte_vhost_extern_callback_register(int vid,
>   }
>   
>   static __rte_always_inline int
> -async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_channel_ops *ops)
> +async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
> @@ -1651,6 +1651,14 @@ async_channel_register(int vid, uint16_t queue_id,
>   		goto out_free_async;
>   	}
>   
> +	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool), RTE_CACHE_LINE_SIZE,
> +			node);
> +	if (!async->pkts_cmpl_flag) {
> +		VHOST_LOG_CONFIG(ERR, "(%s) failed to allocate async pkts_cmpl_flag (qid: %d)\n",
> +				dev->ifname, queue_id);
> +		goto out_free_async;
> +	}
> +
>   	if (vq_is_packed(dev)) {
>   		async->buffers_packed = rte_malloc_socket(NULL,
>   				vq->size * sizeof(struct vring_used_elem_packed),
> @@ -1671,9 +1679,6 @@ async_channel_register(int vid, uint16_t queue_id,
>   		}
>   	}
>   
> -	async->ops.check_completed_copies = ops->check_completed_copies;
> -	async->ops.transfer_data = ops->transfer_data;
> -
>   	vq->async = async;
>   
>   	return 0;
> @@ -1686,15 +1691,13 @@ async_channel_register(int vid, uint16_t queue_id,
>   }
>   
>   int
> -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   	int ret;
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1705,33 +1708,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
>   	if (unlikely(vq == NULL || !dev->async_copy))
>   		return -1;
>   
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"(%s) async copy is not supported on non-inorder mode (qid: %d)\n",
> -			dev->ifname, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
>   	rte_spinlock_lock(&vq->access_lock);
> -	ret = async_channel_register(vid, queue_id, ops);
> +	ret = async_channel_register(vid, queue_id);
>   	rte_spinlock_unlock(&vq->access_lock);
>   
>   	return ret;
>   }
>   
>   int
> -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_vhost_async_config config,
> -		struct rte_vhost_async_channel_ops *ops)
> +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	struct virtio_net *dev = get_device(vid);
>   
> -	if (dev == NULL || ops == NULL)
> +	if (dev == NULL)
>   		return -1;
>   
>   	if (queue_id >= VHOST_MAX_VRING)
> @@ -1742,18 +1732,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
>   	if (unlikely(vq == NULL || !dev->async_copy))
>   		return -1;
>   
> -	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"(%s) async copy is not supported on non-inorder mode (qid: %d)\n",
> -			dev->ifname, queue_id);
> -		return -1;
> -	}
> -
> -	if (unlikely(ops->check_completed_copies == NULL ||
> -		ops->transfer_data == NULL))
> -		return -1;
> -
> -	return async_channel_register(vid, queue_id, ops);
> +	return async_channel_register(vid, queue_id);
>   }
>   
>   int
> @@ -1832,6 +1811,69 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
>   	return 0;
>   }
>   
> +int
> +rte_vhost_async_dma_configure(int16_t dma_id, uint16_t vchan_id)
> +{
> +	struct rte_dma_info info;
> +	void *pkts_cmpl_flag_addr;
> +	uint16_t max_desc;
> +
> +	if (!rte_dma_is_valid(dma_id)) {
> +		VHOST_LOG_CONFIG(ERR, "DMA %d is not found. Cannot use it in vhost\n", dma_id);
> +		return -1;
> +	}
> +
> +	rte_dma_info_get(dma_id, &info);
> +	if (vchan_id >= info.max_vchans) {
> +		VHOST_LOG_CONFIG(ERR, "Invalid vChannel ID. Cannot use DMA %d vChannel %u for "
> +				"vhost\n", dma_id, vchan_id);
> +		return -1;
> +	}
> +
> +	if (!dma_copy_track[dma_id].vchans) {
> +		struct async_dma_vchan_info *vchans;
> +
> +		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * info.max_vchans,
> +				RTE_CACHE_LINE_SIZE);
> +		if (vchans == NULL) {
> +			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans, Cannot use DMA %d "
> +					"vChannel %u for vhost.\n", dma_id, vchan_id);

Please remove the "cannot use in Vhost" here and above and below.
The messages are already prefixed with VHOST_CONFIG, so this is
redundant.

> +			return -1;
> +		}
> +
> +		dma_copy_track[dma_id].vchans = vchans;
> +	}
> +
> +	if (dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr) {
> +		VHOST_LOG_CONFIG(INFO, "DMA %d vChannel %u has registered in vhost. Ignore\n",

"DMA %d vChannel %u already registered.\n"

> +				dma_id, vchan_id);
> +		return 0;
> +	}
> +
> +	max_desc = info.max_desc;
> +	if (!rte_is_power_of_2(max_desc))
> +		max_desc = rte_align32pow2(max_desc);
> +
> +	pkts_cmpl_flag_addr = rte_zmalloc(NULL, sizeof(bool *) * max_desc, RTE_CACHE_LINE_SIZE);
> +	if (!pkts_cmpl_flag_addr) {
> +		VHOST_LOG_CONFIG(ERR, "Failed to allocate pkts_cmpl_flag_addr for DMA %d "
> +				"vChannel %u. Cannot use it for vhost\n", dma_id, vchan_id);
> +
> +		if (dma_copy_track[dma_id].nr_vchans == 0) {
> +			rte_free(dma_copy_track[dma_id].vchans);
> +			dma_copy_track[dma_id].vchans = NULL;
> +		}
> +		return -1;
> +	}
> +
> +	dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr = pkts_cmpl_flag_addr;
> +	dma_copy_track[dma_id].vchans[vchan_id].ring_size = max_desc;
> +	dma_copy_track[dma_id].vchans[vchan_id].ring_mask = max_desc - 1;
> +	dma_copy_track[dma_id].nr_vchans++;
> +
> +	return 0;
> +}
> +
>   int
>   rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
>   {
> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> index b3f0c1d07c..1c2ee29600 100644
> --- a/lib/vhost/vhost.h
> +++ b/lib/vhost/vhost.h
> @@ -19,6 +19,7 @@
>   #include <rte_ether.h>
>   #include <rte_rwlock.h>
>   #include <rte_malloc.h>
> +#include <rte_dmadev.h>
>   
>   #include "rte_vhost.h"
>   #include "rte_vdpa.h"
> @@ -50,6 +51,9 @@
>   
>   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
>   #define VHOST_MAX_ASYNC_VEC 2048
> +#define VIRTIO_MAX_RX_PKTLEN 9728U
> +#define VHOST_DMA_MAX_COPY_COMPLETE ((VIRTIO_MAX_RX_PKTLEN / RTE_MBUF_DEFAULT_DATAROOM) \
> +		* MAX_PKT_BURST)
>   
>   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
>   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
> @@ -119,6 +123,58 @@ struct vring_used_elem_packed {
>   	uint32_t count;
>   };
>   
> +/**
> + * iovec
> + */
> +struct vhost_iovec {
> +	void *src_addr;
> +	void *dst_addr;
> +	size_t len;
> +};
> +
> +/**
> + * iovec iterator
> + */
> +struct vhost_iov_iter {
> +	/** pointer to the iovec array */
> +	struct vhost_iovec *iov;
> +	/** number of iovec in this iterator */
> +	unsigned long nr_segs;
> +};
> +
> +struct async_dma_vchan_info {
> +	/* circular array to track if packet copy completes */
> +	bool **pkts_cmpl_flag_addr;
> +
> +	/* max elements in 'pkts_cmpl_flag_addr' */
> +	uint16_t ring_size;
> +	/* ring index mask for 'pkts_cmpl_flag_addr' */
> +	uint16_t ring_mask;
> +
> +	/**
> +	 * DMA virtual channel lock. Although it is able to bind DMA
> +	 * virtual channels to data plane threads, vhost control plane
> +	 * thread could call data plane functions too, thus causing
> +	 * DMA device contention.
> +	 *
> +	 * For example, in VM exit case, vhost control plane thread needs
> +	 * to clear in-flight packets before disable vring, but there could
> +	 * be anotther data plane thread is enqueuing packets to the same
> +	 * vring with the same DMA virtual channel. As dmadev PMD functions
> +	 * are lock-free, the control plane and data plane threads could
> +	 * operate the same DMA virtual channel at the same time.
> +	 */
> +	rte_spinlock_t dma_lock;
> +};
> +
> +struct async_dma_info {
> +	struct async_dma_vchan_info *vchans;
> +	/* number of registered virtual channels */
> +	uint16_t nr_vchans;
> +};
> +
> +extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>   /**
>    * inflight async packet information
>    */
> @@ -129,16 +185,32 @@ struct async_inflight_info {
>   };
>   
>   struct vhost_async {
> -	/* operation callbacks for DMA */
> -	struct rte_vhost_async_channel_ops ops;
> -
> -	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> -	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> +	struct vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> +	struct vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
>   	uint16_t iter_idx;
>   	uint16_t iovec_idx;
>   
>   	/* data transfer status */
>   	struct async_inflight_info *pkts_info;
> +	/**
> +	 * Packet reorder array. "true" indicates that DMA device
> +	 * completes all copies for the packet.
> +	 *
> +	 * Note that this array could be written by multiple threads
> +	 * simultaneously. For example, in the case of thread0 and
> +	 * thread1 RX packets from NIC and then enqueue packets to
> +	 * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
> +	 * possible for thread0 to get completed copies belonging to
> +	 * vring1 from DMA0, while thread0 is calling rte_vhost_poll
> +	 * _enqueue_completed() for vring0 and thread1 is calling
> +	 * rte_vhost_submit_enqueue_burst() for vring1. In this case,
> +	 * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
> +	 *
> +	 * However, since offloading is per-packet basis, each packet
> +	 * flag will only be written by one thread. And single byte
> +	 * write is atomic, so no lock for pkts_cmpl_flag is needed.
> +	 */
> +	bool *pkts_cmpl_flag;
>   	uint16_t pkts_idx;
>   	uint16_t pkts_inflight_n;
>   	union {
> @@ -568,8 +640,7 @@ extern int vhost_data_log_level;
>   #define PRINT_PACKET(device, addr, size, header) do {} while (0)
>   #endif
>   
> -#define MAX_VHOST_DEVICE	1024
> -extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> +extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
>   
>   #define VHOST_BINARY_SEARCH_THRESH 256
>   
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index f19713137c..cc4e2504ac 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -11,6 +11,7 @@
>   #include <rte_net.h>
>   #include <rte_ether.h>
>   #include <rte_ip.h>
> +#include <rte_dmadev.h>
>   #include <rte_vhost.h>
>   #include <rte_tcp.h>
>   #include <rte_udp.h>
> @@ -25,6 +26,9 @@
>   
>   #define MAX_BATCH_LEN 256
>   
> +/* DMA device copy operation tracking array. */
> +struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> +
>   static  __rte_always_inline bool
>   rxvq_is_mergeable(struct virtio_net *dev)
>   {
> @@ -43,6 +47,136 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
>   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
>   }
>   
> +static __rte_always_inline int64_t
> +vhost_async_dma_transfer_one(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		int16_t dma_id, uint16_t vchan_id, uint16_t flag_idx,
> +		struct vhost_iov_iter *pkt)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	static bool vhost_async_dma_copy_log;
> +
> +
> +	struct vhost_iovec *iov = pkt->iov;
> +	int copy_idx = 0;
> +	uint32_t nr_segs = pkt->nr_segs;
> +	uint16_t i;
> +
> +	if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
> +		return -1;
> +
> +	for (i = 0; i < nr_segs; i++) {
> +		copy_idx = rte_dma_copy(dma_id, vchan_id, (rte_iova_t)iov[i].src_addr,
> +				(rte_iova_t)iov[i].dst_addr, iov[i].len, RTE_DMA_OP_FLAG_LLC);
> +		/**
> +		 * Since all memory is pinned and DMA vChannel
> +		 * ring has enough space, failure should be a
> +		 * rare case. If failure happens, it means DMA
> +		 * device encounters serious errors; in this
> +		 * case, please stop async data-path and check
> +		 * what has happened to DMA device.
> +		 */
> +		if (unlikely(copy_idx < 0)) {
> +			if (!vhost_async_dma_copy_log) {
> +				VHOST_LOG_DATA(ERR, "(%s) DMA %d vChannel %u reports error in "
> +						"rte_dma_copy(). Please stop async data-path and "
> +						"debug what has happened to DMA device\n",

Please try to maje the log message shorter.

Something like "DMA copy failed for channel <dev id>:<chan id>".

> +						dev->ifname, dma_id, vchan_id);
> +				vhost_async_dma_copy_log = true;
> +			}
> +			return -1;
> +		}
> +	}
> +
> +	/**
> +	 * Only store packet completion flag address in the last copy's
> +	 * slot, and other slots are set to NULL.
> +	 */
> +	dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask] = &vq->async->pkts_cmpl_flag[flag_idx];
> +
> +	return nr_segs;
> +}
> +
> +static __rte_always_inline uint16_t
> +vhost_async_dma_transfer(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		int16_t dma_id, uint16_t vchan_id, uint16_t head_idx,
> +		struct vhost_iov_iter *pkts, uint16_t nr_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
> +	int64_t ret, nr_copies = 0;
> +	uint16_t pkt_idx;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> +		ret = vhost_async_dma_transfer_one(dev, vq, dma_id, vchan_id, head_idx, &pkts[pkt_idx]);
> +		if (unlikely(ret < 0))
> +			break;
> +
> +		nr_copies += ret;
> +		head_idx++;
> +		if (head_idx >= vq->size)
> +			head_idx -= vq->size;
> +	}
> +
> +	if (likely(nr_copies > 0))
> +		rte_dma_submit(dma_id, vchan_id);
> +
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +
> +	return pkt_idx;
> +}

Thanks for reworking vhost_async_dma_transfer()!It is much cleaner &
easy to understand IMHO.

> +static __rte_always_inline uint16_t
> +vhost_async_dma_check_completed(struct virtio_net *dev, int16_t dma_id, uint16_t vchan_id,
> +		uint16_t max_pkts)
> +{
> +	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
> +	uint16_t ring_mask = dma_info->ring_mask;
> +	uint16_t last_idx = 0;
> +	uint16_t nr_copies;
> +	uint16_t copy_idx;
> +	uint16_t i;
> +	bool has_error = false;
> +	static bool vhost_async_dma_complete_log;
> +
> +	rte_spinlock_lock(&dma_info->dma_lock);
> +
> +	/**
> +	 * Print error log for debugging, if DMA reports error during
> +	 * DMA transfer. We do not handle error in vhost level.
> +	 */
> +	nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts, &last_idx, &has_error);
> +	if (unlikely(!vhost_async_dma_complete_log && has_error)) {
> +		VHOST_LOG_DATA(ERR, "(%s) DMA %d vChannel %u reports error in "
> +				"rte_dma_completed()\n", dev->ifname, dma_id, vchan_id);

"(%s) DMA completion failure on channel %d:%d"

> +		vhost_async_dma_complete_log = true;
> +	} else if (nr_copies == 0) {
> +		goto out;
> +	}
> +
> +	copy_idx = last_idx - nr_copies + 1;
> +	for (i = 0; i < nr_copies; i++) {
> +		bool *flag;
> +
> +		flag = dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask];
> +		if (flag) {
> +			/**
> +			 * Mark the packet flag as received. The flag
> +			 * could belong to another virtqueue but write
> +			 * is atomic.
> +			 */
> +			*flag = true;
> +			dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask] = NULL;
> +		}
> +		copy_idx++;
> +	}
> +
> +out:
> +	rte_spinlock_unlock(&dma_info->dma_lock);
> +	return nr_copies;
> +}
> +
>   static inline void
>   do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
>   {
> @@ -794,7 +928,7 @@ copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>   static __rte_always_inline int
>   async_iter_initialize(struct virtio_net *dev, struct vhost_async *async)
>   {
> -	struct rte_vhost_iov_iter *iter;
> +	struct vhost_iov_iter *iter;
>   
>   	if (unlikely(async->iovec_idx >= VHOST_MAX_ASYNC_VEC)) {
>   		VHOST_LOG_DATA(ERR, "(%s) no more async iovec available\n", dev->ifname);
> @@ -812,8 +946,8 @@ static __rte_always_inline int
>   async_iter_add_iovec(struct virtio_net *dev, struct vhost_async *async,
>   		void *src, void *dst, size_t len)
>   {
> -	struct rte_vhost_iov_iter *iter;
> -	struct rte_vhost_iovec *iovec;
> +	struct vhost_iov_iter *iter;
> +	struct vhost_iovec *iovec;
>   
>   	if (unlikely(async->iovec_idx >= VHOST_MAX_ASYNC_VEC)) {
>   		static bool vhost_max_async_vec_log;
> @@ -848,7 +982,7 @@ async_iter_finalize(struct vhost_async *async)
>   static __rte_always_inline void
>   async_iter_cancel(struct vhost_async *async)
>   {
> -	struct rte_vhost_iov_iter *iter;
> +	struct vhost_iov_iter *iter;
>   
>   	iter = async->iov_iter + async->iter_idx;
>   	async->iovec_idx -= iter->nr_segs;
> @@ -1448,9 +1582,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
>   }
>   
>   static __rte_noinline uint32_t
> -virtio_dev_rx_async_submit_split(struct virtio_net *dev,
> -	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
> +		int16_t dma_id, uint16_t vchan_id)
>   {
>   	struct buf_vector buf_vec[BUF_VECTOR_MAX];
>   	uint32_t pkt_idx = 0;
> @@ -1460,7 +1594,7 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
>   	struct vhost_async *async = vq->async;
>   	struct async_inflight_info *pkts_info = async->pkts_info;
>   	uint32_t pkt_err = 0;
> -	int32_t n_xfer;
> +	uint16_t n_xfer;
>   	uint16_t slot_idx = 0;
>   
>   	/*
> @@ -1502,17 +1636,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
>   	if (unlikely(pkt_idx == 0))
>   		return 0;
>   
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%s) %s: failed to transfer data for queue id %d.\n",
> -				dev->ifname, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> +	n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx,
> +			async->iov_iter, pkt_idx);
>   
>   	pkt_err = pkt_idx - n_xfer;
>   	if (unlikely(pkt_err)) {
>   		uint16_t num_descs = 0;
>   
> +		VHOST_LOG_DATA(DEBUG, "(%s) %s: failed to transfer %u packets for queue %u.\n",
> +				dev->ifname, __func__, pkt_err, queue_id);
> +
>   		/* update number of completed packets */
>   		pkt_idx = n_xfer;
>   
> @@ -1655,13 +1788,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
>   }
>   
>   static __rte_noinline uint32_t
> -virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
> -	struct vhost_virtqueue *vq, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
> +		int16_t dma_id, uint16_t vchan_id)
>   {
>   	uint32_t pkt_idx = 0;
>   	uint32_t remained = count;
> -	int32_t n_xfer;
> +	uint16_t n_xfer;
>   	uint16_t num_buffers;
>   	uint16_t num_descs;
>   
> @@ -1693,19 +1826,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
>   	if (unlikely(pkt_idx == 0))
>   		return 0;
>   
> -	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
> -	if (unlikely(n_xfer < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%s) %s: failed to transfer data for queue id %d.\n",
> -				dev->ifname, __func__, queue_id);
> -		n_xfer = 0;
> -	}
> -
> -	pkt_err = pkt_idx - n_xfer;
> +	n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx, async->iov_iter,
> +			pkt_idx);
>   
>   	async_iter_reset(async);
>   
> -	if (unlikely(pkt_err))
> +	pkt_err = pkt_idx - n_xfer;
> +	if (unlikely(pkt_err)) {
> +		VHOST_LOG_DATA(DEBUG, "(%s) %s: failed to transfer %u packets for queue %u.\n",
> +				dev->ifname, __func__, pkt_err, queue_id);
>   		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
> +	}
>   
>   	if (likely(vq->shadow_used_idx)) {
>   		/* keep used descriptors. */
> @@ -1825,28 +1956,40 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
>   
>   static __rte_always_inline uint16_t
>   vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id)
>   {
>   	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
>   	struct vhost_async *async = vq->async;
>   	struct async_inflight_info *pkts_info = async->pkts_info;
> -	int32_t n_cpl;
> +	uint16_t nr_cpl_pkts = 0;
>   	uint16_t n_descs = 0, n_buffers = 0;
>   	uint16_t start_idx, from, i;
>   
> -	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
> -	if (unlikely(n_cpl < 0)) {
> -		VHOST_LOG_DATA(ERR, "(%s) %s: failed to check completed copies for queue id %d.\n",
> -				dev->ifname, __func__, queue_id);
> -		return 0;
> +	/* Check completed copies for the given DMA vChannel */
> +	vhost_async_dma_check_completed(dev, dma_id, vchan_id, VHOST_DMA_MAX_COPY_COMPLETE);
> +
> +	start_idx = async_get_first_inflight_pkt_idx(vq);
> +	/**
> +	 * Calculate the number of copy completed packets.
> +	 * Note that there may be completed packets even if
> +	 * no copies are reported done by the given DMA vChannel。
> +	 * For example, multiple data plane threads enqueue packets
> +	 * to the same virtqueue with their own DMA vChannels.
> +	 */
> +	from = start_idx;
> +	while (vq->async->pkts_cmpl_flag[from] && count--) {
> +		vq->async->pkts_cmpl_flag[from] = false;
> +		from++;
> +		if (from >= vq->size)
> +			from -= vq->size;
> +		nr_cpl_pkts++;
>   	}
>   
> -	if (n_cpl == 0)
> +	if (nr_cpl_pkts == 0)
>   		return 0;
>   
> -	start_idx = async_get_first_inflight_pkt_idx(vq);
> -
> -	for (i = 0; i < n_cpl; i++) {
> +	for (i = 0; i < nr_cpl_pkts; i++) {
>   		from = (start_idx + i) % vq->size;
>   		/* Only used with packed ring */
>   		n_buffers += pkts_info[from].nr_buffers;
> @@ -1855,7 +1998,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
>   		pkts[i] = pkts_info[from].mbuf;
>   	}
>   
> -	async->pkts_inflight_n -= n_cpl;
> +	async->pkts_inflight_n -= nr_cpl_pkts;
>   
>   	if (likely(vq->enabled && vq->access_ok)) {
>   		if (vq_is_packed(dev)) {
> @@ -1876,12 +2019,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
>   		}
>   	}
>   
> -	return n_cpl;
> +	return nr_cpl_pkts;
>   }
>   
>   uint16_t
>   rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq;
> @@ -1905,9 +2049,20 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	rte_spinlock_lock(&vq->access_lock);
> +	if (unlikely(!dma_copy_track[dma_id].vchans ||
> +				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
> +		VHOST_LOG_DATA(ERR, "(%s) %s: invalid DMA %d vChannel %u.\n", dev->ifname, __func__,
> +			       dma_id, vchan_id);
> +		return 0;
> +	}
>   
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	if (!rte_spinlock_trylock(&vq->access_lock)) {
> +		VHOST_LOG_DATA(DEBUG, "(%s) failed to poll completed packets from queue id %u. "
> +			"virtqueue busy.\n", dev->ifname, queue_id);

Please try to make it shorter so that the string fits in a single line.

> +		return 0;
> +	}
> +
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
>   
>   	rte_spinlock_unlock(&vq->access_lock);
>   
> @@ -1916,7 +2071,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
>   
>   uint16_t
>   rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   	struct vhost_virtqueue *vq;
> @@ -1940,14 +2096,21 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
> +	if (unlikely(!dma_copy_track[dma_id].vchans ||
> +				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
> +		VHOST_LOG_DATA(ERR, "(%s) %s: invalid DMA %d vChannel %u.\n", dev->ifname, __func__,
> +				dma_id, vchan_id);
> +		return 0;
> +	}
> +
> +	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
>   
>   	return n_pkts_cpl;
>   }
>   
>   static __rte_always_inline uint32_t
>   virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
> -	struct rte_mbuf **pkts, uint32_t count)
> +	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan_id)
>   {
>   	struct vhost_virtqueue *vq;
>   	uint32_t nb_tx = 0;
> @@ -1959,6 +2122,13 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   		return 0;
>   	}
>   
> +	if (unlikely(!dma_copy_track[dma_id].vchans ||
> +				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
> +		VHOST_LOG_DATA(ERR, "(%s) %s: invalid DMA %d vChannel %u.\n", dev->ifname, __func__,
> +			       dma_id, vchan_id);
> +		return 0;
> +	}
> +
>   	vq = dev->virtqueue[queue_id];
>   
>   	rte_spinlock_lock(&vq->access_lock);
> @@ -1979,10 +2149,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   
>   	if (vq_is_packed(dev))
>   		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, vchan_id);
>   	else
>   		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
> -				pkts, count);
> +				pkts, count, dma_id, vchan_id);
>   
>   out:
>   	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> @@ -1996,7 +2166,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
>   
>   uint16_t
>   rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> -		struct rte_mbuf **pkts, uint16_t count)
> +		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> +		uint16_t vchan_id)
>   {
>   	struct virtio_net *dev = get_device(vid);
>   
> @@ -2009,7 +2180,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
>   		return 0;
>   	}
>   
> -	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
> +	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, vchan_id);
>   }
>   
>   static inline bool

Overall, it looks good to me, only cosmetic issues to be fixed.
It is too late for -rc1, but please send a new version and I'll pick it
it for -rc2.

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 0/1] integrate dmadev in vhost
  2022-02-08 10:40         ` [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
  2022-02-08 17:46           ` Maxime Coquelin
@ 2022-02-09 12:51           ` Jiayu Hu
  2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
  1 sibling, 1 reply; 31+ messages in thread
From: Jiayu Hu @ 2022-02-09 12:51 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, xuan.ding, cheng1.jiang,
	liangma, Jiayu Hu

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in vhost.

To enable the flexibility of using DMA devices in different function
modules, not limited in vhost, vhost doesn't manage DMA devices.
Applications, like OVS, need to manage and configure DMA devices and
tell vhost what DMA device to use in every dataplane function call.

In addition, vhost supports M:N mapping between vrings and DMA virtual
channels. Specifically, one vring can use multiple different DMA channels
and one DMA channel can be shared by multiple vrings at the same time.
The reason of enabling one vring to use multiple DMA channels is that
it's possible that more than one dataplane threads enqueue packets to
the same vring with their own DMA virtual channels. Besides, the number
of DMA devices is limited. For the purpose of scaling, it's necessary to
support sharing DMA channels among vrings.

As only enqueue path is enabled DMA acceleration, the new dataplane
functions are like:
1). rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Get descriptors and submit copies to DMA virtual channel for the
    packets that need to be send to VM.
 
2). rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
    dma_vchan):
    Check completed DMA copies from the given DMA virtual channel and
    write back corresponding descriptors to vring.

OVS needs to call rte_vhost_poll_enqueue_completed to clean in-flight
copies on previous call and it can be called inside rxq_recv function,
so that it doesn't require big change in OVS datapath. For example:
netdev_dpdk_vhost_rxq_recv() {
	...
	qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_RXQ;
	rte_vhost_poll_enqueue_completed(vid, qid, ...);
}

Change log
==========
v3 -> v4:
- fix coding style issues
- optmize and shorter log
- optimize doc
v2 -> v3:
- remove SW fallback
- remove middle-packet dma submit
- refactor rte_async_dma_configure() and remove poll_factor
- introduce vhost_async_dma_transfer_one()
- rename rte_vhost_iov_iter and rte_vhost_iovec and place them in vhost.h
- refactor LOG format 
- print error log for rte_dma_copy() failure with avoiding log flood
- avoid log flood for rte_dma_completed() failure
- fix some typo and update comment and doc
v1 -> v2:
- add SW fallback if rte_dma_copy() reports error
- print error if rte_dma_completed() reports error
- add poll_factor while call rte_dma_completed() for scatter-gaher packets
- use trylock instead of lock in rte_vhost_poll_enqueue_completed()
- check if dma_id and vchan_id valid
- input dma_id in rte_vhost_async_dma_configure()
- remove useless code, brace and hardcode in vhost example
- redefine MAX_VHOST_DEVICE to RTE_MAX_VHOST_DEVICE
- update doc and comments
rfc -> v1:
- remove useless code
- support dynamic DMA vchannel ring size (rte_vhost_async_dma_configure)
- fix several bugs
- fix typo and coding style issues
- replace "while" with "for"
- update programmer guide 
- support share dma among vhost in vhost example
- remove "--dma-type" in vhost example

Jiayu Hu (1):
  vhost: integrate dmadev in asynchronous data-path

 doc/guides/prog_guide/vhost_lib.rst | 100 +++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 ----------------------
 examples/vhost/ioat.h               |  63 -------
 examples/vhost/main.c               | 253 ++++++++++++++++++++-----
 examples/vhost/main.h               |  11 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 145 ++++-----------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 121 ++++++++----
 lib/vhost/vhost.h                   |  85 ++++++++-
 lib/vhost/virtio_net.c              | 277 ++++++++++++++++++++++------
 14 files changed, 693 insertions(+), 595 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-09 12:51           ` [PATCH v4 0/1] integrate dmadev in vhost Jiayu Hu
@ 2022-02-09 12:51             ` Jiayu Hu
  2022-02-10  7:58               ` Yang, YvonneX
                                 ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Jiayu Hu @ 2022-02-09 12:51 UTC (permalink / raw)
  To: dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, xuan.ding, cheng1.jiang,
	liangma, Jiayu Hu, Sunil Pai G

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.

Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst | 100 +++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 ----------------------
 examples/vhost/ioat.h               |  63 -------
 examples/vhost/main.c               | 253 ++++++++++++++++++++-----
 examples/vhost/main.h               |  11 ++
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 145 ++++-----------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 121 ++++++++----
 lib/vhost/vhost.h                   |  85 ++++++++-
 lib/vhost/virtio_net.c              | 277 ++++++++++++++++++++++------
 14 files changed, 693 insertions(+), 595 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index f72ce75909..b27a9a8a0d 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -105,13 +105,13 @@ The following is an overview of some key Vhost API functions:
 
   - ``RTE_VHOST_USER_ASYNC_COPY``
 
-    Asynchronous data path will be enabled when this flag is set. Async data
-    path allows applications to register async copy devices (typically
-    hardware DMA channels) to the vhost queues. Vhost leverages the copy
-    device registered to free CPU from memory copy operations. A set of
-    async data path APIs are defined for DPDK applications to make use of
-    the async capability. Only packets enqueued/dequeued by async APIs are
-    processed through the async data path.
+    Asynchronous data path will be enabled when this flag is set. Async
+    data path allows applications to enable DMA acceleration for vhost
+    queues. Vhost leverages the registered DMA channels to free CPU from
+    memory copy operations in data path. A set of async data path APIs are
+    defined for DPDK applications to make use of the async capability. Only
+    packets enqueued/dequeued by async APIs are processed through the async
+    data path.
 
     Currently this feature is only implemented on split ring enqueue data
     path.
@@ -218,52 +218,30 @@ The following is an overview of some key Vhost API functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_dma_configure(dma_id, vchan_id)``
 
-  Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  Tell vhost which DMA vChannel is going to use. This function needs to
+  be called before register async data-path for vring.
 
-  * ``features``
+* ``rte_vhost_async_channel_register(vid, queue_id)``
 
-    This field is used to specify async copy device features.
+  Register async DMA acceleration for a vhost queue after vring is enabled.
 
-    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
-    guarantee the order of copy completion is the same as the order
-    of copy submission.
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
 
-    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
-    supported by vhost.
-
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-    vhost invokes this function to submit copy data to the async devices.
-    For non-async_inorder capable devices, ``opaque_data`` could be used
-    for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-    vhost invokes this function to get the copy data completed by async
-    devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, ops)``
-
-  Register an async copy device channel for a vhost queue without
-  performing any locking.
+  Register async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
 * ``rte_vhost_async_channel_unregister(vid, queue_id)``
 
-  Unregister the async copy device channel from a vhost queue.
+  Unregister the async DMA acceleration from a vhost queue.
   Unregistration will fail, if the vhost queue has in-flight
   packets that are not completed.
 
-  Unregister async copy devices in vring_state_changed() may
+  Unregister async DMA acceleration in vring_state_changed() may
   fail, as this API tries to acquire the spinlock of vhost
   queue. The recommended way is to unregister async copy
   devices for all vhost queues in destroy_device(), when a
@@ -271,24 +249,19 @@ The following is an overview of some key Vhost API functions:
 
 * ``rte_vhost_async_channel_unregister_thread_unsafe(vid, queue_id)``
 
-  Unregister the async copy device channel for a vhost queue without
-  performing any locking.
+  Unregister async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, vchan_id)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
-
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, vchan_id)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +271,7 @@ The following is an overview of some key Vhost API functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, vchan_id)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -443,6 +416,29 @@ Finally, a set of device ops is defined for device specific operations:
 
   Called to get the notify area info of the queue.
 
+Vhost asynchronous data path
+----------------------------
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables applications, like OVS, to save CPU cycles and hide memory copy
+overhead, thus achieving higher throughput.
+
+Vhost doesn't manage DMA devices and applications, like OVS, need to
+manage and configure DMA devices. Applications need to tell vhost what
+DMA devices to use in every data path function call. This design enables
+the flexibility for applications to dynamically use DMA channels in
+different function modules, not limited in vhost.
+
+In addition, vhost supports M:N mapping between vrings and DMA virtual
+channels. Specifically, one vring can use multiple different DMA channels
+and one DMA channel can be shared by multiple vrings at the same time.
+The reason of enabling one vring to use multiple DMA channels is that
+it's possible that more than one dataplane threads enqueue packets to
+the same vring with their own DMA virtual channels. Besides, the number
+of DMA devices is limited. For the purpose of scaling, it's necessary to
+support sharing DMA channels among vrings.
+
 Recommended IOVA mode in async datapath
 ---------------------------------------
 
@@ -450,4 +446,4 @@ When DMA devices are bound to vfio driver, VA mode is recommended.
 For PA mode, page by page mapping may exceed IOMMU's max capability,
 better to use 1G guest hugepage.
 
-For uio driver, any vfio related error message can be ignored.
\ No newline at end of file
+For uio driver, any vfio related error message can be ignored.
diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
index 587ea2ab47..975a5dfe40 100644
--- a/examples/vhost/Makefile
+++ b/examples/vhost/Makefile
@@ -5,7 +5,7 @@
 APP = vhost-switch
 
 # all source are stored in SRCS-y
-SRCS-y := main.c virtio_net.c ioat.c
+SRCS-y := main.c virtio_net.c
 
 PKGCONF ?= pkg-config
 
diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
deleted file mode 100644
index 9aeeb12fd9..0000000000
--- a/examples/vhost/ioat.c
+++ /dev/null
@@ -1,218 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#include <sys/uio.h>
-#ifdef RTE_RAW_IOAT
-#include <rte_rawdev.h>
-#include <rte_ioat_rawdev.h>
-
-#include "ioat.h"
-#include "main.h"
-
-struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
-
-struct packet_tracker {
-	unsigned short size_track[MAX_ENQUEUED_SIZE];
-	unsigned short next_read;
-	unsigned short next_write;
-	unsigned short last_remain;
-	unsigned short ioat_space;
-};
-
-struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
-
-int
-open_ioat(const char *value)
-{
-	struct dma_for_vhost *dma_info = dma_bind;
-	char *input = strndup(value, strlen(value) + 1);
-	char *addrs = input;
-	char *ptrs[2];
-	char *start, *end, *substr;
-	int64_t vid, vring_id;
-	struct rte_ioat_rawdev_config config;
-	struct rte_rawdev_info info = { .dev_private = &config };
-	char name[32];
-	int dev_id;
-	int ret = 0;
-	uint16_t i = 0;
-	char *dma_arg[MAX_VHOST_DEVICE];
-	int args_nr;
-
-	while (isblank(*addrs))
-		addrs++;
-	if (*addrs == '\0') {
-		ret = -1;
-		goto out;
-	}
-
-	/* process DMA devices within bracket. */
-	addrs++;
-	substr = strtok(addrs, ";]");
-	if (!substr) {
-		ret = -1;
-		goto out;
-	}
-	args_nr = rte_strsplit(substr, strlen(substr),
-			dma_arg, MAX_VHOST_DEVICE, ',');
-	if (args_nr <= 0) {
-		ret = -1;
-		goto out;
-	}
-	while (i < args_nr) {
-		char *arg_temp = dma_arg[i];
-		uint8_t sub_nr;
-		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
-		if (sub_nr != 2) {
-			ret = -1;
-			goto out;
-		}
-
-		start = strstr(ptrs[0], "txd");
-		if (start == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		start += 3;
-		vid = strtol(start, &end, 0);
-		if (end == start) {
-			ret = -1;
-			goto out;
-		}
-
-		vring_id = 0 + VIRTIO_RXQ;
-		if (rte_pci_addr_parse(ptrs[1],
-				&(dma_info + vid)->dmas[vring_id].addr) < 0) {
-			ret = -1;
-			goto out;
-		}
-
-		rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
-				name, sizeof(name));
-		dev_id = rte_rawdev_get_dev_id(name);
-		if (dev_id == (uint16_t)(-ENODEV) ||
-		dev_id == (uint16_t)(-EINVAL)) {
-			ret = -1;
-			goto out;
-		}
-
-		if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
-		strstr(info.driver_name, "ioat") == NULL) {
-			ret = -1;
-			goto out;
-		}
-
-		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
-		(dma_info + vid)->dmas[vring_id].is_valid = true;
-		config.ring_size = IOAT_RING_SIZE;
-		config.hdls_disable = true;
-		if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
-			ret = -1;
-			goto out;
-		}
-		rte_rawdev_start(dev_id);
-		cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
-		dma_info->nr++;
-		i++;
-	}
-out:
-	free(input);
-	return ret;
-}
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count)
-{
-	uint32_t i_iter;
-	uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
-	struct rte_vhost_iov_iter *iter = NULL;
-	unsigned long i_seg;
-	unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-	unsigned short write = cb_tracker[dev_id].next_write;
-
-	if (!opaque_data) {
-		for (i_iter = 0; i_iter < count; i_iter++) {
-			iter = iov_iter + i_iter;
-			i_seg = 0;
-			if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
-				break;
-			while (i_seg < iter->nr_segs) {
-				rte_ioat_enqueue_copy(dev_id,
-					(uintptr_t)(iter->iov[i_seg].src_addr),
-					(uintptr_t)(iter->iov[i_seg].dst_addr),
-					iter->iov[i_seg].len,
-					0,
-					0);
-				i_seg++;
-			}
-			write &= mask;
-			cb_tracker[dev_id].size_track[write] = iter->nr_segs;
-			cb_tracker[dev_id].ioat_space -= iter->nr_segs;
-			write++;
-		}
-	} else {
-		/* Opaque data is not supported */
-		return -1;
-	}
-	/* ring the doorbell */
-	rte_ioat_perform_ops(dev_id);
-	cb_tracker[dev_id].next_write = write;
-	return i_iter;
-}
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets)
-{
-	if (!opaque_data) {
-		uintptr_t dump[255];
-		int n_seg;
-		unsigned short read, write;
-		unsigned short nb_packet = 0;
-		unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-		unsigned short i;
-
-		uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
-				+ VIRTIO_RXQ].dev_id;
-		n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, dump);
-		if (n_seg < 0) {
-			RTE_LOG(ERR,
-				VHOST_DATA,
-				"fail to poll completed buf on IOAT device %u",
-				dev_id);
-			return 0;
-		}
-		if (n_seg == 0)
-			return 0;
-
-		cb_tracker[dev_id].ioat_space += n_seg;
-		n_seg += cb_tracker[dev_id].last_remain;
-
-		read = cb_tracker[dev_id].next_read;
-		write = cb_tracker[dev_id].next_write;
-		for (i = 0; i < max_packets; i++) {
-			read &= mask;
-			if (read == write)
-				break;
-			if (n_seg >= cb_tracker[dev_id].size_track[read]) {
-				n_seg -= cb_tracker[dev_id].size_track[read];
-				read++;
-				nb_packet++;
-			} else {
-				break;
-			}
-		}
-		cb_tracker[dev_id].next_read = read;
-		cb_tracker[dev_id].last_remain = n_seg;
-		return nb_packet;
-	}
-	/* Opaque data is not supported */
-	return -1;
-}
-
-#endif /* RTE_RAW_IOAT */
diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
deleted file mode 100644
index d9bf717e8d..0000000000
--- a/examples/vhost/ioat.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#ifndef _IOAT_H_
-#define _IOAT_H_
-
-#include <rte_vhost.h>
-#include <rte_pci.h>
-#include <rte_vhost_async.h>
-
-#define MAX_VHOST_DEVICE 1024
-#define IOAT_RING_SIZE 4096
-#define MAX_ENQUEUED_SIZE 4096
-
-struct dma_info {
-	struct rte_pci_addr addr;
-	uint16_t dev_id;
-	bool is_valid;
-};
-
-struct dma_for_vhost {
-	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
-	uint16_t nr;
-};
-
-#ifdef RTE_RAW_IOAT
-int open_ioat(const char *value);
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data, uint16_t count);
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-#else
-static int open_ioat(const char *value __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
-		struct rte_vhost_iov_iter *iov_iter __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t count __rte_unused)
-{
-	return -1;
-}
-
-static int32_t
-ioat_check_completed_copies_cb(int vid __rte_unused,
-		uint16_t queue_id __rte_unused,
-		struct rte_vhost_async_status *opaque_data __rte_unused,
-		uint16_t max_packets __rte_unused)
-{
-	return -1;
-}
-#endif
-#endif /* _IOAT_H_ */
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 590a77c723..3e784f5c6f 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -24,8 +24,9 @@
 #include <rte_ip.h>
 #include <rte_tcp.h>
 #include <rte_pause.h>
+#include <rte_dmadev.h>
+#include <rte_vhost_async.h>
 
-#include "ioat.h"
 #include "main.h"
 
 #ifndef MAX_QUEUES
@@ -56,6 +57,13 @@
 #define RTE_TEST_TX_DESC_DEFAULT 512
 
 #define INVALID_PORT_ID 0xFF
+#define INVALID_DMA_ID -1
+
+#define DMA_RING_SIZE 4096
+
+struct dma_for_vhost dma_bind[RTE_MAX_VHOST_DEVICE];
+int16_t dmas_id[RTE_DMADEV_DEFAULT_MAX];
+static int dma_count;
 
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;
@@ -94,10 +102,6 @@ static int client_mode;
 
 static int builtin_net_driver;
 
-static int async_vhost_driver;
-
-static char *dma_type;
-
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
 /* Specify the number of retries on RX. */
@@ -191,18 +195,150 @@ struct mbuf_table lcore_tx_queue[RTE_MAX_LCORE];
  * Every data core maintains a TX buffer for every vhost device,
  * which is used for batch pkts enqueue for higher performance.
  */
-struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
+struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * RTE_MAX_VHOST_DEVICE];
 
 #define MBUF_TABLE_DRAIN_TSC	((rte_get_tsc_hz() + US_PER_S - 1) \
 				 / US_PER_S * BURST_TX_DRAIN_US)
 
+static inline bool
+is_dma_configured(int16_t dev_id)
+{
+	int i;
+
+	for (i = 0; i < dma_count; i++)
+		if (dmas_id[i] == dev_id)
+			return true;
+	return false;
+}
+
 static inline int
 open_dma(const char *value)
 {
-	if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
-		return open_ioat(value);
+	struct dma_for_vhost *dma_info = dma_bind;
+	char *input = strndup(value, strlen(value) + 1);
+	char *addrs = input;
+	char *ptrs[2];
+	char *start, *end, *substr;
+	int64_t vid;
+
+	struct rte_dma_info info;
+	struct rte_dma_conf dev_config = { .nb_vchans = 1 };
+	struct rte_dma_vchan_conf qconf = {
+		.direction = RTE_DMA_DIR_MEM_TO_MEM,
+		.nb_desc = DMA_RING_SIZE
+	};
+
+	int dev_id;
+	int ret = 0;
+	uint16_t i = 0;
+	char *dma_arg[RTE_MAX_VHOST_DEVICE];
+	int args_nr;
+
+	while (isblank(*addrs))
+		addrs++;
+	if (*addrs == '\0') {
+		ret = -1;
+		goto out;
+	}
+
+	/* process DMA devices within bracket. */
+	addrs++;
+	substr = strtok(addrs, ";]");
+	if (!substr) {
+		ret = -1;
+		goto out;
+	}
+
+	args_nr = rte_strsplit(substr, strlen(substr), dma_arg, RTE_MAX_VHOST_DEVICE, ',');
+	if (args_nr <= 0) {
+		ret = -1;
+		goto out;
+	}
+
+	while (i < args_nr) {
+		char *arg_temp = dma_arg[i];
+		uint8_t sub_nr;
+
+		sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
+		if (sub_nr != 2) {
+			ret = -1;
+			goto out;
+		}
+
+		start = strstr(ptrs[0], "txd");
+		if (start == NULL) {
+			ret = -1;
+			goto out;
+		}
+
+		start += 3;
+		vid = strtol(start, &end, 0);
+		if (end == start) {
+			ret = -1;
+			goto out;
+		}
+
+		dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
+		if (dev_id < 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", ptrs[1]);
+			ret = -1;
+			goto out;
+		}
+
+		/* DMA device is already configured, so skip */
+		if (is_dma_configured(dev_id))
+			goto done;
+
+		if (rte_dma_info_get(dev_id, &info) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Error with rte_dma_info_get()\n");
+			ret = -1;
+			goto out;
+		}
+
+		if (info.max_vchans < 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "No channels available on device %d\n", dev_id);
+			ret = -1;
+			goto out;
+		}
 
-	return -1;
+		if (rte_dma_configure(dev_id, &dev_config) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		/* Check the max desc supported by DMA device */
+		rte_dma_info_get(dev_id, &info);
+		if (info.nb_vchans != 1) {
+			RTE_LOG(ERR, VHOST_CONFIG, "No configured queues reported by DMA %d.\n",
+					dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		qconf.nb_desc = RTE_MIN(DMA_RING_SIZE, info.max_desc);
+
+		if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		if (rte_dma_start(dev_id) != 0) {
+			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", dev_id);
+			ret = -1;
+			goto out;
+		}
+
+		dmas_id[dma_count++] = dev_id;
+
+done:
+		(dma_info + vid)->dmas[VIRTIO_RXQ].dev_id = dev_id;
+		i++;
+	}
+out:
+	free(input);
+	return ret;
 }
 
 /*
@@ -500,8 +636,6 @@ enum {
 	OPT_CLIENT_NUM,
 #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
 	OPT_BUILTIN_NET_DRIVER_NUM,
-#define OPT_DMA_TYPE            "dma-type"
-	OPT_DMA_TYPE_NUM,
 #define OPT_DMAS                "dmas"
 	OPT_DMAS_NUM,
 };
@@ -539,8 +673,6 @@ us_vhost_parse_args(int argc, char **argv)
 				NULL, OPT_CLIENT_NUM},
 		{OPT_BUILTIN_NET_DRIVER, no_argument,
 				NULL, OPT_BUILTIN_NET_DRIVER_NUM},
-		{OPT_DMA_TYPE, required_argument,
-				NULL, OPT_DMA_TYPE_NUM},
 		{OPT_DMAS, required_argument,
 				NULL, OPT_DMAS_NUM},
 		{NULL, 0, 0, 0},
@@ -661,10 +793,6 @@ us_vhost_parse_args(int argc, char **argv)
 			}
 			break;
 
-		case OPT_DMA_TYPE_NUM:
-			dma_type = optarg;
-			break;
-
 		case OPT_DMAS_NUM:
 			if (open_dma(optarg) == -1) {
 				RTE_LOG(INFO, VHOST_CONFIG,
@@ -672,7 +800,6 @@ us_vhost_parse_args(int argc, char **argv)
 				us_vhost_usage(prgname);
 				return -1;
 			}
-			async_vhost_driver = 1;
 			break;
 
 		case OPT_CLIENT_NUM:
@@ -841,9 +968,10 @@ complete_async_pkts(struct vhost_dev *vdev)
 {
 	struct rte_mbuf *p_cpl[MAX_PKT_BURST];
 	uint16_t complete_count;
+	int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 	complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
-					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
+					VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, dma_id, 0);
 	if (complete_count) {
 		free_pkts(p_cpl, complete_count);
 		__atomic_sub_fetch(&vdev->pkts_inflight, complete_count, __ATOMIC_SEQ_CST);
@@ -877,17 +1005,18 @@ static __rte_always_inline void
 drain_vhost(struct vhost_dev *vdev)
 {
 	uint16_t ret;
-	uint32_t buff_idx = rte_lcore_id() * MAX_VHOST_DEVICE + vdev->vid;
+	uint32_t buff_idx = rte_lcore_id() * RTE_MAX_VHOST_DEVICE + vdev->vid;
 	uint16_t nr_xmit = vhost_txbuff[buff_idx]->len;
 	struct rte_mbuf **m = vhost_txbuff[buff_idx]->m_table;
 
 	if (builtin_net_driver) {
 		ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
-		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit);
+		ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, nr_xmit, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = nr_xmit - ret;
@@ -905,7 +1034,7 @@ drain_vhost(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(m, nr_xmit);
 }
 
@@ -921,8 +1050,7 @@ drain_vhost_table(void)
 		if (unlikely(vdev->remove == 1))
 			continue;
 
-		vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE
-						+ vdev->vid];
+		vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE + vdev->vid];
 
 		cur_tsc = rte_rdtsc();
 		if (unlikely(cur_tsc - vhost_txq->pre_tsc
@@ -970,7 +1098,7 @@ virtio_tx_local(struct vhost_dev *vdev, struct rte_mbuf *m)
 		return 0;
 	}
 
-	vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE + dst_vdev->vid];
+	vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE + dst_vdev->vid];
 	vhost_txq->m_table[vhost_txq->len++] = m;
 
 	if (enable_stats) {
@@ -1211,12 +1339,13 @@ drain_eth_rx(struct vhost_dev *vdev)
 	if (builtin_net_driver) {
 		enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
 						pkts, rx_count);
-	} else if (async_vhost_driver) {
+	} else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t enqueue_fail = 0;
+		int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
 		complete_async_pkts(vdev);
 		enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
-					VIRTIO_RXQ, pkts, rx_count);
+					VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
 		__atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, __ATOMIC_SEQ_CST);
 
 		enqueue_fail = rx_count - enqueue_count;
@@ -1235,7 +1364,7 @@ drain_eth_rx(struct vhost_dev *vdev)
 				__ATOMIC_SEQ_CST);
 	}
 
-	if (!async_vhost_driver)
+	if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
 		free_pkts(pkts, rx_count);
 }
 
@@ -1357,7 +1486,7 @@ destroy_device(int vid)
 	}
 
 	for (i = 0; i < RTE_MAX_LCORE; i++)
-		rte_free(vhost_txbuff[i * MAX_VHOST_DEVICE + vid]);
+		rte_free(vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]);
 
 	if (builtin_net_driver)
 		vs_vhost_net_remove(vdev);
@@ -1387,18 +1516,20 @@ destroy_device(int vid)
 		"(%d) device has been removed from data core\n",
 		vdev->vid);
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
 		uint16_t n_pkt = 0;
+		int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 		struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 		while (vdev->pkts_inflight) {
 			n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, VIRTIO_RXQ,
-						m_cpl, vdev->pkts_inflight);
+						m_cpl, vdev->pkts_inflight, dma_id, 0);
 			free_pkts(m_cpl, n_pkt);
 			__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 		}
 
 		rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
+		dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
 	}
 
 	rte_free(vdev);
@@ -1425,12 +1556,12 @@ new_device(int vid)
 	vdev->vid = vid;
 
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		vhost_txbuff[i * MAX_VHOST_DEVICE + vid]
+		vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]
 			= rte_zmalloc("vhost bufftable",
 				sizeof(struct vhost_bufftable),
 				RTE_CACHE_LINE_SIZE);
 
-		if (vhost_txbuff[i * MAX_VHOST_DEVICE + vid] == NULL) {
+		if (vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid] == NULL) {
 			RTE_LOG(INFO, VHOST_DATA,
 			  "(%d) couldn't allocate memory for vhost TX\n", vid);
 			return -1;
@@ -1468,20 +1599,13 @@ new_device(int vid)
 		"(%d) device has been added to data core %d\n",
 		vid, vdev->coreid);
 
-	if (async_vhost_driver) {
-		struct rte_vhost_async_config config = {0};
-		struct rte_vhost_async_channel_ops channel_ops;
-
-		if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
-			channel_ops.transfer_data = ioat_transfer_data_cb;
-			channel_ops.check_completed_copies =
-				ioat_check_completed_copies_cb;
-
-			config.features = RTE_VHOST_ASYNC_INORDER;
+	if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
+		int ret;
 
-			return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
-				config, &channel_ops);
-		}
+		ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
+		if (ret == 0)
+			dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
+		return ret;
 	}
 
 	return 0;
@@ -1502,14 +1626,15 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
 	if (queue_id != VIRTIO_RXQ)
 		return 0;
 
-	if (async_vhost_driver) {
+	if (dma_bind[vid].dmas[queue_id].async_enabled) {
 		if (!enable) {
 			uint16_t n_pkt = 0;
+			int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
 			struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
 			while (vdev->pkts_inflight) {
 				n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
-							m_cpl, vdev->pkts_inflight);
+							m_cpl, vdev->pkts_inflight, dma_id, 0);
 				free_pkts(m_cpl, n_pkt);
 				__atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, __ATOMIC_SEQ_CST);
 			}
@@ -1657,6 +1782,24 @@ create_mbuf_pool(uint16_t nr_port, uint32_t nr_switch_core, uint32_t mbuf_size,
 		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
 }
 
+static void
+reset_dma(void)
+{
+	int i;
+
+	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
+		int j;
+
+		for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
+			dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
+			dma_bind[i].dmas[j].async_enabled = false;
+		}
+	}
+
+	for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++)
+		dmas_id[i] = INVALID_DMA_ID;
+}
+
 /*
  * Main function, does initialisation and calls the per-lcore functions.
  */
@@ -1679,6 +1822,9 @@ main(int argc, char *argv[])
 	argc -= ret;
 	argv += ret;
 
+	/* initialize dma structures */
+	reset_dma();
+
 	/* parse app arguments */
 	ret = us_vhost_parse_args(argc, argv);
 	if (ret < 0)
@@ -1754,11 +1900,18 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	for (i = 0; i < dma_count; i++) {
+		if (rte_vhost_async_dma_configure(dmas_id[i], 0) < 0) {
+			RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in vhost.\n");
+			rte_exit(EXIT_FAILURE, "Cannot use given DMA device\n");
+		}
+	}
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		char *file = socket_files + i * PATH_MAX;
 
-		if (async_vhost_driver)
+		if (dma_count)
 			flags = flags | RTE_VHOST_USER_ASYNC_COPY;
 
 		ret = rte_vhost_driver_register(file, flags);
diff --git a/examples/vhost/main.h b/examples/vhost/main.h
index e7b1ac60a6..b4a453e77e 100644
--- a/examples/vhost/main.h
+++ b/examples/vhost/main.h
@@ -8,6 +8,7 @@
 #include <sys/queue.h>
 
 #include <rte_ether.h>
+#include <rte_pci.h>
 
 /* Macros for printing using RTE_LOG */
 #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
@@ -79,6 +80,16 @@ struct lcore_info {
 	struct vhost_dev_tailq_list vdev_list;
 };
 
+struct dma_info {
+	struct rte_pci_addr addr;
+	int16_t dev_id;
+	bool async_enabled;
+};
+
+struct dma_for_vhost {
+	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
+};
+
 /* we implement non-extra virtio net features */
 #define VIRTIO_NET_FEATURES	0
 
diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
index 3efd5e6540..87a637f83f 100644
--- a/examples/vhost/meson.build
+++ b/examples/vhost/meson.build
@@ -12,13 +12,9 @@ if not is_linux
 endif
 
 deps += 'vhost'
+deps += 'dmadev'
 allow_experimental_apis = true
 sources = files(
         'main.c',
         'virtio_net.c',
 )
-
-if dpdk_conf.has('RTE_RAW_IOAT')
-    deps += 'raw_ioat'
-    sources += files('ioat.c')
-endif
diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
index cdb37a4814..bc7272053b 100644
--- a/lib/vhost/meson.build
+++ b/lib/vhost/meson.build
@@ -36,4 +36,4 @@ headers = files(
 driver_sdk_headers = files(
         'vdpa_driver.h',
 )
-deps += ['ethdev', 'cryptodev', 'hash', 'pci']
+deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
index b454c05868..15c37dd26e 100644
--- a/lib/vhost/rte_vhost.h
+++ b/lib/vhost/rte_vhost.h
@@ -113,6 +113,8 @@ extern "C" {
 #define VHOST_USER_F_PROTOCOL_FEATURES	30
 #endif
 
+#define RTE_MAX_VHOST_DEVICE	1024
+
 struct rte_vdpa_device;
 
 /**
diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
index a87ea6ba37..15d4c51fc1 100644
--- a/lib/vhost/rte_vhost_async.h
+++ b/lib/vhost/rte_vhost_async.h
@@ -5,94 +5,6 @@
 #ifndef _RTE_VHOST_ASYNC_H_
 #define _RTE_VHOST_ASYNC_H_
 
-#include "rte_vhost.h"
-
-/**
- * iovec
- */
-struct rte_vhost_iovec {
-	void *src_addr;
-	void *dst_addr;
-	size_t len;
-};
-
-/**
- * iovec iterator
- */
-struct rte_vhost_iov_iter {
-	/** pointer to the iovec array */
-	struct rte_vhost_iovec *iov;
-	/** number of iovec in this iterator */
-	unsigned long nr_segs;
-};
-
-/**
- * dma transfer status
- */
-struct rte_vhost_async_status {
-	/** An array of application specific data for source memory */
-	uintptr_t *src_opaque_data;
-	/** An array of application specific data for destination memory */
-	uintptr_t *dst_opaque_data;
-};
-
-/**
- * dma operation callbacks to be implemented by applications
- */
-struct rte_vhost_async_channel_ops {
-	/**
-	 * instruct async engines to perform copies for a batch of packets
-	 *
-	 * @param vid
-	 *  id of vhost device to perform data copies
-	 * @param queue_id
-	 *  queue id to perform data copies
-	 * @param iov_iter
-	 *  an array of IOV iterators
-	 * @param opaque_data
-	 *  opaque data pair sending to DMA engine
-	 * @param count
-	 *  number of elements in the "descs" array
-	 * @return
-	 *  number of IOV iterators processed, negative value means error
-	 */
-	int32_t (*transfer_data)(int vid, uint16_t queue_id,
-		struct rte_vhost_iov_iter *iov_iter,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t count);
-	/**
-	 * check copy-completed packets from the async engine
-	 * @param vid
-	 *  id of vhost device to check copy completion
-	 * @param queue_id
-	 *  queue id to check copy completion
-	 * @param opaque_data
-	 *  buffer to receive the opaque data pair from DMA engine
-	 * @param max_packets
-	 *  max number of packets could be completed
-	 * @return
-	 *  number of async descs completed, negative value means error
-	 */
-	int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
-		struct rte_vhost_async_status *opaque_data,
-		uint16_t max_packets);
-};
-
-/**
- *  async channel features
- */
-enum {
-	RTE_VHOST_ASYNC_INORDER = 1U << 0,
-};
-
-/**
- *  async channel configuration
- */
-struct rte_vhost_async_config {
-	uint32_t features;
-	uint32_t rsvd[2];
-};
-
 /**
  * Register an async channel for a vhost queue
  *
@@ -100,17 +12,11 @@ struct rte_vhost_async_config {
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration structure
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue
@@ -136,17 +42,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-	struct rte_vhost_async_config config,
-	struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue without performing any
@@ -179,12 +79,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int vid,
  *  array of packets to be enqueued
  * @param count
  *  packets num to be enqueued
+ * @param dma_id
+ *  the identifier of DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets enqueued
  */
 __rte_experimental
 uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
 
 /**
  * This function checks async completion status for a specific vhost
@@ -199,12 +104,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
  *  blank array to get return packet pointer
  * @param count
  *  size of the packet array
+ * @param dma_id
+ *  the identifier of DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
 
 /**
  * This function returns the amount of in-flight packets for the vhost
@@ -235,11 +145,36 @@ int rte_vhost_async_get_inflight(int vid, uint16_t queue_id);
  *  Blank array to get return packet pointer
  * @param count
  *  Size of the packet array
+ * @param dma_id
+ *  the identifier of DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  Number of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count);
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id);
+/**
+ * The DMA vChannels used in asynchronous data path must be configured
+ * first. So this function needs to be called before enabling DMA
+ * acceleration for vring. If this function fails, the given DMA vChannel
+ * cannot be used in asynchronous data path.
+ *
+ * DMA devices used in data-path must belong to DMA devices given in this
+ * function. But users are free to use DMA devices given in the function
+ * in non-vhost scenarios, only if guarantee no copies in vhost are
+ * offloaded to them at the same time.
+ *
+ * @param dma_id
+ *  the identifier of DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
+ * @return
+ *  0 on success, and -1 on failure
+ */
+__rte_experimental
+int rte_vhost_async_dma_configure(int16_t dma_id, uint16_t vchan_id);
 
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/vhost/version.map b/lib/vhost/version.map
index a7ef7f1976..1202ba9c1a 100644
--- a/lib/vhost/version.map
+++ b/lib/vhost/version.map
@@ -84,6 +84,9 @@ EXPERIMENTAL {
 
 	# added in 21.11
 	rte_vhost_get_monitor_addr;
+
+	# added in 22.03
+	rte_vhost_async_dma_configure;
 };
 
 INTERNAL {
diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
index f59ca6c157..6bcb716de0 100644
--- a/lib/vhost/vhost.c
+++ b/lib/vhost/vhost.c
@@ -25,7 +25,7 @@
 #include "vhost.h"
 #include "vhost_user.h"
 
-struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 /* Called with iotlb_lock read-locked */
@@ -343,6 +343,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
 		return;
 
 	rte_free(vq->async->pkts_info);
+	rte_free(vq->async->pkts_cmpl_flag);
 
 	rte_free(vq->async->buffers_packed);
 	vq->async->buffers_packed = NULL;
@@ -665,12 +666,12 @@ vhost_new_device(void)
 	int i;
 
 	pthread_mutex_lock(&vhost_dev_lock);
-	for (i = 0; i < MAX_VHOST_DEVICE; i++) {
+	for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
 		if (vhost_devices[i] == NULL)
 			break;
 	}
 
-	if (i == MAX_VHOST_DEVICE) {
+	if (i == RTE_MAX_VHOST_DEVICE) {
 		VHOST_LOG_CONFIG(ERR, "failed to find a free slot for new device.\n");
 		pthread_mutex_unlock(&vhost_dev_lock);
 		return -1;
@@ -1621,8 +1622,7 @@ rte_vhost_extern_callback_register(int vid,
 }
 
 static __rte_always_inline int
-async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_channel_ops *ops)
+async_channel_register(int vid, uint16_t queue_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
@@ -1651,6 +1651,14 @@ async_channel_register(int vid, uint16_t queue_id,
 		goto out_free_async;
 	}
 
+	async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * sizeof(bool),
+			RTE_CACHE_LINE_SIZE, node);
+	if (!async->pkts_cmpl_flag) {
+		VHOST_LOG_CONFIG(ERR, "(%s) failed to allocate async pkts_cmpl_flag (qid: %d)\n",
+				dev->ifname, queue_id);
+		goto out_free_async;
+	}
+
 	if (vq_is_packed(dev)) {
 		async->buffers_packed = rte_malloc_socket(NULL,
 				vq->size * sizeof(struct vring_used_elem_packed),
@@ -1671,9 +1679,6 @@ async_channel_register(int vid, uint16_t queue_id,
 		}
 	}
 
-	async->ops.check_completed_copies = ops->check_completed_copies;
-	async->ops.transfer_data = ops->transfer_data;
-
 	vq->async = async;
 
 	return 0;
@@ -1686,15 +1691,13 @@ async_channel_register(int vid, uint16_t queue_id,
 }
 
 int
-rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 	int ret;
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1705,33 +1708,20 @@ rte_vhost_async_channel_register(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"(%s) async copy is not supported on non-inorder mode (qid: %d)\n",
-			dev->ifname, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
 	rte_spinlock_lock(&vq->access_lock);
-	ret = async_channel_register(vid, queue_id, ops);
+	ret = async_channel_register(vid, queue_id);
 	rte_spinlock_unlock(&vq->access_lock);
 
 	return ret;
 }
 
 int
-rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_vhost_async_config config,
-		struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq;
 	struct virtio_net *dev = get_device(vid);
 
-	if (dev == NULL || ops == NULL)
+	if (dev == NULL)
 		return -1;
 
 	if (queue_id >= VHOST_MAX_VRING)
@@ -1742,18 +1732,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
 	if (unlikely(vq == NULL || !dev->async_copy))
 		return -1;
 
-	if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-		VHOST_LOG_CONFIG(ERR,
-			"(%s) async copy is not supported on non-inorder mode (qid: %d)\n",
-			dev->ifname, queue_id);
-		return -1;
-	}
-
-	if (unlikely(ops->check_completed_copies == NULL ||
-		ops->transfer_data == NULL))
-		return -1;
-
-	return async_channel_register(vid, queue_id, ops);
+	return async_channel_register(vid, queue_id);
 }
 
 int
@@ -1832,6 +1811,68 @@ rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t queue_id)
 	return 0;
 }
 
+int
+rte_vhost_async_dma_configure(int16_t dma_id, uint16_t vchan_id)
+{
+	struct rte_dma_info info;
+	void *pkts_cmpl_flag_addr;
+	uint16_t max_desc;
+
+	if (!rte_dma_is_valid(dma_id)) {
+		VHOST_LOG_CONFIG(ERR, "DMA %d is not found.\n", dma_id);
+		return -1;
+	}
+
+	rte_dma_info_get(dma_id, &info);
+	if (vchan_id >= info.max_vchans) {
+		VHOST_LOG_CONFIG(ERR, "Invalid DMA %d vChannel %u.\n", dma_id, vchan_id);
+		return -1;
+	}
+
+	if (!dma_copy_track[dma_id].vchans) {
+		struct async_dma_vchan_info *vchans;
+
+		vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) * info.max_vchans,
+				RTE_CACHE_LINE_SIZE);
+		if (vchans == NULL) {
+			VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for DMA %d vChannel %u.\n",
+					dma_id, vchan_id);
+			return -1;
+		}
+
+		dma_copy_track[dma_id].vchans = vchans;
+	}
+
+	if (dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr) {
+		VHOST_LOG_CONFIG(INFO, "DMA %d vChannel %u already registered.\n", dma_id,
+				vchan_id);
+		return 0;
+	}
+
+	max_desc = info.max_desc;
+	if (!rte_is_power_of_2(max_desc))
+		max_desc = rte_align32pow2(max_desc);
+
+	pkts_cmpl_flag_addr = rte_zmalloc(NULL, sizeof(bool *) * max_desc, RTE_CACHE_LINE_SIZE);
+	if (!pkts_cmpl_flag_addr) {
+		VHOST_LOG_CONFIG(ERR, "Failed to allocate pkts_cmpl_flag_addr for DMA %d "
+				"vChannel %u.\n", dma_id, vchan_id);
+
+		if (dma_copy_track[dma_id].nr_vchans == 0) {
+			rte_free(dma_copy_track[dma_id].vchans);
+			dma_copy_track[dma_id].vchans = NULL;
+		}
+		return -1;
+	}
+
+	dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr = pkts_cmpl_flag_addr;
+	dma_copy_track[dma_id].vchans[vchan_id].ring_size = max_desc;
+	dma_copy_track[dma_id].vchans[vchan_id].ring_mask = max_desc - 1;
+	dma_copy_track[dma_id].nr_vchans++;
+
+	return 0;
+}
+
 int
 rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
 {
diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index b3f0c1d07c..1c2ee29600 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h
@@ -19,6 +19,7 @@
 #include <rte_ether.h>
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
+#include <rte_dmadev.h>
 
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
@@ -50,6 +51,9 @@
 
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
 #define VHOST_MAX_ASYNC_VEC 2048
+#define VIRTIO_MAX_RX_PKTLEN 9728U
+#define VHOST_DMA_MAX_COPY_COMPLETE ((VIRTIO_MAX_RX_PKTLEN / RTE_MBUF_DEFAULT_DATAROOM) \
+		* MAX_PKT_BURST)
 
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
 	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
@@ -119,6 +123,58 @@ struct vring_used_elem_packed {
 	uint32_t count;
 };
 
+/**
+ * iovec
+ */
+struct vhost_iovec {
+	void *src_addr;
+	void *dst_addr;
+	size_t len;
+};
+
+/**
+ * iovec iterator
+ */
+struct vhost_iov_iter {
+	/** pointer to the iovec array */
+	struct vhost_iovec *iov;
+	/** number of iovec in this iterator */
+	unsigned long nr_segs;
+};
+
+struct async_dma_vchan_info {
+	/* circular array to track if packet copy completes */
+	bool **pkts_cmpl_flag_addr;
+
+	/* max elements in 'pkts_cmpl_flag_addr' */
+	uint16_t ring_size;
+	/* ring index mask for 'pkts_cmpl_flag_addr' */
+	uint16_t ring_mask;
+
+	/**
+	 * DMA virtual channel lock. Although it is able to bind DMA
+	 * virtual channels to data plane threads, vhost control plane
+	 * thread could call data plane functions too, thus causing
+	 * DMA device contention.
+	 *
+	 * For example, in VM exit case, vhost control plane thread needs
+	 * to clear in-flight packets before disable vring, but there could
+	 * be anotther data plane thread is enqueuing packets to the same
+	 * vring with the same DMA virtual channel. As dmadev PMD functions
+	 * are lock-free, the control plane and data plane threads could
+	 * operate the same DMA virtual channel at the same time.
+	 */
+	rte_spinlock_t dma_lock;
+};
+
+struct async_dma_info {
+	struct async_dma_vchan_info *vchans;
+	/* number of registered virtual channels */
+	uint16_t nr_vchans;
+};
+
+extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 /**
  * inflight async packet information
  */
@@ -129,16 +185,32 @@ struct async_inflight_info {
 };
 
 struct vhost_async {
-	/* operation callbacks for DMA */
-	struct rte_vhost_async_channel_ops ops;
-
-	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
-	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
+	struct vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
+	struct vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
 	uint16_t iter_idx;
 	uint16_t iovec_idx;
 
 	/* data transfer status */
 	struct async_inflight_info *pkts_info;
+	/**
+	 * Packet reorder array. "true" indicates that DMA device
+	 * completes all copies for the packet.
+	 *
+	 * Note that this array could be written by multiple threads
+	 * simultaneously. For example, in the case of thread0 and
+	 * thread1 RX packets from NIC and then enqueue packets to
+	 * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
+	 * possible for thread0 to get completed copies belonging to
+	 * vring1 from DMA0, while thread0 is calling rte_vhost_poll
+	 * _enqueue_completed() for vring0 and thread1 is calling
+	 * rte_vhost_submit_enqueue_burst() for vring1. In this case,
+	 * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
+	 *
+	 * However, since offloading is per-packet basis, each packet
+	 * flag will only be written by one thread. And single byte
+	 * write is atomic, so no lock for pkts_cmpl_flag is needed.
+	 */
+	bool *pkts_cmpl_flag;
 	uint16_t pkts_idx;
 	uint16_t pkts_inflight_n;
 	union {
@@ -568,8 +640,7 @@ extern int vhost_data_log_level;
 #define PRINT_PACKET(device, addr, size, header) do {} while (0)
 #endif
 
-#define MAX_VHOST_DEVICE	1024
-extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 
 #define VHOST_BINARY_SEARCH_THRESH 256
 
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index f19713137c..886e076b28 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -11,6 +11,7 @@
 #include <rte_net.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
+#include <rte_dmadev.h>
 #include <rte_vhost.h>
 #include <rte_tcp.h>
 #include <rte_udp.h>
@@ -25,6 +26,9 @@
 
 #define MAX_BATCH_LEN 256
 
+/* DMA device copy operation tracking array. */
+struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -43,6 +47,135 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
 }
 
+static __rte_always_inline int64_t
+vhost_async_dma_transfer_one(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		int16_t dma_id, uint16_t vchan_id, uint16_t flag_idx,
+		struct vhost_iov_iter *pkt)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	uint16_t ring_mask = dma_info->ring_mask;
+	static bool vhost_async_dma_copy_log;
+
+
+	struct vhost_iovec *iov = pkt->iov;
+	int copy_idx = 0;
+	uint32_t nr_segs = pkt->nr_segs;
+	uint16_t i;
+
+	if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
+		return -1;
+
+	for (i = 0; i < nr_segs; i++) {
+		copy_idx = rte_dma_copy(dma_id, vchan_id, (rte_iova_t)iov[i].src_addr,
+				(rte_iova_t)iov[i].dst_addr, iov[i].len, RTE_DMA_OP_FLAG_LLC);
+		/**
+		 * Since all memory is pinned and DMA vChannel
+		 * ring has enough space, failure should be a
+		 * rare case. If failure happens, it means DMA
+		 * device encounters serious errors; in this
+		 * case, please stop async data-path and check
+		 * what has happened to DMA device.
+		 */
+		if (unlikely(copy_idx < 0)) {
+			if (!vhost_async_dma_copy_log) {
+				VHOST_LOG_DATA(ERR, "(%s) DMA copy failed for channel %d:%u\n",
+						dev->ifname, dma_id, vchan_id);
+				vhost_async_dma_copy_log = true;
+			}
+			return -1;
+		}
+	}
+
+	/**
+	 * Only store packet completion flag address in the last copy's
+	 * slot, and other slots are set to NULL.
+	 */
+	dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask] = &vq->async->pkts_cmpl_flag[flag_idx];
+
+	return nr_segs;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_transfer(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		int16_t dma_id, uint16_t vchan_id, uint16_t head_idx,
+		struct vhost_iov_iter *pkts, uint16_t nr_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	int64_t ret, nr_copies = 0;
+	uint16_t pkt_idx;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
+		ret = vhost_async_dma_transfer_one(dev, vq, dma_id, vchan_id, head_idx,
+				&pkts[pkt_idx]);
+		if (unlikely(ret < 0))
+			break;
+
+		nr_copies += ret;
+		head_idx++;
+		if (head_idx >= vq->size)
+			head_idx -= vq->size;
+	}
+
+	if (likely(nr_copies > 0))
+		rte_dma_submit(dma_id, vchan_id);
+
+	rte_spinlock_unlock(&dma_info->dma_lock);
+
+	return pkt_idx;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_check_completed(struct virtio_net *dev, int16_t dma_id, uint16_t vchan_id,
+		uint16_t max_pkts)
+{
+	struct async_dma_vchan_info *dma_info = &dma_copy_track[dma_id].vchans[vchan_id];
+	uint16_t ring_mask = dma_info->ring_mask;
+	uint16_t last_idx = 0;
+	uint16_t nr_copies;
+	uint16_t copy_idx;
+	uint16_t i;
+	bool has_error = false;
+	static bool vhost_async_dma_complete_log;
+
+	rte_spinlock_lock(&dma_info->dma_lock);
+
+	/**
+	 * Print error log for debugging, if DMA reports error during
+	 * DMA transfer. We do not handle error in vhost level.
+	 */
+	nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts, &last_idx, &has_error);
+	if (unlikely(!vhost_async_dma_complete_log && has_error)) {
+		VHOST_LOG_DATA(ERR, "(%s) DMA completion failure on channel %d:%u\n", dev->ifname,
+				dma_id, vchan_id);
+		vhost_async_dma_complete_log = true;
+	} else if (nr_copies == 0) {
+		goto out;
+	}
+
+	copy_idx = last_idx - nr_copies + 1;
+	for (i = 0; i < nr_copies; i++) {
+		bool *flag;
+
+		flag = dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask];
+		if (flag) {
+			/**
+			 * Mark the packet flag as received. The flag
+			 * could belong to another virtqueue but write
+			 * is atomic.
+			 */
+			*flag = true;
+			dma_info->pkts_cmpl_flag_addr[copy_idx & ring_mask] = NULL;
+		}
+		copy_idx++;
+	}
+
+out:
+	rte_spinlock_unlock(&dma_info->dma_lock);
+	return nr_copies;
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
@@ -794,7 +927,7 @@ copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 static __rte_always_inline int
 async_iter_initialize(struct virtio_net *dev, struct vhost_async *async)
 {
-	struct rte_vhost_iov_iter *iter;
+	struct vhost_iov_iter *iter;
 
 	if (unlikely(async->iovec_idx >= VHOST_MAX_ASYNC_VEC)) {
 		VHOST_LOG_DATA(ERR, "(%s) no more async iovec available\n", dev->ifname);
@@ -812,8 +945,8 @@ static __rte_always_inline int
 async_iter_add_iovec(struct virtio_net *dev, struct vhost_async *async,
 		void *src, void *dst, size_t len)
 {
-	struct rte_vhost_iov_iter *iter;
-	struct rte_vhost_iovec *iovec;
+	struct vhost_iov_iter *iter;
+	struct vhost_iovec *iovec;
 
 	if (unlikely(async->iovec_idx >= VHOST_MAX_ASYNC_VEC)) {
 		static bool vhost_max_async_vec_log;
@@ -848,7 +981,7 @@ async_iter_finalize(struct vhost_async *async)
 static __rte_always_inline void
 async_iter_cancel(struct vhost_async *async)
 {
-	struct rte_vhost_iov_iter *iter;
+	struct vhost_iov_iter *iter;
 
 	iter = async->iov_iter + async->iter_idx;
 	async->iovec_idx -= iter->nr_segs;
@@ -1448,9 +1581,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed *s_ring,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_split(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan_id)
 {
 	struct buf_vector buf_vec[BUF_VECTOR_MAX];
 	uint32_t pkt_idx = 0;
@@ -1460,7 +1593,7 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
 	uint32_t pkt_err = 0;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t slot_idx = 0;
 
 	/*
@@ -1502,17 +1635,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: failed to transfer data for queue id %d.\n",
-				dev->ifname, __func__, queue_id);
-		n_xfer = 0;
-	}
+	n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx,
+			async->iov_iter, pkt_idx);
 
 	pkt_err = pkt_idx - n_xfer;
 	if (unlikely(pkt_err)) {
 		uint16_t num_descs = 0;
 
+		VHOST_LOG_DATA(DEBUG, "(%s) %s: failed to transfer %u packets for queue %u.\n",
+				dev->ifname, __func__, pkt_err, queue_id);
+
 		/* update number of completed packets */
 		pkt_idx = n_xfer;
 
@@ -1655,13 +1787,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, uint16_t slot_idx,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
-	struct vhost_virtqueue *vq, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+		int16_t dma_id, uint16_t vchan_id)
 {
 	uint32_t pkt_idx = 0;
 	uint32_t remained = count;
-	int32_t n_xfer;
+	uint16_t n_xfer;
 	uint16_t num_buffers;
 	uint16_t num_descs;
 
@@ -1693,19 +1825,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
 	if (unlikely(pkt_idx == 0))
 		return 0;
 
-	n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 0, pkt_idx);
-	if (unlikely(n_xfer < 0)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: failed to transfer data for queue id %d.\n",
-				dev->ifname, __func__, queue_id);
-		n_xfer = 0;
-	}
-
-	pkt_err = pkt_idx - n_xfer;
+	n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx,
+			async->iov_iter, pkt_idx);
 
 	async_iter_reset(async);
 
-	if (unlikely(pkt_err))
+	pkt_err = pkt_idx - n_xfer;
+	if (unlikely(pkt_err)) {
+		VHOST_LOG_DATA(DEBUG, "(%s) %s: failed to transfer %u packets for queue %u.\n",
+				dev->ifname, __func__, pkt_err, queue_id);
 		dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
+	}
 
 	if (likely(vq->shadow_used_idx)) {
 		/* keep used descriptors. */
@@ -1825,28 +1955,40 @@ write_back_completed_descs_packed(struct vhost_virtqueue *vq,
 
 static __rte_always_inline uint16_t
 vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
 	struct vhost_async *async = vq->async;
 	struct async_inflight_info *pkts_info = async->pkts_info;
-	int32_t n_cpl;
+	uint16_t nr_cpl_pkts = 0;
 	uint16_t n_descs = 0, n_buffers = 0;
 	uint16_t start_idx, from, i;
 
-	n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
-	if (unlikely(n_cpl < 0)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: failed to check completed copies for queue id %d.\n",
-				dev->ifname, __func__, queue_id);
-		return 0;
+	/* Check completed copies for the given DMA vChannel */
+	vhost_async_dma_check_completed(dev, dma_id, vchan_id, VHOST_DMA_MAX_COPY_COMPLETE);
+
+	start_idx = async_get_first_inflight_pkt_idx(vq);
+	/**
+	 * Calculate the number of copy completed packets.
+	 * Note that there may be completed packets even if
+	 * no copies are reported done by the given DMA vChannel,
+	 * as it's possible that a virtqueue uses multiple DMA
+	 * vChannels.
+	 */
+	from = start_idx;
+	while (vq->async->pkts_cmpl_flag[from] && count--) {
+		vq->async->pkts_cmpl_flag[from] = false;
+		from++;
+		if (from >= vq->size)
+			from -= vq->size;
+		nr_cpl_pkts++;
 	}
 
-	if (n_cpl == 0)
+	if (nr_cpl_pkts == 0)
 		return 0;
 
-	start_idx = async_get_first_inflight_pkt_idx(vq);
-
-	for (i = 0; i < n_cpl; i++) {
+	for (i = 0; i < nr_cpl_pkts; i++) {
 		from = (start_idx + i) % vq->size;
 		/* Only used with packed ring */
 		n_buffers += pkts_info[from].nr_buffers;
@@ -1855,7 +1997,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		pkts[i] = pkts_info[from].mbuf;
 	}
 
-	async->pkts_inflight_n -= n_cpl;
+	async->pkts_inflight_n -= nr_cpl_pkts;
 
 	if (likely(vq->enabled && vq->access_ok)) {
 		if (vq_is_packed(dev)) {
@@ -1876,12 +2018,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
 		}
 	}
 
-	return n_cpl;
+	return nr_cpl_pkts;
 }
 
 uint16_t
 rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1897,18 +2040,30 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 		return 0;
 	}
 
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: invalid channel %d:%u.\n", dev->ifname, __func__,
+			       dma_id, vchan_id);
+		return 0;
+	}
+
 	vq = dev->virtqueue[queue_id];
 
-	if (unlikely(!vq->async)) {
-		VHOST_LOG_DATA(ERR, "(%s) %s: async not registered for queue id %d.\n",
-			dev->ifname, __func__, queue_id);
+	if (!rte_spinlock_trylock(&vq->access_lock)) {
+		VHOST_LOG_DATA(DEBUG, "(%s) %s: virtqueue %u is busy.\n", dev->ifname, __func__,
+				queue_id);
 		return 0;
 	}
 
-	rte_spinlock_lock(&vq->access_lock);
+	if (unlikely(!vq->async)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: async not registered for virtqueue %d.\n",
+				dev->ifname, __func__, queue_id);
+		goto out;
+	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
 
+out:
 	rte_spinlock_unlock(&vq->access_lock);
 
 	return n_pkts_cpl;
@@ -1916,7 +2071,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
 
 uint16_t
 rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 	struct vhost_virtqueue *vq;
@@ -1940,14 +2096,21 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: invalid channel %d:%u.\n", dev->ifname, __func__,
+				dma_id, vchan_id);
+		return 0;
+	}
+
+	n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, dma_id, vchan_id);
 
 	return n_pkts_cpl;
 }
 
 static __rte_always_inline uint32_t
 virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
+	struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t vchan_id)
 {
 	struct vhost_virtqueue *vq;
 	uint32_t nb_tx = 0;
@@ -1959,6 +2122,13 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 		return 0;
 	}
 
+	if (unlikely(!dma_copy_track[dma_id].vchans ||
+				!dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) {
+		VHOST_LOG_DATA(ERR, "(%s) %s: invalid channel %d:%u.\n", dev->ifname, __func__,
+			       dma_id, vchan_id);
+		return 0;
+	}
+
 	vq = dev->virtqueue[queue_id];
 
 	rte_spinlock_lock(&vq->access_lock);
@@ -1979,10 +2149,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 	if (vq_is_packed(dev))
 		nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan_id);
 	else
 		nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
-				pkts, count);
+				pkts, count, dma_id, vchan_id);
 
 out:
 	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
@@ -1996,7 +2166,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
 
 uint16_t
 rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-		struct rte_mbuf **pkts, uint16_t count)
+		struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+		uint16_t vchan_id)
 {
 	struct virtio_net *dev = get_device(vid);
 
@@ -2009,7 +2180,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 		return 0;
 	}
 
-	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, vchan_id);
 }
 
 static inline bool
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
@ 2022-02-10  7:58               ` Yang, YvonneX
  2022-02-10 13:44               ` Maxime Coquelin
                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Yang, YvonneX @ 2022-02-10  7:58 UTC (permalink / raw)
  To: Hu, Jiayu, dev
  Cc: maxime.coquelin, i.maximets, Xia, Chenbo, Ding, Xuan, Jiang,
	Cheng1, liangma, Hu,  Jiayu, Pai G, Sunil



> -----Original Message-----
> From: Jiayu Hu <jiayu.hu@intel.com>
> Sent: Wednesday, February 9, 2022 8:52 PM
> To: dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; i.maximets@ovn.org; Xia, Chenbo
> <chenbo.xia@intel.com>; Ding, Xuan <xuan.ding@intel.com>; Jiang, Cheng1
> <cheng1.jiang@intel.com>; liangma@liangbit.com; Hu, Jiayu
> <jiayu.hu@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>
> Subject: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
> 
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---

Tested-by: Yvonne Yang <yvonnex.yang@intel.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
  2022-02-10  7:58               ` Yang, YvonneX
@ 2022-02-10 13:44               ` Maxime Coquelin
  2022-02-10 15:14               ` Maxime Coquelin
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Maxime Coquelin @ 2022-02-10 13:44 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, xuan.ding, cheng1.jiang, liangma, Sunil Pai G



On 2/9/22 13:51, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst | 100 +++++-----
>   examples/vhost/Makefile             |   2 +-
>   examples/vhost/ioat.c               | 218 ----------------------
>   examples/vhost/ioat.h               |  63 -------
>   examples/vhost/main.c               | 253 ++++++++++++++++++++-----
>   examples/vhost/main.h               |  11 ++
>   examples/vhost/meson.build          |   6 +-
>   lib/vhost/meson.build               |   2 +-
>   lib/vhost/rte_vhost.h               |   2 +
>   lib/vhost/rte_vhost_async.h         | 145 ++++-----------
>   lib/vhost/version.map               |   3 +
>   lib/vhost/vhost.c                   | 121 ++++++++----
>   lib/vhost/vhost.h                   |  85 ++++++++-
>   lib/vhost/virtio_net.c              | 277 ++++++++++++++++++++++------
>   14 files changed, 693 insertions(+), 595 deletions(-)
>   delete mode 100644 examples/vhost/ioat.c
>   delete mode 100644 examples/vhost/ioat.h
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
  2022-02-10  7:58               ` Yang, YvonneX
  2022-02-10 13:44               ` Maxime Coquelin
@ 2022-02-10 15:14               ` Maxime Coquelin
  2022-02-10 20:50               ` Ferruh Yigit
  2022-02-10 20:56               ` Ferruh Yigit
  4 siblings, 0 replies; 31+ messages in thread
From: Maxime Coquelin @ 2022-02-10 15:14 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, xuan.ding, cheng1.jiang, liangma, Sunil Pai G



On 2/9/22 13:51, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst | 100 +++++-----
>   examples/vhost/Makefile             |   2 +-
>   examples/vhost/ioat.c               | 218 ----------------------
>   examples/vhost/ioat.h               |  63 -------
>   examples/vhost/main.c               | 253 ++++++++++++++++++++-----
>   examples/vhost/main.h               |  11 ++
>   examples/vhost/meson.build          |   6 +-
>   lib/vhost/meson.build               |   2 +-
>   lib/vhost/rte_vhost.h               |   2 +
>   lib/vhost/rte_vhost_async.h         | 145 ++++-----------
>   lib/vhost/version.map               |   3 +
>   lib/vhost/vhost.c                   | 121 ++++++++----
>   lib/vhost/vhost.h                   |  85 ++++++++-
>   lib/vhost/virtio_net.c              | 277 ++++++++++++++++++++++------
>   14 files changed, 693 insertions(+), 595 deletions(-)
>   delete mode 100644 examples/vhost/ioat.c
>   delete mode 100644 examples/vhost/ioat.h
> 

Applied to dpdk-next-virtio/main.

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
                                 ` (2 preceding siblings ...)
  2022-02-10 15:14               ` Maxime Coquelin
@ 2022-02-10 20:50               ` Ferruh Yigit
  2022-02-10 21:01                 ` Maxime Coquelin
  2022-02-10 20:56               ` Ferruh Yigit
  4 siblings, 1 reply; 31+ messages in thread
From: Ferruh Yigit @ 2022-02-10 20:50 UTC (permalink / raw)
  To: maxime.coquelin, dpdklab, Ali Alnubani
  Cc: i.maximets, chenbo.xia, xuan.ding, cheng1.jiang, liangma,
	Sunil Pai G, Jiayu Hu, dev, ci, Thomas Monjalon, David Marchand,
	Aaron Conole

On 2/9/2022 12:51 PM, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>

CI not run on this patch because it failed to apply [1].
(There is a build error with this patch but CI seems not able to catch
it because not able to apply it...)

Patch seems applied on top of 'main' repo [2].

Maxime, Ali, Lab,
In which tree should vhost (lib/vhost) patches be applied and tested?
main, next-net-virtio or next-net?



[1]
http://mails.dpdk.org/archives/test-report/2022-February/258064.html



[2]
https://lab.dpdk.org/results/dashboard/patchsets/20962/
Applied on dpdk (0dff3f26d6faad4e51f75e5245f0387ee9bb0c6d)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-10 20:50               ` Ferruh Yigit
@ 2022-02-10 21:01                 ` Maxime Coquelin
  0 siblings, 0 replies; 31+ messages in thread
From: Maxime Coquelin @ 2022-02-10 21:01 UTC (permalink / raw)
  To: Ferruh Yigit, dpdklab, Ali Alnubani
  Cc: i.maximets, chenbo.xia, xuan.ding, cheng1.jiang, liangma,
	Sunil Pai G, Jiayu Hu, dev, ci, Thomas Monjalon, David Marchand,
	Aaron Conole



On 2/10/22 21:50, Ferruh Yigit wrote:
> On 2/9/2022 12:51 PM, Jiayu Hu wrote:
>> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
>> abstraction layer and simplify application logics, this patch integrates
>> dmadev in asynchronous data path.
>>
>> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
>> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> 
> CI not run on this patch because it failed to apply [1].
> (There is a build error with this patch but CI seems not able to catch
> it because not able to apply it...)
> 
> Patch seems applied on top of 'main' repo [2].
> 
> Maxime, Ali, Lab,
> In which tree should vhost (lib/vhost) patches be applied and tested?
> main, next-net-virtio or next-net?

It should be on top of dpdk-next-virtio tree, as mentioned in the 
MAINTAINERS file.

> 
> 
> [1]
> http://mails.dpdk.org/archives/test-report/2022-February/258064.html
> 
> 
> 
> [2]
> https://lab.dpdk.org/results/dashboard/patchsets/20962/
> Applied on dpdk (0dff3f26d6faad4e51f75e5245f0387ee9bb0c6d)
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
                                 ` (3 preceding siblings ...)
  2022-02-10 20:50               ` Ferruh Yigit
@ 2022-02-10 20:56               ` Ferruh Yigit
  2022-02-10 21:00                 ` Maxime Coquelin
  4 siblings, 1 reply; 31+ messages in thread
From: Ferruh Yigit @ 2022-02-10 20:56 UTC (permalink / raw)
  To: Jiayu Hu, dev
  Cc: maxime.coquelin, i.maximets, chenbo.xia, xuan.ding, cheng1.jiang,
	liangma, Sunil Pai G

On 2/9/2022 12:51 PM, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu<jiayu.hu@intel.com>
> Signed-off-by: Sunil Pai G<sunil.pai.g@intel.com>

Patch gives a build error with './devtools/test-meson-builds.sh' [1],
for the minimum build test [2].

This seems because new header file (rte_vhost_async.h) included by
'buildtools/chkincs' and it is missing depended includes.


Fixed in next-net by adding the includes [3],
please confirm latest patch in next-net:





[1]
        19 In file included from buildtools/chkincs/chkincs.p/rte_vhost_async.c:1:
        18 /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:18:19: error: expected ‘;’ before ‘int’
        17    18 | __rte_experimental
        16       |                   ^
        15       |                   ;
        14    19 | int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
        13       | ~~~
        12 /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:19:47: error: unknown type name ‘uint16_t’
        11    19 | int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
        10       |                                               ^~~~~~~~
        9 /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:1:1: note: ‘uint16_t’ is defined in header ‘<stdint.h>’; did you forget to ‘#include <stdint.h>’?
        8   +++ |+#include <stdint.h>
        7     1 | /* SPDX-License-Identifier: BSD-3-Clause
        6 /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:31:19: error: expected ‘;’ before ‘int’
        5    31 | __rte_experimental
        4       |                   ^
        3       |                   ;
        2    32 | int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
        1       | ~~~

        37 In file included from buildtools/chkincs/chkincs.p/rte_vhost_async.c:1:
        36 /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:95:24: error: ‘struct rte_mbuf’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
        35    95 |                 struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
        34       |

[2]
meson  -Dexamples=all --buildtype=debugoptimized --werror --default-library=shared -Ddisable_libs=*
-Denable_drivers=bus/vdev,mempool/ring,net/null /opt/dpdk_maintain/self/dpdk/devtools/.. ./build-mini


[3]
  diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
  index 11e6cfa7cb8d..b202c5540e5b 100644
  --- a/lib/vhost/rte_vhost_async.h
  +++ b/lib/vhost/rte_vhost_async.h
  @@ -5,6 +5,11 @@
   #ifndef _RTE_VHOST_ASYNC_H_
   #define _RTE_VHOST_ASYNC_H_
   
  +#include <stdint.h>
  +
  +#include <rte_compat.h>
  +#include <rte_mbuf.h>
  +
   /**
    * Register an async channel for a vhost queue
    *

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path
  2022-02-10 20:56               ` Ferruh Yigit
@ 2022-02-10 21:00                 ` Maxime Coquelin
  0 siblings, 0 replies; 31+ messages in thread
From: Maxime Coquelin @ 2022-02-10 21:00 UTC (permalink / raw)
  To: Ferruh Yigit, Jiayu Hu, dev
  Cc: i.maximets, chenbo.xia, xuan.ding, cheng1.jiang, liangma, Sunil Pai G

Hi Ferruh,

On 2/10/22 21:56, Ferruh Yigit wrote:
> On 2/9/2022 12:51 PM, Jiayu Hu wrote:
>> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
>> abstraction layer and simplify application logics, this patch integrates
>> dmadev in asynchronous data path.
>>
>> Signed-off-by: Jiayu Hu<jiayu.hu@intel.com>
>> Signed-off-by: Sunil Pai G<sunil.pai.g@intel.com>
> 
> Patch gives a build error with './devtools/test-meson-builds.sh' [1],
> for the minimum build test [2].

Sorry, I didn't run this script, and so didn't faced this issue.

> This seems because new header file (rte_vhost_async.h) included by
> 'buildtools/chkincs' and it is missing depended includes.
> 
> 
> Fixed in next-net by adding the includes [3],
> please confirm latest patch in next-net:
> 

I agree with the changes you suggest.

Thanks,
Maxime

> 
> 
> 
> [1]
>         19 In file included from 
> buildtools/chkincs/chkincs.p/rte_vhost_async.c:1:
>         18 
> /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:18:19: error: 
> expected ‘;’ before ‘int’
>         17    18 | __rte_experimental
>         16       |                   ^
>         15       |                   ;
>         14    19 | int rte_vhost_async_channel_register(int vid, 
> uint16_t queue_id);
>         13       | ~~~
>         12 
> /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:19:47: error: 
> unknown type name ‘uint16_t’
>         11    19 | int rte_vhost_async_channel_register(int vid, 
> uint16_t queue_id);
>         10       |                                               ^~~~~~~~
>         9 /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:1:1: 
> note: ‘uint16_t’ is defined in header ‘<stdint.h>’; did you forget to 
> ‘#include <stdint.h>’?
>         8   +++ |+#include <stdint.h>
>         7     1 | /* SPDX-License-Identifier: BSD-3-Clause
>         6 
> /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:31:19: error: 
> expected ‘;’ before ‘int’
>         5    31 | __rte_experimental
>         4       |                   ^
>         3       |                   ;
>         2    32 | int rte_vhost_async_channel_unregister(int vid, 
> uint16_t queue_id);
>         1       | ~~~
> 
>         37 In file included from 
> buildtools/chkincs/chkincs.p/rte_vhost_async.c:1:
>         36 
> /opt/dpdk_maintain/self/dpdk/lib/vhost/rte_vhost_async.h:95:24: error: 
> ‘struct rte_mbuf’ declared inside parameter list will not be visible 
> outside of this definition or declaration [-Werror]
>         35    95 |                 struct rte_mbuf **pkts, uint16_t 
> count, int16_t dma_id,
>         34       |
> 
> [2]
> meson  -Dexamples=all --buildtype=debugoptimized --werror 
> --default-library=shared -Ddisable_libs=*
> -Denable_drivers=bus/vdev,mempool/ring,net/null 
> /opt/dpdk_maintain/self/dpdk/devtools/.. ./build-mini
> 
> 
> [3]
>   diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
>   index 11e6cfa7cb8d..b202c5540e5b 100644
>   --- a/lib/vhost/rte_vhost_async.h
>   +++ b/lib/vhost/rte_vhost_async.h
>   @@ -5,6 +5,11 @@
>    #ifndef _RTE_VHOST_ASYNC_H_
>    #define _RTE_VHOST_ASYNC_H_
>   +#include <stdint.h>
>   +
>   +#include <rte_compat.h>
>   +#include <rte_mbuf.h>
>   +
>    /**
>     * Register an async channel for a vhost queue
>     *
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-02-10 21:02 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-22 10:54 [RFC 0/1] integrate dmadev in vhost Jiayu Hu
2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2021-12-24 10:39   ` Maxime Coquelin
2021-12-28  1:15     ` Hu, Jiayu
2022-01-03 10:26       ` Maxime Coquelin
2022-01-06  5:46         ` Hu, Jiayu
2021-12-03  3:49 ` [RFC 0/1] integrate dmadev in vhost fengchengwen
2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2021-12-31  0:55     ` Liang Ma
2022-01-14  6:30     ` Xia, Chenbo
2022-01-17  5:39       ` Hu, Jiayu
2022-01-19  2:18         ` Xia, Chenbo
2022-01-20 17:00     ` Maxime Coquelin
2022-01-21  1:56       ` Hu, Jiayu
2022-01-24 16:40   ` [PATCH v2 0/1] integrate dmadev in vhost Jiayu Hu
2022-01-24 16:40     ` [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2022-02-03 13:04       ` Maxime Coquelin
2022-02-07  1:34         ` Hu, Jiayu
2022-02-08 10:40       ` [PATCH v3 0/1] integrate dmadev in vhost Jiayu Hu
2022-02-08 10:40         ` [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
2022-02-08 17:46           ` Maxime Coquelin
2022-02-09 12:51           ` [PATCH v4 0/1] integrate dmadev in vhost Jiayu Hu
2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
2022-02-10  7:58               ` Yang, YvonneX
2022-02-10 13:44               ` Maxime Coquelin
2022-02-10 15:14               ` Maxime Coquelin
2022-02-10 20:50               ` Ferruh Yigit
2022-02-10 21:01                 ` Maxime Coquelin
2022-02-10 20:56               ` Ferruh Yigit
2022-02-10 21:00                 ` Maxime Coquelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).