DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path
@ 2020-08-19  3:24 Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19  3:24 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring format is imported since virtio spec 1.1. All descriptors
are compacted into one single ring when packed ring format is on. It is
straight forward that ring operations can be accelerated by utilizing
SIMD instructions. 

This patch set will introduce vectorized data path in vhost library. If
vectorized option is on, operations like descs check, descs writeback,
address translation will be accelerated by SIMD instructions. Vhost
application can choose whether using vectorized acceleration, it is 
like external buffer and zero copy features. 

If platform or ring format not support vectorized function, vhost will
fallback to use default batch function. There will be no impact in current
data path.

Marvin Liu (5):
  vhost: add vectorized data path
  vhost: reuse packed ring functions
  vhost: prepare memory regions addresses
  vhost: add packed ring vectorized dequeue
  vhost: add packed ring vectorized enqueue

 doc/guides/nics/vhost.rst           |   5 +
 doc/guides/prog_guide/vhost_lib.rst |  12 ++
 drivers/net/vhost/rte_eth_vhost.c   |  17 +-
 lib/librte_vhost/Makefile           |  13 ++
 lib/librte_vhost/meson.build        |  16 ++
 lib/librte_vhost/rte_vhost.h        |   1 +
 lib/librte_vhost/socket.c           |   5 +
 lib/librte_vhost/vhost.c            |  11 ++
 lib/librte_vhost/vhost.h            | 235 ++++++++++++++++++++++
 lib/librte_vhost/vhost_user.c       |  11 ++
 lib/librte_vhost/vhost_vec_avx.c    | 292 ++++++++++++++++++++++++++++
 lib/librte_vhost/virtio_net.c       | 257 ++++--------------------
 12 files changed, 659 insertions(+), 216 deletions(-)
 create mode 100644 lib/librte_vhost/vhost_vec_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 1/5] vhost: add vectorized data path
  2020-08-19  3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
@ 2020-08-19  3:24 ` " Marvin Liu
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions Marvin Liu
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19  3:24 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring operations are split into batch and single functions for
performance perspective. Ring operations in batch function can be
accelerated by SIMD instructions like AVX512.

So introduce vectorized parameter in vhost. Vectorized data path can be
selected if platform and ring format matched requirements. Otherwise
will fallback to original data path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index d36f3120b2..efdaf4de09 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -64,6 +64,11 @@ The user can specify below arguments in `--vdev` option.
     It is used to enable external buffer support in vhost library.
     (Default: 0 (disabled))
 
+#.  ``vectorized``:
+
+    It is used to enable vectorized data path support in vhost library.
+    (Default: 0 (disabled))
+
 Vhost PMD event handling
 ------------------------
 
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index b892eec67a..d5d421441c 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -162,6 +162,18 @@ The following is an overview of some key Vhost API functions:
 
     It is disabled by default.
 
+ - ``RTE_VHOST_USER_VECTORIZED``
+    Vectorized data path will used when this flag is set. When packed ring
+    enabled, available descriptors are stored from frontend driver in sequence.
+    SIMD instructions like AVX can be used to handle multiple descriptors
+    simultaneously. Thus can accelerate the throughput of ring operations.
+
+    * Only packed ring has vectorized data path.
+
+    * Will fallback to normal datapath if no vectorization support.
+
+    It is disabled by default.
+
 * ``rte_vhost_driver_set_features(path, features)``
 
   This function sets the feature bits the vhost-user driver supports. The
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index e55278af69..2ba5a2a076 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -35,6 +35,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
 #define ETH_VHOST_LINEAR_BUF  "linear-buffer"
 #define ETH_VHOST_EXT_BUF  "ext-buffer"
+#define ETH_VHOST_VECTORIZED "vectorized"
 #define VHOST_MAX_PKT_BURST 32
 
 static const char *valid_arguments[] = {
@@ -47,6 +48,7 @@ static const char *valid_arguments[] = {
 	ETH_VHOST_VIRTIO_NET_F_HOST_TSO,
 	ETH_VHOST_LINEAR_BUF,
 	ETH_VHOST_EXT_BUF,
+	ETH_VHOST_VECTORIZED,
 	NULL
 };
 
@@ -1507,6 +1509,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
 	int tso = 0;
 	int linear_buf = 0;
 	int ext_buf = 0;
+	int vectorized = 0;
 	struct rte_eth_dev *eth_dev;
 	const char *name = rte_vdev_device_name(dev);
 
@@ -1626,6 +1629,17 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
 			flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
 	}
 
+	if (rte_kvargs_count(kvlist, ETH_VHOST_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				ETH_VHOST_VECTORIZED,
+				&open_int, &vectorized);
+		if (ret < 0)
+			goto out_free;
+
+		if (vectorized == 1)
+			flags |= RTE_VHOST_USER_VECTORIZED;
+	}
+
 	if (dev->device.numa_node == SOCKET_ID_ANY)
 		dev->device.numa_node = rte_socket_id();
 
@@ -1679,4 +1693,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost,
 	"postcopy-support=<0|1> "
 	"tso=<0|1> "
 	"linear-buffer=<0|1> "
-	"ext-buffer=<0|1>");
+	"ext-buffer=<0|1> "
+	"vectorized=<0|1>");
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index a94c84134d..c7f946c6c1 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
 #define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
+#define RTE_VHOST_USER_VECTORIZED	(1ULL << 8)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 73e1dca95e..cc11244693 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -43,6 +43,7 @@ struct vhost_user_socket {
 	bool extbuf;
 	bool linearbuf;
 	bool async_copy;
+	bool vectorized;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -245,6 +246,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 			dev->async_copy = 1;
 	}
 
+	if (vsocket->vectorized)
+		vhost_enable_vectorized(vid);
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -881,6 +885,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
 	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
 	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
+	vsocket->vectorized = flags & RTE_VHOST_USER_VECTORIZED;
 
 	if (vsocket->dequeue_zero_copy &&
 	    (flags & RTE_VHOST_USER_IOMMU_SUPPORT)) {
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 8f20a0818f..50bf033a9d 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -752,6 +752,17 @@ vhost_enable_linearbuf(int vid)
 	dev->linearbuf = 1;
 }
 
+void
+vhost_enable_vectorized(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->vectorized = 1;
+}
+
 int
 rte_vhost_get_mtu(int vid, uint16_t *mtu)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 632f66d532..b556eb3bf6 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -383,6 +383,7 @@ struct virtio_net {
 	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
+	int                     vectorized;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 	struct inflight_mem_info *inflight_info;
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
@@ -721,6 +722,7 @@ void vhost_enable_dequeue_zero_copy(int vid);
 void vhost_set_builtin_virtio_net(int vid, bool enable);
 void vhost_enable_extbuf(int vid);
 void vhost_enable_linearbuf(int vid);
+void vhost_enable_vectorized(int vid);
 int vhost_enable_guest_notification(struct virtio_net *dev,
 		struct vhost_virtqueue *vq, int enable);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions
  2020-08-19  3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
@ 2020-08-19  3:24 ` Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses Marvin Liu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19  3:24 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Move parse_ethernet, offload, extbuf functions to header file. These
functions will be reused by vhost vectorized path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b556eb3bf6..5a5c945551 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -20,6 +20,10 @@
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
 
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_sctp.h>
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
@@ -905,4 +909,215 @@ put_zmbuf(struct zcopy_mbuf *zmbuf)
 	zmbuf->in_use = 0;
 }
 
+static  __rte_always_inline bool
+virtio_net_is_inorder(struct virtio_net *dev)
+{
+	return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
+}
+
+static __rte_always_inline void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+	struct rte_ipv4_hdr *ipv4_hdr;
+	struct rte_ipv6_hdr *ipv6_hdr;
+	void *l3_hdr = NULL;
+	struct rte_ether_hdr *eth_hdr;
+	uint16_t ethertype;
+
+	eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
+
+	m->l2_len = sizeof(struct rte_ether_hdr);
+	ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+	if (ethertype == RTE_ETHER_TYPE_VLAN) {
+		struct rte_vlan_hdr *vlan_hdr =
+			(struct rte_vlan_hdr *)(eth_hdr + 1);
+
+		m->l2_len += sizeof(struct rte_vlan_hdr);
+		ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+	}
+
+	l3_hdr = (char *)eth_hdr + m->l2_len;
+
+	switch (ethertype) {
+	case RTE_ETHER_TYPE_IPV4:
+		ipv4_hdr = l3_hdr;
+		*l4_proto = ipv4_hdr->next_proto_id;
+		m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+		*l4_hdr = (char *)l3_hdr + m->l3_len;
+		m->ol_flags |= PKT_TX_IPV4;
+		break;
+	case RTE_ETHER_TYPE_IPV6:
+		ipv6_hdr = l3_hdr;
+		*l4_proto = ipv6_hdr->proto;
+		m->l3_len = sizeof(struct rte_ipv6_hdr);
+		*l4_hdr = (char *)l3_hdr + m->l3_len;
+		m->ol_flags |= PKT_TX_IPV6;
+		break;
+	default:
+		m->l3_len = 0;
+		*l4_proto = 0;
+		*l4_hdr = NULL;
+		break;
+	}
+}
+
+static __rte_always_inline bool
+virtio_net_with_host_offload(struct virtio_net *dev)
+{
+	if (dev->features &
+			((1ULL << VIRTIO_NET_F_CSUM) |
+			 (1ULL << VIRTIO_NET_F_HOST_ECN) |
+			 (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+			 (1ULL << VIRTIO_NET_F_HOST_TSO6) |
+			 (1ULL << VIRTIO_NET_F_HOST_UFO)))
+		return true;
+
+	return false;
+}
+
+static __rte_always_inline void
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+	uint16_t l4_proto = 0;
+	void *l4_hdr = NULL;
+	struct rte_tcp_hdr *tcp_hdr = NULL;
+
+	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+		return;
+
+	parse_ethernet(m, &l4_proto, &l4_hdr);
+	if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+			switch (hdr->csum_offset) {
+			case (offsetof(struct rte_tcp_hdr, cksum)):
+				if (l4_proto == IPPROTO_TCP)
+					m->ol_flags |= PKT_TX_TCP_CKSUM;
+				break;
+			case (offsetof(struct rte_udp_hdr, dgram_cksum)):
+				if (l4_proto == IPPROTO_UDP)
+					m->ol_flags |= PKT_TX_UDP_CKSUM;
+				break;
+			case (offsetof(struct rte_sctp_hdr, cksum)):
+				if (l4_proto == IPPROTO_SCTP)
+					m->ol_flags |= PKT_TX_SCTP_CKSUM;
+				break;
+			default:
+				break;
+			}
+		}
+	}
+
+	if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			tcp_hdr = l4_hdr;
+			m->ol_flags |= PKT_TX_TCP_SEG;
+			m->tso_segsz = hdr->gso_size;
+			m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			m->ol_flags |= PKT_TX_UDP_SEG;
+			m->tso_segsz = hdr->gso_size;
+			m->l4_len = sizeof(struct rte_udp_hdr);
+			break;
+		default:
+			VHOST_LOG_DATA(WARNING,
+				"unsupported gso type %u.\n", hdr->gso_type);
+			break;
+		}
+	}
+}
+
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+	rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+	struct rte_mbuf_ext_shared_info *shinfo = NULL;
+	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+	uint16_t buf_len;
+	rte_iova_t iova;
+	void *buf;
+
+	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
+	 * required, otherwise store shinfo in the new buffer.
+	 */
+	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+		shinfo = rte_pktmbuf_mtod(pkt,
+					  struct rte_mbuf_ext_shared_info *);
+	else {
+		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+	}
+
+	if (unlikely(total_len > UINT16_MAX))
+		return -ENOSPC;
+
+	buf_len = total_len;
+	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+	if (unlikely(buf == NULL))
+		return -ENOMEM;
+
+	/* Initialize shinfo */
+	if (shinfo) {
+		shinfo->free_cb = virtio_dev_extbuf_free;
+		shinfo->fcb_opaque = buf;
+		rte_mbuf_ext_refcnt_set(shinfo, 1);
+	} else {
+		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+					      virtio_dev_extbuf_free, buf);
+		if (unlikely(shinfo == NULL)) {
+			rte_free(buf);
+			VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
+			return -1;
+		}
+	}
+
+	iova = rte_malloc_virt2iova(buf);
+	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+	rte_pktmbuf_reset_headroom(pkt);
+
+	return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+			 uint32_t data_len)
+{
+	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+	if (unlikely(pkt == NULL)) {
+		VHOST_LOG_DATA(ERR,
+			"Failed to allocate memory for mbuf.\n");
+		return NULL;
+	}
+
+	if (rte_pktmbuf_tailroom(pkt) >= data_len)
+		return pkt;
+
+	/* attach an external buffer if supported */
+	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+		return pkt;
+
+	/* check if chained buffers are allowed */
+	if (!dev->linearbuf)
+		return pkt;
+
+	/* Data doesn't fit into the buffer and the host supports
+	 * only linear buffers
+	 */
+	rte_pktmbuf_free(pkt);
+
+	return NULL;
+}
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index bd9303c8a9..6107662685 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -32,12 +32,6 @@ rxvq_is_mergeable(struct virtio_net *dev)
 	return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
 }
 
-static  __rte_always_inline bool
-virtio_net_is_inorder(struct virtio_net *dev)
-{
-	return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
-}
-
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 {
@@ -1804,121 +1798,6 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
 }
 
-static inline bool
-virtio_net_with_host_offload(struct virtio_net *dev)
-{
-	if (dev->features &
-			((1ULL << VIRTIO_NET_F_CSUM) |
-			 (1ULL << VIRTIO_NET_F_HOST_ECN) |
-			 (1ULL << VIRTIO_NET_F_HOST_TSO4) |
-			 (1ULL << VIRTIO_NET_F_HOST_TSO6) |
-			 (1ULL << VIRTIO_NET_F_HOST_UFO)))
-		return true;
-
-	return false;
-}
-
-static void
-parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
-{
-	struct rte_ipv4_hdr *ipv4_hdr;
-	struct rte_ipv6_hdr *ipv6_hdr;
-	void *l3_hdr = NULL;
-	struct rte_ether_hdr *eth_hdr;
-	uint16_t ethertype;
-
-	eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
-
-	m->l2_len = sizeof(struct rte_ether_hdr);
-	ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
-
-	if (ethertype == RTE_ETHER_TYPE_VLAN) {
-		struct rte_vlan_hdr *vlan_hdr =
-			(struct rte_vlan_hdr *)(eth_hdr + 1);
-
-		m->l2_len += sizeof(struct rte_vlan_hdr);
-		ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
-	}
-
-	l3_hdr = (char *)eth_hdr + m->l2_len;
-
-	switch (ethertype) {
-	case RTE_ETHER_TYPE_IPV4:
-		ipv4_hdr = l3_hdr;
-		*l4_proto = ipv4_hdr->next_proto_id;
-		m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
-		*l4_hdr = (char *)l3_hdr + m->l3_len;
-		m->ol_flags |= PKT_TX_IPV4;
-		break;
-	case RTE_ETHER_TYPE_IPV6:
-		ipv6_hdr = l3_hdr;
-		*l4_proto = ipv6_hdr->proto;
-		m->l3_len = sizeof(struct rte_ipv6_hdr);
-		*l4_hdr = (char *)l3_hdr + m->l3_len;
-		m->ol_flags |= PKT_TX_IPV6;
-		break;
-	default:
-		m->l3_len = 0;
-		*l4_proto = 0;
-		*l4_hdr = NULL;
-		break;
-	}
-}
-
-static __rte_always_inline void
-vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
-{
-	uint16_t l4_proto = 0;
-	void *l4_hdr = NULL;
-	struct rte_tcp_hdr *tcp_hdr = NULL;
-
-	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
-		return;
-
-	parse_ethernet(m, &l4_proto, &l4_hdr);
-	if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-		if (hdr->csum_start == (m->l2_len + m->l3_len)) {
-			switch (hdr->csum_offset) {
-			case (offsetof(struct rte_tcp_hdr, cksum)):
-				if (l4_proto == IPPROTO_TCP)
-					m->ol_flags |= PKT_TX_TCP_CKSUM;
-				break;
-			case (offsetof(struct rte_udp_hdr, dgram_cksum)):
-				if (l4_proto == IPPROTO_UDP)
-					m->ol_flags |= PKT_TX_UDP_CKSUM;
-				break;
-			case (offsetof(struct rte_sctp_hdr, cksum)):
-				if (l4_proto == IPPROTO_SCTP)
-					m->ol_flags |= PKT_TX_SCTP_CKSUM;
-				break;
-			default:
-				break;
-			}
-		}
-	}
-
-	if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-		case VIRTIO_NET_HDR_GSO_TCPV4:
-		case VIRTIO_NET_HDR_GSO_TCPV6:
-			tcp_hdr = l4_hdr;
-			m->ol_flags |= PKT_TX_TCP_SEG;
-			m->tso_segsz = hdr->gso_size;
-			m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
-			break;
-		case VIRTIO_NET_HDR_GSO_UDP:
-			m->ol_flags |= PKT_TX_UDP_SEG;
-			m->tso_segsz = hdr->gso_size;
-			m->l4_len = sizeof(struct rte_udp_hdr);
-			break;
-		default:
-			VHOST_LOG_DATA(WARNING,
-				"unsupported gso type %u.\n", hdr->gso_type);
-			break;
-		}
-	}
-}
-
 static __rte_noinline void
 copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
 		struct buf_vector *buf_vec)
@@ -2145,96 +2024,6 @@ get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
-static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
-{
-	rte_free(opaque);
-}
-
-static int
-virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
-{
-	struct rte_mbuf_ext_shared_info *shinfo = NULL;
-	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
-	uint16_t buf_len;
-	rte_iova_t iova;
-	void *buf;
-
-	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
-	 * required, otherwise store shinfo in the new buffer.
-	 */
-	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
-		shinfo = rte_pktmbuf_mtod(pkt,
-					  struct rte_mbuf_ext_shared_info *);
-	else {
-		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
-		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
-	}
-
-	if (unlikely(total_len > UINT16_MAX))
-		return -ENOSPC;
-
-	buf_len = total_len;
-	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
-	if (unlikely(buf == NULL))
-		return -ENOMEM;
-
-	/* Initialize shinfo */
-	if (shinfo) {
-		shinfo->free_cb = virtio_dev_extbuf_free;
-		shinfo->fcb_opaque = buf;
-		rte_mbuf_ext_refcnt_set(shinfo, 1);
-	} else {
-		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
-					      virtio_dev_extbuf_free, buf);
-		if (unlikely(shinfo == NULL)) {
-			rte_free(buf);
-			VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
-			return -1;
-		}
-	}
-
-	iova = rte_malloc_virt2iova(buf);
-	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
-	rte_pktmbuf_reset_headroom(pkt);
-
-	return 0;
-}
-
-/*
- * Allocate a host supported pktmbuf.
- */
-static __rte_always_inline struct rte_mbuf *
-virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
-			 uint32_t data_len)
-{
-	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
-
-	if (unlikely(pkt == NULL)) {
-		VHOST_LOG_DATA(ERR,
-			"Failed to allocate memory for mbuf.\n");
-		return NULL;
-	}
-
-	if (rte_pktmbuf_tailroom(pkt) >= data_len)
-		return pkt;
-
-	/* attach an external buffer if supported */
-	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
-		return pkt;
-
-	/* check if chained buffers are allowed */
-	if (!dev->linearbuf)
-		return pkt;
-
-	/* Data doesn't fit into the buffer and the host supports
-	 * only linear buffers
-	 */
-	rte_pktmbuf_free(pkt);
-
-	return NULL;
-}
-
 static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses
  2020-08-19  3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions Marvin Liu
@ 2020-08-19  3:24 ` Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
  4 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19  3:24 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Prepare memory regions guest physical addresses for vectorized data
path. These information will be utilized by SIMD instructions to find
matched region index.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5a5c945551..4a81f18f01 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -52,6 +52,8 @@
 
 #define ASYNC_MAX_POLL_SEG 255
 
+#define MAX_NREGIONS 8
+
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
 #define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
 
@@ -375,6 +377,8 @@ struct inflight_mem_info {
 struct virtio_net {
 	/* Frontend (QEMU) memory and memory region information */
 	struct rte_vhost_memory	*mem;
+	uint64_t		regions_low_addrs[MAX_NREGIONS];
+	uint64_t		regions_high_addrs[MAX_NREGIONS];
 	uint64_t		features;
 	uint64_t		protocol_features;
 	int			vid;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index c3c924faec..89e75e9e71 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1291,6 +1291,17 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 		}
 	}
 
+	RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
+	if (dev->vectorized) {
+		for (i = 0; i < memory->nregions; i++) {
+			dev->regions_low_addrs[i] =
+				memory->regions[i].guest_phys_addr;
+			dev->regions_high_addrs[i] =
+				memory->regions[i].guest_phys_addr +
+				memory->regions[i].memory_size;
+		}
+	}
+
 	for (i = 0; i < dev->nr_vring; i++) {
 		struct vhost_virtqueue *vq = dev->virtqueue[i];
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
  2020-08-19  3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
                   ` (2 preceding siblings ...)
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-08-19  3:24 ` Marvin Liu
  2020-09-18 13:44   ` Maxime Coquelin
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
  4 siblings, 1 reply; 36+ messages in thread
From: Marvin Liu @ 2020-08-19  3:24 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Optimize vhost packed ring dequeue path with SIMD instructions. Four
descriptors status check and writeback are batched handled with AVX512
instructions. Address translation operations are also accelerated by
AVX512 instructions.

If platform or compiler not support vectorization, will fallback to
default path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index 4f2f3e47da..c0cd7d498f 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
 endif
 endif
 
+ifneq ($(FORCE_DISABLE_AVX512), y)
+        CC_AVX512_SUPPORT=\
+        $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+        sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+        grep -q AVX512 && echo 1)
+endif
+
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
 LDLIBS += -lnuma
 endif
@@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev -lrte_net
 SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
 					vhost_user.c virtio_net.c vdpa.c
 
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
+CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
+endif
+
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
 						rte_vdpa_dev.h rte_vhost_async.h
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index cc9aa65c67..c1481802d7 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -8,6 +8,22 @@ endif
 if has_libnuma == 1
 	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
 endif
+
+if arch_subdir == 'x86'
+        if not machine_args.contains('-mno-avx512f')
+                if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+                        cflags += ['-DCC_AVX512_SUPPORT']
+                        vhost_avx512_lib = static_library('vhost_avx512_lib',
+                                              'vhost_vec_avx.c',
+                                              dependencies: [static_rte_eal, static_rte_mempool,
+                                                  static_rte_mbuf, static_rte_ethdev, static_rte_net],
+                                              include_directories: includes,
+                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
+                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
+                endif
+        endif
+endif
+
 if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
 	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
 elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 4a81f18f01..fc7daf2145 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
 	return NULL;
 }
 
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids);
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
new file mode 100644
index 0000000000..e8361d18fa
--- /dev/null
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2016 Intel Corporation
+ */
+#include <stdint.h>
+
+#include "vhost.h"
+
+#define BYTE_SIZE 8
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
+	sizeof(uint16_t))
+
+#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
+	sizeof(uint16_t))
+#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
+
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
+	<< FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define DESC_FLAGS_POS 0xaa
+#define MBUF_LENS_POS 0x6666
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids)
+{
+	struct vring_packed_desc *descs = vq->desc_packed;
+	uint32_t descs_status;
+	void *desc_addr;
+	uint16_t i;
+	uint8_t cmp_low, cmp_high, cmp_result;
+	uint64_t lens[PACKED_BATCH_SIZE];
+
+	if (unlikely(avail_idx & PACKED_BATCH_MASK))
+		return -1;
+
+	/* load 4 descs */
+	desc_addr = &vq->desc_packed[avail_idx];
+	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+	/* burst check four status */
+	__m512i avail_flag_vec;
+	if (vq->avail_wrap_counter)
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
+					PACKED_FLAGS_MASK, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_AVAIL_FLAG);
+
+#endif
+	else
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
+					0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_AVAIL_FLAG_WRAP);
+#endif
+
+	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
+		_MM_CMPINT_NE);
+	if (descs_status & BATCH_FLAGS_MASK)
+		return -1;
+
+	/* check buffer fit into one region & translate address */
+	__m512i regions_low_addrs =
+		_mm512_loadu_si512((void *)&dev->regions_low_addrs);
+	__m512i regions_high_addrs =
+		_mm512_loadu_si512((void *)&dev->regions_high_addrs);
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		uint64_t addr_low = descs[avail_idx + i].addr;
+		uint64_t addr_high = addr_low + descs[avail_idx + i].len;
+		__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+		__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+		cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+				regions_low_addrs, _MM_CMPINT_NLT);
+		cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+				regions_high_addrs, _MM_CMPINT_LT);
+		cmp_result = cmp_low & cmp_high;
+		int index = __builtin_ctz(cmp_result);
+		if (unlikely((uint32_t)index >= dev->mem->nregions))
+			goto free_buf;
+
+		desc_addrs[i] = addr_low +
+			dev->mem->regions[index].host_user_addr -
+			dev->mem->regions[index].guest_phys_addr;
+		lens[i] = descs[avail_idx + i].len;
+		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
+		if (!pkts[i])
+			goto free_buf;
+	}
+
+	if (unlikely(virtio_net_is_inorder(dev))) {
+		ids[PACKED_BATCH_SIZE - 1] =
+			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
+	} else {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			ids[i] = descs[avail_idx + i].id;
+	}
+
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	/* store mbuf data_len, pkt_len */
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
+					+ sizeof(uint64_t);
+	}
+
+	/* save pkt_len and data_len into mbufs */
+	__m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
+					0xAA);
+	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
+					(uint32_t)-12);
+	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
+
+	return 0;
+free_buf:
+	for (i = 0; i < PACKED_BATCH_SIZE; i++)
+		rte_pktmbuf_free(pkts[i]);
+
+	return -1;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 6107662685..e4d2e2e7d6 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
 	return -1;
 }
 
+static __rte_always_inline int
+vhost_handle_avail_batch_packed(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids)
+{
+	if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+		return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
+				pkts, avail_idx, desc_addrs, ids);
+#else
+		return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
+				pkts, avail_idx, desc_addrs, ids);
+
+#endif
+	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+			avail_idx, desc_addrs, ids);
+}
+
 static __rte_always_inline int
 virtio_dev_tx_batch_packed(struct virtio_net *dev,
 			   struct vhost_virtqueue *vq,
@@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
 	uint16_t ids[PACKED_BATCH_SIZE];
 	uint16_t i;
 
-	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
-					     avail_idx, desc_addrs, ids))
+
+	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+		avail_idx, desc_addrs, ids))
 		return -1;
 
 	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue
  2020-08-19  3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
                   ` (3 preceding siblings ...)
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-08-19  3:24 ` Marvin Liu
  4 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19  3:24 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Optimize vhost packed ring enqueue path with SIMD instructions. Four
descriptors status and length are batched handled with AVX512
instructions. Address translation operations are also accelerated
by AVX512 instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index fc7daf2145..b78b2c5c1b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
 				 uint16_t avail_idx,
 				 uintptr_t *desc_addrs,
 				 uint16_t *ids);
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+			       struct vhost_virtqueue *vq,
+			       struct rte_mbuf **pkts);
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
index e8361d18fa..12b902253a 100644
--- a/lib/librte_vhost/vhost_vec_avx.c
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -35,9 +35,15 @@
 #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
 #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
 	FLAGS_BITS_OFFSET)
+#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
+	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
+#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
+	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
 
 #define DESC_FLAGS_POS 0xaa
 #define MBUF_LENS_POS 0x6666
+#define DESC_LENS_POS 0x4444
+#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
 
 int
 vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
@@ -150,3 +156,137 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
 
 	return -1;
 }
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+			       struct vhost_virtqueue *vq,
+			       struct rte_mbuf **pkts)
+{
+	struct vring_packed_desc *descs = vq->desc_packed;
+	uint16_t avail_idx = vq->last_avail_idx;
+	uint64_t desc_addrs[PACKED_BATCH_SIZE];
+	uint32_t buf_offset = dev->vhost_hlen;
+	uint32_t desc_status;
+	uint64_t lens[PACKED_BATCH_SIZE];
+	uint16_t i;
+	void *desc_addr;
+	uint8_t cmp_low, cmp_high, cmp_result;
+
+	if (unlikely(avail_idx & PACKED_BATCH_MASK))
+		return -1;
+
+	/* check refcnt and nb_segs */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+
+	/* load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+				*pkts[3]->rearm_data,
+				*pkts[2]->rearm_data,
+				*pkts[1]->rearm_data,
+				*pkts[0]->rearm_data);
+
+	uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & MBUF_LENS_POS)
+		return -1;
+
+	/* check desc status */
+	desc_addr = &vq->desc_packed[avail_idx];
+	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+	__m512i avail_flag_vec;
+	__m512i used_flag_vec;
+	if (vq->avail_wrap_counter) {
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
+					0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
+		used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
+					PACKED_FLAGS_MASK, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_WRITE_AVAIL_FLAG);
+		used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_FLAGS_MASK);
+#endif
+	} else {
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(
+					PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
+					PACKED_WRITE_AVAIL_FLAG, 0x0);
+		used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_WRITE_AVAIL_FLAG_WRAP);
+		used_flag_vec = _mm512_setzero_epi32();
+#endif
+	}
+
+	desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
+				avail_flag_vec, _MM_CMPINT_NE);
+	if (desc_status)
+		return -1;
+
+	/* check buffer fit into one region & translate address */
+	__m512i regions_low_addrs =
+		_mm512_loadu_si512((void *)&dev->regions_low_addrs);
+	__m512i regions_high_addrs =
+		_mm512_loadu_si512((void *)&dev->regions_high_addrs);
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		uint64_t addr_low = descs[avail_idx + i].addr;
+		uint64_t addr_high = addr_low + descs[avail_idx + i].len;
+		__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+		__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+		cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+				regions_low_addrs, _MM_CMPINT_NLT);
+		cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+				regions_high_addrs, _MM_CMPINT_LT);
+		cmp_result = cmp_low & cmp_high;
+		int index = __builtin_ctz(cmp_result);
+		if (unlikely((uint32_t)index >= dev->mem->nregions))
+			return -1;
+
+		desc_addrs[i] = addr_low +
+			dev->mem->regions[index].host_user_addr -
+			dev->mem->regions[index].guest_phys_addr;
+		rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *, 0));
+	}
+
+	/* check length is enough */
+	__m512i pkt_lens = _mm512_set_epi32(
+			0, pkts[3]->pkt_len, 0, 0,
+			0, pkts[2]->pkt_len, 0, 0,
+			0, pkts[1]->pkt_len, 0, 0,
+			0, pkts[0]->pkt_len, 0, 0);
+
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
+					dev->vhost_hlen);
+	__m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
+	uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
+				desc_vec, buf_len_vec, _MM_CMPINT_LT);
+	if (lens_cmp)
+		return -1;
+
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+			   rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+			   pkts[i]->pkt_len);
+	}
+
+	if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			lens[i] = descs[avail_idx + i].len;
+			vhost_log_cache_write_iova(dev, vq,
+				descs[avail_idx + i].addr, lens[i]);
+		}
+	}
+
+	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
+	vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
+	/* save len and flags, skip addr and id */
+	__m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
+					DESC_LENS_FLAGS_POS, buf_len_vec,
+					used_flag_vec);
+	_mm512_storeu_si512(desc_addr, desc_updated);
+
+	return 0;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index e4d2e2e7d6..5c56a8d6ff 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
 	return 0;
 }
 
+static __rte_always_inline int
+virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
+			   struct vhost_virtqueue *vq,
+			   struct rte_mbuf **pkts)
+
+{
+	if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
+#else
+		return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#endif
+	return virtio_dev_rx_batch_packed(dev, vq, pkts);
+}
+
 static __rte_noinline uint32_t
 virtio_dev_rx_packed(struct virtio_net *dev,
 		     struct vhost_virtqueue *__rte_restrict vq,
@@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
 
 		if (remained >= PACKED_BATCH_SIZE) {
-			if (!virtio_dev_rx_batch_packed(dev, vq,
-							&pkts[pkt_idx])) {
+			if (!virtio_dev_rx_handle_batch_packed(dev, vq,
+				&pkts[pkt_idx])) {
 				pkt_idx += PACKED_BATCH_SIZE;
 				remained -= PACKED_BATCH_SIZE;
 				continue;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-09-18 13:44   ` Maxime Coquelin
  2020-09-21  6:26     ` Liu, Yong
  0 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-09-18 13:44 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev



On 8/19/20 5:24 AM, Marvin Liu wrote:
> Optimize vhost packed ring dequeue path with SIMD instructions. Four
> descriptors status check and writeback are batched handled with AVX512
> instructions. Address translation operations are also accelerated by
> AVX512 instructions.
> 
> If platform or compiler not support vectorization, will fallback to
> default path.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> index 4f2f3e47da..c0cd7d498f 100644
> --- a/lib/librte_vhost/Makefile
> +++ b/lib/librte_vhost/Makefile
> @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
>  endif
>  endif
>  
> +ifneq ($(FORCE_DISABLE_AVX512), y)
> +        CC_AVX512_SUPPORT=\
> +        $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> +        sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> +        grep -q AVX512 && echo 1)
> +endif
> +
>  ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
>  LDLIBS += -lnuma
>  endif
> @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev -lrte_net
>  SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
>  					vhost_user.c virtio_net.c vdpa.c
>  
> +ifeq ($(CC_AVX512_SUPPORT), 1)
> +CFLAGS += -DCC_AVX512_SUPPORT
> +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> +endif
> +
>  # install includes
>  SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
>  						rte_vdpa_dev.h rte_vhost_async.h
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index cc9aa65c67..c1481802d7 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -8,6 +8,22 @@ endif
>  if has_libnuma == 1
>  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
>  endif
> +
> +if arch_subdir == 'x86'
> +        if not machine_args.contains('-mno-avx512f')
> +                if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> +                        cflags += ['-DCC_AVX512_SUPPORT']
> +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> +                                              'vhost_vec_avx.c',
> +                                              dependencies: [static_rte_eal, static_rte_mempool,
> +                                                  static_rte_mbuf, static_rte_ethdev, static_rte_net],
> +                                              include_directories: includes,
> +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
> +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> +                endif
> +        endif
> +endif
> +
>  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
>  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
>  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 4a81f18f01..fc7daf2145 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
>  	return NULL;
>  }
>  
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids);
>  #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> new file mode 100644
> index 0000000000..e8361d18fa
> --- /dev/null
> +++ b/lib/librte_vhost/vhost_vec_avx.c
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2016 Intel Corporation
> + */
> +#include <stdint.h>
> +
> +#include "vhost.h"
> +
> +#define BYTE_SIZE 8
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> +	1ULL << REFCNT_BITS_OFFSET)
> +
> +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
> +	sizeof(uint16_t))
> +
> +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> +	sizeof(uint16_t))
> +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> +
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
> +	<< FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> +	FLAGS_BITS_OFFSET)
> +
> +#define DESC_FLAGS_POS 0xaa
> +#define MBUF_LENS_POS 0x6666
> +
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids)
> +{
> +	struct vring_packed_desc *descs = vq->desc_packed;
> +	uint32_t descs_status;
> +	void *desc_addr;
> +	uint16_t i;
> +	uint8_t cmp_low, cmp_high, cmp_result;
> +	uint64_t lens[PACKED_BATCH_SIZE];
> +
> +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> +		return -1;
> +
> +	/* load 4 descs */
> +	desc_addr = &vq->desc_packed[avail_idx];
> +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> +
> +	/* burst check four status */
> +	__m512i avail_flag_vec;
> +	if (vq->avail_wrap_counter)
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
> +					PACKED_FLAGS_MASK, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_AVAIL_FLAG);
> +
> +#endif
> +	else
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> +					0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_AVAIL_FLAG_WRAP);
> +#endif
> +
> +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> +		_MM_CMPINT_NE);
> +	if (descs_status & BATCH_FLAGS_MASK)
> +		return -1;
> +
> +	/* check buffer fit into one region & translate address */
> +	__m512i regions_low_addrs =
> +		_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> +	__m512i regions_high_addrs =
> +		_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +		uint64_t addr_low = descs[avail_idx + i].addr;
> +		uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> +		__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> +		__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> +		cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> +				regions_low_addrs, _MM_CMPINT_NLT);
> +		cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> +				regions_high_addrs, _MM_CMPINT_LT);
> +		cmp_result = cmp_low & cmp_high;
> +		int index = __builtin_ctz(cmp_result);
> +		if (unlikely((uint32_t)index >= dev->mem->nregions))
> +			goto free_buf;
> +
> +		desc_addrs[i] = addr_low +
> +			dev->mem->regions[index].host_user_addr -
> +			dev->mem->regions[index].guest_phys_addr;
> +		lens[i] = descs[avail_idx + i].len;
> +		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> +		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> +		if (!pkts[i])
> +			goto free_buf;
> +	}

The above does not support vIOMMU, isn't it?

The more the packed datapath evolves, the more it gets optimized for a
very specific configuration.

In v19.11, indirect descriptors and chained buffers are handled as a
fallback. And now vIOMMU support is handled as a fallback.

I personnally don't like the path it is taking as it is adding a lot of
complexity on top of that.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
  2020-09-18 13:44   ` Maxime Coquelin
@ 2020-09-21  6:26     ` Liu, Yong
  2020-09-21  7:47       ` Liu, Yong
  0 siblings, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-09-21  6:26 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, September 18, 2020 9:45 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
> 
> 
> 
> On 8/19/20 5:24 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index 4f2f3e47da..c0cd7d498f 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
> >  endif
> >  endif
> >
> > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > +        CC_AVX512_SUPPORT=\
> > +        $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > +        sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > +        grep -q AVX512 && echo 1)
> > +endif
> > +
> >  ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
> >  LDLIBS += -lnuma
> >  endif
> > @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> lrte_ethdev -lrte_net
> >  SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
> >  					vhost_user.c virtio_net.c vdpa.c
> >
> > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > +CFLAGS += -DCC_AVX512_SUPPORT
> > +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> > +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> > +endif
> > +
> >  # install includes
> >  SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h \
> >  						rte_vdpa_dev.h
> rte_vhost_async.h
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> >  if has_libnuma == 1
> >  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> >  endif
> > +
> > +if arch_subdir == 'x86'
> > +        if not machine_args.contains('-mno-avx512f')
> > +                if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +                        cflags += ['-DCC_AVX512_SUPPORT']
> > +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> > +                                              'vhost_vec_avx.c',
> > +                                              dependencies: [static_rte_eal,
> static_rte_mempool,
> > +                                                  static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > +                                              include_directories: includes,
> > +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > +                endif
> > +        endif
> > +endif
> > +
> >  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> >  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> >  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> >  	return NULL;
> >  }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids);
> >  #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..e8361d18fa
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -0,0 +1,152 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > +	1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > +	sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > +	sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  |
> \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > +	<< FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids)
> > +{
> > +	struct vring_packed_desc *descs = vq->desc_packed;
> > +	uint32_t descs_status;
> > +	void *desc_addr;
> > +	uint16_t i;
> > +	uint8_t cmp_low, cmp_high, cmp_result;
> > +	uint64_t lens[PACKED_BATCH_SIZE];
> > +
> > +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > +		return -1;
> > +
> > +	/* load 4 descs */
> > +	desc_addr = &vq->desc_packed[avail_idx];
> > +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > +	/* burst check four status */
> > +	__m512i avail_flag_vec;
> > +	if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > +					PACKED_FLAGS_MASK, 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > +	else
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > +					0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > +		_MM_CMPINT_NE);
> > +	if (descs_status & BATCH_FLAGS_MASK)
> > +		return -1;
> > +
> > +	/* check buffer fit into one region & translate address */
> > +	__m512i regions_low_addrs =
> > +		_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> > +	__m512i regions_high_addrs =
> > +		_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> > +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +		uint64_t addr_low = descs[avail_idx + i].addr;
> > +		uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> > +		__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> > +		__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> > +
> > +		cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> > +				regions_low_addrs, _MM_CMPINT_NLT);
> > +		cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> > +				regions_high_addrs, _MM_CMPINT_LT);
> > +		cmp_result = cmp_low & cmp_high;
> > +		int index = __builtin_ctz(cmp_result);
> > +		if (unlikely((uint32_t)index >= dev->mem->nregions))
> > +			goto free_buf;
> > +
> > +		desc_addrs[i] = addr_low +
> > +			dev->mem->regions[index].host_user_addr -
> > +			dev->mem->regions[index].guest_phys_addr;
> > +		lens[i] = descs[avail_idx + i].len;
> > +		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > +		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> > +		if (!pkts[i])
> > +			goto free_buf;
> > +	}
> 
> The above does not support vIOMMU, isn't it?
> 
> The more the packed datapath evolves, the more it gets optimized for a
> very specific configuration.
> 
> In v19.11, indirect descriptors and chained buffers are handled as a
> fallback. And now vIOMMU support is handled as a fallback.
> 

Hi Maxime,
Thanks for figuring out the feature miss. First version patch is lack of vIOMMU supporting.
V2 patch will fix the feature gap between vectorized function and original batch function.
So there will be no additional fallback introduced in vectorized patch set.

IMHO, current packed optimization introduced complexity is for handling that gap between performance aimed frontend (like PMD) and normal network traffic (like TCP).
Vectorized datapath is focusing in enhancing the performance of batched function.  From function point of view, there will no difference between vectorized batched function and original batched function. 
Current packed ring path will remain the same if vectorized option is not enable. So I think the complexity won't increase too much. If there's any concern, please let me known. 

BTW, vectorized path can help performance a lot when vIOMMU enabled. 

Regards,
Marvin

> I personnally don't like the path it is taking as it is adding a lot of
> complexity on top of that.
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
@ 2020-09-21  6:48   ` " Marvin Liu
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
                       ` (5 more replies)
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
  1 sibling, 6 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21  6:48 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring format is imported since virtio spec 1.1. All descriptors
are compacted into one single ring when packed ring format is on. It is
straight forward that ring operations can be accelerated by utilizing
SIMD instructions. 

This patch set will introduce vectorized data path in vhost library. If
vectorized option is on, operations like descs check, descs writeback,
address translation will be accelerated by SIMD instructions. Vhost
application can choose whether using vectorized acceleration, it is 
like external buffer and zero copy features. 

If platform or ring format not support vectorized function, vhost will
fallback to use default batch function. There will be no impact in current
data path.

v2:
* add vIOMMU support
* add dequeue offloading
* rebase code

Marvin Liu (5):
  vhost: add vectorized data path
  vhost: reuse packed ring functions
  vhost: prepare memory regions addresses
  vhost: add packed ring vectorized dequeue
  vhost: add packed ring vectorized enqueue

 doc/guides/nics/vhost.rst           |   5 +
 doc/guides/prog_guide/vhost_lib.rst |  12 +
 drivers/net/vhost/rte_eth_vhost.c   |  17 +-
 lib/librte_vhost/meson.build        |  16 ++
 lib/librte_vhost/rte_vhost.h        |   1 +
 lib/librte_vhost/socket.c           |   5 +
 lib/librte_vhost/vhost.c            |  11 +
 lib/librte_vhost/vhost.h            | 235 +++++++++++++++++++
 lib/librte_vhost/vhost_user.c       |  11 +
 lib/librte_vhost/vhost_vec_avx.c    | 338 ++++++++++++++++++++++++++++
 lib/librte_vhost/virtio_net.c       | 257 ++++-----------------
 11 files changed, 692 insertions(+), 216 deletions(-)
 create mode 100644 lib/librte_vhost/vhost_vec_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 1/5] vhost: add vectorized data path
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
@ 2020-09-21  6:48     ` " Marvin Liu
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions Marvin Liu
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21  6:48 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring operations are split into batch and single functions for
performance perspective. Ring operations in batch function can be
accelerated by SIMD instructions like AVX512.

So introduce vectorized parameter in vhost. Vectorized data path can be
selected if platform and ring format matched requirements. Otherwise
will fallback to original data path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index d36f3120b2..efdaf4de09 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -64,6 +64,11 @@ The user can specify below arguments in `--vdev` option.
     It is used to enable external buffer support in vhost library.
     (Default: 0 (disabled))
 
+#.  ``vectorized``:
+
+    It is used to enable vectorized data path support in vhost library.
+    (Default: 0 (disabled))
+
 Vhost PMD event handling
 ------------------------
 
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index b892eec67a..d5d421441c 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -162,6 +162,18 @@ The following is an overview of some key Vhost API functions:
 
     It is disabled by default.
 
+ - ``RTE_VHOST_USER_VECTORIZED``
+    Vectorized data path will used when this flag is set. When packed ring
+    enabled, available descriptors are stored from frontend driver in sequence.
+    SIMD instructions like AVX can be used to handle multiple descriptors
+    simultaneously. Thus can accelerate the throughput of ring operations.
+
+    * Only packed ring has vectorized data path.
+
+    * Will fallback to normal datapath if no vectorization support.
+
+    It is disabled by default.
+
 * ``rte_vhost_driver_set_features(path, features)``
 
   This function sets the feature bits the vhost-user driver supports. The
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index e55278af69..2ba5a2a076 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -35,6 +35,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
 #define ETH_VHOST_LINEAR_BUF  "linear-buffer"
 #define ETH_VHOST_EXT_BUF  "ext-buffer"
+#define ETH_VHOST_VECTORIZED "vectorized"
 #define VHOST_MAX_PKT_BURST 32
 
 static const char *valid_arguments[] = {
@@ -47,6 +48,7 @@ static const char *valid_arguments[] = {
 	ETH_VHOST_VIRTIO_NET_F_HOST_TSO,
 	ETH_VHOST_LINEAR_BUF,
 	ETH_VHOST_EXT_BUF,
+	ETH_VHOST_VECTORIZED,
 	NULL
 };
 
@@ -1507,6 +1509,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
 	int tso = 0;
 	int linear_buf = 0;
 	int ext_buf = 0;
+	int vectorized = 0;
 	struct rte_eth_dev *eth_dev;
 	const char *name = rte_vdev_device_name(dev);
 
@@ -1626,6 +1629,17 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
 			flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
 	}
 
+	if (rte_kvargs_count(kvlist, ETH_VHOST_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				ETH_VHOST_VECTORIZED,
+				&open_int, &vectorized);
+		if (ret < 0)
+			goto out_free;
+
+		if (vectorized == 1)
+			flags |= RTE_VHOST_USER_VECTORIZED;
+	}
+
 	if (dev->device.numa_node == SOCKET_ID_ANY)
 		dev->device.numa_node = rte_socket_id();
 
@@ -1679,4 +1693,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost,
 	"postcopy-support=<0|1> "
 	"tso=<0|1> "
 	"linear-buffer=<0|1> "
-	"ext-buffer=<0|1>");
+	"ext-buffer=<0|1> "
+	"vectorized=<0|1>");
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index a94c84134d..c7f946c6c1 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
 #define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
+#define RTE_VHOST_USER_VECTORIZED	(1ULL << 8)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 73e1dca95e..cc11244693 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -43,6 +43,7 @@ struct vhost_user_socket {
 	bool extbuf;
 	bool linearbuf;
 	bool async_copy;
+	bool vectorized;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -245,6 +246,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 			dev->async_copy = 1;
 	}
 
+	if (vsocket->vectorized)
+		vhost_enable_vectorized(vid);
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -881,6 +885,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
 	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
 	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
+	vsocket->vectorized = flags & RTE_VHOST_USER_VECTORIZED;
 
 	if (vsocket->dequeue_zero_copy &&
 	    (flags & RTE_VHOST_USER_IOMMU_SUPPORT)) {
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 8f20a0818f..50bf033a9d 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -752,6 +752,17 @@ vhost_enable_linearbuf(int vid)
 	dev->linearbuf = 1;
 }
 
+void
+vhost_enable_vectorized(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->vectorized = 1;
+}
+
 int
 rte_vhost_get_mtu(int vid, uint16_t *mtu)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 632f66d532..b556eb3bf6 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -383,6 +383,7 @@ struct virtio_net {
 	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
+	int                     vectorized;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 	struct inflight_mem_info *inflight_info;
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
@@ -721,6 +722,7 @@ void vhost_enable_dequeue_zero_copy(int vid);
 void vhost_set_builtin_virtio_net(int vid, bool enable);
 void vhost_enable_extbuf(int vid);
 void vhost_enable_linearbuf(int vid);
+void vhost_enable_vectorized(int vid);
 int vhost_enable_guest_notification(struct virtio_net *dev,
 		struct vhost_virtqueue *vq, int enable);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
@ 2020-09-21  6:48     ` Marvin Liu
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21  6:48 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Move parse_ethernet, offload, extbuf functions to header file. These
functions will be reused by vhost vectorized path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b556eb3bf6..5a5c945551 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -20,6 +20,10 @@
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
 
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_sctp.h>
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
@@ -905,4 +909,215 @@ put_zmbuf(struct zcopy_mbuf *zmbuf)
 	zmbuf->in_use = 0;
 }
 
+static  __rte_always_inline bool
+virtio_net_is_inorder(struct virtio_net *dev)
+{
+	return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
+}
+
+static __rte_always_inline void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+	struct rte_ipv4_hdr *ipv4_hdr;
+	struct rte_ipv6_hdr *ipv6_hdr;
+	void *l3_hdr = NULL;
+	struct rte_ether_hdr *eth_hdr;
+	uint16_t ethertype;
+
+	eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
+
+	m->l2_len = sizeof(struct rte_ether_hdr);
+	ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+	if (ethertype == RTE_ETHER_TYPE_VLAN) {
+		struct rte_vlan_hdr *vlan_hdr =
+			(struct rte_vlan_hdr *)(eth_hdr + 1);
+
+		m->l2_len += sizeof(struct rte_vlan_hdr);
+		ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+	}
+
+	l3_hdr = (char *)eth_hdr + m->l2_len;
+
+	switch (ethertype) {
+	case RTE_ETHER_TYPE_IPV4:
+		ipv4_hdr = l3_hdr;
+		*l4_proto = ipv4_hdr->next_proto_id;
+		m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+		*l4_hdr = (char *)l3_hdr + m->l3_len;
+		m->ol_flags |= PKT_TX_IPV4;
+		break;
+	case RTE_ETHER_TYPE_IPV6:
+		ipv6_hdr = l3_hdr;
+		*l4_proto = ipv6_hdr->proto;
+		m->l3_len = sizeof(struct rte_ipv6_hdr);
+		*l4_hdr = (char *)l3_hdr + m->l3_len;
+		m->ol_flags |= PKT_TX_IPV6;
+		break;
+	default:
+		m->l3_len = 0;
+		*l4_proto = 0;
+		*l4_hdr = NULL;
+		break;
+	}
+}
+
+static __rte_always_inline bool
+virtio_net_with_host_offload(struct virtio_net *dev)
+{
+	if (dev->features &
+			((1ULL << VIRTIO_NET_F_CSUM) |
+			 (1ULL << VIRTIO_NET_F_HOST_ECN) |
+			 (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+			 (1ULL << VIRTIO_NET_F_HOST_TSO6) |
+			 (1ULL << VIRTIO_NET_F_HOST_UFO)))
+		return true;
+
+	return false;
+}
+
+static __rte_always_inline void
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+	uint16_t l4_proto = 0;
+	void *l4_hdr = NULL;
+	struct rte_tcp_hdr *tcp_hdr = NULL;
+
+	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+		return;
+
+	parse_ethernet(m, &l4_proto, &l4_hdr);
+	if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+			switch (hdr->csum_offset) {
+			case (offsetof(struct rte_tcp_hdr, cksum)):
+				if (l4_proto == IPPROTO_TCP)
+					m->ol_flags |= PKT_TX_TCP_CKSUM;
+				break;
+			case (offsetof(struct rte_udp_hdr, dgram_cksum)):
+				if (l4_proto == IPPROTO_UDP)
+					m->ol_flags |= PKT_TX_UDP_CKSUM;
+				break;
+			case (offsetof(struct rte_sctp_hdr, cksum)):
+				if (l4_proto == IPPROTO_SCTP)
+					m->ol_flags |= PKT_TX_SCTP_CKSUM;
+				break;
+			default:
+				break;
+			}
+		}
+	}
+
+	if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			tcp_hdr = l4_hdr;
+			m->ol_flags |= PKT_TX_TCP_SEG;
+			m->tso_segsz = hdr->gso_size;
+			m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			m->ol_flags |= PKT_TX_UDP_SEG;
+			m->tso_segsz = hdr->gso_size;
+			m->l4_len = sizeof(struct rte_udp_hdr);
+			break;
+		default:
+			VHOST_LOG_DATA(WARNING,
+				"unsupported gso type %u.\n", hdr->gso_type);
+			break;
+		}
+	}
+}
+
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+	rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+	struct rte_mbuf_ext_shared_info *shinfo = NULL;
+	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+	uint16_t buf_len;
+	rte_iova_t iova;
+	void *buf;
+
+	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
+	 * required, otherwise store shinfo in the new buffer.
+	 */
+	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+		shinfo = rte_pktmbuf_mtod(pkt,
+					  struct rte_mbuf_ext_shared_info *);
+	else {
+		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+	}
+
+	if (unlikely(total_len > UINT16_MAX))
+		return -ENOSPC;
+
+	buf_len = total_len;
+	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+	if (unlikely(buf == NULL))
+		return -ENOMEM;
+
+	/* Initialize shinfo */
+	if (shinfo) {
+		shinfo->free_cb = virtio_dev_extbuf_free;
+		shinfo->fcb_opaque = buf;
+		rte_mbuf_ext_refcnt_set(shinfo, 1);
+	} else {
+		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+					      virtio_dev_extbuf_free, buf);
+		if (unlikely(shinfo == NULL)) {
+			rte_free(buf);
+			VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
+			return -1;
+		}
+	}
+
+	iova = rte_malloc_virt2iova(buf);
+	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+	rte_pktmbuf_reset_headroom(pkt);
+
+	return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+			 uint32_t data_len)
+{
+	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+	if (unlikely(pkt == NULL)) {
+		VHOST_LOG_DATA(ERR,
+			"Failed to allocate memory for mbuf.\n");
+		return NULL;
+	}
+
+	if (rte_pktmbuf_tailroom(pkt) >= data_len)
+		return pkt;
+
+	/* attach an external buffer if supported */
+	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+		return pkt;
+
+	/* check if chained buffers are allowed */
+	if (!dev->linearbuf)
+		return pkt;
+
+	/* Data doesn't fit into the buffer and the host supports
+	 * only linear buffers
+	 */
+	rte_pktmbuf_free(pkt);
+
+	return NULL;
+}
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index bd9303c8a9..6107662685 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -32,12 +32,6 @@ rxvq_is_mergeable(struct virtio_net *dev)
 	return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
 }
 
-static  __rte_always_inline bool
-virtio_net_is_inorder(struct virtio_net *dev)
-{
-	return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
-}
-
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 {
@@ -1804,121 +1798,6 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
 }
 
-static inline bool
-virtio_net_with_host_offload(struct virtio_net *dev)
-{
-	if (dev->features &
-			((1ULL << VIRTIO_NET_F_CSUM) |
-			 (1ULL << VIRTIO_NET_F_HOST_ECN) |
-			 (1ULL << VIRTIO_NET_F_HOST_TSO4) |
-			 (1ULL << VIRTIO_NET_F_HOST_TSO6) |
-			 (1ULL << VIRTIO_NET_F_HOST_UFO)))
-		return true;
-
-	return false;
-}
-
-static void
-parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
-{
-	struct rte_ipv4_hdr *ipv4_hdr;
-	struct rte_ipv6_hdr *ipv6_hdr;
-	void *l3_hdr = NULL;
-	struct rte_ether_hdr *eth_hdr;
-	uint16_t ethertype;
-
-	eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
-
-	m->l2_len = sizeof(struct rte_ether_hdr);
-	ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
-
-	if (ethertype == RTE_ETHER_TYPE_VLAN) {
-		struct rte_vlan_hdr *vlan_hdr =
-			(struct rte_vlan_hdr *)(eth_hdr + 1);
-
-		m->l2_len += sizeof(struct rte_vlan_hdr);
-		ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
-	}
-
-	l3_hdr = (char *)eth_hdr + m->l2_len;
-
-	switch (ethertype) {
-	case RTE_ETHER_TYPE_IPV4:
-		ipv4_hdr = l3_hdr;
-		*l4_proto = ipv4_hdr->next_proto_id;
-		m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
-		*l4_hdr = (char *)l3_hdr + m->l3_len;
-		m->ol_flags |= PKT_TX_IPV4;
-		break;
-	case RTE_ETHER_TYPE_IPV6:
-		ipv6_hdr = l3_hdr;
-		*l4_proto = ipv6_hdr->proto;
-		m->l3_len = sizeof(struct rte_ipv6_hdr);
-		*l4_hdr = (char *)l3_hdr + m->l3_len;
-		m->ol_flags |= PKT_TX_IPV6;
-		break;
-	default:
-		m->l3_len = 0;
-		*l4_proto = 0;
-		*l4_hdr = NULL;
-		break;
-	}
-}
-
-static __rte_always_inline void
-vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
-{
-	uint16_t l4_proto = 0;
-	void *l4_hdr = NULL;
-	struct rte_tcp_hdr *tcp_hdr = NULL;
-
-	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
-		return;
-
-	parse_ethernet(m, &l4_proto, &l4_hdr);
-	if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-		if (hdr->csum_start == (m->l2_len + m->l3_len)) {
-			switch (hdr->csum_offset) {
-			case (offsetof(struct rte_tcp_hdr, cksum)):
-				if (l4_proto == IPPROTO_TCP)
-					m->ol_flags |= PKT_TX_TCP_CKSUM;
-				break;
-			case (offsetof(struct rte_udp_hdr, dgram_cksum)):
-				if (l4_proto == IPPROTO_UDP)
-					m->ol_flags |= PKT_TX_UDP_CKSUM;
-				break;
-			case (offsetof(struct rte_sctp_hdr, cksum)):
-				if (l4_proto == IPPROTO_SCTP)
-					m->ol_flags |= PKT_TX_SCTP_CKSUM;
-				break;
-			default:
-				break;
-			}
-		}
-	}
-
-	if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-		case VIRTIO_NET_HDR_GSO_TCPV4:
-		case VIRTIO_NET_HDR_GSO_TCPV6:
-			tcp_hdr = l4_hdr;
-			m->ol_flags |= PKT_TX_TCP_SEG;
-			m->tso_segsz = hdr->gso_size;
-			m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
-			break;
-		case VIRTIO_NET_HDR_GSO_UDP:
-			m->ol_flags |= PKT_TX_UDP_SEG;
-			m->tso_segsz = hdr->gso_size;
-			m->l4_len = sizeof(struct rte_udp_hdr);
-			break;
-		default:
-			VHOST_LOG_DATA(WARNING,
-				"unsupported gso type %u.\n", hdr->gso_type);
-			break;
-		}
-	}
-}
-
 static __rte_noinline void
 copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
 		struct buf_vector *buf_vec)
@@ -2145,96 +2024,6 @@ get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
-static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
-{
-	rte_free(opaque);
-}
-
-static int
-virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
-{
-	struct rte_mbuf_ext_shared_info *shinfo = NULL;
-	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
-	uint16_t buf_len;
-	rte_iova_t iova;
-	void *buf;
-
-	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
-	 * required, otherwise store shinfo in the new buffer.
-	 */
-	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
-		shinfo = rte_pktmbuf_mtod(pkt,
-					  struct rte_mbuf_ext_shared_info *);
-	else {
-		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
-		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
-	}
-
-	if (unlikely(total_len > UINT16_MAX))
-		return -ENOSPC;
-
-	buf_len = total_len;
-	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
-	if (unlikely(buf == NULL))
-		return -ENOMEM;
-
-	/* Initialize shinfo */
-	if (shinfo) {
-		shinfo->free_cb = virtio_dev_extbuf_free;
-		shinfo->fcb_opaque = buf;
-		rte_mbuf_ext_refcnt_set(shinfo, 1);
-	} else {
-		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
-					      virtio_dev_extbuf_free, buf);
-		if (unlikely(shinfo == NULL)) {
-			rte_free(buf);
-			VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
-			return -1;
-		}
-	}
-
-	iova = rte_malloc_virt2iova(buf);
-	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
-	rte_pktmbuf_reset_headroom(pkt);
-
-	return 0;
-}
-
-/*
- * Allocate a host supported pktmbuf.
- */
-static __rte_always_inline struct rte_mbuf *
-virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
-			 uint32_t data_len)
-{
-	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
-
-	if (unlikely(pkt == NULL)) {
-		VHOST_LOG_DATA(ERR,
-			"Failed to allocate memory for mbuf.\n");
-		return NULL;
-	}
-
-	if (rte_pktmbuf_tailroom(pkt) >= data_len)
-		return pkt;
-
-	/* attach an external buffer if supported */
-	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
-		return pkt;
-
-	/* check if chained buffers are allowed */
-	if (!dev->linearbuf)
-		return pkt;
-
-	/* Data doesn't fit into the buffer and the host supports
-	 * only linear buffers
-	 */
-	rte_pktmbuf_free(pkt);
-
-	return NULL;
-}
-
 static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions Marvin Liu
@ 2020-09-21  6:48     ` Marvin Liu
  2020-10-06 15:06       ` Maxime Coquelin
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Marvin Liu @ 2020-09-21  6:48 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Prepare memory regions guest physical addresses for vectorized data
path. These information will be utilized by SIMD instructions to find
matched region index.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5a5c945551..4a81f18f01 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -52,6 +52,8 @@
 
 #define ASYNC_MAX_POLL_SEG 255
 
+#define MAX_NREGIONS 8
+
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
 #define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
 
@@ -375,6 +377,8 @@ struct inflight_mem_info {
 struct virtio_net {
 	/* Frontend (QEMU) memory and memory region information */
 	struct rte_vhost_memory	*mem;
+	uint64_t		regions_low_addrs[MAX_NREGIONS];
+	uint64_t		regions_high_addrs[MAX_NREGIONS];
 	uint64_t		features;
 	uint64_t		protocol_features;
 	int			vid;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index c3c924faec..89e75e9e71 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1291,6 +1291,17 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 		}
 	}
 
+	RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
+	if (dev->vectorized) {
+		for (i = 0; i < memory->nregions; i++) {
+			dev->regions_low_addrs[i] =
+				memory->regions[i].guest_phys_addr;
+			dev->regions_high_addrs[i] =
+				memory->regions[i].guest_phys_addr +
+				memory->regions[i].memory_size;
+		}
+	}
+
 	for (i = 0; i < dev->nr_vring; i++) {
 		struct vhost_virtqueue *vq = dev->virtqueue[i];
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
                       ` (2 preceding siblings ...)
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-09-21  6:48     ` Marvin Liu
  2020-10-06 14:59       ` Maxime Coquelin
  2020-10-06 15:18       ` Maxime Coquelin
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
  2020-10-06 13:34     ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
  5 siblings, 2 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21  6:48 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Optimize vhost packed ring dequeue path with SIMD instructions. Four
descriptors status check and writeback are batched handled with AVX512
instructions. Address translation operations are also accelerated by
AVX512 instructions.

If platform or compiler not support vectorization, will fallback to
default path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index cc9aa65c67..c1481802d7 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -8,6 +8,22 @@ endif
 if has_libnuma == 1
 	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
 endif
+
+if arch_subdir == 'x86'
+        if not machine_args.contains('-mno-avx512f')
+                if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+                        cflags += ['-DCC_AVX512_SUPPORT']
+                        vhost_avx512_lib = static_library('vhost_avx512_lib',
+                                              'vhost_vec_avx.c',
+                                              dependencies: [static_rte_eal, static_rte_mempool,
+                                                  static_rte_mbuf, static_rte_ethdev, static_rte_net],
+                                              include_directories: includes,
+                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
+                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
+                endif
+        endif
+endif
+
 if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
 	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
 elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 4a81f18f01..fc7daf2145 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
 	return NULL;
 }
 
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids);
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
new file mode 100644
index 0000000000..dc5322d002
--- /dev/null
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2016 Intel Corporation
+ */
+#include <stdint.h>
+
+#include "vhost.h"
+
+#define BYTE_SIZE 8
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
+	sizeof(uint16_t))
+
+#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
+	sizeof(uint16_t))
+#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
+
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
+	<< FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define DESC_FLAGS_POS 0xaa
+#define MBUF_LENS_POS 0x6666
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids)
+{
+	struct vring_packed_desc *descs = vq->desc_packed;
+	uint32_t descs_status;
+	void *desc_addr;
+	uint16_t i;
+	uint8_t cmp_low, cmp_high, cmp_result;
+	uint64_t lens[PACKED_BATCH_SIZE];
+	struct virtio_net_hdr *hdr;
+
+	if (unlikely(avail_idx & PACKED_BATCH_MASK))
+		return -1;
+
+	/* load 4 descs */
+	desc_addr = &vq->desc_packed[avail_idx];
+	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+	/* burst check four status */
+	__m512i avail_flag_vec;
+	if (vq->avail_wrap_counter)
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
+					PACKED_FLAGS_MASK, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_AVAIL_FLAG);
+
+#endif
+	else
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
+					0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_AVAIL_FLAG_WRAP);
+#endif
+
+	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
+		_MM_CMPINT_NE);
+	if (descs_status & BATCH_FLAGS_MASK)
+		return -1;
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t size = (uint64_t)descs[avail_idx + i].len;
+			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+				descs[avail_idx + i].addr, &size,
+				VHOST_ACCESS_RO);
+
+			if (!desc_addrs[i])
+				goto free_buf;
+			lens[i] = descs[avail_idx + i].len;
+			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+					lens[i]);
+			if (!pkts[i])
+				goto free_buf;
+		}
+	} else {
+		/* check buffer fit into one region & translate address */
+		__m512i regions_low_addrs =
+			_mm512_loadu_si512((void *)&dev->regions_low_addrs);
+		__m512i regions_high_addrs =
+			_mm512_loadu_si512((void *)&dev->regions_high_addrs);
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t addr_low = descs[avail_idx + i].addr;
+			uint64_t addr_high = addr_low +
+						descs[avail_idx + i].len;
+			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+					regions_low_addrs, _MM_CMPINT_NLT);
+			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+					regions_high_addrs, _MM_CMPINT_LT);
+			cmp_result = cmp_low & cmp_high;
+			int index = __builtin_ctz(cmp_result);
+			if (unlikely((uint32_t)index >= dev->mem->nregions))
+				goto free_buf;
+
+			desc_addrs[i] = addr_low +
+				dev->mem->regions[index].host_user_addr -
+				dev->mem->regions[index].guest_phys_addr;
+			lens[i] = descs[avail_idx + i].len;
+			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+					lens[i]);
+			if (!pkts[i])
+				goto free_buf;
+		}
+	}
+
+	if (virtio_net_with_host_offload(dev)) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
+			vhost_dequeue_offload(hdr, pkts[i]);
+		}
+	}
+
+	if (unlikely(virtio_net_is_inorder(dev))) {
+		ids[PACKED_BATCH_SIZE - 1] =
+			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
+	} else {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			ids[i] = descs[avail_idx + i].id;
+	}
+
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	/* store mbuf data_len, pkt_len */
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
+					+ sizeof(uint64_t);
+	}
+
+	/* save pkt_len and data_len into mbufs */
+	__m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
+					0xAA);
+	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
+					(uint32_t)-12);
+	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
+
+	return 0;
+free_buf:
+	for (i = 0; i < PACKED_BATCH_SIZE; i++)
+		rte_pktmbuf_free(pkts[i]);
+
+	return -1;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 6107662685..e4d2e2e7d6 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
 	return -1;
 }
 
+static __rte_always_inline int
+vhost_handle_avail_batch_packed(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids)
+{
+	if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+		return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
+				pkts, avail_idx, desc_addrs, ids);
+#else
+		return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
+				pkts, avail_idx, desc_addrs, ids);
+
+#endif
+	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+			avail_idx, desc_addrs, ids);
+}
+
 static __rte_always_inline int
 virtio_dev_tx_batch_packed(struct virtio_net *dev,
 			   struct vhost_virtqueue *vq,
@@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
 	uint16_t ids[PACKED_BATCH_SIZE];
 	uint16_t i;
 
-	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
-					     avail_idx, desc_addrs, ids))
+
+	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+		avail_idx, desc_addrs, ids))
 		return -1;
 
 	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
                       ` (3 preceding siblings ...)
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-09-21  6:48     ` Marvin Liu
  2020-10-06 15:00       ` Maxime Coquelin
  2020-10-06 13:34     ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
  5 siblings, 1 reply; 36+ messages in thread
From: Marvin Liu @ 2020-09-21  6:48 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Optimize vhost packed ring enqueue path with SIMD instructions. Four
descriptors status and length are batched handled with AVX512
instructions. Address translation operations are also accelerated
by AVX512 instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index fc7daf2145..b78b2c5c1b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
 				 uint16_t avail_idx,
 				 uintptr_t *desc_addrs,
 				 uint16_t *ids);
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+			       struct vhost_virtqueue *vq,
+			       struct rte_mbuf **pkts);
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
index dc5322d002..7d2250ed86 100644
--- a/lib/librte_vhost/vhost_vec_avx.c
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -35,9 +35,15 @@
 #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
 #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
 	FLAGS_BITS_OFFSET)
+#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
+	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
+#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
+	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
 
 #define DESC_FLAGS_POS 0xaa
 #define MBUF_LENS_POS 0x6666
+#define DESC_LENS_POS 0x4444
+#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
 
 int
 vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
@@ -179,3 +185,154 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
 
 	return -1;
 }
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+			       struct vhost_virtqueue *vq,
+			       struct rte_mbuf **pkts)
+{
+	struct vring_packed_desc *descs = vq->desc_packed;
+	uint16_t avail_idx = vq->last_avail_idx;
+	uint64_t desc_addrs[PACKED_BATCH_SIZE];
+	uint32_t buf_offset = dev->vhost_hlen;
+	uint32_t desc_status;
+	uint64_t lens[PACKED_BATCH_SIZE];
+	uint16_t i;
+	void *desc_addr;
+	uint8_t cmp_low, cmp_high, cmp_result;
+
+	if (unlikely(avail_idx & PACKED_BATCH_MASK))
+		return -1;
+
+	/* check refcnt and nb_segs */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+
+	/* load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+				*pkts[3]->rearm_data,
+				*pkts[2]->rearm_data,
+				*pkts[1]->rearm_data,
+				*pkts[0]->rearm_data);
+
+	uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & MBUF_LENS_POS)
+		return -1;
+
+	/* check desc status */
+	desc_addr = &vq->desc_packed[avail_idx];
+	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+	__m512i avail_flag_vec;
+	__m512i used_flag_vec;
+	if (vq->avail_wrap_counter) {
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
+					0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
+		used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
+					PACKED_FLAGS_MASK, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_WRITE_AVAIL_FLAG);
+		used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_FLAGS_MASK);
+#endif
+	} else {
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(
+					PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
+					PACKED_WRITE_AVAIL_FLAG, 0x0);
+		used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_WRITE_AVAIL_FLAG_WRAP);
+		used_flag_vec = _mm512_setzero_epi32();
+#endif
+	}
+
+	desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
+				avail_flag_vec, _MM_CMPINT_NE);
+	if (desc_status)
+		return -1;
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t size = (uint64_t)descs[avail_idx + i].len;
+			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+				descs[avail_idx + i].addr, &size,
+				VHOST_ACCESS_RW);
+
+			if (!desc_addrs[i])
+				return -1;
+
+			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+					0));
+		}
+	} else {
+		/* check buffer fit into one region & translate address */
+		__m512i regions_low_addrs =
+			_mm512_loadu_si512((void *)&dev->regions_low_addrs);
+		__m512i regions_high_addrs =
+			_mm512_loadu_si512((void *)&dev->regions_high_addrs);
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t addr_low = descs[avail_idx + i].addr;
+			uint64_t addr_high = addr_low +
+						descs[avail_idx + i].len;
+			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+					regions_low_addrs, _MM_CMPINT_NLT);
+			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+					regions_high_addrs, _MM_CMPINT_LT);
+			cmp_result = cmp_low & cmp_high;
+			int index = __builtin_ctz(cmp_result);
+			if (unlikely((uint32_t)index >= dev->mem->nregions))
+				return -1;
+
+			desc_addrs[i] = addr_low +
+				dev->mem->regions[index].host_user_addr -
+				dev->mem->regions[index].guest_phys_addr;
+			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+					0));
+		}
+	}
+
+	/* check length is enough */
+	__m512i pkt_lens = _mm512_set_epi32(
+			0, pkts[3]->pkt_len, 0, 0,
+			0, pkts[2]->pkt_len, 0, 0,
+			0, pkts[1]->pkt_len, 0, 0,
+			0, pkts[0]->pkt_len, 0, 0);
+
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
+					dev->vhost_hlen);
+	__m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
+	uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
+				desc_vec, buf_len_vec, _MM_CMPINT_LT);
+	if (lens_cmp)
+		return -1;
+
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+			   rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+			   pkts[i]->pkt_len);
+	}
+
+	if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			lens[i] = descs[avail_idx + i].len;
+			vhost_log_cache_write_iova(dev, vq,
+				descs[avail_idx + i].addr, lens[i]);
+		}
+	}
+
+	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
+	vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
+	/* save len and flags, skip addr and id */
+	__m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
+					DESC_LENS_FLAGS_POS, buf_len_vec,
+					used_flag_vec);
+	_mm512_storeu_si512(desc_addr, desc_updated);
+
+	return 0;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index e4d2e2e7d6..5c56a8d6ff 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
 	return 0;
 }
 
+static __rte_always_inline int
+virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
+			   struct vhost_virtqueue *vq,
+			   struct rte_mbuf **pkts)
+
+{
+	if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
+#else
+		return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#endif
+	return virtio_dev_rx_batch_packed(dev, vq, pkts);
+}
+
 static __rte_noinline uint32_t
 virtio_dev_rx_packed(struct virtio_net *dev,
 		     struct vhost_virtqueue *__rte_restrict vq,
@@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
 
 		if (remained >= PACKED_BATCH_SIZE) {
-			if (!virtio_dev_rx_batch_packed(dev, vq,
-							&pkts[pkt_idx])) {
+			if (!virtio_dev_rx_handle_batch_packed(dev, vq,
+				&pkts[pkt_idx])) {
 				pkt_idx += PACKED_BATCH_SIZE;
 				remained -= PACKED_BATCH_SIZE;
 				continue;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
  2020-09-21  6:26     ` Liu, Yong
@ 2020-09-21  7:47       ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-09-21  7:47 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Liu, Yong
> Sent: Monday, September 21, 2020 2:27 PM
> To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
> 
> 
> 
> > -----Original Message-----
> > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Sent: Friday, September 18, 2020 9:45 PM
> > To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
> >
> >
> >
> > On 8/19/20 5:24 AM, Marvin Liu wrote:
> > > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > > descriptors status check and writeback are batched handled with
> AVX512
> > > instructions. Address translation operations are also accelerated by
> > > AVX512 instructions.
> > >
> > > If platform or compiler not support vectorization, will fallback to
> > > default path.
> > >
> > > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> > >
> > > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > > index 4f2f3e47da..c0cd7d498f 100644
> > > --- a/lib/librte_vhost/Makefile
> > > +++ b/lib/librte_vhost/Makefile
> > > @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
> > >  endif
> > >  endif
> > >
> > > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > > +        CC_AVX512_SUPPORT=\
> > > +        $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > > +        sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' |
> \
> > > +        grep -q AVX512 && echo 1)
> > > +endif
> > > +
> > >  ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
> > >  LDLIBS += -lnuma
> > >  endif
> > > @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> > lrte_ethdev -lrte_net
> > >  SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c
> \
> > >  					vhost_user.c virtio_net.c vdpa.c
> > >
> > > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > > +CFLAGS += -DCC_AVX512_SUPPORT
> > > +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> > > +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> > > +endif
> > > +
> > >  # install includes
> > >  SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> > rte_vdpa.h \
> > >  						rte_vdpa_dev.h
> > rte_vhost_async.h
> > > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > > index cc9aa65c67..c1481802d7 100644
> > > --- a/lib/librte_vhost/meson.build
> > > +++ b/lib/librte_vhost/meson.build
> > > @@ -8,6 +8,22 @@ endif
> > >  if has_libnuma == 1
> > >  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> > >  endif
> > > +
> > > +if arch_subdir == 'x86'
> > > +        if not machine_args.contains('-mno-avx512f')
> > > +                if cc.has_argument('-mavx512f') and cc.has_argument('-
> > mavx512vl') and cc.has_argument('-mavx512bw')
> > > +                        cflags += ['-DCC_AVX512_SUPPORT']
> > > +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> > > +                                              'vhost_vec_avx.c',
> > > +                                              dependencies: [static_rte_eal,
> > static_rte_mempool,
> > > +                                                  static_rte_mbuf, static_rte_ethdev,
> > static_rte_net],
> > > +                                              include_directories: includes,
> > > +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> > mavx512vl'])
> > > +                        objs +=
> vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > > +                endif
> > > +        endif
> > > +endif
> > > +
> > >  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> > >  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > >  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > > index 4a81f18f01..fc7daf2145 100644
> > > --- a/lib/librte_vhost/vhost.h
> > > +++ b/lib/librte_vhost/vhost.h
> > > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> > *dev, struct rte_mempool *mp,
> > >  	return NULL;
> > >  }
> > >
> > > +int
> > > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > > +				 struct vhost_virtqueue *vq,
> > > +				 struct rte_mempool *mbuf_pool,
> > > +				 struct rte_mbuf **pkts,
> > > +				 uint16_t avail_idx,
> > > +				 uintptr_t *desc_addrs,
> > > +				 uint16_t *ids);
> > >  #endif /* _VHOST_NET_CDEV_H_ */
> > > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> > b/lib/librte_vhost/vhost_vec_avx.c
> > > new file mode 100644
> > > index 0000000000..e8361d18fa
> > > --- /dev/null
> > > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > > @@ -0,0 +1,152 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2010-2016 Intel Corporation
> > > + */
> > > +#include <stdint.h>
> > > +
> > > +#include "vhost.h"
> > > +
> > > +#define BYTE_SIZE 8
> > > +/* reference count offset in mbuf rearm data */
> > > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > > +/* segment number offset in mbuf rearm data */
> > > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > > +
> > > +/* default rearm data */
> > > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > > +	1ULL << REFCNT_BITS_OFFSET)
> > > +
> > > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct
> vring_packed_desc,
> > flags) / \
> > > +	sizeof(uint16_t))
> > > +
> > > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > > +	sizeof(uint16_t))
> > > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  |
> > \
> > > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > > +
> > > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) -
> \
> > > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > > +
> > > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> > VRING_DESC_F_USED) \
> > > +	<< FLAGS_BITS_OFFSET)
> > > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> > FLAGS_BITS_OFFSET)
> > > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) <<
> \
> > > +	FLAGS_BITS_OFFSET)
> > > +
> > > +#define DESC_FLAGS_POS 0xaa
> > > +#define MBUF_LENS_POS 0x6666
> > > +
> > > +int
> > > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > > +				 struct vhost_virtqueue *vq,
> > > +				 struct rte_mempool *mbuf_pool,
> > > +				 struct rte_mbuf **pkts,
> > > +				 uint16_t avail_idx,
> > > +				 uintptr_t *desc_addrs,
> > > +				 uint16_t *ids)
> > > +{
> > > +	struct vring_packed_desc *descs = vq->desc_packed;
> > > +	uint32_t descs_status;
> > > +	void *desc_addr;
> > > +	uint16_t i;
> > > +	uint8_t cmp_low, cmp_high, cmp_result;
> > > +	uint64_t lens[PACKED_BATCH_SIZE];
> > > +
> > > +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > > +		return -1;
> > > +
> > > +	/* load 4 descs */
> > > +	desc_addr = &vq->desc_packed[avail_idx];
> > > +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > > +
> > > +	/* burst check four status */
> > > +	__m512i avail_flag_vec;
> > > +	if (vq->avail_wrap_counter)
> > > +#if defined(RTE_ARCH_I686)
> > > +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> > 0x0,
> > > +					PACKED_FLAGS_MASK, 0x0);
> > > +#else
> > > +		avail_flag_vec =
> > _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > > +					PACKED_AVAIL_FLAG);
> > > +
> > > +#endif
> > > +	else
> > > +#if defined(RTE_ARCH_I686)
> > > +		avail_flag_vec =
> > _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > > +					0x0, PACKED_AVAIL_FLAG_WRAP,
> > 0x0);
> > > +#else
> > > +		avail_flag_vec =
> > _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > > +					PACKED_AVAIL_FLAG_WRAP);
> > > +#endif
> > > +
> > > +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > > +		_MM_CMPINT_NE);
> > > +	if (descs_status & BATCH_FLAGS_MASK)
> > > +		return -1;
> > > +
> > > +	/* check buffer fit into one region & translate address */
> > > +	__m512i regions_low_addrs =
> > > +		_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> > > +	__m512i regions_high_addrs =
> > > +		_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> > > +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > > +		uint64_t addr_low = descs[avail_idx + i].addr;
> > > +		uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> > > +		__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> > > +		__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> > > +
> > > +		cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> > > +				regions_low_addrs, _MM_CMPINT_NLT);
> > > +		cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> > > +				regions_high_addrs, _MM_CMPINT_LT);
> > > +		cmp_result = cmp_low & cmp_high;
> > > +		int index = __builtin_ctz(cmp_result);
> > > +		if (unlikely((uint32_t)index >= dev->mem->nregions))
> > > +			goto free_buf;
> > > +
> > > +		desc_addrs[i] = addr_low +
> > > +			dev->mem->regions[index].host_user_addr -
> > > +			dev->mem->regions[index].guest_phys_addr;
> > > +		lens[i] = descs[avail_idx + i].len;
> > > +		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > > +
> > > +		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> > > +		if (!pkts[i])
> > > +			goto free_buf;
> > > +	}
> >
> > The above does not support vIOMMU, isn't it?
> >
> > The more the packed datapath evolves, the more it gets optimized for a
> > very specific configuration.
> >
> > In v19.11, indirect descriptors and chained buffers are handled as a
> > fallback. And now vIOMMU support is handled as a fallback.
> >
> 
> Hi Maxime,
> Thanks for figuring out the feature miss. First version patch is lack of
> vIOMMU supporting.
> V2 patch will fix the feature gap between vectorized function and original
> batch function.
> So there will be no additional fallback introduced in vectorized patch set.
> 
> IMHO, current packed optimization introduced complexity is for handling
> that gap between performance aimed frontend (like PMD) and normal
> network traffic (like TCP).
> Vectorized datapath is focusing in enhancing the performance of batched
> function.  From function point of view, there will no difference between
> vectorized batched function and original batched function.
> Current packed ring path will remain the same if vectorized option is not
> enable. So I think the complexity won't increase too much. If there's any
> concern, please let me known.
> 
> BTW, vectorized path can help performance a lot when vIOMMU enabled.
> 

After double check, most performance difference came from runtime setting. 
Performance gain of vectorized path is not so obvious. 

> Regards,
> Marvin
> 
> > I personnally don't like the path it is taking as it is adding a lot of
> > complexity on top of that.
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
                       ` (4 preceding siblings ...)
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
@ 2020-10-06 13:34     ` Maxime Coquelin
  2020-10-08  6:20       ` Liu, Yong
  5 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 13:34 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev

Hi,

On 9/21/20 8:48 AM, Marvin Liu wrote:
> Packed ring format is imported since virtio spec 1.1. All descriptors
> are compacted into one single ring when packed ring format is on. It is
> straight forward that ring operations can be accelerated by utilizing
> SIMD instructions. 
> 
> This patch set will introduce vectorized data path in vhost library. If
> vectorized option is on, operations like descs check, descs writeback,
> address translation will be accelerated by SIMD instructions. Vhost
> application can choose whether using vectorized acceleration, it is 
> like external buffer and zero copy features. 
> 
> If platform or ring format not support vectorized function, vhost will
> fallback to use default batch function. There will be no impact in current
> data path.

As a pre-requisite, I'd like some performance numbers in both loopback
and PVP to figure out if adding such complexity is worth it, given we
will have to support it for at least one year.

Thanks,
Maxime

> v2:
> * add vIOMMU support
> * add dequeue offloading
> * rebase code
> 
> Marvin Liu (5):
>   vhost: add vectorized data path
>   vhost: reuse packed ring functions
>   vhost: prepare memory regions addresses
>   vhost: add packed ring vectorized dequeue
>   vhost: add packed ring vectorized enqueue
> 
>  doc/guides/nics/vhost.rst           |   5 +
>  doc/guides/prog_guide/vhost_lib.rst |  12 +
>  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
>  lib/librte_vhost/meson.build        |  16 ++
>  lib/librte_vhost/rte_vhost.h        |   1 +
>  lib/librte_vhost/socket.c           |   5 +
>  lib/librte_vhost/vhost.c            |  11 +
>  lib/librte_vhost/vhost.h            | 235 +++++++++++++++++++
>  lib/librte_vhost/vhost_user.c       |  11 +
>  lib/librte_vhost/vhost_vec_avx.c    | 338 ++++++++++++++++++++++++++++
>  lib/librte_vhost/virtio_net.c       | 257 ++++-----------------
>  11 files changed, 692 insertions(+), 216 deletions(-)
>  create mode 100644 lib/librte_vhost/vhost_vec_avx.c
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-10-06 14:59       ` Maxime Coquelin
  2020-10-08  7:05         ` Liu, Yong
  2020-10-06 15:18       ` Maxime Coquelin
  1 sibling, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 14:59 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev



On 9/21/20 8:48 AM, Marvin Liu wrote:
> Optimize vhost packed ring dequeue path with SIMD instructions. Four
> descriptors status check and writeback are batched handled with AVX512
> instructions. Address translation operations are also accelerated by
> AVX512 instructions.
> 
> If platform or compiler not support vectorization, will fallback to
> default path.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index cc9aa65c67..c1481802d7 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -8,6 +8,22 @@ endif
>  if has_libnuma == 1
>  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
>  endif
> +
> +if arch_subdir == 'x86'
> +        if not machine_args.contains('-mno-avx512f')
> +                if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> +                        cflags += ['-DCC_AVX512_SUPPORT']
> +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> +                                              'vhost_vec_avx.c',
> +                                              dependencies: [static_rte_eal, static_rte_mempool,
> +                                                  static_rte_mbuf, static_rte_ethdev, static_rte_net],
> +                                              include_directories: includes,
> +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
> +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> +                endif
> +        endif
> +endif

Not a Meson expert, but wonder how I can disable CC_AVX512_SUPPORT.
I checked the DPDK doc, but I could not find how to pass -mno-avx512f to
the machine_args.

> +
>  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
>  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
>  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 4a81f18f01..fc7daf2145 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
>  	return NULL;
>  }
>  
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids);
>  #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> new file mode 100644
> index 0000000000..dc5322d002
> --- /dev/null
> +++ b/lib/librte_vhost/vhost_vec_avx.c

For consistency it should be prefixed with virtio_net, not vhost.

> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2016 Intel Corporation
> + */
> +#include <stdint.h>
> +
> +#include "vhost.h"
> +
> +#define BYTE_SIZE 8
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> +	1ULL << REFCNT_BITS_OFFSET)
> +
> +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
> +	sizeof(uint16_t))
> +
> +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> +	sizeof(uint16_t))
> +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> +
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
> +	<< FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> +	FLAGS_BITS_OFFSET)
> +
> +#define DESC_FLAGS_POS 0xaa
> +#define MBUF_LENS_POS 0x6666
> +
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids)
> +{
> +	struct vring_packed_desc *descs = vq->desc_packed;
> +	uint32_t descs_status;
> +	void *desc_addr;
> +	uint16_t i;
> +	uint8_t cmp_low, cmp_high, cmp_result;
> +	uint64_t lens[PACKED_BATCH_SIZE];
> +	struct virtio_net_hdr *hdr;
> +
> +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> +		return -1;
> +
> +	/* load 4 descs */
> +	desc_addr = &vq->desc_packed[avail_idx];
> +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);

Unlike split ring, packed ring specification does not mandate the ring
size to be a power of two. So checking  avail_idx is aligned on 64 bytes
is not enough given a descriptor is 16 bytes.

You need to also check against ring size to prevent out of bounds
accesses.

I see the non vectorized batch processing you introduced in v19.11 also
do that wrong assumption. Please fix it.

Also, I wonder whether it is assumed that &vq->desc_packed[avail_idx];
is aligned on a cache-line. Meaning, does below intrinsics have such a
requirement?

> +	/* burst check four status */
> +	__m512i avail_flag_vec;
> +	if (vq->avail_wrap_counter)
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
> +					PACKED_FLAGS_MASK, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_AVAIL_FLAG);
> +
> +#endif
> +	else
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> +					0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_AVAIL_FLAG_WRAP);
> +#endif
> +
> +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> +		_MM_CMPINT_NE);
> +	if (descs_status & BATCH_FLAGS_MASK)
> +		return -1;
> +
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			uint64_t size = (uint64_t)descs[avail_idx + i].len;
> +			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> +				descs[avail_idx + i].addr, &size,
> +				VHOST_ACCESS_RO);
> +
> +			if (!desc_addrs[i])
> +				goto free_buf;
> +			lens[i] = descs[avail_idx + i].len;
> +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> +					lens[i]);
> +			if (!pkts[i])
> +				goto free_buf;
> +		}
> +	} else {
> +		/* check buffer fit into one region & translate address */
> +		__m512i regions_low_addrs =
> +			_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> +		__m512i regions_high_addrs =
> +			_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			uint64_t addr_low = descs[avail_idx + i].addr;
> +			uint64_t addr_high = addr_low +
> +						descs[avail_idx + i].len;
> +			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> +			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> +			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> +					regions_low_addrs, _MM_CMPINT_NLT);
> +			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> +					regions_high_addrs, _MM_CMPINT_LT);
> +			cmp_result = cmp_low & cmp_high;
> +			int index = __builtin_ctz(cmp_result);
> +			if (unlikely((uint32_t)index >= dev->mem->nregions))
> +				goto free_buf;
> +
> +			desc_addrs[i] = addr_low +
> +				dev->mem->regions[index].host_user_addr -
> +				dev->mem->regions[index].guest_phys_addr;
> +			lens[i] = descs[avail_idx + i].len;
> +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> +					lens[i]);
> +			if (!pkts[i])
> +				goto free_buf;
> +		}
> +	}
> +
> +	if (virtio_net_with_host_offload(dev)) {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> +			vhost_dequeue_offload(hdr, pkts[i]);
> +		}
> +	}
> +
> +	if (unlikely(virtio_net_is_inorder(dev))) {
> +		ids[PACKED_BATCH_SIZE - 1] =
> +			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;

Isn't in-order a likely case? Maybe just remove the unlikely.

> +	} else {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> +			ids[i] = descs[avail_idx + i].id;
> +	}
> +
> +	uint64_t addrs[PACKED_BATCH_SIZE << 1];
> +	/* store mbuf data_len, pkt_len */
> +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> +		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> +					+ sizeof(uint64_t);
> +	}
> +
> +	/* save pkt_len and data_len into mbufs */
> +	__m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> +					0xAA);
> +	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> +					(uint32_t)-12);
> +	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> +	__m512i vindex = _mm512_loadu_si512((void *)addrs);
> +	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> +
> +	return 0;
> +free_buf:
> +	for (i = 0; i < PACKED_BATCH_SIZE; i++)
> +		rte_pktmbuf_free(pkts[i]);
> +
> +	return -1;
> +}
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 6107662685..e4d2e2e7d6 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
>  	return -1;
>  }
>  
> +static __rte_always_inline int
> +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids)
> +{
> +	if (unlikely(dev->vectorized))
> +#ifdef CC_AVX512_SUPPORT
> +		return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
> +				pkts, avail_idx, desc_addrs, ids);
> +#else
> +		return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
> +				pkts, avail_idx, desc_addrs, ids);
> +
> +#endif
> +	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> +			avail_idx, desc_addrs, ids);
> +}


It should be as below to not have any performance impact when
CC_AVX512_SUPPORT is not set:

#ifdef CC_AVX512_SUPPORT
	if (unlikely(dev->vectorized))
		return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
			pkts, avail_idx, desc_addrs, ids);
#else
	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
		avail_idx, desc_addrs, ids);
#endif
> +
>  static __rte_always_inline int
>  virtio_dev_tx_batch_packed(struct virtio_net *dev,
>  			   struct vhost_virtqueue *vq,
> @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
>  	uint16_t ids[PACKED_BATCH_SIZE];
>  	uint16_t i;
>  
> -	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> -					     avail_idx, desc_addrs, ids))
> +
> +	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> +		avail_idx, desc_addrs, ids))
>  		return -1;
>  
>  	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
@ 2020-10-06 15:00       ` Maxime Coquelin
  2020-10-08  7:09         ` Liu, Yong
  0 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 15:00 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev



On 9/21/20 8:48 AM, Marvin Liu wrote:
> Optimize vhost packed ring enqueue path with SIMD instructions. Four
> descriptors status and length are batched handled with AVX512
> instructions. Address translation operations are also accelerated
> by AVX512 instructions.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index fc7daf2145..b78b2c5c1b 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
>  				 uint16_t avail_idx,
>  				 uintptr_t *desc_addrs,
>  				 uint16_t *ids);
> +
> +int
> +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> +			       struct vhost_virtqueue *vq,
> +			       struct rte_mbuf **pkts);
> +
>  #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> index dc5322d002..7d2250ed86 100644
> --- a/lib/librte_vhost/vhost_vec_avx.c
> +++ b/lib/librte_vhost/vhost_vec_avx.c
> @@ -35,9 +35,15 @@
>  #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
>  #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
>  	FLAGS_BITS_OFFSET)
> +#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
> +	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
> +#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
> +	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
>  
>  #define DESC_FLAGS_POS 0xaa
>  #define MBUF_LENS_POS 0x6666
> +#define DESC_LENS_POS 0x4444
> +#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
>  
>  int
>  vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> @@ -179,3 +185,154 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
>  
>  	return -1;
>  }
> +
> +int
> +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> +			       struct vhost_virtqueue *vq,
> +			       struct rte_mbuf **pkts)
> +{
> +	struct vring_packed_desc *descs = vq->desc_packed;
> +	uint16_t avail_idx = vq->last_avail_idx;
> +	uint64_t desc_addrs[PACKED_BATCH_SIZE];
> +	uint32_t buf_offset = dev->vhost_hlen;
> +	uint32_t desc_status;
> +	uint64_t lens[PACKED_BATCH_SIZE];
> +	uint16_t i;
> +	void *desc_addr;
> +	uint8_t cmp_low, cmp_high, cmp_result;
> +
> +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> +		return -1;

Same comment as for patch 4. Packed ring size may not be a pow2.

> +	/* check refcnt and nb_segs */
> +	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
> +
> +	/* load four mbufs rearm data */
> +	__m256i mbufs = _mm256_set_epi64x(
> +				*pkts[3]->rearm_data,
> +				*pkts[2]->rearm_data,
> +				*pkts[1]->rearm_data,
> +				*pkts[0]->rearm_data);
> +
> +	uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
> +	if (cmp & MBUF_LENS_POS)
> +		return -1;
> +
> +	/* check desc status */
> +	desc_addr = &vq->desc_packed[avail_idx];
> +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> +
> +	__m512i avail_flag_vec;
> +	__m512i used_flag_vec;
> +	if (vq->avail_wrap_counter) {
> +#if defined(RTE_ARCH_I686)

Is supporting AVX512 on i686 really useful/necessary?

> +		avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
> +					0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
> +		used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
> +					PACKED_FLAGS_MASK, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_WRITE_AVAIL_FLAG);
> +		used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_FLAGS_MASK);
> +#endif
> +	} else {
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(
> +					PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
> +					PACKED_WRITE_AVAIL_FLAG, 0x0);
> +		used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_WRITE_AVAIL_FLAG_WRAP);
> +		used_flag_vec = _mm512_setzero_epi32();
> +#endif
> +	}
> +
> +	desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
> +				avail_flag_vec, _MM_CMPINT_NE);
> +	if (desc_status)
> +		return -1;
> +
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			uint64_t size = (uint64_t)descs[avail_idx + i].len;
> +			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> +				descs[avail_idx + i].addr, &size,
> +				VHOST_ACCESS_RW);
> +
> +			if (!desc_addrs[i])
> +				return -1;
> +
> +			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
> +					0));
> +		}
> +	} else {
> +		/* check buffer fit into one region & translate address */
> +		__m512i regions_low_addrs =
> +			_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> +		__m512i regions_high_addrs =
> +			_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			uint64_t addr_low = descs[avail_idx + i].addr;
> +			uint64_t addr_high = addr_low +
> +						descs[avail_idx + i].len;
> +			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> +			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> +			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> +					regions_low_addrs, _MM_CMPINT_NLT);
> +			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> +					regions_high_addrs, _MM_CMPINT_LT);
> +			cmp_result = cmp_low & cmp_high;
> +			int index = __builtin_ctz(cmp_result);
> +			if (unlikely((uint32_t)index >= dev->mem->nregions))
> +				return -1;
> +
> +			desc_addrs[i] = addr_low +
> +				dev->mem->regions[index].host_user_addr -
> +				dev->mem->regions[index].guest_phys_addr;
> +			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
> +					0));
> +		}
> +	}
> +
> +	/* check length is enough */
> +	__m512i pkt_lens = _mm512_set_epi32(
> +			0, pkts[3]->pkt_len, 0, 0,
> +			0, pkts[2]->pkt_len, 0, 0,
> +			0, pkts[1]->pkt_len, 0, 0,
> +			0, pkts[0]->pkt_len, 0, 0);
> +
> +	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
> +					dev->vhost_hlen);
> +	__m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
> +	uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
> +				desc_vec, buf_len_vec, _MM_CMPINT_LT);
> +	if (lens_cmp)
> +		return -1;
> +
> +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +		rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
> +			   rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
> +			   pkts[i]->pkt_len);
> +	}
> +
> +	if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			lens[i] = descs[avail_idx + i].len;
> +			vhost_log_cache_write_iova(dev, vq,
> +				descs[avail_idx + i].addr, lens[i]);
> +		}
> +	}
> +
> +	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
> +	vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
> +	/* save len and flags, skip addr and id */
> +	__m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
> +					DESC_LENS_FLAGS_POS, buf_len_vec,
> +					used_flag_vec);
> +	_mm512_storeu_si512(desc_addr, desc_updated);
> +
> +	return 0;
> +}
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index e4d2e2e7d6..5c56a8d6ff 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
>  	return 0;
>  }
>  
> +static __rte_always_inline int
> +virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
> +			   struct vhost_virtqueue *vq,
> +			   struct rte_mbuf **pkts)
> +
> +{
> +	if (unlikely(dev->vectorized))
> +#ifdef CC_AVX512_SUPPORT
> +		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
> +#else
> +		return virtio_dev_rx_batch_packed(dev, vq, pkts);
> +#endif
> +	return virtio_dev_rx_batch_packed(dev, vq, pkts);

It should be as below to not have any performance impact when
CC_AVX512_SUPPORT is not set:

#ifdef CC_AVX512_SUPPORT
	if (unlikely(dev->vectorized))
		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
#else
	return virtio_dev_rx_batch_packed(dev, vq, pkts);
#endif

> +}
> +
>  static __rte_noinline uint32_t
>  virtio_dev_rx_packed(struct virtio_net *dev,
>  		     struct vhost_virtqueue *__rte_restrict vq,
> @@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
>  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
>  
>  		if (remained >= PACKED_BATCH_SIZE) {
> -			if (!virtio_dev_rx_batch_packed(dev, vq,
> -							&pkts[pkt_idx])) {
> +			if (!virtio_dev_rx_handle_batch_packed(dev, vq,
> +				&pkts[pkt_idx])) {
>  				pkt_idx += PACKED_BATCH_SIZE;
>  				remained -= PACKED_BATCH_SIZE;
>  				continue;
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-10-06 15:06       ` Maxime Coquelin
  0 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 15:06 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev



On 9/21/20 8:48 AM, Marvin Liu wrote:
> Prepare memory regions guest physical addresses for vectorized data
> path. These information will be utilized by SIMD instructions to find
> matched region index.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 5a5c945551..4a81f18f01 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -52,6 +52,8 @@
>  
>  #define ASYNC_MAX_POLL_SEG 255
>  
> +#define MAX_NREGIONS 8
> +
>  #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
>  #define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
>  
> @@ -375,6 +377,8 @@ struct inflight_mem_info {
>  struct virtio_net {
>  	/* Frontend (QEMU) memory and memory region information */
>  	struct rte_vhost_memory	*mem;
> +	uint64_t		regions_low_addrs[MAX_NREGIONS];
> +	uint64_t		regions_high_addrs[MAX_NREGIONS];

It eats two cache lines, so it would be better to have it in a dedicated
structure dynamically allocated.

It would be better for non-vectorized path, as it will avoid polluting
cache with useless data in its case. And it would be better for
vectorized path too, as when the DP will need to use it, it will use
exactly two cache lines instead of 3three.

>  	uint64_t		features;
>  	uint64_t		protocol_features;
>  	int			vid;
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index c3c924faec..89e75e9e71 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -1291,6 +1291,17 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
>  		}
>  	}
>  
> +	RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
> +	if (dev->vectorized) {
> +		for (i = 0; i < memory->nregions; i++) {
> +			dev->regions_low_addrs[i] =
> +				memory->regions[i].guest_phys_addr;
> +			dev->regions_high_addrs[i] =
> +				memory->regions[i].guest_phys_addr +
> +				memory->regions[i].memory_size;
> +		}
> +	}
> +
>  	for (i = 0; i < dev->nr_vring; i++) {
>  		struct vhost_virtqueue *vq = dev->virtqueue[i];
>  
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
  2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
  2020-10-06 14:59       ` Maxime Coquelin
@ 2020-10-06 15:18       ` Maxime Coquelin
  2020-10-09  7:59         ` Liu, Yong
  1 sibling, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 15:18 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev



On 9/21/20 8:48 AM, Marvin Liu wrote:
> Optimize vhost packed ring dequeue path with SIMD instructions. Four
> descriptors status check and writeback are batched handled with AVX512
> instructions. Address translation operations are also accelerated by
> AVX512 instructions.
> 
> If platform or compiler not support vectorization, will fallback to
> default path.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index cc9aa65c67..c1481802d7 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -8,6 +8,22 @@ endif
>  if has_libnuma == 1
>  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
>  endif
> +
> +if arch_subdir == 'x86'
> +        if not machine_args.contains('-mno-avx512f')
> +                if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> +                        cflags += ['-DCC_AVX512_SUPPORT']
> +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> +                                              'vhost_vec_avx.c',
> +                                              dependencies: [static_rte_eal, static_rte_mempool,
> +                                                  static_rte_mbuf, static_rte_ethdev, static_rte_net],
> +                                              include_directories: includes,
> +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
> +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> +                endif
> +        endif
> +endif
> +
>  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
>  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
>  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 4a81f18f01..fc7daf2145 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
>  	return NULL;
>  }
>  
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids);
>  #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> new file mode 100644
> index 0000000000..dc5322d002
> --- /dev/null
> +++ b/lib/librte_vhost/vhost_vec_avx.c
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2016 Intel Corporation
> + */
> +#include <stdint.h>
> +
> +#include "vhost.h"
> +
> +#define BYTE_SIZE 8
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> +	1ULL << REFCNT_BITS_OFFSET)
> +
> +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
> +	sizeof(uint16_t))
> +
> +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> +	sizeof(uint16_t))
> +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  | \
> +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> +
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
> +	<< FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> +	FLAGS_BITS_OFFSET)
> +
> +#define DESC_FLAGS_POS 0xaa
> +#define MBUF_LENS_POS 0x6666
> +
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids)
> +{
> +	struct vring_packed_desc *descs = vq->desc_packed;
> +	uint32_t descs_status;
> +	void *desc_addr;
> +	uint16_t i;
> +	uint8_t cmp_low, cmp_high, cmp_result;
> +	uint64_t lens[PACKED_BATCH_SIZE];
> +	struct virtio_net_hdr *hdr;
> +
> +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> +		return -1;
> +
> +	/* load 4 descs */
> +	desc_addr = &vq->desc_packed[avail_idx];
> +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> +
> +	/* burst check four status */
> +	__m512i avail_flag_vec;
> +	if (vq->avail_wrap_counter)
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
> +					PACKED_FLAGS_MASK, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_AVAIL_FLAG);
> +
> +#endif
> +	else
> +#if defined(RTE_ARCH_I686)
> +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> +					0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
> +#else
> +		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> +					PACKED_AVAIL_FLAG_WRAP);
> +#endif
> +
> +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> +		_MM_CMPINT_NE);
> +	if (descs_status & BATCH_FLAGS_MASK)
> +		return -1;
> +


Also, please try to factorize code to avoid duplication between Tx and
Rx paths for desc address translation:
> +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			uint64_t size = (uint64_t)descs[avail_idx + i].len;
> +			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> +				descs[avail_idx + i].addr, &size,
> +				VHOST_ACCESS_RO);
> +
> +			if (!desc_addrs[i])
> +				goto free_buf;
> +			lens[i] = descs[avail_idx + i].len;
> +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> +					lens[i]);
> +			if (!pkts[i])
> +				goto free_buf;
> +		}
> +	} else {> +		/* check buffer fit into one region & translate address */
> +		__m512i regions_low_addrs =
> +			_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> +		__m512i regions_high_addrs =
> +			_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			uint64_t addr_low = descs[avail_idx + i].addr;
> +			uint64_t addr_high = addr_low +
> +						descs[avail_idx + i].len;
> +			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> +			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> +			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> +					regions_low_addrs, _MM_CMPINT_NLT);
> +			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> +					regions_high_addrs, _MM_CMPINT_LT);
> +			cmp_result = cmp_low & cmp_high;
> +			int index = __builtin_ctz(cmp_result);
> +			if (unlikely((uint32_t)index >= dev->mem->nregions))
> +				goto free_buf;
> +
> +			desc_addrs[i] = addr_low +
> +				dev->mem->regions[index].host_user_addr -
> +				dev->mem->regions[index].guest_phys_addr;
> +			lens[i] = descs[avail_idx + i].len;
> +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> +					lens[i]);
> +			if (!pkts[i])
> +				goto free_buf;
> +		}
> +	}
> +
> +	if (virtio_net_with_host_offload(dev)) {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> +			vhost_dequeue_offload(hdr, pkts[i]);
> +		}
> +	}
> +
> +	if (unlikely(virtio_net_is_inorder(dev))) {
> +		ids[PACKED_BATCH_SIZE - 1] =
> +			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
> +	} else {
> +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> +			ids[i] = descs[avail_idx + i].id;
> +	}
> +
> +	uint64_t addrs[PACKED_BATCH_SIZE << 1];
> +	/* store mbuf data_len, pkt_len */
> +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> +		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> +					+ sizeof(uint64_t);
> +	}
> +
> +	/* save pkt_len and data_len into mbufs */
> +	__m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> +					0xAA);
> +	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> +					(uint32_t)-12);
> +	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> +	__m512i vindex = _mm512_loadu_si512((void *)addrs);
> +	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> +
> +	return 0;
> +free_buf:
> +	for (i = 0; i < PACKED_BATCH_SIZE; i++)
> +		rte_pktmbuf_free(pkts[i]);
> +
> +	return -1;
> +}
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 6107662685..e4d2e2e7d6 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
>  	return -1;
>  }
>  
> +static __rte_always_inline int
> +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct rte_mempool *mbuf_pool,
> +				 struct rte_mbuf **pkts,
> +				 uint16_t avail_idx,
> +				 uintptr_t *desc_addrs,
> +				 uint16_t *ids)
> +{
> +	if (unlikely(dev->vectorized))
> +#ifdef CC_AVX512_SUPPORT
> +		return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
> +				pkts, avail_idx, desc_addrs, ids);
> +#else
> +		return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
> +				pkts, avail_idx, desc_addrs, ids);
> +
> +#endif
> +	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> +			avail_idx, desc_addrs, ids);
> +}
> +
>  static __rte_always_inline int
>  virtio_dev_tx_batch_packed(struct virtio_net *dev,
>  			   struct vhost_virtqueue *vq,
> @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
>  	uint16_t ids[PACKED_BATCH_SIZE];
>  	uint16_t i;
>  
> -	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> -					     avail_idx, desc_addrs, ids))
> +
> +	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> +		avail_idx, desc_addrs, ids))
>  		return -1;
>  
>  	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path
  2020-10-06 13:34     ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
@ 2020-10-08  6:20       ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-08  6:20 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 9:34 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 0/5] vhost add vectorized data path
> 
> Hi,
> 
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Packed ring format is imported since virtio spec 1.1. All descriptors
> > are compacted into one single ring when packed ring format is on. It is
> > straight forward that ring operations can be accelerated by utilizing
> > SIMD instructions.
> >
> > This patch set will introduce vectorized data path in vhost library. If
> > vectorized option is on, operations like descs check, descs writeback,
> > address translation will be accelerated by SIMD instructions. Vhost
> > application can choose whether using vectorized acceleration, it is
> > like external buffer and zero copy features.
> >
> > If platform or ring format not support vectorized function, vhost will
> > fallback to use default batch function. There will be no impact in current
> > data path.
> 
> As a pre-requisite, I'd like some performance numbers in both loopback
> and PVP to figure out if adding such complexity is worth it, given we
> will have to support it for at least one year.
> 


Thanks for suggestion, will add some reference numbers in next version.

> Thanks,
> Maxime
> 
> > v2:
> > * add vIOMMU support
> > * add dequeue offloading
> > * rebase code
> >
> > Marvin Liu (5):
> >   vhost: add vectorized data path
> >   vhost: reuse packed ring functions
> >   vhost: prepare memory regions addresses
> >   vhost: add packed ring vectorized dequeue
> >   vhost: add packed ring vectorized enqueue
> >
> >  doc/guides/nics/vhost.rst           |   5 +
> >  doc/guides/prog_guide/vhost_lib.rst |  12 +
> >  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
> >  lib/librte_vhost/meson.build        |  16 ++
> >  lib/librte_vhost/rte_vhost.h        |   1 +
> >  lib/librte_vhost/socket.c           |   5 +
> >  lib/librte_vhost/vhost.c            |  11 +
> >  lib/librte_vhost/vhost.h            | 235 +++++++++++++++++++
> >  lib/librte_vhost/vhost_user.c       |  11 +
> >  lib/librte_vhost/vhost_vec_avx.c    | 338
> ++++++++++++++++++++++++++++
> >  lib/librte_vhost/virtio_net.c       | 257 ++++-----------------
> >  11 files changed, 692 insertions(+), 216 deletions(-)
> >  create mode 100644 lib/librte_vhost/vhost_vec_avx.c
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
  2020-10-06 14:59       ` Maxime Coquelin
@ 2020-10-08  7:05         ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-08  7:05 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 10:59 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
> 
> 
> 
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> >  if has_libnuma == 1
> >  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> >  endif
> > +
> > +if arch_subdir == 'x86'
> > +        if not machine_args.contains('-mno-avx512f')
> > +                if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +                        cflags += ['-DCC_AVX512_SUPPORT']
> > +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> > +                                              'vhost_vec_avx.c',
> > +                                              dependencies: [static_rte_eal,
> static_rte_mempool,
> > +                                                  static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > +                                              include_directories: includes,
> > +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > +                endif
> > +        endif
> > +endif
> 
> Not a Meson expert, but wonder how I can disable CC_AVX512_SUPPORT.
> I checked the DPDK doc, but I could not find how to pass -mno-avx512f to
> the machine_args.

Hi Maxime,
By now mno-avx512f flag will be set only if binutils check script found issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90028.
So avx512 code will be built-in if compiler support that. There's alternative way is that introduce one new option in meson build. 

Thanks,
Marvin

> 
> > +
> >  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> >  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> >  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> >  	return NULL;
> >  }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids);
> >  #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..dc5322d002
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> 
> For consistency it should be prefixed with virtio_net, not vhost.
> 
> > @@ -0,0 +1,181 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > +	1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > +	sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > +	sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  |
> \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > +	<< FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids)
> > +{
> > +	struct vring_packed_desc *descs = vq->desc_packed;
> > +	uint32_t descs_status;
> > +	void *desc_addr;
> > +	uint16_t i;
> > +	uint8_t cmp_low, cmp_high, cmp_result;
> > +	uint64_t lens[PACKED_BATCH_SIZE];
> > +	struct virtio_net_hdr *hdr;
> > +
> > +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > +		return -1;
> > +
> > +	/* load 4 descs */
> > +	desc_addr = &vq->desc_packed[avail_idx];
> > +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> 
> Unlike split ring, packed ring specification does not mandate the ring
> size to be a power of two. So checking  avail_idx is aligned on 64 bytes
> is not enough given a descriptor is 16 bytes.
> 
> You need to also check against ring size to prevent out of bounds
> accesses.
> 
> I see the non vectorized batch processing you introduced in v19.11 also
> do that wrong assumption. Please fix it.
> 
> Also, I wonder whether it is assumed that &vq->desc_packed[avail_idx];
> is aligned on a cache-line. Meaning, does below intrinsics have such a
> requirement?
> 

Got, packed ring size may arbitrary number. In v19.11 batch handling function has already checked available index not oversized. 
I forgot that in vectorized path, will fix it in next release. 

In vectorized path, loading function mm512_loadu_si512 do not need cache-aligned memory. So no special requirement is needed. 

> > +	/* burst check four status */
> > +	__m512i avail_flag_vec;
> > +	if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > +					PACKED_FLAGS_MASK, 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > +	else
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > +					0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > +		_MM_CMPINT_NE);
> > +	if (descs_status & BATCH_FLAGS_MASK)
> > +		return -1;
> > +
> > +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			uint64_t size = (uint64_t)descs[avail_idx + i].len;
> > +			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> > +				descs[avail_idx + i].addr, &size,
> > +				VHOST_ACCESS_RO);
> > +
> > +			if (!desc_addrs[i])
> > +				goto free_buf;
> > +			lens[i] = descs[avail_idx + i].len;
> > +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > +					lens[i]);
> > +			if (!pkts[i])
> > +				goto free_buf;
> > +		}
> > +	} else {
> > +		/* check buffer fit into one region & translate address */
> > +		__m512i regions_low_addrs =
> > +			_mm512_loadu_si512((void *)&dev-
> >regions_low_addrs);
> > +		__m512i regions_high_addrs =
> > +			_mm512_loadu_si512((void *)&dev-
> >regions_high_addrs);
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			uint64_t addr_low = descs[avail_idx + i].addr;
> > +			uint64_t addr_high = addr_low +
> > +						descs[avail_idx + i].len;
> > +			__m512i low_addr_vec =
> _mm512_set1_epi64(addr_low);
> > +			__m512i high_addr_vec =
> _mm512_set1_epi64(addr_high);
> > +
> > +			cmp_low =
> _mm512_cmp_epi64_mask(low_addr_vec,
> > +					regions_low_addrs,
> _MM_CMPINT_NLT);
> > +			cmp_high =
> _mm512_cmp_epi64_mask(high_addr_vec,
> > +					regions_high_addrs,
> _MM_CMPINT_LT);
> > +			cmp_result = cmp_low & cmp_high;
> > +			int index = __builtin_ctz(cmp_result);
> > +			if (unlikely((uint32_t)index >= dev->mem->nregions))
> > +				goto free_buf;
> > +
> > +			desc_addrs[i] = addr_low +
> > +				dev->mem->regions[index].host_user_addr -
> > +				dev->mem->regions[index].guest_phys_addr;
> > +			lens[i] = descs[avail_idx + i].len;
> > +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > +					lens[i]);
> > +			if (!pkts[i])
> > +				goto free_buf;
> > +		}
> > +	}
> > +
> > +	if (virtio_net_with_host_offload(dev)) {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> > +			vhost_dequeue_offload(hdr, pkts[i]);
> > +		}
> > +	}
> > +
> > +	if (unlikely(virtio_net_is_inorder(dev))) {
> > +		ids[PACKED_BATCH_SIZE - 1] =
> > +			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
> 
> Isn't in-order a likely case? Maybe just remove the unlikely.
> 
In_order option is depended on feature negotiation , will remove unlikely. 

> > +	} else {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> > +			ids[i] = descs[avail_idx + i].id;
> > +	}
> > +
> > +	uint64_t addrs[PACKED_BATCH_SIZE << 1];
> > +	/* store mbuf data_len, pkt_len */
> > +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> > +		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> > +					+ sizeof(uint64_t);
> > +	}
> > +
> > +	/* save pkt_len and data_len into mbufs */
> > +	__m512i value_vec =
> _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> > +					0xAA);
> > +	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> > +					(uint32_t)-12);
> > +	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> > +	__m512i vindex = _mm512_loadu_si512((void *)addrs);
> > +	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> > +
> > +	return 0;
> > +free_buf:
> > +	for (i = 0; i < PACKED_BATCH_SIZE; i++)
> > +		rte_pktmbuf_free(pkts[i]);
> > +
> > +	return -1;
> > +}
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index 6107662685..e4d2e2e7d6 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct
> virtio_net *dev,
> >  	return -1;
> >  }
> >
> > +static __rte_always_inline int
> > +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids)
> > +{
> > +	if (unlikely(dev->vectorized))
> > +#ifdef CC_AVX512_SUPPORT
> > +		return vhost_reserve_avail_batch_packed_avx(dev, vq,
> mbuf_pool,
> > +				pkts, avail_idx, desc_addrs, ids);
> > +#else
> > +		return vhost_reserve_avail_batch_packed(dev, vq,
> mbuf_pool,
> > +				pkts, avail_idx, desc_addrs, ids);
> > +
> > +#endif
> > +	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > +			avail_idx, desc_addrs, ids);
> > +}
> 
> 
> It should be as below to not have any performance impact when
> CC_AVX512_SUPPORT is not set:
> 
> #ifdef CC_AVX512_SUPPORT
> 	if (unlikely(dev->vectorized))
> 		return vhost_reserve_avail_batch_packed_avx(dev, vq,
> mbuf_pool,
> 			pkts, avail_idx, desc_addrs, ids);
> #else
> 	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> 		avail_idx, desc_addrs, ids);
> #endif

Got, will change in next release. 

> > +
> >  static __rte_always_inline int
> >  virtio_dev_tx_batch_packed(struct virtio_net *dev,
> >  			   struct vhost_virtqueue *vq,
> > @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net
> *dev,
> >  	uint16_t ids[PACKED_BATCH_SIZE];
> >  	uint16_t i;
> >
> > -	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > -					     avail_idx, desc_addrs, ids))
> > +
> > +	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > +		avail_idx, desc_addrs, ids))
> >  		return -1;
> >
> >  	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
  2020-10-06 15:00       ` Maxime Coquelin
@ 2020-10-08  7:09         ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-08  7:09 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 11:00 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
> 
> 
> 
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Optimize vhost packed ring enqueue path with SIMD instructions. Four
> > descriptors status and length are batched handled with AVX512
> > instructions. Address translation operations are also accelerated
> > by AVX512 instructions.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index fc7daf2145..b78b2c5c1b 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct
> virtio_net *dev,
> >  				 uint16_t avail_idx,
> >  				 uintptr_t *desc_addrs,
> >  				 uint16_t *ids);
> > +
> > +int
> > +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> > +			       struct vhost_virtqueue *vq,
> > +			       struct rte_mbuf **pkts);
> > +
> >  #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > index dc5322d002..7d2250ed86 100644
> > --- a/lib/librte_vhost/vhost_vec_avx.c
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -35,9 +35,15 @@
> >  #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> >  #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> >  	FLAGS_BITS_OFFSET)
> > +#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
> > +	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
> > +#define PACKED_WRITE_AVAIL_FLAG_WRAP
> (PACKED_AVAIL_FLAG_WRAP | \
> > +	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
> >
> >  #define DESC_FLAGS_POS 0xaa
> >  #define MBUF_LENS_POS 0x6666
> > +#define DESC_LENS_POS 0x4444
> > +#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
> >
> >  int
> >  vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > @@ -179,3 +185,154 @@ vhost_reserve_avail_batch_packed_avx(struct
> virtio_net *dev,
> >
> >  	return -1;
> >  }
> > +
> > +int
> > +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> > +			       struct vhost_virtqueue *vq,
> > +			       struct rte_mbuf **pkts)
> > +{
> > +	struct vring_packed_desc *descs = vq->desc_packed;
> > +	uint16_t avail_idx = vq->last_avail_idx;
> > +	uint64_t desc_addrs[PACKED_BATCH_SIZE];
> > +	uint32_t buf_offset = dev->vhost_hlen;
> > +	uint32_t desc_status;
> > +	uint64_t lens[PACKED_BATCH_SIZE];
> > +	uint16_t i;
> > +	void *desc_addr;
> > +	uint8_t cmp_low, cmp_high, cmp_result;
> > +
> > +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > +		return -1;
> 
> Same comment as for patch 4. Packed ring size may not be a pow2.
> 
Thanks, will fix in next version.

> > +	/* check refcnt and nb_segs */
> > +	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
> > +
> > +	/* load four mbufs rearm data */
> > +	__m256i mbufs = _mm256_set_epi64x(
> > +				*pkts[3]->rearm_data,
> > +				*pkts[2]->rearm_data,
> > +				*pkts[1]->rearm_data,
> > +				*pkts[0]->rearm_data);
> > +
> > +	uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
> > +	if (cmp & MBUF_LENS_POS)
> > +		return -1;
> > +
> > +	/* check desc status */
> > +	desc_addr = &vq->desc_packed[avail_idx];
> > +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > +	__m512i avail_flag_vec;
> > +	__m512i used_flag_vec;
> > +	if (vq->avail_wrap_counter) {
> > +#if defined(RTE_ARCH_I686)
> 
> Is supporting AVX512 on i686 really useful/necessary?
> 
It is useless for function point of view.  Here is for successful compilation if enabled i686 build. 

> > +		avail_flag_vec =
> _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
> > +					0x0, PACKED_WRITE_AVAIL_FLAG,
> 0x0);
> > +		used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK,
> 0x0,
> > +					PACKED_FLAGS_MASK, 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_WRITE_AVAIL_FLAG);
> > +		used_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_FLAGS_MASK);
> > +#endif
> > +	} else {
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec = _mm512_set4_epi64(
> > +					PACKED_WRITE_AVAIL_FLAG_WRAP,
> 0x0,
> > +					PACKED_WRITE_AVAIL_FLAG, 0x0);
> > +		used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_WRITE_AVAIL_FLAG_WRAP);
> > +		used_flag_vec = _mm512_setzero_epi32();
> > +#endif
> > +	}
> > +
> > +	desc_status =
> _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
> > +				avail_flag_vec, _MM_CMPINT_NE);
> > +	if (desc_status)
> > +		return -1;
> > +
> > +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			uint64_t size = (uint64_t)descs[avail_idx + i].len;
> > +			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> > +				descs[avail_idx + i].addr, &size,
> > +				VHOST_ACCESS_RW);
> > +
> > +			if (!desc_addrs[i])
> > +				return -1;
> > +
> > +			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void
> *,
> > +					0));
> > +		}
> > +	} else {
> > +		/* check buffer fit into one region & translate address */
> > +		__m512i regions_low_addrs =
> > +			_mm512_loadu_si512((void *)&dev-
> >regions_low_addrs);
> > +		__m512i regions_high_addrs =
> > +			_mm512_loadu_si512((void *)&dev-
> >regions_high_addrs);
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			uint64_t addr_low = descs[avail_idx + i].addr;
> > +			uint64_t addr_high = addr_low +
> > +						descs[avail_idx + i].len;
> > +			__m512i low_addr_vec =
> _mm512_set1_epi64(addr_low);
> > +			__m512i high_addr_vec =
> _mm512_set1_epi64(addr_high);
> > +
> > +			cmp_low =
> _mm512_cmp_epi64_mask(low_addr_vec,
> > +					regions_low_addrs,
> _MM_CMPINT_NLT);
> > +			cmp_high =
> _mm512_cmp_epi64_mask(high_addr_vec,
> > +					regions_high_addrs,
> _MM_CMPINT_LT);
> > +			cmp_result = cmp_low & cmp_high;
> > +			int index = __builtin_ctz(cmp_result);
> > +			if (unlikely((uint32_t)index >= dev->mem->nregions))
> > +				return -1;
> > +
> > +			desc_addrs[i] = addr_low +
> > +				dev->mem->regions[index].host_user_addr -
> > +				dev->mem->regions[index].guest_phys_addr;
> > +			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void
> *,
> > +					0));
> > +		}
> > +	}
> > +
> > +	/* check length is enough */
> > +	__m512i pkt_lens = _mm512_set_epi32(
> > +			0, pkts[3]->pkt_len, 0, 0,
> > +			0, pkts[2]->pkt_len, 0, 0,
> > +			0, pkts[1]->pkt_len, 0, 0,
> > +			0, pkts[0]->pkt_len, 0, 0);
> > +
> > +	__m512i mbuf_len_offset =
> _mm512_maskz_set1_epi32(DESC_LENS_POS,
> > +					dev->vhost_hlen);
> > +	__m512i buf_len_vec = _mm512_add_epi32(pkt_lens,
> mbuf_len_offset);
> > +	uint16_t lens_cmp =
> _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
> > +				desc_vec, buf_len_vec, _MM_CMPINT_LT);
> > +	if (lens_cmp)
> > +		return -1;
> > +
> > +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +		rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
> > +			   rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
> > +			   pkts[i]->pkt_len);
> > +	}
> > +
> > +	if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			lens[i] = descs[avail_idx + i].len;
> > +			vhost_log_cache_write_iova(dev, vq,
> > +				descs[avail_idx + i].addr, lens[i]);
> > +		}
> > +	}
> > +
> > +	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
> > +	vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
> > +	/* save len and flags, skip addr and id */
> > +	__m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
> > +					DESC_LENS_FLAGS_POS, buf_len_vec,
> > +					used_flag_vec);
> > +	_mm512_storeu_si512(desc_addr, desc_updated);
> > +
> > +	return 0;
> > +}
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index e4d2e2e7d6..5c56a8d6ff 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net
> *dev,
> >  	return 0;
> >  }
> >
> > +static __rte_always_inline int
> > +virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
> > +			   struct vhost_virtqueue *vq,
> > +			   struct rte_mbuf **pkts)
> > +
> > +{
> > +	if (unlikely(dev->vectorized))
> > +#ifdef CC_AVX512_SUPPORT
> > +		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
> > +#else
> > +		return virtio_dev_rx_batch_packed(dev, vq, pkts);
> > +#endif
> > +	return virtio_dev_rx_batch_packed(dev, vq, pkts);
> 
> It should be as below to not have any performance impact when
> CC_AVX512_SUPPORT is not set:
> 
> #ifdef CC_AVX512_SUPPORT
> 	if (unlikely(dev->vectorized))
> 		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
> #else
> 	return virtio_dev_rx_batch_packed(dev, vq, pkts);
> #endif
> 
Got, will fix in next version.

> > +}
> > +
> >  static __rte_noinline uint32_t
> >  virtio_dev_rx_packed(struct virtio_net *dev,
> >  		     struct vhost_virtqueue *__rte_restrict vq,
> > @@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
> >  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> >
> >  		if (remained >= PACKED_BATCH_SIZE) {
> > -			if (!virtio_dev_rx_batch_packed(dev, vq,
> > -							&pkts[pkt_idx])) {
> > +			if (!virtio_dev_rx_handle_batch_packed(dev, vq,
> > +				&pkts[pkt_idx])) {
> >  				pkt_idx += PACKED_BATCH_SIZE;
> >  				remained -= PACKED_BATCH_SIZE;
> >  				continue;
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
  2020-10-06 15:18       ` Maxime Coquelin
@ 2020-10-09  7:59         ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-09  7:59 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 11:19 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
> 
> 
> 
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> >  if has_libnuma == 1
> >  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> >  endif
> > +
> > +if arch_subdir == 'x86'
> > +        if not machine_args.contains('-mno-avx512f')
> > +                if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +                        cflags += ['-DCC_AVX512_SUPPORT']
> > +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> > +                                              'vhost_vec_avx.c',
> > +                                              dependencies: [static_rte_eal,
> static_rte_mempool,
> > +                                                  static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > +                                              include_directories: includes,
> > +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > +                endif
> > +        endif
> > +endif
> > +
> >  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> >  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> >  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> >  	return NULL;
> >  }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids);
> >  #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..dc5322d002
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -0,0 +1,181 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > +	1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > +	sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > +	sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  |
> \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > +	<< FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids)
> > +{
> > +	struct vring_packed_desc *descs = vq->desc_packed;
> > +	uint32_t descs_status;
> > +	void *desc_addr;
> > +	uint16_t i;
> > +	uint8_t cmp_low, cmp_high, cmp_result;
> > +	uint64_t lens[PACKED_BATCH_SIZE];
> > +	struct virtio_net_hdr *hdr;
> > +
> > +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > +		return -1;
> > +
> > +	/* load 4 descs */
> > +	desc_addr = &vq->desc_packed[avail_idx];
> > +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > +	/* burst check four status */
> > +	__m512i avail_flag_vec;
> > +	if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > +					PACKED_FLAGS_MASK, 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > +	else
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > +					0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > +		_MM_CMPINT_NE);
> > +	if (descs_status & BATCH_FLAGS_MASK)
> > +		return -1;
> > +
> 
> 
> Also, please try to factorize code to avoid duplication between Tx and
> Rx paths for desc address translation:

Hi Maxime,
I  have factorized the translation function in Rx and Tx paths, but there's a few performance drop after the change.
Since vectorized datapath is focusing on performance, I'd like to keep current implementation. 

Thanks,
Marvin

> > +	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			uint64_t size = (uint64_t)descs[avail_idx + i].len;
> > +			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> > +				descs[avail_idx + i].addr, &size,
> > +				VHOST_ACCESS_RO);
> > +
> > +			if (!desc_addrs[i])
> > +				goto free_buf;
> > +			lens[i] = descs[avail_idx + i].len;
> > +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > +					lens[i]);
> > +			if (!pkts[i])
> > +				goto free_buf;
> > +		}
> > +	} else {> +		/* check buffer fit into one region &
> translate address */
> > +		__m512i regions_low_addrs =
> > +			_mm512_loadu_si512((void *)&dev-
> >regions_low_addrs);
> > +		__m512i regions_high_addrs =
> > +			_mm512_loadu_si512((void *)&dev-
> >regions_high_addrs);
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			uint64_t addr_low = descs[avail_idx + i].addr;
> > +			uint64_t addr_high = addr_low +
> > +						descs[avail_idx + i].len;
> > +			__m512i low_addr_vec =
> _mm512_set1_epi64(addr_low);
> > +			__m512i high_addr_vec =
> _mm512_set1_epi64(addr_high);
> > +
> > +			cmp_low =
> _mm512_cmp_epi64_mask(low_addr_vec,
> > +					regions_low_addrs,
> _MM_CMPINT_NLT);
> > +			cmp_high =
> _mm512_cmp_epi64_mask(high_addr_vec,
> > +					regions_high_addrs,
> _MM_CMPINT_LT);
> > +			cmp_result = cmp_low & cmp_high;
> > +			int index = __builtin_ctz(cmp_result);
> > +			if (unlikely((uint32_t)index >= dev->mem->nregions))
> > +				goto free_buf;
> > +
> > +			desc_addrs[i] = addr_low +
> > +				dev->mem->regions[index].host_user_addr -
> > +				dev->mem->regions[index].guest_phys_addr;
> > +			lens[i] = descs[avail_idx + i].len;
> > +			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > +			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > +					lens[i]);
> > +			if (!pkts[i])
> > +				goto free_buf;
> > +		}
> > +	}
> > +
> > +	if (virtio_net_with_host_offload(dev)) {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> > +			vhost_dequeue_offload(hdr, pkts[i]);
> > +		}
> > +	}
> > +
> > +	if (unlikely(virtio_net_is_inorder(dev))) {
> > +		ids[PACKED_BATCH_SIZE - 1] =
> > +			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
> > +	} else {
> > +		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> > +			ids[i] = descs[avail_idx + i].id;
> > +	}
> > +
> > +	uint64_t addrs[PACKED_BATCH_SIZE << 1];
> > +	/* store mbuf data_len, pkt_len */
> > +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> > +		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> > +					+ sizeof(uint64_t);
> > +	}
> > +
> > +	/* save pkt_len and data_len into mbufs */
> > +	__m512i value_vec =
> _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> > +					0xAA);
> > +	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> > +					(uint32_t)-12);
> > +	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> > +	__m512i vindex = _mm512_loadu_si512((void *)addrs);
> > +	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> > +
> > +	return 0;
> > +free_buf:
> > +	for (i = 0; i < PACKED_BATCH_SIZE; i++)
> > +		rte_pktmbuf_free(pkts[i]);
> > +
> > +	return -1;
> > +}
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index 6107662685..e4d2e2e7d6 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct
> virtio_net *dev,
> >  	return -1;
> >  }
> >
> > +static __rte_always_inline int
> > +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids)
> > +{
> > +	if (unlikely(dev->vectorized))
> > +#ifdef CC_AVX512_SUPPORT
> > +		return vhost_reserve_avail_batch_packed_avx(dev, vq,
> mbuf_pool,
> > +				pkts, avail_idx, desc_addrs, ids);
> > +#else
> > +		return vhost_reserve_avail_batch_packed(dev, vq,
> mbuf_pool,
> > +				pkts, avail_idx, desc_addrs, ids);
> > +
> > +#endif
> > +	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > +			avail_idx, desc_addrs, ids);
> > +}
> > +
> >  static __rte_always_inline int
> >  virtio_dev_tx_batch_packed(struct virtio_net *dev,
> >  			   struct vhost_virtqueue *vq,
> > @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net
> *dev,
> >  	uint16_t ids[PACKED_BATCH_SIZE];
> >  	uint16_t i;
> >
> > -	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > -					     avail_idx, desc_addrs, ids))
> > +
> > +	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > +		avail_idx, desc_addrs, ids))
> >  		return -1;
> >
> >  	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
  2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
@ 2020-10-09  8:14   ` " Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
                       ` (5 more replies)
  1 sibling, 6 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09  8:14 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring format is imported since virtio spec 1.1. All descriptors
are compacted into one single ring when packed ring format is on. It is
straight forward that ring operations can be accelerated by utilizing
SIMD instructions. 

This patch set will introduce vectorized data path in vhost library. If
vectorized option is on, operations like descs check, descs writeback,
address translation will be accelerated by SIMD instructions. On skylake
server, it can bring 6% performance gain in loopback case and around 4%
performance gain in PvP case.

Vhost application can choose whether using vectorized acceleration, just
like external buffer feature. If platform or ring format not support
vectorized function, vhost will fallback to use default batch function.
There will be no impact in current data path.

v3:
* rename vectorized datapath file
* eliminate the impact when avx512 disabled
* dynamically allocate memory regions structure
* remove unlikely hint for in_order

v2:
* add vIOMMU support
* add dequeue offloading
* rebase code

Marvin Liu (5):
  vhost: add vectorized data path
  vhost: reuse packed ring functions
  vhost: prepare memory regions addresses
  vhost: add packed ring vectorized dequeue
  vhost: add packed ring vectorized enqueue

 doc/guides/nics/vhost.rst           |   5 +
 doc/guides/prog_guide/vhost_lib.rst |  12 +
 drivers/net/vhost/rte_eth_vhost.c   |  17 +-
 lib/librte_vhost/meson.build        |  16 ++
 lib/librte_vhost/rte_vhost.h        |   1 +
 lib/librte_vhost/socket.c           |   5 +
 lib/librte_vhost/vhost.c            |  11 +
 lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
 lib/librte_vhost/vhost_user.c       |  26 +++
 lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
 lib/librte_vhost/virtio_net_avx.c   | 344 ++++++++++++++++++++++++++++
 11 files changed, 718 insertions(+), 216 deletions(-)
 create mode 100644 lib/librte_vhost/virtio_net_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 1/5] vhost: add vectorized data path
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
@ 2020-10-09  8:14     ` " Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions Marvin Liu
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09  8:14 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring operations are split into batch and single functions for
performance perspective. Ring operations in batch function can be
accelerated by SIMD instructions like AVX512.

So introduce vectorized parameter in vhost. Vectorized data path can be
selected if platform and ring format matched requirements. Otherwise
will fallback to original data path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index d36f3120b..efdaf4de0 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -64,6 +64,11 @@ The user can specify below arguments in `--vdev` option.
     It is used to enable external buffer support in vhost library.
     (Default: 0 (disabled))
 
+#.  ``vectorized``:
+
+    It is used to enable vectorized data path support in vhost library.
+    (Default: 0 (disabled))
+
 Vhost PMD event handling
 ------------------------
 
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index ba4c62aeb..5ef3844a0 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -118,6 +118,18 @@ The following is an overview of some key Vhost API functions:
 
     It is disabled by default.
 
+ - ``RTE_VHOST_USER_VECTORIZED``
+    Vectorized data path will used when this flag is set. When packed ring
+    enabled, available descriptors are stored from frontend driver in sequence.
+    SIMD instructions like AVX can be used to handle multiple descriptors
+    simultaneously. Thus can accelerate the throughput of ring operations.
+
+    * Only packed ring has vectorized data path.
+
+    * Will fallback to normal datapath if no vectorization support.
+
+    It is disabled by default.
+
 * ``rte_vhost_driver_set_features(path, features)``
 
   This function sets the feature bits the vhost-user driver supports. The
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index 66efecb32..8f71054ad 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -34,6 +34,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
 #define ETH_VHOST_LINEAR_BUF  "linear-buffer"
 #define ETH_VHOST_EXT_BUF  "ext-buffer"
+#define ETH_VHOST_VECTORIZED "vectorized"
 #define VHOST_MAX_PKT_BURST 32
 
 static const char *valid_arguments[] = {
@@ -45,6 +46,7 @@ static const char *valid_arguments[] = {
 	ETH_VHOST_VIRTIO_NET_F_HOST_TSO,
 	ETH_VHOST_LINEAR_BUF,
 	ETH_VHOST_EXT_BUF,
+	ETH_VHOST_VECTORIZED,
 	NULL
 };
 
@@ -1509,6 +1511,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
 	int tso = 0;
 	int linear_buf = 0;
 	int ext_buf = 0;
+	int vectorized = 0;
 	struct rte_eth_dev *eth_dev;
 	const char *name = rte_vdev_device_name(dev);
 
@@ -1618,6 +1621,17 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
 			flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
 	}
 
+	if (rte_kvargs_count(kvlist, ETH_VHOST_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				ETH_VHOST_VECTORIZED,
+				&open_int, &vectorized);
+		if (ret < 0)
+			goto out_free;
+
+		if (vectorized == 1)
+			flags |= RTE_VHOST_USER_VECTORIZED;
+	}
+
 	if (dev->device.numa_node == SOCKET_ID_ANY)
 		dev->device.numa_node = rte_socket_id();
 
@@ -1666,4 +1680,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost,
 	"postcopy-support=<0|1> "
 	"tso=<0|1> "
 	"linear-buffer=<0|1> "
-	"ext-buffer=<0|1>");
+	"ext-buffer=<0|1> "
+	"vectorized=<0|1>");
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 010f16086..c49c1aca2 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
 /* support only linear buffers (no chained mbufs) */
 #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
 #define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
+#define RTE_VHOST_USER_VECTORIZED	(1ULL << 8)
 
 /* Features. */
 #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 0169d3648..e492c8c87 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
 	bool extbuf;
 	bool linearbuf;
 	bool async_copy;
+	bool vectorized;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -241,6 +242,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 			dev->async_copy = 1;
 	}
 
+	if (vsocket->vectorized)
+		vhost_enable_vectorized(vid);
+
 	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -876,6 +880,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	vsocket->vdpa_dev = NULL;
 	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
 	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
+	vsocket->vectorized = flags & RTE_VHOST_USER_VECTORIZED;
 	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
 
 	if (vsocket->async_copy &&
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index c7cd34e42..4b5ef10a8 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -738,6 +738,17 @@ vhost_enable_linearbuf(int vid)
 	dev->linearbuf = 1;
 }
 
+void
+vhost_enable_vectorized(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->vectorized = 1;
+}
+
 int
 rte_vhost_get_mtu(int vid, uint16_t *mtu)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 20ccdc9bd..87583c0b6 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -363,6 +363,7 @@ struct virtio_net {
 	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
+	int                     vectorized;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 	struct inflight_mem_info *inflight_info;
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
@@ -700,6 +701,7 @@ void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
 void vhost_set_builtin_virtio_net(int vid, bool enable);
 void vhost_enable_extbuf(int vid);
 void vhost_enable_linearbuf(int vid);
+void vhost_enable_vectorized(int vid);
 int vhost_enable_guest_notification(struct virtio_net *dev,
 		struct vhost_virtqueue *vq, int enable);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
@ 2020-10-09  8:14     ` Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses Marvin Liu
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09  8:14 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Move parse_ethernet, offload, extbuf functions to header file. These
functions will be reused by vhost vectorized path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 87583c0b6..12b7699cf 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -20,6 +20,10 @@
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
 
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_sctp.h>
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
 #include "rte_vdpa_dev.h"
@@ -878,4 +882,214 @@ mbuf_is_consumed(struct rte_mbuf *m)
 	return true;
 }
 
+static  __rte_always_inline bool
+virtio_net_is_inorder(struct virtio_net *dev)
+{
+	return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
+}
+
+static void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+	struct rte_ipv4_hdr *ipv4_hdr;
+	struct rte_ipv6_hdr *ipv6_hdr;
+	void *l3_hdr = NULL;
+	struct rte_ether_hdr *eth_hdr;
+	uint16_t ethertype;
+
+	eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
+
+	m->l2_len = sizeof(struct rte_ether_hdr);
+	ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+	if (ethertype == RTE_ETHER_TYPE_VLAN) {
+		struct rte_vlan_hdr *vlan_hdr =
+			(struct rte_vlan_hdr *)(eth_hdr + 1);
+
+		m->l2_len += sizeof(struct rte_vlan_hdr);
+		ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+	}
+
+	l3_hdr = (char *)eth_hdr + m->l2_len;
+
+	switch (ethertype) {
+	case RTE_ETHER_TYPE_IPV4:
+		ipv4_hdr = l3_hdr;
+		*l4_proto = ipv4_hdr->next_proto_id;
+		m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+		*l4_hdr = (char *)l3_hdr + m->l3_len;
+		m->ol_flags |= PKT_TX_IPV4;
+		break;
+	case RTE_ETHER_TYPE_IPV6:
+		ipv6_hdr = l3_hdr;
+		*l4_proto = ipv6_hdr->proto;
+		m->l3_len = sizeof(struct rte_ipv6_hdr);
+		*l4_hdr = (char *)l3_hdr + m->l3_len;
+		m->ol_flags |= PKT_TX_IPV6;
+		break;
+	default:
+		m->l3_len = 0;
+		*l4_proto = 0;
+		*l4_hdr = NULL;
+		break;
+	}
+}
+
+static inline bool
+virtio_net_with_host_offload(struct virtio_net *dev)
+{
+	if (dev->features &
+			((1ULL << VIRTIO_NET_F_CSUM) |
+			 (1ULL << VIRTIO_NET_F_HOST_ECN) |
+			 (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+			 (1ULL << VIRTIO_NET_F_HOST_TSO6) |
+			 (1ULL << VIRTIO_NET_F_HOST_UFO)))
+		return true;
+
+	return false;
+}
+
+static __rte_always_inline void
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+	uint16_t l4_proto = 0;
+	void *l4_hdr = NULL;
+	struct rte_tcp_hdr *tcp_hdr = NULL;
+
+	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+		return;
+
+	parse_ethernet(m, &l4_proto, &l4_hdr);
+	if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+			switch (hdr->csum_offset) {
+			case (offsetof(struct rte_tcp_hdr, cksum)):
+				if (l4_proto == IPPROTO_TCP)
+					m->ol_flags |= PKT_TX_TCP_CKSUM;
+				break;
+			case (offsetof(struct rte_udp_hdr, dgram_cksum)):
+				if (l4_proto == IPPROTO_UDP)
+					m->ol_flags |= PKT_TX_UDP_CKSUM;
+				break;
+			case (offsetof(struct rte_sctp_hdr, cksum)):
+				if (l4_proto == IPPROTO_SCTP)
+					m->ol_flags |= PKT_TX_SCTP_CKSUM;
+				break;
+			default:
+				break;
+			}
+		}
+	}
+
+	if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			tcp_hdr = l4_hdr;
+			m->ol_flags |= PKT_TX_TCP_SEG;
+			m->tso_segsz = hdr->gso_size;
+			m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			m->ol_flags |= PKT_TX_UDP_SEG;
+			m->tso_segsz = hdr->gso_size;
+			m->l4_len = sizeof(struct rte_udp_hdr);
+			break;
+		default:
+			VHOST_LOG_DATA(WARNING,
+				"unsupported gso type %u.\n", hdr->gso_type);
+			break;
+		}
+	}
+}
+
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+	rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+	struct rte_mbuf_ext_shared_info *shinfo = NULL;
+	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+	uint16_t buf_len;
+	rte_iova_t iova;
+	void *buf;
+
+	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
+	 * required, otherwise store shinfo in the new buffer.
+	 */
+	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+		shinfo = rte_pktmbuf_mtod(pkt,
+					  struct rte_mbuf_ext_shared_info *);
+	else {
+		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+	}
+
+	if (unlikely(total_len > UINT16_MAX))
+		return -ENOSPC;
+
+	buf_len = total_len;
+	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+	if (unlikely(buf == NULL))
+		return -ENOMEM;
+
+	/* Initialize shinfo */
+	if (shinfo) {
+		shinfo->free_cb = virtio_dev_extbuf_free;
+		shinfo->fcb_opaque = buf;
+		rte_mbuf_ext_refcnt_set(shinfo, 1);
+	} else {
+		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+					      virtio_dev_extbuf_free, buf);
+		if (unlikely(shinfo == NULL)) {
+			rte_free(buf);
+			VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
+			return -1;
+		}
+	}
+
+	iova = rte_malloc_virt2iova(buf);
+	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+	rte_pktmbuf_reset_headroom(pkt);
+
+	return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+			 uint32_t data_len)
+{
+	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+	if (unlikely(pkt == NULL)) {
+		VHOST_LOG_DATA(ERR,
+			"Failed to allocate memory for mbuf.\n");
+		return NULL;
+	}
+
+	if (rte_pktmbuf_tailroom(pkt) >= data_len)
+		return pkt;
+
+	/* attach an external buffer if supported */
+	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+		return pkt;
+
+	/* check if chained buffers are allowed */
+	if (!dev->linearbuf)
+		return pkt;
+
+	/* Data doesn't fit into the buffer and the host supports
+	 * only linear buffers
+	 */
+	rte_pktmbuf_free(pkt);
+
+	return NULL;
+}
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 0a0bea1a5..9757ed053 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -32,12 +32,6 @@ rxvq_is_mergeable(struct virtio_net *dev)
 	return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
 }
 
-static  __rte_always_inline bool
-virtio_net_is_inorder(struct virtio_net *dev)
-{
-	return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
-}
-
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
 {
@@ -1804,121 +1798,6 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
 	return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
 }
 
-static inline bool
-virtio_net_with_host_offload(struct virtio_net *dev)
-{
-	if (dev->features &
-			((1ULL << VIRTIO_NET_F_CSUM) |
-			 (1ULL << VIRTIO_NET_F_HOST_ECN) |
-			 (1ULL << VIRTIO_NET_F_HOST_TSO4) |
-			 (1ULL << VIRTIO_NET_F_HOST_TSO6) |
-			 (1ULL << VIRTIO_NET_F_HOST_UFO)))
-		return true;
-
-	return false;
-}
-
-static void
-parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
-{
-	struct rte_ipv4_hdr *ipv4_hdr;
-	struct rte_ipv6_hdr *ipv6_hdr;
-	void *l3_hdr = NULL;
-	struct rte_ether_hdr *eth_hdr;
-	uint16_t ethertype;
-
-	eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
-
-	m->l2_len = sizeof(struct rte_ether_hdr);
-	ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
-
-	if (ethertype == RTE_ETHER_TYPE_VLAN) {
-		struct rte_vlan_hdr *vlan_hdr =
-			(struct rte_vlan_hdr *)(eth_hdr + 1);
-
-		m->l2_len += sizeof(struct rte_vlan_hdr);
-		ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
-	}
-
-	l3_hdr = (char *)eth_hdr + m->l2_len;
-
-	switch (ethertype) {
-	case RTE_ETHER_TYPE_IPV4:
-		ipv4_hdr = l3_hdr;
-		*l4_proto = ipv4_hdr->next_proto_id;
-		m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
-		*l4_hdr = (char *)l3_hdr + m->l3_len;
-		m->ol_flags |= PKT_TX_IPV4;
-		break;
-	case RTE_ETHER_TYPE_IPV6:
-		ipv6_hdr = l3_hdr;
-		*l4_proto = ipv6_hdr->proto;
-		m->l3_len = sizeof(struct rte_ipv6_hdr);
-		*l4_hdr = (char *)l3_hdr + m->l3_len;
-		m->ol_flags |= PKT_TX_IPV6;
-		break;
-	default:
-		m->l3_len = 0;
-		*l4_proto = 0;
-		*l4_hdr = NULL;
-		break;
-	}
-}
-
-static __rte_always_inline void
-vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
-{
-	uint16_t l4_proto = 0;
-	void *l4_hdr = NULL;
-	struct rte_tcp_hdr *tcp_hdr = NULL;
-
-	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
-		return;
-
-	parse_ethernet(m, &l4_proto, &l4_hdr);
-	if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-		if (hdr->csum_start == (m->l2_len + m->l3_len)) {
-			switch (hdr->csum_offset) {
-			case (offsetof(struct rte_tcp_hdr, cksum)):
-				if (l4_proto == IPPROTO_TCP)
-					m->ol_flags |= PKT_TX_TCP_CKSUM;
-				break;
-			case (offsetof(struct rte_udp_hdr, dgram_cksum)):
-				if (l4_proto == IPPROTO_UDP)
-					m->ol_flags |= PKT_TX_UDP_CKSUM;
-				break;
-			case (offsetof(struct rte_sctp_hdr, cksum)):
-				if (l4_proto == IPPROTO_SCTP)
-					m->ol_flags |= PKT_TX_SCTP_CKSUM;
-				break;
-			default:
-				break;
-			}
-		}
-	}
-
-	if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-		case VIRTIO_NET_HDR_GSO_TCPV4:
-		case VIRTIO_NET_HDR_GSO_TCPV6:
-			tcp_hdr = l4_hdr;
-			m->ol_flags |= PKT_TX_TCP_SEG;
-			m->tso_segsz = hdr->gso_size;
-			m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
-			break;
-		case VIRTIO_NET_HDR_GSO_UDP:
-			m->ol_flags |= PKT_TX_UDP_SEG;
-			m->tso_segsz = hdr->gso_size;
-			m->l4_len = sizeof(struct rte_udp_hdr);
-			break;
-		default:
-			VHOST_LOG_DATA(WARNING,
-				"unsupported gso type %u.\n", hdr->gso_type);
-			break;
-		}
-	}
-}
-
 static __rte_noinline void
 copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
 		struct buf_vector *buf_vec)
@@ -2083,96 +1962,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
-static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
-{
-	rte_free(opaque);
-}
-
-static int
-virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
-{
-	struct rte_mbuf_ext_shared_info *shinfo = NULL;
-	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
-	uint16_t buf_len;
-	rte_iova_t iova;
-	void *buf;
-
-	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
-	 * required, otherwise store shinfo in the new buffer.
-	 */
-	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
-		shinfo = rte_pktmbuf_mtod(pkt,
-					  struct rte_mbuf_ext_shared_info *);
-	else {
-		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
-		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
-	}
-
-	if (unlikely(total_len > UINT16_MAX))
-		return -ENOSPC;
-
-	buf_len = total_len;
-	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
-	if (unlikely(buf == NULL))
-		return -ENOMEM;
-
-	/* Initialize shinfo */
-	if (shinfo) {
-		shinfo->free_cb = virtio_dev_extbuf_free;
-		shinfo->fcb_opaque = buf;
-		rte_mbuf_ext_refcnt_set(shinfo, 1);
-	} else {
-		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
-					      virtio_dev_extbuf_free, buf);
-		if (unlikely(shinfo == NULL)) {
-			rte_free(buf);
-			VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
-			return -1;
-		}
-	}
-
-	iova = rte_malloc_virt2iova(buf);
-	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
-	rte_pktmbuf_reset_headroom(pkt);
-
-	return 0;
-}
-
-/*
- * Allocate a host supported pktmbuf.
- */
-static __rte_always_inline struct rte_mbuf *
-virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
-			 uint32_t data_len)
-{
-	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
-
-	if (unlikely(pkt == NULL)) {
-		VHOST_LOG_DATA(ERR,
-			"Failed to allocate memory for mbuf.\n");
-		return NULL;
-	}
-
-	if (rte_pktmbuf_tailroom(pkt) >= data_len)
-		return pkt;
-
-	/* attach an external buffer if supported */
-	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
-		return pkt;
-
-	/* check if chained buffers are allowed */
-	if (!dev->linearbuf)
-		return pkt;
-
-	/* Data doesn't fit into the buffer and the host supports
-	 * only linear buffers
-	 */
-	rte_pktmbuf_free(pkt);
-
-	return NULL;
-}
-
 static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions Marvin Liu
@ 2020-10-09  8:14     ` Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09  8:14 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Prepare memory regions guest physical addresses for vectorized data
path. These information will be utilized by SIMD instructions to find
matched region index.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 12b7699cf..a19fe9423 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -52,6 +52,8 @@
 
 #define ASYNC_MAX_POLL_SEG 255
 
+#define MAX_NREGIONS 8
+
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
 #define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
 
@@ -349,6 +351,11 @@ struct inflight_mem_info {
 	uint64_t	size;
 };
 
+struct mem_regions_range {
+	uint64_t regions_low_addrs[MAX_NREGIONS];
+	uint64_t regions_high_addrs[MAX_NREGIONS];
+};
+
 /**
  * Device structure contains all configuration information relating
  * to the device.
@@ -356,6 +363,7 @@ struct inflight_mem_info {
 struct virtio_net {
 	/* Frontend (QEMU) memory and memory region information */
 	struct rte_vhost_memory	*mem;
+	struct mem_regions_range *regions_range;
 	uint64_t		features;
 	uint64_t		protocol_features;
 	int			vid;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 4deceb3e0..2d2a2a1a3 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -185,6 +185,11 @@ vhost_backend_cleanup(struct virtio_net *dev)
 		dev->inflight_info = NULL;
 	}
 
+	if (dev->regions_range) {
+		free(dev->regions_range);
+		dev->regions_range = NULL;
+	}
+
 	if (dev->slave_req_fd >= 0) {
 		close(dev->slave_req_fd);
 		dev->slave_req_fd = -1;
@@ -1230,6 +1235,27 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 		}
 	}
 
+	RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
+	if (dev->vectorized) {
+		if (dev->regions_range == NULL) {
+			dev->regions_range = calloc(1,
+					sizeof(struct mem_regions_range));
+			if (!dev->regions_range) {
+				VHOST_LOG_CONFIG(ERR,
+					"failed to alloc dev vectorized area\n");
+				return RTE_VHOST_MSG_RESULT_ERR;
+			}
+		}
+
+		for (i = 0; i < memory->nregions; i++) {
+			dev->regions_range->regions_low_addrs[i] =
+				memory->regions[i].guest_phys_addr;
+			dev->regions_range->regions_high_addrs[i] =
+				memory->regions[i].guest_phys_addr +
+				memory->regions[i].memory_size;
+		}
+	}
+
 	for (i = 0; i < dev->nr_vring; i++) {
 		struct vhost_virtqueue *vq = dev->virtqueue[i];
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
                       ` (2 preceding siblings ...)
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-10-09  8:14     ` Marvin Liu
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
  2020-10-12  8:21     ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09  8:14 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Optimize vhost packed ring dequeue path with SIMD instructions. Four
descriptors status check and writeback are batched handled with AVX512
instructions. Address translation operations are also accelerated by
AVX512 instructions.

If platform or compiler not support vectorization, will fallback to
default path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index cc9aa65c6..5eadcbae4 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -8,6 +8,22 @@ endif
 if has_libnuma == 1
 	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
 endif
+
+if arch_subdir == 'x86'
+        if not machine_args.contains('-mno-avx512f')
+                if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+                        cflags += ['-DCC_AVX512_SUPPORT']
+                        vhost_avx512_lib = static_library('vhost_avx512_lib',
+                                              'virtio_net_avx.c',
+                                              dependencies: [static_rte_eal, static_rte_mempool,
+                                                  static_rte_mbuf, static_rte_ethdev, static_rte_net],
+                                              include_directories: includes,
+                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
+                        objs += vhost_avx512_lib.extract_objects('virtio_net_avx.c')
+                endif
+        endif
+endif
+
 if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
 	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
 elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index a19fe9423..b270c424b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1100,4 +1100,15 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
 
 	return NULL;
 }
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids);
+
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 9757ed053..3bc6b9b20 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2136,6 +2136,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
 	return -1;
 }
 
+static __rte_always_inline int
+vhost_handle_avail_batch_packed(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids)
+{
+#ifdef CC_AVX512_SUPPORT
+	if (unlikely(dev->vectorized))
+		return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
+				pkts, avail_idx, desc_addrs, ids);
+	else
+		return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
+				pkts, avail_idx, desc_addrs, ids);
+#else
+	return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+			avail_idx, desc_addrs, ids);
+#endif
+}
+
 static __rte_always_inline int
 virtio_dev_tx_batch_packed(struct virtio_net *dev,
 			   struct vhost_virtqueue *vq,
@@ -2148,8 +2170,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
 	uint16_t ids[PACKED_BATCH_SIZE];
 	uint16_t i;
 
-	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
-					     avail_idx, desc_addrs, ids))
+
+	if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+		avail_idx, desc_addrs, ids))
 		return -1;
 
 	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
diff --git a/lib/librte_vhost/virtio_net_avx.c b/lib/librte_vhost/virtio_net_avx.c
new file mode 100644
index 000000000..e10b2a285
--- /dev/null
+++ b/lib/librte_vhost/virtio_net_avx.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2016 Intel Corporation
+ */
+#include <stdint.h>
+
+#include "vhost.h"
+
+#define BYTE_SIZE 8
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
+	sizeof(uint16_t))
+
+#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
+	sizeof(uint16_t))
+#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  | \
+	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
+
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
+	<< FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define DESC_FLAGS_POS 0xaa
+#define MBUF_LENS_POS 0x6666
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+				 struct vhost_virtqueue *vq,
+				 struct rte_mempool *mbuf_pool,
+				 struct rte_mbuf **pkts,
+				 uint16_t avail_idx,
+				 uintptr_t *desc_addrs,
+				 uint16_t *ids)
+{
+	struct vring_packed_desc *descs = vq->desc_packed;
+	uint32_t descs_status;
+	void *desc_addr;
+	uint16_t i;
+	uint8_t cmp_low, cmp_high, cmp_result;
+	uint64_t lens[PACKED_BATCH_SIZE];
+	struct virtio_net_hdr *hdr;
+
+	if (unlikely(avail_idx & PACKED_BATCH_MASK))
+		return -1;
+	if (unlikely((avail_idx + PACKED_BATCH_SIZE) > vq->size))
+		return -1;
+
+	/* load 4 descs */
+	desc_addr = &vq->desc_packed[avail_idx];
+	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+	/* burst check four status */
+	__m512i avail_flag_vec;
+	if (vq->avail_wrap_counter)
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
+					PACKED_FLAGS_MASK, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_AVAIL_FLAG);
+
+#endif
+	else
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
+					0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_AVAIL_FLAG_WRAP);
+#endif
+
+	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
+		_MM_CMPINT_NE);
+	if (descs_status & BATCH_FLAGS_MASK)
+		return -1;
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t size = (uint64_t)descs[avail_idx + i].len;
+			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+				descs[avail_idx + i].addr, &size,
+				VHOST_ACCESS_RO);
+
+			if (!desc_addrs[i])
+				goto free_buf;
+			lens[i] = descs[avail_idx + i].len;
+			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+					lens[i]);
+			if (!pkts[i])
+				goto free_buf;
+		}
+	} else {
+		/* check buffer fit into one region & translate address */
+		struct mem_regions_range *range = dev->regions_range;
+		__m512i regions_low_addrs =
+			_mm512_loadu_si512((void *)&range->regions_low_addrs);
+		__m512i regions_high_addrs =
+			_mm512_loadu_si512((void *)&range->regions_high_addrs);
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t addr_low = descs[avail_idx + i].addr;
+			uint64_t addr_high = addr_low +
+						descs[avail_idx + i].len;
+			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+					regions_low_addrs, _MM_CMPINT_NLT);
+			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+					regions_high_addrs, _MM_CMPINT_LT);
+			cmp_result = cmp_low & cmp_high;
+			int index = __builtin_ctz(cmp_result);
+			if (unlikely((uint32_t)index >= dev->mem->nregions))
+				goto free_buf;
+
+			desc_addrs[i] = addr_low +
+				dev->mem->regions[index].host_user_addr -
+				dev->mem->regions[index].guest_phys_addr;
+			lens[i] = descs[avail_idx + i].len;
+			rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+			pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+					lens[i]);
+			if (!pkts[i])
+				goto free_buf;
+		}
+	}
+
+	if (virtio_net_with_host_offload(dev)) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
+			vhost_dequeue_offload(hdr, pkts[i]);
+		}
+	}
+
+	if (virtio_net_is_inorder(dev)) {
+		ids[PACKED_BATCH_SIZE - 1] =
+			descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
+	} else {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			ids[i] = descs[avail_idx + i].id;
+	}
+
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	/* store mbuf data_len, pkt_len */
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
+					+ sizeof(uint64_t);
+	}
+
+	/* save pkt_len and data_len into mbufs */
+	__m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
+					0xAA);
+	__m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
+					(uint32_t)-12);
+	value_vec = _mm512_add_epi32(value_vec, offsets_vec);
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+	_mm512_i64scatter_epi64(0, vindex, value_vec, 1);
+
+	return 0;
+free_buf:
+	for (i = 0; i < PACKED_BATCH_SIZE; i++)
+		rte_pktmbuf_free(pkts[i]);
+
+	return -1;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
                       ` (3 preceding siblings ...)
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-10-09  8:14     ` Marvin Liu
  2020-10-12  8:21     ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
  5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09  8:14 UTC (permalink / raw)
  To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu

Optimize vhost packed ring enqueue path with SIMD instructions. Four
descriptors status and length are batched handled with AVX512
instructions. Address translation operations are also accelerated
by AVX512 instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b270c424b..84dc289e9 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1110,5 +1110,9 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
 				 uintptr_t *desc_addrs,
 				 uint16_t *ids);
 
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+			       struct vhost_virtqueue *vq,
+			       struct rte_mbuf **pkts);
 
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 3bc6b9b20..3e49c88ac 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,6 +1354,22 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
 	return 0;
 }
 
+static __rte_always_inline int
+virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
+			   struct vhost_virtqueue *vq,
+			   struct rte_mbuf **pkts)
+
+{
+#ifdef CC_AVX512_SUPPORT
+	if (unlikely(dev->vectorized))
+		return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
+	else
+		return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#else
+	return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#endif
+}
+
 static __rte_noinline uint32_t
 virtio_dev_rx_packed(struct virtio_net *dev,
 		     struct vhost_virtqueue *__rte_restrict vq,
@@ -1367,8 +1383,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
 
 		if (remained >= PACKED_BATCH_SIZE) {
-			if (!virtio_dev_rx_batch_packed(dev, vq,
-							&pkts[pkt_idx])) {
+			if (!virtio_dev_rx_handle_batch_packed(dev, vq,
+				&pkts[pkt_idx])) {
 				pkt_idx += PACKED_BATCH_SIZE;
 				remained -= PACKED_BATCH_SIZE;
 				continue;
diff --git a/lib/librte_vhost/virtio_net_avx.c b/lib/librte_vhost/virtio_net_avx.c
index e10b2a285..aa47b15ae 100644
--- a/lib/librte_vhost/virtio_net_avx.c
+++ b/lib/librte_vhost/virtio_net_avx.c
@@ -35,9 +35,15 @@
 #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
 #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
 	FLAGS_BITS_OFFSET)
+#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
+	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
+#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
+	((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
 
 #define DESC_FLAGS_POS 0xaa
 #define MBUF_LENS_POS 0x6666
+#define DESC_LENS_POS 0x4444
+#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
 
 int
 vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
@@ -182,3 +188,157 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
 
 	return -1;
 }
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+			       struct vhost_virtqueue *vq,
+			       struct rte_mbuf **pkts)
+{
+	struct vring_packed_desc *descs = vq->desc_packed;
+	uint16_t avail_idx = vq->last_avail_idx;
+	uint64_t desc_addrs[PACKED_BATCH_SIZE];
+	uint32_t buf_offset = dev->vhost_hlen;
+	uint32_t desc_status;
+	uint64_t lens[PACKED_BATCH_SIZE];
+	uint16_t i;
+	void *desc_addr;
+	uint8_t cmp_low, cmp_high, cmp_result;
+
+	if (unlikely(avail_idx & PACKED_BATCH_MASK))
+		return -1;
+	if (unlikely((avail_idx + PACKED_BATCH_SIZE) > vq->size))
+		return -1;
+
+	/* check refcnt and nb_segs */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+
+	/* load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+				*pkts[3]->rearm_data,
+				*pkts[2]->rearm_data,
+				*pkts[1]->rearm_data,
+				*pkts[0]->rearm_data);
+
+	uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & MBUF_LENS_POS)
+		return -1;
+
+	/* check desc status */
+	desc_addr = &vq->desc_packed[avail_idx];
+	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+	__m512i avail_flag_vec;
+	__m512i used_flag_vec;
+	if (vq->avail_wrap_counter) {
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
+					0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
+		used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
+					PACKED_FLAGS_MASK, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_WRITE_AVAIL_FLAG);
+		used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_FLAGS_MASK);
+#endif
+	} else {
+#if defined(RTE_ARCH_I686)
+		avail_flag_vec = _mm512_set4_epi64(
+					PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
+					PACKED_WRITE_AVAIL_FLAG, 0x0);
+		used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
+#else
+		avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+					PACKED_WRITE_AVAIL_FLAG_WRAP);
+		used_flag_vec = _mm512_setzero_epi32();
+#endif
+	}
+
+	desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
+				avail_flag_vec, _MM_CMPINT_NE);
+	if (desc_status)
+		return -1;
+
+	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t size = (uint64_t)descs[avail_idx + i].len;
+			desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+				descs[avail_idx + i].addr, &size,
+				VHOST_ACCESS_RW);
+
+			if (!desc_addrs[i])
+				return -1;
+
+			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+					0));
+		}
+	} else {
+		/* check buffer fit into one region & translate address */
+		struct mem_regions_range *range = dev->regions_range;
+		__m512i regions_low_addrs =
+			_mm512_loadu_si512((void *)&range->regions_low_addrs);
+		__m512i regions_high_addrs =
+			_mm512_loadu_si512((void *)&range->regions_high_addrs);
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			uint64_t addr_low = descs[avail_idx + i].addr;
+			uint64_t addr_high = addr_low +
+						descs[avail_idx + i].len;
+			__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+			__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+			cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+					regions_low_addrs, _MM_CMPINT_NLT);
+			cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+					regions_high_addrs, _MM_CMPINT_LT);
+			cmp_result = cmp_low & cmp_high;
+			int index = __builtin_ctz(cmp_result);
+			if (unlikely((uint32_t)index >= dev->mem->nregions))
+				return -1;
+
+			desc_addrs[i] = addr_low +
+				dev->mem->regions[index].host_user_addr -
+				dev->mem->regions[index].guest_phys_addr;
+			rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+					0));
+		}
+	}
+
+	/* check length is enough */
+	__m512i pkt_lens = _mm512_set_epi32(
+			0, pkts[3]->pkt_len, 0, 0,
+			0, pkts[2]->pkt_len, 0, 0,
+			0, pkts[1]->pkt_len, 0, 0,
+			0, pkts[0]->pkt_len, 0, 0);
+
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
+					dev->vhost_hlen);
+	__m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
+	uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
+				desc_vec, buf_len_vec, _MM_CMPINT_LT);
+	if (lens_cmp)
+		return -1;
+
+	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+			   rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+			   pkts[i]->pkt_len);
+	}
+
+	if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
+		vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			lens[i] = descs[avail_idx + i].len;
+			vhost_log_cache_write_iova(dev, vq,
+				descs[avail_idx + i].addr, lens[i]);
+		}
+	}
+
+	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
+	vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
+	/* save len and flags, skip addr and id */
+	__m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
+					DESC_LENS_FLAGS_POS, buf_len_vec,
+					used_flag_vec);
+	_mm512_storeu_si512(desc_addr, desc_updated);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
                       ` (4 preceding siblings ...)
  2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
@ 2020-10-12  8:21     ` Maxime Coquelin
  2020-10-12  9:10       ` Liu, Yong
  2020-10-15 15:28       ` Liu, Yong
  5 siblings, 2 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-12  8:21 UTC (permalink / raw)
  To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev

Hi Marvin,

On 10/9/20 10:14 AM, Marvin Liu wrote:
> Packed ring format is imported since virtio spec 1.1. All descriptors
> are compacted into one single ring when packed ring format is on. It is
> straight forward that ring operations can be accelerated by utilizing
> SIMD instructions. 
> 
> This patch set will introduce vectorized data path in vhost library. If
> vectorized option is on, operations like descs check, descs writeback,
> address translation will be accelerated by SIMD instructions. On skylake
> server, it can bring 6% performance gain in loopback case and around 4%
> performance gain in PvP case.

IMHO, 4% gain on PVP is not a significant gain if we compare to the
added complexity. Moreover, I guess this is 4% gain with testpmd-based
PVP? If this is the case it may be even lower with OVS-DPDK PVP
benchmark, I will try to do a benchmark this week.

Thanks,
Maxime

> Vhost application can choose whether using vectorized acceleration, just
> like external buffer feature. If platform or ring format not support
> vectorized function, vhost will fallback to use default batch function.
> There will be no impact in current data path.
> 
> v3:
> * rename vectorized datapath file
> * eliminate the impact when avx512 disabled
> * dynamically allocate memory regions structure
> * remove unlikely hint for in_order
> 
> v2:
> * add vIOMMU support
> * add dequeue offloading
> * rebase code
> 
> Marvin Liu (5):
>   vhost: add vectorized data path
>   vhost: reuse packed ring functions
>   vhost: prepare memory regions addresses
>   vhost: add packed ring vectorized dequeue
>   vhost: add packed ring vectorized enqueue
> 
>  doc/guides/nics/vhost.rst           |   5 +
>  doc/guides/prog_guide/vhost_lib.rst |  12 +
>  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
>  lib/librte_vhost/meson.build        |  16 ++
>  lib/librte_vhost/rte_vhost.h        |   1 +
>  lib/librte_vhost/socket.c           |   5 +
>  lib/librte_vhost/vhost.c            |  11 +
>  lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
>  lib/librte_vhost/vhost_user.c       |  26 +++
>  lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
>  lib/librte_vhost/virtio_net_avx.c   | 344 ++++++++++++++++++++++++++++
>  11 files changed, 718 insertions(+), 216 deletions(-)
>  create mode 100644 lib/librte_vhost/virtio_net_avx.c
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-10-12  8:21     ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
@ 2020-10-12  9:10       ` Liu, Yong
  2020-10-12  9:57         ` Maxime Coquelin
  2020-10-15 15:28       ` Liu, Yong
  1 sibling, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-10-12  9:10 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, October 12, 2020 4:22 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
> 
> Hi Marvin,
> 
> On 10/9/20 10:14 AM, Marvin Liu wrote:
> > Packed ring format is imported since virtio spec 1.1. All descriptors
> > are compacted into one single ring when packed ring format is on. It is
> > straight forward that ring operations can be accelerated by utilizing
> > SIMD instructions.
> >
> > This patch set will introduce vectorized data path in vhost library. If
> > vectorized option is on, operations like descs check, descs writeback,
> > address translation will be accelerated by SIMD instructions. On skylake
> > server, it can bring 6% performance gain in loopback case and around 4%
> > performance gain in PvP case.
> 
> IMHO, 4% gain on PVP is not a significant gain if we compare to the
> added complexity. Moreover, I guess this is 4% gain with testpmd-based
> PVP? If this is the case it may be even lower with OVS-DPDK PVP
> benchmark, I will try to do a benchmark this week.
> 

Maxime, 
I have observed around 3% gain with OVS-DPDK in first version. But the number is not reliable as datapath has been changed. 
I will try again after fixed OVS integration issue with latest dpdk. 

> Thanks,
> Maxime
> 
> > Vhost application can choose whether using vectorized acceleration, just
> > like external buffer feature. If platform or ring format not support
> > vectorized function, vhost will fallback to use default batch function.
> > There will be no impact in current data path.
> >
> > v3:
> > * rename vectorized datapath file
> > * eliminate the impact when avx512 disabled
> > * dynamically allocate memory regions structure
> > * remove unlikely hint for in_order
> >
> > v2:
> > * add vIOMMU support
> > * add dequeue offloading
> > * rebase code
> >
> > Marvin Liu (5):
> >   vhost: add vectorized data path
> >   vhost: reuse packed ring functions
> >   vhost: prepare memory regions addresses
> >   vhost: add packed ring vectorized dequeue
> >   vhost: add packed ring vectorized enqueue
> >
> >  doc/guides/nics/vhost.rst           |   5 +
> >  doc/guides/prog_guide/vhost_lib.rst |  12 +
> >  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
> >  lib/librte_vhost/meson.build        |  16 ++
> >  lib/librte_vhost/rte_vhost.h        |   1 +
> >  lib/librte_vhost/socket.c           |   5 +
> >  lib/librte_vhost/vhost.c            |  11 +
> >  lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
> >  lib/librte_vhost/vhost_user.c       |  26 +++
> >  lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
> >  lib/librte_vhost/virtio_net_avx.c   | 344 ++++++++++++++++++++++++++++
> >  11 files changed, 718 insertions(+), 216 deletions(-)
> >  create mode 100644 lib/librte_vhost/virtio_net_avx.c
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-10-12  9:10       ` Liu, Yong
@ 2020-10-12  9:57         ` Maxime Coquelin
  2020-10-12 13:24           ` Liu, Yong
  0 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-12  9:57 UTC (permalink / raw)
  To: Liu, Yong, Xia, Chenbo, Wang, Zhihong; +Cc: dev

Hi Marvin,

On 10/12/20 11:10 AM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Monday, October 12, 2020 4:22 PM
>> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org
>> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>>
>> Hi Marvin,
>>
>> On 10/9/20 10:14 AM, Marvin Liu wrote:
>>> Packed ring format is imported since virtio spec 1.1. All descriptors
>>> are compacted into one single ring when packed ring format is on. It is
>>> straight forward that ring operations can be accelerated by utilizing
>>> SIMD instructions.
>>>
>>> This patch set will introduce vectorized data path in vhost library. If
>>> vectorized option is on, operations like descs check, descs writeback,
>>> address translation will be accelerated by SIMD instructions. On skylake
>>> server, it can bring 6% performance gain in loopback case and around 4%
>>> performance gain in PvP case.
>>
>> IMHO, 4% gain on PVP is not a significant gain if we compare to the
>> added complexity. Moreover, I guess this is 4% gain with testpmd-based
>> PVP? If this is the case it may be even lower with OVS-DPDK PVP
>> benchmark, I will try to do a benchmark this week.
>>
> 
> Maxime, 
> I have observed around 3% gain with OVS-DPDK in first version. But the number is not reliable as datapath has been changed. 
> I will try again after fixed OVS integration issue with latest dpdk. 

Thanks for the information.

Also, wouldn't using AVX512 lower the CPU frequency?
If so, could it have an impact on the workload running on the other
CPUs?

Thanks,
Maxime

>> Thanks,
>> Maxime
>>
>>> Vhost application can choose whether using vectorized acceleration, just
>>> like external buffer feature. If platform or ring format not support
>>> vectorized function, vhost will fallback to use default batch function.
>>> There will be no impact in current data path.
>>>
>>> v3:
>>> * rename vectorized datapath file
>>> * eliminate the impact when avx512 disabled
>>> * dynamically allocate memory regions structure
>>> * remove unlikely hint for in_order
>>>
>>> v2:
>>> * add vIOMMU support
>>> * add dequeue offloading
>>> * rebase code
>>>
>>> Marvin Liu (5):
>>>   vhost: add vectorized data path
>>>   vhost: reuse packed ring functions
>>>   vhost: prepare memory regions addresses
>>>   vhost: add packed ring vectorized dequeue
>>>   vhost: add packed ring vectorized enqueue
>>>
>>>  doc/guides/nics/vhost.rst           |   5 +
>>>  doc/guides/prog_guide/vhost_lib.rst |  12 +
>>>  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
>>>  lib/librte_vhost/meson.build        |  16 ++
>>>  lib/librte_vhost/rte_vhost.h        |   1 +
>>>  lib/librte_vhost/socket.c           |   5 +
>>>  lib/librte_vhost/vhost.c            |  11 +
>>>  lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
>>>  lib/librte_vhost/vhost_user.c       |  26 +++
>>>  lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
>>>  lib/librte_vhost/virtio_net_avx.c   | 344 ++++++++++++++++++++++++++++
>>>  11 files changed, 718 insertions(+), 216 deletions(-)
>>>  create mode 100644 lib/librte_vhost/virtio_net_avx.c
>>>
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-10-12  9:57         ` Maxime Coquelin
@ 2020-10-12 13:24           ` Liu, Yong
  0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-12 13:24 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, October 12, 2020 5:57 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
> 
> Hi Marvin,
> 
> On 10/12/20 11:10 AM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Monday, October 12, 2020 4:22 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo
> <chenbo.xia@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org
> >> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
> >>
> >> Hi Marvin,
> >>
> >> On 10/9/20 10:14 AM, Marvin Liu wrote:
> >>> Packed ring format is imported since virtio spec 1.1. All descriptors
> >>> are compacted into one single ring when packed ring format is on. It is
> >>> straight forward that ring operations can be accelerated by utilizing
> >>> SIMD instructions.
> >>>
> >>> This patch set will introduce vectorized data path in vhost library. If
> >>> vectorized option is on, operations like descs check, descs writeback,
> >>> address translation will be accelerated by SIMD instructions. On skylake
> >>> server, it can bring 6% performance gain in loopback case and around 4%
> >>> performance gain in PvP case.
> >>
> >> IMHO, 4% gain on PVP is not a significant gain if we compare to the
> >> added complexity. Moreover, I guess this is 4% gain with testpmd-based
> >> PVP? If this is the case it may be even lower with OVS-DPDK PVP
> >> benchmark, I will try to do a benchmark this week.
> >>
> >
> > Maxime,
> > I have observed around 3% gain with OVS-DPDK in first version. But the
> number is not reliable as datapath has been changed.
> > I will try again after fixed OVS integration issue with latest dpdk.
> 
> Thanks for the information.
> 
> Also, wouldn't using AVX512 lower the CPU frequency?
> If so, could it have an impact on the workload running on the other
> CPUs?
> 

All AVX512 instructions used in vhost are lightweight ones, frequency won't be affected. 
Theoretically system performance won’t be affected if only lightweight instructions are used. 

Thanks.

> Thanks,
> Maxime
> 
> >> Thanks,
> >> Maxime
> >>
> >>> Vhost application can choose whether using vectorized acceleration,
> just
> >>> like external buffer feature. If platform or ring format not support
> >>> vectorized function, vhost will fallback to use default batch function.
> >>> There will be no impact in current data path.
> >>>
> >>> v3:
> >>> * rename vectorized datapath file
> >>> * eliminate the impact when avx512 disabled
> >>> * dynamically allocate memory regions structure
> >>> * remove unlikely hint for in_order
> >>>
> >>> v2:
> >>> * add vIOMMU support
> >>> * add dequeue offloading
> >>> * rebase code
> >>>
> >>> Marvin Liu (5):
> >>>   vhost: add vectorized data path
> >>>   vhost: reuse packed ring functions
> >>>   vhost: prepare memory regions addresses
> >>>   vhost: add packed ring vectorized dequeue
> >>>   vhost: add packed ring vectorized enqueue
> >>>
> >>>  doc/guides/nics/vhost.rst           |   5 +
> >>>  doc/guides/prog_guide/vhost_lib.rst |  12 +
> >>>  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
> >>>  lib/librte_vhost/meson.build        |  16 ++
> >>>  lib/librte_vhost/rte_vhost.h        |   1 +
> >>>  lib/librte_vhost/socket.c           |   5 +
> >>>  lib/librte_vhost/vhost.c            |  11 +
> >>>  lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
> >>>  lib/librte_vhost/vhost_user.c       |  26 +++
> >>>  lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
> >>>  lib/librte_vhost/virtio_net_avx.c   | 344
> ++++++++++++++++++++++++++++
> >>>  11 files changed, 718 insertions(+), 216 deletions(-)
> >>>  create mode 100644 lib/librte_vhost/virtio_net_avx.c
> >>>
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-10-12  8:21     ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
  2020-10-12  9:10       ` Liu, Yong
@ 2020-10-15 15:28       ` Liu, Yong
  2020-10-15 15:35         ` Maxime Coquelin
  1 sibling, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-10-15 15:28 UTC (permalink / raw)
  To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev

Hi All,
Performance gain from vectorized datapath in OVS-DPDK is around 1%, meanwhile it have a small impact of original datapath. 
On the other hand, it will increase the complexity of vhost (new parameter introduced, prepare memory information for address translation). 
After weighed the procs and co, I’d like to drawback this patch set.  Thanks for your time.

Regards,
Marvin

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, October 12, 2020 4:22 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
> 
> Hi Marvin,
> 
> On 10/9/20 10:14 AM, Marvin Liu wrote:
> > Packed ring format is imported since virtio spec 1.1. All descriptors
> > are compacted into one single ring when packed ring format is on. It is
> > straight forward that ring operations can be accelerated by utilizing
> > SIMD instructions.
> >
> > This patch set will introduce vectorized data path in vhost library. If
> > vectorized option is on, operations like descs check, descs writeback,
> > address translation will be accelerated by SIMD instructions. On skylake
> > server, it can bring 6% performance gain in loopback case and around 4%
> > performance gain in PvP case.
> 
> IMHO, 4% gain on PVP is not a significant gain if we compare to the
> added complexity. Moreover, I guess this is 4% gain with testpmd-based
> PVP? If this is the case it may be even lower with OVS-DPDK PVP
> benchmark, I will try to do a benchmark this week.
> 
> Thanks,
> Maxime
> 
> > Vhost application can choose whether using vectorized acceleration, just
> > like external buffer feature. If platform or ring format not support
> > vectorized function, vhost will fallback to use default batch function.
> > There will be no impact in current data path.
> >
> > v3:
> > * rename vectorized datapath file
> > * eliminate the impact when avx512 disabled
> > * dynamically allocate memory regions structure
> > * remove unlikely hint for in_order
> >
> > v2:
> > * add vIOMMU support
> > * add dequeue offloading
> > * rebase code
> >
> > Marvin Liu (5):
> >   vhost: add vectorized data path
> >   vhost: reuse packed ring functions
> >   vhost: prepare memory regions addresses
> >   vhost: add packed ring vectorized dequeue
> >   vhost: add packed ring vectorized enqueue
> >
> >  doc/guides/nics/vhost.rst           |   5 +
> >  doc/guides/prog_guide/vhost_lib.rst |  12 +
> >  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
> >  lib/librte_vhost/meson.build        |  16 ++
> >  lib/librte_vhost/rte_vhost.h        |   1 +
> >  lib/librte_vhost/socket.c           |   5 +
> >  lib/librte_vhost/vhost.c            |  11 +
> >  lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
> >  lib/librte_vhost/vhost_user.c       |  26 +++
> >  lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
> >  lib/librte_vhost/virtio_net_avx.c   | 344 ++++++++++++++++++++++++++++
> >  11 files changed, 718 insertions(+), 216 deletions(-)
> >  create mode 100644 lib/librte_vhost/virtio_net_avx.c
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
  2020-10-15 15:28       ` Liu, Yong
@ 2020-10-15 15:35         ` Maxime Coquelin
  0 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-15 15:35 UTC (permalink / raw)
  To: Liu, Yong, Xia, Chenbo, Wang, Zhihong; +Cc: dev

Hi Marvin,

On 10/15/20 5:28 PM, Liu, Yong wrote:
> Hi All,
> Performance gain from vectorized datapath in OVS-DPDK is around 1%, meanwhile it have a small impact of original datapath. 
> On the other hand, it will increase the complexity of vhost (new parameter introduced, prepare memory information for address translation). 
> After weighed the procs and co, I’d like to drawback this patch set.  Thanks for your time.

Thanks for running the test with the new version.
I have removed it from Patchwork.

Thanks,
Maxime

> Regards,
> Marvin
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Monday, October 12, 2020 4:22 PM
>> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org
>> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>>
>> Hi Marvin,
>>
>> On 10/9/20 10:14 AM, Marvin Liu wrote:
>>> Packed ring format is imported since virtio spec 1.1. All descriptors
>>> are compacted into one single ring when packed ring format is on. It is
>>> straight forward that ring operations can be accelerated by utilizing
>>> SIMD instructions.
>>>
>>> This patch set will introduce vectorized data path in vhost library. If
>>> vectorized option is on, operations like descs check, descs writeback,
>>> address translation will be accelerated by SIMD instructions. On skylake
>>> server, it can bring 6% performance gain in loopback case and around 4%
>>> performance gain in PvP case.
>>
>> IMHO, 4% gain on PVP is not a significant gain if we compare to the
>> added complexity. Moreover, I guess this is 4% gain with testpmd-based
>> PVP? If this is the case it may be even lower with OVS-DPDK PVP
>> benchmark, I will try to do a benchmark this week.
>>
>> Thanks,
>> Maxime
>>
>>> Vhost application can choose whether using vectorized acceleration, just
>>> like external buffer feature. If platform or ring format not support
>>> vectorized function, vhost will fallback to use default batch function.
>>> There will be no impact in current data path.
>>>
>>> v3:
>>> * rename vectorized datapath file
>>> * eliminate the impact when avx512 disabled
>>> * dynamically allocate memory regions structure
>>> * remove unlikely hint for in_order
>>>
>>> v2:
>>> * add vIOMMU support
>>> * add dequeue offloading
>>> * rebase code
>>>
>>> Marvin Liu (5):
>>>   vhost: add vectorized data path
>>>   vhost: reuse packed ring functions
>>>   vhost: prepare memory regions addresses
>>>   vhost: add packed ring vectorized dequeue
>>>   vhost: add packed ring vectorized enqueue
>>>
>>>  doc/guides/nics/vhost.rst           |   5 +
>>>  doc/guides/prog_guide/vhost_lib.rst |  12 +
>>>  drivers/net/vhost/rte_eth_vhost.c   |  17 +-
>>>  lib/librte_vhost/meson.build        |  16 ++
>>>  lib/librte_vhost/rte_vhost.h        |   1 +
>>>  lib/librte_vhost/socket.c           |   5 +
>>>  lib/librte_vhost/vhost.c            |  11 +
>>>  lib/librte_vhost/vhost.h            | 239 +++++++++++++++++++
>>>  lib/librte_vhost/vhost_user.c       |  26 +++
>>>  lib/librte_vhost/virtio_net.c       | 258 ++++-----------------
>>>  lib/librte_vhost/virtio_net_avx.c   | 344 ++++++++++++++++++++++++++++
>>>  11 files changed, 718 insertions(+), 216 deletions(-)
>>>  create mode 100644 lib/librte_vhost/virtio_net_avx.c
>>>
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, back to index

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19  3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
2020-09-21  6:48   ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions Marvin Liu
2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
2020-10-06 15:06       ` Maxime Coquelin
2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-10-06 14:59       ` Maxime Coquelin
2020-10-08  7:05         ` Liu, Yong
2020-10-06 15:18       ` Maxime Coquelin
2020-10-09  7:59         ` Liu, Yong
2020-09-21  6:48     ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
2020-10-06 15:00       ` Maxime Coquelin
2020-10-08  7:09         ` Liu, Yong
2020-10-06 13:34     ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
2020-10-08  6:20       ` Liu, Yong
2020-10-09  8:14   ` [dpdk-dev] [PATCH v3 " Marvin Liu
2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions Marvin Liu
2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses Marvin Liu
2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-10-09  8:14     ` [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
2020-10-12  8:21     ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
2020-10-12  9:10       ` Liu, Yong
2020-10-12  9:57         ` Maxime Coquelin
2020-10-12 13:24           ` Liu, Yong
2020-10-15 15:28       ` Liu, Yong
2020-10-15 15:35         ` Maxime Coquelin
2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions Marvin Liu
2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses Marvin Liu
2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-09-18 13:44   ` Maxime Coquelin
2020-09-21  6:26     ` Liu, Yong
2020-09-21  7:47       ` Liu, Yong
2020-08-19  3:24 ` [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue Marvin Liu

DPDK patches and discussions

Archives are clonable:
	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox