[dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath
@ 2020-03-13 17:42 Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu
                   ` (17 more replies)
  0 siblings, 18 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

This patch set introduced vectorized datapath for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
can just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX datapath can fully transformed into
vectorized datapath. Rx datapath also can be vectorized by limiated
features(TSO and mergeable).

Marvin Liu (7):
  net/virtio: add Rx free threshold setting
  net/virtio-user: add LRO parameter
  net/virtio: add vectorized packed ring Rx function
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx function
  net/virtio: add election for vectorized datapath
  net/virtio: support meson build

 drivers/net/virtio/Makefile                   |  30 +
 drivers/net/virtio/meson.build                |   1 +
 drivers/net/virtio/virtio_ethdev.c            |  35 +-
 drivers/net/virtio/virtio_ethdev.h            |   6 +
 drivers/net/virtio/virtio_pci.h               |   2 +
 drivers/net/virtio/virtio_rxtx.c              | 201 ++----
 drivers/net/virtio/virtio_rxtx_packed_avx.c   | 606 ++++++++++++++++++
 .../net/virtio/virtio_user/virtio_user_dev.c  |   8 +-
 .../net/virtio/virtio_user/virtio_user_dev.h  |   2 +-
 drivers/net/virtio/virtio_user_ethdev.c       |  17 +-
 drivers/net/virtio/virtqueue.h                | 165 ++++-
 11 files changed, 903 insertions(+), 170 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter Marvin Liu
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue. Now default value of Rx
free threshold is 32. Limiated threshold size to multiple of four as
only vectorized packed Rx function will utilize it. Virtio driver will
rearm Rx queue when more than threshold descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 752faa0f6..3a2dbc2e0 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..bce1db030 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring vectorized rx function won't support GUEST_TSO4 and
GUSET_TSO6. Adding "lro" parameter into virtio user vdev arguments
can disable these features for vectorized path selection.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.c b/drivers/net/virtio/virtio_user/virtio_user_dev.c
index 1c6b26f8d..45d4bf14f 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.c
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.c
@@ -422,7 +422,8 @@ virtio_user_dev_setup(struct virtio_user_dev *dev)
 int
 virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
 		     int cq, int queue_size, const char *mac, char **ifname,
-		     int server, int mrg_rxbuf, int in_order, int packed_vq)
+		     int server, int mrg_rxbuf, int in_order, int packed_vq,
+		     int lro)
 {
 	pthread_mutex_init(&dev->mutex, NULL);
 	strlcpy(dev->path, path, PATH_MAX);
@@ -478,6 +479,11 @@ virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
 	if (!packed_vq)
 		dev->unsupported_features |= (1ull << VIRTIO_F_RING_PACKED);
 
+	if (!lro) {
+		dev->unsupported_features |= (1ull << VIRTIO_NET_F_GUEST_TSO4);
+		dev->unsupported_features |= (1ull << VIRTIO_NET_F_GUEST_TSO6);
+	}
+
 	if (dev->mac_specified)
 		dev->frontend_features |= (1ull << VIRTIO_NET_F_MAC);
 	else
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.h b/drivers/net/virtio/virtio_user/virtio_user_dev.h
index 3b6b6065a..7133e4d26 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.h
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.h
@@ -62,7 +62,7 @@ int virtio_user_stop_device(struct virtio_user_dev *dev);
 int virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
 			 int cq, int queue_size, const char *mac, char **ifname,
 			 int server, int mrg_rxbuf, int in_order,
-			 int packed_vq);
+			 int packed_vq, int lro);
 void virtio_user_dev_uninit(struct virtio_user_dev *dev);
 void virtio_user_handle_cq(struct virtio_user_dev *dev, uint16_t queue_idx);
 void virtio_user_handle_cq_packed(struct virtio_user_dev *dev,
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index e61af4068..ea07a8384 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -450,6 +450,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
 	VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_LRO            "lro"
+	VIRTIO_USER_ARG_LRO,
 	NULL
 };
 
@@ -552,6 +554,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t lro = 1;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -668,6 +671,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_LRO) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_LRO,
+				       &get_integer_arg, &lro) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_PACKED_VQ);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -707,7 +719,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	hw = eth_dev->data->dev_private;
 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
 			 queue_size, mac_addr, &ifname, server_mode,
-			 mrg_rxbuf, in_order, packed_vq) < 0) {
+			 mrg_rxbuf, in_order, packed_vq, lro) < 0) {
 		PMD_INIT_LOG(ERR, "virtio_user_dev_init fails");
 		virtio_user_eth_dev_free(eth_dev);
 		goto end;
@@ -777,4 +789,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"server=<0|1> "
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
-	"packed_vq=<0|1>");
+	"packed_vq=<0|1>"
+	"lro=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Rx datapath when mergeable buffer and LRO are not
required. Solution of optimization is pretty like vhost, split batch
and single functions. Batch function will only dequeue those descs
whose cacheline are aligned. Also padding desc extra structure to 16
bytes aligned.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index efdcb0d93..0458e8bf2 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -37,6 +37,36 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CC_AVX512_SUPPORT=$(shell $(CC) -dM -E -mavx512f -dM -E - </dev/null 2>&1 | \
+		  grep -q AVX512F && echo 1)
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS_virtio_ethdev.o += -DCC_AVX512_SUPPORT
+CFLAGS_virtio_rxtx.o += -DCC_AVX512_SUPPORT
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index cd8947656..10e39670e 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 3a2dbc2e0..ac417232b 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
+			    struct rte_mbuf __rte_unused **rx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..d8cda9d71
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,380 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE];
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i desc_flags = _mm512_set_epi64(
+			PACKED_FLAGS_MASK, 0x0,
+			PACKED_FLAGS_MASK, 0x0,
+			PACKED_FLAGS_MASK, 0x0,
+			PACKED_FLAGS_MASK, 0x0);
+
+	desc_addr = &vq->vq_packed.ring.desc[id];
+	rte_smp_rmb();
+	__m512i packed_desc = _mm512_loadu_si512(desc_addr);
+	__m512i flags_mask  = _mm512_maskz_and_epi64(0xff, packed_desc,
+			desc_flags);
+
+	__m512i used_flags;
+	if (vq->vq_packed.used_wrap_counter) {
+		used_flags = _mm512_set_epi64(
+				PACKED_FLAGS_MASK, 0x0,
+				PACKED_FLAGS_MASK, 0x0,
+				PACKED_FLAGS_MASK, 0x0,
+				PACKED_FLAGS_MASK, 0x0);
+	} else {
+		used_flags = _mm512_set_epi64(
+				0x0, 0x0,
+				0x0, 0x0,
+				0x0, 0x0,
+				0x0, 0x0);
+	}
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags,
+			_MM_CMPINT_EQ);
+	if (desc_stats != 0xff)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] =
+			(uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8;
+	}
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		char *addr = (char *)rx_pkts[i]->buf_addr +
+			RTE_PKTMBUF_HEADROOM - hdr_size;
+		hdrs[i] = (struct virtio_net_hdr *)addr;
+	}
+
+	/* addresses of pkt_len and data_len */
+	__m512i vindex = _mm512_set_epi64(
+			addrs[7], addrs[6],
+			addrs[5], addrs[4],
+			addrs[3], addrs[2],
+			addrs[1], addrs[0]);
+
+	/*
+	 * select 0x10   load 32bit from packed_desc[95:64]
+	 * mmask  0x0110 save 32bit into pkt_len and data_len
+	 */
+	__m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA);
+
+	__m512i mbuf_len_offset = _mm512_set_epi32(
+			0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0,
+			0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0,
+			0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0,
+			0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0);
+
+	value = _mm512_add_epi32(value, mbuf_len_offset);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, vindex, value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			virtio_vec_rx_offload(rx_pkts[i], hdrs[i]);
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += PACKED_BATCH_SIZE;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += PACKED_BATCH_SIZE;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index bce1db030..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -19,6 +19,8 @@
 struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
+
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -235,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (2 preceding siblings ...)
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function Marvin Liu
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index ac417232b..b8b4d3c25 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..31c48710c 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	if ((var) != (val))			\
+		(var) = (val);			\
+} while (0)
+
+#define virtqueue_clear_net_hdr(_hdr) do {		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (3 preceding siblings ...)
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath Marvin Liu
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Tx datapath alike Rx datapath. Split Rx datapath
into batch and single Tx functions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 10e39670e..c9aaef0af 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index b8b4d3c25..125df3a13 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue,
+			    struct rte_mbuf __rte_unused **tx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index d8cda9d71..0872f2083 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -15,6 +15,11 @@
 #include "virtio_pci.h"
 #include "virtqueue.h"
 
+#define REF_CNT_OFFSET 16
+#define SEG_NUM_OFFSET 32
+#define BATCH_REARM_DATA (1ULL << SEG_NUM_OFFSET | \
+			  1ULL << REF_CNT_OFFSET | \
+			  RTE_PKTMBUF_HEADROOM)
 #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
 
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
@@ -41,6 +46,48 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (desc_is_used(&desc[used_idx], vq))
+		id = desc[used_idx].id;
+	else
+		return;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
+
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
 			  uint16_t pkt_len1,
@@ -54,6 +101,185 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len4;
 }
 
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	struct vq_desc_extra *dxps[PACKED_BATCH_SIZE];
+	uint16_t idx = vq->vq_avail_idx;
+	uint64_t descs[PACKED_BATCH_SIZE];
+	struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE];
+	uint16_t i;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	/* Load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+			*tx_pkts[3]->rearm_data,
+			*tx_pkts[2]->rearm_data,
+			*tx_pkts[1]->rearm_data,
+			*tx_pkts[0]->rearm_data);
+
+	/* hdr_room=128, refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set_epi64x(
+			BATCH_REARM_DATA, BATCH_REARM_DATA,
+			BATCH_REARM_DATA, BATCH_REARM_DATA);
+
+	/* Check hdr_room,refcnt and nb_segs */
+	uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & 0x7777)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		dxps[i] = &vq->vq_descx[idx + i];
+		dxps[i]->ndescs = 1;
+		dxps[i]->cookie = tx_pkts[i];
+	}
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rte_pktmbuf_prepend(tx_pkts[i], head_size);
+		tx_pkts[i]->pkt_len -= head_size;
+	}
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+		descs[i] = (uint64_t)tx_pkts[i]->data_len |
+		(uint64_t)(idx + i) << 32 |
+		(uint64_t)vq->vq_packed.cached_flags << 48;
+
+	__m512i new_descs = _mm512_set_epi64(
+			descs[3], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[3], vq),
+			descs[2], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[2], vq),
+			descs[1], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[1], vq),
+			descs[0], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[0], vq));
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+		hdrs[i] = rte_pktmbuf_mtod_offset(tx_pkts[i],
+				struct virtio_net_hdr *, -head_size);
+
+	if (!vq->hw->has_tx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			virtqueue_clear_net_hdr(hdrs[i]);
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			virtqueue_xmit_offload(hdrs[i], tx_pkts[i], true);
+	}
+
+	/* Enqueue Packet buffers */
+	rte_smp_wmb();
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], new_descs);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (4 preceding siblings ...)
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build Marvin Liu
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring vectorized datapath can be selected when requirements are
fulfilled.

1. AVX512 is allowed by config file and compiler
2. VERSION_1 and in_order features are negotiated
3. ring size is power of two
4. LRO and mergeable feature disabled in Rx datapath

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index f9d0ea70d..d27306d50 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->packed_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->packed_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->packed_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -2159,6 +2168,26 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 
 	hw->use_simple_rx = 1;
 
+	if (vtpci_packed_queue(hw)) {
+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
+		unsigned int vq_size;
+		vq_size = VTPCI_OPS(hw)->get_queue_num(hw, 0);
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) &&
+		    rte_is_power_of_2(vq_size) &&
+		    vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) &&
+		    vtpci_with_feature(hw, VIRTIO_F_VERSION_1)) {
+			hw->packed_vec_rx = 1;
+			hw->packed_vec_tx = 1;
+		}
+
+		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
+			hw->packed_vec_rx = 0;
+
+		if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO)
+			hw->packed_vec_rx = 0;
+#endif
+	}
+
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7433d2f08..8103b7a18 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -251,6 +251,8 @@ struct virtio_hw {
 	uint8_t	    use_msix;
 	uint8_t     modern;
 	uint8_t     use_simple_rx;
+	uint8_t     packed_vec_rx;
+	uint8_t     packed_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (5 preceding siblings ...)
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath Marvin Liu
@ 2020-03-13 17:42 ` Marvin Liu
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 04c7fdf25..b0bddfd06 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -11,6 +11,7 @@ deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
 	sources += files('virtio_rxtx_simple_sse.c')
+	sources += files('virtio_rxtx_packed_avx.c')
 elif arch_subdir == 'ppc_64'
 	sources += files('virtio_rxtx_simple_altivec.c')
 elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (6 preceding siblings ...)
  2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build Marvin Liu
@ 2020-03-27 16:54 ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (6 more replies)
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
                   ` (9 subsequent siblings)
  17 siblings, 7 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

This patch set introduced vectorized datapath for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX datapath can fully transformed into
vectorized datapath. Rx datapath also can be vectorized when features
limiated(LRO and mergable disabled). User can specify whether disable
vectorized packed ring datapath by 'packed_vec' parameter of virtio user
vdev.

v2:
1. more function blocks replaced by vector instructions
2. clean virtio_net_hdr by vector instruction
3. allow header room size change
4. add 'packed_vec' option in virtio_user vdev 
5. fix build not check whether AVX512 enabled
6. doc update

Marvin Liu (7):
  net/virtio: add Rx free threshold setting
  net/virtio-user: add vectorized packed ring parameter
  net/virtio: add vectorized packed ring Rx function
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx datapath
  net/virtio: add election for vectorized datapath
  doc: add packed vectorized datapath

 .../nics/features/virtio-packed_vec.ini       |  22 +
 .../{virtio_vec.ini => virtio-split_vec.ini}  |   2 +-
 doc/guides/nics/virtio.rst                    |  44 +-
 drivers/net/virtio/Makefile                   |  28 +
 drivers/net/virtio/meson.build                |  11 +
 drivers/net/virtio/virtio_ethdev.c            |  43 +-
 drivers/net/virtio/virtio_ethdev.h            |   6 +
 drivers/net/virtio/virtio_pci.h               |   2 +
 drivers/net/virtio/virtio_rxtx.c              | 201 ++----
 drivers/net/virtio/virtio_rxtx_packed_avx.c   | 636 ++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c       |  27 +-
 drivers/net/virtio/virtqueue.h                | 165 ++++-
 12 files changed, 1005 insertions(+), 182 deletions(-)
 create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini
 rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue, default value of it is 32.
Limiated threshold size to multiple of four as only vectorized packed Rx
function will utilize it. Virtio driver will rearm Rx queue when more
than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 752faa0f6..3a2dbc2e0 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Add new parameter "packed_vec" which can disable vectorized packed ring
datapath explicitly. When "packed_vec" option is on, driver will check
packed ring vectorized datapath prerequisites. If any one of them not
matched, vectorized datapath won't be selected.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7433d2f08..8103b7a18 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -251,6 +251,8 @@ struct virtio_hw {
 	uint8_t	    use_msix;
 	uint8_t     modern;
 	uint8_t     use_simple_rx;
+	uint8_t     packed_vec_rx;
+	uint8_t     packed_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index e61af4068..2608b1fae 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -450,6 +450,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
 	VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_PACKED_VEC     "packed_vec"
+	VIRTIO_USER_ARG_PACKED_VEC,
 	NULL
 };
 
@@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t packed_vec = 1;
+
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_PACKED_VEC,
+				       &get_integer_arg, &packed_vec) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_PACKED_VQ);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	}
 
 	hw = eth_dev->data->dev_private;
+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
+	if (packed_vec) {
+		hw->packed_vec_rx = 1;
+		hw->packed_vec_tx = 1;
+	}
+#else
+	if (packed_vec)
+		PMD_INIT_LOG(ERR, "building environment not match vectorized "
+				  "packed ring datapath requirement");
+#endif
+
 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
 			 queue_size, mac_addr, &ifname, server_mode,
 			 mrg_rxbuf, in_order, packed_vq) < 0) {
@@ -777,4 +801,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"server=<0|1> "
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
-	"packed_vq=<0|1>");
+	"packed_vq=<0|1>"
+	"packed_vec=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Rx datapath when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split datapath into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index efdcb0d93..7bdb87c49 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -37,6 +37,34 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX512F,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX512F)
+ifneq ($(FORCE_DISABLE_AVX512), y)
+CFLAGS += -DCC_AVX512_SUPPORT
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+endif
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 04c7fdf25..652ae39af 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -11,6 +11,17 @@ deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
 	sources += files('virtio_rxtx_simple_sse.c')
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F')
+		cflags += ['-DCC_AVX512_SUPPORT']
+		if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+			cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+		elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+			cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+		elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+			cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+		endif
+		sources += files('virtio_rxtx_packed_avx.c')
+	endif
 elif arch_subdir == 'ppc_64'
 	sources += files('virtio_rxtx_simple_altivec.c')
 elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index cd8947656..10e39670e 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 3a2dbc2e0..ac417232b 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
+			    struct rte_mbuf __rte_unused **rx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..e2310d74e
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,361 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE];
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	rte_smp_rmb();
+	__m512i packed_desc = _mm512_loadu_si512(desc_addr);
+	__m512i flags_mask  = _mm512_maskz_and_epi64(0xff, packed_desc,
+			desc_flags);
+
+	__m512i used_flags;
+	if (vq->vq_packed.used_wrap_counter)
+		used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	else
+		used_flags = _mm512_setzero_si512();
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags,
+			_MM_CMPINT_EQ);
+	if (desc_stats != 0xff)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] =
+			(uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8;
+	}
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		char *addr = (char *)rx_pkts[i]->buf_addr +
+			RTE_PKTMBUF_HEADROOM - hdr_size;
+		hdrs[i] = (struct virtio_net_hdr *)addr;
+	}
+
+	/* addresses of pkt_len and data_len */
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+
+	/*
+	 * select 10b*4 load 32bit from packed_desc[95:64]
+	 * mmask  0110b*4 save 32bit into pkt_len and data_len
+	 */
+	__m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA);
+
+	/* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666,
+			(uint32_t)-hdr_size);
+
+	value = _mm512_add_epi32(value, mbuf_len_offset);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, vindex, value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			virtio_vec_rx_offload(rx_pkts[i], hdrs[i]);
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += PACKED_BATCH_SIZE;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += PACKED_BATCH_SIZE;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
                     ` (2 preceding siblings ...)
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index ac417232b..b8b4d3c25 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..31c48710c 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	if ((var) != (val))			\
+		(var) = (val);			\
+} while (0)
+
+#define virtqueue_clear_net_hdr(_hdr) do {		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
                     ` (3 preceding siblings ...)
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 7/7] doc: add packed " Marvin Liu
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Tx datapath alike Rx datapath. Split Tx datapath
into batch and single Tx functions. Batch function further optimized by
vector instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 10e39670e..c9aaef0af 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index b8b4d3c25..125df3a13 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue,
+			    struct rte_mbuf __rte_unused **tx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index e2310d74e..b63429df6 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -15,6 +15,18 @@
 #include "virtio_pci.h"
 #include "virtqueue.h"
 
+/* reference count offset in mbuf rearm data */
+#define REF_CNT_OFFSET 16
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_OFFSET 32
+
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \
+			  1ULL << REF_CNT_OFFSET)
+/* id offset in packed ring desc higher 64bits */
+#define ID_OFFSET 32
+/* flag offset in packed ring desc higher 64bits */
+#define FLAG_OFFSET 48
+
 #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
 
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
@@ -41,6 +53,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -54,6 +107,228 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE];
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	/* Load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+			*tx_pkts[3]->rearm_data,
+			*tx_pkts[2]->rearm_data,
+			*tx_pkts[1]->rearm_data,
+			*tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & 0x6666)
+		return -1;
+
+	/* Check headroom is enough */
+	cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms,
+			_MM_CMPINT_LT);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i dxps = _mm512_set_epi64(
+			0x1, (uint64_t)tx_pkts[3],
+			0x1, (uint64_t)tx_pkts[2],
+			0x1, (uint64_t)tx_pkts[1],
+			0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(
+			tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(
+			tx_pkts[3]->data_len, tx_pkts[3]->buf_iova,
+			tx_pkts[2]->data_len, tx_pkts[2]->buf_iova,
+			tx_pkts[1]->data_len, tx_pkts[1]->buf_iova,
+			tx_pkts[0]->data_len, tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64(
+			(uint64_t)3 << ID_OFFSET, tx_pkts[3]->data_off,
+			(uint64_t)2 << ID_OFFSET, tx_pkts[2]->data_off,
+			(uint64_t)1 << ID_OFFSET, tx_pkts[1]->data_off,
+			0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i descs = _mm512_add_epi64(new_descs, flag_offsets);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+		hdrs[i] = rte_pktmbuf_mtod_offset(tx_pkts[i],
+				struct virtio_net_hdr *, -head_size);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			__m128i hdr = _mm_loadu_si128((void *)hdrs[i]);
+			if (unlikely(_mm_mask_test_epi16_mask(0x3F, hdr,
+							mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdrs[i], 0x3F,
+						all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+			virtqueue_xmit_offload(hdrs[i], tx_pkts[i], true);
+	}
+
+	/* Enqueue Packet buffers */
+	rte_smp_wmb();
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
                     ` (4 preceding siblings ...)
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 7/7] doc: add packed " Marvin Liu
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Packed ring vectorized datapath will be selected when requirements are
fulfilled.

1. AVX512 is allowed in config file and supported by compiler
2. Host cpu support AVX512F
3. ring size is power of two
4. virtio VERSION_1 and in_order features are negotiated
5. LRO and mergeable feature disabled in Rx datapath

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index f9d0ea70d..21570e5cf 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->packed_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->packed_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->packed_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -2159,6 +2168,34 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 
 	hw->use_simple_rx = 1;
 
+	if (vtpci_packed_queue(hw)) {
+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
+		unsigned int vq_size;
+		vq_size = VTPCI_OPS(hw)->get_queue_num(hw, 0);
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		    !rte_is_power_of_2(vq_size) ||
+		    !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		    !vtpci_with_feature(hw, VIRTIO_F_VERSION_1)) {
+			hw->packed_vec_rx = 0;
+			hw->packed_vec_tx = 0;
+			PMD_DRV_LOG(INFO, "disabled packed ring vectorized "
+					  "path for requirements are not met");
+		}
+
+		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+			hw->packed_vec_rx = 0;
+			PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx "
+					 "path for mrg_rxbuf enabled");
+		}
+
+		if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+			hw->packed_vec_rx = 0;
+			PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx "
+					 "path for TCP_LRO enabled");
+		}
+#endif
+	}
+
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v2 7/7] doc: add packed vectorized datapath
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
                     ` (5 preceding siblings ...)
  2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath Marvin Liu
@ 2020-03-27 16:54   ` Marvin Liu
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Document packed virtqueue vectorized datapath selection logic in virtio
net PMD. Add packed virtqueue vectorized datapath features to new ini
file.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini
new file mode 100644
index 000000000..b239bcaad
--- /dev/null
+++ b/doc/guides/nics/features/virtio-packed_vec.ini
@@ -0,0 +1,22 @@
+;
+; Supported features of the 'virtio_packed_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = P
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+VLAN filter          = Y
+Basic stats          = Y
+Stats per queue      = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-64               = Y
diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini
similarity index 88%
rename from doc/guides/nics/features/virtio_vec.ini
rename to doc/guides/nics/features/virtio-split_vec.ini
index e60fe36ae..4142fc9f0 100644
--- a/doc/guides/nics/features/virtio_vec.ini
+++ b/doc/guides/nics/features/virtio-split_vec.ini
@@ -1,5 +1,5 @@
 ;
-; Supported features of the 'virtio_vec' network poll mode driver.
+; Supported features of the 'virtio_split_vec' network poll mode driver.
 ;
 ; Refer to default.ini for the full list of available PMD features.
 ;
diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d1f5fb898..fabe2e400 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev:
     It is used to enable virtio device packed virtqueue feature.
     (Default: 0 (disabled))
 
+#.  ``packed_vec``:
+
+    It is used to enable virtio device packed virtqueue vectorized path.
+    (Default: 1 (enabled))
+
 Virtio paths Selection and Usage
 --------------------------------
 
@@ -454,6 +459,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && packed_vec option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && packed_vec option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -476,6 +488,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -493,20 +507,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-08  6:08     ` Ye Xiaolong
  0 siblings, 0 replies; 162+ messages in thread
From: Ye Xiaolong @ 2020-04-08  6:08 UTC (permalink / raw)
  To: Marvin Liu; +Cc: maxime.coquelin, zhihong.wang, harry.van.haaren, dev

On 04/08, Marvin Liu wrote:
>Introduce free threshold setting in Rx queue, default value of it is 32.
>Limiated threshold size to multiple of four as only vectorized packed Rx

s/Limiated/Limit

>function will utilize it. Virtio driver will rearm Rx queue when more
>than rx_free_thresh descs were dequeued.
>
>Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
>diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
>index 752faa0f6..3a2dbc2e0 100644
>--- a/drivers/net/virtio/virtio_rxtx.c
>+++ b/drivers/net/virtio/virtio_rxtx.c
>@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
> 	struct virtio_hw *hw = dev->data->dev_private;
> 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
> 	struct virtnet_rx *rxvq;
>+	uint16_t rx_free_thresh;
> 
> 	PMD_INIT_FUNC_TRACE();
> 
>@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
> 		return -EINVAL;
> 	}
> 
>+	rx_free_thresh = rx_conf->rx_free_thresh;
>+	if (rx_free_thresh == 0)
>+		rx_free_thresh =
>+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
>+
>+	if (rx_free_thresh & 0x3) {
>+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
>+			" (rx_free_thresh=%u port=%u queue=%u)\n",
>+			rx_free_thresh, dev->data->port_id, queue_idx);
>+		return -EINVAL;
>+	}
>+
>+	if (rx_free_thresh >= vq->vq_nentries) {
>+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
>+			"number of RX entries (%u)."
>+			" (rx_free_thresh=%u port=%u queue=%u)\n",
>+			vq->vq_nentries,
>+			rx_free_thresh, dev->data->port_id, queue_idx);
>+		return -EINVAL;
>+	}
>+	vq->vq_free_thresh = rx_free_thresh;
>+
> 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
> 		nb_desc = vq->vq_nentries;
> 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
>diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
>index 58ad7309a..6301c56b2 100644
>--- a/drivers/net/virtio/virtqueue.h
>+++ b/drivers/net/virtio/virtqueue.h
>@@ -18,6 +18,8 @@
> 
> struct rte_mbuf;
> 
>+#define DEFAULT_RX_FREE_THRESH 32

What about naming it VIRITO_DEFAULT_RX_FREE_THRESH?

Thanks,
Xiaolong


>+
> /*
>  * Per virtio_ring.h in Linux.
>  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
>-- 
>2.17.1
>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
@ 2020-04-08  6:22     ` Ye Xiaolong
  2020-04-08  7:31       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Ye Xiaolong @ 2020-04-08  6:22 UTC (permalink / raw)
  To: Marvin Liu; +Cc: maxime.coquelin, zhihong.wang, harry.van.haaren, dev

On 04/08, Marvin Liu wrote:
>Add new parameter "packed_vec" which can disable vectorized packed ring
>datapath explicitly. When "packed_vec" option is on, driver will check
>packed ring vectorized datapath prerequisites. If any one of them not
>matched, vectorized datapath won't be selected.
>
>Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
>diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
>index 7433d2f08..8103b7a18 100644
>--- a/drivers/net/virtio/virtio_pci.h
>+++ b/drivers/net/virtio/virtio_pci.h
>@@ -251,6 +251,8 @@ struct virtio_hw {
> 	uint8_t	    use_msix;
> 	uint8_t     modern;
> 	uint8_t     use_simple_rx;
>+	uint8_t     packed_vec_rx;
>+	uint8_t     packed_vec_tx;
> 	uint8_t     use_inorder_rx;
> 	uint8_t     use_inorder_tx;
> 	uint8_t     weak_barriers;
>diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
>index e61af4068..399ac5511 100644
>--- a/drivers/net/virtio/virtio_user_ethdev.c
>+++ b/drivers/net/virtio/virtio_user_ethdev.c
>@@ -450,6 +450,8 @@ static const char *valid_args[] = {
> 	VIRTIO_USER_ARG_IN_ORDER,
> #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
> 	VIRTIO_USER_ARG_PACKED_VQ,
>+#define VIRTIO_USER_ARG_PACKED_VEC     "packed_vec"
>+	VIRTIO_USER_ARG_PACKED_VEC,
> 	NULL
> };
> 
>@@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
> 	uint64_t mrg_rxbuf = 1;
> 	uint64_t in_order = 1;
> 	uint64_t packed_vq = 0;
>+	uint64_t packed_vec = 0;
>+
> 	char *path = NULL;
> 	char *ifname = NULL;
> 	char *mac_addr = NULL;
>@@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
> 		}
> 	}
> 
>+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) {
>+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_PACKED_VEC,
>+				       &get_integer_arg, &packed_vec) < 0) {
>+			PMD_INIT_LOG(ERR, "error to parse %s",
>+				     VIRTIO_USER_ARG_PACKED_VQ);
>+			goto end;
>+		}
>+	}
>+
> 	if (queues > 1 && cq == 0) {
> 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
> 		goto end;
>@@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
> 	}
> 
> 	hw = eth_dev->data->dev_private;
>+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
>+	if (packed_vec) {
>+		hw->packed_vec_rx = 1;
>+		hw->packed_vec_tx = 1;
>+	}
>+#else
>+	if (packed_vec)
>+		PMD_INIT_LOG(ERR, "building environment not match vectorized "
>+				  "packed ring datapath requirement");

Minor nit:

s/not match/doesn't match/

And better to avoid breaking error message strings across multiple source lines.
It makes it harder to use tools like grep to find errors in source.
E.g. user uses "vectorized packed ring datapath" to grep the code. 

Thanks,
Xiaolong

>+#endif
>+
> 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
> 			 queue_size, mac_addr, &ifname, server_mode,
> 			 mrg_rxbuf, in_order, packed_vq) < 0) {
>@@ -777,4 +801,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
> 	"server=<0|1> "
> 	"mrg_rxbuf=<0|1> "
> 	"in_order=<0|1> "
>-	"packed_vq=<0|1>");
>+	"packed_vq=<0|1>"
>+	"packed_vec=<0|1>");
>-- 
>2.17.1
>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter
  2020-04-08  6:22     ` Ye Xiaolong
@ 2020-04-08  7:31       ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-08  7:31 UTC (permalink / raw)
  To: Ye, Xiaolong; +Cc: maxime.coquelin, Wang, Zhihong, Van Haaren, Harry, dev



> -----Original Message-----
> From: Ye, Xiaolong <xiaolong.ye@intel.com>
> Sent: Wednesday, April 8, 2020 2:23 PM
> To: Liu, Yong <yong.liu@intel.com>
> Cc: maxime.coquelin@redhat.com; Wang, Zhihong
> <zhihong.wang@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; dev@dpdk.org
> Subject: Re: [PATCH v3 2/7] net/virtio-user: add vectorized packed ring
> parameter
> 
> On 04/08, Marvin Liu wrote:
> >Add new parameter "packed_vec" which can disable vectorized packed
> ring
> >datapath explicitly. When "packed_vec" option is on, driver will check
> >packed ring vectorized datapath prerequisites. If any one of them not
> >matched, vectorized datapath won't be selected.
> >
> >Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> >diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
> >index 7433d2f08..8103b7a18 100644
> >--- a/drivers/net/virtio/virtio_pci.h
> >+++ b/drivers/net/virtio/virtio_pci.h
> >@@ -251,6 +251,8 @@ struct virtio_hw {
> > 	uint8_t	    use_msix;
> > 	uint8_t     modern;
> > 	uint8_t     use_simple_rx;
> >+	uint8_t     packed_vec_rx;
> >+	uint8_t     packed_vec_tx;
> > 	uint8_t     use_inorder_rx;
> > 	uint8_t     use_inorder_tx;
> > 	uint8_t     weak_barriers;
> >diff --git a/drivers/net/virtio/virtio_user_ethdev.c
> b/drivers/net/virtio/virtio_user_ethdev.c
> >index e61af4068..399ac5511 100644
> >--- a/drivers/net/virtio/virtio_user_ethdev.c
> >+++ b/drivers/net/virtio/virtio_user_ethdev.c
> >@@ -450,6 +450,8 @@ static const char *valid_args[] = {
> > 	VIRTIO_USER_ARG_IN_ORDER,
> > #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
> > 	VIRTIO_USER_ARG_PACKED_VQ,
> >+#define VIRTIO_USER_ARG_PACKED_VEC     "packed_vec"
> >+	VIRTIO_USER_ARG_PACKED_VEC,
> > 	NULL
> > };
> >
> >@@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device
> *dev)
> > 	uint64_t mrg_rxbuf = 1;
> > 	uint64_t in_order = 1;
> > 	uint64_t packed_vq = 0;
> >+	uint64_t packed_vec = 0;
> >+
> > 	char *path = NULL;
> > 	char *ifname = NULL;
> > 	char *mac_addr = NULL;
> >@@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device
> *dev)
> > 		}
> > 	}
> >
> >+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) {
> >+		if (rte_kvargs_process(kvlist,
> VIRTIO_USER_ARG_PACKED_VEC,
> >+				       &get_integer_arg, &packed_vec) < 0) {
> >+			PMD_INIT_LOG(ERR, "error to parse %s",
> >+				     VIRTIO_USER_ARG_PACKED_VQ);
> >+			goto end;
> >+		}
> >+	}
> >+
> > 	if (queues > 1 && cq == 0) {
> > 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
> > 		goto end;
> >@@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device
> *dev)
> > 	}
> >
> > 	hw = eth_dev->data->dev_private;
> >+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
> >+	if (packed_vec) {
> >+		hw->packed_vec_rx = 1;
> >+		hw->packed_vec_tx = 1;
> >+	}
> >+#else
> >+	if (packed_vec)
> >+		PMD_INIT_LOG(ERR, "building environment not match
> vectorized "
> >+				  "packed ring datapath requirement");
> 
> Minor nit:
> 
> s/not match/doesn't match/
> 
> And better to avoid breaking error message strings across multiple source
> lines.
> It makes it harder to use tools like grep to find errors in source.
> E.g. user uses "vectorized packed ring datapath" to grep the code.
> 
> Thanks,
> Xiaolong
> 

Thanks for remind. Will change in next release.

> >+#endif
> >+
> > 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
> > 			 queue_size, mac_addr, &ifname, server_mode,
> > 			 mrg_rxbuf, in_order, packed_vq) < 0) {
> >@@ -777,4 +801,5 @@
> RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
> > 	"server=<0|1> "
> > 	"mrg_rxbuf=<0|1> "
> > 	"in_order=<0|1> "
> >-	"packed_vq=<0|1>");
> >+	"packed_vq=<0|1>"
> >+	"packed_vec=<0|1>");
> >--
> >2.17.1
> >

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 0/7] add packed ring vectorized datapath
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (7 preceding siblings ...)
  2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
@ 2020-04-08  8:53 ` Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (6 more replies)
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
                   ` (8 subsequent siblings)
  17 siblings, 7 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

This patch set introduced vectorized datapath for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX datapath can fully transformed into
vectorized datapath. Rx datapath also can be vectorized when features
limiated(LRO and mergable disabled). User can specify whether disable
vectorized packed ring datapath by 'packed_vec' parameter of virtio user
vdev.

v3:
1. Remove virtio_net_hdr array for better performance
2. disable 'packed_vec' by default

v2:
1. more function blocks replaced by vector instructions
2. clean virtio_net_hdr by vector instruction
3. allow header room size change
4. add 'packed_vec' option in virtio_user vdev 
5. fix build not check whether AVX512 enabled
6. doc update

Marvin Liu (7):
  net/virtio: add Rx free threshold setting
  net/virtio-user: add vectorized packed ring parameter
  net/virtio: add vectorized packed ring Rx function
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx datapath
  net/virtio: add election for vectorized datapath
  doc: add packed vectorized datapath

 .../nics/features/virtio-packed_vec.ini       |  22 +
 .../{virtio_vec.ini => virtio-split_vec.ini}  |   2 +-
 doc/guides/nics/virtio.rst                    |  44 +-
 drivers/net/virtio/Makefile                   |  28 +
 drivers/net/virtio/meson.build                |  11 +
 drivers/net/virtio/virtio_ethdev.c            |  43 +-
 drivers/net/virtio/virtio_ethdev.h            |   6 +
 drivers/net/virtio/virtio_pci.h               |   2 +
 drivers/net/virtio/virtio_rxtx.c              | 201 ++----
 drivers/net/virtio/virtio_rxtx_packed_avx.c   | 637 ++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c       |  27 +-
 drivers/net/virtio/virtqueue.h                | 165 ++++-
 12 files changed, 1006 insertions(+), 182 deletions(-)
 create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini
 rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  2020-04-08  6:08     ` Ye Xiaolong
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Introduce free threshold setting in Rx queue, default value of it is 32.
Limiated threshold size to multiple of four as only vectorized packed Rx
function will utilize it. Virtio driver will rearm Rx queue when more
than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 752faa0f6..3a2dbc2e0 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  2020-04-08  6:22     ` Ye Xiaolong
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Add new parameter "packed_vec" which can disable vectorized packed ring
datapath explicitly. When "packed_vec" option is on, driver will check
packed ring vectorized datapath prerequisites. If any one of them not
matched, vectorized datapath won't be selected.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7433d2f08..8103b7a18 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -251,6 +251,8 @@ struct virtio_hw {
 	uint8_t	    use_msix;
 	uint8_t     modern;
 	uint8_t     use_simple_rx;
+	uint8_t     packed_vec_rx;
+	uint8_t     packed_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index e61af4068..399ac5511 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -450,6 +450,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
 	VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_PACKED_VEC     "packed_vec"
+	VIRTIO_USER_ARG_PACKED_VEC,
 	NULL
 };
 
@@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t packed_vec = 0;
+
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_PACKED_VEC,
+				       &get_integer_arg, &packed_vec) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_PACKED_VQ);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	}
 
 	hw = eth_dev->data->dev_private;
+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
+	if (packed_vec) {
+		hw->packed_vec_rx = 1;
+		hw->packed_vec_tx = 1;
+	}
+#else
+	if (packed_vec)
+		PMD_INIT_LOG(ERR, "building environment not match vectorized "
+				  "packed ring datapath requirement");
+#endif
+
 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
 			 queue_size, mac_addr, &ifname, server_mode,
 			 mrg_rxbuf, in_order, packed_vq) < 0) {
@@ -777,4 +801,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"server=<0|1> "
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
-	"packed_vq=<0|1>");
+	"packed_vq=<0|1>"
+	"packed_vec=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Rx datapath when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split datapath into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index efdcb0d93..7bdb87c49 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -37,6 +37,34 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX512F,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX512F)
+ifneq ($(FORCE_DISABLE_AVX512), y)
+CFLAGS += -DCC_AVX512_SUPPORT
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+endif
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 04c7fdf25..652ae39af 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -11,6 +11,17 @@ deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
 	sources += files('virtio_rxtx_simple_sse.c')
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F')
+		cflags += ['-DCC_AVX512_SUPPORT']
+		if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+			cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+		elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+			cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+		elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+			cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+		endif
+		sources += files('virtio_rxtx_packed_avx.c')
+	endif
 elif arch_subdir == 'ppc_64'
 	sources += files('virtio_rxtx_simple_altivec.c')
 elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index cd8947656..10e39670e 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 3a2dbc2e0..ac417232b 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
+			    struct rte_mbuf __rte_unused **rx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..f2976b98f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,358 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	rte_smp_rmb();
+	__m512i packed_desc = _mm512_loadu_si512(desc_addr);
+	__m512i flags_mask  = _mm512_maskz_and_epi64(0xff, packed_desc,
+			desc_flags);
+
+	__m512i used_flags;
+	if (vq->vq_packed.used_wrap_counter)
+		used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	else
+		used_flags = _mm512_setzero_si512();
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags,
+			_MM_CMPINT_EQ);
+	if (desc_stats != 0xff)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] =
+			(uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8;
+	}
+
+	/* addresses of pkt_len and data_len */
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+
+	/*
+	 * select 10b*4 load 32bit from packed_desc[95:64]
+	 * mmask  0110b*4 save 32bit into pkt_len and data_len
+	 */
+	__m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA);
+
+	/* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666,
+			(uint32_t)-hdr_size);
+
+	value = _mm512_add_epi32(value, mbuf_len_offset);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, vindex, value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += PACKED_BATCH_SIZE;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += PACKED_BATCH_SIZE;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index ac417232b..b8b4d3c25 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..31c48710c 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	if ((var) != (val))			\
+		(var) = (val);			\
+} while (0)
+
+#define virtqueue_clear_net_hdr(_hdr) do {		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 7/7] doc: add packed " Marvin Liu
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Tx datapath alike Rx datapath. Split Tx datapath
into batch and single Tx functions. Batch function further optimized by
vector instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 10e39670e..c9aaef0af 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index b8b4d3c25..125df3a13 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue,
+			    struct rte_mbuf __rte_unused **tx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index f2976b98f..fb26fe5f3 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -15,6 +15,21 @@
 #include "virtio_pci.h"
 #include "virtqueue.h"
 
+/* reference count offset in mbuf rearm data */
+#define REF_CNT_OFFSET 16
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_OFFSET 32
+
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \
+			  1ULL << REF_CNT_OFFSET)
+/* id offset in packed ring desc higher 64bits */
+#define ID_OFFSET 32
+/* flag offset in packed ring desc higher 64bits */
+#define FLAG_OFFSET 48
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x1F
+
 #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
 
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
@@ -41,6 +56,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -54,6 +110,229 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	/* Load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+			*tx_pkts[3]->rearm_data,
+			*tx_pkts[2]->rearm_data,
+			*tx_pkts[1]->rearm_data,
+			*tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & 0x6666)
+		return -1;
+
+	/* Check headroom is enough */
+	cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms,
+			_MM_CMPINT_LT);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i dxps = _mm512_set_epi64(
+			0x1, (uint64_t)tx_pkts[3],
+			0x1, (uint64_t)tx_pkts[2],
+			0x1, (uint64_t)tx_pkts[1],
+			0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(
+			tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(
+			tx_pkts[3]->data_len, tx_pkts[3]->buf_iova,
+			tx_pkts[2]->data_len, tx_pkts[2]->buf_iova,
+			tx_pkts[1]->data_len, tx_pkts[1]->buf_iova,
+			tx_pkts[0]->data_len, tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64(
+			(uint64_t)3 << ID_OFFSET, tx_pkts[3]->data_off,
+			(uint64_t)2 << ID_OFFSET, tx_pkts[2]->data_off,
+			(uint64_t)1 << ID_OFFSET, tx_pkts[1]->data_off,
+			0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i descs = _mm512_add_epi64(new_descs, flag_offsets);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	rte_smp_wmb();
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 7/7] doc: add packed " Marvin Liu
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Packed ring vectorized datapath will be selected when criterian matched.

1. AVX512 is enabled in dpdk config and supported by compiler
2. Host cpu has AVX512F flag
3. Ring size is power of two
4. virtio VERSION_1 and IN_ORDER features are negotiated
5. LRO and mergeable are disabled in Rx datapath

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index f9d0ea70d..21570e5cf 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->packed_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->packed_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->packed_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -2159,6 +2168,34 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 
 	hw->use_simple_rx = 1;
 
+	if (vtpci_packed_queue(hw)) {
+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT)
+		unsigned int vq_size;
+		vq_size = VTPCI_OPS(hw)->get_queue_num(hw, 0);
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		    !rte_is_power_of_2(vq_size) ||
+		    !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		    !vtpci_with_feature(hw, VIRTIO_F_VERSION_1)) {
+			hw->packed_vec_rx = 0;
+			hw->packed_vec_tx = 0;
+			PMD_DRV_LOG(INFO, "disabled packed ring vectorized "
+					  "path for requirements are not met");
+		}
+
+		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+			hw->packed_vec_rx = 0;
+			PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx "
+					 "path for mrg_rxbuf enabled");
+		}
+
+		if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+			hw->packed_vec_rx = 0;
+			PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx "
+					 "path for TCP_LRO enabled");
+		}
+#endif
+	}
+
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v3 7/7] doc: add packed vectorized datapath
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath Marvin Liu
@ 2020-04-08  8:53   ` Marvin Liu
  6 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-08  8:53 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Document packed virtqueue vectorized datapath selection logic in virtio
net PMD. Add packed virtqueue vectorized datapath features to new ini
file.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini
new file mode 100644
index 000000000..b239bcaad
--- /dev/null
+++ b/doc/guides/nics/features/virtio-packed_vec.ini
@@ -0,0 +1,22 @@
+;
+; Supported features of the 'virtio_packed_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = P
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+VLAN filter          = Y
+Basic stats          = Y
+Stats per queue      = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-64               = Y
diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini
similarity index 88%
rename from doc/guides/nics/features/virtio_vec.ini
rename to doc/guides/nics/features/virtio-split_vec.ini
index e60fe36ae..4142fc9f0 100644
--- a/doc/guides/nics/features/virtio_vec.ini
+++ b/doc/guides/nics/features/virtio-split_vec.ini
@@ -1,5 +1,5 @@
 ;
-; Supported features of the 'virtio_vec' network poll mode driver.
+; Supported features of the 'virtio_split_vec' network poll mode driver.
 ;
 ; Refer to default.ini for the full list of available PMD features.
 ;
diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d1f5fb898..fabe2e400 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev:
     It is used to enable virtio device packed virtqueue feature.
     (Default: 0 (disabled))
 
+#.  ``packed_vec``:
+
+    It is used to enable virtio device packed virtqueue vectorized path.
+    (Default: 1 (enabled))
+
 Virtio paths Selection and Usage
 --------------------------------
 
@@ -454,6 +459,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && packed_vec option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && packed_vec option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -476,6 +488,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -493,20 +507,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu
@ 2020-04-15 13:31     ` David Marchand
  2020-04-15 14:57       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: David Marchand @ 2020-04-15 13:31 UTC (permalink / raw)
  To: Marvin Liu
  Cc: Maxime Coquelin, Xiaolong Ye, Zhihong Wang, Van Haaren Harry,
	dev, Kevin Laatz, Kinsella, Ray

On Wed, Apr 15, 2020 at 11:14 AM Marvin Liu <yong.liu@intel.com> wrote:
>
> Read CPUID to check if AVX512 extensions are supported.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
> diff --git a/lib/librte_eal/common/arch/x86/rte_cpuflags.c b/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> index 6492df556..54e9f6185 100644
> --- a/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> @@ -109,6 +109,9 @@ const struct feature_entry rte_cpu_feature_table[] = {
>         FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
>         FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>         FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> +       FEAT_DEF(AVX512CD, 0x00000007, 0, RTE_REG_EBX, 28)
> +       FEAT_DEF(AVX512BW, 0x00000007, 0, RTE_REG_EBX, 30)
> +       FEAT_DEF(AVX512VL, 0x00000007, 0, RTE_REG_EBX, 31)
>
>         FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>         FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> index 25ba47b96..5bf99e05f 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> @@ -98,6 +98,9 @@ enum rte_cpu_flag_t {
>         RTE_CPUFLAG_RTM,                    /**< Transactional memory */
>         RTE_CPUFLAG_AVX512F,                /**< AVX512F */
>         RTE_CPUFLAG_RDSEED,                 /**< RDSEED instruction */
> +       RTE_CPUFLAG_AVX512CD,               /**< AVX512CD */
> +       RTE_CPUFLAG_AVX512BW,               /**< AVX512BW */
> +       RTE_CPUFLAG_AVX512VL,               /**< AVX512VL */
>
>         /* (EAX 80000001h) ECX features */
>         RTE_CPUFLAG_LAHF_SAHF,              /**< LAHF_SAHF */

This patch most likely breaks the ABI (renumbering flags after
RTE_CPUFLAG_LAHF_SAHF).
This change should not go through the virtio tree and is not rebased on master.
A similar patch had been proposed by Kevin:
http://patchwork.dpdk.org/patch/67438/


-- 
David Marchand


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag
  2020-04-15 13:31     ` David Marchand
@ 2020-04-15 14:57       ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-15 14:57 UTC (permalink / raw)
  To: David Marchand
  Cc: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong, Van Haaren, Harry,
	dev, Laatz, Kevin, Kinsella, Ray

Thanks for note, David.  Kevin's patch can fully cover this one.  

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Wednesday, April 15, 2020 9:32 PM
> To: Liu, Yong <yong.liu@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; dev <dev@dpdk.org>; Laatz,
> Kevin <kevin.laatz@intel.com>; Kinsella, Ray <ray.kinsella@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions
> flag
> 
> On Wed, Apr 15, 2020 at 11:14 AM Marvin Liu <yong.liu@intel.com> wrote:
> >
> > Read CPUID to check if AVX512 extensions are supported.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> b/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> > index 6492df556..54e9f6185 100644
> > --- a/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> > +++ b/lib/librte_eal/common/arch/x86/rte_cpuflags.c
> > @@ -109,6 +109,9 @@ const struct feature_entry rte_cpu_feature_table[]
> = {
> >         FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
> >         FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
> >         FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> > +       FEAT_DEF(AVX512CD, 0x00000007, 0, RTE_REG_EBX, 28)
> > +       FEAT_DEF(AVX512BW, 0x00000007, 0, RTE_REG_EBX, 30)
> > +       FEAT_DEF(AVX512VL, 0x00000007, 0, RTE_REG_EBX, 31)
> >
> >         FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
> >         FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> > index 25ba47b96..5bf99e05f 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
> > @@ -98,6 +98,9 @@ enum rte_cpu_flag_t {
> >         RTE_CPUFLAG_RTM,                    /**< Transactional memory */
> >         RTE_CPUFLAG_AVX512F,                /**< AVX512F */
> >         RTE_CPUFLAG_RDSEED,                 /**< RDSEED instruction */
> > +       RTE_CPUFLAG_AVX512CD,               /**< AVX512CD */
> > +       RTE_CPUFLAG_AVX512BW,               /**< AVX512BW */
> > +       RTE_CPUFLAG_AVX512VL,               /**< AVX512VL */
> >
> >         /* (EAX 80000001h) ECX features */
> >         RTE_CPUFLAG_LAHF_SAHF,              /**< LAHF_SAHF */
> 
> This patch most likely breaks the ABI (renumbering flags after
> RTE_CPUFLAG_LAHF_SAHF).
> This change should not go through the virtio tree and is not rebased on
> master.
> A similar patch had been proposed by Kevin:
> http://patchwork.dpdk.org/patch/67438/
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 0/8] add packed ring vectorized datapath
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (8 preceding siblings ...)
  2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
@ 2020-04-15 16:47 ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu
                     ` (7 more replies)
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                   ` (7 subsequent siblings)
  17 siblings, 8 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

This patch set introduced vectorized datapath for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX datapath can fully transformed into
vectorized datapath. Rx datapath also can be vectorized when features
limiated(LRO and mergable disabled). User can specify whether disable
vectorized packed ring datapath by 'packed_vec' parameter of virtio user
vdev.

v4:
1. rename 'packed_vec' to 'vectorized', also used in split ring
2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
3. check required AVX512 extensions cpuflags
4. combine split and packed ring datapath selection logic
5. remove limitation that size must power of two
6. clear 12Bytes virtio_net_hdr

v3:
1. Remove virtio_net_hdr array for better performance
2. disable 'packed_vec' by default

v2:
1. more function blocks replaced by vector instructions
2. clean virtio_net_hdr by vector instruction
3. allow header room size change
4. add 'packed_vec' option in virtio_user vdev 
5. fix build not check whether AVX512 enabled
6. doc update

Marvin Liu (8):
  net/virtio: enable vectorized datapath
  net/virtio-user: add vectorized datapath parameter
  net/virtio: add vectorized packed ring Rx function
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx datapath
  eal/x86: identify AVX512 extensions flag
  net/virtio: add election for vectorized datapath
  doc: add packed vectorized datapath

 config/common_base                            |   1 +
 .../nics/features/virtio-packed_vec.ini       |  22 +
 .../{virtio_vec.ini => virtio-split_vec.ini}  |   2 +-
 doc/guides/nics/virtio.rst                    |  44 +-
 drivers/net/virtio/Makefile                   |  36 +
 drivers/net/virtio/meson.build                |  13 +
 drivers/net/virtio/virtio_ethdev.c            |  95 ++-
 drivers/net/virtio/virtio_ethdev.h            |   6 +
 drivers/net/virtio/virtio_pci.h               |   3 +-
 drivers/net/virtio/virtio_rxtx.c              | 182 +----
 drivers/net/virtio/virtio_rxtx_packed_avx.c   | 637 ++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c       |  36 +-
 drivers/net/virtio/virtqueue.c                |   6 +-
 drivers/net/virtio/virtqueue.h                | 163 ++++-
 lib/librte_eal/common/arch/x86/rte_cpuflags.c |   3 +
 .../common/include/arch/x86/rte_cpuflags.h    |   3 +
 16 files changed, 1040 insertions(+), 212 deletions(-)
 create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini
 rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 1/8] net/virtio: enable vectorized datapath
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter Marvin Liu
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Previously, virtio split ring vectorized datapath is enabled as default.
This is not suitable for everyone as that datapath not follow virtio
spec. Add specific config for virtio vectorized datapath selection.
This config will be also used for virtio packed ring.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/config/common_base b/config/common_base
index 7ca2f28b1..afeda85b0 100644
--- a/config/common_base
+++ b/config/common_base
@@ -450,6 +450,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
+CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y
 
 #
 # Compile virtio device emulation inside virtio PMD driver
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index efdcb0d93..9ef445bc9 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
 ifeq ($(CONFIG_RTE_ARCH_X86),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
 else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
@@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function Marvin Liu
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Add new parameter "vectorized" which can enable vectorized datapath
explicitly. This parameter will work for both split ring and packed
ring. When "vectorized" option is on, driver will check both compiling
environment and running enviornment.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index f9d0ea70d..19a36ad82 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1547,7 +1547,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
@@ -2157,33 +2157,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		 hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7433d2f08..36afed313 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -250,7 +250,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 3a2dbc2e0..285af1d47 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -995,7 +995,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1013,7 +1013,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index e61af4068..ca7797cfa 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -450,6 +450,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
 	VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -518,7 +520,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -552,6 +555,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
+
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -668,6 +673,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+#endif
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -705,6 +721,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	}
 
 	hw = eth_dev->data->dev_private;
+
 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
 			 queue_size, mac_addr, &ifname, server_mode,
 			 mrg_rxbuf, in_order, packed_vq) < 0) {
@@ -720,6 +737,20 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not match packed ring vectorized requirement");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -777,4 +808,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"server=<0|1> "
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
-	"packed_vq=<0|1>");
+	"packed_vq=<0|1>"
+	"vectorized=<0|1>");
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..349ff0c9d 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,7 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +97,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +121,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Rx datapath when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split datapath into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 9ef445bc9..4d20cb61a 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -37,6 +37,40 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
 endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 04c7fdf25..00f84282c 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -11,6 +11,19 @@ deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
 	sources += files('virtio_rxtx_simple_sse.c')
+	if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F')
+		if '-mno-avx512f' not in machine_args and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 elif arch_subdir == 'ppc_64'
 	sources += files('virtio_rxtx_simple_altivec.c')
 elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index cd8947656..10e39670e 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 285af1d47..965ce3dab 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
+			    struct rte_mbuf __rte_unused **rx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..f2976b98f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,358 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	rte_smp_rmb();
+	__m512i packed_desc = _mm512_loadu_si512(desc_addr);
+	__m512i flags_mask  = _mm512_maskz_and_epi64(0xff, packed_desc,
+			desc_flags);
+
+	__m512i used_flags;
+	if (vq->vq_packed.used_wrap_counter)
+		used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	else
+		used_flags = _mm512_setzero_si512();
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags,
+			_MM_CMPINT_EQ);
+	if (desc_stats != 0xff)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] =
+			(uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8;
+	}
+
+	/* addresses of pkt_len and data_len */
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+
+	/*
+	 * select 10b*4 load 32bit from packed_desc[95:64]
+	 * mmask  0110b*4 save 32bit into pkt_len and data_len
+	 */
+	__m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA);
+
+	/* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666,
+			(uint32_t)-hdr_size);
+
+	value = _mm512_add_epi32(value, mbuf_len_offset);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, vindex, value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += PACKED_BATCH_SIZE;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += PACKED_BATCH_SIZE;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 965ce3dab..1d8135f4f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..31c48710c 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	if ((var) != (val))			\
+		(var) = (val);			\
+} while (0)
+
+#define virtqueue_clear_net_hdr(_hdr) do {		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Tx datapath alike Rx datapath. Split Tx datapath
into batch and single Tx functions. Batch function further optimized by
vector instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 10e39670e..c9aaef0af 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 1d8135f4f..58c7778f4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue,
+			    struct rte_mbuf __rte_unused **tx_pkts,
+			    uint16_t __rte_unused nb_pkts)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index f2976b98f..732256c86 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -15,6 +15,21 @@
 #include "virtio_pci.h"
 #include "virtqueue.h"
 
+/* reference count offset in mbuf rearm data */
+#define REF_CNT_OFFSET 16
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_OFFSET 32
+
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \
+			  1ULL << REF_CNT_OFFSET)
+/* id offset in packed ring desc higher 64bits */
+#define ID_OFFSET 32
+/* flag offset in packed ring desc higher 64bits */
+#define FLAG_OFFSET 48
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
 
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
@@ -41,6 +56,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -54,6 +110,229 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	/* Load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(
+			*tx_pkts[3]->rearm_data,
+			*tx_pkts[2]->rearm_data,
+			*tx_pkts[1]->rearm_data,
+			*tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & 0x6666)
+		return -1;
+
+	/* Check headroom is enough */
+	cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms,
+			_MM_CMPINT_LT);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i dxps = _mm512_set_epi64(
+			0x1, (uint64_t)tx_pkts[3],
+			0x1, (uint64_t)tx_pkts[2],
+			0x1, (uint64_t)tx_pkts[1],
+			0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(
+			tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(
+			tx_pkts[3]->data_len, tx_pkts[3]->buf_iova,
+			tx_pkts[2]->data_len, tx_pkts[2]->buf_iova,
+			tx_pkts[1]->data_len, tx_pkts[1]->buf_iova,
+			tx_pkts[0]->data_len, tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64(
+			(uint64_t)3 << ID_OFFSET, tx_pkts[3]->data_off,
+			(uint64_t)2 << ID_OFFSET, tx_pkts[2]->data_off,
+			(uint64_t)1 << ID_OFFSET, tx_pkts[1]->data_off,
+			0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i descs = _mm512_add_epi64(new_descs, flag_offsets);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	rte_smp_wmb();
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 13:31     ` David Marchand
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 8/8] doc: add packed " Marvin Liu
  7 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Read CPUID to check if AVX512 extensions are supported.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/lib/librte_eal/common/arch/x86/rte_cpuflags.c b/lib/librte_eal/common/arch/x86/rte_cpuflags.c
index 6492df556..54e9f6185 100644
--- a/lib/librte_eal/common/arch/x86/rte_cpuflags.c
+++ b/lib/librte_eal/common/arch/x86/rte_cpuflags.c
@@ -109,6 +109,9 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
+	FEAT_DEF(AVX512CD, 0x00000007, 0, RTE_REG_EBX, 28)
+	FEAT_DEF(AVX512BW, 0x00000007, 0, RTE_REG_EBX, 30)
+	FEAT_DEF(AVX512VL, 0x00000007, 0, RTE_REG_EBX, 31)
 
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
index 25ba47b96..5bf99e05f 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
@@ -98,6 +98,9 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RTM,                    /**< Transactional memory */
 	RTE_CPUFLAG_AVX512F,                /**< AVX512F */
 	RTE_CPUFLAG_RDSEED,                 /**< RDSEED instruction */
+	RTE_CPUFLAG_AVX512CD,               /**< AVX512CD */
+	RTE_CPUFLAG_AVX512BW,               /**< AVX512BW */
+	RTE_CPUFLAG_AVX512VL,               /**< AVX512VL */
 
 	/* (EAX 80000001h) ECX features */
 	RTE_CPUFLAG_LAHF_SAHF,              /**< LAHF_SAHF */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 8/8] doc: add packed " Marvin Liu
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Packed ring vectorized datapath will be selected when criterian matched.
1. vectorized option is enabled
2. AVX512F and required extensions are supported by compiler and host
3. virtio VERSION_1 and IN_ORDER features are negotiated
4. virtio mergeable feature is not negotiated
5. LRO offloading is disabled

Split ring vectorized rx will be selected when criterian matched.
1. vectorized option is enabled
2. virtio mergeable and IN_ORDER features are not negotiated
3. LRO, chksum and vlan strip offloading are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 19a36ad82..a6ce3a0b0 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1548,7 +1557,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 		}
 	} else {
 		if (hw->use_vec_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1921,6 +1930,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	hw->opened = true;
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	hw->use_vec_rx = 1;
+	hw->use_vec_tx = 1;
+#endif
 
 	return 0;
 
@@ -2157,31 +2170,63 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW) ||
+		     !rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorization for requirements are not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
+
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorization for requirements are not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v4 8/8] doc: add packed vectorized datapath
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath Marvin Liu
@ 2020-04-15 16:47   ` Marvin Liu
  7 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Document packed virtqueue vectorized datapath selection logic in virtio
net PMD. Add packed virtqueue vectorized datapath features to new ini
file.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini
new file mode 100644
index 000000000..b239bcaad
--- /dev/null
+++ b/doc/guides/nics/features/virtio-packed_vec.ini
@@ -0,0 +1,22 @@
+;
+; Supported features of the 'virtio_packed_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = P
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+VLAN filter          = Y
+Basic stats          = Y
+Stats per queue      = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-64               = Y
diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini
similarity index 88%
rename from doc/guides/nics/features/virtio_vec.ini
rename to doc/guides/nics/features/virtio-split_vec.ini
index e60fe36ae..4142fc9f0 100644
--- a/doc/guides/nics/features/virtio_vec.ini
+++ b/doc/guides/nics/features/virtio-split_vec.ini
@@ -1,5 +1,5 @@
 ;
-; Supported features of the 'virtio_vec' network poll mode driver.
+; Supported features of the 'virtio_split_vec' network poll mode driver.
 ;
 ; Refer to default.ini for the full list of available PMD features.
 ;
diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d1f5fb898..7c9ad9466 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev:
     It is used to enable virtio device packed virtqueue feature.
     (Default: 0 (disabled))
 
+#.  ``vectorized``:
+
+    It is used to enable virtio device vectorized datapath.
+    (Default: 0 (disabled))
+
 Virtio paths Selection and Usage
 --------------------------------
 
@@ -454,6 +459,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -476,6 +488,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -493,20 +507,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (9 preceding siblings ...)
  2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
@ 2020-04-16 15:31 ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (8 more replies)
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                   ` (6 subsequent siblings)
  17 siblings, 9 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v5:
1. remove cpuflags definition as required extensions always come with
   AVX512F on x86_64
2. inorder actions should depend on feature bit
3. check ring type in rx queue setup
4. rewrite some commit logs
5. fix some checkpatch warnings

v4:
1. rename 'packed_vec' to 'vectorized', also used in split ring
2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
3. check required AVX512 extensions cpuflags
4. combine split and packed ring datapath selection logic
5. remove limitation that size must power of two
6. clear 12Bytes virtio_net_hdr

v3:
1. remove virtio_net_hdr array for better performance
2. disable 'packed_vec' by default

v2:
1. more function blocks replaced by vector instructions
2. clean virtio_net_hdr by vector instruction
3. allow header room size change
4. add 'packed_vec' option in virtio_user vdev 
5. fix build not check whether AVX512 enabled
6. doc update

Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: enable vectorized path
  net/virtio: inorder should depend on feature bit
  net/virtio-user: add vectorized path parameter
  net/virtio: add vectorized packed ring Rx path
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 config/common_base                            |   1 +
 .../nics/features/virtio-packed_vec.ini       |  22 +
 .../{virtio_vec.ini => virtio-split_vec.ini}  |   2 +-
 doc/guides/nics/virtio.rst                    |  44 +-
 drivers/net/virtio/Makefile                   |  36 +
 drivers/net/virtio/meson.build                |  27 +-
 drivers/net/virtio/virtio_ethdev.c            |  95 ++-
 drivers/net/virtio/virtio_ethdev.h            |   6 +
 drivers/net/virtio/virtio_pci.h               |   3 +-
 drivers/net/virtio/virtio_rxtx.c              | 212 ++----
 drivers/net/virtio/virtio_rxtx_packed_avx.c   | 639 ++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c       |  39 +-
 drivers/net/virtio/virtqueue.c                |   7 +-
 drivers/net/virtio/virtqueue.h                | 168 ++++-
 14 files changed, 1080 insertions(+), 221 deletions(-)
 create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini
 rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path Marvin Liu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Introduce free threshold setting in Rx queue, default value of it is 32.
Limiated threshold size to multiple of four as only vectorized packed Rx
function will utilize it. Virtio driver will rearm Rx queue when more
than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Previously, virtio split ring vectorized path is enabled as default.
This is not suitable for everyone because of that path not follow virtio
spec. Add new config for virtio vectorized path selection. By default
vectorized path is enabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/config/common_base b/config/common_base
index c31175f9d..5901a94f7 100644
--- a/config/common_base
+++ b/config/common_base
@@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
+CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y
 
 #
 # Compile virtio device emulation inside virtio PMD driver
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index efdcb0d93..9ef445bc9 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
 ifeq ($(CONFIG_RTE_ARCH_X86),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
 else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
@@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 5e7ca855c..f9619a108 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c',
 	'virtqueue.c')
 deps += ['kvargs', 'bus_pci']
 
-if arch_subdir == 'x86'
-	sources += files('virtio_rxtx_simple_sse.c')
-elif arch_subdir == 'ppc'
-	sources += files('virtio_rxtx_simple_altivec.c')
-elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
-	sources += files('virtio_rxtx_simple_neon.c')
+if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
+	if arch_subdir == 'x86'
+		sources += files('virtio_rxtx_simple_sse.c')
+	elif arch_subdir == 'ppc'
+		sources += files('virtio_rxtx_simple_altivec.c')
+	elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
+		sources += files('virtio_rxtx_simple_neon.c')
+	endif
 endif
 
 if is_linux
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Ring initialzation is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Add new parameter "vectorized" which can select vectorized path
explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR
option is yes. When "vectorized" is set, driver will check both
compiling environment and running environment when selecting path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 35203940a..4c7d60ca0 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1547,7 +1547,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
@@ -2157,33 +2157,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7433d2f08..36afed313 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -250,7 +250,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 5637001df..6e30acaae 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -450,6 +450,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
 	VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -518,7 +520,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -552,6 +555,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
+
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -668,6 +673,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+#endif
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -705,6 +721,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	}
 
 	hw = eth_dev->data->dev_private;
+
 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
 			 queue_size, mac_addr, &ifname, server_mode,
 			 mrg_rxbuf, in_order, packed_vq) < 0) {
@@ -720,6 +737,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not match packed ring vectorized requirement");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	} else {
+		hw->use_vec_rx = 0;
+		hw->use_vec_tx = 0;
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -777,4 +811,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"server=<0|1> "
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
-	"packed_vq=<0|1>");
+	"packed_vq=<0|1>"
+	"vectorized=<0|1>");
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Rx path when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split path into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 9ef445bc9..4d20cb61a 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -37,6 +37,40 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
 endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index f9619a108..9e0ff9761 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -11,6 +11,19 @@ deps += ['kvargs', 'bus_pci']
 
 if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
 	if arch_subdir == 'x86'
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F')
+			if '-mno-avx512f' not in machine_args and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+				cflags += ['-DCC_AVX512_SUPPORT']
+				if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+					cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+				elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+					cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+				elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+					cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+				endif
+				sources += files('virtio_rxtx_packed_avx.c')
+			endif
+		endif
 		sources += files('virtio_rxtx_simple_sse.c')
 	elif arch_subdir == 'ppc'
 		sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index cd8947656..10e39670e 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..7b65d0b0a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..f2976b98f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,358 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	rte_smp_rmb();
+	__m512i packed_desc = _mm512_loadu_si512(desc_addr);
+	__m512i flags_mask  = _mm512_maskz_and_epi64(0xff, packed_desc,
+			desc_flags);
+
+	__m512i used_flags;
+	if (vq->vq_packed.used_wrap_counter)
+		used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	else
+		used_flags = _mm512_setzero_si512();
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags,
+			_MM_CMPINT_EQ);
+	if (desc_stats != 0xff)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] =
+			(uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8;
+	}
+
+	/* addresses of pkt_len and data_len */
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+
+	/*
+	 * select 10b*4 load 32bit from packed_desc[95:64]
+	 * mmask  0110b*4 save 32bit into pkt_len and data_len
+	 */
+	__m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA);
+
+	/* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666,
+			(uint32_t)-hdr_size);
+
+	value = _mm512_add_epi32(value, mbuf_len_offset);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, vindex, value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += PACKED_BATCH_SIZE;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += PACKED_BATCH_SIZE;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7b65d0b0a..cf18fe564 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..18ae34789 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by vector
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 10e39670e..c9aaef0af 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index cf18fe564..f82fe8d64 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index f2976b98f..92094783a 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -15,6 +15,21 @@
 #include "virtio_pci.h"
 #include "virtqueue.h"
 
+/* reference count offset in mbuf rearm data */
+#define REF_CNT_OFFSET 16
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_OFFSET 32
+
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \
+			  1ULL << REF_CNT_OFFSET)
+/* id offset in packed ring desc higher 64bits */
+#define ID_OFFSET 32
+/* flag offset in packed ring desc higher 64bits */
+#define FLAG_OFFSET 48
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
 
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
@@ -41,6 +56,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -54,6 +110,231 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	/* Load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & 0x6666)
+		return -1;
+
+	/* Check headroom is enough */
+	cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms,
+			_MM_CMPINT_LT);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i dxps = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3],
+					0x1, (uint64_t)tx_pkts[2],
+					0x1, (uint64_t)tx_pkts[1],
+					0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i descs = _mm512_add_epi64(new_descs, flag_offsets);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	rte_smp_wmb();
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as
listed below.

Packed ring vectorized path will be selected when:
    vectorized option is enabled
    AVX512F and required extensions are supported by compiler and host
    virtio VERSION_1 and IN_ORDER features are negotiated
    virtio mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path will be selected when:
    vectorized option is enabled
    virtio mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloading are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 4c7d60ca0..de4cef843 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1548,7 +1557,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 		}
 	} else {
 		if (hw->use_vec_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1921,6 +1930,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	hw->opened = true;
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	hw->use_vec_rx = 1;
+	hw->use_vec_tx = 1;
+#endif
 
 	return 0;
 
@@ -2157,31 +2170,63 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+#if defined RTE_ARCH_X86
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorization for requirements are not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
+#endif
+
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorization for requirements are not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v5 9/9] doc: add packed vectorized path
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-16 15:31   ` Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD. Add packed virtqueue vectorized path features to new ini file.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini
new file mode 100644
index 000000000..b239bcaad
--- /dev/null
+++ b/doc/guides/nics/features/virtio-packed_vec.ini
@@ -0,0 +1,22 @@
+;
+; Supported features of the 'virtio_packed_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = P
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+VLAN filter          = Y
+Basic stats          = Y
+Stats per queue      = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-64               = Y
diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini
similarity index 88%
rename from doc/guides/nics/features/virtio_vec.ini
rename to doc/guides/nics/features/virtio-split_vec.ini
index e60fe36ae..4142fc9f0 100644
--- a/doc/guides/nics/features/virtio_vec.ini
+++ b/doc/guides/nics/features/virtio-split_vec.ini
@@ -1,5 +1,5 @@
 ;
-; Supported features of the 'virtio_vec' network poll mode driver.
+; Supported features of the 'virtio_split_vec' network poll mode driver.
 ;
 ; Refer to default.ini for the full list of available PMD features.
 ;
diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d1f5fb898..be07744ce 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev:
     It is used to enable virtio device packed virtqueue feature.
     (Default: 0 (disabled))
 
+#.  ``vectorized``:
+
+    It is used to enable virtio device vectorized path.
+    (Default: 0 (disabled))
+
 Virtio paths Selection and Usage
 --------------------------------
 
@@ -454,6 +459,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -476,6 +488,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -493,20 +507,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (10 preceding siblings ...)
  2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
@ 2020-04-16 22:24 ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (8 more replies)
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                   ` (5 subsequent siblings)
  17 siblings, 9 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v6:
1. fix issue when size not power of 2

v5:
1. remove cpuflags definition as required extensions always come with
   AVX512F on x86_64
2. inorder actions should depend on feature bit
3. check ring type in rx queue setup
4. rewrite some commit logs
5. fix some checkpatch warnings

v4:
1. rename 'packed_vec' to 'vectorized', also used in split ring
2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
3. check required AVX512 extensions cpuflags
4. combine split and packed ring datapath selection logic
5. remove limitation that size must power of two
6. clear 12Bytes virtio_net_hdr

v3:
1. remove virtio_net_hdr array for better performance
2. disable 'packed_vec' by default

v2:
1. more function blocks replaced by vector instructions
2. clean virtio_net_hdr by vector instruction
3. allow header room size change
4. add 'packed_vec' option in virtio_user vdev 
5. fix build not check whether AVX512 enabled
6. doc update

Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: enable vectorized path
  net/virtio: inorder should depend on feature bit
  net/virtio-user: add vectorized path parameter
  net/virtio: add vectorized packed ring Rx path
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 config/common_base                            |   1 +
 .../nics/features/virtio-packed_vec.ini       |  22 +
 .../{virtio_vec.ini => virtio-split_vec.ini}  |   2 +-
 doc/guides/nics/virtio.rst                    |  44 +-
 drivers/net/virtio/Makefile                   |  36 +
 drivers/net/virtio/meson.build                |  27 +-
 drivers/net/virtio/virtio_ethdev.c            |  95 ++-
 drivers/net/virtio/virtio_ethdev.h            |   6 +
 drivers/net/virtio/virtio_pci.h               |   3 +-
 drivers/net/virtio/virtio_rxtx.c              | 212 ++----
 drivers/net/virtio/virtio_rxtx_packed_avx.c   | 652 ++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c       |  39 +-
 drivers/net/virtio/virtqueue.c                |   7 +-
 drivers/net/virtio/virtqueue.h                | 168 ++++-
 14 files changed, 1093 insertions(+), 221 deletions(-)
 create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini
 rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue, default value of it is 32.
Limiated threshold size to multiple of four as only vectorized packed Rx
function will utilize it. Virtio driver will rearm Rx queue when more
than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-20 14:08     ` Maxime Coquelin
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Previously, virtio split ring vectorized path is enabled as default.
This is not suitable for everyone because of that path not follow virtio
spec. Add new config for virtio vectorized path selection. By default
vectorized path is enabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/config/common_base b/config/common_base
index c31175f9d..5901a94f7 100644
--- a/config/common_base
+++ b/config/common_base
@@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
+CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y
 
 #
 # Compile virtio device emulation inside virtio PMD driver
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index efdcb0d93..9ef445bc9 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
 ifeq ($(CONFIG_RTE_ARCH_X86),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
 else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
@@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 5e7ca855c..f9619a108 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c',
 	'virtqueue.c')
 deps += ['kvargs', 'bus_pci']
 
-if arch_subdir == 'x86'
-	sources += files('virtio_rxtx_simple_sse.c')
-elif arch_subdir == 'ppc'
-	sources += files('virtio_rxtx_simple_altivec.c')
-elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
-	sources += files('virtio_rxtx_simple_neon.c')
+if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
+	if arch_subdir == 'x86'
+		sources += files('virtio_rxtx_simple_sse.c')
+	elif arch_subdir == 'ppc'
+		sources += files('virtio_rxtx_simple_altivec.c')
+	elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
+		sources += files('virtio_rxtx_simple_neon.c')
+	endif
 endif
 
 if is_linux
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Ring initialzation is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Add new parameter "vectorized" which can select vectorized path
explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR
option is yes. When "vectorized" is set, driver will check both
compiling environment and running environment when selecting path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 35203940a..4c7d60ca0 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1547,7 +1547,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
@@ -2157,33 +2157,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7433d2f08..36afed313 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -250,7 +250,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 5637001df..6e30acaae 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -450,6 +450,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ      "packed_vq"
 	VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -518,7 +520,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -552,6 +555,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
+
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -668,6 +673,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+#endif
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -705,6 +721,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	}
 
 	hw = eth_dev->data->dev_private;
+
 	if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq,
 			 queue_size, mac_addr, &ifname, server_mode,
 			 mrg_rxbuf, in_order, packed_vq) < 0) {
@@ -720,6 +737,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not match packed ring vectorized requirement");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	} else {
+		hw->use_vec_rx = 0;
+		hw->use_vec_tx = 0;
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -777,4 +811,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"server=<0|1> "
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
-	"packed_vq=<0|1>");
+	"packed_vq=<0|1>"
+	"vectorized=<0|1>");
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Rx path when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split path into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 9ef445bc9..4d20cb61a 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -37,6 +37,40 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
 endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index f9619a108..9e0ff9761 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -11,6 +11,19 @@ deps += ['kvargs', 'bus_pci']
 
 if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
 	if arch_subdir == 'x86'
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F')
+			if '-mno-avx512f' not in machine_args and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+				cflags += ['-DCC_AVX512_SUPPORT']
+				if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+					cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+				elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+					cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+				elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+					cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+				endif
+				sources += files('virtio_rxtx_packed_avx.c')
+			endif
+		endif
 		sources += files('virtio_rxtx_simple_sse.c')
 	elif arch_subdir == 'ppc'
 		sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index cd8947656..10e39670e 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..7b65d0b0a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..ffd254489
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,368 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE << 1];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	rte_smp_rmb();
+	__m512i packed_desc = _mm512_loadu_si512(desc_addr);
+	__m512i flags_mask  = _mm512_maskz_and_epi64(0xff, packed_desc,
+			desc_flags);
+
+	__m512i used_flags;
+	if (vq->vq_packed.used_wrap_counter)
+		used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	else
+		used_flags = _mm512_setzero_si512();
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags,
+			_MM_CMPINT_EQ);
+	if (desc_stats != 0xff)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+		addrs[(i << 1) + 1] =
+			(uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8;
+	}
+
+	/* addresses of pkt_len and data_len */
+	__m512i vindex = _mm512_loadu_si512((void *)addrs);
+
+	/*
+	 * select 10b*4 load 32bit from packed_desc[95:64]
+	 * mmask  0110b*4 save 32bit into pkt_len and data_len
+	 */
+	__m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA);
+
+	/* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666,
+			(uint32_t)-hdr_size);
+
+	value = _mm512_add_epi32(value, mbuf_len_offset);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, vindex, value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7b65d0b0a..cf18fe564 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..18ae34789 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by vector
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 10e39670e..c9aaef0af 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index cf18fe564..f82fe8d64 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index ffd254489..255eba166 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -15,6 +15,21 @@
 #include "virtio_pci.h"
 #include "virtqueue.h"
 
+/* reference count offset in mbuf rearm data */
+#define REF_CNT_OFFSET 16
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_OFFSET 32
+
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \
+			  1ULL << REF_CNT_OFFSET)
+/* id offset in packed ring desc higher 64bits */
+#define ID_OFFSET 32
+/* flag offset in packed ring desc higher 64bits */
+#define FLAG_OFFSET 48
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63)
 
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
@@ -41,6 +56,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -54,6 +110,234 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+	if (cmp & 0x6666)
+		return -1;
+
+	/* Check headroom is enough */
+	cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms,
+			_MM_CMPINT_LT);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i dxps = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3],
+					0x1, (uint64_t)tx_pkts[2],
+					0x1, (uint64_t)tx_pkts[1],
+					0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i descs = _mm512_add_epi64(new_descs, flag_offsets);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	rte_smp_wmb();
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as
listed below.

Packed ring vectorized path will be selected when:
    vectorized option is enabled
    AVX512F and required extensions are supported by compiler and host
    virtio VERSION_1 and IN_ORDER features are negotiated
    virtio mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path will be selected when:
    vectorized option is enabled
    virtio mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloading are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 4c7d60ca0..de4cef843 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1548,7 +1557,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 		}
 	} else {
 		if (hw->use_vec_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1921,6 +1930,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	hw->opened = true;
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	hw->use_vec_rx = 1;
+	hw->use_vec_tx = 1;
+#endif
 
 	return 0;
 
@@ -2157,31 +2170,63 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+#if defined RTE_ARCH_X86
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorization for requirements are not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
+#endif
+
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorization for requirements are not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v6 9/9] doc: add packed vectorized path
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-16 22:24   ` Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD. Add packed virtqueue vectorized path features to new ini file.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini
new file mode 100644
index 000000000..b239bcaad
--- /dev/null
+++ b/doc/guides/nics/features/virtio-packed_vec.ini
@@ -0,0 +1,22 @@
+;
+; Supported features of the 'virtio_packed_vec' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Speed capabilities   = P
+Link status          = Y
+Link status event    = Y
+Rx interrupt         = Y
+Queue start/stop     = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
+Unicast MAC filter   = Y
+Multicast MAC filter = Y
+VLAN filter          = Y
+Basic stats          = Y
+Stats per queue      = Y
+BSD nic_uio          = Y
+Linux UIO            = Y
+Linux VFIO           = Y
+x86-64               = Y
diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini
similarity index 88%
rename from doc/guides/nics/features/virtio_vec.ini
rename to doc/guides/nics/features/virtio-split_vec.ini
index e60fe36ae..4142fc9f0 100644
--- a/doc/guides/nics/features/virtio_vec.ini
+++ b/doc/guides/nics/features/virtio-split_vec.ini
@@ -1,5 +1,5 @@
 ;
-; Supported features of the 'virtio_vec' network poll mode driver.
+; Supported features of the 'virtio_split_vec' network poll mode driver.
 ;
 ; Refer to default.ini for the full list of available PMD features.
 ;
diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d1f5fb898..be07744ce 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev:
     It is used to enable virtio device packed virtqueue feature.
     (Default: 0 (disabled))
 
+#.  ``vectorized``:
+
+    It is used to enable virtio device vectorized path.
+    (Default: 0 (disabled))
+
 Virtio paths Selection and Usage
 --------------------------------
 
@@ -454,6 +459,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -476,6 +488,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -493,20 +507,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path
  2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu
@ 2020-04-20 14:08     ` Maxime Coquelin
  2020-04-21  6:43       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-20 14:08 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev

Hi Marvin,

On 4/17/20 12:24 AM, Marvin Liu wrote:
> Previously, virtio split ring vectorized path is enabled as default.
> This is not suitable for everyone because of that path not follow virtio
> spec. Add new config for virtio vectorized path selection. By default
> vectorized path is enabled.

It should be disabled by default if not following spec. Also, it means
it will always be enabled with Meson, which is not acceptable.

I think we should have a devarg, so that it is built by default but
disabled. User would specify explicitly he wants to enable vector
support when probing the device.

Thanks,
Maxime

> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/config/common_base b/config/common_base
> index c31175f9d..5901a94f7 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
> +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y
>  
>  #
>  # Compile virtio device emulation inside virtio PMD driver
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index efdcb0d93..9ef445bc9 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
>  
> +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
>  ifeq ($(CONFIG_RTE_ARCH_X86),y)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
>  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
> @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
>  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
>  endif
> +endif
>  
>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> index 5e7ca855c..f9619a108 100644
> --- a/drivers/net/virtio/meson.build
> +++ b/drivers/net/virtio/meson.build
> @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c',
>  	'virtqueue.c')
>  deps += ['kvargs', 'bus_pci']
>  
> -if arch_subdir == 'x86'
> -	sources += files('virtio_rxtx_simple_sse.c')
> -elif arch_subdir == 'ppc'
> -	sources += files('virtio_rxtx_simple_altivec.c')
> -elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
> -	sources += files('virtio_rxtx_simple_neon.c')
> +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
> +	if arch_subdir == 'x86'
> +		sources += files('virtio_rxtx_simple_sse.c')
> +	elif arch_subdir == 'ppc'
> +		sources += files('virtio_rxtx_simple_altivec.c')
> +	elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64')
> +		sources += files('virtio_rxtx_simple_neon.c')
> +	endif
>  endif
>  
>  if is_linux
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path
  2020-04-20 14:08     ` Maxime Coquelin
@ 2020-04-21  6:43       ` Liu, Yong
  2020-04-22  8:07         ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-21  6:43 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, April 20, 2020 10:08 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v6 2/9] net/virtio: enable vectorized path
> 
> Hi Marvin,
> 
> On 4/17/20 12:24 AM, Marvin Liu wrote:
> > Previously, virtio split ring vectorized path is enabled as default.
> > This is not suitable for everyone because of that path not follow virtio
> > spec. Add new config for virtio vectorized path selection. By default
> > vectorized path is enabled.
> 
> It should be disabled by default if not following spec. Also, it means
> it will always be enabled with Meson, which is not acceptable.
> 
> I think we should have a devarg, so that it is built by default but
> disabled. User would specify explicitly he wants to enable vector
> support when probing the device.
> 

Thanks, Maxime. Will change to disable as default in next version.

> Thanks,
> Maxime
> 
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/config/common_base b/config/common_base
> > index c31175f9d..5901a94f7 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
> >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
> >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
> >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
> > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y
> >
> >  #
> >  # Compile virtio device emulation inside virtio PMD driver
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index efdcb0d93..9ef445bc9 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_rxtx.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
> >
> > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
> >  ifeq ($(CONFIG_RTE_ARCH_X86),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
> >  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
> > @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_rxtx_simple_altivec.c
> >  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> $(CONFIG_RTE_ARCH_ARM64)),)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
> >  endif
> > +endif
> >
> >  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> > diff --git a/drivers/net/virtio/meson.build
> b/drivers/net/virtio/meson.build
> > index 5e7ca855c..f9619a108 100644
> > --- a/drivers/net/virtio/meson.build
> > +++ b/drivers/net/virtio/meson.build
> > @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c',
> >  	'virtqueue.c')
> >  deps += ['kvargs', 'bus_pci']
> >
> > -if arch_subdir == 'x86'
> > -	sources += files('virtio_rxtx_simple_sse.c')
> > -elif arch_subdir == 'ppc'
> > -	sources += files('virtio_rxtx_simple_altivec.c')
> > -elif arch_subdir == 'arm' and
> host_machine.cpu_family().startswith('aarch64')
> > -	sources += files('virtio_rxtx_simple_neon.c')
> > +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
> > +	if arch_subdir == 'x86'
> > +		sources += files('virtio_rxtx_simple_sse.c')
> > +	elif arch_subdir == 'ppc'
> > +		sources += files('virtio_rxtx_simple_altivec.c')
> > +	elif arch_subdir == 'arm' and
> host_machine.cpu_family().startswith('aarch64')
> > +		sources += files('virtio_rxtx_simple_neon.c')
> > +	endif
> >  endif
> >
> >  if is_linux
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (11 preceding siblings ...)
  2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
@ 2020-04-22  6:16 ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (8 more replies)
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                   ` (4 subsequent siblings)
  17 siblings, 9 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v7:
1. default vectorization is disabled
2. compilation time check with rte_mbuf structure
3. offsets are calcuated when compiling
4. remove useless barrier as descs are batched store&load
5. vindex of scatter is directly set
6. some comments updates
7. enable vectorized path in meson build

v6:
1. fix issue when size not power of 2

v5:
1. remove cpuflags definition as required extensions always come with
   AVX512F on x86_64
2. inorder actions should depend on feature bit
3. check ring type in rx queue setup
4. rewrite some commit logs
5. fix some checkpatch warnings

v4:
1. rename 'packed_vec' to 'vectorized', also used in split ring
2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
3. check required AVX512 extensions cpuflags
4. combine split and packed ring datapath selection logic
5. remove limitation that size must power of two
6. clear 12Bytes virtio_net_hdr

v3:
1. remove virtio_net_hdr array for better performance
2. disable 'packed_vec' by default

v2:
1. more function blocks replaced by vector instructions
2. clean virtio_net_hdr by vector instruction
3. allow header room size change
4. add 'packed_vec' option in virtio_user vdev 
5. fix build not check whether AVX512 enabled
6. doc update


Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: enable vectorized path
  net/virtio: inorder should depend on feature bit
  net/virtio-user: add vectorized path parameter
  net/virtio: add vectorized packed ring Rx path
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 config/common_base                          |   1 +
 doc/guides/nics/virtio.rst                  |  43 +-
 drivers/net/virtio/Makefile                 |  37 ++
 drivers/net/virtio/meson.build              |  15 +
 drivers/net/virtio/virtio_ethdev.c          |  95 ++-
 drivers/net/virtio/virtio_ethdev.h          |   6 +
 drivers/net/virtio/virtio_pci.h             |   3 +-
 drivers/net/virtio/virtio_rxtx.c            | 212 ++-----
 drivers/net/virtio/virtio_rxtx_packed_avx.c | 662 ++++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c     |  37 +-
 drivers/net/virtio/virtqueue.c              |   7 +-
 drivers/net/virtio/virtqueue.h              | 168 ++++-
 12 files changed, 1072 insertions(+), 214 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path Marvin Liu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue, default value of it is 32.
Limiated threshold size to multiple of four as only vectorized packed Rx
function will utilize it. Virtio driver will rearm Rx queue when more
than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Previously, virtio split ring vectorized path is enabled as default.
This is not suitable for everyone because of that path not follow virtio
spec. Add new config for virtio vectorized path selection. By default
vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/config/common_base b/config/common_base
index 00d8d0792..334a26a17 100644
--- a/config/common_base
+++ b/config/common_base
@@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
+CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n
 
 #
 # Compile virtio device emulation inside virtio PMD driver
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index c9edb84ee..4b69827ab 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
 ifeq ($(CONFIG_RTE_ARCH_X86),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
 else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
@@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 15150eea1..ce3525ef5 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c',
 	'virtqueue.c')
 deps += ['kvargs', 'bus_pci']
 
+dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
 if arch_subdir == 'x86'
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Ring initialzation is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Add new parameter "vectorized" which can select vectorized path
explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR
option is yes. When "vectorized" is set, driver will check both
compiling environment and running environment when selecting path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 37766cbb6..361c834a9 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1551,8 +1551,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -2257,33 +2257,33 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
+	hw->use_vec_rx = 1;
 
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index bd89357e4..668e688e1 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -253,7 +253,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 953f00d72..5c338cf44 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -452,6 +452,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_PACKED_VQ,
 #define VIRTIO_USER_ARG_SPEED          "speed"
 	VIRTIO_USER_ARG_SPEED,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -525,7 +527,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -559,6 +562,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -675,6 +679,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+#endif
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -727,6 +742,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not match packed ring vectorized requirement");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	} else {
+		hw->use_vec_rx = 0;
+		hw->use_vec_tx = 0;
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -785,4 +817,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
 	"packed_vq=<0|1> "
-	"speed=<int>");
+	"speed=<int> "
+	"vectorized=<0|1>");
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Optimize packed ring Rx path when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split path into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 4b69827ab..de0b00e50 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -36,6 +36,41 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
 endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index ce3525ef5..39b3605d9 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -10,6 +10,20 @@ deps += ['kvargs', 'bus_pci']
 
 dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
 if arch_subdir == 'x86'
+	if '-mno-avx512f' not in machine_args
+		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
 	sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index febaf17a8..5c112cac7 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..7b65d0b0a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..d02ba9ba6
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,373 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define BYTE_SIZE 8
+/* flag bits offset in packed ring desc higher 64bits */
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static inline uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	__m512i v_desc = _mm512_loadu_si512(desc_addr);
+	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
+
+	__m512i v_used_flag = _mm512_setzero_si512();
+	if (vq->vq_packed.used_wrap_counter)
+		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
+	if (desc_stats)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+	}
+
+	/*
+	 * load len from desc, store into mbuf pkt_len and data_len
+	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
+	 */
+	__m512i values = _mm512_maskz_shuffle_epi32(0x6666, v_desc, 0xAA);
+
+	/* reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666,
+			(uint32_t)-hdr_size);
+
+	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
+
+	/* assert offset of data_len */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+
+	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
+					   addrs[2] + 8, addrs[2],
+					   addrs[1] + 8, addrs[1],
+					   addrs[0] + 8, addrs[0]);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7b65d0b0a..cf18fe564 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..18ae34789 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by vector
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 5c112cac7..b7d52d497 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index cf18fe564..f82fe8d64 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index d02ba9ba6..60d03b6d8 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -23,6 +23,24 @@
 #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
 	FLAGS_BITS_OFFSET)
 
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+/* id bits offset in packed ring desc higher 64bits */
+#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
 	sizeof(struct vring_packed_desc))
 #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
@@ -47,6 +65,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static inline void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -60,6 +119,236 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64);
+	RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64);
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	cmp = _mm256_mask_cmpneq_epu16_mask(0x6666, mbufs, mbuf_ref);
+	if (unlikely(cmp))
+		return -1;
+
+	/* Check headroom is enough */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) !=
+		offsetof(struct rte_mbuf, rearm_data));
+	cmp = _mm256_mask_cmplt_epu16_mask(0x1111, mbufs, head_rooms);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3],
+					   0x1, (uint64_t)tx_pkts[2],
+					   0x1, (uint64_t)tx_pkts[1],
+					   0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_BITS_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_BITS_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i v_offset = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i v_desc = _mm512_add_epi64(new_descs, v_offset);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as
listed below.

Packed ring vectorized path will be selected when:
    vectorized option is enabled
    AVX512F and required extensions are supported by compiler and host
    virtio VERSION_1 and IN_ORDER features are negotiated
    virtio mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path will be selected when:
    vectorized option is enabled
    virtio mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloading are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 361c834a9..c700af6be 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1522,9 +1522,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1538,7 +1541,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1950,6 +1959,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	hw->opened = true;
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	hw->use_vec_rx = 1;
+	hw->use_vec_tx = 1;
+#endif
 
 	return 0;
 
@@ -2257,33 +2270,63 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_vec_rx = 1;
+	if (vtpci_packed_queue(hw)) {
+#if defined RTE_ARCH_X86
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorization for requirements are not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
+#endif
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
 
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorization for requirements are not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v7 9/9] doc: add packed vectorized path
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-22  6:16   ` Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-22  6:16 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren
  Cc: dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 6286286db..4bd46f83e 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -417,6 +417,10 @@ Below devargs are supported by the virtio-user vdev:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to enable virtio device vectorized path.
+    (Default: 0 (disabled))
 
 Virtio paths Selection and Usage
 --------------------------------
@@ -469,6 +473,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -491,6 +502,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -508,20 +521,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path
  2020-04-21  6:43       ` Liu, Yong
@ 2020-04-22  8:07         ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-22  8:07 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Liu, Yong
> Sent: Tuesday, April 21, 2020 2:43 PM
> To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [PATCH v6 2/9] net/virtio: enable vectorized path
> 
> 
> 
> > -----Original Message-----
> > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Sent: Monday, April 20, 2020 10:08 PM
> > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH v6 2/9] net/virtio: enable vectorized path
> >
> > Hi Marvin,
> >
> > On 4/17/20 12:24 AM, Marvin Liu wrote:
> > > Previously, virtio split ring vectorized path is enabled as default.
> > > This is not suitable for everyone because of that path not follow virtio
> > > spec. Add new config for virtio vectorized path selection. By default
> > > vectorized path is enabled.
> >
> > It should be disabled by default if not following spec. Also, it means
> > it will always be enabled with Meson, which is not acceptable.
> >
> > I think we should have a devarg, so that it is built by default but
> > disabled. User would specify explicitly he wants to enable vector
> > support when probing the device.
> >
> 

Hi Maxime,
There's one new parameter "vectorized" in devarg which allow user specific whether enable or disable vectorized path.
By now this parameter depend on RTE_LIBRTE_VIRTIO_INC_VECTOR,  parameter won't be used if INC_VECTOR option is disable. 

Regards,
Marvin

> Thanks, Maxime. Will change to disable as default in next version.
> 
> > Thanks,
> > Maxime
> >
> > > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> > >
> > > diff --git a/config/common_base b/config/common_base
> > > index c31175f9d..5901a94f7 100644
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
> > >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
> > >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
> > >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
> > > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y
> > >
> > >  #
> > >  # Compile virtio device emulation inside virtio PMD driver
> > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > > index efdcb0d93..9ef445bc9 100644
> > > --- a/drivers/net/virtio/Makefile
> > > +++ b/drivers/net/virtio/Makefile
> > > @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> > virtio_rxtx.c
> > >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
> > >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
> > >
> > > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
> > >  ifeq ($(CONFIG_RTE_ARCH_X86),y)
> > >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
> > >  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
> > > @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> > virtio_rxtx_simple_altivec.c
> > >  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> > $(CONFIG_RTE_ARCH_ARM64)),)
> > >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
> > >  endif
> > > +endif
> > >
> > >  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> > >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> > > diff --git a/drivers/net/virtio/meson.build
> > b/drivers/net/virtio/meson.build
> > > index 5e7ca855c..f9619a108 100644
> > > --- a/drivers/net/virtio/meson.build
> > > +++ b/drivers/net/virtio/meson.build
> > > @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c',
> > >  	'virtqueue.c')
> > >  deps += ['kvargs', 'bus_pci']
> > >
> > > -if arch_subdir == 'x86'
> > > -	sources += files('virtio_rxtx_simple_sse.c')
> > > -elif arch_subdir == 'ppc'
> > > -	sources += files('virtio_rxtx_simple_altivec.c')
> > > -elif arch_subdir == 'arm' and
> > host_machine.cpu_family().startswith('aarch64')
> > > -	sources += files('virtio_rxtx_simple_neon.c')
> > > +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR')
> > > +	if arch_subdir == 'x86'
> > > +		sources += files('virtio_rxtx_simple_sse.c')
> > > +	elif arch_subdir == 'ppc'
> > > +		sources += files('virtio_rxtx_simple_altivec.c')
> > > +	elif arch_subdir == 'arm' and
> > host_machine.cpu_family().startswith('aarch64')
> > > +		sources += files('virtio_rxtx_simple_neon.c')
> > > +	endif
> > >  endif
> > >
> > >  if is_linux
> > >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-23  8:09     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-23  8:09 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: harry.van.haaren, dev



On 4/23/20 2:30 PM, Marvin Liu wrote:
> Introduce free threshold setting in Rx queue, default value of it is 32.
> Limiated threshold size to multiple of four as only vectorized packed Rx
s/Limiated/Limit the/

> function will utilize it. Virtio driver will rearm Rx queue when more
> than rx_free_thresh descs were dequeued.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu
@ 2020-04-23  8:33     ` Maxime Coquelin
  2020-04-23  8:46       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-23  8:33 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: harry.van.haaren, dev



On 4/23/20 2:30 PM, Marvin Liu wrote:
> Previously, virtio split ring vectorized path is enabled as default.

s/is/was/
s/as/by/

> This is not suitable for everyone because of that path not follow virtio

s/because of that path not follow/because that path does not follow the/

> spec. Add new config for virtio vectorized path selection. By default
> vectorized path is disabled.

I think we can keep it enabled by default for consistency between make &
meson, now that you are providing a devarg for it that is disabled by
default.

Maybe we can just drop this config flag, what do you think?

Thanks,
Maxime

> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/config/common_base b/config/common_base
> index 00d8d0792..334a26a17 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
> +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n
>  
>  #
>  # Compile virtio device emulation inside virtio PMD driver
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index c9edb84ee..4b69827ab 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
>  
> +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
>  ifeq ($(CONFIG_RTE_ARCH_X86),y)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
>  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
> @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
>  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
>  endif
> +endif
>  
>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> index 15150eea1..ce3525ef5 100644
> --- a/drivers/net/virtio/meson.build
> +++ b/drivers/net/virtio/meson.build
> @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c',
>  	'virtqueue.c')
>  deps += ['kvargs', 'bus_pci']
>  
> +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
>  if arch_subdir == 'x86'
>  	sources += files('virtio_rxtx_simple_sse.c')
>  elif arch_subdir == 'ppc'
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-23  8:46     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-23  8:46 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: harry.van.haaren, dev



On 4/23/20 2:31 PM, Marvin Liu wrote:
> Ring initialzation is different when inorder feature negotiated. This
s/initialzation/initialization/
> action should dependent on negotiated feature bits.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path
  2020-04-23  8:33     ` Maxime Coquelin
@ 2020-04-23  8:46       ` Liu, Yong
  2020-04-23  8:49         ` Maxime Coquelin
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-23  8:46 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: Van Haaren, Harry, dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, April 23, 2020 4:34 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path
> 
> 
> 
> On 4/23/20 2:30 PM, Marvin Liu wrote:
> > Previously, virtio split ring vectorized path is enabled as default.
> 
> s/is/was/
> s/as/by/
> 
> > This is not suitable for everyone because of that path not follow virtio
> 
> s/because of that path not follow/because that path does not follow the/
> 
> > spec. Add new config for virtio vectorized path selection. By default
> > vectorized path is disabled.
> 
> I think we can keep it enabled by default for consistency between make &
> meson, now that you are providing a devarg for it that is disabled by
> default.
> 
> Maybe we can just drop this config flag, what do you think?
> 

Maxime, 
Devarg will only have effect on virtio-user path selection, while DPDK configuration can affect both virtio pmd and virtio-user.
It maybe worth to add new configuration as it can allow user to choice whether disabled vectorized path in virtio pmd.  
IMHO, AVX512 instructions should be selective in each component. 

Regards,
Marvin

> Thanks,
> Maxime
> 
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/config/common_base b/config/common_base
> > index 00d8d0792..334a26a17 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
> >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
> >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
> >  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
> > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n
> >
> >  #
> >  # Compile virtio device emulation inside virtio PMD driver
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index c9edb84ee..4b69827ab 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_rxtx.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
> >
> > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
> >  ifeq ($(CONFIG_RTE_ARCH_X86),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
> >  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
> > @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_rxtx_simple_altivec.c
> >  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> $(CONFIG_RTE_ARCH_ARM64)),)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
> >  endif
> > +endif
> >
> >  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> > index 15150eea1..ce3525ef5 100644
> > --- a/drivers/net/virtio/meson.build
> > +++ b/drivers/net/virtio/meson.build
> > @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c',
> >  	'virtqueue.c')
> >  deps += ['kvargs', 'bus_pci']
> >
> > +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
> >  if arch_subdir == 'x86'
> >  	sources += files('virtio_rxtx_simple_sse.c')
> >  elif arch_subdir == 'ppc'
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path
  2020-04-23  8:46       ` Liu, Yong
@ 2020-04-23  8:49         ` Maxime Coquelin
  2020-04-23  9:59           ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-23  8:49 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: Van Haaren, Harry, dev



On 4/23/20 10:46 AM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Thursday, April 23, 2020 4:34 PM
>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
>> Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path
>>
>>
>>
>> On 4/23/20 2:30 PM, Marvin Liu wrote:
>>> Previously, virtio split ring vectorized path is enabled as default.
>>
>> s/is/was/
>> s/as/by/
>>
>>> This is not suitable for everyone because of that path not follow virtio
>>
>> s/because of that path not follow/because that path does not follow the/
>>
>>> spec. Add new config for virtio vectorized path selection. By default
>>> vectorized path is disabled.
>>
>> I think we can keep it enabled by default for consistency between make &
>> meson, now that you are providing a devarg for it that is disabled by
>> default.
>>
>> Maybe we can just drop this config flag, what do you think?
>>
> 
> Maxime, 
> Devarg will only have effect on virtio-user path selection, while DPDK configuration can affect both virtio pmd and virtio-user.
> It maybe worth to add new configuration as it can allow user to choice whether disabled vectorized path in virtio pmd. 

Ok, so we had a misunderstanding. I was requesting the the devarg to be
effective also for the Virtio PMD, disabled by default.

Thanks,
Maxime
> IMHO, AVX512 instructions should be selective in each component. 
> 
> Regards,
> Marvin
> 
>> Thanks,
>> Maxime
>>
>>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>>>
>>> diff --git a/config/common_base b/config/common_base
>>> index 00d8d0792..334a26a17 100644
>>> --- a/config/common_base
>>> +++ b/config/common_base
>>> @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
>>>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
>>>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
>>>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
>>> +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n
>>>
>>>  #
>>>  # Compile virtio device emulation inside virtio PMD driver
>>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
>>> index c9edb84ee..4b69827ab 100644
>>> --- a/drivers/net/virtio/Makefile
>>> +++ b/drivers/net/virtio/Makefile
>>> @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
>> virtio_rxtx.c
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
>>>
>>> +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
>>>  ifeq ($(CONFIG_RTE_ARCH_X86),y)
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
>>>  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
>>> @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
>> virtio_rxtx_simple_altivec.c
>>>  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
>> $(CONFIG_RTE_ARCH_ARM64)),)
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
>>>  endif
>>> +endif
>>>
>>>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
>>> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
>>> index 15150eea1..ce3525ef5 100644
>>> --- a/drivers/net/virtio/meson.build
>>> +++ b/drivers/net/virtio/meson.build
>>> @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c',
>>>  	'virtqueue.c')
>>>  deps += ['kvargs', 'bus_pci']
>>>
>>> +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
>>>  if arch_subdir == 'x86'
>>>  	sources += files('virtio_rxtx_simple_sse.c')
>>>  elif arch_subdir == 'ppc'
>>>
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path
  2020-04-23  8:49         ` Maxime Coquelin
@ 2020-04-23  9:59           ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-23  9:59 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: Van Haaren, Harry, dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, April 23, 2020 4:50 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path
> 
> 
> 
> On 4/23/20 10:46 AM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Thursday, April 23, 2020 4:34 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> >> Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path
> >>
> >>
> >>
> >> On 4/23/20 2:30 PM, Marvin Liu wrote:
> >>> Previously, virtio split ring vectorized path is enabled as default.
> >>
> >> s/is/was/
> >> s/as/by/
> >>
> >>> This is not suitable for everyone because of that path not follow virtio
> >>
> >> s/because of that path not follow/because that path does not follow the/
> >>
> >>> spec. Add new config for virtio vectorized path selection. By default
> >>> vectorized path is disabled.
> >>
> >> I think we can keep it enabled by default for consistency between make &
> >> meson, now that you are providing a devarg for it that is disabled by
> >> default.
> >>
> >> Maybe we can just drop this config flag, what do you think?
> >>
> >
> > Maxime,
> > Devarg will only have effect on virtio-user path selection, while DPDK
> configuration can affect both virtio pmd and virtio-user.
> > It maybe worth to add new configuration as it can allow user to choice
> whether disabled vectorized path in virtio pmd.
> 
> Ok, so we had a misunderstanding. I was requesting the the devarg to be
> effective also for the Virtio PMD, disabled by default.
> 
Got you, will change in next vesion.

> Thanks,
> Maxime
> > IMHO, AVX512 instructions should be selective in each component.
> >
> > Regards,
> > Marvin
> >
> >> Thanks,
> >> Maxime
> >>
> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >>>
> >>> diff --git a/config/common_base b/config/common_base
> >>> index 00d8d0792..334a26a17 100644
> >>> --- a/config/common_base
> >>> +++ b/config/common_base
> >>> @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
> >>>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
> >>>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
> >>>  CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
> >>> +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n
> >>>
> >>>  #
> >>>  # Compile virtio device emulation inside virtio PMD driver
> >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> >>> index c9edb84ee..4b69827ab 100644
> >>> --- a/drivers/net/virtio/Makefile
> >>> +++ b/drivers/net/virtio/Makefile
> >>> @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> >> virtio_rxtx.c
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
> >>>
> >>> +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
> >>>  ifeq ($(CONFIG_RTE_ARCH_X86),y)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
> >>>  else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
> >>> @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> >> virtio_rxtx_simple_altivec.c
> >>>  else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> >> $(CONFIG_RTE_ARCH_ARM64)),)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
> >>>  endif
> >>> +endif
> >>>
> >>>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> >>> diff --git a/drivers/net/virtio/meson.build
> b/drivers/net/virtio/meson.build
> >>> index 15150eea1..ce3525ef5 100644
> >>> --- a/drivers/net/virtio/meson.build
> >>> +++ b/drivers/net/virtio/meson.build
> >>> @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c',
> >>>  	'virtqueue.c')
> >>>  deps += ['kvargs', 'bus_pci']
> >>>
> >>> +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
> >>>  if arch_subdir == 'x86'
> >>>  	sources += files('virtio_rxtx_simple_sse.c')
> >>>  elif arch_subdir == 'ppc'
> >>>
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (12 preceding siblings ...)
  2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
@ 2020-04-23 12:30 ` Marvin Liu
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (9 more replies)
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                   ` (3 subsequent siblings)
  17 siblings, 10 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:30 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v8:
* fix meson build error on ubuntu16.04 and suse15

v7:
* default vectorization is disabled
* compilation time check dependency on rte_mbuf structure
* offsets are calcuated when compiling
* remove useless barrier as descs are batched store&load
* vindex of scatter is directly set
* some comments updates
* enable vectorized path in meson build

v6:
* fix issue when size not power of 2

v5:
* remove cpuflags definition as required extensions always come with
  AVX512F on x86_64
* inorder actions should depend on feature bit
* check ring type in rx queue setup
* rewrite some commit logs
* fix some checkpatch warnings

v4:
* rename 'packed_vec' to 'vectorized', also used in split ring
* add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
* check required AVX512 extensions cpuflags
* combine split and packed ring datapath selection logic
* remove limitation that size must power of two
* clear 12Bytes virtio_net_hdr

v3:
* remove virtio_net_hdr array for better performance
* disable 'packed_vec' by default

v2:
* more function blocks replaced by vector instructions
* clean virtio_net_hdr by vector instruction
* allow header room size change
* add 'packed_vec' option in virtio_user vdev 
* fix build not check whether AVX512 enabled
* doc update


Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: enable vectorized path
  net/virtio: inorder should depend on feature bit
  net/virtio-user: add vectorized path parameter
  net/virtio: add vectorized packed ring Rx path
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 config/common_base                          |   1 +
 doc/guides/nics/virtio.rst                  |  43 +-
 drivers/net/virtio/Makefile                 |  37 ++
 drivers/net/virtio/meson.build              |  15 +
 drivers/net/virtio/virtio_ethdev.c          |  95 ++-
 drivers/net/virtio/virtio_ethdev.h          |   6 +
 drivers/net/virtio/virtio_pci.h             |   3 +-
 drivers/net/virtio/virtio_rxtx.c            | 212 ++-----
 drivers/net/virtio/virtio_rxtx_packed_avx.c | 665 ++++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c     |  37 +-
 drivers/net/virtio/virtqueue.c              |   7 +-
 drivers/net/virtio/virtqueue.h              | 168 ++++-
 12 files changed, 1075 insertions(+), 214 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
@ 2020-04-23 12:30   ` Marvin Liu
  2020-04-23  8:09     ` Maxime Coquelin
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:30 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Introduce free threshold setting in Rx queue, default value of it is 32.
Limiated threshold size to multiple of four as only vectorized packed Rx
function will utilize it. Virtio driver will rearm Rx queue when more
than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-23 12:30   ` Marvin Liu
  2020-04-23  8:33     ` Maxime Coquelin
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:30 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Previously, virtio split ring vectorized path is enabled as default.
This is not suitable for everyone because of that path not follow virtio
spec. Add new config for virtio vectorized path selection. By default
vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/config/common_base b/config/common_base
index 00d8d0792..334a26a17 100644
--- a/config/common_base
+++ b/config/common_base
@@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n
+CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n
 
 #
 # Compile virtio device emulation inside virtio PMD driver
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index c9edb84ee..4b69827ab 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y)
 ifeq ($(CONFIG_RTE_ARCH_X86),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c
 else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y)
@@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 15150eea1..ce3525ef5 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c',
 	'virtqueue.c')
 deps += ['kvargs', 'bus_pci']
 
+dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
 if arch_subdir == 'x86'
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23  8:46     ` Maxime Coquelin
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Ring initialzation is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Add new parameter "vectorized" which can select vectorized path
explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR
option is yes. When "vectorized" is set, driver will check both
compiling environment and running environment when selecting path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 37766cbb6..361c834a9 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1551,8 +1551,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -2257,33 +2257,33 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
+	hw->use_vec_rx = 1;
 
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index bd89357e4..668e688e1 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -253,7 +253,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 953f00d72..5c338cf44 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -452,6 +452,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_PACKED_VQ,
 #define VIRTIO_USER_ARG_SPEED          "speed"
 	VIRTIO_USER_ARG_SPEED,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -525,7 +527,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -559,6 +562,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -675,6 +679,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+#endif
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -727,6 +742,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not match packed ring vectorized requirement");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	} else {
+		hw->use_vec_rx = 0;
+		hw->use_vec_tx = 0;
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -785,4 +817,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
 	"packed_vq=<0|1> "
-	"speed=<int>");
+	"speed=<int> "
+	"vectorized=<0|1>");
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Rx path when AVX512 enabled and mergeable
buffer/Rx LRO offloading are not required. Solution of optimization
is pretty like vhost, is that split path into batch and single
functions. Batch function is further optimized by vector instructions.
Also pad desc extra structure to 16 bytes aligned, thus four elements
will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 4b69827ab..de0b00e50 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -36,6 +36,41 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c
 else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
+
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
 endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index ce3525ef5..39b3605d9 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -10,6 +10,20 @@ deps += ['kvargs', 'bus_pci']
 
 dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1)
 if arch_subdir == 'x86'
+	if '-mno-avx512f' not in machine_args
+		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
 	sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index febaf17a8..5c112cac7 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..7b65d0b0a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
@@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..3380f1da5
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define BYTE_SIZE 8
+/* flag bits offset in packed ring desc higher 64bits */
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static inline uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	__m512i v_desc = _mm512_loadu_si512(desc_addr);
+	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
+
+	__m512i v_used_flag = _mm512_setzero_si512();
+	if (vq->vq_packed.used_wrap_counter)
+		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
+	if (desc_stats)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+	}
+
+	/*
+	 * load len from desc, store into mbuf pkt_len and data_len
+	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
+	 */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA);
+
+	/* reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
+			(uint32_t)-hdr_size);
+
+	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
+
+	/* assert offset of data_len */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+
+	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
+					   addrs[2] + 8, addrs[2],
+					   addrs[1] + 8, addrs[1],
+					   addrs[0] + 8, addrs[0]);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..43e305ecc 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -20,6 +20,7 @@ struct rte_mbuf;
 
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +237,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7b65d0b0a..cf18fe564 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 43e305ecc..18ae34789 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,7 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
 #define VIRTIO_MBUF_BURST_SZ 64
@@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path Marvin Liu
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by vector
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 5c112cac7..b7d52d497 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index cf18fe564..f82fe8d64 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index 3380f1da5..c023ace4e 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -23,6 +23,24 @@
 #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
 	FLAGS_BITS_OFFSET)
 
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+/* id bits offset in packed ring desc higher 64bits */
+#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
 	sizeof(struct vring_packed_desc))
 #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
@@ -47,6 +65,47 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static inline void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
 
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
@@ -60,6 +119,238 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len3;
 	stats->bytes += pkt_len4;
 }
+
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64);
+	RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64);
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref);
+	if (unlikely(cmp))
+		return -1;
+
+	/* Check headroom is enough */
+	const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12;
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) !=
+		offsetof(struct rte_mbuf, rearm_data));
+	cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3],
+					   0x1, (uint64_t)tx_pkts[2],
+					   0x1, (uint64_t)tx_pkts[1],
+					   0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_BITS_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_BITS_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i v_offset = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i v_desc = _mm512_add_epi64(new_descs, v_offset);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i all_mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, all_mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 9/9] doc: add packed " Marvin Liu
  2020-04-23 15:17   ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Wang, Yinan
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as
listed below.

Packed ring vectorized path will be selected when:
    vectorized option is enabled
    AVX512F and required extensions are supported by compiler and host
    virtio VERSION_1 and IN_ORDER features are negotiated
    virtio mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path will be selected when:
    vectorized option is enabled
    virtio mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloading are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 361c834a9..c700af6be 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1522,9 +1522,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1538,7 +1541,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1950,6 +1959,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	hw->opened = true;
+#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR
+	hw->use_vec_rx = 1;
+	hw->use_vec_tx = 1;
+#endif
 
 	return 0;
 
@@ -2257,33 +2270,63 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_vec_rx = 1;
+	if (vtpci_packed_queue(hw)) {
+#if defined RTE_ARCH_X86
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorization for requirements are not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
+#endif
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
 
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorization for requirements are not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v8 9/9] doc: add packed vectorized path
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-23 12:31   ` Marvin Liu
  2020-04-23 15:17   ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Wang, Yinan
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: harry.van.haaren, dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 6286286db..4bd46f83e 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -417,6 +417,10 @@ Below devargs are supported by the virtio-user vdev:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to enable virtio device vectorized path.
+    (Default: 0 (disabled))
 
 Virtio paths Selection and Usage
 --------------------------------
@@ -469,6 +473,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -491,6 +502,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -508,20 +521,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v8 0/9] add packed ring vectorized path
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
                     ` (8 preceding siblings ...)
  2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 9/9] doc: add packed " Marvin Liu
@ 2020-04-23 15:17   ` Wang, Yinan
  9 siblings, 0 replies; 162+ messages in thread
From: Wang, Yinan @ 2020-04-23 15:17 UTC (permalink / raw)
  To: Liu, Yong, maxime.coquelin, Ye, Xiaolong, Wang, Zhihong
  Cc: Van Haaren, Harry, dev, Liu, Yong

Tested-by: Wang, Yinan <yinan.wang@intel.com>

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Marvin Liu
> Sent: 2020年4月23日 20:31
> To: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org; Liu,
> Yong <yong.liu@intel.com>
> Subject: [dpdk-dev] [PATCH v8 0/9] add packed ring vectorized path
> 
> This patch set introduced vectorized path for packed ring.
> 
> The size of packed ring descriptor is 16Bytes. Four batched descriptors are
> just placed into one cacheline. AVX512 instructions can well handle this kind
> of data. Packed ring TX path can fully transformed into vectorized path.
> Packed ring Rx path can be vectorized when requirements met(LRO and
> mergeable disabled).
> 
> New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch
> set. This option will unify split and packed ring vectorized path default setting.
> Meanwhile user can specify whether enable vectorized path at runtime by
> 'vectorized' parameter of virtio user vdev.
> 
> v8:
> * fix meson build error on ubuntu16.04 and suse15
> 
> v7:
> * default vectorization is disabled
> * compilation time check dependency on rte_mbuf structure
> * offsets are calcuated when compiling
> * remove useless barrier as descs are batched store&load
> * vindex of scatter is directly set
> * some comments updates
> * enable vectorized path in meson build
> 
> v6:
> * fix issue when size not power of 2
> 
> v5:
> * remove cpuflags definition as required extensions always come with
>   AVX512F on x86_64
> * inorder actions should depend on feature bit
> * check ring type in rx queue setup
> * rewrite some commit logs
> * fix some checkpatch warnings
> 
> v4:
> * rename 'packed_vec' to 'vectorized', also used in split ring
> * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
> * check required AVX512 extensions cpuflags
> * combine split and packed ring datapath selection logic
> * remove limitation that size must power of two
> * clear 12Bytes virtio_net_hdr
> 
> v3:
> * remove virtio_net_hdr array for better performance
> * disable 'packed_vec' by default
> 
> v2:
> * more function blocks replaced by vector instructions
> * clean virtio_net_hdr by vector instruction
> * allow header room size change
> * add 'packed_vec' option in virtio_user vdev
> * fix build not check whether AVX512 enabled
> * doc update
> 
> 
> Marvin Liu (9):
>   net/virtio: add Rx free threshold setting
>   net/virtio: enable vectorized path
>   net/virtio: inorder should depend on feature bit
>   net/virtio-user: add vectorized path parameter
>   net/virtio: add vectorized packed ring Rx path
>   net/virtio: reuse packed ring xmit functions
>   net/virtio: add vectorized packed ring Tx path
>   net/virtio: add election for vectorized path
>   doc: add packed vectorized path
> 
>  config/common_base                          |   1 +
>  doc/guides/nics/virtio.rst                  |  43 +-
>  drivers/net/virtio/Makefile                 |  37 ++
>  drivers/net/virtio/meson.build              |  15 +
>  drivers/net/virtio/virtio_ethdev.c          |  95 ++-
>  drivers/net/virtio/virtio_ethdev.h          |   6 +
>  drivers/net/virtio/virtio_pci.h             |   3 +-
>  drivers/net/virtio/virtio_rxtx.c            | 212 ++-----
>  drivers/net/virtio/virtio_rxtx_packed_avx.c | 665 ++++++++++++++++++++
>  drivers/net/virtio/virtio_user_ethdev.c     |  37 +-
>  drivers/net/virtio/virtqueue.c              |   7 +-
>  drivers/net/virtio/virtqueue.h              | 168 ++++-
>  12 files changed, 1075 insertions(+), 214 deletions(-)
>  create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (13 preceding siblings ...)
  2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
@ 2020-04-24  9:24 ` Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (8 more replies)
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                   ` (2 subsequent siblings)
  17 siblings, 9 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v9:
* replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg
* reorder patch sequence

v8:
* fix meson build error on ubuntu16.04 and suse15

v7:
* default vectorization is disabled
* compilation time check dependency on rte_mbuf structure
* offsets are calcuated when compiling
* remove useless barrier as descs are batched store&load
* vindex of scatter is directly set
* some comments updates
* enable vectorized path in meson build

v6:
* fix issue when size not power of 2

v5:
* remove cpuflags definition as required extensions always come with
  AVX512F on x86_64
* inorder actions should depend on feature bit
* check ring type in rx queue setup
* rewrite some commit logs
* fix some checkpatch warnings

v4:
* rename 'packed_vec' to 'vectorized', also used in split ring
* add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
* check required AVX512 extensions cpuflags
* combine split and packed ring datapath selection logic
* remove limitation that size must power of two
* clear 12Bytes virtio_net_hdr

v3:
* remove virtio_net_hdr array for better performance
* disable 'packed_vec' by default

v2:
* more function blocks replaced by vector instructions
* clean virtio_net_hdr by vector instruction
* allow header room size change
* add 'packed_vec' option in virtio_user vdev 
* fix build not check whether AVX512 enabled
* doc update

Tested-by: Wang, Yinan <yinan.wang@intel.com>

Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: inorder should depend on feature bit
  net/virtio: add vectorized devarg
  net/virtio-user: add vectorized devarg
  net/virtio: add vectorized packed ring Rx path
  net/virtio: reuse packed ring xmit functions
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 doc/guides/nics/virtio.rst                  |  52 +-
 drivers/net/virtio/Makefile                 |  35 ++
 drivers/net/virtio/meson.build              |  14 +
 drivers/net/virtio/virtio_ethdev.c          | 136 +++-
 drivers/net/virtio/virtio_ethdev.h          |   6 +
 drivers/net/virtio/virtio_pci.h             |   3 +-
 drivers/net/virtio/virtio_rxtx.c            | 212 ++-----
 drivers/net/virtio/virtio_rxtx_packed_avx.c | 665 ++++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c     |  32 +-
 drivers/net/virtio/virtqueue.c              |   7 +-
 drivers/net/virtio/virtqueue.h              | 168 ++++-
 11 files changed, 1112 insertions(+), 218 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Introduce free threshold setting in Rx queue, its default value is 32.
Limit the threshold size to multiple of four as only vectorized packed
Rx function will utilize it. Virtio driver will rearm Rx queue when
more than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Ring initialization is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 11:27     ` Maxime Coquelin
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Previously, virtio split ring vectorized path was enabled by default.
This is not suitable for everyone because that path dose not follow
virtio spec. Add new devarg for virtio vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 6286286db..902a1f0cf 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfer to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
+
 Below devargs are supported by the virtio-user vdev:
 
 #.  ``path``:
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 37766cbb6..0a69a4db1 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev);
 static uint32_t virtio_dev_speed_capa_get(uint32_t speed);
 static int virtio_dev_devargs_parse(struct rte_devargs *devargs,
 	int *vdpa,
-	uint32_t *speed);
+	uint32_t *speed,
+	int *vectorized);
 static int virtio_dev_info_get(struct rte_eth_dev *dev,
 				struct rte_eth_dev_info *dev_info);
 static int virtio_dev_link_update(struct rte_eth_dev *dev,
@@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 {
 	struct virtio_hw *hw = eth_dev->data->dev_private;
 	uint32_t speed = SPEED_UNKNOWN;
+	int vectorized = 0;
 	int ret;
 
 	if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) {
@@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		return 0;
 	}
 	ret = virtio_dev_devargs_parse(eth_dev->device->devargs,
-		 NULL, &speed);
+		 NULL, &speed, &vectorized);
 	if (ret < 0)
 		return ret;
 	hw->speed = speed;
@@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 	if (ret < 0)
 		goto err_virtio_init;
 
+	if (vectorized) {
+		if (!vtpci_packed_queue(hw))
+			hw->use_vec_rx = 1;
+	}
+
 	hw->opened = true;
 
 	return 0;
@@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed)
 	}
 }
 
+static int vectorized_check_handler(__rte_unused const char *key,
+		const char *value, void *ret_val)
+{
+	if (strcmp(value, "1") == 0)
+		*(int *)ret_val = 1;
+	else
+		*(int *)ret_val = 0;
+
+	return 0;
+}
 
 #define VIRTIO_ARG_SPEED      "speed"
 #define VIRTIO_ARG_VDPA       "vdpa"
+#define VIRTIO_ARG_VECTORIZED "vectorized"
 
 
 static int
@@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused,
 
 static int
 virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
-	uint32_t *speed)
+	uint32_t *speed, int *vectorized)
 {
 	struct rte_kvargs *kvlist;
 	int ret = 0;
@@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
 		}
 	}
 
+	if (vectorized &&
+		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				VIRTIO_ARG_VECTORIZED,
+				vectorized_check_handler, vectorized);
+		if (ret < 0) {
+			PMD_INIT_LOG(ERR, "Failed to parse %s",
+					VIRTIO_ARG_VECTORIZED);
+			goto exit;
+		}
+	}
+
 exit:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	int vdpa = 0;
 	int ret = 0;
 
-	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL);
+	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL,
+		NULL);
 	if (ret < 0) {
 		PMD_INIT_LOG(ERR, "devargs parsing is failed");
 		return ret;
@@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index bd89357e4..668e688e1 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -253,7 +253,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 953f00d72..150a8d987 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 4/9] net/virtio-user: add vectorized devarg
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 11:29     ` Maxime Coquelin
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Add new devarg for virtio user device vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 902a1f0cf..d59add23e 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfer to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
 
 Virtio paths Selection and Usage
 --------------------------------
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 150a8d987..40ad786cc 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -452,6 +452,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_PACKED_VQ,
 #define VIRTIO_USER_ARG_SPEED          "speed"
 	VIRTIO_USER_ARG_SPEED,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized)
+		hw->use_vec_rx = 1;
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
 	"packed_vq=<0|1> "
-	"speed=<int>");
+	"speed=<int> "
+	"vectorized=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 11:51     ` Maxime Coquelin
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Optimize packed ring Rx path with SIMD instructions. Solution of
optimization is pretty like vhost, is that split path into batch and
single functions. Batch function is further optimized by AVX512
instructions. Also pad desc extra structure to 16 bytes aligned, thus
four elements will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index c9edb84ee..102b1deab 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 15150eea1..8e68c3039 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
 deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
+	if '-mno-avx512f' not in machine_args
+		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
 	sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index febaf17a8..5c112cac7 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..c9b6e7844 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..8a7b459eb
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define BYTE_SIZE 8
+/* flag bits offset in packed ring desc higher 64bits */
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static inline uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	__m512i v_desc = _mm512_loadu_si512(desc_addr);
+	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
+
+	__m512i v_used_flag = _mm512_setzero_si512();
+	if (vq->vq_packed.used_wrap_counter)
+		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
+	if (desc_stats)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+	}
+
+	/*
+	 * load len from desc, store into mbuf pkt_len and data_len
+	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
+	 */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA);
+
+	/* reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
+			(uint32_t)-hdr_size);
+
+	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
+
+	/* assert offset of data_len */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+
+	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
+					   addrs[2] + 8, addrs[2],
+					   addrs[1] + 8, addrs[1],
+					   addrs[0] + 8, addrs[0]);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 40ad786cc..c54698ad1 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	hw->use_msix = 1;
 	hw->modern   = 0;
 	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
-	if (vectorized)
-		hw->use_vec_rx = 1;
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	}
 
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..d293a3189 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,8 +18,10 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -236,7 +238,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 12:01     ` Maxime Coquelin
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Move xmit offload and packed ring xmit enqueue function to header file.
These functions will be reused by packed ring vectorized Tx function.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c9b6e7844..cf18fe564 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
 static void
 virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
 {
@@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
@@ -1246,7 +1085,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index d293a3189..18ae34789 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -563,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 12:29     ` Maxime Coquelin
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by AVX512
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 5c112cac7..b7d52d497 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index cf18fe564..f82fe8d64 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index 8a7b459eb..c023ace4e 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -23,6 +23,24 @@
 #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
 	FLAGS_BITS_OFFSET)
 
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+/* id bits offset in packed ring desc higher 64bits */
+#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
 	sizeof(struct vring_packed_desc))
 #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
@@ -47,6 +65,48 @@
 	for (iter = val; iter < num; iter++)
 #endif
 
+static inline void
+virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
+{
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct rte_mbuf *mbufs[size];
+	uint16_t nb_mbuf = 0, i;
+
+	used_idx = vq->vq_used_cons_idx;
+
+	if (!desc_is_used(&desc[used_idx], vq))
+		return;
+
+	id = desc[used_idx].id;
+
+	do {
+		curr_id = used_idx;
+		dxp = &vq->vq_descx[used_idx];
+		used_idx += dxp->ndescs;
+		free_cnt += dxp->ndescs;
+
+		if (dxp->cookie != NULL) {
+			mbufs[nb_mbuf] = dxp->cookie;
+			dxp->cookie = NULL;
+			nb_mbuf++;
+		}
+
+		if (used_idx >= size) {
+			used_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+	} while (curr_id != id);
+
+	for (i = 0; i < nb_mbuf; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
+
 static inline void
 virtio_update_batch_stats(struct virtnet_stats *stats,
 			  uint16_t pkt_len1,
@@ -60,6 +120,237 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len4;
 }
 
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64);
+	RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64);
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref);
+	if (unlikely(cmp))
+		return -1;
+
+	/* Check headroom is enough */
+	const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12;
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) !=
+		offsetof(struct rte_mbuf, rearm_data));
+	cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3],
+					   0x1, (uint64_t)tx_pkts[2],
+					   0x1, (uint64_t)tx_pkts[1],
+					   0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_BITS_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_BITS_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i v_offset = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i v_desc = _mm512_add_epi64(new_descs, v_offset);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i all_mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, all_mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_packed_vec(vq);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_packed_vec(vq);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 13:26     ` Maxime Coquelin
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
vectorized devarg, then checks each criteria.

Packed ring vectorized path need:
    AVX512F and required extensions are supported by compiler and host
    VERSION_1 and IN_ORDER features are negotiated
    mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path need:
    mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloadings are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 0a69a4db1..8a9545dd8 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	if (vectorized) {
-		if (!vtpci_packed_queue(hw))
+		if (!vtpci_packed_queue(hw)) {
+			hw->use_vec_rx = 1;
+		} else {
+#if !defined(CC_AVX512_SUPPORT)
+			PMD_DRV_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#else
 			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#endif
+		}
 	}
 
 	hw->opened = true;
@@ -2099,11 +2117,10 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
 		}
 	}
 
-	if (vectorized &&
-		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
+	if (vectorized && rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
 		ret = rte_kvargs_process(kvlist,
-				VIRTIO_ARG_VECTORIZED,
-				vectorized_check_handler, vectorized);
+					VIRTIO_ARG_VECTORIZED,
+					vectorized_check_handler, vectorized);
 		if (ret < 0) {
 			PMD_INIT_LOG(ERR, "Failed to parse %s",
 					VIRTIO_ARG_VECTORIZED);
@@ -2288,31 +2305,61 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorized path for requirements not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
 
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
+
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized path for requirement not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 9/9] doc: add packed vectorized path
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-24  9:24   ` Marvin Liu
  2020-04-24 13:31     ` Maxime Coquelin
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-24  9:24 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang
  Cc: dev, harry.van.haaren, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d59add23e..dbcf49ae1 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -482,6 +482,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -504,6 +511,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -521,20 +530,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu
@ 2020-04-24 11:27     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 11:27 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Previously, virtio split ring vectorized path was enabled by default.
> This is not suitable for everyone because that path dose not follow
> virtio spec. Add new devarg for virtio vectorized path selection. By
> default vectorized path is disabled.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 


Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks!
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 4/9] net/virtio-user: add vectorized devarg
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu
@ 2020-04-24 11:29     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 11:29 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Add new devarg for virtio user device vectorized path selection. By
> default vectorized path is disabled.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-24 11:51     ` Maxime Coquelin
  2020-04-24 13:12       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 11:51 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Optimize packed ring Rx path with SIMD instructions. Solution of
> optimization is pretty like vhost, is that split path into batch and
> single functions. Batch function is further optimized by AVX512
> instructions. Also pad desc extra structure to 16 bytes aligned, thus
> four elements will be saved in one batch.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index c9edb84ee..102b1deab 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
>  endif
>  
> +ifneq ($(FORCE_DISABLE_AVX512), y)
> +	CC_AVX512_SUPPORT=\
> +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> +	grep -q AVX512 && echo 1)
> +endif
> +
> +ifeq ($(CC_AVX512_SUPPORT), 1)
> +CFLAGS += -DCC_AVX512_SUPPORT
> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
> +
> +ifeq ($(RTE_TOOLCHAIN), gcc)
> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
> +endif
> +endif
> +
> +ifeq ($(RTE_TOOLCHAIN), clang)
> +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
> +endif
> +endif
> +
> +ifeq ($(RTE_TOOLCHAIN), icc)
> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
> +endif
> +endif
> +
> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
> +endif
> +endif
> +
>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> index 15150eea1..8e68c3039 100644
> --- a/drivers/net/virtio/meson.build
> +++ b/drivers/net/virtio/meson.build
> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
>  deps += ['kvargs', 'bus_pci']
>  
>  if arch_subdir == 'x86'
> +	if '-mno-avx512f' not in machine_args
> +		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> +			cflags += ['-DCC_AVX512_SUPPORT']
> +			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> +			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> +				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
> +			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> +			endif
> +			sources += files('virtio_rxtx_packed_avx.c')
> +		endif
> +	endif
>  	sources += files('virtio_rxtx_simple_sse.c')
>  elif arch_subdir == 'ppc'
>  	sources += files('virtio_rxtx_simple_altivec.c')
> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
> index febaf17a8..5c112cac7 100644
> --- a/drivers/net/virtio/virtio_ethdev.h
> +++ b/drivers/net/virtio/virtio_ethdev.h
> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
>  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		uint16_t nb_pkts);
>  
> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> +		uint16_t nb_pkts);
> +
>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
>  
>  void virtio_interrupt_handler(void *param);
> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> index 84f4cf946..c9b6e7844 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
>  
>  	return nb_tx;
>  }
> +
> +__rte_weak uint16_t
> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> +			    struct rte_mbuf **rx_pkts __rte_unused,
> +			    uint16_t nb_pkts __rte_unused)
> +{
> +	return 0;
> +}
> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> new file mode 100644
> index 000000000..8a7b459eb
> --- /dev/null
> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> @@ -0,0 +1,374 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +
> +#include <rte_net.h>
> +
> +#include "virtio_logs.h"
> +#include "virtio_ethdev.h"
> +#include "virtio_pci.h"
> +#include "virtqueue.h"
> +
> +#define BYTE_SIZE 8
> +/* flag bits offset in packed ring desc higher 64bits */
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
> +	FLAGS_BITS_OFFSET)
> +
> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> +	sizeof(struct vring_packed_desc))
> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> +
> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
> +	for (iter = val; iter < size; iter++)
> +#endif
> +
> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
> +	for (iter = val; iter < size; iter++)
> +#endif
> +
> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
> +	for (iter = val; iter < size; iter++)
> +#endif
> +
> +#ifndef virtio_for_each_try_unroll
> +#define virtio_for_each_try_unroll(iter, val, num) \
> +	for (iter = val; iter < num; iter++)
> +#endif
> +
> +static inline void
> +virtio_update_batch_stats(struct virtnet_stats *stats,
> +			  uint16_t pkt_len1,
> +			  uint16_t pkt_len2,
> +			  uint16_t pkt_len3,
> +			  uint16_t pkt_len4)
> +{
> +	stats->bytes += pkt_len1;
> +	stats->bytes += pkt_len2;
> +	stats->bytes += pkt_len3;
> +	stats->bytes += pkt_len4;
> +}
> +
> +/* Optionally fill offload information in structure */
> +static inline int
> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> +{
> +	struct rte_net_hdr_lens hdr_lens;
> +	uint32_t hdrlen, ptype;
> +	int l4_supported = 0;
> +
> +	/* nothing to do */
> +	if (hdr->flags == 0)
> +		return 0;

IIUC, the only difference with the non-vectorized version is the GSO
support removed here.
gso_type being in the same cacheline as flags in virtio_net_hdr, I don't
think checking the performance gain is worth the added maintainance
effort due to code duplication.

Please prove I'm wrong, otherwise please move virtio_rx_offload() in a
header and use it here. Alternative if it really imapcts performance is
to put all the shared code in a dedicated function that can be re-used
by both implementations.

> +
> +	/* GSO not support in vec path, skip check */
> +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
> +
> +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> +	m->packet_type = ptype;
> +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
> +		l4_supported = 1;
> +
> +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> +		if (hdr->csum_start <= hdrlen && l4_supported) {
> +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
> +		} else {
> +			/* Unknown proto or tunnel, do sw cksum. We can assume
> +			 * the cksum field is in the first segment since the
> +			 * buffers we provided to the host are large enough.
> +			 * In case of SCTP, this will be wrong since it's a CRC
> +			 * but there's nothing we can do.
> +			 */
> +			uint16_t csum = 0, off;
> +
> +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> +				&csum);
> +			if (likely(csum != 0xffff))
> +				csum = ~csum;
> +			off = hdr->csum_offset + hdr->csum_start;
> +			if (rte_pktmbuf_data_len(m) >= off + 1)
> +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> +					off) = csum;
> +		}
> +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
> +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
> +	}
> +
> +	return 0;
> +}

Otherwise, the patch looks okay to me.

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
@ 2020-04-24 12:01     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 12:01 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Move xmit offload and packed ring xmit enqueue function to header file.
> These functions will be reused by packed ring vectorized Tx function.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-24 12:29     ` Maxime Coquelin
  2020-04-24 13:33       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 12:29 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Optimize packed ring Tx path alike Rx path. Split Tx path into batch and

s/alike/like/ ?

> single Tx functions. Batch function is further optimized by AVX512
> instructions.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
> index 5c112cac7..b7d52d497 100644
> --- a/drivers/net/virtio/virtio_ethdev.h
> +++ b/drivers/net/virtio/virtio_ethdev.h
> @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>  uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		uint16_t nb_pkts);
>  
> +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
> +		uint16_t nb_pkts);
> +
>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
>  
>  void virtio_interrupt_handler(void *param);
> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> index cf18fe564..f82fe8d64 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
>  {
>  	return 0;
>  }
> +
> +__rte_weak uint16_t
> +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
> +			    struct rte_mbuf **tx_pkts __rte_unused,
> +			    uint16_t nb_pkts __rte_unused)
> +{
> +	return 0;
> +}
> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> index 8a7b459eb..c023ace4e 100644
> --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> @@ -23,6 +23,24 @@
>  #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
>  	FLAGS_BITS_OFFSET)
>  
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> +	1ULL << REFCNT_BITS_OFFSET)
> +
> +/* id bits offset in packed ring desc higher 64bits */
> +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +/* net hdr short size mask */
> +#define NET_HDR_MASK 0x3F
> +
>  #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
>  	sizeof(struct vring_packed_desc))
>  #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> @@ -47,6 +65,48 @@
>  	for (iter = val; iter < num; iter++)
>  #endif
>  
> +static inline void
> +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
> +{
> +	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
> +	struct vq_desc_extra *dxp;
> +	uint16_t used_idx, id, curr_id, free_cnt = 0;
> +	uint16_t size = vq->vq_nentries;
> +	struct rte_mbuf *mbufs[size];
> +	uint16_t nb_mbuf = 0, i;
> +
> +	used_idx = vq->vq_used_cons_idx;
> +
> +	if (!desc_is_used(&desc[used_idx], vq))
> +		return;
> +
> +	id = desc[used_idx].id;
> +
> +	do {
> +		curr_id = used_idx;
> +		dxp = &vq->vq_descx[used_idx];
> +		used_idx += dxp->ndescs;
> +		free_cnt += dxp->ndescs;
> +
> +		if (dxp->cookie != NULL) {
> +			mbufs[nb_mbuf] = dxp->cookie;
> +			dxp->cookie = NULL;
> +			nb_mbuf++;
> +		}
> +
> +		if (used_idx >= size) {
> +			used_idx -= size;
> +			vq->vq_packed.used_wrap_counter ^= 1;
> +		}
> +	} while (curr_id != id);
> +
> +	for (i = 0; i < nb_mbuf; i++)
> +		rte_pktmbuf_free(mbufs[i]);
> +
> +	vq->vq_used_cons_idx = used_idx;
> +	vq->vq_free_cnt += free_cnt;
> +}
> +


I think you can re-use the inlined non-vectorized cleanup function here.
Or use your implementation in non-vectorized path.
BTW, do you know we have to pass the num argument in non-vectorized
case? I'm not sure to remember.

Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-24 11:51     ` Maxime Coquelin
@ 2020-04-24 13:12       ` Liu, Yong
  2020-04-24 13:33         ` Maxime Coquelin
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-24 13:12 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, April 24, 2020 7:52 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/24/20 11:24 AM, Marvin Liu wrote:
> > Optimize packed ring Rx path with SIMD instructions. Solution of
> > optimization is pretty like vhost, is that split path into batch and
> > single functions. Batch function is further optimized by AVX512
> > instructions. Also pad desc extra structure to 16 bytes aligned, thus
> > four elements will be saved in one batch.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index c9edb84ee..102b1deab 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> $(CONFIG_RTE_ARCH_ARM64)),)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
> >  endif
> >
> > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > +	CC_AVX512_SUPPORT=\
> > +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > +	grep -q AVX512 && echo 1)
> > +endif
> > +
> > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > +CFLAGS += -DCC_AVX512_SUPPORT
> > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
> > +
> > +ifeq ($(RTE_TOOLCHAIN), gcc)
> > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> > +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +ifeq ($(RTE_TOOLCHAIN), clang)
> > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
> ge 37 && echo 1), 1)
> > +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +ifeq ($(RTE_TOOLCHAIN), icc)
> > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> > +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
> > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> > +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
> > +endif
> > +endif
> > +
> >  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
> > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> > index 15150eea1..8e68c3039 100644
> > --- a/drivers/net/virtio/meson.build
> > +++ b/drivers/net/virtio/meson.build
> > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
> >  deps += ['kvargs', 'bus_pci']
> >
> >  if arch_subdir == 'x86'
> > +	if '-mno-avx512f' not in machine_args
> > +		if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> > +			cflags += ['-DCC_AVX512_SUPPORT']
> > +			if (toolchain == 'gcc' and
> cc.version().version_compare('>=8.3.0'))
> > +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > +			elif (toolchain == 'clang' and
> cc.version().version_compare('>=3.7.0'))
> > +				cflags += '-
> DVHOST_CLANG_UNROLL_PRAGMA'
> > +			elif (toolchain == 'icc' and
> cc.version().version_compare('>=16.0.0'))
> > +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> > +			endif
> > +			sources += files('virtio_rxtx_packed_avx.c')
> > +		endif
> > +	endif
> >  	sources += files('virtio_rxtx_simple_sse.c')
> >  elif arch_subdir == 'ppc'
> >  	sources += files('virtio_rxtx_simple_altivec.c')
> > diff --git a/drivers/net/virtio/virtio_ethdev.h
> b/drivers/net/virtio/virtio_ethdev.h
> > index febaf17a8..5c112cac7 100644
> > --- a/drivers/net/virtio/virtio_ethdev.h
> > +++ b/drivers/net/virtio/virtio_ethdev.h
> > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue,
> struct rte_mbuf **tx_pkts,
> >  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> >  		uint16_t nb_pkts);
> >
> > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> > +		uint16_t nb_pkts);
> > +
> >  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >
> >  void virtio_interrupt_handler(void *param);
> > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> > index 84f4cf946..c9b6e7844 100644
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
> >
> >  	return nb_tx;
> >  }
> > +
> > +__rte_weak uint16_t
> > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> > +			    struct rte_mbuf **rx_pkts __rte_unused,
> > +			    uint16_t nb_pkts __rte_unused)
> > +{
> > +	return 0;
> > +}
> > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > new file mode 100644
> > index 000000000..8a7b459eb
> > --- /dev/null
> > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > @@ -0,0 +1,374 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2020 Intel Corporation
> > + */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <errno.h>
> > +
> > +#include <rte_net.h>
> > +
> > +#include "virtio_logs.h"
> > +#include "virtio_ethdev.h"
> > +#include "virtio_pci.h"
> > +#include "virtqueue.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* flag bits offset in packed ring desc higher 64bits */
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL |
> VRING_PACKED_DESC_F_AVAIL_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> > +	sizeof(struct vring_packed_desc))
> > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> > +
> > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4")
> \
> > +	for (iter = val; iter < size; iter++)
> > +#endif
> > +
> > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
> > +	for (iter = val; iter < size; iter++)
> > +#endif
> > +
> > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
> > +	for (iter = val; iter < size; iter++)
> > +#endif
> > +
> > +#ifndef virtio_for_each_try_unroll
> > +#define virtio_for_each_try_unroll(iter, val, num) \
> > +	for (iter = val; iter < num; iter++)
> > +#endif
> > +
> > +static inline void
> > +virtio_update_batch_stats(struct virtnet_stats *stats,
> > +			  uint16_t pkt_len1,
> > +			  uint16_t pkt_len2,
> > +			  uint16_t pkt_len3,
> > +			  uint16_t pkt_len4)
> > +{
> > +	stats->bytes += pkt_len1;
> > +	stats->bytes += pkt_len2;
> > +	stats->bytes += pkt_len3;
> > +	stats->bytes += pkt_len4;
> > +}
> > +
> > +/* Optionally fill offload information in structure */
> > +static inline int
> > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> > +{
> > +	struct rte_net_hdr_lens hdr_lens;
> > +	uint32_t hdrlen, ptype;
> > +	int l4_supported = 0;
> > +
> > +	/* nothing to do */
> > +	if (hdr->flags == 0)
> > +		return 0;
> 
> IIUC, the only difference with the non-vectorized version is the GSO
> support removed here.
> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't
> think checking the performance gain is worth the added maintainance
> effort due to code duplication.
> 
> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a
> header and use it here. Alternative if it really imapcts performance is
> to put all the shared code in a dedicated function that can be re-used
> by both implementations.
> 

Maxime,
It won't be much performance difference between non-vectorized and vectorized.
The reason to add special vectorized version is for skipping the handling of garbage GSO packets. 
As all descs have been handled in batch, it is needed to revert when found garbage packets. 
That will introduce complicated logic in vectorized path.

Regards,
Marvin

> > +
> > +	/* GSO not support in vec path, skip check */
> > +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
> > +
> > +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> > +	m->packet_type = ptype;
> > +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
> > +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
> > +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
> > +		l4_supported = 1;
> > +
> > +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> > +		if (hdr->csum_start <= hdrlen && l4_supported) {
> > +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
> > +		} else {
> > +			/* Unknown proto or tunnel, do sw cksum. We can
> assume
> > +			 * the cksum field is in the first segment since the
> > +			 * buffers we provided to the host are large enough.
> > +			 * In case of SCTP, this will be wrong since it's a CRC
> > +			 * but there's nothing we can do.
> > +			 */
> > +			uint16_t csum = 0, off;
> > +
> > +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> > +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> > +				&csum);
> > +			if (likely(csum != 0xffff))
> > +				csum = ~csum;
> > +			off = hdr->csum_offset + hdr->csum_start;
> > +			if (rte_pktmbuf_data_len(m) >= off + 1)
> > +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> > +					off) = csum;
> > +		}
> > +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
> l4_supported) {
> > +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> Otherwise, the patch looks okay to me.
> 
> Thanks,
> Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-24 13:26     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 13:26 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Rewrite vectorized path selection logic. Default setting comes from
> vectorized devarg, then checks each criteria.
> 
> Packed ring vectorized path need:
>     AVX512F and required extensions are supported by compiler and host
>     VERSION_1 and IN_ORDER features are negotiated
>     mergeable feature is not negotiated
>     LRO offloading is disabled
> 
> Split ring vectorized rx path need:
>     mergeable and IN_ORDER features are not negotiated
>     LRO, chksum and vlan strip offloadings are disabled
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 9/9] doc: add packed vectorized path
  2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu
@ 2020-04-24 13:31     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 13:31 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren



On 4/24/20 11:24 AM, Marvin Liu wrote:
> Document packed virtqueue vectorized path selection logic in virtio net
> PMD.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-24 13:12       ` Liu, Yong
@ 2020-04-24 13:33         ` Maxime Coquelin
  2020-04-24 13:40           ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 13:33 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



On 4/24/20 3:12 PM, Liu, Yong wrote:
>> IIUC, the only difference with the non-vectorized version is the GSO
>> support removed here.
>> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't
>> think checking the performance gain is worth the added maintainance
>> effort due to code duplication.
>>
>> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a
>> header and use it here. Alternative if it really imapcts performance is
>> to put all the shared code in a dedicated function that can be re-used
>> by both implementations.
>>
> Maxime,
> It won't be much performance difference between non-vectorized and vectorized.
> The reason to add special vectorized version is for skipping the handling of garbage GSO packets. 
> As all descs have been handled in batch, it is needed to revert when found garbage packets. 
> That will introduce complicated logic in vectorized path.


What do you mean by garbage packet?
Is it really good to just ignore such issues?

Thanks,
Maxime

> Regards,
> Marvin
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-24 12:29     ` Maxime Coquelin
@ 2020-04-24 13:33       ` Liu, Yong
  2020-04-24 13:35         ` Maxime Coquelin
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-24 13:33 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, April 24, 2020 8:30 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
> 
> 
> 
> On 4/24/20 11:24 AM, Marvin Liu wrote:
> > Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
> 
> s/alike/like/ ?
> 
> > single Tx functions. Batch function is further optimized by AVX512
> > instructions.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/drivers/net/virtio/virtio_ethdev.h
> b/drivers/net/virtio/virtio_ethdev.h
> > index 5c112cac7..b7d52d497 100644
> > --- a/drivers/net/virtio/virtio_ethdev.h
> > +++ b/drivers/net/virtio/virtio_ethdev.h
> > @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue,
> struct rte_mbuf **rx_pkts,
> >  uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> >  		uint16_t nb_pkts);
> >
> > +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf
> **tx_pkts,
> > +		uint16_t nb_pkts);
> > +
> >  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >
> >  void virtio_interrupt_handler(void *param);
> > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> > index cf18fe564..f82fe8d64 100644
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue
> __rte_unused,
> >  {
> >  	return 0;
> >  }
> > +
> > +__rte_weak uint16_t
> > +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
> > +			    struct rte_mbuf **tx_pkts __rte_unused,
> > +			    uint16_t nb_pkts __rte_unused)
> > +{
> > +	return 0;
> > +}
> > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > index 8a7b459eb..c023ace4e 100644
> > --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > @@ -23,6 +23,24 @@
> >  #define PACKED_FLAGS_MASK ((0ULL |
> VRING_PACKED_DESC_F_AVAIL_USED) << \
> >  	FLAGS_BITS_OFFSET)
> >
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > +	1ULL << REFCNT_BITS_OFFSET)
> > +
> > +/* id bits offset in packed ring desc higher 64bits */
> > +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +/* net hdr short size mask */
> > +#define NET_HDR_MASK 0x3F
> > +
> >  #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> >  	sizeof(struct vring_packed_desc))
> >  #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> > @@ -47,6 +65,48 @@
> >  	for (iter = val; iter < num; iter++)
> >  #endif
> >
> > +static inline void
> > +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
> > +{
> > +	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
> > +	struct vq_desc_extra *dxp;
> > +	uint16_t used_idx, id, curr_id, free_cnt = 0;
> > +	uint16_t size = vq->vq_nentries;
> > +	struct rte_mbuf *mbufs[size];
> > +	uint16_t nb_mbuf = 0, i;
> > +
> > +	used_idx = vq->vq_used_cons_idx;
> > +
> > +	if (!desc_is_used(&desc[used_idx], vq))
> > +		return;
> > +
> > +	id = desc[used_idx].id;
> > +
> > +	do {
> > +		curr_id = used_idx;
> > +		dxp = &vq->vq_descx[used_idx];
> > +		used_idx += dxp->ndescs;
> > +		free_cnt += dxp->ndescs;
> > +
> > +		if (dxp->cookie != NULL) {
> > +			mbufs[nb_mbuf] = dxp->cookie;
> > +			dxp->cookie = NULL;
> > +			nb_mbuf++;
> > +		}
> > +
> > +		if (used_idx >= size) {
> > +			used_idx -= size;
> > +			vq->vq_packed.used_wrap_counter ^= 1;
> > +		}
> > +	} while (curr_id != id);
> > +
> > +	for (i = 0; i < nb_mbuf; i++)
> > +		rte_pktmbuf_free(mbufs[i]);
> > +
> > +	vq->vq_used_cons_idx = used_idx;
> > +	vq->vq_free_cnt += free_cnt;
> > +}
> > +
> 
> 
> I think you can re-use the inlined non-vectorized cleanup function here.
> Or use your implementation in non-vectorized path.
> BTW, do you know we have to pass the num argument in non-vectorized
> case? I'm not sure to remember.
> 

Maxime,
This is simple version of xmit clean up function. It is based on the concept that backend will update used id in burst which also match frontend's requirement.
I just found original version work better in loopback case. Will adapt it in next version. 

Thanks,
Marvin

> Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-24 13:33       ` Liu, Yong
@ 2020-04-24 13:35         ` Maxime Coquelin
  2020-04-24 13:47           ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-24 13:35 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



On 4/24/20 3:33 PM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Friday, April 24, 2020 8:30 PM
>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
>> Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
>>
>>
>>
>> On 4/24/20 11:24 AM, Marvin Liu wrote:
>>> Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
>>
>> s/alike/like/ ?
>>
>>> single Tx functions. Batch function is further optimized by AVX512
>>> instructions.
>>>
>>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>>>
>>> diff --git a/drivers/net/virtio/virtio_ethdev.h
>> b/drivers/net/virtio/virtio_ethdev.h
>>> index 5c112cac7..b7d52d497 100644
>>> --- a/drivers/net/virtio/virtio_ethdev.h
>>> +++ b/drivers/net/virtio/virtio_ethdev.h
>>> @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue,
>> struct rte_mbuf **rx_pkts,
>>>  uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
>> **rx_pkts,
>>>  		uint16_t nb_pkts);
>>>
>>> +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf
>> **tx_pkts,
>>> +		uint16_t nb_pkts);
>>> +
>>>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
>>>
>>>  void virtio_interrupt_handler(void *param);
>>> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
>>> index cf18fe564..f82fe8d64 100644
>>> --- a/drivers/net/virtio/virtio_rxtx.c
>>> +++ b/drivers/net/virtio/virtio_rxtx.c
>>> @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue
>> __rte_unused,
>>>  {
>>>  	return 0;
>>>  }
>>> +
>>> +__rte_weak uint16_t
>>> +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
>>> +			    struct rte_mbuf **tx_pkts __rte_unused,
>>> +			    uint16_t nb_pkts __rte_unused)
>>> +{
>>> +	return 0;
>>> +}
>>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
>> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
>>> index 8a7b459eb..c023ace4e 100644
>>> --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
>>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
>>> @@ -23,6 +23,24 @@
>>>  #define PACKED_FLAGS_MASK ((0ULL |
>> VRING_PACKED_DESC_F_AVAIL_USED) << \
>>>  	FLAGS_BITS_OFFSET)
>>>
>>> +/* reference count offset in mbuf rearm data */
>>> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
>>> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
>>> +/* segment number offset in mbuf rearm data */
>>> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
>>> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
>>> +
>>> +/* default rearm data */
>>> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
>>> +	1ULL << REFCNT_BITS_OFFSET)
>>> +
>>> +/* id bits offset in packed ring desc higher 64bits */
>>> +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
>>> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
>>> +
>>> +/* net hdr short size mask */
>>> +#define NET_HDR_MASK 0x3F
>>> +
>>>  #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
>>>  	sizeof(struct vring_packed_desc))
>>>  #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
>>> @@ -47,6 +65,48 @@
>>>  	for (iter = val; iter < num; iter++)
>>>  #endif
>>>
>>> +static inline void
>>> +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
>>> +{
>>> +	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
>>> +	struct vq_desc_extra *dxp;
>>> +	uint16_t used_idx, id, curr_id, free_cnt = 0;
>>> +	uint16_t size = vq->vq_nentries;
>>> +	struct rte_mbuf *mbufs[size];
>>> +	uint16_t nb_mbuf = 0, i;
>>> +
>>> +	used_idx = vq->vq_used_cons_idx;
>>> +
>>> +	if (!desc_is_used(&desc[used_idx], vq))
>>> +		return;
>>> +
>>> +	id = desc[used_idx].id;
>>> +
>>> +	do {
>>> +		curr_id = used_idx;
>>> +		dxp = &vq->vq_descx[used_idx];
>>> +		used_idx += dxp->ndescs;
>>> +		free_cnt += dxp->ndescs;
>>> +
>>> +		if (dxp->cookie != NULL) {
>>> +			mbufs[nb_mbuf] = dxp->cookie;
>>> +			dxp->cookie = NULL;
>>> +			nb_mbuf++;
>>> +		}
>>> +
>>> +		if (used_idx >= size) {
>>> +			used_idx -= size;
>>> +			vq->vq_packed.used_wrap_counter ^= 1;
>>> +		}
>>> +	} while (curr_id != id);
>>> +
>>> +	for (i = 0; i < nb_mbuf; i++)
>>> +		rte_pktmbuf_free(mbufs[i]);
>>> +
>>> +	vq->vq_used_cons_idx = used_idx;
>>> +	vq->vq_free_cnt += free_cnt;
>>> +}
>>> +
>>
>>
>> I think you can re-use the inlined non-vectorized cleanup function here.
>> Or use your implementation in non-vectorized path.
>> BTW, do you know we have to pass the num argument in non-vectorized
>> case? I'm not sure to remember.
>>
> 
> Maxime,
> This is simple version of xmit clean up function. It is based on the concept that backend will update used id in burst which also match frontend's requirement.

And what the backend doesn't follow that concept?
It is just slower or broken?

> I just found original version work better in loopback case. Will adapt it in next version. 
> 
> Thanks,
> Marvin
> 
>> Maxime
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-24 13:33         ` Maxime Coquelin
@ 2020-04-24 13:40           ` Liu, Yong
  2020-04-24 15:58             ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-24 13:40 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, April 24, 2020 9:34 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/24/20 3:12 PM, Liu, Yong wrote:
> >> IIUC, the only difference with the non-vectorized version is the GSO
> >> support removed here.
> >> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't
> >> think checking the performance gain is worth the added maintainance
> >> effort due to code duplication.
> >>
> >> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a
> >> header and use it here. Alternative if it really imapcts performance is
> >> to put all the shared code in a dedicated function that can be re-used
> >> by both implementations.
> >>
> > Maxime,
> > It won't be much performance difference between non-vectorized and
> vectorized.
> > The reason to add special vectorized version is for skipping the handling of
> garbage GSO packets.
> > As all descs have been handled in batch, it is needed to revert when found
> garbage packets.
> > That will introduce complicated logic in vectorized path.
> 
		
Dequeue function will call virtio_discard_rxbuf when found gso info in hdr is invalid.
IMHO, there's no need to check gso info when GSO not negotiated.
There's an alternative way is that use single function handle GSO packets but its performance will be worse than normal function.

if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) ||
	 (hdr->gso_size == 0)) {
	 return -EINVAL;
}

> 
> What do you mean by garbage packet?
> Is it really good to just ignore such issues?
> 
> Thanks,
> Maxime
> 
> > Regards,
> > Marvin
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-24 13:35         ` Maxime Coquelin
@ 2020-04-24 13:47           ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-24 13:47 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, April 24, 2020 9:36 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
> 
> 
> 
> On 4/24/20 3:33 PM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Friday, April 24, 2020 8:30 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> >> Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path
> >>
> >>
> >>
> >> On 4/24/20 11:24 AM, Marvin Liu wrote:
> >>> Optimize packed ring Tx path alike Rx path. Split Tx path into batch and
> >>
> >> s/alike/like/ ?
> >>
> >>> single Tx functions. Batch function is further optimized by AVX512
> >>> instructions.
> >>>
> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >>>
> >>> diff --git a/drivers/net/virtio/virtio_ethdev.h
> >> b/drivers/net/virtio/virtio_ethdev.h
> >>> index 5c112cac7..b7d52d497 100644
> >>> --- a/drivers/net/virtio/virtio_ethdev.h
> >>> +++ b/drivers/net/virtio/virtio_ethdev.h
> >>> @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue,
> >> struct rte_mbuf **rx_pkts,
> >>>  uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> >> **rx_pkts,
> >>>  		uint16_t nb_pkts);
> >>>
> >>> +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf
> >> **tx_pkts,
> >>> +		uint16_t nb_pkts);
> >>> +
> >>>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >>>
> >>>  void virtio_interrupt_handler(void *param);
> >>> diff --git a/drivers/net/virtio/virtio_rxtx.c
> b/drivers/net/virtio/virtio_rxtx.c
> >>> index cf18fe564..f82fe8d64 100644
> >>> --- a/drivers/net/virtio/virtio_rxtx.c
> >>> +++ b/drivers/net/virtio/virtio_rxtx.c
> >>> @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue
> >> __rte_unused,
> >>>  {
> >>>  	return 0;
> >>>  }
> >>> +
> >>> +__rte_weak uint16_t
> >>> +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
> >>> +			    struct rte_mbuf **tx_pkts __rte_unused,
> >>> +			    uint16_t nb_pkts __rte_unused)
> >>> +{
> >>> +	return 0;
> >>> +}
> >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> index 8a7b459eb..c023ace4e 100644
> >>> --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> @@ -23,6 +23,24 @@
> >>>  #define PACKED_FLAGS_MASK ((0ULL |
> >> VRING_PACKED_DESC_F_AVAIL_USED) << \
> >>>  	FLAGS_BITS_OFFSET)
> >>>
> >>> +/* reference count offset in mbuf rearm data */
> >>> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> >>> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> >>> +/* segment number offset in mbuf rearm data */
> >>> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> >>> +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> >>> +
> >>> +/* default rearm data */
> >>> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> >>> +	1ULL << REFCNT_BITS_OFFSET)
> >>> +
> >>> +/* id bits offset in packed ring desc higher 64bits */
> >>> +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
> >>> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> >>> +
> >>> +/* net hdr short size mask */
> >>> +#define NET_HDR_MASK 0x3F
> >>> +
> >>>  #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> >>>  	sizeof(struct vring_packed_desc))
> >>>  #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> >>> @@ -47,6 +65,48 @@
> >>>  	for (iter = val; iter < num; iter++)
> >>>  #endif
> >>>
> >>> +static inline void
> >>> +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq)
> >>> +{
> >>> +	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
> >>> +	struct vq_desc_extra *dxp;
> >>> +	uint16_t used_idx, id, curr_id, free_cnt = 0;
> >>> +	uint16_t size = vq->vq_nentries;
> >>> +	struct rte_mbuf *mbufs[size];
> >>> +	uint16_t nb_mbuf = 0, i;
> >>> +
> >>> +	used_idx = vq->vq_used_cons_idx;
> >>> +
> >>> +	if (!desc_is_used(&desc[used_idx], vq))
> >>> +		return;
> >>> +
> >>> +	id = desc[used_idx].id;
> >>> +
> >>> +	do {
> >>> +		curr_id = used_idx;
> >>> +		dxp = &vq->vq_descx[used_idx];
> >>> +		used_idx += dxp->ndescs;
> >>> +		free_cnt += dxp->ndescs;
> >>> +
> >>> +		if (dxp->cookie != NULL) {
> >>> +			mbufs[nb_mbuf] = dxp->cookie;
> >>> +			dxp->cookie = NULL;
> >>> +			nb_mbuf++;
> >>> +		}
> >>> +
> >>> +		if (used_idx >= size) {
> >>> +			used_idx -= size;
> >>> +			vq->vq_packed.used_wrap_counter ^= 1;
> >>> +		}
> >>> +	} while (curr_id != id);
> >>> +
> >>> +	for (i = 0; i < nb_mbuf; i++)
> >>> +		rte_pktmbuf_free(mbufs[i]);
> >>> +
> >>> +	vq->vq_used_cons_idx = used_idx;
> >>> +	vq->vq_free_cnt += free_cnt;
> >>> +}
> >>> +
> >>
> >>
> >> I think you can re-use the inlined non-vectorized cleanup function here.
> >> Or use your implementation in non-vectorized path.
> >> BTW, do you know we have to pass the num argument in non-vectorized
> >> case? I'm not sure to remember.
> >>
> >
> > Maxime,
> > This is simple version of xmit clean up function. It is based on the concept
> that backend will update used id in burst which also match frontend's
> requirement.
> 
> And what the backend doesn't follow that concept?
> It is just slower or broken?

It is just slower. More packets maybe drop due to no free room in the ring. 
I will replace vectorized with non-vectorized version it shown good number.

> 
> > I just found original version work better in loopback case. Will adapt it in
> next version.
> >
> > Thanks,
> > Marvin
> >
> >> Maxime
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
  2020-04-24 13:40           ` Liu, Yong
@ 2020-04-24 15:58             ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-24 15:58 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry



> -----Original Message-----
> From: Liu, Yong
> Sent: Friday, April 24, 2020 9:41 PM
> To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> Subject: RE: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> > -----Original Message-----
> > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Sent: Friday, April 24, 2020 9:34 PM
> > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> > Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
> >
> >
> >
> > On 4/24/20 3:12 PM, Liu, Yong wrote:
> > >> IIUC, the only difference with the non-vectorized version is the GSO
> > >> support removed here.
> > >> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't
> > >> think checking the performance gain is worth the added maintainance
> > >> effort due to code duplication.
> > >>
> > >> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a
> > >> header and use it here. Alternative if it really imapcts performance is
> > >> to put all the shared code in a dedicated function that can be re-used
> > >> by both implementations.
> > >>
> > > Maxime,
> > > It won't be much performance difference between non-vectorized and
> > vectorized.
> > > The reason to add special vectorized version is for skipping the handling
> of
> > garbage GSO packets.
> > > As all descs have been handled in batch, it is needed to revert when
> found
> > garbage packets.
> > > That will introduce complicated logic in vectorized path.
> >
> 
> Dequeue function will call virtio_discard_rxbuf when found gso info in hdr is
> invalid.
> IMHO, there's no need to check gso info when GSO not negotiated.
> There's an alternative way is that use single function handle GSO packets but
> its performance will be worse than normal function.
> 
> if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) ||
> 	 (hdr->gso_size == 0)) {
> 	 return -EINVAL;
> }
> 

Hi Maxime,
There's about 6% performance drop in loopback case after handling this special case in Rx path.
I prefer to keep current implementation. What's your option?

Thanks,
Marvin

> >
> > What do you mean by garbage packet?
> > Is it really good to just ignore such issues?
> >
> > Thanks,
> > Maxime
> >
> > > Regards,
> > > Marvin
> > >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v9 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (14 preceding siblings ...)
  2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
@ 2020-04-26  2:19 ` Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (8 more replies)
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
  17 siblings, 9 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v10:
* reuse packed ring xmit cleanup

v9:
* replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg
* reorder patch sequence

v8:
* fix meson build error on ubuntu16.04 and suse15

v7:
* default vectorization is disabled
* compilation time check dependency on rte_mbuf structure
* offsets are calcuated when compiling
* remove useless barrier as descs are batched store&load
* vindex of scatter is directly set
* some comments updates
* enable vectorized path in meson build

v6:
* fix issue when size not power of 2

v5:
* remove cpuflags definition as required extensions always come with
  AVX512F on x86_64
* inorder actions should depend on feature bit
* check ring type in rx queue setup
* rewrite some commit logs
* fix some checkpatch warnings

v4:
* rename 'packed_vec' to 'vectorized', also used in split ring
* add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
* check required AVX512 extensions cpuflags
* combine split and packed ring datapath selection logic
* remove limitation that size must power of two
* clear 12Bytes virtio_net_hdr

v3:
* remove virtio_net_hdr array for better performance
* disable 'packed_vec' by default

v2:
* more function blocks replaced by vector instructions
* clean virtio_net_hdr by vector instruction
* allow header room size change
* add 'packed_vec' option in virtio_user vdev 
* fix build not check whether AVX512 enabled
* doc update

Tested-by: Wang, Yinan <yinan.wang@intel.com>

Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: inorder should depend on feature bit
  net/virtio: add vectorized devarg
  net/virtio-user: add vectorized devarg
  net/virtio: reuse packed ring functions
  net/virtio: add vectorized packed ring Rx path
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 doc/guides/nics/virtio.rst                  |  52 +-
 drivers/net/virtio/Makefile                 |  35 ++
 drivers/net/virtio/meson.build              |  14 +
 drivers/net/virtio/virtio_ethdev.c          | 137 ++++-
 drivers/net/virtio/virtio_ethdev.h          |   6 +
 drivers/net/virtio/virtio_pci.h             |   3 +-
 drivers/net/virtio/virtio_rxtx.c            | 349 ++---------
 drivers/net/virtio/virtio_rxtx_packed_avx.c | 623 ++++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c     |  32 +-
 drivers/net/virtio/virtqueue.c              |   7 +-
 drivers/net/virtio/virtqueue.h              | 307 +++++++++-
 11 files changed, 1210 insertions(+), 355 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue, its default value is 32.
Limit the threshold size to multiple of four as only vectorized packed
Rx function will utilize it. Virtio driver will rearm Rx queue when
more than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Ring initialization is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-27 11:12     ` Maxime Coquelin
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Previously, virtio split ring vectorized path was enabled by default.
This is not suitable for everyone because that path dose not follow
virtio spec. Add new devarg for virtio vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 6286286db..902a1f0cf 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfer to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
+
 Below devargs are supported by the virtio-user vdev:
 
 #.  ``path``:
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 37766cbb6..0a69a4db1 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev);
 static uint32_t virtio_dev_speed_capa_get(uint32_t speed);
 static int virtio_dev_devargs_parse(struct rte_devargs *devargs,
 	int *vdpa,
-	uint32_t *speed);
+	uint32_t *speed,
+	int *vectorized);
 static int virtio_dev_info_get(struct rte_eth_dev *dev,
 				struct rte_eth_dev_info *dev_info);
 static int virtio_dev_link_update(struct rte_eth_dev *dev,
@@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 {
 	struct virtio_hw *hw = eth_dev->data->dev_private;
 	uint32_t speed = SPEED_UNKNOWN;
+	int vectorized = 0;
 	int ret;
 
 	if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) {
@@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		return 0;
 	}
 	ret = virtio_dev_devargs_parse(eth_dev->device->devargs,
-		 NULL, &speed);
+		 NULL, &speed, &vectorized);
 	if (ret < 0)
 		return ret;
 	hw->speed = speed;
@@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 	if (ret < 0)
 		goto err_virtio_init;
 
+	if (vectorized) {
+		if (!vtpci_packed_queue(hw))
+			hw->use_vec_rx = 1;
+	}
+
 	hw->opened = true;
 
 	return 0;
@@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed)
 	}
 }
 
+static int vectorized_check_handler(__rte_unused const char *key,
+		const char *value, void *ret_val)
+{
+	if (strcmp(value, "1") == 0)
+		*(int *)ret_val = 1;
+	else
+		*(int *)ret_val = 0;
+
+	return 0;
+}
 
 #define VIRTIO_ARG_SPEED      "speed"
 #define VIRTIO_ARG_VDPA       "vdpa"
+#define VIRTIO_ARG_VECTORIZED "vectorized"
 
 
 static int
@@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused,
 
 static int
 virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
-	uint32_t *speed)
+	uint32_t *speed, int *vectorized)
 {
 	struct rte_kvargs *kvlist;
 	int ret = 0;
@@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
 		}
 	}
 
+	if (vectorized &&
+		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				VIRTIO_ARG_VECTORIZED,
+				vectorized_check_handler, vectorized);
+		if (ret < 0) {
+			PMD_INIT_LOG(ERR, "Failed to parse %s",
+					VIRTIO_ARG_VECTORIZED);
+			goto exit;
+		}
+	}
+
 exit:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	int vdpa = 0;
 	int ret = 0;
 
-	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL);
+	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL,
+		NULL);
 	if (ret < 0) {
 		PMD_INIT_LOG(ERR, "devargs parsing is failed");
 		return ret;
@@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index bd89357e4..668e688e1 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -253,7 +253,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 953f00d72..150a8d987 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 4/9] net/virtio-user: add vectorized devarg
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-27 11:07     ` Maxime Coquelin
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Add new devarg for virtio user device vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 902a1f0cf..d59add23e 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfer to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
 
 Virtio paths Selection and Usage
 --------------------------------
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 150a8d987..40ad786cc 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -452,6 +452,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_PACKED_VQ,
 #define VIRTIO_USER_ARG_SPEED          "speed"
 	VIRTIO_USER_ARG_SPEED,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized)
+		hw->use_vec_rx = 1;
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
 	"packed_vq=<0|1> "
-	"speed=<int>");
+	"speed=<int> "
+	"vectorized=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-27 11:08     ` Maxime Coquelin
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Move offload, xmit cleanup and packed xmit enqueue function to header
file. These functions will be reused by packed ring vectorized path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..a549991aa 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -89,23 +89,6 @@ vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
 	dp->next = VQ_RING_DESC_CHAIN_END;
 }
 
-static void
-vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id)
-{
-	struct vq_desc_extra *dxp;
-
-	dxp = &vq->vq_descx[id];
-	vq->vq_free_cnt += dxp->ndescs;
-
-	if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END)
-		vq->vq_desc_head_idx = id;
-	else
-		vq->vq_descx[vq->vq_desc_tail_idx].next = id;
-
-	vq->vq_desc_tail_idx = id;
-	dxp->next = VQ_RING_DESC_CHAIN_END;
-}
-
 void
 virtio_update_packet_stats(struct virtnet_stats *stats, struct rte_mbuf *mbuf)
 {
@@ -264,130 +247,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
-static void
-virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
-{
-	uint16_t used_idx, id, curr_id, free_cnt = 0;
-	uint16_t size = vq->vq_nentries;
-	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
-	struct vq_desc_extra *dxp;
-
-	used_idx = vq->vq_used_cons_idx;
-	/* desc_is_used has a load-acquire or rte_cio_rmb inside
-	 * and wait for used desc in virtqueue.
-	 */
-	while (num > 0 && desc_is_used(&desc[used_idx], vq)) {
-		id = desc[used_idx].id;
-		do {
-			curr_id = used_idx;
-			dxp = &vq->vq_descx[used_idx];
-			used_idx += dxp->ndescs;
-			free_cnt += dxp->ndescs;
-			num -= dxp->ndescs;
-			if (used_idx >= size) {
-				used_idx -= size;
-				vq->vq_packed.used_wrap_counter ^= 1;
-			}
-			if (dxp->cookie != NULL) {
-				rte_pktmbuf_free(dxp->cookie);
-				dxp->cookie = NULL;
-			}
-		} while (curr_id != id);
-	}
-	vq->vq_used_cons_idx = used_idx;
-	vq->vq_free_cnt += free_cnt;
-}
-
-static void
-virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num)
-{
-	uint16_t used_idx, id;
-	uint16_t size = vq->vq_nentries;
-	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
-	struct vq_desc_extra *dxp;
-
-	used_idx = vq->vq_used_cons_idx;
-	/* desc_is_used has a load-acquire or rte_cio_rmb inside
-	 * and wait for used desc in virtqueue.
-	 */
-	while (num-- && desc_is_used(&desc[used_idx], vq)) {
-		id = desc[used_idx].id;
-		dxp = &vq->vq_descx[id];
-		vq->vq_used_cons_idx += dxp->ndescs;
-		if (vq->vq_used_cons_idx >= size) {
-			vq->vq_used_cons_idx -= size;
-			vq->vq_packed.used_wrap_counter ^= 1;
-		}
-		vq_ring_free_id_packed(vq, id);
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-		used_idx = vq->vq_used_cons_idx;
-	}
-}
-
-/* Cleanup from completed transmits. */
-static inline void
-virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order)
-{
-	if (in_order)
-		virtio_xmit_cleanup_inorder_packed(vq, num);
-	else
-		virtio_xmit_cleanup_normal_packed(vq, num);
-}
-
-static void
-virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num)
-{
-	uint16_t i, used_idx, desc_idx;
-	for (i = 0; i < num; i++) {
-		struct vring_used_elem *uep;
-		struct vq_desc_extra *dxp;
-
-		used_idx = (uint16_t)(vq->vq_used_cons_idx & (vq->vq_nentries - 1));
-		uep = &vq->vq_split.ring.used->ring[used_idx];
-
-		desc_idx = (uint16_t) uep->id;
-		dxp = &vq->vq_descx[desc_idx];
-		vq->vq_used_cons_idx++;
-		vq_ring_free_chain(vq, desc_idx);
-
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-	}
-}
-
-/* Cleanup from completed inorder transmits. */
-static __rte_always_inline void
-virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num)
-{
-	uint16_t i, idx = vq->vq_used_cons_idx;
-	int16_t free_cnt = 0;
-	struct vq_desc_extra *dxp = NULL;
-
-	if (unlikely(num == 0))
-		return;
-
-	for (i = 0; i < num; i++) {
-		dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)];
-		free_cnt += dxp->ndescs;
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-	}
-
-	vq->vq_free_cnt += free_cnt;
-	vq->vq_used_cons_idx = idx;
-}
-
 static inline int
 virtqueue_enqueue_refill_inorder(struct virtqueue *vq,
 			struct rte_mbuf **cookies,
@@ -562,68 +421,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +523,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
@@ -1246,7 +948,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..ca1c10499 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -10,6 +10,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_mempool.h>
+#include <rte_net.h>
 
 #include "virtio_pci.h"
 #include "virtio_ring.h"
@@ -18,8 +19,10 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -560,4 +563,303 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
+
+static void
+vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id)
+{
+	struct vq_desc_extra *dxp;
+
+	dxp = &vq->vq_descx[id];
+	vq->vq_free_cnt += dxp->ndescs;
+
+	if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END)
+		vq->vq_desc_head_idx = id;
+	else
+		vq->vq_descx[vq->vq_desc_tail_idx].next = id;
+
+	vq->vq_desc_tail_idx = id;
+	dxp->next = VQ_RING_DESC_CHAIN_END;
+}
+
+static void
+virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
+{
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+
+	used_idx = vq->vq_used_cons_idx;
+	/* desc_is_used has a load-acquire or rte_cio_rmb inside
+	 * and wait for used desc in virtqueue.
+	 */
+	while (num > 0 && desc_is_used(&desc[used_idx], vq)) {
+		id = desc[used_idx].id;
+		do {
+			curr_id = used_idx;
+			dxp = &vq->vq_descx[used_idx];
+			used_idx += dxp->ndescs;
+			free_cnt += dxp->ndescs;
+			num -= dxp->ndescs;
+			if (used_idx >= size) {
+				used_idx -= size;
+				vq->vq_packed.used_wrap_counter ^= 1;
+			}
+			if (dxp->cookie != NULL) {
+				rte_pktmbuf_free(dxp->cookie);
+				dxp->cookie = NULL;
+			}
+		} while (curr_id != id);
+	}
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
+
+static void
+virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num)
+{
+	uint16_t used_idx, id;
+	uint16_t size = vq->vq_nentries;
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+
+	used_idx = vq->vq_used_cons_idx;
+	/* desc_is_used has a load-acquire or rte_cio_rmb inside
+	 * and wait for used desc in virtqueue.
+	 */
+	while (num-- && desc_is_used(&desc[used_idx], vq)) {
+		id = desc[used_idx].id;
+		dxp = &vq->vq_descx[id];
+		vq->vq_used_cons_idx += dxp->ndescs;
+		if (vq->vq_used_cons_idx >= size) {
+			vq->vq_used_cons_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+		vq_ring_free_id_packed(vq, id);
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+		used_idx = vq->vq_used_cons_idx;
+	}
+}
+
+/* Cleanup from completed transmits. */
+static inline void
+virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order)
+{
+	if (in_order)
+		virtio_xmit_cleanup_inorder_packed(vq, num);
+	else
+		virtio_xmit_cleanup_normal_packed(vq, num);
+}
+
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num)
+{
+	uint16_t i, used_idx, desc_idx;
+	for (i = 0; i < num; i++) {
+		struct vring_used_elem *uep;
+		struct vq_desc_extra *dxp;
+
+		used_idx = (uint16_t)(vq->vq_used_cons_idx &
+				(vq->vq_nentries - 1));
+		uep = &vq->vq_split.ring.used->ring[used_idx];
+
+		desc_idx = (uint16_t)uep->id;
+		dxp = &vq->vq_descx[desc_idx];
+		vq->vq_used_cons_idx++;
+		vq_ring_free_chain(vq, desc_idx);
+
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+	}
+}
+
+/* Cleanup from completed inorder transmits. */
+static __rte_always_inline void
+virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num)
+{
+	uint16_t i, idx = vq->vq_used_cons_idx;
+	int16_t free_cnt = 0;
+	struct vq_desc_extra *dxp = NULL;
+
+	if (unlikely(num == 0))
+		return;
+
+	for (i = 0; i < num; i++) {
+		dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)];
+		free_cnt += dxp->ndescs;
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+	}
+
+	vq->vq_free_cnt += free_cnt;
+	vq->vq_used_cons_idx = idx;
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-27 11:20     ` Maxime Coquelin
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Rx path with SIMD instructions. Solution of
optimization is pretty like vhost, is that split path into batch and
single functions. Batch function is further optimized by AVX512
instructions. Also pad desc extra structure to 16 bytes aligned, thus
four elements will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index c9edb84ee..102b1deab 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 15150eea1..8e68c3039 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
 deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
+	if '-mno-avx512f' not in machine_args
+		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
 	sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index febaf17a8..5c112cac7 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index a549991aa..534562cca 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..8a7b459eb
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define BYTE_SIZE 8
+/* flag bits offset in packed ring desc higher 64bits */
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static inline uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	__m512i v_desc = _mm512_loadu_si512(desc_addr);
+	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
+
+	__m512i v_used_flag = _mm512_setzero_si512();
+	if (vq->vq_packed.used_wrap_counter)
+		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
+	if (desc_stats)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
+	}
+
+	/*
+	 * load len from desc, store into mbuf pkt_len and data_len
+	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
+	 */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA);
+
+	/* reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
+			(uint32_t)-hdr_size);
+
+	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
+
+	/* assert offset of data_len */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+
+	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
+					   addrs[2] + 8, addrs[2],
+					   addrs[1] + 8, addrs[1],
+					   addrs[0] + 8, addrs[0]);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 40ad786cc..c54698ad1 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	hw->use_msix = 1;
 	hw->modern   = 0;
 	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
-	if (vectorized)
-		hw->use_vec_rx = 1;
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	}
 
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index ca1c10499..ce0340743 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -239,7 +239,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-27 11:55     ` Maxime Coquelin
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 9/9] doc: add packed " Marvin Liu
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Tx path like Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by AVX512
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 5c112cac7..b7d52d497 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 534562cca..460e9d4a2 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2038,3 +2038,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index 8a7b459eb..43cee4244 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -23,6 +23,24 @@
 #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
 	FLAGS_BITS_OFFSET)
 
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+/* id bits offset in packed ring desc higher 64bits */
+#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
 	sizeof(struct vring_packed_desc))
 #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
@@ -60,6 +78,237 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len4;
 }
 
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64);
+	RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64);
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref);
+	if (unlikely(cmp))
+		return -1;
+
+	/* Check headroom is enough */
+	const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12;
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) !=
+		offsetof(struct rte_mbuf, rearm_data));
+	cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3],
+					   0x1, (uint64_t)tx_pkts[2],
+					   0x1, (uint64_t)tx_pkts[1],
+					   0x1, (uint64_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_BITS_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_BITS_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i v_offset = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i v_desc = _mm512_add_epi64(new_descs, v_offset);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i all_mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, all_mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_inorder_packed(vq, need);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_inorder_packed(vq, vq->vq_free_thresh);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
vectorized devarg, then checks each criteria.

Packed ring vectorized path need:
    AVX512F and required extensions are supported by compiler and host
    VERSION_1 and IN_ORDER features are negotiated
    mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path need:
    mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloadings are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 0a69a4db1..f8ff41d99 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	if (vectorized) {
-		if (!vtpci_packed_queue(hw))
+		if (!vtpci_packed_queue(hw)) {
+			hw->use_vec_rx = 1;
+		} else {
+#if !defined(CC_AVX512_SUPPORT)
+			PMD_DRV_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#else
 			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#endif
+		}
 	}
 
 	hw->opened = true;
@@ -2102,8 +2120,8 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
 	if (vectorized &&
 		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
 		ret = rte_kvargs_process(kvlist,
-				VIRTIO_ARG_VECTORIZED,
-				vectorized_check_handler, vectorized);
+					VIRTIO_ARG_VECTORIZED,
+					vectorized_check_handler, vectorized);
 		if (ret < 0) {
 			PMD_INIT_LOG(ERR, "Failed to parse %s",
 					VIRTIO_ARG_VECTORIZED);
@@ -2288,31 +2306,61 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorized path for requirements not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
 
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
+
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized path for requirement not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v10 9/9] doc: add packed vectorized path
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-26  2:19   ` Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-26  2:19 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d59add23e..dbcf49ae1 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -482,6 +482,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -504,6 +511,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -521,20 +530,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 4/9] net/virtio-user: add vectorized devarg
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu
@ 2020-04-27 11:07     ` Maxime Coquelin
  2020-04-28  1:29       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-27 11:07 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev



On 4/26/20 4:19 AM, Marvin Liu wrote:
> Add new devarg for virtio user device vectorized path selection. By
> default vectorized path is disabled.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
> index 902a1f0cf..d59add23e 100644
> --- a/doc/guides/nics/virtio.rst
> +++ b/doc/guides/nics/virtio.rst
> @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev:
>      rte_eth_link_get_nowait function.
>      (Default: 10000 (10G))
>  
> +#.  ``vectorized``:
> +
> +    It is used to specify whether virtio device perfer to use vectorized path.

s/perfer/prefers/

I'll fix while applying if the rest of the series is ok.

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu
@ 2020-04-27 11:08     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-27 11:08 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev



On 4/26/20 4:19 AM, Marvin Liu wrote:
> Move offload, xmit cleanup and packed xmit enqueue function to header
> file. These functions will be reused by packed ring vectorized path.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu
@ 2020-04-27 11:12     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-27 11:12 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev



On 4/26/20 4:19 AM, Marvin Liu wrote:
> Previously, virtio split ring vectorized path was enabled by default.
> This is not suitable for everyone because that path dose not follow
> virtio spec. Add new devarg for virtio vectorized path selection. By
> default vectorized path is disabled.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> 
> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
> index 6286286db..902a1f0cf 100644
> --- a/doc/guides/nics/virtio.rst
> +++ b/doc/guides/nics/virtio.rst
> @@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver:
>      rte_eth_link_get_nowait function.
>      (Default: 10000 (10G))
>  
> +#.  ``vectorized``:
> +
> +    It is used to specify whether virtio device perfer to use vectorized path.

s/perfer/prefers/

> +    Afterwards, dependencies of vectorized path will be checked in path
> +    election.
> +    (Default: 0 (disabled))
> +
>  Below devargs are supported by the virtio-user vdev:
>  
>  #.  ``path``:
> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
> index 37766cbb6..0a69a4db1 100644
> --- a/drivers/net/virtio/virtio_ethdev.c
> +++ b/drivers/net/virtio/virtio_ethdev.c
> @@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev);
>  static uint32_t virtio_dev_speed_capa_get(uint32_t speed);
>  static int virtio_dev_devargs_parse(struct rte_devargs *devargs,
>  	int *vdpa,
> -	uint32_t *speed);
> +	uint32_t *speed,
> +	int *vectorized);
>  static int virtio_dev_info_get(struct rte_eth_dev *dev,
>  				struct rte_eth_dev_info *dev_info);
>  static int virtio_dev_link_update(struct rte_eth_dev *dev,
> @@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
>  			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
>  		}
>  	} else {
> -		if (hw->use_simple_rx) {
> -			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
> +		if (hw->use_vec_rx) {
> +			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
>  				eth_dev->data->port_id);
>  			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
>  		} else if (hw->use_inorder_rx) {
> @@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>  {
>  	struct virtio_hw *hw = eth_dev->data->dev_private;
>  	uint32_t speed = SPEED_UNKNOWN;
> +	int vectorized = 0;
>  	int ret;
>  
>  	if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) {
> @@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>  		return 0;
>  	}
>  	ret = virtio_dev_devargs_parse(eth_dev->device->devargs,
> -		 NULL, &speed);
> +		 NULL, &speed, &vectorized);
>  	if (ret < 0)
>  		return ret;
>  	hw->speed = speed;
> @@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>  	if (ret < 0)
>  		goto err_virtio_init;
>  
> +	if (vectorized) {
> +		if (!vtpci_packed_queue(hw))
> +			hw->use_vec_rx = 1;
> +	}
> +
>  	hw->opened = true;
>  
>  	return 0;
> @@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed)
>  	}
>  }
>  
> +static int vectorized_check_handler(__rte_unused const char *key,
> +		const char *value, void *ret_val)
> +{
> +	if (strcmp(value, "1") == 0)
> +		*(int *)ret_val = 1;
> +	else
> +		*(int *)ret_val = 0;
> +
> +	return 0;
> +}
>  
>  #define VIRTIO_ARG_SPEED      "speed"
>  #define VIRTIO_ARG_VDPA       "vdpa"
> +#define VIRTIO_ARG_VECTORIZED "vectorized"
>  
>  
>  static int
> @@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused,
>  
>  static int
>  virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
> -	uint32_t *speed)
> +	uint32_t *speed, int *vectorized)
>  {
>  	struct rte_kvargs *kvlist;
>  	int ret = 0;
> @@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
>  		}
>  	}
>  
> +	if (vectorized &&
> +		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
> +		ret = rte_kvargs_process(kvlist,
> +				VIRTIO_ARG_VECTORIZED,
> +				vectorized_check_handler, vectorized);
> +		if (ret < 0) {
> +			PMD_INIT_LOG(ERR, "Failed to parse %s",
> +					VIRTIO_ARG_VECTORIZED);
> +			goto exit;
> +		}
> +	}
> +
>  exit:
>  	rte_kvargs_free(kvlist);
>  	return ret;
> @@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
>  	int vdpa = 0;
>  	int ret = 0;
>  
> -	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL);
> +	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL,
> +		NULL);
>  	if (ret < 0) {
>  		PMD_INIT_LOG(ERR, "devargs parsing is failed");
>  		return ret;
> @@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
>  			return -EBUSY;
>  		}
>  
> -	hw->use_simple_rx = 1;
> -
>  	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
>  		hw->use_inorder_tx = 1;
>  		hw->use_inorder_rx = 1;
> -		hw->use_simple_rx = 0;
> +		hw->use_vec_rx = 0;
>  	}
>  
>  	if (vtpci_packed_queue(hw)) {
> -		hw->use_simple_rx = 0;
> +		hw->use_vec_rx = 0;
>  		hw->use_inorder_rx = 0;
>  	}
>  
>  #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
>  	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
> -		hw->use_simple_rx = 0;
> +		hw->use_vec_rx = 0;
>  	}
>  #endif
>  	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
> -		 hw->use_simple_rx = 0;
> +		hw->use_vec_rx = 0;
>  	}
>  
>  	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
>  			   DEV_RX_OFFLOAD_TCP_CKSUM |
>  			   DEV_RX_OFFLOAD_TCP_LRO |
>  			   DEV_RX_OFFLOAD_VLAN_STRIP))
> -		hw->use_simple_rx = 0;
> +		hw->use_vec_rx = 0;
>  
>  	return 0;
>  }
> diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
> index bd89357e4..668e688e1 100644
> --- a/drivers/net/virtio/virtio_pci.h
> +++ b/drivers/net/virtio/virtio_pci.h
> @@ -253,7 +253,8 @@ struct virtio_hw {
>  	uint8_t	    vlan_strip;
>  	uint8_t	    use_msix;
>  	uint8_t     modern;
> -	uint8_t     use_simple_rx;
> +	uint8_t     use_vec_rx;
> +	uint8_t     use_vec_tx;
>  	uint8_t     use_inorder_rx;
>  	uint8_t     use_inorder_tx;
>  	uint8_t     weak_barriers;
> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> index e450477e8..84f4cf946 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
>  	/* Allocate blank mbufs for the each rx descriptor */
>  	nbufs = 0;
>  
> -	if (hw->use_simple_rx) {
> +	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
>  		for (desc_idx = 0; desc_idx < vq->vq_nentries;
>  		     desc_idx++) {
>  			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
> @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
>  			&rxvq->fake_mbuf;
>  	}
>  
> -	if (hw->use_simple_rx) {
> +	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
>  		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
>  			virtio_rxq_rearm_vec(rxvq);
>  			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
> index 953f00d72..150a8d987 100644
> --- a/drivers/net/virtio/virtio_user_ethdev.c
> +++ b/drivers/net/virtio/virtio_user_ethdev.c
> @@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
>  	 */
>  	hw->use_msix = 1;
>  	hw->modern   = 0;
> -	hw->use_simple_rx = 0;
> +	hw->use_vec_rx = 0;
>  	hw->use_inorder_rx = 0;
>  	hw->use_inorder_tx = 0;
>  	hw->virtio_user_dev = dev;
> diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
> index 0b4e3bf3e..ca23180de 100644
> --- a/drivers/net/virtio/virtqueue.c
> +++ b/drivers/net/virtio/virtqueue.c
> @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
>  	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
>  
>  	for (idx = 0; idx < vq->vq_nentries; idx++) {
> -		if (hw->use_simple_rx && type == VTNET_RQ) {
> +		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
> +		    type == VTNET_RQ) {
>  			if (start <= end && idx >= start && idx < end)
>  				continue;
>  			if (start > end && (idx >= start || idx < end))
> @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
>  	for (i = 0; i < nb_used; i++) {
>  		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
>  		uep = &vq->vq_split.ring.used->ring[used_idx];
> -		if (hw->use_simple_rx) {
> +		if (hw->use_vec_rx) {
>  			desc_idx = used_idx;
>  			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
>  			vq->vq_free_cnt++;
> @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
>  		vq->vq_used_cons_idx++;
>  	}
>  
> -	if (hw->use_simple_rx) {
> +	if (hw->use_vec_rx) {
>  		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
>  			virtio_rxq_rearm_vec(rxq);
>  			if (virtqueue_kick_prepare(vq))
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-27 11:20     ` Maxime Coquelin
  2020-04-28  1:14       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-27 11:20 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev



On 4/26/20 4:19 AM, Marvin Liu wrote:
> Optimize packed ring Rx path with SIMD instructions. Solution of
> optimization is pretty like vhost, is that split path into batch and
> single functions. Batch function is further optimized by AVX512
> instructions. Also pad desc extra structure to 16 bytes aligned, thus
> four elements will be saved in one batch.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index c9edb84ee..102b1deab 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
>  endif
>  
> +ifneq ($(FORCE_DISABLE_AVX512), y)
> +	CC_AVX512_SUPPORT=\
> +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> +	grep -q AVX512 && echo 1)
> +endif
> +
> +ifeq ($(CC_AVX512_SUPPORT), 1)
> +CFLAGS += -DCC_AVX512_SUPPORT
> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
> +
> +ifeq ($(RTE_TOOLCHAIN), gcc)
> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
> +endif
> +endif
> +
> +ifeq ($(RTE_TOOLCHAIN), clang)
> +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
> +endif
> +endif
> +
> +ifeq ($(RTE_TOOLCHAIN), icc)
> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
> +endif
> +endif
> +
> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
> +endif
> +endif
> +
>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> index 15150eea1..8e68c3039 100644
> --- a/drivers/net/virtio/meson.build
> +++ b/drivers/net/virtio/meson.build
> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
>  deps += ['kvargs', 'bus_pci']
>  
>  if arch_subdir == 'x86'
> +	if '-mno-avx512f' not in machine_args
> +		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> +			cflags += ['-DCC_AVX512_SUPPORT']
> +			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> +			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> +				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
> +			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> +			endif
> +			sources += files('virtio_rxtx_packed_avx.c')
> +		endif
> +	endif
>  	sources += files('virtio_rxtx_simple_sse.c')
>  elif arch_subdir == 'ppc'
>  	sources += files('virtio_rxtx_simple_altivec.c')
> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
> index febaf17a8..5c112cac7 100644
> --- a/drivers/net/virtio/virtio_ethdev.h
> +++ b/drivers/net/virtio/virtio_ethdev.h
> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
>  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		uint16_t nb_pkts);
>  
> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> +		uint16_t nb_pkts);
> +
>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
>  
>  void virtio_interrupt_handler(void *param);
> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> index a549991aa..534562cca 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
>  
>  	return nb_tx;
>  }
> +
> +__rte_weak uint16_t
> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> +			    struct rte_mbuf **rx_pkts __rte_unused,
> +			    uint16_t nb_pkts __rte_unused)
> +{
> +	return 0;
> +}
> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> new file mode 100644
> index 000000000..8a7b459eb
> --- /dev/null
> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> @@ -0,0 +1,374 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +
> +#include <rte_net.h>
> +
> +#include "virtio_logs.h"
> +#include "virtio_ethdev.h"
> +#include "virtio_pci.h"
> +#include "virtqueue.h"
> +
> +#define BYTE_SIZE 8
> +/* flag bits offset in packed ring desc higher 64bits */
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
> +	FLAGS_BITS_OFFSET)
> +
> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> +	sizeof(struct vring_packed_desc))
> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> +
> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
> +	for (iter = val; iter < size; iter++)
> +#endif
> +
> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
> +	for (iter = val; iter < size; iter++)
> +#endif
> +
> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
> +	for (iter = val; iter < size; iter++)
> +#endif
> +
> +#ifndef virtio_for_each_try_unroll
> +#define virtio_for_each_try_unroll(iter, val, num) \
> +	for (iter = val; iter < num; iter++)
> +#endif
> +
> +static inline void
> +virtio_update_batch_stats(struct virtnet_stats *stats,
> +			  uint16_t pkt_len1,
> +			  uint16_t pkt_len2,
> +			  uint16_t pkt_len3,
> +			  uint16_t pkt_len4)
> +{
> +	stats->bytes += pkt_len1;
> +	stats->bytes += pkt_len2;
> +	stats->bytes += pkt_len3;
> +	stats->bytes += pkt_len4;
> +}
> +
> +/* Optionally fill offload information in structure */
> +static inline int
> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> +{
> +	struct rte_net_hdr_lens hdr_lens;
> +	uint32_t hdrlen, ptype;
> +	int l4_supported = 0;
> +
> +	/* nothing to do */
> +	if (hdr->flags == 0)
> +		return 0;
> +
> +	/* GSO not support in vec path, skip check */
> +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
> +
> +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> +	m->packet_type = ptype;
> +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
> +		l4_supported = 1;
> +
> +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> +		if (hdr->csum_start <= hdrlen && l4_supported) {
> +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
> +		} else {
> +			/* Unknown proto or tunnel, do sw cksum. We can assume
> +			 * the cksum field is in the first segment since the
> +			 * buffers we provided to the host are large enough.
> +			 * In case of SCTP, this will be wrong since it's a CRC
> +			 * but there's nothing we can do.
> +			 */
> +			uint16_t csum = 0, off;
> +
> +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> +				&csum);
> +			if (likely(csum != 0xffff))
> +				csum = ~csum;
> +			off = hdr->csum_offset + hdr->csum_start;
> +			if (rte_pktmbuf_data_len(m) >= off + 1)
> +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> +					off) = csum;
> +		}
> +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
> +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline uint16_t
> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
> +				   struct rte_mbuf **rx_pkts)
> +{
> +	struct virtqueue *vq = rxvq->vq;
> +	struct virtio_hw *hw = vq->hw;
> +	uint16_t hdr_size = hw->vtnet_hdr_size;
> +	uint64_t addrs[PACKED_BATCH_SIZE];
> +	uint16_t id = vq->vq_used_cons_idx;
> +	uint8_t desc_stats;
> +	uint16_t i;
> +	void *desc_addr;
> +
> +	if (id & PACKED_BATCH_MASK)
> +		return -1;
> +
> +	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
> +		return -1;
> +
> +	/* only care avail/used bits */
> +	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
> +	desc_addr = &vq->vq_packed.ring.desc[id];
> +
> +	__m512i v_desc = _mm512_loadu_si512(desc_addr);
> +	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
> +
> +	__m512i v_used_flag = _mm512_setzero_si512();
> +	if (vq->vq_packed.used_wrap_counter)
> +		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
> +
> +	/* Check all descs are used */
> +	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
> +	if (desc_stats)
> +		return -1;
> +
> +	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
> +		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
> +
> +		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
> +	}
> +
> +	/*
> +	 * load len from desc, store into mbuf pkt_len and data_len
> +	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
> +	 */
> +	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
> +	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA);
> +
> +	/* reduce hdr_len from pkt_len and data_len */
> +	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
> +			(uint32_t)-hdr_size);
> +
> +	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
> +
> +	/* assert offset of data_len */
> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
> +		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
> +
> +	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
> +					   addrs[2] + 8, addrs[2],
> +					   addrs[1] + 8, addrs[1],
> +					   addrs[0] + 8, addrs[0]);
> +	/* batch store into mbufs */
> +	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
> +
> +	if (hw->has_rx_offload) {
> +		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> +			char *addr = (char *)rx_pkts[i]->buf_addr +
> +				RTE_PKTMBUF_HEADROOM - hdr_size;
> +			virtio_vec_rx_offload(rx_pkts[i],
> +					(struct virtio_net_hdr *)addr);
> +		}
> +	}
> +
> +	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
> +			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
> +			rx_pkts[3]->pkt_len);
> +
> +	vq->vq_free_cnt += PACKED_BATCH_SIZE;
> +
> +	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> +		vq->vq_used_cons_idx -= vq->vq_nentries;
> +		vq->vq_packed.used_wrap_counter ^= 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static uint16_t
> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
> +				    struct rte_mbuf **rx_pkts)
> +{
> +	uint16_t used_idx, id;
> +	uint32_t len;
> +	struct virtqueue *vq = rxvq->vq;
> +	struct virtio_hw *hw = vq->hw;
> +	uint32_t hdr_size = hw->vtnet_hdr_size;
> +	struct virtio_net_hdr *hdr;
> +	struct vring_packed_desc *desc;
> +	struct rte_mbuf *cookie;
> +
> +	desc = vq->vq_packed.ring.desc;
> +	used_idx = vq->vq_used_cons_idx;
> +	if (!desc_is_used(&desc[used_idx], vq))
> +		return -1;
> +
> +	len = desc[used_idx].len;
> +	id = desc[used_idx].id;
> +	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
> +	if (unlikely(cookie == NULL)) {
> +		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
> +				vq->vq_used_cons_idx);
> +		return -1;
> +	}
> +	rte_prefetch0(cookie);
> +	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
> +
> +	cookie->data_off = RTE_PKTMBUF_HEADROOM;
> +	cookie->ol_flags = 0;
> +	cookie->pkt_len = (uint32_t)(len - hdr_size);
> +	cookie->data_len = (uint32_t)(len - hdr_size);
> +
> +	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
> +					RTE_PKTMBUF_HEADROOM - hdr_size);
> +	if (hw->has_rx_offload)
> +		virtio_vec_rx_offload(cookie, hdr);
> +
> +	*rx_pkts = cookie;
> +
> +	rxvq->stats.bytes += cookie->pkt_len;
> +
> +	vq->vq_free_cnt++;
> +	vq->vq_used_cons_idx++;
> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> +		vq->vq_used_cons_idx -= vq->vq_nentries;
> +		vq->vq_packed.used_wrap_counter ^= 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline void
> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
> +			      struct rte_mbuf **cookie,
> +			      uint16_t num)
> +{
> +	struct virtqueue *vq = rxvq->vq;
> +	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
> +	uint16_t flags = vq->vq_packed.cached_flags;
> +	struct virtio_hw *hw = vq->hw;
> +	struct vq_desc_extra *dxp;
> +	uint16_t idx, i;
> +	uint16_t batch_num, total_num = 0;
> +	uint16_t head_idx = vq->vq_avail_idx;
> +	uint16_t head_flag = vq->vq_packed.cached_flags;
> +	uint64_t addr;
> +
> +	do {
> +		idx = vq->vq_avail_idx;
> +
> +		batch_num = PACKED_BATCH_SIZE;
> +		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
> +			batch_num = vq->vq_nentries - idx;
> +		if (unlikely((total_num + batch_num) > num))
> +			batch_num = num - total_num;
> +
> +		virtio_for_each_try_unroll(i, 0, batch_num) {
> +			dxp = &vq->vq_descx[idx + i];
> +			dxp->cookie = (void *)cookie[total_num + i];
> +
> +			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
> +				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
> +			start_dp[idx + i].addr = addr;
> +			start_dp[idx + i].len = cookie[total_num + i]->buf_len
> +				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
> +			if (total_num || i) {
> +				virtqueue_store_flags_packed(&start_dp[idx + i],
> +						flags, hw->weak_barriers);
> +			}
> +		}
> +
> +		vq->vq_avail_idx += batch_num;
> +		if (vq->vq_avail_idx >= vq->vq_nentries) {
> +			vq->vq_avail_idx -= vq->vq_nentries;
> +			vq->vq_packed.cached_flags ^=
> +				VRING_PACKED_DESC_F_AVAIL_USED;
> +			flags = vq->vq_packed.cached_flags;
> +		}
> +		total_num += batch_num;
> +	} while (total_num < num);
> +
> +	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
> +				hw->weak_barriers);
> +	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
> +}
> +
> +uint16_t
> +virtio_recv_pkts_packed_vec(void *rx_queue,
> +			    struct rte_mbuf **rx_pkts,
> +			    uint16_t nb_pkts)
> +{
> +	struct virtnet_rx *rxvq = rx_queue;
> +	struct virtqueue *vq = rxvq->vq;
> +	struct virtio_hw *hw = vq->hw;
> +	uint16_t num, nb_rx = 0;
> +	uint32_t nb_enqueued = 0;
> +	uint16_t free_cnt = vq->vq_free_thresh;
> +
> +	if (unlikely(hw->started == 0))
> +		return nb_rx;
> +
> +	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
> +	if (likely(num > PACKED_BATCH_SIZE))
> +		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
> +
> +	while (num) {
> +		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
> +					&rx_pkts[nb_rx])) {
> +			nb_rx += PACKED_BATCH_SIZE;
> +			num -= PACKED_BATCH_SIZE;
> +			continue;
> +		}
> +		if (!virtqueue_dequeue_single_packed_vec(rxvq,
> +					&rx_pkts[nb_rx])) {
> +			nb_rx++;
> +			num--;
> +			continue;
> +		}
> +		break;
> +	};
> +
> +	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
> +
> +	rxvq->stats.packets += nb_rx;
> +
> +	if (likely(vq->vq_free_cnt >= free_cnt)) {
> +		struct rte_mbuf *new_pkts[free_cnt];
> +		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
> +						free_cnt) == 0)) {
> +			virtio_recv_refill_packed_vec(rxvq, new_pkts,
> +					free_cnt);
> +			nb_enqueued += free_cnt;
> +		} else {
> +			struct rte_eth_dev *dev =
> +				&rte_eth_devices[rxvq->port_id];
> +			dev->data->rx_mbuf_alloc_failed += free_cnt;
> +		}
> +	}
> +
> +	if (likely(nb_enqueued)) {
> +		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
> +			virtqueue_notify(vq);
> +			PMD_RX_LOG(DEBUG, "Notified");
> +		}
> +	}
> +
> +	return nb_rx;
> +}
> diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
> index 40ad786cc..c54698ad1 100644
> --- a/drivers/net/virtio/virtio_user_ethdev.c
> +++ b/drivers/net/virtio/virtio_user_ethdev.c
> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
>  	hw->use_msix = 1;
>  	hw->modern   = 0;
>  	hw->use_vec_rx = 0;
> +	hw->use_vec_tx = 0;
>  	hw->use_inorder_rx = 0;
>  	hw->use_inorder_tx = 0;
>  	hw->virtio_user_dev = dev;
> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
>  		goto end;
>  	}
>  
> -	if (vectorized)
> -		hw->use_vec_rx = 1;
> +	if (vectorized) {
> +		if (packed_vq) {
> +#if defined(CC_AVX512_SUPPORT)
> +			hw->use_vec_rx = 1;
> +			hw->use_vec_tx = 1;
> +#else
> +			PMD_INIT_LOG(INFO,
> +				"building environment do not support packed ring vectorized");
> +#endif
> +		} else {
> +			hw->use_vec_rx = 1;
> +		}
> +	}
>  
>  	rte_eth_dev_probing_finish(eth_dev);
>  	ret = 0;
> diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
> index ca1c10499..ce0340743 100644
> --- a/drivers/net/virtio/virtqueue.h
> +++ b/drivers/net/virtio/virtqueue.h
> @@ -239,7 +239,8 @@ struct vq_desc_extra {
>  	void *cookie;
>  	uint16_t ndescs;
>  	uint16_t next;
> -};
> +	uint8_t padding[4];
> +} __rte_packed __rte_aligned(16);

Can't this introduce a performance impact for the non-vectorized
case? I think of worse cache liens utilization.

For example with a burst of 32 descriptors with 32B cachelines, before
it would take 14 cachelines, after 16. So for each burst, one could face
2 extra cache misses.

If you could run non-vectorized benchamrks with and without that patch,
I would be grateful.

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-27 11:55     ` Maxime Coquelin
  0 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-27 11:55 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev



On 4/26/20 4:19 AM, Marvin Liu wrote:
> Optimize packed ring Tx path like Rx path. Split Tx path into batch and
> single Tx functions. Batch function is further optimized by AVX512
> instructions.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-27 11:20     ` Maxime Coquelin
@ 2020-04-28  1:14       ` Liu, Yong
  2020-04-28  8:44         ` Maxime Coquelin
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-28  1:14 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, April 27, 2020 7:21 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/26/20 4:19 AM, Marvin Liu wrote:
> > Optimize packed ring Rx path with SIMD instructions. Solution of
> > optimization is pretty like vhost, is that split path into batch and
> > single functions. Batch function is further optimized by AVX512
> > instructions. Also pad desc extra structure to 16 bytes aligned, thus
> > four elements will be saved in one batch.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index c9edb84ee..102b1deab 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> $(CONFIG_RTE_ARCH_ARM64)),)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
> >  endif
> >
> > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > +	CC_AVX512_SUPPORT=\
> > +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > +	grep -q AVX512 && echo 1)
> > +endif
> > +
> > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > +CFLAGS += -DCC_AVX512_SUPPORT
> > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
> > +
> > +ifeq ($(RTE_TOOLCHAIN), gcc)
> > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> > +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +ifeq ($(RTE_TOOLCHAIN), clang)
> > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
> ge 37 && echo 1), 1)
> > +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +ifeq ($(RTE_TOOLCHAIN), icc)
> > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> > +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
> > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> > +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
> > +endif
> > +endif
> > +
> >  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
> > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
> > index 15150eea1..8e68c3039 100644
> > --- a/drivers/net/virtio/meson.build
> > +++ b/drivers/net/virtio/meson.build
> > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
> >  deps += ['kvargs', 'bus_pci']
> >
> >  if arch_subdir == 'x86'
> > +	if '-mno-avx512f' not in machine_args
> > +		if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> > +			cflags += ['-DCC_AVX512_SUPPORT']
> > +			if (toolchain == 'gcc' and
> cc.version().version_compare('>=8.3.0'))
> > +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > +			elif (toolchain == 'clang' and
> cc.version().version_compare('>=3.7.0'))
> > +				cflags += '-
> DVHOST_CLANG_UNROLL_PRAGMA'
> > +			elif (toolchain == 'icc' and
> cc.version().version_compare('>=16.0.0'))
> > +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> > +			endif
> > +			sources += files('virtio_rxtx_packed_avx.c')
> > +		endif
> > +	endif
> >  	sources += files('virtio_rxtx_simple_sse.c')
> >  elif arch_subdir == 'ppc'
> >  	sources += files('virtio_rxtx_simple_altivec.c')
> > diff --git a/drivers/net/virtio/virtio_ethdev.h
> b/drivers/net/virtio/virtio_ethdev.h
> > index febaf17a8..5c112cac7 100644
> > --- a/drivers/net/virtio/virtio_ethdev.h
> > +++ b/drivers/net/virtio/virtio_ethdev.h
> > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue,
> struct rte_mbuf **tx_pkts,
> >  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> >  		uint16_t nb_pkts);
> >
> > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> > +		uint16_t nb_pkts);
> > +
> >  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >
> >  void virtio_interrupt_handler(void *param);
> > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> > index a549991aa..534562cca 100644
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
> >
> >  	return nb_tx;
> >  }
> > +
> > +__rte_weak uint16_t
> > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> > +			    struct rte_mbuf **rx_pkts __rte_unused,
> > +			    uint16_t nb_pkts __rte_unused)
> > +{
> > +	return 0;
> > +}
> > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > new file mode 100644
> > index 000000000..8a7b459eb
> > --- /dev/null
> > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > @@ -0,0 +1,374 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2020 Intel Corporation
> > + */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <errno.h>
> > +
> > +#include <rte_net.h>
> > +
> > +#include "virtio_logs.h"
> > +#include "virtio_ethdev.h"
> > +#include "virtio_pci.h"
> > +#include "virtqueue.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* flag bits offset in packed ring desc higher 64bits */
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL |
> VRING_PACKED_DESC_F_AVAIL_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> > +	sizeof(struct vring_packed_desc))
> > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> > +
> > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4")
> \
> > +	for (iter = val; iter < size; iter++)
> > +#endif
> > +
> > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
> > +	for (iter = val; iter < size; iter++)
> > +#endif
> > +
> > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
> > +	for (iter = val; iter < size; iter++)
> > +#endif
> > +
> > +#ifndef virtio_for_each_try_unroll
> > +#define virtio_for_each_try_unroll(iter, val, num) \
> > +	for (iter = val; iter < num; iter++)
> > +#endif
> > +
> > +static inline void
> > +virtio_update_batch_stats(struct virtnet_stats *stats,
> > +			  uint16_t pkt_len1,
> > +			  uint16_t pkt_len2,
> > +			  uint16_t pkt_len3,
> > +			  uint16_t pkt_len4)
> > +{
> > +	stats->bytes += pkt_len1;
> > +	stats->bytes += pkt_len2;
> > +	stats->bytes += pkt_len3;
> > +	stats->bytes += pkt_len4;
> > +}
> > +
> > +/* Optionally fill offload information in structure */
> > +static inline int
> > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> > +{
> > +	struct rte_net_hdr_lens hdr_lens;
> > +	uint32_t hdrlen, ptype;
> > +	int l4_supported = 0;
> > +
> > +	/* nothing to do */
> > +	if (hdr->flags == 0)
> > +		return 0;
> > +
> > +	/* GSO not support in vec path, skip check */
> > +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
> > +
> > +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> > +	m->packet_type = ptype;
> > +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
> > +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
> > +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
> > +		l4_supported = 1;
> > +
> > +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> > +		if (hdr->csum_start <= hdrlen && l4_supported) {
> > +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
> > +		} else {
> > +			/* Unknown proto or tunnel, do sw cksum. We can
> assume
> > +			 * the cksum field is in the first segment since the
> > +			 * buffers we provided to the host are large enough.
> > +			 * In case of SCTP, this will be wrong since it's a CRC
> > +			 * but there's nothing we can do.
> > +			 */
> > +			uint16_t csum = 0, off;
> > +
> > +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> > +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> > +				&csum);
> > +			if (likely(csum != 0xffff))
> > +				csum = ~csum;
> > +			off = hdr->csum_offset + hdr->csum_start;
> > +			if (rte_pktmbuf_data_len(m) >= off + 1)
> > +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> > +					off) = csum;
> > +		}
> > +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
> l4_supported) {
> > +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static inline uint16_t
> > +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
> > +				   struct rte_mbuf **rx_pkts)
> > +{
> > +	struct virtqueue *vq = rxvq->vq;
> > +	struct virtio_hw *hw = vq->hw;
> > +	uint16_t hdr_size = hw->vtnet_hdr_size;
> > +	uint64_t addrs[PACKED_BATCH_SIZE];
> > +	uint16_t id = vq->vq_used_cons_idx;
> > +	uint8_t desc_stats;
> > +	uint16_t i;
> > +	void *desc_addr;
> > +
> > +	if (id & PACKED_BATCH_MASK)
> > +		return -1;
> > +
> > +	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
> > +		return -1;
> > +
> > +	/* only care avail/used bits */
> > +	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa,
> PACKED_FLAGS_MASK);
> > +	desc_addr = &vq->vq_packed.ring.desc[id];
> > +
> > +	__m512i v_desc = _mm512_loadu_si512(desc_addr);
> > +	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
> > +
> > +	__m512i v_used_flag = _mm512_setzero_si512();
> > +	if (vq->vq_packed.used_wrap_counter)
> > +		v_used_flag = _mm512_maskz_set1_epi64(0xaa,
> PACKED_FLAGS_MASK);
> > +
> > +	/* Check all descs are used */
> > +	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
> > +	if (desc_stats)
> > +		return -1;
> > +
> > +	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
> > +		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
> > +
> > +		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
> > +	}
> > +
> > +	/*
> > +	 * load len from desc, store into mbuf pkt_len and data_len
> > +	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
> > +	 */
> > +	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
> > +	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc,
> 0xAA);
> > +
> > +	/* reduce hdr_len from pkt_len and data_len */
> > +	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
> > +			(uint32_t)-hdr_size);
> > +
> > +	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
> > +
> > +	/* assert offset of data_len */
> > +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
> > +		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
> > +
> > +	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
> > +					   addrs[2] + 8, addrs[2],
> > +					   addrs[1] + 8, addrs[1],
> > +					   addrs[0] + 8, addrs[0]);
> > +	/* batch store into mbufs */
> > +	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
> > +
> > +	if (hw->has_rx_offload) {
> > +		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +			char *addr = (char *)rx_pkts[i]->buf_addr +
> > +				RTE_PKTMBUF_HEADROOM - hdr_size;
> > +			virtio_vec_rx_offload(rx_pkts[i],
> > +					(struct virtio_net_hdr *)addr);
> > +		}
> > +	}
> > +
> > +	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
> > +			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
> > +			rx_pkts[3]->pkt_len);
> > +
> > +	vq->vq_free_cnt += PACKED_BATCH_SIZE;
> > +
> > +	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
> > +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> > +		vq->vq_used_cons_idx -= vq->vq_nentries;
> > +		vq->vq_packed.used_wrap_counter ^= 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static uint16_t
> > +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
> > +				    struct rte_mbuf **rx_pkts)
> > +{
> > +	uint16_t used_idx, id;
> > +	uint32_t len;
> > +	struct virtqueue *vq = rxvq->vq;
> > +	struct virtio_hw *hw = vq->hw;
> > +	uint32_t hdr_size = hw->vtnet_hdr_size;
> > +	struct virtio_net_hdr *hdr;
> > +	struct vring_packed_desc *desc;
> > +	struct rte_mbuf *cookie;
> > +
> > +	desc = vq->vq_packed.ring.desc;
> > +	used_idx = vq->vq_used_cons_idx;
> > +	if (!desc_is_used(&desc[used_idx], vq))
> > +		return -1;
> > +
> > +	len = desc[used_idx].len;
> > +	id = desc[used_idx].id;
> > +	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
> > +	if (unlikely(cookie == NULL)) {
> > +		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie
> at %u",
> > +				vq->vq_used_cons_idx);
> > +		return -1;
> > +	}
> > +	rte_prefetch0(cookie);
> > +	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
> > +
> > +	cookie->data_off = RTE_PKTMBUF_HEADROOM;
> > +	cookie->ol_flags = 0;
> > +	cookie->pkt_len = (uint32_t)(len - hdr_size);
> > +	cookie->data_len = (uint32_t)(len - hdr_size);
> > +
> > +	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
> > +					RTE_PKTMBUF_HEADROOM -
> hdr_size);
> > +	if (hw->has_rx_offload)
> > +		virtio_vec_rx_offload(cookie, hdr);
> > +
> > +	*rx_pkts = cookie;
> > +
> > +	rxvq->stats.bytes += cookie->pkt_len;
> > +
> > +	vq->vq_free_cnt++;
> > +	vq->vq_used_cons_idx++;
> > +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> > +		vq->vq_used_cons_idx -= vq->vq_nentries;
> > +		vq->vq_packed.used_wrap_counter ^= 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static inline void
> > +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
> > +			      struct rte_mbuf **cookie,
> > +			      uint16_t num)
> > +{
> > +	struct virtqueue *vq = rxvq->vq;
> > +	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
> > +	uint16_t flags = vq->vq_packed.cached_flags;
> > +	struct virtio_hw *hw = vq->hw;
> > +	struct vq_desc_extra *dxp;
> > +	uint16_t idx, i;
> > +	uint16_t batch_num, total_num = 0;
> > +	uint16_t head_idx = vq->vq_avail_idx;
> > +	uint16_t head_flag = vq->vq_packed.cached_flags;
> > +	uint64_t addr;
> > +
> > +	do {
> > +		idx = vq->vq_avail_idx;
> > +
> > +		batch_num = PACKED_BATCH_SIZE;
> > +		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
> > +			batch_num = vq->vq_nentries - idx;
> > +		if (unlikely((total_num + batch_num) > num))
> > +			batch_num = num - total_num;
> > +
> > +		virtio_for_each_try_unroll(i, 0, batch_num) {
> > +			dxp = &vq->vq_descx[idx + i];
> > +			dxp->cookie = (void *)cookie[total_num + i];
> > +
> > +			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i],
> vq) +
> > +				RTE_PKTMBUF_HEADROOM - hw-
> >vtnet_hdr_size;
> > +			start_dp[idx + i].addr = addr;
> > +			start_dp[idx + i].len = cookie[total_num + i]->buf_len
> > +				- RTE_PKTMBUF_HEADROOM + hw-
> >vtnet_hdr_size;
> > +			if (total_num || i) {
> > +				virtqueue_store_flags_packed(&start_dp[idx
> + i],
> > +						flags, hw->weak_barriers);
> > +			}
> > +		}
> > +
> > +		vq->vq_avail_idx += batch_num;
> > +		if (vq->vq_avail_idx >= vq->vq_nentries) {
> > +			vq->vq_avail_idx -= vq->vq_nentries;
> > +			vq->vq_packed.cached_flags ^=
> > +				VRING_PACKED_DESC_F_AVAIL_USED;
> > +			flags = vq->vq_packed.cached_flags;
> > +		}
> > +		total_num += batch_num;
> > +	} while (total_num < num);
> > +
> > +	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
> > +				hw->weak_barriers);
> > +	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
> > +}
> > +
> > +uint16_t
> > +virtio_recv_pkts_packed_vec(void *rx_queue,
> > +			    struct rte_mbuf **rx_pkts,
> > +			    uint16_t nb_pkts)
> > +{
> > +	struct virtnet_rx *rxvq = rx_queue;
> > +	struct virtqueue *vq = rxvq->vq;
> > +	struct virtio_hw *hw = vq->hw;
> > +	uint16_t num, nb_rx = 0;
> > +	uint32_t nb_enqueued = 0;
> > +	uint16_t free_cnt = vq->vq_free_thresh;
> > +
> > +	if (unlikely(hw->started == 0))
> > +		return nb_rx;
> > +
> > +	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
> > +	if (likely(num > PACKED_BATCH_SIZE))
> > +		num = num - ((vq->vq_used_cons_idx + num) %
> PACKED_BATCH_SIZE);
> > +
> > +	while (num) {
> > +		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
> > +					&rx_pkts[nb_rx])) {
> > +			nb_rx += PACKED_BATCH_SIZE;
> > +			num -= PACKED_BATCH_SIZE;
> > +			continue;
> > +		}
> > +		if (!virtqueue_dequeue_single_packed_vec(rxvq,
> > +					&rx_pkts[nb_rx])) {
> > +			nb_rx++;
> > +			num--;
> > +			continue;
> > +		}
> > +		break;
> > +	};
> > +
> > +	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
> > +
> > +	rxvq->stats.packets += nb_rx;
> > +
> > +	if (likely(vq->vq_free_cnt >= free_cnt)) {
> > +		struct rte_mbuf *new_pkts[free_cnt];
> > +		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
> > +						free_cnt) == 0)) {
> > +			virtio_recv_refill_packed_vec(rxvq, new_pkts,
> > +					free_cnt);
> > +			nb_enqueued += free_cnt;
> > +		} else {
> > +			struct rte_eth_dev *dev =
> > +				&rte_eth_devices[rxvq->port_id];
> > +			dev->data->rx_mbuf_alloc_failed += free_cnt;
> > +		}
> > +	}
> > +
> > +	if (likely(nb_enqueued)) {
> > +		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
> > +			virtqueue_notify(vq);
> > +			PMD_RX_LOG(DEBUG, "Notified");
> > +		}
> > +	}
> > +
> > +	return nb_rx;
> > +}
> > diff --git a/drivers/net/virtio/virtio_user_ethdev.c
> b/drivers/net/virtio/virtio_user_ethdev.c
> > index 40ad786cc..c54698ad1 100644
> > --- a/drivers/net/virtio/virtio_user_ethdev.c
> > +++ b/drivers/net/virtio/virtio_user_ethdev.c
> > @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device
> *vdev)
> >  	hw->use_msix = 1;
> >  	hw->modern   = 0;
> >  	hw->use_vec_rx = 0;
> > +	hw->use_vec_tx = 0;
> >  	hw->use_inorder_rx = 0;
> >  	hw->use_inorder_tx = 0;
> >  	hw->virtio_user_dev = dev;
> > @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device
> *dev)
> >  		goto end;
> >  	}
> >
> > -	if (vectorized)
> > -		hw->use_vec_rx = 1;
> > +	if (vectorized) {
> > +		if (packed_vq) {
> > +#if defined(CC_AVX512_SUPPORT)
> > +			hw->use_vec_rx = 1;
> > +			hw->use_vec_tx = 1;
> > +#else
> > +			PMD_INIT_LOG(INFO,
> > +				"building environment do not support packed
> ring vectorized");
> > +#endif
> > +		} else {
> > +			hw->use_vec_rx = 1;
> > +		}
> > +	}
> >
> >  	rte_eth_dev_probing_finish(eth_dev);
> >  	ret = 0;
> > diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
> > index ca1c10499..ce0340743 100644
> > --- a/drivers/net/virtio/virtqueue.h
> > +++ b/drivers/net/virtio/virtqueue.h
> > @@ -239,7 +239,8 @@ struct vq_desc_extra {
> >  	void *cookie;
> >  	uint16_t ndescs;
> >  	uint16_t next;
> > -};
> > +	uint8_t padding[4];
> > +} __rte_packed __rte_aligned(16);
> 
> Can't this introduce a performance impact for the non-vectorized
> case? I think of worse cache liens utilization.
> 
> For example with a burst of 32 descriptors with 32B cachelines, before
> it would take 14 cachelines, after 16. So for each burst, one could face
> 2 extra cache misses.
> 
> If you could run non-vectorized benchamrks with and without that patch,
> I would be grateful.
> 

Maxime,
Thanks for point it out, it will add extra cache miss in datapath. 
And its impact on performance is around 1% in loopback case. 
While benefit of vectorized path will be more than that number.

Thanks,
Marvin

> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> 
> Thanks,
> Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 4/9] net/virtio-user: add vectorized devarg
  2020-04-27 11:07     ` Maxime Coquelin
@ 2020-04-28  1:29       ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-28  1:29 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, April 27, 2020 7:07 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v10 4/9] net/virtio-user: add vectorized devarg
> 
> 
> 
> On 4/26/20 4:19 AM, Marvin Liu wrote:
> > Add new devarg for virtio user device vectorized path selection. By
> > default vectorized path is disabled.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
> > index 902a1f0cf..d59add23e 100644
> > --- a/doc/guides/nics/virtio.rst
> > +++ b/doc/guides/nics/virtio.rst
> > @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user
> vdev:
> >      rte_eth_link_get_nowait function.
> >      (Default: 10000 (10G))
> >
> > +#.  ``vectorized``:
> > +
> > +    It is used to specify whether virtio device perfer to use vectorized path.
> 
> s/perfer/prefers/
> 
> I'll fix while applying if the rest of the series is ok.

Thanks, Maxime. I will fix in next version followed with i686 building fix. 

> 
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> 
> Thanks,
> Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (15 preceding siblings ...)
  2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
@ 2020-04-28  8:32 ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (8 more replies)
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
  17 siblings, 9 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this
patch set. This option will unify split and packed ring vectorized
path default setting. Meanwhile user can specify whether enable
vectorized path at runtime by 'vectorized' parameter of virtio user
vdev.

v11:
* fix i686 build warnings
* fix typo in doc

v10:
* reuse packed ring xmit cleanup

v9:
* replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg
* reorder patch sequence

v8:
* fix meson build error on ubuntu16.04 and suse15

v7:
* default vectorization is disabled
* compilation time check dependency on rte_mbuf structure
* offsets are calcuated when compiling
* remove useless barrier as descs are batched store&load
* vindex of scatter is directly set
* some comments updates
* enable vectorized path in meson build

v6:
* fix issue when size not power of 2

v5:
* remove cpuflags definition as required extensions always come with
  AVX512F on x86_64
* inorder actions should depend on feature bit
* check ring type in rx queue setup
* rewrite some commit logs
* fix some checkpatch warnings

v4:
* rename 'packed_vec' to 'vectorized', also used in split ring
* add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
* check required AVX512 extensions cpuflags
* combine split and packed ring datapath selection logic
* remove limitation that size must power of two
* clear 12Bytes virtio_net_hdr

v3:
* remove virtio_net_hdr array for better performance
* disable 'packed_vec' by default

v2:
* more function blocks replaced by vector instructions
* clean virtio_net_hdr by vector instruction
* allow header room size change
* add 'packed_vec' option in virtio_user vdev 
* fix build not check whether AVX512 enabled
* doc update

Tested-by: Wang, Yinan <yinan.wang@intel.com>

Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: inorder should depend on feature bit
  net/virtio: add vectorized devarg
  net/virtio-user: add vectorized devarg
  net/virtio: reuse packed ring functions
  net/virtio: add vectorized packed ring Rx path
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 doc/guides/nics/virtio.rst                  |  52 +-
 drivers/net/virtio/Makefile                 |  35 ++
 drivers/net/virtio/meson.build              |  14 +
 drivers/net/virtio/virtio_ethdev.c          | 137 ++++-
 drivers/net/virtio/virtio_ethdev.h          |   6 +
 drivers/net/virtio/virtio_pci.h             |   3 +-
 drivers/net/virtio/virtio_rxtx.c            | 349 ++---------
 drivers/net/virtio/virtio_rxtx_packed_avx.c | 623 ++++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c     |  32 +-
 drivers/net/virtio/virtqueue.c              |   7 +-
 drivers/net/virtio/virtqueue.h              | 307 +++++++++-
 11 files changed, 1210 insertions(+), 355 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue, its default value is 32.
Limit the threshold size to multiple of four as only vectorized packed
Rx function will utilize it. Virtio driver will rearm Rx queue when
more than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg Marvin Liu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Ring initialization is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 4/9] net/virtio-user: " Marvin Liu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Previously, virtio split ring vectorized path was enabled by default.
This is not suitable for everyone because that path dose not follow
virtio spec. Add new devarg for virtio vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 6286286db..a67774e91 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfers to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
+
 Below devargs are supported by the virtio-user vdev:
 
 #.  ``path``:
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 37766cbb6..0a69a4db1 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev);
 static uint32_t virtio_dev_speed_capa_get(uint32_t speed);
 static int virtio_dev_devargs_parse(struct rte_devargs *devargs,
 	int *vdpa,
-	uint32_t *speed);
+	uint32_t *speed,
+	int *vectorized);
 static int virtio_dev_info_get(struct rte_eth_dev *dev,
 				struct rte_eth_dev_info *dev_info);
 static int virtio_dev_link_update(struct rte_eth_dev *dev,
@@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 {
 	struct virtio_hw *hw = eth_dev->data->dev_private;
 	uint32_t speed = SPEED_UNKNOWN;
+	int vectorized = 0;
 	int ret;
 
 	if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) {
@@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		return 0;
 	}
 	ret = virtio_dev_devargs_parse(eth_dev->device->devargs,
-		 NULL, &speed);
+		 NULL, &speed, &vectorized);
 	if (ret < 0)
 		return ret;
 	hw->speed = speed;
@@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 	if (ret < 0)
 		goto err_virtio_init;
 
+	if (vectorized) {
+		if (!vtpci_packed_queue(hw))
+			hw->use_vec_rx = 1;
+	}
+
 	hw->opened = true;
 
 	return 0;
@@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed)
 	}
 }
 
+static int vectorized_check_handler(__rte_unused const char *key,
+		const char *value, void *ret_val)
+{
+	if (strcmp(value, "1") == 0)
+		*(int *)ret_val = 1;
+	else
+		*(int *)ret_val = 0;
+
+	return 0;
+}
 
 #define VIRTIO_ARG_SPEED      "speed"
 #define VIRTIO_ARG_VDPA       "vdpa"
+#define VIRTIO_ARG_VECTORIZED "vectorized"
 
 
 static int
@@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused,
 
 static int
 virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
-	uint32_t *speed)
+	uint32_t *speed, int *vectorized)
 {
 	struct rte_kvargs *kvlist;
 	int ret = 0;
@@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
 		}
 	}
 
+	if (vectorized &&
+		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				VIRTIO_ARG_VECTORIZED,
+				vectorized_check_handler, vectorized);
+		if (ret < 0) {
+			PMD_INIT_LOG(ERR, "Failed to parse %s",
+					VIRTIO_ARG_VECTORIZED);
+			goto exit;
+		}
+	}
+
 exit:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	int vdpa = 0;
 	int ret = 0;
 
-	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL);
+	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL,
+		NULL);
 	if (ret < 0) {
 		PMD_INIT_LOG(ERR, "devargs parsing is failed");
 		return ret;
@@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index bd89357e4..668e688e1 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -253,7 +253,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 953f00d72..150a8d987 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 4/9] net/virtio-user: add vectorized devarg
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions Marvin Liu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Add new devarg for virtio user device vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index a67774e91..fdd0790e0 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfers to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
 
 Virtio paths Selection and Usage
 --------------------------------
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 150a8d987..40ad786cc 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -452,6 +452,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_PACKED_VQ,
 #define VIRTIO_USER_ARG_SPEED          "speed"
 	VIRTIO_USER_ARG_SPEED,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized)
+		hw->use_vec_rx = 1;
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
 	"packed_vq=<0|1> "
-	"speed=<int>");
+	"speed=<int> "
+	"vectorized=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 4/9] net/virtio-user: " Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Move offload, xmit cleanup and packed xmit enqueue function to header
file. These functions will be reused by packed ring vectorized path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..a549991aa 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -89,23 +89,6 @@ vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
 	dp->next = VQ_RING_DESC_CHAIN_END;
 }
 
-static void
-vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id)
-{
-	struct vq_desc_extra *dxp;
-
-	dxp = &vq->vq_descx[id];
-	vq->vq_free_cnt += dxp->ndescs;
-
-	if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END)
-		vq->vq_desc_head_idx = id;
-	else
-		vq->vq_descx[vq->vq_desc_tail_idx].next = id;
-
-	vq->vq_desc_tail_idx = id;
-	dxp->next = VQ_RING_DESC_CHAIN_END;
-}
-
 void
 virtio_update_packet_stats(struct virtnet_stats *stats, struct rte_mbuf *mbuf)
 {
@@ -264,130 +247,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
-static void
-virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
-{
-	uint16_t used_idx, id, curr_id, free_cnt = 0;
-	uint16_t size = vq->vq_nentries;
-	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
-	struct vq_desc_extra *dxp;
-
-	used_idx = vq->vq_used_cons_idx;
-	/* desc_is_used has a load-acquire or rte_cio_rmb inside
-	 * and wait for used desc in virtqueue.
-	 */
-	while (num > 0 && desc_is_used(&desc[used_idx], vq)) {
-		id = desc[used_idx].id;
-		do {
-			curr_id = used_idx;
-			dxp = &vq->vq_descx[used_idx];
-			used_idx += dxp->ndescs;
-			free_cnt += dxp->ndescs;
-			num -= dxp->ndescs;
-			if (used_idx >= size) {
-				used_idx -= size;
-				vq->vq_packed.used_wrap_counter ^= 1;
-			}
-			if (dxp->cookie != NULL) {
-				rte_pktmbuf_free(dxp->cookie);
-				dxp->cookie = NULL;
-			}
-		} while (curr_id != id);
-	}
-	vq->vq_used_cons_idx = used_idx;
-	vq->vq_free_cnt += free_cnt;
-}
-
-static void
-virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num)
-{
-	uint16_t used_idx, id;
-	uint16_t size = vq->vq_nentries;
-	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
-	struct vq_desc_extra *dxp;
-
-	used_idx = vq->vq_used_cons_idx;
-	/* desc_is_used has a load-acquire or rte_cio_rmb inside
-	 * and wait for used desc in virtqueue.
-	 */
-	while (num-- && desc_is_used(&desc[used_idx], vq)) {
-		id = desc[used_idx].id;
-		dxp = &vq->vq_descx[id];
-		vq->vq_used_cons_idx += dxp->ndescs;
-		if (vq->vq_used_cons_idx >= size) {
-			vq->vq_used_cons_idx -= size;
-			vq->vq_packed.used_wrap_counter ^= 1;
-		}
-		vq_ring_free_id_packed(vq, id);
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-		used_idx = vq->vq_used_cons_idx;
-	}
-}
-
-/* Cleanup from completed transmits. */
-static inline void
-virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order)
-{
-	if (in_order)
-		virtio_xmit_cleanup_inorder_packed(vq, num);
-	else
-		virtio_xmit_cleanup_normal_packed(vq, num);
-}
-
-static void
-virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num)
-{
-	uint16_t i, used_idx, desc_idx;
-	for (i = 0; i < num; i++) {
-		struct vring_used_elem *uep;
-		struct vq_desc_extra *dxp;
-
-		used_idx = (uint16_t)(vq->vq_used_cons_idx & (vq->vq_nentries - 1));
-		uep = &vq->vq_split.ring.used->ring[used_idx];
-
-		desc_idx = (uint16_t) uep->id;
-		dxp = &vq->vq_descx[desc_idx];
-		vq->vq_used_cons_idx++;
-		vq_ring_free_chain(vq, desc_idx);
-
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-	}
-}
-
-/* Cleanup from completed inorder transmits. */
-static __rte_always_inline void
-virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num)
-{
-	uint16_t i, idx = vq->vq_used_cons_idx;
-	int16_t free_cnt = 0;
-	struct vq_desc_extra *dxp = NULL;
-
-	if (unlikely(num == 0))
-		return;
-
-	for (i = 0; i < num; i++) {
-		dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)];
-		free_cnt += dxp->ndescs;
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-	}
-
-	vq->vq_free_cnt += free_cnt;
-	vq->vq_used_cons_idx = idx;
-}
-
 static inline int
 virtqueue_enqueue_refill_inorder(struct virtqueue *vq,
 			struct rte_mbuf **cookies,
@@ -562,68 +421,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +523,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
@@ -1246,7 +948,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..ca1c10499 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -10,6 +10,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_mempool.h>
+#include <rte_net.h>
 
 #include "virtio_pci.h"
 #include "virtio_ring.h"
@@ -18,8 +19,10 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -560,4 +563,303 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
+
+static void
+vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id)
+{
+	struct vq_desc_extra *dxp;
+
+	dxp = &vq->vq_descx[id];
+	vq->vq_free_cnt += dxp->ndescs;
+
+	if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END)
+		vq->vq_desc_head_idx = id;
+	else
+		vq->vq_descx[vq->vq_desc_tail_idx].next = id;
+
+	vq->vq_desc_tail_idx = id;
+	dxp->next = VQ_RING_DESC_CHAIN_END;
+}
+
+static void
+virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
+{
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+
+	used_idx = vq->vq_used_cons_idx;
+	/* desc_is_used has a load-acquire or rte_cio_rmb inside
+	 * and wait for used desc in virtqueue.
+	 */
+	while (num > 0 && desc_is_used(&desc[used_idx], vq)) {
+		id = desc[used_idx].id;
+		do {
+			curr_id = used_idx;
+			dxp = &vq->vq_descx[used_idx];
+			used_idx += dxp->ndescs;
+			free_cnt += dxp->ndescs;
+			num -= dxp->ndescs;
+			if (used_idx >= size) {
+				used_idx -= size;
+				vq->vq_packed.used_wrap_counter ^= 1;
+			}
+			if (dxp->cookie != NULL) {
+				rte_pktmbuf_free(dxp->cookie);
+				dxp->cookie = NULL;
+			}
+		} while (curr_id != id);
+	}
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
+
+static void
+virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num)
+{
+	uint16_t used_idx, id;
+	uint16_t size = vq->vq_nentries;
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+
+	used_idx = vq->vq_used_cons_idx;
+	/* desc_is_used has a load-acquire or rte_cio_rmb inside
+	 * and wait for used desc in virtqueue.
+	 */
+	while (num-- && desc_is_used(&desc[used_idx], vq)) {
+		id = desc[used_idx].id;
+		dxp = &vq->vq_descx[id];
+		vq->vq_used_cons_idx += dxp->ndescs;
+		if (vq->vq_used_cons_idx >= size) {
+			vq->vq_used_cons_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+		vq_ring_free_id_packed(vq, id);
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+		used_idx = vq->vq_used_cons_idx;
+	}
+}
+
+/* Cleanup from completed transmits. */
+static inline void
+virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order)
+{
+	if (in_order)
+		virtio_xmit_cleanup_inorder_packed(vq, num);
+	else
+		virtio_xmit_cleanup_normal_packed(vq, num);
+}
+
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num)
+{
+	uint16_t i, used_idx, desc_idx;
+	for (i = 0; i < num; i++) {
+		struct vring_used_elem *uep;
+		struct vq_desc_extra *dxp;
+
+		used_idx = (uint16_t)(vq->vq_used_cons_idx &
+				(vq->vq_nentries - 1));
+		uep = &vq->vq_split.ring.used->ring[used_idx];
+
+		desc_idx = (uint16_t)uep->id;
+		dxp = &vq->vq_descx[desc_idx];
+		vq->vq_used_cons_idx++;
+		vq_ring_free_chain(vq, desc_idx);
+
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+	}
+}
+
+/* Cleanup from completed inorder transmits. */
+static __rte_always_inline void
+virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num)
+{
+	uint16_t i, idx = vq->vq_used_cons_idx;
+	int16_t free_cnt = 0;
+	struct vq_desc_extra *dxp = NULL;
+
+	if (unlikely(num == 0))
+		return;
+
+	for (i = 0; i < num; i++) {
+		dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)];
+		free_cnt += dxp->ndescs;
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+	}
+
+	vq->vq_free_cnt += free_cnt;
+	vq->vq_used_cons_idx = idx;
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-30  9:48     ` Ferruh Yigit
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Rx path with SIMD instructions. Solution of
optimization is pretty like vhost, is that split path into batch and
single functions. Batch function is further optimized by AVX512
instructions. Also pad desc extra structure to 16 bytes aligned, thus
four elements will be saved in one batch.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index c9edb84ee..102b1deab 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 15150eea1..8e68c3039 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
 deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
+	if '-mno-avx512f' not in machine_args
+		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
 	sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index febaf17a8..5c112cac7 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index a549991aa..534562cca 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+__rte_weak uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..88831a786
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define BYTE_SIZE 8
+/* flag bits offset in packed ring desc higher 64bits */
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static inline uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	__m512i v_desc = _mm512_loadu_si512(desc_addr);
+	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
+
+	__m512i v_used_flag = _mm512_setzero_si512();
+	if (vq->vq_packed.used_wrap_counter)
+		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
+	if (desc_stats)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i] = (uintptr_t)rx_pkts[i]->rx_descriptor_fields1;
+	}
+
+	/*
+	 * load len from desc, store into mbuf pkt_len and data_len
+	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
+	 */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA);
+
+	/* reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
+			(uint32_t)-hdr_size);
+
+	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
+
+	/* assert offset of data_len */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+
+	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
+					   addrs[2] + 8, addrs[2],
+					   addrs[1] + 8, addrs[1],
+					   addrs[0] + 8, addrs[0]);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 40ad786cc..c54698ad1 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	hw->use_msix = 1;
 	hw->modern   = 0;
 	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
-	if (vectorized)
-		hw->use_vec_rx = 1;
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	}
 
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index ca1c10499..ce0340743 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -239,7 +239,8 @@ struct vq_desc_extra {
 	void *cookie;
 	uint16_t ndescs;
 	uint16_t next;
-};
+	uint8_t padding[4];
+} __rte_packed __rte_aligned(16);
 
 struct virtqueue {
 	struct virtio_hw  *hw; /**< virtio_hw structure pointer. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Tx path like Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by AVX512
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 5c112cac7..b7d52d497 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 534562cca..460e9d4a2 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2038,3 +2038,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+__rte_weak uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index 88831a786..a7358c768 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -23,6 +23,24 @@
 #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
 	FLAGS_BITS_OFFSET)
 
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+/* id bits offset in packed ring desc higher 64bits */
+#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
 	sizeof(struct vring_packed_desc))
 #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
@@ -60,6 +78,237 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len4;
 }
 
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64);
+	RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64);
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref);
+	if (unlikely(cmp))
+		return -1;
+
+	/* Check headroom is enough */
+	const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12;
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) !=
+		offsetof(struct rte_mbuf, rearm_data));
+	cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms);
+	if (unlikely(cmp))
+		return -1;
+
+	__m512i v_descx = _mm512_set_epi64(0x1, (uintptr_t)tx_pkts[3],
+					   0x1, (uintptr_t)tx_pkts[2],
+					   0x1, (uintptr_t)tx_pkts[1],
+					   0x1, (uintptr_t)tx_pkts[0]);
+
+	_mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx);
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+#ifdef RTE_VIRTIO_USER
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])),
+			tx_pkts[2]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])),
+			tx_pkts[1]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])),
+			tx_pkts[0]->data_len,
+			(uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0])));
+#else
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+					      tx_pkts[3]->buf_iova,
+					      tx_pkts[2]->data_len,
+					      tx_pkts[2]->buf_iova,
+					      tx_pkts[1]->data_len,
+					      tx_pkts[1]->buf_iova,
+					      tx_pkts[0]->data_len,
+					      tx_pkts[0]->buf_iova);
+#endif
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_BITS_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_BITS_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif
+	__m512i v_offset = _mm512_broadcast_i32x4(flag_offset);
+
+	__m512i v_desc = _mm512_add_epi64(new_descs, v_offset);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i all_mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, all_mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_inorder_packed(vq, need);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_inorder_packed(vq, vq->vq_free_thresh);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 9/9] doc: add packed " Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
vectorized devarg, then checks each criteria.

Packed ring vectorized path need:
    AVX512F and required extensions are supported by compiler and host
    VERSION_1 and IN_ORDER features are negotiated
    mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path need:
    mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloadings are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 0a69a4db1..088d0e45e 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	if (vectorized) {
-		if (!vtpci_packed_queue(hw))
+		if (!vtpci_packed_queue(hw)) {
+			hw->use_vec_rx = 1;
+		} else {
+#if !defined(CC_AVX512_SUPPORT)
+			PMD_DRV_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#else
 			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#endif
+		}
 	}
 
 	hw->opened = true;
@@ -2288,31 +2306,61 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
-		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorized path for requirements not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
 
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
+
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized path for requirement not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v11 9/9] doc: add packed vectorized path
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-28  8:32   ` Marvin Liu
  8 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-28  8:32 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index fdd0790e0..226f4308d 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -482,6 +482,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -504,6 +511,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -521,20 +530,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28  1:14       ` Liu, Yong
@ 2020-04-28  8:44         ` Maxime Coquelin
  2020-04-28 13:01           ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-28  8:44 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



On 4/28/20 3:14 AM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Monday, April 27, 2020 7:21 PM
>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org
>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
>>
>>
>>
>> On 4/26/20 4:19 AM, Marvin Liu wrote:
>>> Optimize packed ring Rx path with SIMD instructions. Solution of
>>> optimization is pretty like vhost, is that split path into batch and
>>> single functions. Batch function is further optimized by AVX512
>>> instructions. Also pad desc extra structure to 16 bytes aligned, thus
>>> four elements will be saved in one batch.
>>>
>>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>>>
>>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
>>> index c9edb84ee..102b1deab 100644
>>> --- a/drivers/net/virtio/Makefile
>>> +++ b/drivers/net/virtio/Makefile
>>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
>> $(CONFIG_RTE_ARCH_ARM64)),)
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
>>>  endif
>>>
>>> +ifneq ($(FORCE_DISABLE_AVX512), y)
>>> +	CC_AVX512_SUPPORT=\
>>> +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
>>> +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
>>> +	grep -q AVX512 && echo 1)
>>> +endif
>>> +
>>> +ifeq ($(CC_AVX512_SUPPORT), 1)
>>> +CFLAGS += -DCC_AVX512_SUPPORT
>>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
>>> +
>>> +ifeq ($(RTE_TOOLCHAIN), gcc)
>>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
>>> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
>>> +endif
>>> +endif
>>> +
>>> +ifeq ($(RTE_TOOLCHAIN), clang)
>>> +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
>> ge 37 && echo 1), 1)
>>> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
>>> +endif
>>> +endif
>>> +
>>> +ifeq ($(RTE_TOOLCHAIN), icc)
>>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
>>> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
>>> +endif
>>> +endif
>>> +
>>> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
>>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
>>> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
>>> +endif
>>> +endif
>>> +
>>>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
>>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
>>> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
>>> index 15150eea1..8e68c3039 100644
>>> --- a/drivers/net/virtio/meson.build
>>> +++ b/drivers/net/virtio/meson.build
>>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
>>>  deps += ['kvargs', 'bus_pci']
>>>
>>>  if arch_subdir == 'x86'
>>> +	if '-mno-avx512f' not in machine_args
>>> +		if cc.has_argument('-mavx512f') and cc.has_argument('-
>> mavx512vl') and cc.has_argument('-mavx512bw')
>>> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
>>> +			cflags += ['-DCC_AVX512_SUPPORT']
>>> +			if (toolchain == 'gcc' and
>> cc.version().version_compare('>=8.3.0'))
>>> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
>>> +			elif (toolchain == 'clang' and
>> cc.version().version_compare('>=3.7.0'))
>>> +				cflags += '-
>> DVHOST_CLANG_UNROLL_PRAGMA'
>>> +			elif (toolchain == 'icc' and
>> cc.version().version_compare('>=16.0.0'))
>>> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
>>> +			endif
>>> +			sources += files('virtio_rxtx_packed_avx.c')
>>> +		endif
>>> +	endif
>>>  	sources += files('virtio_rxtx_simple_sse.c')
>>>  elif arch_subdir == 'ppc'
>>>  	sources += files('virtio_rxtx_simple_altivec.c')
>>> diff --git a/drivers/net/virtio/virtio_ethdev.h
>> b/drivers/net/virtio/virtio_ethdev.h
>>> index febaf17a8..5c112cac7 100644
>>> --- a/drivers/net/virtio/virtio_ethdev.h
>>> +++ b/drivers/net/virtio/virtio_ethdev.h
>>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue,
>> struct rte_mbuf **tx_pkts,
>>>  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>  		uint16_t nb_pkts);
>>>
>>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
>> **rx_pkts,
>>> +		uint16_t nb_pkts);
>>> +
>>>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
>>>
>>>  void virtio_interrupt_handler(void *param);
>>> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
>>> index a549991aa..534562cca 100644
>>> --- a/drivers/net/virtio/virtio_rxtx.c
>>> +++ b/drivers/net/virtio/virtio_rxtx.c
>>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
>>>
>>>  	return nb_tx;
>>>  }
>>> +
>>> +__rte_weak uint16_t
>>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
>>> +			    struct rte_mbuf **rx_pkts __rte_unused,
>>> +			    uint16_t nb_pkts __rte_unused)
>>> +{
>>> +	return 0;
>>> +}
>>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
>> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
>>> new file mode 100644
>>> index 000000000..8a7b459eb
>>> --- /dev/null
>>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
>>> @@ -0,0 +1,374 @@
>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>> + * Copyright(c) 2010-2020 Intel Corporation
>>> + */
>>> +
>>> +#include <stdint.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <errno.h>
>>> +
>>> +#include <rte_net.h>
>>> +
>>> +#include "virtio_logs.h"
>>> +#include "virtio_ethdev.h"
>>> +#include "virtio_pci.h"
>>> +#include "virtqueue.h"
>>> +
>>> +#define BYTE_SIZE 8
>>> +/* flag bits offset in packed ring desc higher 64bits */
>>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
>>> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
>>> +
>>> +#define PACKED_FLAGS_MASK ((0ULL |
>> VRING_PACKED_DESC_F_AVAIL_USED) << \
>>> +	FLAGS_BITS_OFFSET)
>>> +
>>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
>>> +	sizeof(struct vring_packed_desc))
>>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
>>> +
>>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
>>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4")
>> \
>>> +	for (iter = val; iter < size; iter++)
>>> +#endif
>>> +
>>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
>>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
>>> +	for (iter = val; iter < size; iter++)
>>> +#endif
>>> +
>>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
>>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
>>> +	for (iter = val; iter < size; iter++)
>>> +#endif
>>> +
>>> +#ifndef virtio_for_each_try_unroll
>>> +#define virtio_for_each_try_unroll(iter, val, num) \
>>> +	for (iter = val; iter < num; iter++)
>>> +#endif
>>> +
>>> +static inline void
>>> +virtio_update_batch_stats(struct virtnet_stats *stats,
>>> +			  uint16_t pkt_len1,
>>> +			  uint16_t pkt_len2,
>>> +			  uint16_t pkt_len3,
>>> +			  uint16_t pkt_len4)
>>> +{
>>> +	stats->bytes += pkt_len1;
>>> +	stats->bytes += pkt_len2;
>>> +	stats->bytes += pkt_len3;
>>> +	stats->bytes += pkt_len4;
>>> +}
>>> +
>>> +/* Optionally fill offload information in structure */
>>> +static inline int
>>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
>>> +{
>>> +	struct rte_net_hdr_lens hdr_lens;
>>> +	uint32_t hdrlen, ptype;
>>> +	int l4_supported = 0;
>>> +
>>> +	/* nothing to do */
>>> +	if (hdr->flags == 0)
>>> +		return 0;
>>> +
>>> +	/* GSO not support in vec path, skip check */
>>> +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
>>> +
>>> +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
>>> +	m->packet_type = ptype;
>>> +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
>>> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
>>> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
>>> +		l4_supported = 1;
>>> +
>>> +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>>> +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
>>> +		if (hdr->csum_start <= hdrlen && l4_supported) {
>>> +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
>>> +		} else {
>>> +			/* Unknown proto or tunnel, do sw cksum. We can
>> assume
>>> +			 * the cksum field is in the first segment since the
>>> +			 * buffers we provided to the host are large enough.
>>> +			 * In case of SCTP, this will be wrong since it's a CRC
>>> +			 * but there's nothing we can do.
>>> +			 */
>>> +			uint16_t csum = 0, off;
>>> +
>>> +			rte_raw_cksum_mbuf(m, hdr->csum_start,
>>> +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
>>> +				&csum);
>>> +			if (likely(csum != 0xffff))
>>> +				csum = ~csum;
>>> +			off = hdr->csum_offset + hdr->csum_start;
>>> +			if (rte_pktmbuf_data_len(m) >= off + 1)
>>> +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
>>> +					off) = csum;
>>> +		}
>>> +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
>> l4_supported) {
>>> +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static inline uint16_t
>>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
>>> +				   struct rte_mbuf **rx_pkts)
>>> +{
>>> +	struct virtqueue *vq = rxvq->vq;
>>> +	struct virtio_hw *hw = vq->hw;
>>> +	uint16_t hdr_size = hw->vtnet_hdr_size;
>>> +	uint64_t addrs[PACKED_BATCH_SIZE];
>>> +	uint16_t id = vq->vq_used_cons_idx;
>>> +	uint8_t desc_stats;
>>> +	uint16_t i;
>>> +	void *desc_addr;
>>> +
>>> +	if (id & PACKED_BATCH_MASK)
>>> +		return -1;
>>> +
>>> +	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
>>> +		return -1;
>>> +
>>> +	/* only care avail/used bits */
>>> +	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa,
>> PACKED_FLAGS_MASK);
>>> +	desc_addr = &vq->vq_packed.ring.desc[id];
>>> +
>>> +	__m512i v_desc = _mm512_loadu_si512(desc_addr);
>>> +	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
>>> +
>>> +	__m512i v_used_flag = _mm512_setzero_si512();
>>> +	if (vq->vq_packed.used_wrap_counter)
>>> +		v_used_flag = _mm512_maskz_set1_epi64(0xaa,
>> PACKED_FLAGS_MASK);
>>> +
>>> +	/* Check all descs are used */
>>> +	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
>>> +	if (desc_stats)
>>> +		return -1;
>>> +
>>> +	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
>>> +		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
>>> +		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
>>> +
>>> +		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
>>> +	}
>>> +
>>> +	/*
>>> +	 * load len from desc, store into mbuf pkt_len and data_len
>>> +	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
>>> +	 */
>>> +	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
>>> +	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc,
>> 0xAA);
>>> +
>>> +	/* reduce hdr_len from pkt_len and data_len */
>>> +	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
>>> +			(uint32_t)-hdr_size);
>>> +
>>> +	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
>>> +
>>> +	/* assert offset of data_len */
>>> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
>>> +		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
>>> +
>>> +	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
>>> +					   addrs[2] + 8, addrs[2],
>>> +					   addrs[1] + 8, addrs[1],
>>> +					   addrs[0] + 8, addrs[0]);
>>> +	/* batch store into mbufs */
>>> +	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
>>> +
>>> +	if (hw->has_rx_offload) {
>>> +		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
>>> +			char *addr = (char *)rx_pkts[i]->buf_addr +
>>> +				RTE_PKTMBUF_HEADROOM - hdr_size;
>>> +			virtio_vec_rx_offload(rx_pkts[i],
>>> +					(struct virtio_net_hdr *)addr);
>>> +		}
>>> +	}
>>> +
>>> +	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
>>> +			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
>>> +			rx_pkts[3]->pkt_len);
>>> +
>>> +	vq->vq_free_cnt += PACKED_BATCH_SIZE;
>>> +
>>> +	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
>>> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
>>> +		vq->vq_used_cons_idx -= vq->vq_nentries;
>>> +		vq->vq_packed.used_wrap_counter ^= 1;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static uint16_t
>>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
>>> +				    struct rte_mbuf **rx_pkts)
>>> +{
>>> +	uint16_t used_idx, id;
>>> +	uint32_t len;
>>> +	struct virtqueue *vq = rxvq->vq;
>>> +	struct virtio_hw *hw = vq->hw;
>>> +	uint32_t hdr_size = hw->vtnet_hdr_size;
>>> +	struct virtio_net_hdr *hdr;
>>> +	struct vring_packed_desc *desc;
>>> +	struct rte_mbuf *cookie;
>>> +
>>> +	desc = vq->vq_packed.ring.desc;
>>> +	used_idx = vq->vq_used_cons_idx;
>>> +	if (!desc_is_used(&desc[used_idx], vq))
>>> +		return -1;
>>> +
>>> +	len = desc[used_idx].len;
>>> +	id = desc[used_idx].id;
>>> +	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
>>> +	if (unlikely(cookie == NULL)) {
>>> +		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie
>> at %u",
>>> +				vq->vq_used_cons_idx);
>>> +		return -1;
>>> +	}
>>> +	rte_prefetch0(cookie);
>>> +	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
>>> +
>>> +	cookie->data_off = RTE_PKTMBUF_HEADROOM;
>>> +	cookie->ol_flags = 0;
>>> +	cookie->pkt_len = (uint32_t)(len - hdr_size);
>>> +	cookie->data_len = (uint32_t)(len - hdr_size);
>>> +
>>> +	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
>>> +					RTE_PKTMBUF_HEADROOM -
>> hdr_size);
>>> +	if (hw->has_rx_offload)
>>> +		virtio_vec_rx_offload(cookie, hdr);
>>> +
>>> +	*rx_pkts = cookie;
>>> +
>>> +	rxvq->stats.bytes += cookie->pkt_len;
>>> +
>>> +	vq->vq_free_cnt++;
>>> +	vq->vq_used_cons_idx++;
>>> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
>>> +		vq->vq_used_cons_idx -= vq->vq_nentries;
>>> +		vq->vq_packed.used_wrap_counter ^= 1;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static inline void
>>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
>>> +			      struct rte_mbuf **cookie,
>>> +			      uint16_t num)
>>> +{
>>> +	struct virtqueue *vq = rxvq->vq;
>>> +	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
>>> +	uint16_t flags = vq->vq_packed.cached_flags;
>>> +	struct virtio_hw *hw = vq->hw;
>>> +	struct vq_desc_extra *dxp;
>>> +	uint16_t idx, i;
>>> +	uint16_t batch_num, total_num = 0;
>>> +	uint16_t head_idx = vq->vq_avail_idx;
>>> +	uint16_t head_flag = vq->vq_packed.cached_flags;
>>> +	uint64_t addr;
>>> +
>>> +	do {
>>> +		idx = vq->vq_avail_idx;
>>> +
>>> +		batch_num = PACKED_BATCH_SIZE;
>>> +		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
>>> +			batch_num = vq->vq_nentries - idx;
>>> +		if (unlikely((total_num + batch_num) > num))
>>> +			batch_num = num - total_num;
>>> +
>>> +		virtio_for_each_try_unroll(i, 0, batch_num) {
>>> +			dxp = &vq->vq_descx[idx + i];
>>> +			dxp->cookie = (void *)cookie[total_num + i];
>>> +
>>> +			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i],
>> vq) +
>>> +				RTE_PKTMBUF_HEADROOM - hw-
>>> vtnet_hdr_size;
>>> +			start_dp[idx + i].addr = addr;
>>> +			start_dp[idx + i].len = cookie[total_num + i]->buf_len
>>> +				- RTE_PKTMBUF_HEADROOM + hw-
>>> vtnet_hdr_size;
>>> +			if (total_num || i) {
>>> +				virtqueue_store_flags_packed(&start_dp[idx
>> + i],
>>> +						flags, hw->weak_barriers);
>>> +			}
>>> +		}
>>> +
>>> +		vq->vq_avail_idx += batch_num;
>>> +		if (vq->vq_avail_idx >= vq->vq_nentries) {
>>> +			vq->vq_avail_idx -= vq->vq_nentries;
>>> +			vq->vq_packed.cached_flags ^=
>>> +				VRING_PACKED_DESC_F_AVAIL_USED;
>>> +			flags = vq->vq_packed.cached_flags;
>>> +		}
>>> +		total_num += batch_num;
>>> +	} while (total_num < num);
>>> +
>>> +	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
>>> +				hw->weak_barriers);
>>> +	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
>>> +}
>>> +
>>> +uint16_t
>>> +virtio_recv_pkts_packed_vec(void *rx_queue,
>>> +			    struct rte_mbuf **rx_pkts,
>>> +			    uint16_t nb_pkts)
>>> +{
>>> +	struct virtnet_rx *rxvq = rx_queue;
>>> +	struct virtqueue *vq = rxvq->vq;
>>> +	struct virtio_hw *hw = vq->hw;
>>> +	uint16_t num, nb_rx = 0;
>>> +	uint32_t nb_enqueued = 0;
>>> +	uint16_t free_cnt = vq->vq_free_thresh;
>>> +
>>> +	if (unlikely(hw->started == 0))
>>> +		return nb_rx;
>>> +
>>> +	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
>>> +	if (likely(num > PACKED_BATCH_SIZE))
>>> +		num = num - ((vq->vq_used_cons_idx + num) %
>> PACKED_BATCH_SIZE);
>>> +
>>> +	while (num) {
>>> +		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
>>> +					&rx_pkts[nb_rx])) {
>>> +			nb_rx += PACKED_BATCH_SIZE;
>>> +			num -= PACKED_BATCH_SIZE;
>>> +			continue;
>>> +		}
>>> +		if (!virtqueue_dequeue_single_packed_vec(rxvq,
>>> +					&rx_pkts[nb_rx])) {
>>> +			nb_rx++;
>>> +			num--;
>>> +			continue;
>>> +		}
>>> +		break;
>>> +	};
>>> +
>>> +	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
>>> +
>>> +	rxvq->stats.packets += nb_rx;
>>> +
>>> +	if (likely(vq->vq_free_cnt >= free_cnt)) {
>>> +		struct rte_mbuf *new_pkts[free_cnt];
>>> +		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
>>> +						free_cnt) == 0)) {
>>> +			virtio_recv_refill_packed_vec(rxvq, new_pkts,
>>> +					free_cnt);
>>> +			nb_enqueued += free_cnt;
>>> +		} else {
>>> +			struct rte_eth_dev *dev =
>>> +				&rte_eth_devices[rxvq->port_id];
>>> +			dev->data->rx_mbuf_alloc_failed += free_cnt;
>>> +		}
>>> +	}
>>> +
>>> +	if (likely(nb_enqueued)) {
>>> +		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
>>> +			virtqueue_notify(vq);
>>> +			PMD_RX_LOG(DEBUG, "Notified");
>>> +		}
>>> +	}
>>> +
>>> +	return nb_rx;
>>> +}
>>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c
>> b/drivers/net/virtio/virtio_user_ethdev.c
>>> index 40ad786cc..c54698ad1 100644
>>> --- a/drivers/net/virtio/virtio_user_ethdev.c
>>> +++ b/drivers/net/virtio/virtio_user_ethdev.c
>>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device
>> *vdev)
>>>  	hw->use_msix = 1;
>>>  	hw->modern   = 0;
>>>  	hw->use_vec_rx = 0;
>>> +	hw->use_vec_tx = 0;
>>>  	hw->use_inorder_rx = 0;
>>>  	hw->use_inorder_tx = 0;
>>>  	hw->virtio_user_dev = dev;
>>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device
>> *dev)
>>>  		goto end;
>>>  	}
>>>
>>> -	if (vectorized)
>>> -		hw->use_vec_rx = 1;
>>> +	if (vectorized) {
>>> +		if (packed_vq) {
>>> +#if defined(CC_AVX512_SUPPORT)
>>> +			hw->use_vec_rx = 1;
>>> +			hw->use_vec_tx = 1;
>>> +#else
>>> +			PMD_INIT_LOG(INFO,
>>> +				"building environment do not support packed
>> ring vectorized");
>>> +#endif
>>> +		} else {
>>> +			hw->use_vec_rx = 1;
>>> +		}
>>> +	}
>>>
>>>  	rte_eth_dev_probing_finish(eth_dev);
>>>  	ret = 0;
>>> diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
>>> index ca1c10499..ce0340743 100644
>>> --- a/drivers/net/virtio/virtqueue.h
>>> +++ b/drivers/net/virtio/virtqueue.h
>>> @@ -239,7 +239,8 @@ struct vq_desc_extra {
>>>  	void *cookie;
>>>  	uint16_t ndescs;
>>>  	uint16_t next;
>>> -};
>>> +	uint8_t padding[4];
>>> +} __rte_packed __rte_aligned(16);
>>
>> Can't this introduce a performance impact for the non-vectorized
>> case? I think of worse cache liens utilization.
>>
>> For example with a burst of 32 descriptors with 32B cachelines, before
>> it would take 14 cachelines, after 16. So for each burst, one could face
>> 2 extra cache misses.
>>
>> If you could run non-vectorized benchamrks with and without that patch,
>> I would be grateful.
>>
> 
> Maxime,
> Thanks for point it out, it will add extra cache miss in datapath. 
> And its impact on performance is around 1% in loopback case. 

Ok, thanks for doing the test. I'll try to run some PVP benchmarks
on my side because when doing IO loopback, the cache pressure is
much less important.

> While benefit of vectorized path will be more than that number.

Ok, but I disagree for two reasons:
 1. You have to keep in mind than non-vectorized is the default and
encouraged mode to use. Indeed, it takes a lot of shortcuts like not
checking header length (so no error stats), etc...

 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
the gain is 20% on $CPU_VENDOR_B.

In the case we see more degradation in real-world scenario, you might
want to consider using ifdefs to avoid adding padding in the non-
vectorized case, like you did to differentiate Virtio PMD to Virtio-user
PMD in patch 7.

Thanks,
Maxime

> Thanks,
> Marvin
> 
>> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>
>> Thanks,
>> Maxime
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28  8:44         ` Maxime Coquelin
@ 2020-04-28 13:01           ` Liu, Yong
  2020-04-28 13:46             ` Maxime Coquelin
  2020-04-28 17:01             ` Liu, Yong
  0 siblings, 2 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-28 13:01 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, April 28, 2020 4:44 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/28/20 3:14 AM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Monday, April 27, 2020 7:21 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org
> >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> path
> >>
> >>
> >>
> >> On 4/26/20 4:19 AM, Marvin Liu wrote:
> >>> Optimize packed ring Rx path with SIMD instructions. Solution of
> >>> optimization is pretty like vhost, is that split path into batch and
> >>> single functions. Batch function is further optimized by AVX512
> >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus
> >>> four elements will be saved in one batch.
> >>>
> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >>>
> >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> >>> index c9edb84ee..102b1deab 100644
> >>> --- a/drivers/net/virtio/Makefile
> >>> +++ b/drivers/net/virtio/Makefile
> >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> >> $(CONFIG_RTE_ARCH_ARM64)),)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_rxtx_simple_neon.c
> >>>  endif
> >>>
> >>> +ifneq ($(FORCE_DISABLE_AVX512), y)
> >>> +	CC_AVX512_SUPPORT=\
> >>> +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> >>> +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> >>> +	grep -q AVX512 && echo 1)
> >>> +endif
> >>> +
> >>> +ifeq ($(CC_AVX512_SUPPORT), 1)
> >>> +CFLAGS += -DCC_AVX512_SUPPORT
> >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
> >>> +
> >>> +ifeq ($(RTE_TOOLCHAIN), gcc)
> >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> >>> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
> >>> +endif
> >>> +endif
> >>> +
> >>> +ifeq ($(RTE_TOOLCHAIN), clang)
> >>> +ifeq ($(shell test
> $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
> >> ge 37 && echo 1), 1)
> >>> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
> >>> +endif
> >>> +endif
> >>> +
> >>> +ifeq ($(RTE_TOOLCHAIN), icc)
> >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> >>> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
> >>> +endif
> >>> +endif
> >>> +
> >>> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -
> mavx512vl
> >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> >>> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
> >>> +endif
> >>> +endif
> >>> +
> >>>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_user/vhost_kernel.c
> >>> diff --git a/drivers/net/virtio/meson.build
> b/drivers/net/virtio/meson.build
> >>> index 15150eea1..8e68c3039 100644
> >>> --- a/drivers/net/virtio/meson.build
> >>> +++ b/drivers/net/virtio/meson.build
> >>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
> >>>  deps += ['kvargs', 'bus_pci']
> >>>
> >>>  if arch_subdir == 'x86'
> >>> +	if '-mno-avx512f' not in machine_args
> >>> +		if cc.has_argument('-mavx512f') and cc.has_argument('-
> >> mavx512vl') and cc.has_argument('-mavx512bw')
> >>> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> >>> +			cflags += ['-DCC_AVX512_SUPPORT']
> >>> +			if (toolchain == 'gcc' and
> >> cc.version().version_compare('>=8.3.0'))
> >>> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> >>> +			elif (toolchain == 'clang' and
> >> cc.version().version_compare('>=3.7.0'))
> >>> +				cflags += '-
> >> DVHOST_CLANG_UNROLL_PRAGMA'
> >>> +			elif (toolchain == 'icc' and
> >> cc.version().version_compare('>=16.0.0'))
> >>> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> >>> +			endif
> >>> +			sources += files('virtio_rxtx_packed_avx.c')
> >>> +		endif
> >>> +	endif
> >>>  	sources += files('virtio_rxtx_simple_sse.c')
> >>>  elif arch_subdir == 'ppc'
> >>>  	sources += files('virtio_rxtx_simple_altivec.c')
> >>> diff --git a/drivers/net/virtio/virtio_ethdev.h
> >> b/drivers/net/virtio/virtio_ethdev.h
> >>> index febaf17a8..5c112cac7 100644
> >>> --- a/drivers/net/virtio/virtio_ethdev.h
> >>> +++ b/drivers/net/virtio/virtio_ethdev.h
> >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void
> *tx_queue,
> >> struct rte_mbuf **tx_pkts,
> >>>  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> >>>  		uint16_t nb_pkts);
> >>>
> >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> >> **rx_pkts,
> >>> +		uint16_t nb_pkts);
> >>> +
> >>>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >>>
> >>>  void virtio_interrupt_handler(void *param);
> >>> diff --git a/drivers/net/virtio/virtio_rxtx.c
> b/drivers/net/virtio/virtio_rxtx.c
> >>> index a549991aa..534562cca 100644
> >>> --- a/drivers/net/virtio/virtio_rxtx.c
> >>> +++ b/drivers/net/virtio/virtio_rxtx.c
> >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
> >>>
> >>>  	return nb_tx;
> >>>  }
> >>> +
> >>> +__rte_weak uint16_t
> >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> >>> +			    struct rte_mbuf **rx_pkts __rte_unused,
> >>> +			    uint16_t nb_pkts __rte_unused)
> >>> +{
> >>> +	return 0;
> >>> +}
> >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> new file mode 100644
> >>> index 000000000..8a7b459eb
> >>> --- /dev/null
> >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> @@ -0,0 +1,374 @@
> >>> +/* SPDX-License-Identifier: BSD-3-Clause
> >>> + * Copyright(c) 2010-2020 Intel Corporation
> >>> + */
> >>> +
> >>> +#include <stdint.h>
> >>> +#include <stdio.h>
> >>> +#include <stdlib.h>
> >>> +#include <string.h>
> >>> +#include <errno.h>
> >>> +
> >>> +#include <rte_net.h>
> >>> +
> >>> +#include "virtio_logs.h"
> >>> +#include "virtio_ethdev.h"
> >>> +#include "virtio_pci.h"
> >>> +#include "virtqueue.h"
> >>> +
> >>> +#define BYTE_SIZE 8
> >>> +/* flag bits offset in packed ring desc higher 64bits */
> >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags)
> - \
> >>> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> >>> +
> >>> +#define PACKED_FLAGS_MASK ((0ULL |
> >> VRING_PACKED_DESC_F_AVAIL_USED) << \
> >>> +	FLAGS_BITS_OFFSET)
> >>> +
> >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> >>> +	sizeof(struct vring_packed_desc))
> >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> >>> +
> >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
> 4")
> >> \
> >>> +	for (iter = val; iter < size; iter++)
> >>> +#endif
> >>> +
> >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
> >>> +	for (iter = val; iter < size; iter++)
> >>> +#endif
> >>> +
> >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)")
> \
> >>> +	for (iter = val; iter < size; iter++)
> >>> +#endif
> >>> +
> >>> +#ifndef virtio_for_each_try_unroll
> >>> +#define virtio_for_each_try_unroll(iter, val, num) \
> >>> +	for (iter = val; iter < num; iter++)
> >>> +#endif
> >>> +
> >>> +static inline void
> >>> +virtio_update_batch_stats(struct virtnet_stats *stats,
> >>> +			  uint16_t pkt_len1,
> >>> +			  uint16_t pkt_len2,
> >>> +			  uint16_t pkt_len3,
> >>> +			  uint16_t pkt_len4)
> >>> +{
> >>> +	stats->bytes += pkt_len1;
> >>> +	stats->bytes += pkt_len2;
> >>> +	stats->bytes += pkt_len3;
> >>> +	stats->bytes += pkt_len4;
> >>> +}
> >>> +
> >>> +/* Optionally fill offload information in structure */
> >>> +static inline int
> >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> >>> +{
> >>> +	struct rte_net_hdr_lens hdr_lens;
> >>> +	uint32_t hdrlen, ptype;
> >>> +	int l4_supported = 0;
> >>> +
> >>> +	/* nothing to do */
> >>> +	if (hdr->flags == 0)
> >>> +		return 0;
> >>> +
> >>> +	/* GSO not support in vec path, skip check */
> >>> +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
> >>> +
> >>> +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> >>> +	m->packet_type = ptype;
> >>> +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
> >>> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
> >>> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
> >>> +		l4_supported = 1;
> >>> +
> >>> +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> >>> +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> >>> +		if (hdr->csum_start <= hdrlen && l4_supported) {
> >>> +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
> >>> +		} else {
> >>> +			/* Unknown proto or tunnel, do sw cksum. We can
> >> assume
> >>> +			 * the cksum field is in the first segment since the
> >>> +			 * buffers we provided to the host are large enough.
> >>> +			 * In case of SCTP, this will be wrong since it's a CRC
> >>> +			 * but there's nothing we can do.
> >>> +			 */
> >>> +			uint16_t csum = 0, off;
> >>> +
> >>> +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> >>> +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> >>> +				&csum);
> >>> +			if (likely(csum != 0xffff))
> >>> +				csum = ~csum;
> >>> +			off = hdr->csum_offset + hdr->csum_start;
> >>> +			if (rte_pktmbuf_data_len(m) >= off + 1)
> >>> +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> >>> +					off) = csum;
> >>> +		}
> >>> +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
> >> l4_supported) {
> >>> +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline uint16_t
> >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
> >>> +				   struct rte_mbuf **rx_pkts)
> >>> +{
> >>> +	struct virtqueue *vq = rxvq->vq;
> >>> +	struct virtio_hw *hw = vq->hw;
> >>> +	uint16_t hdr_size = hw->vtnet_hdr_size;
> >>> +	uint64_t addrs[PACKED_BATCH_SIZE];
> >>> +	uint16_t id = vq->vq_used_cons_idx;
> >>> +	uint8_t desc_stats;
> >>> +	uint16_t i;
> >>> +	void *desc_addr;
> >>> +
> >>> +	if (id & PACKED_BATCH_MASK)
> >>> +		return -1;
> >>> +
> >>> +	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
> >>> +		return -1;
> >>> +
> >>> +	/* only care avail/used bits */
> >>> +	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa,
> >> PACKED_FLAGS_MASK);
> >>> +	desc_addr = &vq->vq_packed.ring.desc[id];
> >>> +
> >>> +	__m512i v_desc = _mm512_loadu_si512(desc_addr);
> >>> +	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
> >>> +
> >>> +	__m512i v_used_flag = _mm512_setzero_si512();
> >>> +	if (vq->vq_packed.used_wrap_counter)
> >>> +		v_used_flag = _mm512_maskz_set1_epi64(0xaa,
> >> PACKED_FLAGS_MASK);
> >>> +
> >>> +	/* Check all descs are used */
> >>> +	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
> >>> +	if (desc_stats)
> >>> +		return -1;
> >>> +
> >>> +	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> >>> +		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
> >>> +		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
> >>> +
> >>> +		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * load len from desc, store into mbuf pkt_len and data_len
> >>> +	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
> >>> +	 */
> >>> +	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
> >>> +	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc,
> >> 0xAA);
> >>> +
> >>> +	/* reduce hdr_len from pkt_len and data_len */
> >>> +	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
> >>> +			(uint32_t)-hdr_size);
> >>> +
> >>> +	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
> >>> +
> >>> +	/* assert offset of data_len */
> >>> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
> >>> +		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
> >>> +
> >>> +	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
> >>> +					   addrs[2] + 8, addrs[2],
> >>> +					   addrs[1] + 8, addrs[1],
> >>> +					   addrs[0] + 8, addrs[0]);
> >>> +	/* batch store into mbufs */
> >>> +	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
> >>> +
> >>> +	if (hw->has_rx_offload) {
> >>> +		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> >>> +			char *addr = (char *)rx_pkts[i]->buf_addr +
> >>> +				RTE_PKTMBUF_HEADROOM - hdr_size;
> >>> +			virtio_vec_rx_offload(rx_pkts[i],
> >>> +					(struct virtio_net_hdr *)addr);
> >>> +		}
> >>> +	}
> >>> +
> >>> +	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
> >>> +			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
> >>> +			rx_pkts[3]->pkt_len);
> >>> +
> >>> +	vq->vq_free_cnt += PACKED_BATCH_SIZE;
> >>> +
> >>> +	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
> >>> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> >>> +		vq->vq_used_cons_idx -= vq->vq_nentries;
> >>> +		vq->vq_packed.used_wrap_counter ^= 1;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static uint16_t
> >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
> >>> +				    struct rte_mbuf **rx_pkts)
> >>> +{
> >>> +	uint16_t used_idx, id;
> >>> +	uint32_t len;
> >>> +	struct virtqueue *vq = rxvq->vq;
> >>> +	struct virtio_hw *hw = vq->hw;
> >>> +	uint32_t hdr_size = hw->vtnet_hdr_size;
> >>> +	struct virtio_net_hdr *hdr;
> >>> +	struct vring_packed_desc *desc;
> >>> +	struct rte_mbuf *cookie;
> >>> +
> >>> +	desc = vq->vq_packed.ring.desc;
> >>> +	used_idx = vq->vq_used_cons_idx;
> >>> +	if (!desc_is_used(&desc[used_idx], vq))
> >>> +		return -1;
> >>> +
> >>> +	len = desc[used_idx].len;
> >>> +	id = desc[used_idx].id;
> >>> +	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
> >>> +	if (unlikely(cookie == NULL)) {
> >>> +		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie
> >> at %u",
> >>> +				vq->vq_used_cons_idx);
> >>> +		return -1;
> >>> +	}
> >>> +	rte_prefetch0(cookie);
> >>> +	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
> >>> +
> >>> +	cookie->data_off = RTE_PKTMBUF_HEADROOM;
> >>> +	cookie->ol_flags = 0;
> >>> +	cookie->pkt_len = (uint32_t)(len - hdr_size);
> >>> +	cookie->data_len = (uint32_t)(len - hdr_size);
> >>> +
> >>> +	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
> >>> +					RTE_PKTMBUF_HEADROOM -
> >> hdr_size);
> >>> +	if (hw->has_rx_offload)
> >>> +		virtio_vec_rx_offload(cookie, hdr);
> >>> +
> >>> +	*rx_pkts = cookie;
> >>> +
> >>> +	rxvq->stats.bytes += cookie->pkt_len;
> >>> +
> >>> +	vq->vq_free_cnt++;
> >>> +	vq->vq_used_cons_idx++;
> >>> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> >>> +		vq->vq_used_cons_idx -= vq->vq_nentries;
> >>> +		vq->vq_packed.used_wrap_counter ^= 1;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline void
> >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
> >>> +			      struct rte_mbuf **cookie,
> >>> +			      uint16_t num)
> >>> +{
> >>> +	struct virtqueue *vq = rxvq->vq;
> >>> +	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
> >>> +	uint16_t flags = vq->vq_packed.cached_flags;
> >>> +	struct virtio_hw *hw = vq->hw;
> >>> +	struct vq_desc_extra *dxp;
> >>> +	uint16_t idx, i;
> >>> +	uint16_t batch_num, total_num = 0;
> >>> +	uint16_t head_idx = vq->vq_avail_idx;
> >>> +	uint16_t head_flag = vq->vq_packed.cached_flags;
> >>> +	uint64_t addr;
> >>> +
> >>> +	do {
> >>> +		idx = vq->vq_avail_idx;
> >>> +
> >>> +		batch_num = PACKED_BATCH_SIZE;
> >>> +		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
> >>> +			batch_num = vq->vq_nentries - idx;
> >>> +		if (unlikely((total_num + batch_num) > num))
> >>> +			batch_num = num - total_num;
> >>> +
> >>> +		virtio_for_each_try_unroll(i, 0, batch_num) {
> >>> +			dxp = &vq->vq_descx[idx + i];
> >>> +			dxp->cookie = (void *)cookie[total_num + i];
> >>> +
> >>> +			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i],
> >> vq) +
> >>> +				RTE_PKTMBUF_HEADROOM - hw-
> >>> vtnet_hdr_size;
> >>> +			start_dp[idx + i].addr = addr;
> >>> +			start_dp[idx + i].len = cookie[total_num + i]-
> >buf_len
> >>> +				- RTE_PKTMBUF_HEADROOM + hw-
> >>> vtnet_hdr_size;
> >>> +			if (total_num || i) {
> >>> +				virtqueue_store_flags_packed(&start_dp[idx
> >> + i],
> >>> +						flags, hw->weak_barriers);
> >>> +			}
> >>> +		}
> >>> +
> >>> +		vq->vq_avail_idx += batch_num;
> >>> +		if (vq->vq_avail_idx >= vq->vq_nentries) {
> >>> +			vq->vq_avail_idx -= vq->vq_nentries;
> >>> +			vq->vq_packed.cached_flags ^=
> >>> +				VRING_PACKED_DESC_F_AVAIL_USED;
> >>> +			flags = vq->vq_packed.cached_flags;
> >>> +		}
> >>> +		total_num += batch_num;
> >>> +	} while (total_num < num);
> >>> +
> >>> +	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
> >>> +				hw->weak_barriers);
> >>> +	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
> >>> +}
> >>> +
> >>> +uint16_t
> >>> +virtio_recv_pkts_packed_vec(void *rx_queue,
> >>> +			    struct rte_mbuf **rx_pkts,
> >>> +			    uint16_t nb_pkts)
> >>> +{
> >>> +	struct virtnet_rx *rxvq = rx_queue;
> >>> +	struct virtqueue *vq = rxvq->vq;
> >>> +	struct virtio_hw *hw = vq->hw;
> >>> +	uint16_t num, nb_rx = 0;
> >>> +	uint32_t nb_enqueued = 0;
> >>> +	uint16_t free_cnt = vq->vq_free_thresh;
> >>> +
> >>> +	if (unlikely(hw->started == 0))
> >>> +		return nb_rx;
> >>> +
> >>> +	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
> >>> +	if (likely(num > PACKED_BATCH_SIZE))
> >>> +		num = num - ((vq->vq_used_cons_idx + num) %
> >> PACKED_BATCH_SIZE);
> >>> +
> >>> +	while (num) {
> >>> +		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
> >>> +					&rx_pkts[nb_rx])) {
> >>> +			nb_rx += PACKED_BATCH_SIZE;
> >>> +			num -= PACKED_BATCH_SIZE;
> >>> +			continue;
> >>> +		}
> >>> +		if (!virtqueue_dequeue_single_packed_vec(rxvq,
> >>> +					&rx_pkts[nb_rx])) {
> >>> +			nb_rx++;
> >>> +			num--;
> >>> +			continue;
> >>> +		}
> >>> +		break;
> >>> +	};
> >>> +
> >>> +	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
> >>> +
> >>> +	rxvq->stats.packets += nb_rx;
> >>> +
> >>> +	if (likely(vq->vq_free_cnt >= free_cnt)) {
> >>> +		struct rte_mbuf *new_pkts[free_cnt];
> >>> +		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
> >>> +						free_cnt) == 0)) {
> >>> +			virtio_recv_refill_packed_vec(rxvq, new_pkts,
> >>> +					free_cnt);
> >>> +			nb_enqueued += free_cnt;
> >>> +		} else {
> >>> +			struct rte_eth_dev *dev =
> >>> +				&rte_eth_devices[rxvq->port_id];
> >>> +			dev->data->rx_mbuf_alloc_failed += free_cnt;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (likely(nb_enqueued)) {
> >>> +		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
> >>> +			virtqueue_notify(vq);
> >>> +			PMD_RX_LOG(DEBUG, "Notified");
> >>> +		}
> >>> +	}
> >>> +
> >>> +	return nb_rx;
> >>> +}
> >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c
> >> b/drivers/net/virtio/virtio_user_ethdev.c
> >>> index 40ad786cc..c54698ad1 100644
> >>> --- a/drivers/net/virtio/virtio_user_ethdev.c
> >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c
> >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct
> rte_vdev_device
> >> *vdev)
> >>>  	hw->use_msix = 1;
> >>>  	hw->modern   = 0;
> >>>  	hw->use_vec_rx = 0;
> >>> +	hw->use_vec_tx = 0;
> >>>  	hw->use_inorder_rx = 0;
> >>>  	hw->use_inorder_tx = 0;
> >>>  	hw->virtio_user_dev = dev;
> >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device
> >> *dev)
> >>>  		goto end;
> >>>  	}
> >>>
> >>> -	if (vectorized)
> >>> -		hw->use_vec_rx = 1;
> >>> +	if (vectorized) {
> >>> +		if (packed_vq) {
> >>> +#if defined(CC_AVX512_SUPPORT)
> >>> +			hw->use_vec_rx = 1;
> >>> +			hw->use_vec_tx = 1;
> >>> +#else
> >>> +			PMD_INIT_LOG(INFO,
> >>> +				"building environment do not support
> packed
> >> ring vectorized");
> >>> +#endif
> >>> +		} else {
> >>> +			hw->use_vec_rx = 1;
> >>> +		}
> >>> +	}
> >>>
> >>>  	rte_eth_dev_probing_finish(eth_dev);
> >>>  	ret = 0;
> >>> diff --git a/drivers/net/virtio/virtqueue.h
> b/drivers/net/virtio/virtqueue.h
> >>> index ca1c10499..ce0340743 100644
> >>> --- a/drivers/net/virtio/virtqueue.h
> >>> +++ b/drivers/net/virtio/virtqueue.h
> >>> @@ -239,7 +239,8 @@ struct vq_desc_extra {
> >>>  	void *cookie;
> >>>  	uint16_t ndescs;
> >>>  	uint16_t next;
> >>> -};
> >>> +	uint8_t padding[4];
> >>> +} __rte_packed __rte_aligned(16);
> >>
> >> Can't this introduce a performance impact for the non-vectorized
> >> case? I think of worse cache liens utilization.
> >>
> >> For example with a burst of 32 descriptors with 32B cachelines, before
> >> it would take 14 cachelines, after 16. So for each burst, one could face
> >> 2 extra cache misses.
> >>
> >> If you could run non-vectorized benchamrks with and without that patch,
> >> I would be grateful.
> >>
> >
> > Maxime,
> > Thanks for point it out, it will add extra cache miss in datapath.
> > And its impact on performance is around 1% in loopback case.
> 
> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
> on my side because when doing IO loopback, the cache pressure is
> much less important.
> 
> > While benefit of vectorized path will be more than that number.
> 
> Ok, but I disagree for two reasons:
>  1. You have to keep in mind than non-vectorized is the default and
> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
> checking header length (so no error stats), etc...
> 
Ok, I will keep non-vectorized same as before. 

>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
> the gain is 20% on $CPU_VENDOR_B.
> 
> In the case we see more degradation in real-world scenario, you might
> want to consider using ifdefs to avoid adding padding in the non-
> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
> PMD in patch 7.
> 

Maxime,
The performance difference is so slight, so I ignored for it look like a sampling error. 
It maybe not suitable to add new configuration for such setting which only used inside driver.
Virtio driver can check whether virtqueue is using vectorized path when initialization, will use padded structure if it is.
I have added some tested code and now performance came back.  Since code has changed in initialization process,  it need some time for regression check.

Regards,
Marvin

> Thanks,
> Maxime
> 
> > Thanks,
> > Marvin
> >
> >> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>
> >> Thanks,
> >> Maxime
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 13:01           ` Liu, Yong
@ 2020-04-28 13:46             ` Maxime Coquelin
  2020-04-28 14:43               ` Liu, Yong
  2020-04-28 17:01             ` Liu, Yong
  1 sibling, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-28 13:46 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Honnappa Nagarahalli, jerinj



On 4/28/20 3:01 PM, Liu, Yong wrote:
>>> Maxime,
>>> Thanks for point it out, it will add extra cache miss in datapath.
>>> And its impact on performance is around 1% in loopback case.
>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
>> on my side because when doing IO loopback, the cache pressure is
>> much less important.
>>
>>> While benefit of vectorized path will be more than that number.
>> Ok, but I disagree for two reasons:
>>  1. You have to keep in mind than non-vectorized is the default and
>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
>> checking header length (so no error stats), etc...
>>
> Ok, I will keep non-vectorized same as before.
> 
>>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
>> the gain is 20% on $CPU_VENDOR_B.
>>
>> In the case we see more degradation in real-world scenario, you might
>> want to consider using ifdefs to avoid adding padding in the non-
>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
>> PMD in patch 7.
>>
> Maxime,
> The performance difference is so slight, so I ignored for it look like a sampling error. 

Agree for IO loopback, but it adds one more cache line access per burst,
which might be see in some real-life use cases.

> It maybe not suitable to add new configuration for such setting which only used inside driver.

Wait, the Virtio-user #ifdef is based on the defconfig options? How can
it work since both Virtio PMD and Virtio-user PMD can be selected at the
same time?

I thought it was a define set before the headers inclusion and unset
afterwards, but I didn't checked carefully.

> Virtio driver can check whether virtqueue is using vectorized path when initialization, will use padded structure if it is.
> I have added some tested code and now performance came back.  Since code has changed in initialization process,  it need some time for regression check.

Ok, works for me.

I am investigating a linkage issue with your series, which does not
happen systematically (see below, it happens also with clang). David
pointed me to some Intel patches removing the usage if __rte_weak,
could it be related?


gcc  -o app/test/dpdk-test
'app/test/3062f5d@@dpdk-test@exe/commands.c.o'
'app/test/3062f5d@@dpdk-test@exe/packet_burst_generator.c.o'
'app/test/3062f5d@@dpdk-test@exe/test.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_acl.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_alarm.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_atomic.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_barrier.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_bpf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_byteorder.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_cirbuf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_etheraddr.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_ipaddr.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_lib.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_num.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_portlist.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cmdline_string.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_common.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cpuflags.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_crc.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cryptodev.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_asym.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_blockcipher.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_security_pdcp.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_cycles.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_debug.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_distributor.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_distributor_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_eal_flags.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_eal_fs.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_efd.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_efd_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_errno.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_event_crypto_adapter.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_event_eth_rx_adapter.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_event_ring.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_event_timer_adapter.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_eventdev.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_external_mem.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_fbarray.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_fib.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_fib_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_fib6.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_fib6_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_func_reentrancy.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_flow_classify.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_hash.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_hash_functions.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_hash_multiwriter.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_hash_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite_lf_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_interrupts.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ipfrag.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ipsec.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ipsec_sad.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_kni.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_kvargs.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_link_bonding.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_rssconf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_logs.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_lpm.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_lpm6.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_lpm6_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_lpm_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_malloc.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_mbuf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_member.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_member_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_memcpy.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_memcpy_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_memory.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_mempool.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_mempool_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_memzone.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_meter.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_metrics.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_mcslock.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_mp_secondary.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_per_lcore.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_pmd_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_power.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_power_cpufreq.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_power_kvm_vm.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_prefetch.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rand_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rawdev.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_red.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_reorder.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rib.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rib6.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring_mpmc_stress.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring_hts_stress.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring_peek_stress.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring_rts_stress.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ring_stress.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_rwlock.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_sched.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_security.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_service_cores.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_spinlock.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_stack.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_stack_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_string_fns.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_table.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_table_acl.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_table_combined.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_table_pipeline.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_table_ports.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_table_tables.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_tailq.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_thash.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_timer.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_timer_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_timer_racecond.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_timer_secondary.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_ticketlock.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_trace.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_trace_register.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_trace_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_version.c.o'
'app/test/3062f5d@@dpdk-test@exe/virtual_pmd.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring_perf.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_event_eth_tx_adapter.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_bitratestats.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_latencystats.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_mode4.c.o'
'app/test/3062f5d@@dpdk-test@exe/sample_packet_forward.c.o'
'app/test/3062f5d@@dpdk-test@exe/test_pdump.c.o' -Wl,--no-undefined
-Wl,--as-needed -Wl,-O1 -Wl,--whole-archive -Wl,--start-group
drivers/librte_common_cpt.a drivers/librte_common_dpaax.a
drivers/librte_common_iavf.a drivers/librte_common_octeontx.a
drivers/librte_common_octeontx2.a drivers/librte_bus_dpaa.a
drivers/librte_bus_fslmc.a drivers/librte_bus_ifpga.a
drivers/librte_bus_pci.a drivers/librte_bus_vdev.a
drivers/librte_bus_vmbus.a drivers/librte_mempool_bucket.a
drivers/librte_mempool_dpaa.a drivers/librte_mempool_dpaa2.a
drivers/librte_mempool_octeontx.a drivers/librte_mempool_octeontx2.a
drivers/librte_mempool_ring.a drivers/librte_mempool_stack.a
drivers/librte_pmd_af_packet.a drivers/librte_pmd_ark.a
drivers/librte_pmd_atlantic.a drivers/librte_pmd_avp.a
drivers/librte_pmd_axgbe.a drivers/librte_pmd_bond.a
drivers/librte_pmd_bnxt.a drivers/librte_pmd_cxgbe.a
drivers/librte_pmd_dpaa.a drivers/librte_pmd_dpaa2.a
drivers/librte_pmd_e1000.a drivers/librte_pmd_ena.a
drivers/librte_pmd_enetc.a drivers/librte_pmd_enic.a
drivers/librte_pmd_failsafe.a drivers/librte_pmd_fm10k.a
drivers/librte_pmd_i40e.a drivers/librte_pmd_hinic.a
drivers/librte_pmd_hns3.a drivers/librte_pmd_iavf.a
drivers/librte_pmd_ice.a drivers/librte_pmd_igc.a
drivers/librte_pmd_ixgbe.a drivers/librte_pmd_kni.a
drivers/librte_pmd_liquidio.a drivers/librte_pmd_memif.a
drivers/librte_pmd_netvsc.a drivers/librte_pmd_nfp.a
drivers/librte_pmd_null.a drivers/librte_pmd_octeontx.a
drivers/librte_pmd_octeontx2.a drivers/librte_pmd_pfe.a
drivers/librte_pmd_qede.a drivers/librte_pmd_ring.a
drivers/librte_pmd_sfc.a drivers/librte_pmd_softnic.a
drivers/librte_pmd_tap.a drivers/librte_pmd_thunderx.a
drivers/librte_pmd_vdev_netvsc.a drivers/librte_pmd_vhost.a
drivers/librte_pmd_virtio.a drivers/librte_pmd_vmxnet3.a
drivers/librte_rawdev_dpaa2_cmdif.a drivers/librte_rawdev_dpaa2_qdma.a
drivers/librte_rawdev_ioat.a drivers/librte_rawdev_ntb.a
drivers/librte_rawdev_octeontx2_dma.a
drivers/librte_rawdev_octeontx2_ep.a drivers/librte_rawdev_skeleton.a
drivers/librte_pmd_caam_jr.a drivers/librte_pmd_dpaa_sec.a
drivers/librte_pmd_dpaa2_sec.a drivers/librte_pmd_nitrox.a
drivers/librte_pmd_null_crypto.a drivers/librte_pmd_octeontx_crypto.a
drivers/librte_pmd_octeontx2_crypto.a
drivers/librte_pmd_crypto_scheduler.a drivers/librte_pmd_virtio_crypto.a
drivers/librte_pmd_octeontx_compress.a drivers/librte_pmd_qat.a
drivers/librte_pmd_ifc.a drivers/librte_pmd_dpaa_event.a
drivers/librte_pmd_dpaa2_event.a drivers/librte_pmd_octeontx2_event.a
drivers/librte_pmd_opdl_event.a drivers/librte_pmd_skeleton_event.a
drivers/librte_pmd_sw_event.a drivers/librte_pmd_dsw_event.a
drivers/librte_pmd_octeontx_event.a drivers/librte_pmd_bbdev_null.a
drivers/librte_pmd_bbdev_turbo_sw.a
drivers/librte_pmd_bbdev_fpga_lte_fec.a
drivers/librte_pmd_bbdev_fpga_5gnr_fec.a -Wl,--no-whole-archive
-Wl,--no-as-needed -pthread -lm -ldl -lnuma lib/librte_acl.a
lib/librte_eal.a lib/librte_kvargs.a lib/librte_bitratestats.a
lib/librte_ethdev.a lib/librte_net.a lib/librte_mbuf.a
lib/librte_mempool.a lib/librte_ring.a lib/librte_meter.a
lib/librte_metrics.a lib/librte_bpf.a lib/librte_cfgfile.a
lib/librte_cmdline.a lib/librte_cryptodev.a lib/librte_distributor.a
lib/librte_efd.a lib/librte_hash.a lib/librte_eventdev.a
lib/librte_timer.a lib/librte_fib.a lib/librte_rib.a
lib/librte_flow_classify.a lib/librte_table.a lib/librte_port.a
lib/librte_sched.a lib/librte_ip_frag.a lib/librte_kni.a
lib/librte_pci.a lib/librte_lpm.a lib/librte_ipsec.a
lib/librte_security.a lib/librte_latencystats.a lib/librte_member.a
lib/librte_pipeline.a lib/librte_rawdev.a lib/librte_rcu.a
lib/librte_reorder.a lib/librte_stack.a lib/librte_power.a
lib/librte_pdump.a lib/librte_gso.a lib/librte_vhost.a
lib/librte_compressdev.a lib/librte_bbdev.a -Wl,--end-group
'-Wl,-rpath,$ORIGIN/../../lib:$ORIGIN/../../drivers'
-Wl,-rpath-link,/tmp/dpdk_build/meson_buildir_gcc/lib:/tmp/dpdk_build/meson_buildir_gcc/drivers
drivers/librte_pmd_virtio.a(net_virtio_virtio_ethdev.c.o): In function
`set_rxtx_funcs':
virtio_ethdev.c:(.text.unlikely+0x6f): undefined reference to
`virtio_xmit_pkts_packed_vec'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

> Regards,
> Marvin
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 13:46             ` Maxime Coquelin
@ 2020-04-28 14:43               ` Liu, Yong
  2020-04-28 14:50                 ` Maxime Coquelin
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-28 14:43 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong
  Cc: dev, Honnappa Nagarahalli, jerinj



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, April 28, 2020 9:46 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/28/20 3:01 PM, Liu, Yong wrote:
> >>> Maxime,
> >>> Thanks for point it out, it will add extra cache miss in datapath.
> >>> And its impact on performance is around 1% in loopback case.
> >> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
> >> on my side because when doing IO loopback, the cache pressure is
> >> much less important.
> >>
> >>> While benefit of vectorized path will be more than that number.
> >> Ok, but I disagree for two reasons:
> >>  1. You have to keep in mind than non-vectorized is the default and
> >> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
> >> checking header length (so no error stats), etc...
> >>
> > Ok, I will keep non-vectorized same as before.
> >
> >>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
> >> the gain is 20% on $CPU_VENDOR_B.
> >>
> >> In the case we see more degradation in real-world scenario, you might
> >> want to consider using ifdefs to avoid adding padding in the non-
> >> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
> >> PMD in patch 7.
> >>
> > Maxime,
> > The performance difference is so slight, so I ignored for it look like a
> sampling error.
> 
> Agree for IO loopback, but it adds one more cache line access per burst,
> which might be see in some real-life use cases.
> 
> > It maybe not suitable to add new configuration for such setting which
> only used inside driver.
> 
> Wait, the Virtio-user #ifdef is based on the defconfig options? How can
> it work since both Virtio PMD and Virtio-user PMD can be selected at the
> same time?
> 
> I thought it was a define set before the headers inclusion and unset
> afterwards, but I didn't checked carefully.
> 

Maxime,
The difference between virtio PMD and Virtio-user PMD addresses is handled by vq->offset. 

When virtio PMD is running, offset will be set to buf_iova.
vq->offset = offsetof(struct rte_mbuf, buf_iova);

When virtio_user PMD is running, offset will be set to buf_addr.
vq->offset = offsetof(struct rte_mbuf, buf_addr);

> > Virtio driver can check whether virtqueue is using vectorized path when
> initialization, will use padded structure if it is.
> > I have added some tested code and now performance came back.  Since
> code has changed in initialization process,  it need some time for regression
> check.
> 
> Ok, works for me.
> 
> I am investigating a linkage issue with your series, which does not
> happen systematically (see below, it happens also with clang). David
> pointed me to some Intel patches removing the usage if __rte_weak,
> could it be related?
> 

I checked David's patch, it only changed i40e driver. Meanwhile attribute __rte_weak should still be in virtio_rxtx.c. 
I will follow David's patch, eliminate the usage of weak attribute. 

> 
> gcc  -o app/test/dpdk-test
> 'app/test/3062f5d@@dpdk-test@exe/commands.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/packet_burst_generator.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_acl.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_alarm.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_atomic.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_barrier.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_bpf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_byteorder.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_cirbuf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_etheraddr.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_ipaddr.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_lib.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_num.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_portlist.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_string.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_common.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cpuflags.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_crc.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_asym.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_blockcipher.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_security_pdcp.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_cycles.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_debug.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_distributor.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_distributor_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_eal_flags.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_eal_fs.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_efd.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_efd_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_errno.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_event_crypto_adapter.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_event_eth_rx_adapter.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_event_ring.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_event_timer_adapter.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_eventdev.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_external_mem.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_fbarray.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_fib.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_fib_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_fib6.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_fib6_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_func_reentrancy.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_flow_classify.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_hash.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_hash_functions.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_hash_multiwriter.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_hash_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite_lf_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_interrupts.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ipfrag.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ipsec.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ipsec_sad.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_kni.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_kvargs.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_rssconf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_logs.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_lpm.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_lpm6.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_lpm6_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_lpm_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_malloc.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_mbuf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_member.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_member_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_memcpy.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_memcpy_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_memory.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_mempool.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_mempool_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_memzone.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_meter.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_metrics.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_mcslock.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_mp_secondary.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_per_lcore.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_pmd_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_power.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_power_cpufreq.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_power_kvm_vm.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_prefetch.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rand_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rawdev.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_red.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_reorder.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rib.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rib6.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring_mpmc_stress.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring_hts_stress.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring_peek_stress.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring_rts_stress.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ring_stress.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_rwlock.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_sched.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_security.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_service_cores.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_spinlock.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_stack.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_stack_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_string_fns.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_table.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_table_acl.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_table_combined.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_table_pipeline.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_table_ports.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_table_tables.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_tailq.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_thash.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_timer.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_timer_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_timer_racecond.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_timer_secondary.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_ticketlock.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_trace.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_trace_register.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_trace_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_version.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/virtual_pmd.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring_perf.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_event_eth_tx_adapter.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_bitratestats.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_latencystats.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_mode4.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/sample_packet_forward.c.o'
> 'app/test/3062f5d@@dpdk-test@exe/test_pdump.c.o' -Wl,--no-undefined
> -Wl,--as-needed -Wl,-O1 -Wl,--whole-archive -Wl,--start-group
> drivers/librte_common_cpt.a drivers/librte_common_dpaax.a
> drivers/librte_common_iavf.a drivers/librte_common_octeontx.a
> drivers/librte_common_octeontx2.a drivers/librte_bus_dpaa.a
> drivers/librte_bus_fslmc.a drivers/librte_bus_ifpga.a
> drivers/librte_bus_pci.a drivers/librte_bus_vdev.a
> drivers/librte_bus_vmbus.a drivers/librte_mempool_bucket.a
> drivers/librte_mempool_dpaa.a drivers/librte_mempool_dpaa2.a
> drivers/librte_mempool_octeontx.a drivers/librte_mempool_octeontx2.a
> drivers/librte_mempool_ring.a drivers/librte_mempool_stack.a
> drivers/librte_pmd_af_packet.a drivers/librte_pmd_ark.a
> drivers/librte_pmd_atlantic.a drivers/librte_pmd_avp.a
> drivers/librte_pmd_axgbe.a drivers/librte_pmd_bond.a
> drivers/librte_pmd_bnxt.a drivers/librte_pmd_cxgbe.a
> drivers/librte_pmd_dpaa.a drivers/librte_pmd_dpaa2.a
> drivers/librte_pmd_e1000.a drivers/librte_pmd_ena.a
> drivers/librte_pmd_enetc.a drivers/librte_pmd_enic.a
> drivers/librte_pmd_failsafe.a drivers/librte_pmd_fm10k.a
> drivers/librte_pmd_i40e.a drivers/librte_pmd_hinic.a
> drivers/librte_pmd_hns3.a drivers/librte_pmd_iavf.a
> drivers/librte_pmd_ice.a drivers/librte_pmd_igc.a
> drivers/librte_pmd_ixgbe.a drivers/librte_pmd_kni.a
> drivers/librte_pmd_liquidio.a drivers/librte_pmd_memif.a
> drivers/librte_pmd_netvsc.a drivers/librte_pmd_nfp.a
> drivers/librte_pmd_null.a drivers/librte_pmd_octeontx.a
> drivers/librte_pmd_octeontx2.a drivers/librte_pmd_pfe.a
> drivers/librte_pmd_qede.a drivers/librte_pmd_ring.a
> drivers/librte_pmd_sfc.a drivers/librte_pmd_softnic.a
> drivers/librte_pmd_tap.a drivers/librte_pmd_thunderx.a
> drivers/librte_pmd_vdev_netvsc.a drivers/librte_pmd_vhost.a
> drivers/librte_pmd_virtio.a drivers/librte_pmd_vmxnet3.a
> drivers/librte_rawdev_dpaa2_cmdif.a drivers/librte_rawdev_dpaa2_qdma.a
> drivers/librte_rawdev_ioat.a drivers/librte_rawdev_ntb.a
> drivers/librte_rawdev_octeontx2_dma.a
> drivers/librte_rawdev_octeontx2_ep.a drivers/librte_rawdev_skeleton.a
> drivers/librte_pmd_caam_jr.a drivers/librte_pmd_dpaa_sec.a
> drivers/librte_pmd_dpaa2_sec.a drivers/librte_pmd_nitrox.a
> drivers/librte_pmd_null_crypto.a drivers/librte_pmd_octeontx_crypto.a
> drivers/librte_pmd_octeontx2_crypto.a
> drivers/librte_pmd_crypto_scheduler.a drivers/librte_pmd_virtio_crypto.a
> drivers/librte_pmd_octeontx_compress.a drivers/librte_pmd_qat.a
> drivers/librte_pmd_ifc.a drivers/librte_pmd_dpaa_event.a
> drivers/librte_pmd_dpaa2_event.a drivers/librte_pmd_octeontx2_event.a
> drivers/librte_pmd_opdl_event.a drivers/librte_pmd_skeleton_event.a
> drivers/librte_pmd_sw_event.a drivers/librte_pmd_dsw_event.a
> drivers/librte_pmd_octeontx_event.a drivers/librte_pmd_bbdev_null.a
> drivers/librte_pmd_bbdev_turbo_sw.a
> drivers/librte_pmd_bbdev_fpga_lte_fec.a
> drivers/librte_pmd_bbdev_fpga_5gnr_fec.a -Wl,--no-whole-archive
> -Wl,--no-as-needed -pthread -lm -ldl -lnuma lib/librte_acl.a
> lib/librte_eal.a lib/librte_kvargs.a lib/librte_bitratestats.a
> lib/librte_ethdev.a lib/librte_net.a lib/librte_mbuf.a
> lib/librte_mempool.a lib/librte_ring.a lib/librte_meter.a
> lib/librte_metrics.a lib/librte_bpf.a lib/librte_cfgfile.a
> lib/librte_cmdline.a lib/librte_cryptodev.a lib/librte_distributor.a
> lib/librte_efd.a lib/librte_hash.a lib/librte_eventdev.a
> lib/librte_timer.a lib/librte_fib.a lib/librte_rib.a
> lib/librte_flow_classify.a lib/librte_table.a lib/librte_port.a
> lib/librte_sched.a lib/librte_ip_frag.a lib/librte_kni.a
> lib/librte_pci.a lib/librte_lpm.a lib/librte_ipsec.a
> lib/librte_security.a lib/librte_latencystats.a lib/librte_member.a
> lib/librte_pipeline.a lib/librte_rawdev.a lib/librte_rcu.a
> lib/librte_reorder.a lib/librte_stack.a lib/librte_power.a
> lib/librte_pdump.a lib/librte_gso.a lib/librte_vhost.a
> lib/librte_compressdev.a lib/librte_bbdev.a -Wl,--end-group
> '-Wl,-rpath,$ORIGIN/../../lib:$ORIGIN/../../drivers'
> -Wl,-rpath-
> link,/tmp/dpdk_build/meson_buildir_gcc/lib:/tmp/dpdk_build/meson_buil
> dir_gcc/drivers
> drivers/librte_pmd_virtio.a(net_virtio_virtio_ethdev.c.o): In function
> `set_rxtx_funcs':
> virtio_ethdev.c:(.text.unlikely+0x6f): undefined reference to
> `virtio_xmit_pkts_packed_vec'
> collect2: error: ld returned 1 exit status
> ninja: build stopped: subcommand failed.
> 
> > Regards,
> > Marvin
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 14:43               ` Liu, Yong
@ 2020-04-28 14:50                 ` Maxime Coquelin
  2020-04-28 15:35                   ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-28 14:50 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Honnappa Nagarahalli, jerinj



On 4/28/20 4:43 PM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Tuesday, April 28, 2020 9:46 PM
>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org; Honnappa Nagarahalli
>> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
>>
>>
>>
>> On 4/28/20 3:01 PM, Liu, Yong wrote:
>>>>> Maxime,
>>>>> Thanks for point it out, it will add extra cache miss in datapath.
>>>>> And its impact on performance is around 1% in loopback case.
>>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
>>>> on my side because when doing IO loopback, the cache pressure is
>>>> much less important.
>>>>
>>>>> While benefit of vectorized path will be more than that number.
>>>> Ok, but I disagree for two reasons:
>>>>  1. You have to keep in mind than non-vectorized is the default and
>>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
>>>> checking header length (so no error stats), etc...
>>>>
>>> Ok, I will keep non-vectorized same as before.
>>>
>>>>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
>>>> the gain is 20% on $CPU_VENDOR_B.
>>>>
>>>> In the case we see more degradation in real-world scenario, you might
>>>> want to consider using ifdefs to avoid adding padding in the non-
>>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
>>>> PMD in patch 7.
>>>>
>>> Maxime,
>>> The performance difference is so slight, so I ignored for it look like a
>> sampling error.
>>
>> Agree for IO loopback, but it adds one more cache line access per burst,
>> which might be see in some real-life use cases.
>>
>>> It maybe not suitable to add new configuration for such setting which
>> only used inside driver.
>>
>> Wait, the Virtio-user #ifdef is based on the defconfig options? How can
>> it work since both Virtio PMD and Virtio-user PMD can be selected at the
>> same time?
>>
>> I thought it was a define set before the headers inclusion and unset
>> afterwards, but I didn't checked carefully.
>>
> 
> Maxime,
> The difference between virtio PMD and Virtio-user PMD addresses is handled by vq->offset. 
> 
> When virtio PMD is running, offset will be set to buf_iova.
> vq->offset = offsetof(struct rte_mbuf, buf_iova);
> 
> When virtio_user PMD is running, offset will be set to buf_addr.
> vq->offset = offsetof(struct rte_mbuf, buf_addr);

Ok, but below is a build time check:

+#ifdef RTE_VIRTIO_USER
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset);
+#else
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+#endif

So how can it work for a single build for both Virtio and Virtio-user?

>>> Virtio driver can check whether virtqueue is using vectorized path when
>> initialization, will use padded structure if it is.
>>> I have added some tested code and now performance came back.  Since
>> code has changed in initialization process,  it need some time for regression
>> check.
>>
>> Ok, works for me.
>>
>> I am investigating a linkage issue with your series, which does not
>> happen systematically (see below, it happens also with clang). David
>> pointed me to some Intel patches removing the usage if __rte_weak,
>> could it be related?
>>
> 
> I checked David's patch, it only changed i40e driver. Meanwhile attribute __rte_weak should still be in virtio_rxtx.c. 
> I will follow David's patch, eliminate the usage of weak attribute. 

Yeah, I meant below issue could be linked to __rte_weak, not that i40e
patch was the cause of this problem.



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 14:50                 ` Maxime Coquelin
@ 2020-04-28 15:35                   ` Liu, Yong
  2020-04-28 15:40                     ` Maxime Coquelin
  0 siblings, 1 reply; 162+ messages in thread
From: Liu, Yong @ 2020-04-28 15:35 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong
  Cc: dev, Honnappa Nagarahalli, jerinj



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, April 28, 2020 10:50 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/28/20 4:43 PM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Tuesday, April 28, 2020 9:46 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org; Honnappa Nagarahalli
> >> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
> >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> path
> >>
> >>
> >>
> >> On 4/28/20 3:01 PM, Liu, Yong wrote:
> >>>>> Maxime,
> >>>>> Thanks for point it out, it will add extra cache miss in datapath.
> >>>>> And its impact on performance is around 1% in loopback case.
> >>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
> >>>> on my side because when doing IO loopback, the cache pressure is
> >>>> much less important.
> >>>>
> >>>>> While benefit of vectorized path will be more than that number.
> >>>> Ok, but I disagree for two reasons:
> >>>>  1. You have to keep in mind than non-vectorized is the default and
> >>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
> >>>> checking header length (so no error stats), etc...
> >>>>
> >>> Ok, I will keep non-vectorized same as before.
> >>>
> >>>>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A
> because
> >>>> the gain is 20% on $CPU_VENDOR_B.
> >>>>
> >>>> In the case we see more degradation in real-world scenario, you might
> >>>> want to consider using ifdefs to avoid adding padding in the non-
> >>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
> >>>> PMD in patch 7.
> >>>>
> >>> Maxime,
> >>> The performance difference is so slight, so I ignored for it look like a
> >> sampling error.
> >>
> >> Agree for IO loopback, but it adds one more cache line access per burst,
> >> which might be see in some real-life use cases.
> >>
> >>> It maybe not suitable to add new configuration for such setting which
> >> only used inside driver.
> >>
> >> Wait, the Virtio-user #ifdef is based on the defconfig options? How can
> >> it work since both Virtio PMD and Virtio-user PMD can be selected at the
> >> same time?
> >>
> >> I thought it was a define set before the headers inclusion and unset
> >> afterwards, but I didn't checked carefully.
> >>
> >
> > Maxime,
> > The difference between virtio PMD and Virtio-user PMD addresses is
> handled by vq->offset.
> >
> > When virtio PMD is running, offset will be set to buf_iova.
> > vq->offset = offsetof(struct rte_mbuf, buf_iova);
> >
> > When virtio_user PMD is running, offset will be set to buf_addr.
> > vq->offset = offsetof(struct rte_mbuf, buf_addr);
> 
> Ok, but below is a build time check:
> 
> +#ifdef RTE_VIRTIO_USER
> +	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq-
> >offset);
> +#else
> +	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
> +#endif
> 
> So how can it work for a single build for both Virtio and Virtio-user?
> 

Sorry, here is an implementation error. vq->offset should be used in descs_base for getting the iova address. 
It will work the same as VIRTIO_MBUF_ADDR macro.

> >>> Virtio driver can check whether virtqueue is using vectorized path when
> >> initialization, will use padded structure if it is.
> >>> I have added some tested code and now performance came back.  Since
> >> code has changed in initialization process,  it need some time for
> regression
> >> check.
> >>
> >> Ok, works for me.
> >>
> >> I am investigating a linkage issue with your series, which does not
> >> happen systematically (see below, it happens also with clang). David
> >> pointed me to some Intel patches removing the usage if __rte_weak,
> >> could it be related?
> >>
> >
> > I checked David's patch, it only changed i40e driver. Meanwhile attribute
> __rte_weak should still be in virtio_rxtx.c.
> > I will follow David's patch, eliminate the usage of weak attribute.
> 
> Yeah, I meant below issue could be linked to __rte_weak, not that i40e
> patch was the cause of this problem.
> 

Maxime,
I haven't seen any build issue related to __rte_weak both with gcc and clang.   

Thanks,
Marvin

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 15:35                   ` Liu, Yong
@ 2020-04-28 15:40                     ` Maxime Coquelin
  2020-04-28 15:55                       ` Liu, Yong
  0 siblings, 1 reply; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-28 15:40 UTC (permalink / raw)
  To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Honnappa Nagarahalli, jerinj



On 4/28/20 5:35 PM, Liu, Yong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Tuesday, April 28, 2020 10:50 PM
>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org; Honnappa Nagarahalli
>> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
>>
>>
>>
>> On 4/28/20 4:43 PM, Liu, Yong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> Sent: Tuesday, April 28, 2020 9:46 PM
>>>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
>> <xiaolong.ye@intel.com>;
>>>> Wang, Zhihong <zhihong.wang@intel.com>
>>>> Cc: dev@dpdk.org; Honnappa Nagarahalli
>>>> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
>>>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
>> path
>>>>
>>>>
>>>>
>>>> On 4/28/20 3:01 PM, Liu, Yong wrote:
>>>>>>> Maxime,
>>>>>>> Thanks for point it out, it will add extra cache miss in datapath.
>>>>>>> And its impact on performance is around 1% in loopback case.
>>>>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
>>>>>> on my side because when doing IO loopback, the cache pressure is
>>>>>> much less important.
>>>>>>
>>>>>>> While benefit of vectorized path will be more than that number.
>>>>>> Ok, but I disagree for two reasons:
>>>>>>  1. You have to keep in mind than non-vectorized is the default and
>>>>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
>>>>>> checking header length (so no error stats), etc...
>>>>>>
>>>>> Ok, I will keep non-vectorized same as before.
>>>>>
>>>>>>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A
>> because
>>>>>> the gain is 20% on $CPU_VENDOR_B.
>>>>>>
>>>>>> In the case we see more degradation in real-world scenario, you might
>>>>>> want to consider using ifdefs to avoid adding padding in the non-
>>>>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
>>>>>> PMD in patch 7.
>>>>>>
>>>>> Maxime,
>>>>> The performance difference is so slight, so I ignored for it look like a
>>>> sampling error.
>>>>
>>>> Agree for IO loopback, but it adds one more cache line access per burst,
>>>> which might be see in some real-life use cases.
>>>>
>>>>> It maybe not suitable to add new configuration for such setting which
>>>> only used inside driver.
>>>>
>>>> Wait, the Virtio-user #ifdef is based on the defconfig options? How can
>>>> it work since both Virtio PMD and Virtio-user PMD can be selected at the
>>>> same time?
>>>>
>>>> I thought it was a define set before the headers inclusion and unset
>>>> afterwards, but I didn't checked carefully.
>>>>
>>>
>>> Maxime,
>>> The difference between virtio PMD and Virtio-user PMD addresses is
>> handled by vq->offset.
>>>
>>> When virtio PMD is running, offset will be set to buf_iova.
>>> vq->offset = offsetof(struct rte_mbuf, buf_iova);
>>>
>>> When virtio_user PMD is running, offset will be set to buf_addr.
>>> vq->offset = offsetof(struct rte_mbuf, buf_addr);
>>
>> Ok, but below is a build time check:
>>
>> +#ifdef RTE_VIRTIO_USER
>> +	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq-
>>> offset);
>> +#else
>> +	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
>> +#endif
>>
>> So how can it work for a single build for both Virtio and Virtio-user?
>>
> 
> Sorry, here is an implementation error. vq->offset should be used in descs_base for getting the iova address. 
> It will work the same as VIRTIO_MBUF_ADDR macro.
> 
>>>>> Virtio driver can check whether virtqueue is using vectorized path when
>>>> initialization, will use padded structure if it is.
>>>>> I have added some tested code and now performance came back.  Since
>>>> code has changed in initialization process,  it need some time for
>> regression
>>>> check.
>>>>
>>>> Ok, works for me.
>>>>
>>>> I am investigating a linkage issue with your series, which does not
>>>> happen systematically (see below, it happens also with clang). David
>>>> pointed me to some Intel patches removing the usage if __rte_weak,
>>>> could it be related?
>>>>
>>>
>>> I checked David's patch, it only changed i40e driver. Meanwhile attribute
>> __rte_weak should still be in virtio_rxtx.c.
>>> I will follow David's patch, eliminate the usage of weak attribute.
>>
>> Yeah, I meant below issue could be linked to __rte_weak, not that i40e
>> patch was the cause of this problem.
>>
> 
> Maxime,
> I haven't seen any build issue related to __rte_weak both with gcc and clang.   

Note that this build (which does not fail systematically) is when using
binutils 2.30, which cause AVX512 support to be disabled.

> Thanks,
> Marvin
> 


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 15:40                     ` Maxime Coquelin
@ 2020-04-28 15:55                       ` Liu, Yong
  0 siblings, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-28 15:55 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong
  Cc: dev, Honnappa Nagarahalli, jerinj



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, April 28, 2020 11:40 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> On 4/28/20 5:35 PM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Tuesday, April 28, 2020 10:50 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org; Honnappa Nagarahalli
> >> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
> >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> path
> >>
> >>
> >>
> >> On 4/28/20 4:43 PM, Liu, Yong wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>>> Sent: Tuesday, April 28, 2020 9:46 PM
> >>>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
> >> <xiaolong.ye@intel.com>;
> >>>> Wang, Zhihong <zhihong.wang@intel.com>
> >>>> Cc: dev@dpdk.org; Honnappa Nagarahalli
> >>>> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com
> >>>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> >> path
> >>>>
> >>>>
> >>>>
> >>>> On 4/28/20 3:01 PM, Liu, Yong wrote:
> >>>>>>> Maxime,
> >>>>>>> Thanks for point it out, it will add extra cache miss in datapath.
> >>>>>>> And its impact on performance is around 1% in loopback case.
> >>>>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
> >>>>>> on my side because when doing IO loopback, the cache pressure is
> >>>>>> much less important.
> >>>>>>
> >>>>>>> While benefit of vectorized path will be more than that number.
> >>>>>> Ok, but I disagree for two reasons:
> >>>>>>  1. You have to keep in mind than non-vectorized is the default and
> >>>>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
> >>>>>> checking header length (so no error stats), etc...
> >>>>>>
> >>>>> Ok, I will keep non-vectorized same as before.
> >>>>>
> >>>>>>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A
> >> because
> >>>>>> the gain is 20% on $CPU_VENDOR_B.
> >>>>>>
> >>>>>> In the case we see more degradation in real-world scenario, you
> might
> >>>>>> want to consider using ifdefs to avoid adding padding in the non-
> >>>>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-
> user
> >>>>>> PMD in patch 7.
> >>>>>>
> >>>>> Maxime,
> >>>>> The performance difference is so slight, so I ignored for it look like a
> >>>> sampling error.
> >>>>
> >>>> Agree for IO loopback, but it adds one more cache line access per
> burst,
> >>>> which might be see in some real-life use cases.
> >>>>
> >>>>> It maybe not suitable to add new configuration for such setting
> which
> >>>> only used inside driver.
> >>>>
> >>>> Wait, the Virtio-user #ifdef is based on the defconfig options? How
> can
> >>>> it work since both Virtio PMD and Virtio-user PMD can be selected at
> the
> >>>> same time?
> >>>>
> >>>> I thought it was a define set before the headers inclusion and unset
> >>>> afterwards, but I didn't checked carefully.
> >>>>
> >>>
> >>> Maxime,
> >>> The difference between virtio PMD and Virtio-user PMD addresses is
> >> handled by vq->offset.
> >>>
> >>> When virtio PMD is running, offset will be set to buf_iova.
> >>> vq->offset = offsetof(struct rte_mbuf, buf_iova);
> >>>
> >>> When virtio_user PMD is running, offset will be set to buf_addr.
> >>> vq->offset = offsetof(struct rte_mbuf, buf_addr);
> >>
> >> Ok, but below is a build time check:
> >>
> >> +#ifdef RTE_VIRTIO_USER
> >> +	__m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq-
> >>> offset);
> >> +#else
> >> +	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
> >> +#endif
> >>
> >> So how can it work for a single build for both Virtio and Virtio-user?
> >>
> >
> > Sorry, here is an implementation error. vq->offset should be used in
> descs_base for getting the iova address.
> > It will work the same as VIRTIO_MBUF_ADDR macro.
> >
> >>>>> Virtio driver can check whether virtqueue is using vectorized path
> when
> >>>> initialization, will use padded structure if it is.
> >>>>> I have added some tested code and now performance came back.
> Since
> >>>> code has changed in initialization process,  it need some time for
> >> regression
> >>>> check.
> >>>>
> >>>> Ok, works for me.
> >>>>
> >>>> I am investigating a linkage issue with your series, which does not
> >>>> happen systematically (see below, it happens also with clang). David
> >>>> pointed me to some Intel patches removing the usage if __rte_weak,
> >>>> could it be related?
> >>>>
> >>>
> >>> I checked David's patch, it only changed i40e driver. Meanwhile
> attribute
> >> __rte_weak should still be in virtio_rxtx.c.
> >>> I will follow David's patch, eliminate the usage of weak attribute.
> >>
> >> Yeah, I meant below issue could be linked to __rte_weak, not that i40e
> >> patch was the cause of this problem.
> >>
> >
> > Maxime,
> > I haven't seen any build issue related to __rte_weak both with gcc and
> clang.
> 
> Note that this build (which does not fail systematically) is when using
> binutils 2.30, which cause AVX512 support to be disabled.
> 

Just change to binutils 2.30,  AVX512 code will be skipped as expected in meson build. 
Could you please supply more information, I will try to reproduce it.

> > Thanks,
> > Marvin
> >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28 13:01           ` Liu, Yong
  2020-04-28 13:46             ` Maxime Coquelin
@ 2020-04-28 17:01             ` Liu, Yong
  1 sibling, 0 replies; 162+ messages in thread
From: Liu, Yong @ 2020-04-28 17:01 UTC (permalink / raw)
  To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev



> -----Original Message-----
> From: Liu, Yong
> Sent: Tuesday, April 28, 2020 9:01 PM
> To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
> 
> 
> 
> > -----Original Message-----
> > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Sent: Tuesday, April 28, 2020 4:44 PM
> > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> path
> >
> >
> >
> > On 4/28/20 3:14 AM, Liu, Yong wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > >> Sent: Monday, April 27, 2020 7:21 PM
> > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
> > <xiaolong.ye@intel.com>;
> > >> Wang, Zhihong <zhihong.wang@intel.com>
> > >> Cc: dev@dpdk.org
> > >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> > path
> > >>
> > >>
> > >>
> > >> On 4/26/20 4:19 AM, Marvin Liu wrote:
> > >>> Optimize packed ring Rx path with SIMD instructions. Solution of
> > >>> optimization is pretty like vhost, is that split path into batch and
> > >>> single functions. Batch function is further optimized by AVX512
> > >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus
> > >>> four elements will be saved in one batch.
> > >>>
> > >>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> > >>>
> > >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > >>> index c9edb84ee..102b1deab 100644
> > >>> --- a/drivers/net/virtio/Makefile
> > >>> +++ b/drivers/net/virtio/Makefile
> > >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> > >> $(CONFIG_RTE_ARCH_ARM64)),)
> > >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> > virtio_rxtx_simple_neon.c
> > >>>  endif
> > >>>
> > >>> +ifneq ($(FORCE_DISABLE_AVX512), y)
> > >>> +	CC_AVX512_SUPPORT=\
> > >>> +	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > >>> +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > >>> +	grep -q AVX512 && echo 1)
> > >>> +endif
> > >>> +
> > >>> +ifeq ($(CC_AVX512_SUPPORT), 1)
> > >>> +CFLAGS += -DCC_AVX512_SUPPORT
> > >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_rxtx_packed_avx.c
> > >>> +
> > >>> +ifeq ($(RTE_TOOLCHAIN), gcc)
> > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> > >>> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
> > >>> +endif
> > >>> +endif
> > >>> +
> > >>> +ifeq ($(RTE_TOOLCHAIN), clang)
> > >>> +ifeq ($(shell test
> > $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
> > >> ge 37 && echo 1), 1)
> > >>> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
> > >>> +endif
> > >>> +endif
> > >>> +
> > >>> +ifeq ($(RTE_TOOLCHAIN), icc)
> > >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> > >>> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
> > >>> +endif
> > >>> +endif
> > >>> +
> > >>> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -
> > mavx512vl
> > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> > >>> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
> > >>> +endif
> > >>> +endif
> > >>> +
> > >>>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> > >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> virtio_user/vhost_user.c
> > >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=
> > virtio_user/vhost_kernel.c
> > >>> diff --git a/drivers/net/virtio/meson.build
> > b/drivers/net/virtio/meson.build
> > >>> index 15150eea1..8e68c3039 100644
> > >>> --- a/drivers/net/virtio/meson.build
> > >>> +++ b/drivers/net/virtio/meson.build
> > >>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
> > >>>  deps += ['kvargs', 'bus_pci']
> > >>>
> > >>>  if arch_subdir == 'x86'
> > >>> +	if '-mno-avx512f' not in machine_args
> > >>> +		if cc.has_argument('-mavx512f') and cc.has_argument('-
> > >> mavx512vl') and cc.has_argument('-mavx512bw')
> > >>> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> > >>> +			cflags += ['-DCC_AVX512_SUPPORT']
> > >>> +			if (toolchain == 'gcc' and
> > >> cc.version().version_compare('>=8.3.0'))
> > >>> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > >>> +			elif (toolchain == 'clang' and
> > >> cc.version().version_compare('>=3.7.0'))
> > >>> +				cflags += '-
> > >> DVHOST_CLANG_UNROLL_PRAGMA'
> > >>> +			elif (toolchain == 'icc' and
> > >> cc.version().version_compare('>=16.0.0'))
> > >>> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> > >>> +			endif
> > >>> +			sources += files('virtio_rxtx_packed_avx.c')
> > >>> +		endif
> > >>> +	endif
> > >>>  	sources += files('virtio_rxtx_simple_sse.c')
> > >>>  elif arch_subdir == 'ppc'
> > >>>  	sources += files('virtio_rxtx_simple_altivec.c')
> > >>> diff --git a/drivers/net/virtio/virtio_ethdev.h
> > >> b/drivers/net/virtio/virtio_ethdev.h
> > >>> index febaf17a8..5c112cac7 100644
> > >>> --- a/drivers/net/virtio/virtio_ethdev.h
> > >>> +++ b/drivers/net/virtio/virtio_ethdev.h
> > >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void
> > *tx_queue,
> > >> struct rte_mbuf **tx_pkts,
> > >>>  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf
> > **rx_pkts,
> > >>>  		uint16_t nb_pkts);
> > >>>
> > >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct
> rte_mbuf
> > >> **rx_pkts,
> > >>> +		uint16_t nb_pkts);
> > >>> +
> > >>>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> > >>>
> > >>>  void virtio_interrupt_handler(void *param);
> > >>> diff --git a/drivers/net/virtio/virtio_rxtx.c
> > b/drivers/net/virtio/virtio_rxtx.c
> > >>> index a549991aa..534562cca 100644
> > >>> --- a/drivers/net/virtio/virtio_rxtx.c
> > >>> +++ b/drivers/net/virtio/virtio_rxtx.c
> > >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
> > >>>
> > >>>  	return nb_tx;
> > >>>  }
> > >>> +
> > >>> +__rte_weak uint16_t
> > >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> > >>> +			    struct rte_mbuf **rx_pkts __rte_unused,
> > >>> +			    uint16_t nb_pkts __rte_unused)
> > >>> +{
> > >>> +	return 0;
> > >>> +}
> > >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > >>> new file mode 100644
> > >>> index 000000000..8a7b459eb
> > >>> --- /dev/null
> > >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > >>> @@ -0,0 +1,374 @@
> > >>> +/* SPDX-License-Identifier: BSD-3-Clause
> > >>> + * Copyright(c) 2010-2020 Intel Corporation
> > >>> + */
> > >>> +
> > >>> +#include <stdint.h>
> > >>> +#include <stdio.h>
> > >>> +#include <stdlib.h>
> > >>> +#include <string.h>
> > >>> +#include <errno.h>
> > >>> +
> > >>> +#include <rte_net.h>
> > >>> +
> > >>> +#include "virtio_logs.h"
> > >>> +#include "virtio_ethdev.h"
> > >>> +#include "virtio_pci.h"
> > >>> +#include "virtqueue.h"
> > >>> +
> > >>> +#define BYTE_SIZE 8
> > >>> +/* flag bits offset in packed ring desc higher 64bits */
> > >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc,
> flags)
> > - \
> > >>> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > >>> +
> > >>> +#define PACKED_FLAGS_MASK ((0ULL |
> > >> VRING_PACKED_DESC_F_AVAIL_USED) << \
> > >>> +	FLAGS_BITS_OFFSET)
> > >>> +
> > >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> > >>> +	sizeof(struct vring_packed_desc))
> > >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> > >>> +
> > >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC
> unroll
> > 4")
> > >> \
> > >>> +	for (iter = val; iter < size; iter++)
> > >>> +#endif
> > >>> +
> > >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4")
> \
> > >>> +	for (iter = val; iter < size; iter++)
> > >>> +#endif
> > >>> +
> > >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll
> (4)")
> > \
> > >>> +	for (iter = val; iter < size; iter++)
> > >>> +#endif
> > >>> +
> > >>> +#ifndef virtio_for_each_try_unroll
> > >>> +#define virtio_for_each_try_unroll(iter, val, num) \
> > >>> +	for (iter = val; iter < num; iter++)
> > >>> +#endif
> > >>> +
> > >>> +static inline void
> > >>> +virtio_update_batch_stats(struct virtnet_stats *stats,
> > >>> +			  uint16_t pkt_len1,
> > >>> +			  uint16_t pkt_len2,
> > >>> +			  uint16_t pkt_len3,
> > >>> +			  uint16_t pkt_len4)
> > >>> +{
> > >>> +	stats->bytes += pkt_len1;
> > >>> +	stats->bytes += pkt_len2;
> > >>> +	stats->bytes += pkt_len3;
> > >>> +	stats->bytes += pkt_len4;
> > >>> +}
> > >>> +
> > >>> +/* Optionally fill offload information in structure */
> > >>> +static inline int
> > >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> > >>> +{
> > >>> +	struct rte_net_hdr_lens hdr_lens;
> > >>> +	uint32_t hdrlen, ptype;
> > >>> +	int l4_supported = 0;
> > >>> +
> > >>> +	/* nothing to do */
> > >>> +	if (hdr->flags == 0)
> > >>> +		return 0;
> > >>> +
> > >>> +	/* GSO not support in vec path, skip check */
> > >>> +	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
> > >>> +
> > >>> +	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> > >>> +	m->packet_type = ptype;
> > >>> +	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
> > >>> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
> > >>> +	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
> > >>> +		l4_supported = 1;
> > >>> +
> > >>> +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > >>> +		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> > >>> +		if (hdr->csum_start <= hdrlen && l4_supported) {
> > >>> +			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
> > >>> +		} else {
> > >>> +			/* Unknown proto or tunnel, do sw cksum. We can
> > >> assume
> > >>> +			 * the cksum field is in the first segment since the
> > >>> +			 * buffers we provided to the host are large enough.
> > >>> +			 * In case of SCTP, this will be wrong since it's a CRC
> > >>> +			 * but there's nothing we can do.
> > >>> +			 */
> > >>> +			uint16_t csum = 0, off;
> > >>> +
> > >>> +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> > >>> +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> > >>> +				&csum);
> > >>> +			if (likely(csum != 0xffff))
> > >>> +				csum = ~csum;
> > >>> +			off = hdr->csum_offset + hdr->csum_start;
> > >>> +			if (rte_pktmbuf_data_len(m) >= off + 1)
> > >>> +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> > >>> +					off) = csum;
> > >>> +		}
> > >>> +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
> > >> l4_supported) {
> > >>> +		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
> > >>> +	}
> > >>> +
> > >>> +	return 0;
> > >>> +}
> > >>> +
> > >>> +static inline uint16_t
> > >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
> > >>> +				   struct rte_mbuf **rx_pkts)
> > >>> +{
> > >>> +	struct virtqueue *vq = rxvq->vq;
> > >>> +	struct virtio_hw *hw = vq->hw;
> > >>> +	uint16_t hdr_size = hw->vtnet_hdr_size;
> > >>> +	uint64_t addrs[PACKED_BATCH_SIZE];
> > >>> +	uint16_t id = vq->vq_used_cons_idx;
> > >>> +	uint8_t desc_stats;
> > >>> +	uint16_t i;
> > >>> +	void *desc_addr;
> > >>> +
> > >>> +	if (id & PACKED_BATCH_MASK)
> > >>> +		return -1;
> > >>> +
> > >>> +	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
> > >>> +		return -1;
> > >>> +
> > >>> +	/* only care avail/used bits */
> > >>> +	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa,
> > >> PACKED_FLAGS_MASK);
> > >>> +	desc_addr = &vq->vq_packed.ring.desc[id];
> > >>> +
> > >>> +	__m512i v_desc = _mm512_loadu_si512(desc_addr);
> > >>> +	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
> > >>> +
> > >>> +	__m512i v_used_flag = _mm512_setzero_si512();
> > >>> +	if (vq->vq_packed.used_wrap_counter)
> > >>> +		v_used_flag = _mm512_maskz_set1_epi64(0xaa,
> > >> PACKED_FLAGS_MASK);
> > >>> +
> > >>> +	/* Check all descs are used */
> > >>> +	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
> > >>> +	if (desc_stats)
> > >>> +		return -1;
> > >>> +
> > >>> +	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > >>> +		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
> > >>> +		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
> > >>> +
> > >>> +		addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
> > >>> +	}
> > >>> +
> > >>> +	/*
> > >>> +	 * load len from desc, store into mbuf pkt_len and data_len
> > >>> +	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
> > >>> +	 */
> > >>> +	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
> > >>> +	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc,
> > >> 0xAA);
> > >>> +
> > >>> +	/* reduce hdr_len from pkt_len and data_len */
> > >>> +	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
> > >>> +			(uint32_t)-hdr_size);
> > >>> +
> > >>> +	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
> > >>> +
> > >>> +	/* assert offset of data_len */
> > >>> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
> > >>> +		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
> > >>> +
> > >>> +	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
> > >>> +					   addrs[2] + 8, addrs[2],
> > >>> +					   addrs[1] + 8, addrs[1],
> > >>> +					   addrs[0] + 8, addrs[0]);
> > >>> +	/* batch store into mbufs */
> > >>> +	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
> > >>> +
> > >>> +	if (hw->has_rx_offload) {
> > >>> +		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > >>> +			char *addr = (char *)rx_pkts[i]->buf_addr +
> > >>> +				RTE_PKTMBUF_HEADROOM - hdr_size;
> > >>> +			virtio_vec_rx_offload(rx_pkts[i],
> > >>> +					(struct virtio_net_hdr *)addr);
> > >>> +		}
> > >>> +	}
> > >>> +
> > >>> +	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
> > >>> +			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
> > >>> +			rx_pkts[3]->pkt_len);
> > >>> +
> > >>> +	vq->vq_free_cnt += PACKED_BATCH_SIZE;
> > >>> +
> > >>> +	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
> > >>> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> > >>> +		vq->vq_used_cons_idx -= vq->vq_nentries;
> > >>> +		vq->vq_packed.used_wrap_counter ^= 1;
> > >>> +	}
> > >>> +
> > >>> +	return 0;
> > >>> +}
> > >>> +
> > >>> +static uint16_t
> > >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
> > >>> +				    struct rte_mbuf **rx_pkts)
> > >>> +{
> > >>> +	uint16_t used_idx, id;
> > >>> +	uint32_t len;
> > >>> +	struct virtqueue *vq = rxvq->vq;
> > >>> +	struct virtio_hw *hw = vq->hw;
> > >>> +	uint32_t hdr_size = hw->vtnet_hdr_size;
> > >>> +	struct virtio_net_hdr *hdr;
> > >>> +	struct vring_packed_desc *desc;
> > >>> +	struct rte_mbuf *cookie;
> > >>> +
> > >>> +	desc = vq->vq_packed.ring.desc;
> > >>> +	used_idx = vq->vq_used_cons_idx;
> > >>> +	if (!desc_is_used(&desc[used_idx], vq))
> > >>> +		return -1;
> > >>> +
> > >>> +	len = desc[used_idx].len;
> > >>> +	id = desc[used_idx].id;
> > >>> +	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
> > >>> +	if (unlikely(cookie == NULL)) {
> > >>> +		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie
> > >> at %u",
> > >>> +				vq->vq_used_cons_idx);
> > >>> +		return -1;
> > >>> +	}
> > >>> +	rte_prefetch0(cookie);
> > >>> +	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
> > >>> +
> > >>> +	cookie->data_off = RTE_PKTMBUF_HEADROOM;
> > >>> +	cookie->ol_flags = 0;
> > >>> +	cookie->pkt_len = (uint32_t)(len - hdr_size);
> > >>> +	cookie->data_len = (uint32_t)(len - hdr_size);
> > >>> +
> > >>> +	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
> > >>> +					RTE_PKTMBUF_HEADROOM -
> > >> hdr_size);
> > >>> +	if (hw->has_rx_offload)
> > >>> +		virtio_vec_rx_offload(cookie, hdr);
> > >>> +
> > >>> +	*rx_pkts = cookie;
> > >>> +
> > >>> +	rxvq->stats.bytes += cookie->pkt_len;
> > >>> +
> > >>> +	vq->vq_free_cnt++;
> > >>> +	vq->vq_used_cons_idx++;
> > >>> +	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
> > >>> +		vq->vq_used_cons_idx -= vq->vq_nentries;
> > >>> +		vq->vq_packed.used_wrap_counter ^= 1;
> > >>> +	}
> > >>> +
> > >>> +	return 0;
> > >>> +}
> > >>> +
> > >>> +static inline void
> > >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
> > >>> +			      struct rte_mbuf **cookie,
> > >>> +			      uint16_t num)
> > >>> +{
> > >>> +	struct virtqueue *vq = rxvq->vq;
> > >>> +	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
> > >>> +	uint16_t flags = vq->vq_packed.cached_flags;
> > >>> +	struct virtio_hw *hw = vq->hw;
> > >>> +	struct vq_desc_extra *dxp;
> > >>> +	uint16_t idx, i;
> > >>> +	uint16_t batch_num, total_num = 0;
> > >>> +	uint16_t head_idx = vq->vq_avail_idx;
> > >>> +	uint16_t head_flag = vq->vq_packed.cached_flags;
> > >>> +	uint64_t addr;
> > >>> +
> > >>> +	do {
> > >>> +		idx = vq->vq_avail_idx;
> > >>> +
> > >>> +		batch_num = PACKED_BATCH_SIZE;
> > >>> +		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
> > >>> +			batch_num = vq->vq_nentries - idx;
> > >>> +		if (unlikely((total_num + batch_num) > num))
> > >>> +			batch_num = num - total_num;
> > >>> +
> > >>> +		virtio_for_each_try_unroll(i, 0, batch_num) {
> > >>> +			dxp = &vq->vq_descx[idx + i];
> > >>> +			dxp->cookie = (void *)cookie[total_num + i];
> > >>> +
> > >>> +			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i],
> > >> vq) +
> > >>> +				RTE_PKTMBUF_HEADROOM - hw-
> > >>> vtnet_hdr_size;
> > >>> +			start_dp[idx + i].addr = addr;
> > >>> +			start_dp[idx + i].len = cookie[total_num + i]-
> > >buf_len
> > >>> +				- RTE_PKTMBUF_HEADROOM + hw-
> > >>> vtnet_hdr_size;
> > >>> +			if (total_num || i) {
> > >>> +				virtqueue_store_flags_packed(&start_dp[idx
> > >> + i],
> > >>> +						flags, hw->weak_barriers);
> > >>> +			}
> > >>> +		}
> > >>> +
> > >>> +		vq->vq_avail_idx += batch_num;
> > >>> +		if (vq->vq_avail_idx >= vq->vq_nentries) {
> > >>> +			vq->vq_avail_idx -= vq->vq_nentries;
> > >>> +			vq->vq_packed.cached_flags ^=
> > >>> +				VRING_PACKED_DESC_F_AVAIL_USED;
> > >>> +			flags = vq->vq_packed.cached_flags;
> > >>> +		}
> > >>> +		total_num += batch_num;
> > >>> +	} while (total_num < num);
> > >>> +
> > >>> +	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
> > >>> +				hw->weak_barriers);
> > >>> +	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
> > >>> +}
> > >>> +
> > >>> +uint16_t
> > >>> +virtio_recv_pkts_packed_vec(void *rx_queue,
> > >>> +			    struct rte_mbuf **rx_pkts,
> > >>> +			    uint16_t nb_pkts)
> > >>> +{
> > >>> +	struct virtnet_rx *rxvq = rx_queue;
> > >>> +	struct virtqueue *vq = rxvq->vq;
> > >>> +	struct virtio_hw *hw = vq->hw;
> > >>> +	uint16_t num, nb_rx = 0;
> > >>> +	uint32_t nb_enqueued = 0;
> > >>> +	uint16_t free_cnt = vq->vq_free_thresh;
> > >>> +
> > >>> +	if (unlikely(hw->started == 0))
> > >>> +		return nb_rx;
> > >>> +
> > >>> +	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
> > >>> +	if (likely(num > PACKED_BATCH_SIZE))
> > >>> +		num = num - ((vq->vq_used_cons_idx + num) %
> > >> PACKED_BATCH_SIZE);
> > >>> +
> > >>> +	while (num) {
> > >>> +		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
> > >>> +					&rx_pkts[nb_rx])) {
> > >>> +			nb_rx += PACKED_BATCH_SIZE;
> > >>> +			num -= PACKED_BATCH_SIZE;
> > >>> +			continue;
> > >>> +		}
> > >>> +		if (!virtqueue_dequeue_single_packed_vec(rxvq,
> > >>> +					&rx_pkts[nb_rx])) {
> > >>> +			nb_rx++;
> > >>> +			num--;
> > >>> +			continue;
> > >>> +		}
> > >>> +		break;
> > >>> +	};
> > >>> +
> > >>> +	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
> > >>> +
> > >>> +	rxvq->stats.packets += nb_rx;
> > >>> +
> > >>> +	if (likely(vq->vq_free_cnt >= free_cnt)) {
> > >>> +		struct rte_mbuf *new_pkts[free_cnt];
> > >>> +		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
> > >>> +						free_cnt) == 0)) {
> > >>> +			virtio_recv_refill_packed_vec(rxvq, new_pkts,
> > >>> +					free_cnt);
> > >>> +			nb_enqueued += free_cnt;
> > >>> +		} else {
> > >>> +			struct rte_eth_dev *dev =
> > >>> +				&rte_eth_devices[rxvq->port_id];
> > >>> +			dev->data->rx_mbuf_alloc_failed += free_cnt;
> > >>> +		}
> > >>> +	}
> > >>> +
> > >>> +	if (likely(nb_enqueued)) {
> > >>> +		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
> > >>> +			virtqueue_notify(vq);
> > >>> +			PMD_RX_LOG(DEBUG, "Notified");
> > >>> +		}
> > >>> +	}
> > >>> +
> > >>> +	return nb_rx;
> > >>> +}
> > >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c
> > >> b/drivers/net/virtio/virtio_user_ethdev.c
> > >>> index 40ad786cc..c54698ad1 100644
> > >>> --- a/drivers/net/virtio/virtio_user_ethdev.c
> > >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c
> > >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct
> > rte_vdev_device
> > >> *vdev)
> > >>>  	hw->use_msix = 1;
> > >>>  	hw->modern   = 0;
> > >>>  	hw->use_vec_rx = 0;
> > >>> +	hw->use_vec_tx = 0;
> > >>>  	hw->use_inorder_rx = 0;
> > >>>  	hw->use_inorder_tx = 0;
> > >>>  	hw->virtio_user_dev = dev;
> > >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct
> rte_vdev_device
> > >> *dev)
> > >>>  		goto end;
> > >>>  	}
> > >>>
> > >>> -	if (vectorized)
> > >>> -		hw->use_vec_rx = 1;
> > >>> +	if (vectorized) {
> > >>> +		if (packed_vq) {
> > >>> +#if defined(CC_AVX512_SUPPORT)
> > >>> +			hw->use_vec_rx = 1;
> > >>> +			hw->use_vec_tx = 1;
> > >>> +#else
> > >>> +			PMD_INIT_LOG(INFO,
> > >>> +				"building environment do not support
> > packed
> > >> ring vectorized");
> > >>> +#endif
> > >>> +		} else {
> > >>> +			hw->use_vec_rx = 1;
> > >>> +		}
> > >>> +	}
> > >>>
> > >>>  	rte_eth_dev_probing_finish(eth_dev);
> > >>>  	ret = 0;
> > >>> diff --git a/drivers/net/virtio/virtqueue.h
> > b/drivers/net/virtio/virtqueue.h
> > >>> index ca1c10499..ce0340743 100644
> > >>> --- a/drivers/net/virtio/virtqueue.h
> > >>> +++ b/drivers/net/virtio/virtqueue.h
> > >>> @@ -239,7 +239,8 @@ struct vq_desc_extra {
> > >>>  	void *cookie;
> > >>>  	uint16_t ndescs;
> > >>>  	uint16_t next;
> > >>> -};
> > >>> +	uint8_t padding[4];
> > >>> +} __rte_packed __rte_aligned(16);
> > >>
> > >> Can't this introduce a performance impact for the non-vectorized
> > >> case? I think of worse cache liens utilization.
> > >>
> > >> For example with a burst of 32 descriptors with 32B cachelines, before
> > >> it would take 14 cachelines, after 16. So for each burst, one could face
> > >> 2 extra cache misses.
> > >>
> > >> If you could run non-vectorized benchamrks with and without that
> patch,
> > >> I would be grateful.
> > >>
> > >
> > > Maxime,
> > > Thanks for point it out, it will add extra cache miss in datapath.
> > > And its impact on performance is around 1% in loopback case.
> >
> > Ok, thanks for doing the test. I'll try to run some PVP benchmarks
> > on my side because when doing IO loopback, the cache pressure is
> > much less important.
> >
> > > While benefit of vectorized path will be more than that number.
> >
> > Ok, but I disagree for two reasons:
> >  1. You have to keep in mind than non-vectorized is the default and
> > encouraged mode to use. Indeed, it takes a lot of shortcuts like not
> > checking header length (so no error stats), etc...
> >
> Ok, I will keep non-vectorized same as before.
> 
> >  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
> > the gain is 20% on $CPU_VENDOR_B.
> >
> > In the case we see more degradation in real-world scenario, you might
> > want to consider using ifdefs to avoid adding padding in the non-
> > vectorized case, like you did to differentiate Virtio PMD to Virtio-user
> > PMD in patch 7.
> >
> 
> Maxime,
> The performance difference is so slight, so I ignored for it look like a
> sampling error.
> It maybe not suitable to add new configuration for such setting which only
> used inside driver.
> Virtio driver can check whether virtqueue is using vectorized path when
> initialization, will use padded structure if it is.
> I have added some tested code and now performance came back.  Since
> code has changed in initialization process,  it need some time for regression
> check.
> 

+ one more update.
Batch store with padding structure won't have benefit based on the latest code.
It may due to addition load/store cost can't be hidden by saved cpu cycles.
Will moved padding structure and make things clear as before.

> Regards,
> Marvin
> 
> > Thanks,
> > Maxime
> >
> > > Thanks,
> > > Marvin
> > >
> > >> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > >>
> > >> Thanks,
> > >> Maxime
> > >


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 0/9] add packed ring vectorized path
  2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
                   ` (16 preceding siblings ...)
  2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
@ 2020-04-29  7:28 ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu
                     ` (9 more replies)
  17 siblings, 10 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

This patch set introduced vectorized path for packed ring.

The size of packed ring descriptor is 16Bytes. Four batched descriptors
are just placed into one cacheline. AVX512 instructions can well handle
this kind of data. Packed ring TX path can fully transformed into
vectorized path. Packed ring Rx path can be vectorized when requirements
met(LRO and mergeable disabled).

New device parameter "vectorized" will be introduced in this patch set.
This parameter will be workable for both virtio device and virtio user
vdev. It will also unify split and packed ring vectorized path default
setting. Path election logic will check dependencies of vectorized path.
Packed ring vectorized path is dependent on building/running environment
and features like IN_ORDER and VERSION_1 enabled, MRG and LRO disabled. 
If vectorized path is not supported, will fallback to normal path.

v12:
* eliminate weak symbols in data path
* remove desc extra padding which can impact normal path 
* fix enqueue address invalid

v11:
* fix i686 build warnings
* fix typo in doc

v10:
* reuse packed ring xmit cleanup

v9:
* replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg
* reorder patch sequence

v8:
* fix meson build error on ubuntu16.04 and suse15

v7:
* default vectorization is disabled
* compilation time check dependency on rte_mbuf structure
* offsets are calcuated when compiling
* remove useless barrier as descs are batched store&load
* vindex of scatter is directly set
* some comments updates
* enable vectorized path in meson build

v6:
* fix issue when size not power of 2

v5:
* remove cpuflags definition as required extensions always come with
  AVX512F on x86_64
* inorder actions should depend on feature bit
* check ring type in rx queue setup
* rewrite some commit logs
* fix some checkpatch warnings

v4:
* rename 'packed_vec' to 'vectorized', also used in split ring
* add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
* check required AVX512 extensions cpuflags
* combine split and packed ring datapath selection logic
* remove limitation that size must power of two
* clear 12Bytes virtio_net_hdr

v3:
* remove virtio_net_hdr array for better performance
* disable 'packed_vec' by default

v2:
* more function blocks replaced by vector instructions
* clean virtio_net_hdr by vector instruction
* allow header room size change
* add 'packed_vec' option in virtio_user vdev 
* fix build not check whether AVX512 enabled
* doc update

Tested-by: Wang, Yinan <yinan.wang@intel.com>

Marvin Liu (9):
  net/virtio: add Rx free threshold setting
  net/virtio: inorder should depend on feature bit
  net/virtio: add vectorized devarg
  net/virtio-user: add vectorized devarg
  net/virtio: reuse packed ring functions
  net/virtio: add vectorized packed ring Rx path
  net/virtio: add vectorized packed ring Tx path
  net/virtio: add election for vectorized path
  doc: add packed vectorized path

 doc/guides/nics/virtio.rst                  |  52 +-
 drivers/net/virtio/Makefile                 |  35 ++
 drivers/net/virtio/meson.build              |  14 +
 drivers/net/virtio/virtio_ethdev.c          | 142 ++++-
 drivers/net/virtio/virtio_ethdev.h          |   6 +
 drivers/net/virtio/virtio_pci.h             |   3 +-
 drivers/net/virtio/virtio_rxtx.c            | 351 ++---------
 drivers/net/virtio/virtio_rxtx_packed_avx.c | 607 ++++++++++++++++++++
 drivers/net/virtio/virtio_user_ethdev.c     |  32 +-
 drivers/net/virtio/virtqueue.c              |   7 +-
 drivers/net/virtio/virtqueue.h              | 304 ++++++++++
 11 files changed, 1199 insertions(+), 354 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Introduce free threshold setting in Rx queue, its default value is 32.
Limit the threshold size to multiple of four as only vectorized packed
Rx function will utilize it. Virtio driver will rearm Rx queue when
more than rx_free_thresh descs were dequeued.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 060410577..94ba7a3ec 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct virtio_hw *hw = dev->data->dev_private;
 	struct virtqueue *vq = hw->vqs[vtpci_queue_idx];
 	struct virtnet_rx *rxvq;
+	uint16_t rx_free_thresh;
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		return -EINVAL;
 	}
 
+	rx_free_thresh = rx_conf->rx_free_thresh;
+	if (rx_free_thresh == 0)
+		rx_free_thresh =
+			RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH);
+
+	if (rx_free_thresh & 0x3) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (rx_free_thresh >= vq->vq_nentries) {
+		RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the "
+			"number of RX entries (%u)."
+			" (rx_free_thresh=%u port=%u queue=%u)\n",
+			vq->vq_nentries,
+			rx_free_thresh, dev->data->port_id, queue_idx);
+		return -EINVAL;
+	}
+	vq->vq_free_thresh = rx_free_thresh;
+
 	if (nb_desc == 0 || nb_desc > vq->vq_nentries)
 		nb_desc = vq->vq_nentries;
 	vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 58ad7309a..6301c56b2 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -18,6 +18,8 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_RX_FREE_THRESH 32
+
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg Marvin Liu
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Ring initialization is different when inorder feature negotiated. This
action should dependent on negotiated feature bits.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 94ba7a3ec..e450477e8 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	struct rte_mbuf *m;
 	uint16_t desc_idx;
 	int error, nbufs, i;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	PMD_INIT_FUNC_TRACE();
 
@@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
 		}
-	} else if (hw->use_inorder_rx) {
+	} else if (!vtpci_packed_queue(vq->hw) && in_order) {
 		if ((!virtqueue_full(vq))) {
 			uint16_t free_cnt = vq->vq_free_cnt;
 			struct rte_mbuf *pkts[free_cnt];
@@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev,
 	PMD_INIT_FUNC_TRACE();
 
 	if (!vtpci_packed_queue(hw)) {
-		if (hw->use_inorder_tx)
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER))
 			vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0;
 	}
 
@@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct virtio_hw *hw = vq->hw;
 	uint16_t hdr_size = hw->vtnet_hdr_size;
 	uint16_t nb_tx = 0;
-	bool in_order = hw->use_inorder_tx;
+	bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER);
 
 	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
 		return nb_tx;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 4/9] net/virtio-user: " Marvin Liu
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Previously, virtio split ring vectorized path was enabled by default.
This is not suitable for everyone because that path dose not follow
virtio spec. Add new devarg for virtio vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 6286286db..a67774e91 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfers to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
+
 Below devargs are supported by the virtio-user vdev:
 
 #.  ``path``:
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 37766cbb6..0a69a4db1 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev);
 static uint32_t virtio_dev_speed_capa_get(uint32_t speed);
 static int virtio_dev_devargs_parse(struct rte_devargs *devargs,
 	int *vdpa,
-	uint32_t *speed);
+	uint32_t *speed,
+	int *vectorized);
 static int virtio_dev_info_get(struct rte_eth_dev *dev,
 				struct rte_eth_dev_info *dev_info);
 static int virtio_dev_link_update(struct rte_eth_dev *dev,
@@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 			eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
 		}
 	} else {
-		if (hw->use_simple_rx) {
-			PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u",
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u",
 				eth_dev->data->port_id);
 			eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;
 		} else if (hw->use_inorder_rx) {
@@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 {
 	struct virtio_hw *hw = eth_dev->data->dev_private;
 	uint32_t speed = SPEED_UNKNOWN;
+	int vectorized = 0;
 	int ret;
 
 	if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) {
@@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		return 0;
 	}
 	ret = virtio_dev_devargs_parse(eth_dev->device->devargs,
-		 NULL, &speed);
+		 NULL, &speed, &vectorized);
 	if (ret < 0)
 		return ret;
 	hw->speed = speed;
@@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 	if (ret < 0)
 		goto err_virtio_init;
 
+	if (vectorized) {
+		if (!vtpci_packed_queue(hw))
+			hw->use_vec_rx = 1;
+	}
+
 	hw->opened = true;
 
 	return 0;
@@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed)
 	}
 }
 
+static int vectorized_check_handler(__rte_unused const char *key,
+		const char *value, void *ret_val)
+{
+	if (strcmp(value, "1") == 0)
+		*(int *)ret_val = 1;
+	else
+		*(int *)ret_val = 0;
+
+	return 0;
+}
 
 #define VIRTIO_ARG_SPEED      "speed"
 #define VIRTIO_ARG_VDPA       "vdpa"
+#define VIRTIO_ARG_VECTORIZED "vectorized"
 
 
 static int
@@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused,
 
 static int
 virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
-	uint32_t *speed)
+	uint32_t *speed, int *vectorized)
 {
 	struct rte_kvargs *kvlist;
 	int ret = 0;
@@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
 		}
 	}
 
+	if (vectorized &&
+		rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) {
+		ret = rte_kvargs_process(kvlist,
+				VIRTIO_ARG_VECTORIZED,
+				vectorized_check_handler, vectorized);
+		if (ret < 0) {
+			PMD_INIT_LOG(ERR, "Failed to parse %s",
+					VIRTIO_ARG_VECTORIZED);
+			goto exit;
+		}
+	}
+
 exit:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 	int vdpa = 0;
 	int ret = 0;
 
-	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL);
+	ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL,
+		NULL);
 	if (ret < 0) {
 		PMD_INIT_LOG(ERR, "devargs parsing is failed");
 		return ret;
@@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	hw->use_simple_rx = 1;
-
 	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
 		hw->use_inorder_tx = 1;
 		hw->use_inorder_rx = 1;
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 		hw->use_inorder_rx = 0;
 	}
 
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
 	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 #endif
 	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		 hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 	}
 
 	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_CKSUM |
 			   DEV_RX_OFFLOAD_TCP_LRO |
 			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_simple_rx = 0;
+		hw->use_vec_rx = 0;
 
 	return 0;
 }
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index bd89357e4..668e688e1 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -253,7 +253,8 @@ struct virtio_hw {
 	uint8_t	    vlan_strip;
 	uint8_t	    use_msix;
 	uint8_t     modern;
-	uint8_t     use_simple_rx;
+	uint8_t     use_vec_rx;
+	uint8_t     use_vec_tx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
 	uint8_t     weak_barriers;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index e450477e8..84f4cf946 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 	/* Allocate blank mbufs for the each rx descriptor */
 	nbufs = 0;
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		for (desc_idx = 0; desc_idx < vq->vq_nentries;
 		     desc_idx++) {
 			vq->vq_split.ring.avail->ring[desc_idx] = desc_idx;
@@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx)
 			&rxvq->fake_mbuf;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx && !vtpci_packed_queue(hw)) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxvq);
 			nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 953f00d72..150a8d987 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	 */
 	hw->use_msix = 1;
 	hw->modern   = 0;
-	hw->use_simple_rx = 0;
+	hw->use_vec_rx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c
index 0b4e3bf3e..ca23180de 100644
--- a/drivers/net/virtio/virtqueue.c
+++ b/drivers/net/virtio/virtqueue.c
@@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq)
 	end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1);
 
 	for (idx = 0; idx < vq->vq_nentries; idx++) {
-		if (hw->use_simple_rx && type == VTNET_RQ) {
+		if (hw->use_vec_rx && !vtpci_packed_queue(hw) &&
+		    type == VTNET_RQ) {
 			if (start <= end && idx >= start && idx < end)
 				continue;
 			if (start > end && (idx >= start || idx < end))
@@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 	for (i = 0; i < nb_used; i++) {
 		used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1);
 		uep = &vq->vq_split.ring.used->ring[used_idx];
-		if (hw->use_simple_rx) {
+		if (hw->use_vec_rx) {
 			desc_idx = used_idx;
 			rte_pktmbuf_free(vq->sw_ring[desc_idx]);
 			vq->vq_free_cnt++;
@@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq)
 		vq->vq_used_cons_idx++;
 	}
 
-	if (hw->use_simple_rx) {
+	if (hw->use_vec_rx) {
 		while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
 			virtio_rxq_rearm_vec(rxq);
 			if (virtqueue_kick_prepare(vq))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 4/9] net/virtio-user: add vectorized devarg
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (2 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions Marvin Liu
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Add new devarg for virtio user device vectorized path selection. By
default vectorized path is disabled.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index a67774e91..fdd0790e0 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev:
     rte_eth_link_get_nowait function.
     (Default: 10000 (10G))
 
+#.  ``vectorized``:
+
+    It is used to specify whether virtio device perfers to use vectorized path.
+    Afterwards, dependencies of vectorized path will be checked in path
+    election.
+    (Default: 0 (disabled))
 
 Virtio paths Selection and Usage
 --------------------------------
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 150a8d987..40ad786cc 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -452,6 +452,8 @@ static const char *valid_args[] = {
 	VIRTIO_USER_ARG_PACKED_VQ,
 #define VIRTIO_USER_ARG_SPEED          "speed"
 	VIRTIO_USER_ARG_SPEED,
+#define VIRTIO_USER_ARG_VECTORIZED     "vectorized"
+	VIRTIO_USER_ARG_VECTORIZED,
 	NULL
 };
 
@@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 	uint64_t mrg_rxbuf = 1;
 	uint64_t in_order = 1;
 	uint64_t packed_vq = 0;
+	uint64_t vectorized = 0;
 	char *path = NULL;
 	char *ifname = NULL;
 	char *mac_addr = NULL;
@@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		}
 	}
 
+	if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) {
+		if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED,
+				       &get_integer_arg, &vectorized) < 0) {
+			PMD_INIT_LOG(ERR, "error to parse %s",
+				     VIRTIO_USER_ARG_VECTORIZED);
+			goto end;
+		}
+	}
+
 	if (queues > 1 && cq == 0) {
 		PMD_INIT_LOG(ERR, "multi-q requires ctrl-q");
 		goto end;
@@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
+	if (vectorized)
+		hw->use_vec_rx = 1;
+
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
 
@@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
 	"mrg_rxbuf=<0|1> "
 	"in_order=<0|1> "
 	"packed_vq=<0|1> "
-	"speed=<int>");
+	"speed=<int> "
+	"vectorized=<0|1>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (3 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 4/9] net/virtio-user: " Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Move offload, xmit cleanup and packed xmit enqueue function to header
file. These functions will be reused by packed ring vectorized path.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 84f4cf946..a549991aa 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -89,23 +89,6 @@ vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
 	dp->next = VQ_RING_DESC_CHAIN_END;
 }
 
-static void
-vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id)
-{
-	struct vq_desc_extra *dxp;
-
-	dxp = &vq->vq_descx[id];
-	vq->vq_free_cnt += dxp->ndescs;
-
-	if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END)
-		vq->vq_desc_head_idx = id;
-	else
-		vq->vq_descx[vq->vq_desc_tail_idx].next = id;
-
-	vq->vq_desc_tail_idx = id;
-	dxp->next = VQ_RING_DESC_CHAIN_END;
-}
-
 void
 virtio_update_packet_stats(struct virtnet_stats *stats, struct rte_mbuf *mbuf)
 {
@@ -264,130 +247,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq,
 	return i;
 }
 
-#ifndef DEFAULT_TX_FREE_THRESH
-#define DEFAULT_TX_FREE_THRESH 32
-#endif
-
-static void
-virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
-{
-	uint16_t used_idx, id, curr_id, free_cnt = 0;
-	uint16_t size = vq->vq_nentries;
-	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
-	struct vq_desc_extra *dxp;
-
-	used_idx = vq->vq_used_cons_idx;
-	/* desc_is_used has a load-acquire or rte_cio_rmb inside
-	 * and wait for used desc in virtqueue.
-	 */
-	while (num > 0 && desc_is_used(&desc[used_idx], vq)) {
-		id = desc[used_idx].id;
-		do {
-			curr_id = used_idx;
-			dxp = &vq->vq_descx[used_idx];
-			used_idx += dxp->ndescs;
-			free_cnt += dxp->ndescs;
-			num -= dxp->ndescs;
-			if (used_idx >= size) {
-				used_idx -= size;
-				vq->vq_packed.used_wrap_counter ^= 1;
-			}
-			if (dxp->cookie != NULL) {
-				rte_pktmbuf_free(dxp->cookie);
-				dxp->cookie = NULL;
-			}
-		} while (curr_id != id);
-	}
-	vq->vq_used_cons_idx = used_idx;
-	vq->vq_free_cnt += free_cnt;
-}
-
-static void
-virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num)
-{
-	uint16_t used_idx, id;
-	uint16_t size = vq->vq_nentries;
-	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
-	struct vq_desc_extra *dxp;
-
-	used_idx = vq->vq_used_cons_idx;
-	/* desc_is_used has a load-acquire or rte_cio_rmb inside
-	 * and wait for used desc in virtqueue.
-	 */
-	while (num-- && desc_is_used(&desc[used_idx], vq)) {
-		id = desc[used_idx].id;
-		dxp = &vq->vq_descx[id];
-		vq->vq_used_cons_idx += dxp->ndescs;
-		if (vq->vq_used_cons_idx >= size) {
-			vq->vq_used_cons_idx -= size;
-			vq->vq_packed.used_wrap_counter ^= 1;
-		}
-		vq_ring_free_id_packed(vq, id);
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-		used_idx = vq->vq_used_cons_idx;
-	}
-}
-
-/* Cleanup from completed transmits. */
-static inline void
-virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order)
-{
-	if (in_order)
-		virtio_xmit_cleanup_inorder_packed(vq, num);
-	else
-		virtio_xmit_cleanup_normal_packed(vq, num);
-}
-
-static void
-virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num)
-{
-	uint16_t i, used_idx, desc_idx;
-	for (i = 0; i < num; i++) {
-		struct vring_used_elem *uep;
-		struct vq_desc_extra *dxp;
-
-		used_idx = (uint16_t)(vq->vq_used_cons_idx & (vq->vq_nentries - 1));
-		uep = &vq->vq_split.ring.used->ring[used_idx];
-
-		desc_idx = (uint16_t) uep->id;
-		dxp = &vq->vq_descx[desc_idx];
-		vq->vq_used_cons_idx++;
-		vq_ring_free_chain(vq, desc_idx);
-
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-	}
-}
-
-/* Cleanup from completed inorder transmits. */
-static __rte_always_inline void
-virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num)
-{
-	uint16_t i, idx = vq->vq_used_cons_idx;
-	int16_t free_cnt = 0;
-	struct vq_desc_extra *dxp = NULL;
-
-	if (unlikely(num == 0))
-		return;
-
-	for (i = 0; i < num; i++) {
-		dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)];
-		free_cnt += dxp->ndescs;
-		if (dxp->cookie != NULL) {
-			rte_pktmbuf_free(dxp->cookie);
-			dxp->cookie = NULL;
-		}
-	}
-
-	vq->vq_free_cnt += free_cnt;
-	vq->vq_used_cons_idx = idx;
-}
-
 static inline int
 virtqueue_enqueue_refill_inorder(struct virtqueue *vq,
 			struct rte_mbuf **cookies,
@@ -562,68 +421,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m)
 }
 
 
-/* avoid write operation when necessary, to lessen cache issues */
-#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
-	if ((var) != (val))			\
-		(var) = (val);			\
-} while (0)
-
-#define virtqueue_clear_net_hdr(_hdr) do {		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0);		\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0);	\
-	ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0);	\
-} while (0)
-
-static inline void
-virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
-			struct rte_mbuf *cookie,
-			bool offload)
-{
-	if (offload) {
-		if (cookie->ol_flags & PKT_TX_TCP_SEG)
-			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
-
-		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
-		case PKT_TX_UDP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_udp_hdr,
-				dgram_cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		case PKT_TX_TCP_CKSUM:
-			hdr->csum_start = cookie->l2_len + cookie->l3_len;
-			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
-			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			break;
-
-		default:
-			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
-			break;
-		}
 
-		/* TCP Segmentation Offload */
-		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
-			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
-				VIRTIO_NET_HDR_GSO_TCPV6 :
-				VIRTIO_NET_HDR_GSO_TCPV4;
-			hdr->gso_size = cookie->tso_segsz;
-			hdr->hdr_len =
-				cookie->l2_len +
-				cookie->l3_len +
-				cookie->l4_len;
-		} else {
-			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
-			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
-		}
-	}
-}
 
 static inline void
 virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq,
@@ -725,102 +523,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq,
 	virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers);
 }
 
-static inline void
-virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
-			      uint16_t needed, int can_push, int in_order)
-{
-	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-	struct vq_desc_extra *dxp;
-	struct virtqueue *vq = txvq->vq;
-	struct vring_packed_desc *start_dp, *head_dp;
-	uint16_t idx, id, head_idx, head_flags;
-	int16_t head_size = vq->hw->vtnet_hdr_size;
-	struct virtio_net_hdr *hdr;
-	uint16_t prev;
-	bool prepend_header = false;
-
-	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
-
-	dxp = &vq->vq_descx[id];
-	dxp->ndescs = needed;
-	dxp->cookie = cookie;
-
-	head_idx = vq->vq_avail_idx;
-	idx = head_idx;
-	prev = head_idx;
-	start_dp = vq->vq_packed.ring.desc;
-
-	head_dp = &vq->vq_packed.ring.desc[idx];
-	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-	head_flags |= vq->vq_packed.cached_flags;
-
-	if (can_push) {
-		/* prepend cannot fail, checked by caller */
-		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
-					      -head_size);
-		prepend_header = true;
-
-		/* if offload disabled, it is not zeroed below, do it now */
-		if (!vq->hw->has_tx_offload)
-			virtqueue_clear_net_hdr(hdr);
-	} else {
-		/* setup first tx ring slot to point to header
-		 * stored in reserved region.
-		 */
-		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
-			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
-		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
-		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	}
-
-	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
-
-	do {
-		uint16_t flags;
-
-		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
-		start_dp[idx].len  = cookie->data_len;
-		if (prepend_header) {
-			start_dp[idx].addr -= head_size;
-			start_dp[idx].len += head_size;
-			prepend_header = false;
-		}
-
-		if (likely(idx != head_idx)) {
-			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
-			flags |= vq->vq_packed.cached_flags;
-			start_dp[idx].flags = flags;
-		}
-		prev = idx;
-		idx++;
-		if (idx >= vq->vq_nentries) {
-			idx -= vq->vq_nentries;
-			vq->vq_packed.cached_flags ^=
-				VRING_PACKED_DESC_F_AVAIL_USED;
-		}
-	} while ((cookie = cookie->next) != NULL);
-
-	start_dp[prev].id = id;
-
-	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
-	vq->vq_avail_idx = idx;
-
-	if (!in_order) {
-		vq->vq_desc_head_idx = dxp->next;
-		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
-			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
-	}
-
-	virtqueue_store_flags_packed(head_dp, head_flags,
-				     vq->hw->weak_barriers);
-}
-
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 			uint16_t needed, int use_indirect, int can_push,
@@ -1246,7 +948,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
 	return 0;
 }
 
-#define VIRTIO_MBUF_BURST_SZ 64
 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc))
 uint16_t
 virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6301c56b2..ca1c10499 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -10,6 +10,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_mempool.h>
+#include <rte_net.h>
 
 #include "virtio_pci.h"
 #include "virtio_ring.h"
@@ -18,8 +19,10 @@
 
 struct rte_mbuf;
 
+#define DEFAULT_TX_FREE_THRESH 32
 #define DEFAULT_RX_FREE_THRESH 32
 
+#define VIRTIO_MBUF_BURST_SZ 64
 /*
  * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
@@ -560,4 +563,303 @@ virtqueue_notify(struct virtqueue *vq)
 #define VIRTQUEUE_DUMP(vq) do { } while (0)
 #endif
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	typeof(var) var_ = (var);		\
+	typeof(val) val_ = (val);		\
+	if ((var_) != (val_))			\
+		(var_) = (val_);		\
+} while (0)
+
+#define virtqueue_clear_net_hdr(hdr) do {		\
+	typeof(hdr) hdr_ = (hdr);			\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0);		\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0);	\
+	ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0);	\
+} while (0)
+
+static inline void
+virtqueue_xmit_offload(struct virtio_net_hdr *hdr,
+			struct rte_mbuf *cookie,
+			bool offload)
+{
+	if (offload) {
+		if (cookie->ol_flags & PKT_TX_TCP_SEG)
+			cookie->ol_flags |= PKT_TX_TCP_CKSUM;
+
+		switch (cookie->ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_udp_hdr,
+				dgram_cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		case PKT_TX_TCP_CKSUM:
+			hdr->csum_start = cookie->l2_len + cookie->l3_len;
+			hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+			hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			break;
+
+		default:
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
+			break;
+		}
+
+		/* TCP Segmentation Offload */
+		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
+			hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ?
+				VIRTIO_NET_HDR_GSO_TCPV6 :
+				VIRTIO_NET_HDR_GSO_TCPV4;
+			hdr->gso_size = cookie->tso_segsz;
+			hdr->hdr_len =
+				cookie->l2_len +
+				cookie->l3_len +
+				cookie->l4_len;
+		} else {
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
+		}
+	}
+}
+
+static inline void
+virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
+			      uint16_t needed, int can_push, int in_order)
+{
+	struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
+	struct vq_desc_extra *dxp;
+	struct virtqueue *vq = txvq->vq;
+	struct vring_packed_desc *start_dp, *head_dp;
+	uint16_t idx, id, head_idx, head_flags;
+	int16_t head_size = vq->hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	uint16_t prev;
+	bool prepend_header = false;
+
+	id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx;
+
+	dxp = &vq->vq_descx[id];
+	dxp->ndescs = needed;
+	dxp->cookie = cookie;
+
+	head_idx = vq->vq_avail_idx;
+	idx = head_idx;
+	prev = head_idx;
+	start_dp = vq->vq_packed.ring.desc;
+
+	head_dp = &vq->vq_packed.ring.desc[idx];
+	head_flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+	head_flags |= vq->vq_packed.cached_flags;
+
+	if (can_push) {
+		/* prepend cannot fail, checked by caller */
+		hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *,
+					      -head_size);
+		prepend_header = true;
+
+		/* if offload disabled, it is not zeroed below, do it now */
+		if (!vq->hw->has_tx_offload)
+			virtqueue_clear_net_hdr(hdr);
+	} else {
+		/* setup first tx ring slot to point to header
+		 * stored in reserved region.
+		 */
+		start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+			RTE_PTR_DIFF(&txr[idx].tx_hdr, txr);
+		start_dp[idx].len   = vq->hw->vtnet_hdr_size;
+		hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	}
+
+	virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload);
+
+	do {
+		uint16_t flags;
+
+		start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq);
+		start_dp[idx].len  = cookie->data_len;
+		if (prepend_header) {
+			start_dp[idx].addr -= head_size;
+			start_dp[idx].len += head_size;
+			prepend_header = false;
+		}
+
+		if (likely(idx != head_idx)) {
+			flags = cookie->next ? VRING_DESC_F_NEXT : 0;
+			flags |= vq->vq_packed.cached_flags;
+			start_dp[idx].flags = flags;
+		}
+		prev = idx;
+		idx++;
+		if (idx >= vq->vq_nentries) {
+			idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+		}
+	} while ((cookie = cookie->next) != NULL);
+
+	start_dp[prev].id = id;
+
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed);
+	vq->vq_avail_idx = idx;
+
+	if (!in_order) {
+		vq->vq_desc_head_idx = dxp->next;
+		if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END)
+			vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END;
+	}
+
+	virtqueue_store_flags_packed(head_dp, head_flags,
+				     vq->hw->weak_barriers);
+}
+
+static void
+vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id)
+{
+	struct vq_desc_extra *dxp;
+
+	dxp = &vq->vq_descx[id];
+	vq->vq_free_cnt += dxp->ndescs;
+
+	if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END)
+		vq->vq_desc_head_idx = id;
+	else
+		vq->vq_descx[vq->vq_desc_tail_idx].next = id;
+
+	vq->vq_desc_tail_idx = id;
+	dxp->next = VQ_RING_DESC_CHAIN_END;
+}
+
+static void
+virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num)
+{
+	uint16_t used_idx, id, curr_id, free_cnt = 0;
+	uint16_t size = vq->vq_nentries;
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+
+	used_idx = vq->vq_used_cons_idx;
+	/* desc_is_used has a load-acquire or rte_cio_rmb inside
+	 * and wait for used desc in virtqueue.
+	 */
+	while (num > 0 && desc_is_used(&desc[used_idx], vq)) {
+		id = desc[used_idx].id;
+		do {
+			curr_id = used_idx;
+			dxp = &vq->vq_descx[used_idx];
+			used_idx += dxp->ndescs;
+			free_cnt += dxp->ndescs;
+			num -= dxp->ndescs;
+			if (used_idx >= size) {
+				used_idx -= size;
+				vq->vq_packed.used_wrap_counter ^= 1;
+			}
+			if (dxp->cookie != NULL) {
+				rte_pktmbuf_free(dxp->cookie);
+				dxp->cookie = NULL;
+			}
+		} while (curr_id != id);
+	}
+	vq->vq_used_cons_idx = used_idx;
+	vq->vq_free_cnt += free_cnt;
+}
+
+static void
+virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num)
+{
+	uint16_t used_idx, id;
+	uint16_t size = vq->vq_nentries;
+	struct vring_packed_desc *desc = vq->vq_packed.ring.desc;
+	struct vq_desc_extra *dxp;
+
+	used_idx = vq->vq_used_cons_idx;
+	/* desc_is_used has a load-acquire or rte_cio_rmb inside
+	 * and wait for used desc in virtqueue.
+	 */
+	while (num-- && desc_is_used(&desc[used_idx], vq)) {
+		id = desc[used_idx].id;
+		dxp = &vq->vq_descx[id];
+		vq->vq_used_cons_idx += dxp->ndescs;
+		if (vq->vq_used_cons_idx >= size) {
+			vq->vq_used_cons_idx -= size;
+			vq->vq_packed.used_wrap_counter ^= 1;
+		}
+		vq_ring_free_id_packed(vq, id);
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+		used_idx = vq->vq_used_cons_idx;
+	}
+}
+
+/* Cleanup from completed transmits. */
+static inline void
+virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order)
+{
+	if (in_order)
+		virtio_xmit_cleanup_inorder_packed(vq, num);
+	else
+		virtio_xmit_cleanup_normal_packed(vq, num);
+}
+
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num)
+{
+	uint16_t i, used_idx, desc_idx;
+	for (i = 0; i < num; i++) {
+		struct vring_used_elem *uep;
+		struct vq_desc_extra *dxp;
+
+		used_idx = (uint16_t)(vq->vq_used_cons_idx &
+				(vq->vq_nentries - 1));
+		uep = &vq->vq_split.ring.used->ring[used_idx];
+
+		desc_idx = (uint16_t)uep->id;
+		dxp = &vq->vq_descx[desc_idx];
+		vq->vq_used_cons_idx++;
+		vq_ring_free_chain(vq, desc_idx);
+
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+	}
+}
+
+/* Cleanup from completed inorder transmits. */
+static __rte_always_inline void
+virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num)
+{
+	uint16_t i, idx = vq->vq_used_cons_idx;
+	int16_t free_cnt = 0;
+	struct vq_desc_extra *dxp = NULL;
+
+	if (unlikely(num == 0))
+		return;
+
+	for (i = 0; i < num; i++) {
+		dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)];
+		free_cnt += dxp->ndescs;
+		if (dxp->cookie != NULL) {
+			rte_pktmbuf_free(dxp->cookie);
+			dxp->cookie = NULL;
+		}
+	}
+
+	vq->vq_free_cnt += free_cnt;
+	vq->vq_used_cons_idx = idx;
+}
 #endif /* _VIRTQUEUE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (4 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Rx path with SIMD instructions. Solution of
optimization is pretty like vhost, is that split path into batch and
single functions. Batch function is further optimized by AVX512
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index c9edb84ee..102b1deab 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c
 endif
 
+ifneq ($(FORCE_DISABLE_AVX512), y)
+	CC_AVX512_SUPPORT=\
+	$(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+	grep -q AVX512 && echo 1)
+endif
+
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c
+
+ifeq ($(RTE_TOOLCHAIN), gcc)
+ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
+CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), clang)
+ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1)
+CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA
+endif
+endif
+
+ifeq ($(RTE_TOOLCHAIN), icc)
+ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
+CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA
+endif
+endif
+
+CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl
+ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
+CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds
+endif
+endif
+
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index 15150eea1..8e68c3039 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
 deps += ['kvargs', 'bus_pci']
 
 if arch_subdir == 'x86'
+	if '-mno-avx512f' not in machine_args
+		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
+			cflags += ['-DCC_AVX512_SUPPORT']
+			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
+				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
+			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
+				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
+			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
+				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
+			endif
+			sources += files('virtio_rxtx_packed_avx.c')
+		endif
+	endif
 	sources += files('virtio_rxtx_simple_sse.c')
 elif arch_subdir == 'ppc'
 	sources += files('virtio_rxtx_simple_altivec.c')
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index febaf17a8..5c112cac7 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts,
 uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index a549991aa..c50980c82 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2030,3 +2030,13 @@ virtio_xmit_pkts_inorder(void *tx_queue,
 
 	return nb_tx;
 }
+
+#ifndef CC_AVX512_SUPPORT
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
+			    struct rte_mbuf **rx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
+#endif /* ifndef CC_AVX512_SUPPORT */
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
new file mode 100644
index 000000000..88831a786
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_net.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtio_pci.h"
+#include "virtqueue.h"
+
+#define BYTE_SIZE 8
+/* flag bits offset in packed ring desc higher 64bits */
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
+	FLAGS_BITS_OFFSET)
+
+#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
+	sizeof(struct vring_packed_desc))
+#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
+
+#ifdef VIRTIO_GCC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifdef VIRTIO_ICC_UNROLL_PRAGMA
+#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)
+#endif
+
+#ifndef virtio_for_each_try_unroll
+#define virtio_for_each_try_unroll(iter, val, num) \
+	for (iter = val; iter < num; iter++)
+#endif
+
+static inline void
+virtio_update_batch_stats(struct virtnet_stats *stats,
+			  uint16_t pkt_len1,
+			  uint16_t pkt_len2,
+			  uint16_t pkt_len3,
+			  uint16_t pkt_len4)
+{
+	stats->bytes += pkt_len1;
+	stats->bytes += pkt_len2;
+	stats->bytes += pkt_len3;
+	stats->bytes += pkt_len4;
+}
+
+/* Optionally fill offload information in structure */
+static inline int
+virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+	struct rte_net_hdr_lens hdr_lens;
+	uint32_t hdrlen, ptype;
+	int l4_supported = 0;
+
+	/* nothing to do */
+	if (hdr->flags == 0)
+		return 0;
+
+	/* GSO not support in vec path, skip check */
+	m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = 1;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. We can assume
+			 * the cksum field is in the first segment since the
+			 * buffers we provided to the host are large enough.
+			 * In case of SCTP, this will be wrong since it's a CRC
+			 * but there's nothing we can do.
+			 */
+			uint16_t csum = 0, off;
+
+			rte_raw_cksum_mbuf(m, hdr->csum_start,
+				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+				&csum);
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *,
+					off) = csum;
+		}
+	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) {
+		m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
+	}
+
+	return 0;
+}
+
+static inline uint16_t
+virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
+				   struct rte_mbuf **rx_pkts)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint64_t addrs[PACKED_BATCH_SIZE];
+	uint16_t id = vq->vq_used_cons_idx;
+	uint8_t desc_stats;
+	uint16_t i;
+	void *desc_addr;
+
+	if (id & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* only care avail/used bits */
+	__m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+	desc_addr = &vq->vq_packed.ring.desc[id];
+
+	__m512i v_desc = _mm512_loadu_si512(desc_addr);
+	__m512i v_flag = _mm512_and_epi64(v_desc, v_mask);
+
+	__m512i v_used_flag = _mm512_setzero_si512();
+	if (vq->vq_packed.used_wrap_counter)
+		v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK);
+
+	/* Check all descs are used */
+	desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
+	if (desc_stats)
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
+		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
+
+		addrs[i] = (uintptr_t)rx_pkts[i]->rx_descriptor_fields1;
+	}
+
+	/*
+	 * load len from desc, store into mbuf pkt_len and data_len
+	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
+	 */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	__m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA);
+
+	/* reduce hdr_len from pkt_len and data_len */
+	__m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask,
+			(uint32_t)-hdr_size);
+
+	__m512i v_value = _mm512_add_epi32(values, mbuf_len_offset);
+
+	/* assert offset of data_len */
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
+		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
+
+	__m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3],
+					   addrs[2] + 8, addrs[2],
+					   addrs[1] + 8, addrs[1],
+					   addrs[0] + 8, addrs[0]);
+	/* batch store into mbufs */
+	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
+
+	if (hw->has_rx_offload) {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			char *addr = (char *)rx_pkts[i]->buf_addr +
+				RTE_PKTMBUF_HEADROOM - hdr_size;
+			virtio_vec_rx_offload(rx_pkts[i],
+					(struct virtio_net_hdr *)addr);
+		}
+	}
+
+	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
+			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
+			rx_pkts[3]->pkt_len);
+
+	vq->vq_free_cnt += PACKED_BATCH_SIZE;
+
+	vq->vq_used_cons_idx += PACKED_BATCH_SIZE;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static uint16_t
+virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
+				    struct rte_mbuf **rx_pkts)
+{
+	uint16_t used_idx, id;
+	uint32_t len;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint32_t hdr_size = hw->vtnet_hdr_size;
+	struct virtio_net_hdr *hdr;
+	struct vring_packed_desc *desc;
+	struct rte_mbuf *cookie;
+
+	desc = vq->vq_packed.ring.desc;
+	used_idx = vq->vq_used_cons_idx;
+	if (!desc_is_used(&desc[used_idx], vq))
+		return -1;
+
+	len = desc[used_idx].len;
+	id = desc[used_idx].id;
+	cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie;
+	if (unlikely(cookie == NULL)) {
+		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u",
+				vq->vq_used_cons_idx);
+		return -1;
+	}
+	rte_prefetch0(cookie);
+	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
+
+	cookie->data_off = RTE_PKTMBUF_HEADROOM;
+	cookie->ol_flags = 0;
+	cookie->pkt_len = (uint32_t)(len - hdr_size);
+	cookie->data_len = (uint32_t)(len - hdr_size);
+
+	hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr +
+					RTE_PKTMBUF_HEADROOM - hdr_size);
+	if (hw->has_rx_offload)
+		virtio_vec_rx_offload(cookie, hdr);
+
+	*rx_pkts = cookie;
+
+	rxvq->stats.bytes += cookie->pkt_len;
+
+	vq->vq_free_cnt++;
+	vq->vq_used_cons_idx++;
+	if (vq->vq_used_cons_idx >= vq->vq_nentries) {
+		vq->vq_used_cons_idx -= vq->vq_nentries;
+		vq->vq_packed.used_wrap_counter ^= 1;
+	}
+
+	return 0;
+}
+
+static inline void
+virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
+			      struct rte_mbuf **cookie,
+			      uint16_t num)
+{
+	struct virtqueue *vq = rxvq->vq;
+	struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc;
+	uint16_t flags = vq->vq_packed.cached_flags;
+	struct virtio_hw *hw = vq->hw;
+	struct vq_desc_extra *dxp;
+	uint16_t idx, i;
+	uint16_t batch_num, total_num = 0;
+	uint16_t head_idx = vq->vq_avail_idx;
+	uint16_t head_flag = vq->vq_packed.cached_flags;
+	uint64_t addr;
+
+	do {
+		idx = vq->vq_avail_idx;
+
+		batch_num = PACKED_BATCH_SIZE;
+		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+			batch_num = vq->vq_nentries - idx;
+		if (unlikely((total_num + batch_num) > num))
+			batch_num = num - total_num;
+
+		virtio_for_each_try_unroll(i, 0, batch_num) {
+			dxp = &vq->vq_descx[idx + i];
+			dxp->cookie = (void *)cookie[total_num + i];
+
+			addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) +
+				RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
+			start_dp[idx + i].addr = addr;
+			start_dp[idx + i].len = cookie[total_num + i]->buf_len
+				- RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+			if (total_num || i) {
+				virtqueue_store_flags_packed(&start_dp[idx + i],
+						flags, hw->weak_barriers);
+			}
+		}
+
+		vq->vq_avail_idx += batch_num;
+		if (vq->vq_avail_idx >= vq->vq_nentries) {
+			vq->vq_avail_idx -= vq->vq_nentries;
+			vq->vq_packed.cached_flags ^=
+				VRING_PACKED_DESC_F_AVAIL_USED;
+			flags = vq->vq_packed.cached_flags;
+		}
+		total_num += batch_num;
+	} while (total_num < num);
+
+	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
+				hw->weak_barriers);
+	vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num);
+}
+
+uint16_t
+virtio_recv_pkts_packed_vec(void *rx_queue,
+			    struct rte_mbuf **rx_pkts,
+			    uint16_t nb_pkts)
+{
+	struct virtnet_rx *rxvq = rx_queue;
+	struct virtqueue *vq = rxvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t num, nb_rx = 0;
+	uint32_t nb_enqueued = 0;
+	uint16_t free_cnt = vq->vq_free_thresh;
+
+	if (unlikely(hw->started == 0))
+		return nb_rx;
+
+	num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
+	if (likely(num > PACKED_BATCH_SIZE))
+		num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE);
+
+	while (num) {
+		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx += PACKED_BATCH_SIZE;
+			num -= PACKED_BATCH_SIZE;
+			continue;
+		}
+		if (!virtqueue_dequeue_single_packed_vec(rxvq,
+					&rx_pkts[nb_rx])) {
+			nb_rx++;
+			num--;
+			continue;
+		}
+		break;
+	};
+
+	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
+
+	rxvq->stats.packets += nb_rx;
+
+	if (likely(vq->vq_free_cnt >= free_cnt)) {
+		struct rte_mbuf *new_pkts[free_cnt];
+		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
+						free_cnt) == 0)) {
+			virtio_recv_refill_packed_vec(rxvq, new_pkts,
+					free_cnt);
+			nb_enqueued += free_cnt;
+		} else {
+			struct rte_eth_dev *dev =
+				&rte_eth_devices[rxvq->port_id];
+			dev->data->rx_mbuf_alloc_failed += free_cnt;
+		}
+	}
+
+	if (likely(nb_enqueued)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_RX_LOG(DEBUG, "Notified");
+		}
+	}
+
+	return nb_rx;
+}
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 40ad786cc..c54698ad1 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev)
 	hw->use_msix = 1;
 	hw->modern   = 0;
 	hw->use_vec_rx = 0;
+	hw->use_vec_tx = 0;
 	hw->use_inorder_rx = 0;
 	hw->use_inorder_tx = 0;
 	hw->virtio_user_dev = dev;
@@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev)
 		goto end;
 	}
 
-	if (vectorized)
-		hw->use_vec_rx = 1;
+	if (vectorized) {
+		if (packed_vq) {
+#if defined(CC_AVX512_SUPPORT)
+			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#else
+			PMD_INIT_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#endif
+		} else {
+			hw->use_vec_rx = 1;
+		}
+	}
 
 	rte_eth_dev_probing_finish(eth_dev);
 	ret = 0;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (5 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path Marvin Liu
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Optimize packed ring Tx path like Rx path. Split Tx path into batch and
single Tx functions. Batch function is further optimized by AVX512
instructions.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 5c112cac7..b7d52d497 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		uint16_t nb_pkts);
 
+uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
+
 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 
 void virtio_interrupt_handler(void *param);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c50980c82..050541a10 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -2039,4 +2039,12 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
 {
 	return 0;
 }
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused,
+			    struct rte_mbuf **tx_pkts __rte_unused,
+			    uint16_t nb_pkts __rte_unused)
+{
+	return 0;
+}
 #endif /* ifndef CC_AVX512_SUPPORT */
diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c
index 88831a786..d130d68bf 100644
--- a/drivers/net/virtio/virtio_rxtx_packed_avx.c
+++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
@@ -23,6 +23,24 @@
 #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \
 	FLAGS_BITS_OFFSET)
 
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+	1ULL << REFCNT_BITS_OFFSET)
+
+/* id bits offset in packed ring desc higher 64bits */
+#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \
+	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+/* net hdr short size mask */
+#define NET_HDR_MASK 0x3F
+
 #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
 	sizeof(struct vring_packed_desc))
 #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
@@ -60,6 +78,221 @@ virtio_update_batch_stats(struct virtnet_stats *stats,
 	stats->bytes += pkt_len4;
 }
 
+static inline int
+virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq,
+				   struct rte_mbuf **tx_pkts)
+{
+	struct virtqueue *vq = txvq->vq;
+	uint16_t head_size = vq->hw->vtnet_hdr_size;
+	uint16_t idx = vq->vq_avail_idx;
+	struct virtio_net_hdr *hdr;
+	struct vq_desc_extra *dxp;
+	uint16_t i, cmp;
+
+	if (vq->vq_avail_idx & PACKED_BATCH_MASK)
+		return -1;
+
+	if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
+		return -1;
+
+	/* Load four mbufs rearm data */
+	RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64);
+	RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64);
+	__m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data,
+					  *tx_pkts[2]->rearm_data,
+					  *tx_pkts[1]->rearm_data,
+					  *tx_pkts[0]->rearm_data);
+
+	/* refcnt=1 and nb_segs=1 */
+	__m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+	__m256i head_rooms = _mm256_set1_epi16(head_size);
+
+	/* Check refcnt and nb_segs */
+	const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
+	cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref);
+	if (unlikely(cmp))
+		return -1;
+
+	/* Check headroom is enough */
+	const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12;
+	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) !=
+		offsetof(struct rte_mbuf, rearm_data));
+	cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms);
+	if (unlikely(cmp))
+		return -1;
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		dxp = &vq->vq_descx[idx + i];
+		dxp->ndescs = 1;
+		dxp->cookie = tx_pkts[i];
+	}
+
+	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+		tx_pkts[i]->data_off -= head_size;
+		tx_pkts[i]->data_len += head_size;
+	}
+
+	__m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len,
+			VIRTIO_MBUF_ADDR(tx_pkts[3], vq),
+			tx_pkts[2]->data_len,
+			VIRTIO_MBUF_ADDR(tx_pkts[2], vq),
+			tx_pkts[1]->data_len,
+			VIRTIO_MBUF_ADDR(tx_pkts[1], vq),
+			tx_pkts[0]->data_len,
+			VIRTIO_MBUF_ADDR(tx_pkts[0], vq));
+
+	/* id offset and data offset */
+	__m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET,
+						tx_pkts[3]->data_off,
+						(uint64_t)2 << ID_BITS_OFFSET,
+						tx_pkts[2]->data_off,
+						(uint64_t)1 << ID_BITS_OFFSET,
+						tx_pkts[1]->data_off,
+						0, tx_pkts[0]->data_off);
+
+	__m512i new_descs = _mm512_add_epi64(descs_base, data_offsets);
+
+	uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET |
+		(uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET;
+
+	/* flags offset and guest virtual address offset */
+	__m128i flag_offset = _mm_set_epi64x(flags_temp, 0);
+	__m512i v_offset = _mm512_broadcast_i32x4(flag_offset);
+	__m512i v_desc = _mm512_add_epi64(new_descs, v_offset);
+
+	if (!vq->hw->has_tx_offload) {
+		__m128i all_mask = _mm_set1_epi16(0xFFFF);
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			__m128i v_hdr = _mm_loadu_si128((void *)hdr);
+			if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK,
+							v_hdr, all_mask))) {
+				__m128i all_zero = _mm_setzero_si128();
+				_mm_mask_storeu_epi16((void *)hdr,
+						NET_HDR_MASK, all_zero);
+			}
+		}
+	} else {
+		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+			hdr = rte_pktmbuf_mtod_offset(tx_pkts[i],
+					struct virtio_net_hdr *, -head_size);
+			virtqueue_xmit_offload(hdr, tx_pkts[i], true);
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	_mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc);
+
+	virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len,
+			tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len,
+			tx_pkts[3]->pkt_len);
+
+	vq->vq_avail_idx += PACKED_BATCH_SIZE;
+	vq->vq_free_cnt -= PACKED_BATCH_SIZE;
+
+	if (vq->vq_avail_idx >= vq->vq_nentries) {
+		vq->vq_avail_idx -= vq->vq_nentries;
+		vq->vq_packed.cached_flags ^=
+			VRING_PACKED_DESC_F_AVAIL_USED;
+	}
+
+	return 0;
+}
+
+static inline int
+virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq,
+				    struct rte_mbuf *txm)
+{
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t hdr_size = hw->vtnet_hdr_size;
+	uint16_t slots, can_push;
+	int16_t need;
+
+	/* How many main ring entries are needed to this Tx?
+	 * any_layout => number of segments
+	 * default    => number of segments + 1
+	 */
+	can_push = rte_mbuf_refcnt_read(txm) == 1 &&
+		   RTE_MBUF_DIRECT(txm) &&
+		   txm->nb_segs == 1 &&
+		   rte_pktmbuf_headroom(txm) >= hdr_size;
+
+	slots = txm->nb_segs + !can_push;
+	need = slots - vq->vq_free_cnt;
+
+	/* Positive value indicates it need free vring descriptors */
+	if (unlikely(need > 0)) {
+		virtio_xmit_cleanup_inorder_packed(vq, need);
+		need = slots - vq->vq_free_cnt;
+		if (unlikely(need > 0)) {
+			PMD_TX_LOG(ERR,
+				   "No free tx descriptors to transmit");
+			return -1;
+		}
+	}
+
+	/* Enqueue Packet buffers */
+	virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1);
+
+	txvq->stats.bytes += txm->pkt_len;
+	return 0;
+}
+
+uint16_t
+virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+			uint16_t nb_pkts)
+{
+	struct virtnet_tx *txvq = tx_queue;
+	struct virtqueue *vq = txvq->vq;
+	struct virtio_hw *hw = vq->hw;
+	uint16_t nb_tx = 0;
+	uint16_t remained;
+
+	if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts))
+		return nb_tx;
+
+	if (unlikely(nb_pkts < 1))
+		return nb_pkts;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh)
+		virtio_xmit_cleanup_inorder_packed(vq, vq->vq_free_thresh);
+
+	remained = RTE_MIN(nb_pkts, vq->vq_free_cnt);
+
+	while (remained) {
+		if (remained >= PACKED_BATCH_SIZE) {
+			if (!virtqueue_enqueue_batch_packed_vec(txvq,
+						&tx_pkts[nb_tx])) {
+				nb_tx += PACKED_BATCH_SIZE;
+				remained -= PACKED_BATCH_SIZE;
+				continue;
+			}
+		}
+		if (!virtqueue_enqueue_single_packed_vec(txvq,
+					tx_pkts[nb_tx])) {
+			nb_tx++;
+			remained--;
+			continue;
+		}
+		break;
+	};
+
+	txvq->stats.packets += nb_tx;
+
+	if (likely(nb_tx)) {
+		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
+			virtqueue_notify(vq);
+			PMD_TX_LOG(DEBUG, "Notified backend after xmit");
+		}
+	}
+
+	return nb_tx;
+}
+
 /* Optionally fill offload information in structure */
 static inline int
 virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (6 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 9/9] doc: add packed " Marvin Liu
  2020-04-29  8:17   ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Maxime Coquelin
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Rewrite vectorized path selection logic. Default setting comes from
vectorized devarg, then checks each criteria.

Packed ring vectorized path need:
    AVX512F and required extensions are supported by compiler and host
    VERSION_1 and IN_ORDER features are negotiated
    mergeable feature is not negotiated
    LRO offloading is disabled

Split ring vectorized rx path need:
    mergeable and IN_ORDER features are not negotiated
    LRO, chksum and vlan strip offloadings are disabled

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 0a69a4db1..e86d4e08f 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	if (vtpci_packed_queue(hw)) {
 		PMD_INIT_LOG(INFO,
 			"virtio: using packed ring %s Tx path on port %u",
-			hw->use_inorder_tx ? "inorder" : "standard",
+			hw->use_vec_tx ? "vectorized" : "standard",
 			eth_dev->data->port_id);
-		eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
+		if (hw->use_vec_tx)
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec;
+		else
+			eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
 	} else {
 		if (hw->use_inorder_tx) {
 			PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u",
@@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev)
 	}
 
 	if (vtpci_packed_queue(hw)) {
-		if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+		if (hw->use_vec_rx) {
+			PMD_INIT_LOG(INFO,
+				"virtio: using packed ring vectorized Rx path on port %u",
+				eth_dev->data->port_id);
+			eth_dev->rx_pkt_burst =
+				&virtio_recv_pkts_packed_vec;
+		} else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
 			PMD_INIT_LOG(INFO,
 				"virtio: using packed ring mergeable buffer Rx path on port %u",
 				eth_dev->data->port_id);
@@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		goto err_virtio_init;
 
 	if (vectorized) {
-		if (!vtpci_packed_queue(hw))
+		if (!vtpci_packed_queue(hw)) {
+			hw->use_vec_rx = 1;
+		} else {
+#if !defined(CC_AVX512_SUPPORT)
+			PMD_DRV_LOG(INFO,
+				"building environment do not support packed ring vectorized");
+#else
 			hw->use_vec_rx = 1;
+			hw->use_vec_tx = 1;
+#endif
+		}
 	}
 
 	hw->opened = true;
@@ -2288,31 +2306,66 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 			return -EBUSY;
 		}
 
-	if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
-		hw->use_inorder_tx = 1;
-		hw->use_inorder_rx = 1;
-		hw->use_vec_rx = 0;
-	}
-
 	if (vtpci_packed_queue(hw)) {
+#if defined(RTE_ARCH_X86_64) && defined(CC_AVX512_SUPPORT)
+		if ((hw->use_vec_rx || hw->use_vec_tx) &&
+		    (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) ||
+		     !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) {
+			PMD_DRV_LOG(INFO,
+				"disabled packed ring vectorized path for requirements not met");
+			hw->use_vec_rx = 0;
+			hw->use_vec_tx = 0;
+		}
+#else
 		hw->use_vec_rx = 0;
-		hw->use_inorder_rx = 0;
-	}
+		hw->use_vec_tx = 0;
+#endif
+
+		if (hw->use_vec_rx) {
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
+			if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+				PMD_DRV_LOG(INFO,
+					"disabled packed ring vectorized rx for TCP_LRO enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	} else {
+		if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) {
+			hw->use_inorder_tx = 1;
+			hw->use_inorder_rx = 1;
+			hw->use_vec_rx = 0;
+		}
+
+		if (hw->use_vec_rx) {
 #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
-	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
-		hw->use_vec_rx = 0;
-	}
+			if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized path for requirement not met");
+				hw->use_vec_rx = 0;
+			}
 #endif
-	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
-		hw->use_vec_rx = 0;
-	}
+			if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for mrg_rxbuf enabled");
+				hw->use_vec_rx = 0;
+			}
 
-	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_CKSUM |
-			   DEV_RX_OFFLOAD_TCP_LRO |
-			   DEV_RX_OFFLOAD_VLAN_STRIP))
-		hw->use_vec_rx = 0;
+			if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_CKSUM |
+					   DEV_RX_OFFLOAD_TCP_LRO |
+					   DEV_RX_OFFLOAD_VLAN_STRIP)) {
+				PMD_DRV_LOG(INFO,
+					"disabled split ring vectorized rx for offloading enabled");
+				hw->use_vec_rx = 0;
+			}
+		}
+	}
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [dpdk-dev] [PATCH v12 9/9] doc: add packed vectorized path
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (7 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path Marvin Liu
@ 2020-04-29  7:28   ` Marvin Liu
  2020-04-29  8:17   ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Maxime Coquelin
  9 siblings, 0 replies; 162+ messages in thread
From: Marvin Liu @ 2020-04-29  7:28 UTC (permalink / raw)
  To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu

Document packed virtqueue vectorized path selection logic in virtio net
PMD.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index fdd0790e0..226f4308d 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -482,6 +482,13 @@ according to below configuration:
    both negotiated, this path will be selected.
 #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and
    Rx mergeable is not negotiated, this path will be selected.
+#. Packed virtqueue vectorized Rx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated &&
+   TCP_LRO Rx offloading is disabled && vectorized option enabled,
+   this path will be selected.
+#. Packed virtqueue vectorized Tx path: If building and running environment support
+   AVX512 && in-order feature is negotiated && vectorized option enabled,
+   this path will be selected.
 
 Rx/Tx callbacks of each Virtio path
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -504,6 +511,8 @@ are shown in below table:
    Packed virtqueue non-meregable path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed
    Packed virtqueue in-order mergeable path     virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed
    Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed           virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Rx path          virtio_recv_pkts_packed_vec       virtio_xmit_pkts_packed
+   Packed virtqueue vectorized Tx path          virtio_recv_pkts_packed           virtio_xmit_pkts_packed_vec
    ============================================ ================================= ========================
 
 Virtio paths Support Status from Release to Release
@@ -521,20 +530,22 @@ All virtio paths support status are shown in below table:
 
 .. table:: Virtio Paths and Releases
 
-   ============================================ ============= ============= =============
-                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11
-   ============================================ ============= ============= =============
-   Split virtqueue mergeable path                     Y             Y             Y
-   Split virtqueue non-mergeable path                 Y             Y             Y
-   Split virtqueue vectorized Rx path                 Y             Y             Y
-   Split virtqueue simple Tx path                     Y             N             N
-   Split virtqueue in-order mergeable path                          Y             Y
-   Split virtqueue in-order non-mergeable path                      Y             Y
-   Packed virtqueue mergeable path                                                Y
-   Packed virtqueue non-mergeable path                                            Y
-   Packed virtqueue in-order mergeable path                                       Y
-   Packed virtqueue in-order non-mergeable path                                   Y
-   ============================================ ============= ============= =============
+   ============================================ ============= ============= ============= =======
+                  Virtio paths                  16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~
+   ============================================ ============= ============= ============= =======
+   Split virtqueue mergeable path                     Y             Y             Y          Y
+   Split virtqueue non-mergeable path                 Y             Y             Y          Y
+   Split virtqueue vectorized Rx path                 Y             Y             Y          Y
+   Split virtqueue simple Tx path                     Y             N             N          N
+   Split virtqueue in-order mergeable path                          Y             Y          Y
+   Split virtqueue in-order non-mergeable path                      Y             Y          Y
+   Packed virtqueue mergeable path                                                Y          Y
+   Packed virtqueue non-mergeable path                                            Y          Y
+   Packed virtqueue in-order mergeable path                                       Y          Y
+   Packed virtqueue in-order non-mergeable path                                   Y          Y
+   Packed virtqueue vectorized Rx path                                                       Y
+   Packed virtqueue vectorized Tx path                                                       Y
+   ============================================ ============= ============= ============= =======
 
 QEMU Support Status
 ~~~~~~~~~~~~~~~~~~~
-- 
2.17.1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v12 0/9] add packed ring vectorized path
  2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
                     ` (8 preceding siblings ...)
  2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 9/9] doc: add packed " Marvin Liu
@ 2020-04-29  8:17   ` Maxime Coquelin
  9 siblings, 0 replies; 162+ messages in thread
From: Maxime Coquelin @ 2020-04-29  8:17 UTC (permalink / raw)
  To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev



On 4/29/20 9:28 AM, Marvin Liu wrote:
> This patch set introduced vectorized path for packed ring.
> 
> The size of packed ring descriptor is 16Bytes. Four batched descriptors
> are just placed into one cacheline. AVX512 instructions can well handle
> this kind of data. Packed ring TX path can fully transformed into
> vectorized path. Packed ring Rx path can be vectorized when requirements
> met(LRO and mergeable disabled).
> 
> New device parameter "vectorized" will be introduced in this patch set.
> This parameter will be workable for both virtio device and virtio user
> vdev. It will also unify split and packed ring vectorized path default
> setting. Path election logic will check dependencies of vectorized path.
> Packed ring vectorized path is dependent on building/running environment
> and features like IN_ORDER and VERSION_1 enabled, MRG and LRO disabled. 
> If vectorized path is not supported, will fallback to normal path.
> 
> v12:
> * eliminate weak symbols in data path
> * remove desc extra padding which can impact normal path 
> * fix enqueue address invalid
> 
> v11:
> * fix i686 build warnings
> * fix typo in doc
> 
> v10:
> * reuse packed ring xmit cleanup
> 
> v9:
> * replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg
> * reorder patch sequence
> 
> v8:
> * fix meson build error on ubuntu16.04 and suse15
> 
> v7:
> * default vectorization is disabled
> * compilation time check dependency on rte_mbuf structure
> * offsets are calcuated when compiling
> * remove useless barrier as descs are batched store&load
> * vindex of scatter is directly set
> * some comments updates
> * enable vectorized path in meson build
> 
> v6:
> * fix issue when size not power of 2
> 
> v5:
> * remove cpuflags definition as required extensions always come with
>   AVX512F on x86_64
> * inorder actions should depend on feature bit
> * check ring type in rx queue setup
> * rewrite some commit logs
> * fix some checkpatch warnings
> 
> v4:
> * rename 'packed_vec' to 'vectorized', also used in split ring
> * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev
> * check required AVX512 extensions cpuflags
> * combine split and packed ring datapath selection logic
> * remove limitation that size must power of two
> * clear 12Bytes virtio_net_hdr
> 
> v3:
> * remove virtio_net_hdr array for better performance
> * disable 'packed_vec' by default
> 
> v2:
> * more function blocks replaced by vector instructions
> * clean virtio_net_hdr by vector instruction
> * allow header room size change
> * add 'packed_vec' option in virtio_user vdev 
> * fix build not check whether AVX512 enabled
> * doc update
> 
> Tested-by: Wang, Yinan <yinan.wang@intel.com>
> 
> Marvin Liu (9):
>   net/virtio: add Rx free threshold setting
>   net/virtio: inorder should depend on feature bit
>   net/virtio: add vectorized devarg
>   net/virtio-user: add vectorized devarg
>   net/virtio: reuse packed ring functions
>   net/virtio: add vectorized packed ring Rx path
>   net/virtio: add vectorized packed ring Tx path
>   net/virtio: add election for vectorized path
>   doc: add packed vectorized path
> 
>  doc/guides/nics/virtio.rst                  |  52 +-
>  drivers/net/virtio/Makefile                 |  35 ++
>  drivers/net/virtio/meson.build              |  14 +
>  drivers/net/virtio/virtio_ethdev.c          | 142 ++++-
>  drivers/net/virtio/virtio_ethdev.h          |   6 +
>  drivers/net/virtio/virtio_pci.h             |   3 +-
>  drivers/net/virtio/virtio_rxtx.c            | 351 ++---------
>  drivers/net/virtio/virtio_rxtx_packed_avx.c | 607 ++++++++++++++++++++
>  drivers/net/virtio/virtio_user_ethdev.c     |  32 +-
>  drivers/net/virtio/virtqueue.c              |   7 +-
>  drivers/net/virtio/virtqueue.h              | 304 ++++++++++
>  11 files changed, 1199 insertions(+), 354 deletions(-)
>  create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c
> 

Applied to dpdk-next-virtio/master,

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
@ 2020-04-30  9:48     ` Ferruh Yigit
  2020-04-30 10:23       ` Bruce Richardson
  0 siblings, 1 reply; 162+ messages in thread
From: Ferruh Yigit @ 2020-04-30  9:48 UTC (permalink / raw)
  To: Marvin Liu, maxime.coquelin
  Cc: xiaolong.ye, zhihong.wang, dev, Luca Boccassi, Bruce Richardson

On 4/28/2020 9:32 AM, Marvin Liu wrote:
> Optimize packed ring Rx path with SIMD instructions. Solution of
> optimization is pretty like vhost, is that split path into batch and
> single functions. Batch function is further optimized by AVX512
> instructions. Also pad desc extra structure to 16 bytes aligned, thus
> four elements will be saved in one batch.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

<...>

> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
>  deps += ['kvargs', 'bus_pci']
>  
>  if arch_subdir == 'x86'
> +	if '-mno-avx512f' not in machine_args
> +		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> +			cflags += ['-DCC_AVX512_SUPPORT']
> +			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> +			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> +				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
> +			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> +			endif
> +			sources += files('virtio_rxtx_packed_avx.c')
> +		endif
> +	endif

This is giving following error in Travis build [1], it is seems this usage is
supported since meson 0.49 [2] and Travis has 0.47 [3], also DPDK supports
version 0.47.1+ [4].

Can you please check for meson v0.47 version way of doing same thing?



[1]
drivers/net/virtio/meson.build:12:19: ERROR:  Expecting eol got not.
	if '-mno-avx512f' not in machine_args
                   ^

[2]
https://mesonbuild.com/Syntax.html#dictionaries
Since 0.49.0, you can check if a dictionary contains a key like this:
if 'foo' not in my_dict
# This condition is false
endif

[3]
The Meson build system
Version: 0.47.1

[4]
doc/guides/linux_gsg/sys_reqs.rst
*   Meson (version 0.47.1+) and ninja


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-30  9:48     ` Ferruh Yigit
@ 2020-04-30 10:23       ` Bruce Richardson
  2020-04-30 13:04         ` Ferruh Yigit
  0 siblings, 1 reply; 162+ messages in thread
From: Bruce Richardson @ 2020-04-30 10:23 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Marvin Liu, maxime.coquelin, xiaolong.ye, zhihong.wang, dev,
	Luca Boccassi

On Thu, Apr 30, 2020 at 10:48:35AM +0100, Ferruh Yigit wrote:
> On 4/28/2020 9:32 AM, Marvin Liu wrote:
> > Optimize packed ring Rx path with SIMD instructions. Solution of
> > optimization is pretty like vhost, is that split path into batch and
> > single functions. Batch function is further optimized by AVX512
> > instructions. Also pad desc extra structure to 16 bytes aligned, thus
> > four elements will be saved in one batch.
> > 
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> 
> <...>
> 
> > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
> >  deps += ['kvargs', 'bus_pci']
> >  
> >  if arch_subdir == 'x86'
> > +	if '-mno-avx512f' not in machine_args
> > +		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> > +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
> > +			cflags += ['-DCC_AVX512_SUPPORT']
> > +			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> > +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > +			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > +				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
> > +			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
> > +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
> > +			endif
> > +			sources += files('virtio_rxtx_packed_avx.c')
> > +		endif
> > +	endif
> 
> This is giving following error in Travis build [1], it is seems this usage is
> supported since meson 0.49 [2] and Travis has 0.47 [3], also DPDK supports
> version 0.47.1+ [4].
> 
> Can you please check for meson v0.47 version way of doing same thing?
> 
> 
<arrayname>.contains() is probably what you want.

/Bruce

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path
  2020-04-30 10:23       ` Bruce Richardson
@ 2020-04-30 13:04         ` Ferruh Yigit
  0 siblings, 0 replies; 162+ messages in thread
From: Ferruh Yigit @ 2020-04-30 13:04 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Marvin Liu, maxime.coquelin, xiaolong.ye, zhihong.wang, dev,
	Luca Boccassi

On 4/30/2020 11:23 AM, Bruce Richardson wrote:
> On Thu, Apr 30, 2020 at 10:48:35AM +0100, Ferruh Yigit wrote:
>> On 4/28/2020 9:32 AM, Marvin Liu wrote:
>>> Optimize packed ring Rx path with SIMD instructions. Solution of
>>> optimization is pretty like vhost, is that split path into batch and
>>> single functions. Batch function is further optimized by AVX512
>>> instructions. Also pad desc extra structure to 16 bytes aligned, thus
>>> four elements will be saved in one batch.
>>>
>>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>>> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>
>> <...>
>>
>>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c',
>>>  deps += ['kvargs', 'bus_pci']
>>>  
>>>  if arch_subdir == 'x86'
>>> +	if '-mno-avx512f' not in machine_args
>>> +		if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
>>> +			cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl']
>>> +			cflags += ['-DCC_AVX512_SUPPORT']
>>> +			if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
>>> +				cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
>>> +			elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
>>> +				cflags += '-DVHOST_CLANG_UNROLL_PRAGMA'
>>> +			elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
>>> +				cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
>>> +			endif
>>> +			sources += files('virtio_rxtx_packed_avx.c')
>>> +		endif
>>> +	endif
>>
>> This is giving following error in Travis build [1], it is seems this usage is
>> supported since meson 0.49 [2] and Travis has 0.47 [3], also DPDK supports
>> version 0.47.1+ [4].
>>
>> Can you please check for meson v0.47 version way of doing same thing?
>>
>>
> <arrayname>.contains() is probably what you want.
> 

Thanks Bruce,

I will update in the next-net as following [1].

@Marvin can you please double check it on the next-net?


[1]
-       if '-mno-avx512f' not in machine_args
+       if not machine_args.contains('-mno-avx512f')

^ permalink raw reply	[flat|nested] 162+ messages in thread

end of thread, other threads:[~2020-04-30 13:04 UTC | newest]

Thread overview: 162+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath Marvin Liu
2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build Marvin Liu
2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath Marvin Liu
2020-03-27 16:54   ` [dpdk-dev] [PATCH v2 7/7] doc: add packed " Marvin Liu
2020-04-08  8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-08  6:08     ` Ye Xiaolong
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu
2020-04-08  6:22     ` Ye Xiaolong
2020-04-08  7:31       ` Liu, Yong
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath Marvin Liu
2020-04-08  8:53   ` [dpdk-dev] [PATCH v3 7/7] doc: add packed " Marvin Liu
2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu
2020-04-15 13:31     ` David Marchand
2020-04-15 14:57       ` Liu, Yong
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath Marvin Liu
2020-04-15 16:47   ` [dpdk-dev] [PATCH v4 8/8] doc: add packed " Marvin Liu
2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-16 15:31   ` [dpdk-dev] [PATCH v5 9/9] doc: add packed " Marvin Liu
2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu
2020-04-20 14:08     ` Maxime Coquelin
2020-04-21  6:43       ` Liu, Yong
2020-04-22  8:07         ` Liu, Yong
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-16 22:24   ` [dpdk-dev] [PATCH v6 9/9] doc: add packed " Marvin Liu
2020-04-22  6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-22  6:16   ` [dpdk-dev] [PATCH v7 9/9] doc: add packed " Marvin Liu
2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu
2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-23  8:09     ` Maxime Coquelin
2020-04-23 12:30   ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu
2020-04-23  8:33     ` Maxime Coquelin
2020-04-23  8:46       ` Liu, Yong
2020-04-23  8:49         ` Maxime Coquelin
2020-04-23  9:59           ` Liu, Yong
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-23  8:46     ` Maxime Coquelin
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter Marvin Liu
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-23 12:31   ` [dpdk-dev] [PATCH v8 9/9] doc: add packed " Marvin Liu
2020-04-23 15:17   ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Wang, Yinan
2020-04-24  9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu
2020-04-24 11:27     ` Maxime Coquelin
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu
2020-04-24 11:29     ` Maxime Coquelin
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-24 11:51     ` Maxime Coquelin
2020-04-24 13:12       ` Liu, Yong
2020-04-24 13:33         ` Maxime Coquelin
2020-04-24 13:40           ` Liu, Yong
2020-04-24 15:58             ` Liu, Yong
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu
2020-04-24 12:01     ` Maxime Coquelin
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-24 12:29     ` Maxime Coquelin
2020-04-24 13:33       ` Liu, Yong
2020-04-24 13:35         ` Maxime Coquelin
2020-04-24 13:47           ` Liu, Yong
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-24 13:26     ` Maxime Coquelin
2020-04-24  9:24   ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu
2020-04-24 13:31     ` Maxime Coquelin
2020-04-26  2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu
2020-04-27 11:12     ` Maxime Coquelin
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu
2020-04-27 11:07     ` Maxime Coquelin
2020-04-28  1:29       ` Liu, Yong
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu
2020-04-27 11:08     ` Maxime Coquelin
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-27 11:20     ` Maxime Coquelin
2020-04-28  1:14       ` Liu, Yong
2020-04-28  8:44         ` Maxime Coquelin
2020-04-28 13:01           ` Liu, Yong
2020-04-28 13:46             ` Maxime Coquelin
2020-04-28 14:43               ` Liu, Yong
2020-04-28 14:50                 ` Maxime Coquelin
2020-04-28 15:35                   ` Liu, Yong
2020-04-28 15:40                     ` Maxime Coquelin
2020-04-28 15:55                       ` Liu, Yong
2020-04-28 17:01             ` Liu, Yong
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-27 11:55     ` Maxime Coquelin
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-26  2:19   ` [dpdk-dev] [PATCH v10 9/9] doc: add packed " Marvin Liu
2020-04-28  8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 4/9] net/virtio-user: " Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-30  9:48     ` Ferruh Yigit
2020-04-30 10:23       ` Bruce Richardson
2020-04-30 13:04         ` Ferruh Yigit
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-28  8:32   ` [dpdk-dev] [PATCH v11 9/9] doc: add packed " Marvin Liu
2020-04-29  7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 4/9] net/virtio-user: " Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path Marvin Liu
2020-04-29  7:28   ` [dpdk-dev] [PATCH v12 9/9] doc: add packed " Marvin Liu
2020-04-29  8:17   ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Maxime Coquelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).