* [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath @ 2020-03-13 17:42 Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu ` (17 more replies) 0 siblings, 18 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu This patch set introduced vectorized datapath for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors can just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX datapath can fully transformed into vectorized datapath. Rx datapath also can be vectorized by limiated features(TSO and mergeable). Marvin Liu (7): net/virtio: add Rx free threshold setting net/virtio-user: add LRO parameter net/virtio: add vectorized packed ring Rx function net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx function net/virtio: add election for vectorized datapath net/virtio: support meson build drivers/net/virtio/Makefile | 30 + drivers/net/virtio/meson.build | 1 + drivers/net/virtio/virtio_ethdev.c | 35 +- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 2 + drivers/net/virtio/virtio_rxtx.c | 201 ++---- drivers/net/virtio/virtio_rxtx_packed_avx.c | 606 ++++++++++++++++++ .../net/virtio/virtio_user/virtio_user_dev.c | 8 +- .../net/virtio/virtio_user/virtio_user_dev.h | 2 +- drivers/net/virtio/virtio_user_ethdev.c | 17 +- drivers/net/virtio/virtqueue.h | 165 ++++- 11 files changed, 903 insertions(+), 170 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter Marvin Liu ` (16 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue. Now default value of Rx free threshold is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than threshold descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 752faa0f6..3a2dbc2e0 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..bce1db030 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu ` (15 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Packed ring vectorized rx function won't support GUEST_TSO4 and GUSET_TSO6. Adding "lro" parameter into virtio user vdev arguments can disable these features for vectorized path selection. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.c b/drivers/net/virtio/virtio_user/virtio_user_dev.c index 1c6b26f8d..45d4bf14f 100644 --- a/drivers/net/virtio/virtio_user/virtio_user_dev.c +++ b/drivers/net/virtio/virtio_user/virtio_user_dev.c @@ -422,7 +422,8 @@ virtio_user_dev_setup(struct virtio_user_dev *dev) int virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues, int cq, int queue_size, const char *mac, char **ifname, - int server, int mrg_rxbuf, int in_order, int packed_vq) + int server, int mrg_rxbuf, int in_order, int packed_vq, + int lro) { pthread_mutex_init(&dev->mutex, NULL); strlcpy(dev->path, path, PATH_MAX); @@ -478,6 +479,11 @@ virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues, if (!packed_vq) dev->unsupported_features |= (1ull << VIRTIO_F_RING_PACKED); + if (!lro) { + dev->unsupported_features |= (1ull << VIRTIO_NET_F_GUEST_TSO4); + dev->unsupported_features |= (1ull << VIRTIO_NET_F_GUEST_TSO6); + } + if (dev->mac_specified) dev->frontend_features |= (1ull << VIRTIO_NET_F_MAC); else diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.h b/drivers/net/virtio/virtio_user/virtio_user_dev.h index 3b6b6065a..7133e4d26 100644 --- a/drivers/net/virtio/virtio_user/virtio_user_dev.h +++ b/drivers/net/virtio/virtio_user/virtio_user_dev.h @@ -62,7 +62,7 @@ int virtio_user_stop_device(struct virtio_user_dev *dev); int virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues, int cq, int queue_size, const char *mac, char **ifname, int server, int mrg_rxbuf, int in_order, - int packed_vq); + int packed_vq, int lro); void virtio_user_dev_uninit(struct virtio_user_dev *dev); void virtio_user_handle_cq(struct virtio_user_dev *dev, uint16_t queue_idx); void virtio_user_handle_cq_packed(struct virtio_user_dev *dev, diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index e61af4068..ea07a8384 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -450,6 +450,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_IN_ORDER, #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" VIRTIO_USER_ARG_PACKED_VQ, +#define VIRTIO_USER_ARG_LRO "lro" + VIRTIO_USER_ARG_LRO, NULL }; @@ -552,6 +554,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t lro = 1; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -668,6 +671,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_LRO) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_LRO, + &get_integer_arg, &lro) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_PACKED_VQ); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -707,7 +719,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) hw = eth_dev->data->dev_private; if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, queue_size, mac_addr, &ifname, server_mode, - mrg_rxbuf, in_order, packed_vq) < 0) { + mrg_rxbuf, in_order, packed_vq, lro) < 0) { PMD_INIT_LOG(ERR, "virtio_user_dev_init fails"); virtio_user_eth_dev_free(eth_dev); goto end; @@ -777,4 +789,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "server=<0|1> " "mrg_rxbuf=<0|1> " "in_order=<0|1> " - "packed_vq=<0|1>"); + "packed_vq=<0|1>" + "lro=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu ` (14 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Rx datapath when mergeable buffer and LRO are not required. Solution of optimization is pretty like vhost, split batch and single functions. Batch function will only dequeue those descs whose cacheline are aligned. Also padding desc extra structure to 16 bytes aligned. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index efdcb0d93..0458e8bf2 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -37,6 +37,36 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CC_AVX512_SUPPORT=$(shell $(CC) -dM -E -mavx512f -dM -E - </dev/null 2>&1 | \ + grep -q AVX512F && echo 1) + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS_virtio_ethdev.o += -DCC_AVX512_SUPPORT +CFLAGS_virtio_rxtx.o += -DCC_AVX512_SUPPORT +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index cd8947656..10e39670e 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 3a2dbc2e0..ac417232b 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, + struct rte_mbuf __rte_unused **rx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..d8cda9d71 --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,380 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2014 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} + +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE]; + uint64_t addrs[PACKED_BATCH_SIZE << 1]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + /* only care avail/used bits */ + __m512i desc_flags = _mm512_set_epi64( + PACKED_FLAGS_MASK, 0x0, + PACKED_FLAGS_MASK, 0x0, + PACKED_FLAGS_MASK, 0x0, + PACKED_FLAGS_MASK, 0x0); + + desc_addr = &vq->vq_packed.ring.desc[id]; + rte_smp_rmb(); + __m512i packed_desc = _mm512_loadu_si512(desc_addr); + __m512i flags_mask = _mm512_maskz_and_epi64(0xff, packed_desc, + desc_flags); + + __m512i used_flags; + if (vq->vq_packed.used_wrap_counter) { + used_flags = _mm512_set_epi64( + PACKED_FLAGS_MASK, 0x0, + PACKED_FLAGS_MASK, 0x0, + PACKED_FLAGS_MASK, 0x0, + PACKED_FLAGS_MASK, 0x0); + } else { + used_flags = _mm512_set_epi64( + 0x0, 0x0, + 0x0, 0x0, + 0x0, 0x0, + 0x0, 0x0); + } + + /* Check all descs are used */ + desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags, + _MM_CMPINT_EQ); + if (desc_stats != 0xff) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + addrs[(i << 1) + 1] = + (uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8; + } + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + hdrs[i] = (struct virtio_net_hdr *)addr; + } + + /* addresses of pkt_len and data_len */ + __m512i vindex = _mm512_set_epi64( + addrs[7], addrs[6], + addrs[5], addrs[4], + addrs[3], addrs[2], + addrs[1], addrs[0]); + + /* + * select 0x10 load 32bit from packed_desc[95:64] + * mmask 0x0110 save 32bit into pkt_len and data_len + */ + __m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA); + + __m512i mbuf_len_offset = _mm512_set_epi32( + 0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0, + 0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0, + 0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0, + 0, (uint32_t)-hdr_size, (uint32_t)-hdr_size, 0); + + value = _mm512_add_epi32(value, mbuf_len_offset); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, vindex, value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + virtio_vec_rx_offload(rx_pkts[i], hdrs[i]); + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += PACKED_BATCH_SIZE; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index bce1db030..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -19,6 +19,8 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 + +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -235,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (2 preceding siblings ...) 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function Marvin Liu ` (13 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index ac417232b..b8b4d3c25 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..31c48710c 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + if ((var) != (val)) \ + (var) = (val); \ +} while (0) + +#define virtqueue_clear_net_hdr(_hdr) do { \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (3 preceding siblings ...) 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath Marvin Liu ` (12 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Tx datapath alike Rx datapath. Split Rx datapath into batch and single Tx functions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 10e39670e..c9aaef0af 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index b8b4d3c25..125df3a13 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue, + struct rte_mbuf __rte_unused **tx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index d8cda9d71..0872f2083 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -15,6 +15,11 @@ #include "virtio_pci.h" #include "virtqueue.h" +#define REF_CNT_OFFSET 16 +#define SEG_NUM_OFFSET 32 +#define BATCH_REARM_DATA (1ULL << SEG_NUM_OFFSET | \ + 1ULL << REF_CNT_OFFSET | \ + RTE_PKTMBUF_HEADROOM) #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ @@ -41,6 +46,48 @@ for (iter = val; iter < num; iter++) #endif +static void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (desc_is_used(&desc[used_idx], vq)) + id = desc[used_idx].id; + else + return; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} + static inline void virtio_update_batch_stats(struct virtnet_stats *stats, uint16_t pkt_len1, @@ -54,6 +101,185 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len4; } +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + struct vq_desc_extra *dxps[PACKED_BATCH_SIZE]; + uint16_t idx = vq->vq_avail_idx; + uint64_t descs[PACKED_BATCH_SIZE]; + struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE]; + uint16_t i; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + /* Load four mbufs rearm data */ + __m256i mbufs = _mm256_set_epi64x( + *tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* hdr_room=128, refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set_epi64x( + BATCH_REARM_DATA, BATCH_REARM_DATA, + BATCH_REARM_DATA, BATCH_REARM_DATA); + + /* Check hdr_room,refcnt and nb_segs */ + uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref); + if (cmp & 0x7777) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxps[i] = &vq->vq_descx[idx + i]; + dxps[i]->ndescs = 1; + dxps[i]->cookie = tx_pkts[i]; + } + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rte_pktmbuf_prepend(tx_pkts[i], head_size); + tx_pkts[i]->pkt_len -= head_size; + } + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + descs[i] = (uint64_t)tx_pkts[i]->data_len | + (uint64_t)(idx + i) << 32 | + (uint64_t)vq->vq_packed.cached_flags << 48; + + __m512i new_descs = _mm512_set_epi64( + descs[3], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[3], vq), + descs[2], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[2], vq), + descs[1], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[1], vq), + descs[0], VIRTIO_MBUF_DATA_DMA_ADDR(tx_pkts[0], vq)); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + hdrs[i] = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + + if (!vq->hw->has_tx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + virtqueue_clear_net_hdr(hdrs[i]); + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + virtqueue_xmit_offload(hdrs[i], tx_pkts[i], true); + } + + /* Enqueue Packet buffers */ + rte_smp_wmb(); + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], new_descs); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (4 preceding siblings ...) 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build Marvin Liu ` (11 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Packed ring vectorized datapath can be selected when requirements are fulfilled. 1. AVX512 is allowed by config file and compiler 2. VERSION_1 and in_order features are negotiated 3. ring size is power of two 4. LRO and mergeable feature disabled in Rx datapath Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index f9d0ea70d..d27306d50 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->packed_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->packed_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->packed_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -2159,6 +2168,26 @@ virtio_dev_configure(struct rte_eth_dev *dev) hw->use_simple_rx = 1; + if (vtpci_packed_queue(hw)) { +#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) + unsigned int vq_size; + vq_size = VTPCI_OPS(hw)->get_queue_num(hw, 0); + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) && + rte_is_power_of_2(vq_size) && + vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) && + vtpci_with_feature(hw, VIRTIO_F_VERSION_1)) { + hw->packed_vec_rx = 1; + hw->packed_vec_tx = 1; + } + + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) + hw->packed_vec_rx = 0; + + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) + hw->packed_vec_rx = 0; +#endif + } + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index 7433d2f08..8103b7a18 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -251,6 +251,8 @@ struct virtio_hw { uint8_t use_msix; uint8_t modern; uint8_t use_simple_rx; + uint8_t packed_vec_rx; + uint8_t packed_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (5 preceding siblings ...) 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath Marvin Liu @ 2020-03-13 17:42 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu ` (10 subsequent siblings) 17 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-13 17:42 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 04c7fdf25..b0bddfd06 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -11,6 +11,7 @@ deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' sources += files('virtio_rxtx_simple_sse.c') + sources += files('virtio_rxtx_packed_avx.c') elif arch_subdir == 'ppc_64' sources += files('virtio_rxtx_simple_altivec.c') elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (6 preceding siblings ...) 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu ` (6 more replies) 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu ` (9 subsequent siblings) 17 siblings, 7 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu This patch set introduced vectorized datapath for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX datapath can fully transformed into vectorized datapath. Rx datapath also can be vectorized when features limiated(LRO and mergable disabled). User can specify whether disable vectorized packed ring datapath by 'packed_vec' parameter of virtio user vdev. v2: 1. more function blocks replaced by vector instructions 2. clean virtio_net_hdr by vector instruction 3. allow header room size change 4. add 'packed_vec' option in virtio_user vdev 5. fix build not check whether AVX512 enabled 6. doc update Marvin Liu (7): net/virtio: add Rx free threshold setting net/virtio-user: add vectorized packed ring parameter net/virtio: add vectorized packed ring Rx function net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx datapath net/virtio: add election for vectorized datapath doc: add packed vectorized datapath .../nics/features/virtio-packed_vec.ini | 22 + .../{virtio_vec.ini => virtio-split_vec.ini} | 2 +- doc/guides/nics/virtio.rst | 44 +- drivers/net/virtio/Makefile | 28 + drivers/net/virtio/meson.build | 11 + drivers/net/virtio/virtio_ethdev.c | 43 +- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 2 + drivers/net/virtio/virtio_rxtx.c | 201 ++---- drivers/net/virtio/virtio_rxtx_packed_avx.c | 636 ++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 27 +- drivers/net/virtio/virtqueue.h | 165 ++++- 12 files changed, 1005 insertions(+), 182 deletions(-) create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu ` (5 subsequent siblings) 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue, default value of it is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 752faa0f6..3a2dbc2e0 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu ` (4 subsequent siblings) 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Add new parameter "packed_vec" which can disable vectorized packed ring datapath explicitly. When "packed_vec" option is on, driver will check packed ring vectorized datapath prerequisites. If any one of them not matched, vectorized datapath won't be selected. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index 7433d2f08..8103b7a18 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -251,6 +251,8 @@ struct virtio_hw { uint8_t use_msix; uint8_t modern; uint8_t use_simple_rx; + uint8_t packed_vec_rx; + uint8_t packed_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index e61af4068..2608b1fae 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -450,6 +450,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_IN_ORDER, #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" VIRTIO_USER_ARG_PACKED_VQ, +#define VIRTIO_USER_ARG_PACKED_VEC "packed_vec" + VIRTIO_USER_ARG_PACKED_VEC, NULL }; @@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t packed_vec = 1; + char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_PACKED_VEC, + &get_integer_arg, &packed_vec) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_PACKED_VQ); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } hw = eth_dev->data->dev_private; +#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) + if (packed_vec) { + hw->packed_vec_rx = 1; + hw->packed_vec_tx = 1; + } +#else + if (packed_vec) + PMD_INIT_LOG(ERR, "building environment not match vectorized " + "packed ring datapath requirement"); +#endif + if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, queue_size, mac_addr, &ifname, server_mode, mrg_rxbuf, in_order, packed_vq) < 0) { @@ -777,4 +801,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "server=<0|1> " "mrg_rxbuf=<0|1> " "in_order=<0|1> " - "packed_vq=<0|1>"); + "packed_vq=<0|1>" + "packed_vec=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu ` (3 subsequent siblings) 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Rx datapath when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split datapath into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index efdcb0d93..7bdb87c49 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -37,6 +37,34 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX512F,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX512F) +ifneq ($(FORCE_DISABLE_AVX512), y) +CFLAGS += -DCC_AVX512_SUPPORT +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c +endif +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 04c7fdf25..652ae39af 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -11,6 +11,17 @@ deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' sources += files('virtio_rxtx_simple_sse.c') + if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif elif arch_subdir == 'ppc_64' sources += files('virtio_rxtx_simple_altivec.c') elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index cd8947656..10e39670e 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 3a2dbc2e0..ac417232b 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, + struct rte_mbuf __rte_unused **rx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..e2310d74e --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,361 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE]; + uint64_t addrs[PACKED_BATCH_SIZE << 1]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + /* only care avail/used bits */ + __m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + rte_smp_rmb(); + __m512i packed_desc = _mm512_loadu_si512(desc_addr); + __m512i flags_mask = _mm512_maskz_and_epi64(0xff, packed_desc, + desc_flags); + + __m512i used_flags; + if (vq->vq_packed.used_wrap_counter) + used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + else + used_flags = _mm512_setzero_si512(); + + /* Check all descs are used */ + desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags, + _MM_CMPINT_EQ); + if (desc_stats != 0xff) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + addrs[(i << 1) + 1] = + (uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8; + } + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + hdrs[i] = (struct virtio_net_hdr *)addr; + } + + /* addresses of pkt_len and data_len */ + __m512i vindex = _mm512_loadu_si512((void *)addrs); + + /* + * select 10b*4 load 32bit from packed_desc[95:64] + * mmask 0110b*4 save 32bit into pkt_len and data_len + */ + __m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA); + + /* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666, + (uint32_t)-hdr_size); + + value = _mm512_add_epi32(value, mbuf_len_offset); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, vindex, value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + virtio_vec_rx_offload(rx_pkts[i], hdrs[i]); + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += PACKED_BATCH_SIZE; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu ` (2 preceding siblings ...) 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu ` (2 subsequent siblings) 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index ac417232b..b8b4d3c25 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..31c48710c 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + if ((var) != (val)) \ + (var) = (val); \ +} while (0) + +#define virtqueue_clear_net_hdr(_hdr) do { \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu ` (3 preceding siblings ...) 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 7/7] doc: add packed " Marvin Liu 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Tx datapath alike Rx datapath. Split Tx datapath into batch and single Tx functions. Batch function further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 10e39670e..c9aaef0af 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index b8b4d3c25..125df3a13 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue, + struct rte_mbuf __rte_unused **tx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index e2310d74e..b63429df6 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -15,6 +15,18 @@ #include "virtio_pci.h" #include "virtqueue.h" +/* reference count offset in mbuf rearm data */ +#define REF_CNT_OFFSET 16 +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_OFFSET 32 + +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \ + 1ULL << REF_CNT_OFFSET) +/* id offset in packed ring desc higher 64bits */ +#define ID_OFFSET 32 +/* flag offset in packed ring desc higher 64bits */ +#define FLAG_OFFSET 48 + #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ @@ -41,6 +53,47 @@ for (iter = val; iter < num; iter++) #endif +static void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -54,6 +107,228 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdrs[PACKED_BATCH_SIZE]; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + /* Load four mbufs rearm data */ + __m256i mbufs = _mm256_set_epi64x( + *tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref); + if (cmp & 0x6666) + return -1; + + /* Check headroom is enough */ + cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms, + _MM_CMPINT_LT); + if (unlikely(cmp)) + return -1; + + __m512i dxps = _mm512_set_epi64( + 0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64( + tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64( + tx_pkts[3]->data_len, tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64( + (uint64_t)3 << ID_OFFSET, tx_pkts[3]->data_off, + (uint64_t)2 << ID_OFFSET, tx_pkts[2]->data_off, + (uint64_t)1 << ID_OFFSET, tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset); + + __m512i descs = _mm512_add_epi64(new_descs, flag_offsets); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + hdrs[i] = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + + if (!vq->hw->has_tx_offload) { + __m128i mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + __m128i hdr = _mm_loadu_si128((void *)hdrs[i]); + if (unlikely(_mm_mask_test_epi16_mask(0x3F, hdr, + mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdrs[i], 0x3F, + all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) + virtqueue_xmit_offload(hdrs[i], tx_pkts[i], true); + } + + /* Enqueue Packet buffers */ + rte_smp_wmb(); + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu ` (4 preceding siblings ...) 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 7/7] doc: add packed " Marvin Liu 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Packed ring vectorized datapath will be selected when requirements are fulfilled. 1. AVX512 is allowed in config file and supported by compiler 2. Host cpu support AVX512F 3. ring size is power of two 4. virtio VERSION_1 and in_order features are negotiated 5. LRO and mergeable feature disabled in Rx datapath Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index f9d0ea70d..21570e5cf 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->packed_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->packed_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->packed_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -2159,6 +2168,34 @@ virtio_dev_configure(struct rte_eth_dev *dev) hw->use_simple_rx = 1; + if (vtpci_packed_queue(hw)) { +#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) + unsigned int vq_size; + vq_size = VTPCI_OPS(hw)->get_queue_num(hw, 0); + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !rte_is_power_of_2(vq_size) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1)) { + hw->packed_vec_rx = 0; + hw->packed_vec_tx = 0; + PMD_DRV_LOG(INFO, "disabled packed ring vectorized " + "path for requirements are not met"); + } + + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + hw->packed_vec_rx = 0; + PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx " + "path for mrg_rxbuf enabled"); + } + + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + hw->packed_vec_rx = 0; + PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx " + "path for TCP_LRO enabled"); + } +#endif + } + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v2 7/7] doc: add packed vectorized datapath 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu ` (5 preceding siblings ...) 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath Marvin Liu @ 2020-03-27 16:54 ` Marvin Liu 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-03-27 16:54 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Document packed virtqueue vectorized datapath selection logic in virtio net PMD. Add packed virtqueue vectorized datapath features to new ini file. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini new file mode 100644 index 000000000..b239bcaad --- /dev/null +++ b/doc/guides/nics/features/virtio-packed_vec.ini @@ -0,0 +1,22 @@ +; +; Supported features of the 'virtio_packed_vec' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Speed capabilities = P +Link status = Y +Link status event = Y +Rx interrupt = Y +Queue start/stop = Y +Promiscuous mode = Y +Allmulticast mode = Y +Unicast MAC filter = Y +Multicast MAC filter = Y +VLAN filter = Y +Basic stats = Y +Stats per queue = Y +BSD nic_uio = Y +Linux UIO = Y +Linux VFIO = Y +x86-64 = Y diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini similarity index 88% rename from doc/guides/nics/features/virtio_vec.ini rename to doc/guides/nics/features/virtio-split_vec.ini index e60fe36ae..4142fc9f0 100644 --- a/doc/guides/nics/features/virtio_vec.ini +++ b/doc/guides/nics/features/virtio-split_vec.ini @@ -1,5 +1,5 @@ ; -; Supported features of the 'virtio_vec' network poll mode driver. +; Supported features of the 'virtio_split_vec' network poll mode driver. ; ; Refer to default.ini for the full list of available PMD features. ; diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d1f5fb898..fabe2e400 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev: It is used to enable virtio device packed virtqueue feature. (Default: 0 (disabled)) +#. ``packed_vec``: + + It is used to enable virtio device packed virtqueue vectorized path. + (Default: 1 (enabled)) + Virtio paths Selection and Usage -------------------------------- @@ -454,6 +459,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && packed_vec option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && packed_vec option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -476,6 +488,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -493,20 +507,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 0/7] add packed ring vectorized datapath 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (7 preceding siblings ...) 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu ` (6 more replies) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu ` (8 subsequent siblings) 17 siblings, 7 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu This patch set introduced vectorized datapath for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX datapath can fully transformed into vectorized datapath. Rx datapath also can be vectorized when features limiated(LRO and mergable disabled). User can specify whether disable vectorized packed ring datapath by 'packed_vec' parameter of virtio user vdev. v3: 1. Remove virtio_net_hdr array for better performance 2. disable 'packed_vec' by default v2: 1. more function blocks replaced by vector instructions 2. clean virtio_net_hdr by vector instruction 3. allow header room size change 4. add 'packed_vec' option in virtio_user vdev 5. fix build not check whether AVX512 enabled 6. doc update Marvin Liu (7): net/virtio: add Rx free threshold setting net/virtio-user: add vectorized packed ring parameter net/virtio: add vectorized packed ring Rx function net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx datapath net/virtio: add election for vectorized datapath doc: add packed vectorized datapath .../nics/features/virtio-packed_vec.ini | 22 + .../{virtio_vec.ini => virtio-split_vec.ini} | 2 +- doc/guides/nics/virtio.rst | 44 +- drivers/net/virtio/Makefile | 28 + drivers/net/virtio/meson.build | 11 + drivers/net/virtio/virtio_ethdev.c | 43 +- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 2 + drivers/net/virtio/virtio_rxtx.c | 201 ++---- drivers/net/virtio/virtio_rxtx_packed_avx.c | 637 ++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 27 +- drivers/net/virtio/virtqueue.h | 165 ++++- 12 files changed, 1006 insertions(+), 182 deletions(-) create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 6:08 ` Ye Xiaolong 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu ` (5 subsequent siblings) 6 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Introduce free threshold setting in Rx queue, default value of it is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 752faa0f6..3a2dbc2e0 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-08 6:08 ` Ye Xiaolong 0 siblings, 0 replies; 162+ messages in thread From: Ye Xiaolong @ 2020-04-08 6:08 UTC (permalink / raw) To: Marvin Liu; +Cc: maxime.coquelin, zhihong.wang, harry.van.haaren, dev On 04/08, Marvin Liu wrote: >Introduce free threshold setting in Rx queue, default value of it is 32. >Limiated threshold size to multiple of four as only vectorized packed Rx s/Limiated/Limit >function will utilize it. Virtio driver will rearm Rx queue when more >than rx_free_thresh descs were dequeued. > >Signed-off-by: Marvin Liu <yong.liu@intel.com> > >diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c >index 752faa0f6..3a2dbc2e0 100644 >--- a/drivers/net/virtio/virtio_rxtx.c >+++ b/drivers/net/virtio/virtio_rxtx.c >@@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, > struct virtio_hw *hw = dev->data->dev_private; > struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; > struct virtnet_rx *rxvq; >+ uint16_t rx_free_thresh; > > PMD_INIT_FUNC_TRACE(); > >@@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, > return -EINVAL; > } > >+ rx_free_thresh = rx_conf->rx_free_thresh; >+ if (rx_free_thresh == 0) >+ rx_free_thresh = >+ RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); >+ >+ if (rx_free_thresh & 0x3) { >+ RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." >+ " (rx_free_thresh=%u port=%u queue=%u)\n", >+ rx_free_thresh, dev->data->port_id, queue_idx); >+ return -EINVAL; >+ } >+ >+ if (rx_free_thresh >= vq->vq_nentries) { >+ RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " >+ "number of RX entries (%u)." >+ " (rx_free_thresh=%u port=%u queue=%u)\n", >+ vq->vq_nentries, >+ rx_free_thresh, dev->data->port_id, queue_idx); >+ return -EINVAL; >+ } >+ vq->vq_free_thresh = rx_free_thresh; >+ > if (nb_desc == 0 || nb_desc > vq->vq_nentries) > nb_desc = vq->vq_nentries; > vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); >diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h >index 58ad7309a..6301c56b2 100644 >--- a/drivers/net/virtio/virtqueue.h >+++ b/drivers/net/virtio/virtqueue.h >@@ -18,6 +18,8 @@ > > struct rte_mbuf; > >+#define DEFAULT_RX_FREE_THRESH 32 What about naming it VIRITO_DEFAULT_RX_FREE_THRESH? Thanks, Xiaolong >+ > /* > * Per virtio_ring.h in Linux. > * For virtio_pci on SMP, we don't need to order with respect to MMIO >-- >2.17.1 > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 6:22 ` Ye Xiaolong 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu ` (4 subsequent siblings) 6 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Add new parameter "packed_vec" which can disable vectorized packed ring datapath explicitly. When "packed_vec" option is on, driver will check packed ring vectorized datapath prerequisites. If any one of them not matched, vectorized datapath won't be selected. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index 7433d2f08..8103b7a18 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -251,6 +251,8 @@ struct virtio_hw { uint8_t use_msix; uint8_t modern; uint8_t use_simple_rx; + uint8_t packed_vec_rx; + uint8_t packed_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index e61af4068..399ac5511 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -450,6 +450,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_IN_ORDER, #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" VIRTIO_USER_ARG_PACKED_VQ, +#define VIRTIO_USER_ARG_PACKED_VEC "packed_vec" + VIRTIO_USER_ARG_PACKED_VEC, NULL }; @@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t packed_vec = 0; + char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_PACKED_VEC, + &get_integer_arg, &packed_vec) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_PACKED_VQ); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } hw = eth_dev->data->dev_private; +#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) + if (packed_vec) { + hw->packed_vec_rx = 1; + hw->packed_vec_tx = 1; + } +#else + if (packed_vec) + PMD_INIT_LOG(ERR, "building environment not match vectorized " + "packed ring datapath requirement"); +#endif + if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, queue_size, mac_addr, &ifname, server_mode, mrg_rxbuf, in_order, packed_vq) < 0) { @@ -777,4 +801,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "server=<0|1> " "mrg_rxbuf=<0|1> " "in_order=<0|1> " - "packed_vq=<0|1>"); + "packed_vq=<0|1>" + "packed_vec=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu @ 2020-04-08 6:22 ` Ye Xiaolong 2020-04-08 7:31 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Ye Xiaolong @ 2020-04-08 6:22 UTC (permalink / raw) To: Marvin Liu; +Cc: maxime.coquelin, zhihong.wang, harry.van.haaren, dev On 04/08, Marvin Liu wrote: >Add new parameter "packed_vec" which can disable vectorized packed ring >datapath explicitly. When "packed_vec" option is on, driver will check >packed ring vectorized datapath prerequisites. If any one of them not >matched, vectorized datapath won't be selected. > >Signed-off-by: Marvin Liu <yong.liu@intel.com> > >diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h >index 7433d2f08..8103b7a18 100644 >--- a/drivers/net/virtio/virtio_pci.h >+++ b/drivers/net/virtio/virtio_pci.h >@@ -251,6 +251,8 @@ struct virtio_hw { > uint8_t use_msix; > uint8_t modern; > uint8_t use_simple_rx; >+ uint8_t packed_vec_rx; >+ uint8_t packed_vec_tx; > uint8_t use_inorder_rx; > uint8_t use_inorder_tx; > uint8_t weak_barriers; >diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c >index e61af4068..399ac5511 100644 >--- a/drivers/net/virtio/virtio_user_ethdev.c >+++ b/drivers/net/virtio/virtio_user_ethdev.c >@@ -450,6 +450,8 @@ static const char *valid_args[] = { > VIRTIO_USER_ARG_IN_ORDER, > #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" > VIRTIO_USER_ARG_PACKED_VQ, >+#define VIRTIO_USER_ARG_PACKED_VEC "packed_vec" >+ VIRTIO_USER_ARG_PACKED_VEC, > NULL > }; > >@@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) > uint64_t mrg_rxbuf = 1; > uint64_t in_order = 1; > uint64_t packed_vq = 0; >+ uint64_t packed_vec = 0; >+ > char *path = NULL; > char *ifname = NULL; > char *mac_addr = NULL; >@@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) > } > } > >+ if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) { >+ if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_PACKED_VEC, >+ &get_integer_arg, &packed_vec) < 0) { >+ PMD_INIT_LOG(ERR, "error to parse %s", >+ VIRTIO_USER_ARG_PACKED_VQ); >+ goto end; >+ } >+ } >+ > if (queues > 1 && cq == 0) { > PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); > goto end; >@@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) > } > > hw = eth_dev->data->dev_private; >+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) >+ if (packed_vec) { >+ hw->packed_vec_rx = 1; >+ hw->packed_vec_tx = 1; >+ } >+#else >+ if (packed_vec) >+ PMD_INIT_LOG(ERR, "building environment not match vectorized " >+ "packed ring datapath requirement"); Minor nit: s/not match/doesn't match/ And better to avoid breaking error message strings across multiple source lines. It makes it harder to use tools like grep to find errors in source. E.g. user uses "vectorized packed ring datapath" to grep the code. Thanks, Xiaolong >+#endif >+ > if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, > queue_size, mac_addr, &ifname, server_mode, > mrg_rxbuf, in_order, packed_vq) < 0) { >@@ -777,4 +801,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, > "server=<0|1> " > "mrg_rxbuf=<0|1> " > "in_order=<0|1> " >- "packed_vq=<0|1>"); >+ "packed_vq=<0|1>" >+ "packed_vec=<0|1>"); >-- >2.17.1 > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter 2020-04-08 6:22 ` Ye Xiaolong @ 2020-04-08 7:31 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-08 7:31 UTC (permalink / raw) To: Ye, Xiaolong; +Cc: maxime.coquelin, Wang, Zhihong, Van Haaren, Harry, dev > -----Original Message----- > From: Ye, Xiaolong <xiaolong.ye@intel.com> > Sent: Wednesday, April 8, 2020 2:23 PM > To: Liu, Yong <yong.liu@intel.com> > Cc: maxime.coquelin@redhat.com; Wang, Zhihong > <zhihong.wang@intel.com>; Van Haaren, Harry > <harry.van.haaren@intel.com>; dev@dpdk.org > Subject: Re: [PATCH v3 2/7] net/virtio-user: add vectorized packed ring > parameter > > On 04/08, Marvin Liu wrote: > >Add new parameter "packed_vec" which can disable vectorized packed > ring > >datapath explicitly. When "packed_vec" option is on, driver will check > >packed ring vectorized datapath prerequisites. If any one of them not > >matched, vectorized datapath won't be selected. > > > >Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > >diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h > >index 7433d2f08..8103b7a18 100644 > >--- a/drivers/net/virtio/virtio_pci.h > >+++ b/drivers/net/virtio/virtio_pci.h > >@@ -251,6 +251,8 @@ struct virtio_hw { > > uint8_t use_msix; > > uint8_t modern; > > uint8_t use_simple_rx; > >+ uint8_t packed_vec_rx; > >+ uint8_t packed_vec_tx; > > uint8_t use_inorder_rx; > > uint8_t use_inorder_tx; > > uint8_t weak_barriers; > >diff --git a/drivers/net/virtio/virtio_user_ethdev.c > b/drivers/net/virtio/virtio_user_ethdev.c > >index e61af4068..399ac5511 100644 > >--- a/drivers/net/virtio/virtio_user_ethdev.c > >+++ b/drivers/net/virtio/virtio_user_ethdev.c > >@@ -450,6 +450,8 @@ static const char *valid_args[] = { > > VIRTIO_USER_ARG_IN_ORDER, > > #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" > > VIRTIO_USER_ARG_PACKED_VQ, > >+#define VIRTIO_USER_ARG_PACKED_VEC "packed_vec" > >+ VIRTIO_USER_ARG_PACKED_VEC, > > NULL > > }; > > > >@@ -552,6 +554,8 @@ virtio_user_pmd_probe(struct rte_vdev_device > *dev) > > uint64_t mrg_rxbuf = 1; > > uint64_t in_order = 1; > > uint64_t packed_vq = 0; > >+ uint64_t packed_vec = 0; > >+ > > char *path = NULL; > > char *ifname = NULL; > > char *mac_addr = NULL; > >@@ -668,6 +672,15 @@ virtio_user_pmd_probe(struct rte_vdev_device > *dev) > > } > > } > > > >+ if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_PACKED_VEC) == 1) { > >+ if (rte_kvargs_process(kvlist, > VIRTIO_USER_ARG_PACKED_VEC, > >+ &get_integer_arg, &packed_vec) < 0) { > >+ PMD_INIT_LOG(ERR, "error to parse %s", > >+ VIRTIO_USER_ARG_PACKED_VQ); > >+ goto end; > >+ } > >+ } > >+ > > if (queues > 1 && cq == 0) { > > PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); > > goto end; > >@@ -705,6 +718,17 @@ virtio_user_pmd_probe(struct rte_vdev_device > *dev) > > } > > > > hw = eth_dev->data->dev_private; > >+#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) > >+ if (packed_vec) { > >+ hw->packed_vec_rx = 1; > >+ hw->packed_vec_tx = 1; > >+ } > >+#else > >+ if (packed_vec) > >+ PMD_INIT_LOG(ERR, "building environment not match > vectorized " > >+ "packed ring datapath requirement"); > > Minor nit: > > s/not match/doesn't match/ > > And better to avoid breaking error message strings across multiple source > lines. > It makes it harder to use tools like grep to find errors in source. > E.g. user uses "vectorized packed ring datapath" to grep the code. > > Thanks, > Xiaolong > Thanks for remind. Will change in next release. > >+#endif > >+ > > if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, > > queue_size, mac_addr, &ifname, server_mode, > > mrg_rxbuf, in_order, packed_vq) < 0) { > >@@ -777,4 +801,5 @@ > RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, > > "server=<0|1> " > > "mrg_rxbuf=<0|1> " > > "in_order=<0|1> " > >- "packed_vq=<0|1>"); > >+ "packed_vq=<0|1>" > >+ "packed_vec=<0|1>"); > >-- > >2.17.1 > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu ` (3 subsequent siblings) 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Rx datapath when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split datapath into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index efdcb0d93..7bdb87c49 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -37,6 +37,34 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +ifeq ($(findstring RTE_MACHINE_CPUFLAG_AVX512F,$(CFLAGS)),RTE_MACHINE_CPUFLAG_AVX512F) +ifneq ($(FORCE_DISABLE_AVX512), y) +CFLAGS += -DCC_AVX512_SUPPORT +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c +endif +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 04c7fdf25..652ae39af 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -11,6 +11,17 @@ deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' sources += files('virtio_rxtx_simple_sse.c') + if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif elif arch_subdir == 'ppc_64' sources += files('virtio_rxtx_simple_altivec.c') elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index cd8947656..10e39670e 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 3a2dbc2e0..ac417232b 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, + struct rte_mbuf __rte_unused **rx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..f2976b98f --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,358 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE << 1]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + /* only care avail/used bits */ + __m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + rte_smp_rmb(); + __m512i packed_desc = _mm512_loadu_si512(desc_addr); + __m512i flags_mask = _mm512_maskz_and_epi64(0xff, packed_desc, + desc_flags); + + __m512i used_flags; + if (vq->vq_packed.used_wrap_counter) + used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + else + used_flags = _mm512_setzero_si512(); + + /* Check all descs are used */ + desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags, + _MM_CMPINT_EQ); + if (desc_stats != 0xff) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + addrs[(i << 1) + 1] = + (uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8; + } + + /* addresses of pkt_len and data_len */ + __m512i vindex = _mm512_loadu_si512((void *)addrs); + + /* + * select 10b*4 load 32bit from packed_desc[95:64] + * mmask 0110b*4 save 32bit into pkt_len and data_len + */ + __m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA); + + /* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666, + (uint32_t)-hdr_size); + + value = _mm512_add_epi32(value, mbuf_len_offset); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, vindex, value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += PACKED_BATCH_SIZE; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu ` (2 subsequent siblings) 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index ac417232b..b8b4d3c25 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..31c48710c 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + if ((var) != (val)) \ + (var) = (val); \ +} while (0) + +#define virtqueue_clear_net_hdr(_hdr) do { \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 7/7] doc: add packed " Marvin Liu 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Tx datapath alike Rx datapath. Split Tx datapath into batch and single Tx functions. Batch function further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 10e39670e..c9aaef0af 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index b8b4d3c25..125df3a13 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue, + struct rte_mbuf __rte_unused **tx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index f2976b98f..fb26fe5f3 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -15,6 +15,21 @@ #include "virtio_pci.h" #include "virtqueue.h" +/* reference count offset in mbuf rearm data */ +#define REF_CNT_OFFSET 16 +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_OFFSET 32 + +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \ + 1ULL << REF_CNT_OFFSET) +/* id offset in packed ring desc higher 64bits */ +#define ID_OFFSET 32 +/* flag offset in packed ring desc higher 64bits */ +#define FLAG_OFFSET 48 + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x1F + #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ @@ -41,6 +56,47 @@ for (iter = val; iter < num; iter++) #endif +static void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -54,6 +110,229 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + /* Load four mbufs rearm data */ + __m256i mbufs = _mm256_set_epi64x( + *tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref); + if (cmp & 0x6666) + return -1; + + /* Check headroom is enough */ + cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms, + _MM_CMPINT_LT); + if (unlikely(cmp)) + return -1; + + __m512i dxps = _mm512_set_epi64( + 0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64( + tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64( + tx_pkts[3]->data_len, tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64( + (uint64_t)3 << ID_OFFSET, tx_pkts[3]->data_off, + (uint64_t)2 << ID_OFFSET, tx_pkts[2]->data_off, + (uint64_t)1 << ID_OFFSET, tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset); + + __m512i descs = _mm512_add_epi64(new_descs, flag_offsets); + + if (!vq->hw->has_tx_offload) { + __m128i mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + rte_smp_wmb(); + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 7/7] doc: add packed " Marvin Liu 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Packed ring vectorized datapath will be selected when criterian matched. 1. AVX512 is enabled in dpdk config and supported by compiler 2. Host cpu has AVX512F flag 3. Ring size is power of two 4. virtio VERSION_1 and IN_ORDER features are negotiated 5. LRO and mergeable are disabled in Rx datapath Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index f9d0ea70d..21570e5cf 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->packed_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->packed_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->packed_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -2159,6 +2168,34 @@ virtio_dev_configure(struct rte_eth_dev *dev) hw->use_simple_rx = 1; + if (vtpci_packed_queue(hw)) { +#if defined(RTE_ARCH_X86) && defined(CC_AVX512_SUPPORT) + unsigned int vq_size; + vq_size = VTPCI_OPS(hw)->get_queue_num(hw, 0); + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !rte_is_power_of_2(vq_size) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1)) { + hw->packed_vec_rx = 0; + hw->packed_vec_tx = 0; + PMD_DRV_LOG(INFO, "disabled packed ring vectorized " + "path for requirements are not met"); + } + + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + hw->packed_vec_rx = 0; + PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx " + "path for mrg_rxbuf enabled"); + } + + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + hw->packed_vec_rx = 0; + PMD_DRV_LOG(ERR, "disabled packed ring vectorized rx " + "path for TCP_LRO enabled"); + } +#endif + } + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v3 7/7] doc: add packed vectorized datapath 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath Marvin Liu @ 2020-04-08 8:53 ` Marvin Liu 6 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-08 8:53 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Document packed virtqueue vectorized datapath selection logic in virtio net PMD. Add packed virtqueue vectorized datapath features to new ini file. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini new file mode 100644 index 000000000..b239bcaad --- /dev/null +++ b/doc/guides/nics/features/virtio-packed_vec.ini @@ -0,0 +1,22 @@ +; +; Supported features of the 'virtio_packed_vec' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Speed capabilities = P +Link status = Y +Link status event = Y +Rx interrupt = Y +Queue start/stop = Y +Promiscuous mode = Y +Allmulticast mode = Y +Unicast MAC filter = Y +Multicast MAC filter = Y +VLAN filter = Y +Basic stats = Y +Stats per queue = Y +BSD nic_uio = Y +Linux UIO = Y +Linux VFIO = Y +x86-64 = Y diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini similarity index 88% rename from doc/guides/nics/features/virtio_vec.ini rename to doc/guides/nics/features/virtio-split_vec.ini index e60fe36ae..4142fc9f0 100644 --- a/doc/guides/nics/features/virtio_vec.ini +++ b/doc/guides/nics/features/virtio-split_vec.ini @@ -1,5 +1,5 @@ ; -; Supported features of the 'virtio_vec' network poll mode driver. +; Supported features of the 'virtio_split_vec' network poll mode driver. ; ; Refer to default.ini for the full list of available PMD features. ; diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d1f5fb898..fabe2e400 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev: It is used to enable virtio device packed virtqueue feature. (Default: 0 (disabled)) +#. ``packed_vec``: + + It is used to enable virtio device packed virtqueue vectorized path. + (Default: 1 (enabled)) + Virtio paths Selection and Usage -------------------------------- @@ -454,6 +459,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && packed_vec option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && packed_vec option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -476,6 +488,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -493,20 +507,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 0/8] add packed ring vectorized datapath 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (8 preceding siblings ...) 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu ` (7 more replies) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (7 subsequent siblings) 17 siblings, 8 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu This patch set introduced vectorized datapath for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX datapath can fully transformed into vectorized datapath. Rx datapath also can be vectorized when features limiated(LRO and mergable disabled). User can specify whether disable vectorized packed ring datapath by 'packed_vec' parameter of virtio user vdev. v4: 1. rename 'packed_vec' to 'vectorized', also used in split ring 2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev 3. check required AVX512 extensions cpuflags 4. combine split and packed ring datapath selection logic 5. remove limitation that size must power of two 6. clear 12Bytes virtio_net_hdr v3: 1. Remove virtio_net_hdr array for better performance 2. disable 'packed_vec' by default v2: 1. more function blocks replaced by vector instructions 2. clean virtio_net_hdr by vector instruction 3. allow header room size change 4. add 'packed_vec' option in virtio_user vdev 5. fix build not check whether AVX512 enabled 6. doc update Marvin Liu (8): net/virtio: enable vectorized datapath net/virtio-user: add vectorized datapath parameter net/virtio: add vectorized packed ring Rx function net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx datapath eal/x86: identify AVX512 extensions flag net/virtio: add election for vectorized datapath doc: add packed vectorized datapath config/common_base | 1 + .../nics/features/virtio-packed_vec.ini | 22 + .../{virtio_vec.ini => virtio-split_vec.ini} | 2 +- doc/guides/nics/virtio.rst | 44 +- drivers/net/virtio/Makefile | 36 + drivers/net/virtio/meson.build | 13 + drivers/net/virtio/virtio_ethdev.c | 95 ++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 182 +---- drivers/net/virtio/virtio_rxtx_packed_avx.c | 637 ++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 36 +- drivers/net/virtio/virtqueue.c | 6 +- drivers/net/virtio/virtqueue.h | 163 ++++- lib/librte_eal/common/arch/x86/rte_cpuflags.c | 3 + .../common/include/arch/x86/rte_cpuflags.h | 3 + 16 files changed, 1040 insertions(+), 212 deletions(-) create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 1/8] net/virtio: enable vectorized datapath 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter Marvin Liu ` (6 subsequent siblings) 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Previously, virtio split ring vectorized datapath is enabled as default. This is not suitable for everyone as that datapath not follow virtio spec. Add specific config for virtio vectorized datapath selection. This config will be also used for virtio packed ring. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/config/common_base b/config/common_base index 7ca2f28b1..afeda85b0 100644 --- a/config/common_base +++ b/config/common_base @@ -450,6 +450,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y # # Compile virtio device emulation inside virtio PMD driver diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index efdcb0d93..9ef445bc9 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) ifeq ($(CONFIG_RTE_ARCH_X86),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function Marvin Liu ` (5 subsequent siblings) 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Add new parameter "vectorized" which can enable vectorized datapath explicitly. This parameter will work for both split ring and packed ring. When "vectorized" option is on, driver will check both compiling environment and running enviornment. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index f9d0ea70d..19a36ad82 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1547,7 +1547,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; @@ -2157,33 +2157,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index 7433d2f08..36afed313 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -250,7 +250,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 3a2dbc2e0..285af1d47 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -995,7 +995,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1013,7 +1013,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index e61af4068..ca7797cfa 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -450,6 +450,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_IN_ORDER, #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" VIRTIO_USER_ARG_PACKED_VQ, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -518,7 +520,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -552,6 +555,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; + char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -668,6 +673,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } +#endif + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -705,6 +721,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } hw = eth_dev->data->dev_private; + if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, queue_size, mac_addr, &ifname, server_mode, mrg_rxbuf, in_order, packed_vq) < 0) { @@ -720,6 +737,20 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not match packed ring vectorized requirement"); +#endif + } else { + hw->use_vec_rx = 1; + } + } + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -777,4 +808,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "server=<0|1> " "mrg_rxbuf=<0|1> " "in_order=<0|1> " - "packed_vq=<0|1>"); + "packed_vq=<0|1>" + "vectorized=<0|1>"); diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..349ff0c9d 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,7 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +97,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +121,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions Marvin Liu ` (4 subsequent siblings) 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Rx datapath when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split datapath into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index 9ef445bc9..4d20cb61a 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -37,6 +37,40 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif + +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 04c7fdf25..00f84282c 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -11,6 +11,19 @@ deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' sources += files('virtio_rxtx_simple_sse.c') + if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') + if '-mno-avx512f' not in machine_args and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif elif arch_subdir == 'ppc_64' sources += files('virtio_rxtx_simple_altivec.c') elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index cd8947656..10e39670e 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 285af1d47..965ce3dab 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1245,7 +1245,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2328,3 +2327,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, + struct rte_mbuf __rte_unused **rx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..f2976b98f --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,358 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE << 1]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + /* only care avail/used bits */ + __m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + rte_smp_rmb(); + __m512i packed_desc = _mm512_loadu_si512(desc_addr); + __m512i flags_mask = _mm512_maskz_and_epi64(0xff, packed_desc, + desc_flags); + + __m512i used_flags; + if (vq->vq_packed.used_wrap_counter) + used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + else + used_flags = _mm512_setzero_si512(); + + /* Check all descs are used */ + desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags, + _MM_CMPINT_EQ); + if (desc_stats != 0xff) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + addrs[(i << 1) + 1] = + (uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8; + } + + /* addresses of pkt_len and data_len */ + __m512i vindex = _mm512_loadu_si512((void *)addrs); + + /* + * select 10b*4 load 32bit from packed_desc[95:64] + * mmask 0110b*4 save 32bit into pkt_len and data_len + */ + __m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA); + + /* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666, + (uint32_t)-hdr_size); + + value = _mm512_add_epi32(value, mbuf_len_offset); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, vindex, value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += PACKED_BATCH_SIZE; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath Marvin Liu ` (3 subsequent siblings) 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 965ce3dab..1d8135f4f 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..31c48710c 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,162 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + if ((var) != (val)) \ + (var) = (val); \ +} while (0) + +#define virtqueue_clear_net_hdr(_hdr) do { \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu ` (2 subsequent siblings) 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Tx datapath alike Rx datapath. Split Tx datapath into batch and single Tx functions. Batch function further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 10e39670e..c9aaef0af 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 1d8135f4f..58c7778f4 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2174,3 +2174,11 @@ virtio_recv_pkts_packed_vec(void __rte_unused *rx_queue, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void __rte_unused *tx_queue, + struct rte_mbuf __rte_unused **tx_pkts, + uint16_t __rte_unused nb_pkts) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index f2976b98f..732256c86 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -15,6 +15,21 @@ #include "virtio_pci.h" #include "virtqueue.h" +/* reference count offset in mbuf rearm data */ +#define REF_CNT_OFFSET 16 +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_OFFSET 32 + +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \ + 1ULL << REF_CNT_OFFSET) +/* id offset in packed ring desc higher 64bits */ +#define ID_OFFSET 32 +/* flag offset in packed ring desc higher 64bits */ +#define FLAG_OFFSET 48 + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ @@ -41,6 +56,47 @@ for (iter = val; iter < num; iter++) #endif +static void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -54,6 +110,229 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + /* Load four mbufs rearm data */ + __m256i mbufs = _mm256_set_epi64x( + *tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref); + if (cmp & 0x6666) + return -1; + + /* Check headroom is enough */ + cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms, + _MM_CMPINT_LT); + if (unlikely(cmp)) + return -1; + + __m512i dxps = _mm512_set_epi64( + 0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64( + tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64( + tx_pkts[3]->data_len, tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64( + (uint64_t)3 << ID_OFFSET, tx_pkts[3]->data_off, + (uint64_t)2 << ID_OFFSET, tx_pkts[2]->data_off, + (uint64_t)1 << ID_OFFSET, tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset); + + __m512i descs = _mm512_add_epi64(new_descs, flag_offsets); + + if (!vq->hw->has_tx_offload) { + __m128i mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + rte_smp_wmb(); + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 13:31 ` David Marchand 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 8/8] doc: add packed " Marvin Liu 7 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Read CPUID to check if AVX512 extensions are supported. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/lib/librte_eal/common/arch/x86/rte_cpuflags.c b/lib/librte_eal/common/arch/x86/rte_cpuflags.c index 6492df556..54e9f6185 100644 --- a/lib/librte_eal/common/arch/x86/rte_cpuflags.c +++ b/lib/librte_eal/common/arch/x86/rte_cpuflags.c @@ -109,6 +109,9 @@ const struct feature_entry rte_cpu_feature_table[] = { FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11) FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16) FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18) + FEAT_DEF(AVX512CD, 0x00000007, 0, RTE_REG_EBX, 28) + FEAT_DEF(AVX512BW, 0x00000007, 0, RTE_REG_EBX, 30) + FEAT_DEF(AVX512VL, 0x00000007, 0, RTE_REG_EBX, 31) FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0) FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4) diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h index 25ba47b96..5bf99e05f 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h +++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h @@ -98,6 +98,9 @@ enum rte_cpu_flag_t { RTE_CPUFLAG_RTM, /**< Transactional memory */ RTE_CPUFLAG_AVX512F, /**< AVX512F */ RTE_CPUFLAG_RDSEED, /**< RDSEED instruction */ + RTE_CPUFLAG_AVX512CD, /**< AVX512CD */ + RTE_CPUFLAG_AVX512BW, /**< AVX512BW */ + RTE_CPUFLAG_AVX512VL, /**< AVX512VL */ /* (EAX 80000001h) ECX features */ RTE_CPUFLAG_LAHF_SAHF, /**< LAHF_SAHF */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu @ 2020-04-15 13:31 ` David Marchand 2020-04-15 14:57 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: David Marchand @ 2020-04-15 13:31 UTC (permalink / raw) To: Marvin Liu Cc: Maxime Coquelin, Xiaolong Ye, Zhihong Wang, Van Haaren Harry, dev, Kevin Laatz, Kinsella, Ray On Wed, Apr 15, 2020 at 11:14 AM Marvin Liu <yong.liu@intel.com> wrote: > > Read CPUID to check if AVX512 extensions are supported. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/lib/librte_eal/common/arch/x86/rte_cpuflags.c b/lib/librte_eal/common/arch/x86/rte_cpuflags.c > index 6492df556..54e9f6185 100644 > --- a/lib/librte_eal/common/arch/x86/rte_cpuflags.c > +++ b/lib/librte_eal/common/arch/x86/rte_cpuflags.c > @@ -109,6 +109,9 @@ const struct feature_entry rte_cpu_feature_table[] = { > FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11) > FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16) > FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18) > + FEAT_DEF(AVX512CD, 0x00000007, 0, RTE_REG_EBX, 28) > + FEAT_DEF(AVX512BW, 0x00000007, 0, RTE_REG_EBX, 30) > + FEAT_DEF(AVX512VL, 0x00000007, 0, RTE_REG_EBX, 31) > > FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0) > FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4) > diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > index 25ba47b96..5bf99e05f 100644 > --- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > +++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > @@ -98,6 +98,9 @@ enum rte_cpu_flag_t { > RTE_CPUFLAG_RTM, /**< Transactional memory */ > RTE_CPUFLAG_AVX512F, /**< AVX512F */ > RTE_CPUFLAG_RDSEED, /**< RDSEED instruction */ > + RTE_CPUFLAG_AVX512CD, /**< AVX512CD */ > + RTE_CPUFLAG_AVX512BW, /**< AVX512BW */ > + RTE_CPUFLAG_AVX512VL, /**< AVX512VL */ > > /* (EAX 80000001h) ECX features */ > RTE_CPUFLAG_LAHF_SAHF, /**< LAHF_SAHF */ This patch most likely breaks the ABI (renumbering flags after RTE_CPUFLAG_LAHF_SAHF). This change should not go through the virtio tree and is not rebased on master. A similar patch had been proposed by Kevin: http://patchwork.dpdk.org/patch/67438/ -- David Marchand ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag 2020-04-15 13:31 ` David Marchand @ 2020-04-15 14:57 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-15 14:57 UTC (permalink / raw) To: David Marchand Cc: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong, Van Haaren, Harry, dev, Laatz, Kevin, Kinsella, Ray Thanks for note, David. Kevin's patch can fully cover this one. > -----Original Message----- > From: David Marchand <david.marchand@redhat.com> > Sent: Wednesday, April 15, 2020 9:32 PM > To: Liu, Yong <yong.liu@intel.com> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Van > Haaren, Harry <harry.van.haaren@intel.com>; dev <dev@dpdk.org>; Laatz, > Kevin <kevin.laatz@intel.com>; Kinsella, Ray <ray.kinsella@intel.com> > Subject: Re: [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions > flag > > On Wed, Apr 15, 2020 at 11:14 AM Marvin Liu <yong.liu@intel.com> wrote: > > > > Read CPUID to check if AVX512 extensions are supported. > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/lib/librte_eal/common/arch/x86/rte_cpuflags.c > b/lib/librte_eal/common/arch/x86/rte_cpuflags.c > > index 6492df556..54e9f6185 100644 > > --- a/lib/librte_eal/common/arch/x86/rte_cpuflags.c > > +++ b/lib/librte_eal/common/arch/x86/rte_cpuflags.c > > @@ -109,6 +109,9 @@ const struct feature_entry rte_cpu_feature_table[] > = { > > FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11) > > FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16) > > FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18) > > + FEAT_DEF(AVX512CD, 0x00000007, 0, RTE_REG_EBX, 28) > > + FEAT_DEF(AVX512BW, 0x00000007, 0, RTE_REG_EBX, 30) > > + FEAT_DEF(AVX512VL, 0x00000007, 0, RTE_REG_EBX, 31) > > > > FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0) > > FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4) > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > > index 25ba47b96..5bf99e05f 100644 > > --- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > > +++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h > > @@ -98,6 +98,9 @@ enum rte_cpu_flag_t { > > RTE_CPUFLAG_RTM, /**< Transactional memory */ > > RTE_CPUFLAG_AVX512F, /**< AVX512F */ > > RTE_CPUFLAG_RDSEED, /**< RDSEED instruction */ > > + RTE_CPUFLAG_AVX512CD, /**< AVX512CD */ > > + RTE_CPUFLAG_AVX512BW, /**< AVX512BW */ > > + RTE_CPUFLAG_AVX512VL, /**< AVX512VL */ > > > > /* (EAX 80000001h) ECX features */ > > RTE_CPUFLAG_LAHF_SAHF, /**< LAHF_SAHF */ > > This patch most likely breaks the ABI (renumbering flags after > RTE_CPUFLAG_LAHF_SAHF). > This change should not go through the virtio tree and is not rebased on > master. > A similar patch had been proposed by Kevin: > http://patchwork.dpdk.org/patch/67438/ > > > -- > David Marchand ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 8/8] doc: add packed " Marvin Liu 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Packed ring vectorized datapath will be selected when criterian matched. 1. vectorized option is enabled 2. AVX512F and required extensions are supported by compiler and host 3. virtio VERSION_1 and IN_ORDER features are negotiated 4. virtio mergeable feature is not negotiated 5. LRO offloading is disabled Split ring vectorized rx will be selected when criterian matched. 1. vectorized option is enabled 2. virtio mergeable and IN_ORDER features are not negotiated 3. LRO, chksum and vlan strip offloading are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 19a36ad82..a6ce3a0b0 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1548,7 +1557,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } } else { if (hw->use_vec_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1921,6 +1930,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; hw->opened = true; +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif return 0; @@ -2157,31 +2170,63 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW) || + !rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v4 8/8] doc: add packed vectorized datapath 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath Marvin Liu @ 2020-04-15 16:47 ` Marvin Liu 7 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-15 16:47 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Document packed virtqueue vectorized datapath selection logic in virtio net PMD. Add packed virtqueue vectorized datapath features to new ini file. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini new file mode 100644 index 000000000..b239bcaad --- /dev/null +++ b/doc/guides/nics/features/virtio-packed_vec.ini @@ -0,0 +1,22 @@ +; +; Supported features of the 'virtio_packed_vec' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Speed capabilities = P +Link status = Y +Link status event = Y +Rx interrupt = Y +Queue start/stop = Y +Promiscuous mode = Y +Allmulticast mode = Y +Unicast MAC filter = Y +Multicast MAC filter = Y +VLAN filter = Y +Basic stats = Y +Stats per queue = Y +BSD nic_uio = Y +Linux UIO = Y +Linux VFIO = Y +x86-64 = Y diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini similarity index 88% rename from doc/guides/nics/features/virtio_vec.ini rename to doc/guides/nics/features/virtio-split_vec.ini index e60fe36ae..4142fc9f0 100644 --- a/doc/guides/nics/features/virtio_vec.ini +++ b/doc/guides/nics/features/virtio-split_vec.ini @@ -1,5 +1,5 @@ ; -; Supported features of the 'virtio_vec' network poll mode driver. +; Supported features of the 'virtio_split_vec' network poll mode driver. ; ; Refer to default.ini for the full list of available PMD features. ; diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d1f5fb898..7c9ad9466 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev: It is used to enable virtio device packed virtqueue feature. (Default: 0 (disabled)) +#. ``vectorized``: + + It is used to enable virtio device vectorized datapath. + (Default: 0 (disabled)) + Virtio paths Selection and Usage -------------------------------- @@ -454,6 +459,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -476,6 +488,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -493,20 +507,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (9 preceding siblings ...) 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (8 more replies) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (6 subsequent siblings) 17 siblings, 9 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v5: 1. remove cpuflags definition as required extensions always come with AVX512F on x86_64 2. inorder actions should depend on feature bit 3. check ring type in rx queue setup 4. rewrite some commit logs 5. fix some checkpatch warnings v4: 1. rename 'packed_vec' to 'vectorized', also used in split ring 2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev 3. check required AVX512 extensions cpuflags 4. combine split and packed ring datapath selection logic 5. remove limitation that size must power of two 6. clear 12Bytes virtio_net_hdr v3: 1. remove virtio_net_hdr array for better performance 2. disable 'packed_vec' by default v2: 1. more function blocks replaced by vector instructions 2. clean virtio_net_hdr by vector instruction 3. allow header room size change 4. add 'packed_vec' option in virtio_user vdev 5. fix build not check whether AVX512 enabled 6. doc update Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: enable vectorized path net/virtio: inorder should depend on feature bit net/virtio-user: add vectorized path parameter net/virtio: add vectorized packed ring Rx path net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path config/common_base | 1 + .../nics/features/virtio-packed_vec.ini | 22 + .../{virtio_vec.ini => virtio-split_vec.ini} | 2 +- doc/guides/nics/virtio.rst | 44 +- drivers/net/virtio/Makefile | 36 + drivers/net/virtio/meson.build | 27 +- drivers/net/virtio/virtio_ethdev.c | 95 ++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 212 ++---- drivers/net/virtio/virtio_rxtx_packed_avx.c | 639 ++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 39 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 168 ++++- 14 files changed, 1080 insertions(+), 221 deletions(-) create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path Marvin Liu ` (7 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Introduce free threshold setting in Rx queue, default value of it is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (6 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Previously, virtio split ring vectorized path is enabled as default. This is not suitable for everyone because of that path not follow virtio spec. Add new config for virtio vectorized path selection. By default vectorized path is enabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/config/common_base b/config/common_base index c31175f9d..5901a94f7 100644 --- a/config/common_base +++ b/config/common_base @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y # # Compile virtio device emulation inside virtio PMD driver diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index efdcb0d93..9ef445bc9 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) ifeq ($(CONFIG_RTE_ARCH_X86),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 5e7ca855c..f9619a108 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c', 'virtqueue.c') deps += ['kvargs', 'bus_pci'] -if arch_subdir == 'x86' - sources += files('virtio_rxtx_simple_sse.c') -elif arch_subdir == 'ppc' - sources += files('virtio_rxtx_simple_altivec.c') -elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') - sources += files('virtio_rxtx_simple_neon.c') +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') + if arch_subdir == 'x86' + sources += files('virtio_rxtx_simple_sse.c') + elif arch_subdir == 'ppc' + sources += files('virtio_rxtx_simple_altivec.c') + elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') + sources += files('virtio_rxtx_simple_neon.c') + endif endif if is_linux -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter Marvin Liu ` (5 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Ring initialzation is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (2 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (4 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Add new parameter "vectorized" which can select vectorized path explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR option is yes. When "vectorized" is set, driver will check both compiling environment and running environment when selecting path. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 35203940a..4c7d60ca0 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1547,7 +1547,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; @@ -2157,33 +2157,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index 7433d2f08..36afed313 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -250,7 +250,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 5637001df..6e30acaae 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -450,6 +450,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_IN_ORDER, #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" VIRTIO_USER_ARG_PACKED_VQ, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -518,7 +520,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -552,6 +555,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; + char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -668,6 +673,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } +#endif + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -705,6 +721,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } hw = eth_dev->data->dev_private; + if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, queue_size, mac_addr, &ifname, server_mode, mrg_rxbuf, in_order, packed_vq) < 0) { @@ -720,6 +737,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not match packed ring vectorized requirement"); +#endif + } else { + hw->use_vec_rx = 1; + } + } else { + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -777,4 +811,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "server=<0|1> " "mrg_rxbuf=<0|1> " "in_order=<0|1> " - "packed_vq=<0|1>"); + "packed_vq=<0|1>" + "vectorized=<0|1>"); diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (3 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu ` (3 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Rx path when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index 9ef445bc9..4d20cb61a 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -37,6 +37,40 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif + +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index f9619a108..9e0ff9761 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -11,6 +11,19 @@ deps += ['kvargs', 'bus_pci'] if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') if arch_subdir == 'x86' + if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') + if '-mno-avx512f' not in machine_args and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index cd8947656..10e39670e 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..7b65d0b0a 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..f2976b98f --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,358 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE << 1]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + /* only care avail/used bits */ + __m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + rte_smp_rmb(); + __m512i packed_desc = _mm512_loadu_si512(desc_addr); + __m512i flags_mask = _mm512_maskz_and_epi64(0xff, packed_desc, + desc_flags); + + __m512i used_flags; + if (vq->vq_packed.used_wrap_counter) + used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + else + used_flags = _mm512_setzero_si512(); + + /* Check all descs are used */ + desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags, + _MM_CMPINT_EQ); + if (desc_stats != 0xff) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + addrs[(i << 1) + 1] = + (uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8; + } + + /* addresses of pkt_len and data_len */ + __m512i vindex = _mm512_loadu_si512((void *)addrs); + + /* + * select 10b*4 load 32bit from packed_desc[95:64] + * mmask 0110b*4 save 32bit into pkt_len and data_len + */ + __m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA); + + /* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666, + (uint32_t)-hdr_size); + + value = _mm512_add_epi32(value, mbuf_len_offset); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, vindex, value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += PACKED_BATCH_SIZE; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (4 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (2 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 7b65d0b0a..cf18fe564 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..18ae34789 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (5 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Tx path alike Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 10e39670e..c9aaef0af 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index cf18fe564..f82fe8d64 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index f2976b98f..92094783a 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -15,6 +15,21 @@ #include "virtio_pci.h" #include "virtqueue.h" +/* reference count offset in mbuf rearm data */ +#define REF_CNT_OFFSET 16 +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_OFFSET 32 + +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \ + 1ULL << REF_CNT_OFFSET) +/* id offset in packed ring desc higher 64bits */ +#define ID_OFFSET 32 +/* flag offset in packed ring desc higher 64bits */ +#define FLAG_OFFSET 48 + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ @@ -41,6 +56,47 @@ for (iter = val; iter < num; iter++) #endif +static void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -54,6 +110,231 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + /* Load four mbufs rearm data */ + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref); + if (cmp & 0x6666) + return -1; + + /* Check headroom is enough */ + cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms, + _MM_CMPINT_LT); + if (unlikely(cmp)) + return -1; + + __m512i dxps = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset); + + __m512i descs = _mm512_add_epi64(new_descs, flag_offsets); + + if (!vq->hw->has_tx_offload) { + __m128i mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + rte_smp_wmb(); + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (6 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as listed below. Packed ring vectorized path will be selected when: vectorized option is enabled AVX512F and required extensions are supported by compiler and host virtio VERSION_1 and IN_ORDER features are negotiated virtio mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path will be selected when: vectorized option is enabled virtio mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloading are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 4c7d60ca0..de4cef843 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1548,7 +1557,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } } else { if (hw->use_vec_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1921,6 +1930,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; hw->opened = true; +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif return 0; @@ -2157,31 +2170,63 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } +#if defined RTE_ARCH_X86 + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } +#endif + + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v5 9/9] doc: add packed vectorized path 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu ` (7 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-16 15:31 ` Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 15:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Add packed virtqueue vectorized path features to new ini file. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini new file mode 100644 index 000000000..b239bcaad --- /dev/null +++ b/doc/guides/nics/features/virtio-packed_vec.ini @@ -0,0 +1,22 @@ +; +; Supported features of the 'virtio_packed_vec' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Speed capabilities = P +Link status = Y +Link status event = Y +Rx interrupt = Y +Queue start/stop = Y +Promiscuous mode = Y +Allmulticast mode = Y +Unicast MAC filter = Y +Multicast MAC filter = Y +VLAN filter = Y +Basic stats = Y +Stats per queue = Y +BSD nic_uio = Y +Linux UIO = Y +Linux VFIO = Y +x86-64 = Y diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini similarity index 88% rename from doc/guides/nics/features/virtio_vec.ini rename to doc/guides/nics/features/virtio-split_vec.ini index e60fe36ae..4142fc9f0 100644 --- a/doc/guides/nics/features/virtio_vec.ini +++ b/doc/guides/nics/features/virtio-split_vec.ini @@ -1,5 +1,5 @@ ; -; Supported features of the 'virtio_vec' network poll mode driver. +; Supported features of the 'virtio_split_vec' network poll mode driver. ; ; Refer to default.ini for the full list of available PMD features. ; diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d1f5fb898..be07744ce 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev: It is used to enable virtio device packed virtqueue feature. (Default: 0 (disabled)) +#. ``vectorized``: + + It is used to enable virtio device vectorized path. + (Default: 0 (disabled)) + Virtio paths Selection and Usage -------------------------------- @@ -454,6 +459,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -476,6 +488,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -493,20 +507,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (10 preceding siblings ...) 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (8 more replies) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (5 subsequent siblings) 17 siblings, 9 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v6: 1. fix issue when size not power of 2 v5: 1. remove cpuflags definition as required extensions always come with AVX512F on x86_64 2. inorder actions should depend on feature bit 3. check ring type in rx queue setup 4. rewrite some commit logs 5. fix some checkpatch warnings v4: 1. rename 'packed_vec' to 'vectorized', also used in split ring 2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev 3. check required AVX512 extensions cpuflags 4. combine split and packed ring datapath selection logic 5. remove limitation that size must power of two 6. clear 12Bytes virtio_net_hdr v3: 1. remove virtio_net_hdr array for better performance 2. disable 'packed_vec' by default v2: 1. more function blocks replaced by vector instructions 2. clean virtio_net_hdr by vector instruction 3. allow header room size change 4. add 'packed_vec' option in virtio_user vdev 5. fix build not check whether AVX512 enabled 6. doc update Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: enable vectorized path net/virtio: inorder should depend on feature bit net/virtio-user: add vectorized path parameter net/virtio: add vectorized packed ring Rx path net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path config/common_base | 1 + .../nics/features/virtio-packed_vec.ini | 22 + .../{virtio_vec.ini => virtio-split_vec.ini} | 2 +- doc/guides/nics/virtio.rst | 44 +- drivers/net/virtio/Makefile | 36 + drivers/net/virtio/meson.build | 27 +- drivers/net/virtio/virtio_ethdev.c | 95 ++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 212 ++---- drivers/net/virtio/virtio_rxtx_packed_avx.c | 652 ++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 39 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 168 ++++- 14 files changed, 1093 insertions(+), 221 deletions(-) create mode 100644 doc/guides/nics/features/virtio-packed_vec.ini rename doc/guides/nics/features/{virtio_vec.ini => virtio-split_vec.ini} (88%) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu ` (7 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue, default value of it is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-20 14:08 ` Maxime Coquelin 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (6 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Previously, virtio split ring vectorized path is enabled as default. This is not suitable for everyone because of that path not follow virtio spec. Add new config for virtio vectorized path selection. By default vectorized path is enabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/config/common_base b/config/common_base index c31175f9d..5901a94f7 100644 --- a/config/common_base +++ b/config/common_base @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y # # Compile virtio device emulation inside virtio PMD driver diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index efdcb0d93..9ef445bc9 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) ifeq ($(CONFIG_RTE_ARCH_X86),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 5e7ca855c..f9619a108 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c', 'virtqueue.c') deps += ['kvargs', 'bus_pci'] -if arch_subdir == 'x86' - sources += files('virtio_rxtx_simple_sse.c') -elif arch_subdir == 'ppc' - sources += files('virtio_rxtx_simple_altivec.c') -elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') - sources += files('virtio_rxtx_simple_neon.c') +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') + if arch_subdir == 'x86' + sources += files('virtio_rxtx_simple_sse.c') + elif arch_subdir == 'ppc' + sources += files('virtio_rxtx_simple_altivec.c') + elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') + sources += files('virtio_rxtx_simple_neon.c') + endif endif if is_linux -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu @ 2020-04-20 14:08 ` Maxime Coquelin 2020-04-21 6:43 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-20 14:08 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev Hi Marvin, On 4/17/20 12:24 AM, Marvin Liu wrote: > Previously, virtio split ring vectorized path is enabled as default. > This is not suitable for everyone because of that path not follow virtio > spec. Add new config for virtio vectorized path selection. By default > vectorized path is enabled. It should be disabled by default if not following spec. Also, it means it will always be enabled with Meson, which is not acceptable. I think we should have a devarg, so that it is built by default but disabled. User would specify explicitly he wants to enable vector support when probing the device. Thanks, Maxime > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/config/common_base b/config/common_base > index c31175f9d..5901a94f7 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y > > # > # Compile virtio device emulation inside virtio PMD driver > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > index efdcb0d93..9ef445bc9 100644 > --- a/drivers/net/virtio/Makefile > +++ b/drivers/net/virtio/Makefile > @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c > > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) > ifeq ($(CONFIG_RTE_ARCH_X86),y) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c > else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) > @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c > else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > endif > +endif > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > index 5e7ca855c..f9619a108 100644 > --- a/drivers/net/virtio/meson.build > +++ b/drivers/net/virtio/meson.build > @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c', > 'virtqueue.c') > deps += ['kvargs', 'bus_pci'] > > -if arch_subdir == 'x86' > - sources += files('virtio_rxtx_simple_sse.c') > -elif arch_subdir == 'ppc' > - sources += files('virtio_rxtx_simple_altivec.c') > -elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') > - sources += files('virtio_rxtx_simple_neon.c') > +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') > + if arch_subdir == 'x86' > + sources += files('virtio_rxtx_simple_sse.c') > + elif arch_subdir == 'ppc' > + sources += files('virtio_rxtx_simple_altivec.c') > + elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') > + sources += files('virtio_rxtx_simple_neon.c') > + endif > endif > > if is_linux > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path 2020-04-20 14:08 ` Maxime Coquelin @ 2020-04-21 6:43 ` Liu, Yong 2020-04-22 8:07 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-21 6:43 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Monday, April 20, 2020 10:08 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org > Subject: Re: [PATCH v6 2/9] net/virtio: enable vectorized path > > Hi Marvin, > > On 4/17/20 12:24 AM, Marvin Liu wrote: > > Previously, virtio split ring vectorized path is enabled as default. > > This is not suitable for everyone because of that path not follow virtio > > spec. Add new config for virtio vectorized path selection. By default > > vectorized path is enabled. > > It should be disabled by default if not following spec. Also, it means > it will always be enabled with Meson, which is not acceptable. > > I think we should have a devarg, so that it is built by default but > disabled. User would specify explicitly he wants to enable vector > support when probing the device. > Thanks, Maxime. Will change to disable as default in next version. > Thanks, > Maxime > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/config/common_base b/config/common_base > > index c31175f9d..5901a94f7 100644 > > --- a/config/common_base > > +++ b/config/common_base > > @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n > > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y > > > > # > > # Compile virtio device emulation inside virtio PMD driver > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > index efdcb0d93..9ef445bc9 100644 > > --- a/drivers/net/virtio/Makefile > > +++ b/drivers/net/virtio/Makefile > > @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_rxtx.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c > > > > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) > > ifeq ($(CONFIG_RTE_ARCH_X86),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c > > else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) > > @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_rxtx_simple_altivec.c > > else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > $(CONFIG_RTE_ARCH_ARM64)),) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > > endif > > +endif > > > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > > diff --git a/drivers/net/virtio/meson.build > b/drivers/net/virtio/meson.build > > index 5e7ca855c..f9619a108 100644 > > --- a/drivers/net/virtio/meson.build > > +++ b/drivers/net/virtio/meson.build > > @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c', > > 'virtqueue.c') > > deps += ['kvargs', 'bus_pci'] > > > > -if arch_subdir == 'x86' > > - sources += files('virtio_rxtx_simple_sse.c') > > -elif arch_subdir == 'ppc' > > - sources += files('virtio_rxtx_simple_altivec.c') > > -elif arch_subdir == 'arm' and > host_machine.cpu_family().startswith('aarch64') > > - sources += files('virtio_rxtx_simple_neon.c') > > +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') > > + if arch_subdir == 'x86' > > + sources += files('virtio_rxtx_simple_sse.c') > > + elif arch_subdir == 'ppc' > > + sources += files('virtio_rxtx_simple_altivec.c') > > + elif arch_subdir == 'arm' and > host_machine.cpu_family().startswith('aarch64') > > + sources += files('virtio_rxtx_simple_neon.c') > > + endif > > endif > > > > if is_linux > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path 2020-04-21 6:43 ` Liu, Yong @ 2020-04-22 8:07 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-22 8:07 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev > -----Original Message----- > From: Liu, Yong > Sent: Tuesday, April 21, 2020 2:43 PM > To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org > Subject: RE: [PATCH v6 2/9] net/virtio: enable vectorized path > > > > > -----Original Message----- > > From: Maxime Coquelin <maxime.coquelin@redhat.com> > > Sent: Monday, April 20, 2020 10:08 PM > > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > > Wang, Zhihong <zhihong.wang@intel.com> > > Cc: dev@dpdk.org > > Subject: Re: [PATCH v6 2/9] net/virtio: enable vectorized path > > > > Hi Marvin, > > > > On 4/17/20 12:24 AM, Marvin Liu wrote: > > > Previously, virtio split ring vectorized path is enabled as default. > > > This is not suitable for everyone because of that path not follow virtio > > > spec. Add new config for virtio vectorized path selection. By default > > > vectorized path is enabled. > > > > It should be disabled by default if not following spec. Also, it means > > it will always be enabled with Meson, which is not acceptable. > > > > I think we should have a devarg, so that it is built by default but > > disabled. User would specify explicitly he wants to enable vector > > support when probing the device. > > > Hi Maxime, There's one new parameter "vectorized" in devarg which allow user specific whether enable or disable vectorized path. By now this parameter depend on RTE_LIBRTE_VIRTIO_INC_VECTOR, parameter won't be used if INC_VECTOR option is disable. Regards, Marvin > Thanks, Maxime. Will change to disable as default in next version. > > > Thanks, > > Maxime > > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > > > diff --git a/config/common_base b/config/common_base > > > index c31175f9d..5901a94f7 100644 > > > --- a/config/common_base > > > +++ b/config/common_base > > > @@ -449,6 +449,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y > > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n > > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n > > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n > > > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=y > > > > > > # > > > # Compile virtio device emulation inside virtio PMD driver > > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > > index efdcb0d93..9ef445bc9 100644 > > > --- a/drivers/net/virtio/Makefile > > > +++ b/drivers/net/virtio/Makefile > > > @@ -29,6 +29,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > > virtio_rxtx.c > > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c > > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c > > > > > > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) > > > ifeq ($(CONFIG_RTE_ARCH_X86),y) > > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c > > > else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) > > > @@ -36,6 +37,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > > virtio_rxtx_simple_altivec.c > > > else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > > $(CONFIG_RTE_ARCH_ARM64)),) > > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > > > endif > > > +endif > > > > > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > > > diff --git a/drivers/net/virtio/meson.build > > b/drivers/net/virtio/meson.build > > > index 5e7ca855c..f9619a108 100644 > > > --- a/drivers/net/virtio/meson.build > > > +++ b/drivers/net/virtio/meson.build > > > @@ -9,12 +9,14 @@ sources += files('virtio_ethdev.c', > > > 'virtqueue.c') > > > deps += ['kvargs', 'bus_pci'] > > > > > > -if arch_subdir == 'x86' > > > - sources += files('virtio_rxtx_simple_sse.c') > > > -elif arch_subdir == 'ppc' > > > - sources += files('virtio_rxtx_simple_altivec.c') > > > -elif arch_subdir == 'arm' and > > host_machine.cpu_family().startswith('aarch64') > > > - sources += files('virtio_rxtx_simple_neon.c') > > > +if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') > > > + if arch_subdir == 'x86' > > > + sources += files('virtio_rxtx_simple_sse.c') > > > + elif arch_subdir == 'ppc' > > > + sources += files('virtio_rxtx_simple_altivec.c') > > > + elif arch_subdir == 'arm' and > > host_machine.cpu_family().startswith('aarch64') > > > + sources += files('virtio_rxtx_simple_neon.c') > > > + endif > > > endif > > > > > > if is_linux > > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter Marvin Liu ` (5 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Ring initialzation is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (4 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Add new parameter "vectorized" which can select vectorized path explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR option is yes. When "vectorized" is set, driver will check both compiling environment and running environment when selecting path. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 35203940a..4c7d60ca0 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1547,7 +1547,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; @@ -2157,33 +2157,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index 7433d2f08..36afed313 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -250,7 +250,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 5637001df..6e30acaae 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -450,6 +450,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_IN_ORDER, #define VIRTIO_USER_ARG_PACKED_VQ "packed_vq" VIRTIO_USER_ARG_PACKED_VQ, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -518,7 +520,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -552,6 +555,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; + char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -668,6 +673,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } +#endif + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -705,6 +721,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } hw = eth_dev->data->dev_private; + if (virtio_user_dev_init(hw->virtio_user_dev, path, queues, cq, queue_size, mac_addr, &ifname, server_mode, mrg_rxbuf, in_order, packed_vq) < 0) { @@ -720,6 +737,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not match packed ring vectorized requirement"); +#endif + } else { + hw->use_vec_rx = 1; + } + } else { + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -777,4 +811,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "server=<0|1> " "mrg_rxbuf=<0|1> " "in_order=<0|1> " - "packed_vq=<0|1>"); + "packed_vq=<0|1>" + "vectorized=<0|1>"); diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu ` (3 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Rx path when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index 9ef445bc9..4d20cb61a 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -37,6 +37,40 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif + +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index f9619a108..9e0ff9761 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -11,6 +11,19 @@ deps += ['kvargs', 'bus_pci'] if dpdk_conf.has('RTE_LIBRTE_VIRTIO_INC_VECTOR') if arch_subdir == 'x86' + if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') + if '-mno-avx512f' not in machine_args and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index cd8947656..10e39670e 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -104,6 +104,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..7b65d0b0a 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..ffd254489 --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,368 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE << 1]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i desc_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + rte_smp_rmb(); + __m512i packed_desc = _mm512_loadu_si512(desc_addr); + __m512i flags_mask = _mm512_maskz_and_epi64(0xff, packed_desc, + desc_flags); + + __m512i used_flags; + if (vq->vq_packed.used_wrap_counter) + used_flags = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + else + used_flags = _mm512_setzero_si512(); + + /* Check all descs are used */ + desc_stats = _mm512_cmp_epu64_mask(flags_mask, used_flags, + _MM_CMPINT_EQ); + if (desc_stats != 0xff) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i << 1] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + addrs[(i << 1) + 1] = + (uint64_t)rx_pkts[i]->rx_descriptor_fields1 + 8; + } + + /* addresses of pkt_len and data_len */ + __m512i vindex = _mm512_loadu_si512((void *)addrs); + + /* + * select 10b*4 load 32bit from packed_desc[95:64] + * mmask 0110b*4 save 32bit into pkt_len and data_len + */ + __m512i value = _mm512_maskz_shuffle_epi32(0x6666, packed_desc, 0xAA); + + /* mmask 0110b*4 reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666, + (uint32_t)-hdr_size); + + value = _mm512_add_epi32(value, mbuf_len_offset); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, vindex, value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (2 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 7b65d0b0a..cf18fe564 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..18ae34789 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Tx path alike Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 10e39670e..c9aaef0af 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -107,6 +107,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index cf18fe564..f82fe8d64 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index ffd254489..255eba166 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -15,6 +15,21 @@ #include "virtio_pci.h" #include "virtqueue.h" +/* reference count offset in mbuf rearm data */ +#define REF_CNT_OFFSET 16 +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_OFFSET 32 + +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_OFFSET | \ + 1ULL << REF_CNT_OFFSET) +/* id offset in packed ring desc higher 64bits */ +#define ID_OFFSET 32 +/* flag offset in packed ring desc higher 64bits */ +#define FLAG_OFFSET 48 + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_FLAGS_MASK (1ULL << 55 | 1ULL << 63) #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ @@ -41,6 +56,47 @@ for (iter = val; iter < num; iter++) #endif +static void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -54,6 +110,234 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref); + if (cmp & 0x6666) + return -1; + + /* Check headroom is enough */ + cmp = _mm256_mask_cmp_epu16_mask(0x1111, mbufs, head_rooms, + _MM_CMPINT_LT); + if (unlikely(cmp)) + return -1; + + __m512i dxps = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], dxps); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAG_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i flag_offsets = _mm512_broadcast_i32x4(flag_offset); + + __m512i descs = _mm512_add_epi64(new_descs, flag_offsets); + + if (!vq->hw->has_tx_offload) { + __m128i mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + rte_smp_wmb(); + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], descs); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as listed below. Packed ring vectorized path will be selected when: vectorized option is enabled AVX512F and required extensions are supported by compiler and host virtio VERSION_1 and IN_ORDER features are negotiated virtio mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path will be selected when: vectorized option is enabled virtio mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloading are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 4c7d60ca0..de4cef843 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1518,9 +1518,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1534,7 +1537,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1548,7 +1557,7 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } } else { if (hw->use_vec_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1921,6 +1930,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; hw->opened = true; +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif return 0; @@ -2157,31 +2170,63 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } +#if defined RTE_ARCH_X86 + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } +#endif + + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v6 9/9] doc: add packed vectorized path 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu ` (7 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-16 22:24 ` Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-16 22:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Add packed virtqueue vectorized path features to new ini file. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/features/virtio-packed_vec.ini b/doc/guides/nics/features/virtio-packed_vec.ini new file mode 100644 index 000000000..b239bcaad --- /dev/null +++ b/doc/guides/nics/features/virtio-packed_vec.ini @@ -0,0 +1,22 @@ +; +; Supported features of the 'virtio_packed_vec' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Speed capabilities = P +Link status = Y +Link status event = Y +Rx interrupt = Y +Queue start/stop = Y +Promiscuous mode = Y +Allmulticast mode = Y +Unicast MAC filter = Y +Multicast MAC filter = Y +VLAN filter = Y +Basic stats = Y +Stats per queue = Y +BSD nic_uio = Y +Linux UIO = Y +Linux VFIO = Y +x86-64 = Y diff --git a/doc/guides/nics/features/virtio_vec.ini b/doc/guides/nics/features/virtio-split_vec.ini similarity index 88% rename from doc/guides/nics/features/virtio_vec.ini rename to doc/guides/nics/features/virtio-split_vec.ini index e60fe36ae..4142fc9f0 100644 --- a/doc/guides/nics/features/virtio_vec.ini +++ b/doc/guides/nics/features/virtio-split_vec.ini @@ -1,5 +1,5 @@ ; -; Supported features of the 'virtio_vec' network poll mode driver. +; Supported features of the 'virtio_split_vec' network poll mode driver. ; ; Refer to default.ini for the full list of available PMD features. ; diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d1f5fb898..be07744ce 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -403,6 +403,11 @@ Below devargs are supported by the virtio-user vdev: It is used to enable virtio device packed virtqueue feature. (Default: 0 (disabled)) +#. ``vectorized``: + + It is used to enable virtio device vectorized path. + (Default: 0 (disabled)) + Virtio paths Selection and Usage -------------------------------- @@ -454,6 +459,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -476,6 +488,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -493,20 +507,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (11 preceding siblings ...) 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (8 more replies) 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (4 subsequent siblings) 17 siblings, 9 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v7: 1. default vectorization is disabled 2. compilation time check with rte_mbuf structure 3. offsets are calcuated when compiling 4. remove useless barrier as descs are batched store&load 5. vindex of scatter is directly set 6. some comments updates 7. enable vectorized path in meson build v6: 1. fix issue when size not power of 2 v5: 1. remove cpuflags definition as required extensions always come with AVX512F on x86_64 2. inorder actions should depend on feature bit 3. check ring type in rx queue setup 4. rewrite some commit logs 5. fix some checkpatch warnings v4: 1. rename 'packed_vec' to 'vectorized', also used in split ring 2. add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev 3. check required AVX512 extensions cpuflags 4. combine split and packed ring datapath selection logic 5. remove limitation that size must power of two 6. clear 12Bytes virtio_net_hdr v3: 1. remove virtio_net_hdr array for better performance 2. disable 'packed_vec' by default v2: 1. more function blocks replaced by vector instructions 2. clean virtio_net_hdr by vector instruction 3. allow header room size change 4. add 'packed_vec' option in virtio_user vdev 5. fix build not check whether AVX512 enabled 6. doc update Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: enable vectorized path net/virtio: inorder should depend on feature bit net/virtio-user: add vectorized path parameter net/virtio: add vectorized packed ring Rx path net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path config/common_base | 1 + doc/guides/nics/virtio.rst | 43 +- drivers/net/virtio/Makefile | 37 ++ drivers/net/virtio/meson.build | 15 + drivers/net/virtio/virtio_ethdev.c | 95 ++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 212 ++----- drivers/net/virtio/virtio_rxtx_packed_avx.c | 662 ++++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 37 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 168 ++++- 12 files changed, 1072 insertions(+), 214 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path Marvin Liu ` (7 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue, default value of it is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (6 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Previously, virtio split ring vectorized path is enabled as default. This is not suitable for everyone because of that path not follow virtio spec. Add new config for virtio vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/config/common_base b/config/common_base index 00d8d0792..334a26a17 100644 --- a/config/common_base +++ b/config/common_base @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n # # Compile virtio device emulation inside virtio PMD driver diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index c9edb84ee..4b69827ab 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) ifeq ($(CONFIG_RTE_ARCH_X86),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 15150eea1..ce3525ef5 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c', 'virtqueue.c') deps += ['kvargs', 'bus_pci'] +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) if arch_subdir == 'x86' sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter Marvin Liu ` (5 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Ring initialzation is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (4 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Add new parameter "vectorized" which can select vectorized path explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR option is yes. When "vectorized" is set, driver will check both compiling environment and running environment when selecting path. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 37766cbb6..361c834a9 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1551,8 +1551,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -2257,33 +2257,33 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; + hw->use_vec_rx = 1; if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index bd89357e4..668e688e1 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -253,7 +253,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 953f00d72..5c338cf44 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -452,6 +452,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_PACKED_VQ, #define VIRTIO_USER_ARG_SPEED "speed" VIRTIO_USER_ARG_SPEED, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -525,7 +527,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -559,6 +562,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -675,6 +679,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } +#endif + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -727,6 +742,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not match packed ring vectorized requirement"); +#endif + } else { + hw->use_vec_rx = 1; + } + } else { + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -785,4 +817,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "mrg_rxbuf=<0|1> " "in_order=<0|1> " "packed_vq=<0|1> " - "speed=<int>"); + "speed=<int> " + "vectorized=<0|1>"); diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu ` (3 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Optimize packed ring Rx path when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index 4b69827ab..de0b00e50 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -36,6 +36,41 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif + +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index ce3525ef5..39b3605d9 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -10,6 +10,20 @@ deps += ['kvargs', 'bus_pci'] dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) if arch_subdir == 'x86' + if '-mno-avx512f' not in machine_args + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index febaf17a8..5c112cac7 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..7b65d0b0a 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..d02ba9ba6 --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,373 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define BYTE_SIZE 8 +/* flag bits offset in packed ring desc higher 64bits */ +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ + FLAGS_BITS_OFFSET) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static inline uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + __m512i v_desc = _mm512_loadu_si512(desc_addr); + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); + + __m512i v_used_flag = _mm512_setzero_si512(); + if (vq->vq_packed.used_wrap_counter) + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + + /* Check all descs are used */ + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); + if (desc_stats) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + } + + /* + * load len from desc, store into mbuf pkt_len and data_len + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored + */ + __m512i values = _mm512_maskz_shuffle_epi32(0x6666, v_desc, 0xAA); + + /* reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(0x6666, + (uint32_t)-hdr_size); + + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); + + /* assert offset of data_len */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], + addrs[2] + 8, addrs[2], + addrs[1] + 8, addrs[1], + addrs[0] + 8, addrs[0]); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, v_index, v_value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (2 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 7b65d0b0a..cf18fe564 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..18ae34789 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Optimize packed ring Tx path alike Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 5c112cac7..b7d52d497 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index cf18fe564..f82fe8d64 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index d02ba9ba6..60d03b6d8 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -23,6 +23,24 @@ #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ FLAGS_BITS_OFFSET) +/* reference count offset in mbuf rearm data */ +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) + +/* default rearm data */ +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ + 1ULL << REFCNT_BITS_OFFSET) + +/* id bits offset in packed ring desc higher 64bits */ +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ sizeof(struct vring_packed_desc)) #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) @@ -47,6 +65,47 @@ for (iter = val; iter < num; iter++) #endif +static inline void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -60,6 +119,236 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64); + RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64); + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + cmp = _mm256_mask_cmpneq_epu16_mask(0x6666, mbufs, mbuf_ref); + if (unlikely(cmp)) + return -1; + + /* Check headroom is enough */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) != + offsetof(struct rte_mbuf, rearm_data)); + cmp = _mm256_mask_cmplt_epu16_mask(0x1111, mbufs, head_rooms); + if (unlikely(cmp)) + return -1; + + __m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_BITS_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_BITS_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i v_offset = _mm512_broadcast_i32x4(flag_offset); + + __m512i v_desc = _mm512_add_epi64(new_descs, v_offset); + + if (!vq->hw->has_tx_offload) { + __m128i mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as listed below. Packed ring vectorized path will be selected when: vectorized option is enabled AVX512F and required extensions are supported by compiler and host virtio VERSION_1 and IN_ORDER features are negotiated virtio mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path will be selected when: vectorized option is enabled virtio mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloading are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 361c834a9..c700af6be 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1522,9 +1522,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1538,7 +1541,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1950,6 +1959,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; hw->opened = true; +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif return 0; @@ -2257,33 +2270,63 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_vec_rx = 1; + if (vtpci_packed_queue(hw)) { +#if defined RTE_ARCH_X86 + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } +#endif - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v7 9/9] doc: add packed vectorized path 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu ` (7 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-22 6:16 ` Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-22 6:16 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang, harry.van.haaren Cc: dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 6286286db..4bd46f83e 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -417,6 +417,10 @@ Below devargs are supported by the virtio-user vdev: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to enable virtio device vectorized path. + (Default: 0 (disabled)) Virtio paths Selection and Usage -------------------------------- @@ -469,6 +473,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -491,6 +502,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -508,20 +521,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (12 preceding siblings ...) 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu @ 2020-04-23 12:30 ` Marvin Liu 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (9 more replies) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (3 subsequent siblings) 17 siblings, 10 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:30 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v8: * fix meson build error on ubuntu16.04 and suse15 v7: * default vectorization is disabled * compilation time check dependency on rte_mbuf structure * offsets are calcuated when compiling * remove useless barrier as descs are batched store&load * vindex of scatter is directly set * some comments updates * enable vectorized path in meson build v6: * fix issue when size not power of 2 v5: * remove cpuflags definition as required extensions always come with AVX512F on x86_64 * inorder actions should depend on feature bit * check ring type in rx queue setup * rewrite some commit logs * fix some checkpatch warnings v4: * rename 'packed_vec' to 'vectorized', also used in split ring * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev * check required AVX512 extensions cpuflags * combine split and packed ring datapath selection logic * remove limitation that size must power of two * clear 12Bytes virtio_net_hdr v3: * remove virtio_net_hdr array for better performance * disable 'packed_vec' by default v2: * more function blocks replaced by vector instructions * clean virtio_net_hdr by vector instruction * allow header room size change * add 'packed_vec' option in virtio_user vdev * fix build not check whether AVX512 enabled * doc update Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: enable vectorized path net/virtio: inorder should depend on feature bit net/virtio-user: add vectorized path parameter net/virtio: add vectorized packed ring Rx path net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path config/common_base | 1 + doc/guides/nics/virtio.rst | 43 +- drivers/net/virtio/Makefile | 37 ++ drivers/net/virtio/meson.build | 15 + drivers/net/virtio/virtio_ethdev.c | 95 ++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 212 ++----- drivers/net/virtio/virtio_rxtx_packed_avx.c | 665 ++++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 37 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 168 ++++- 12 files changed, 1075 insertions(+), 214 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu @ 2020-04-23 12:30 ` Marvin Liu 2020-04-23 8:09 ` Maxime Coquelin 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu ` (8 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:30 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Introduce free threshold setting in Rx queue, default value of it is 32. Limiated threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-23 8:09 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-23 8:09 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: harry.van.haaren, dev On 4/23/20 2:30 PM, Marvin Liu wrote: > Introduce free threshold setting in Rx queue, default value of it is 32. > Limiated threshold size to multiple of four as only vectorized packed Rx s/Limiated/Limit the/ > function will utilize it. Virtio driver will rearm Rx queue when more > than rx_free_thresh descs were dequeued. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-23 12:30 ` Marvin Liu 2020-04-23 8:33 ` Maxime Coquelin 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (7 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:30 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Previously, virtio split ring vectorized path is enabled as default. This is not suitable for everyone because of that path not follow virtio spec. Add new config for virtio vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/config/common_base b/config/common_base index 00d8d0792..334a26a17 100644 --- a/config/common_base +++ b/config/common_base @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n # # Compile virtio device emulation inside virtio PMD driver diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index c9edb84ee..4b69827ab 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) ifeq ($(CONFIG_RTE_ARCH_X86),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 15150eea1..ce3525ef5 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c', 'virtqueue.c') deps += ['kvargs', 'bus_pci'] +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) if arch_subdir == 'x86' sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu @ 2020-04-23 8:33 ` Maxime Coquelin 2020-04-23 8:46 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-23 8:33 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: harry.van.haaren, dev On 4/23/20 2:30 PM, Marvin Liu wrote: > Previously, virtio split ring vectorized path is enabled as default. s/is/was/ s/as/by/ > This is not suitable for everyone because of that path not follow virtio s/because of that path not follow/because that path does not follow the/ > spec. Add new config for virtio vectorized path selection. By default > vectorized path is disabled. I think we can keep it enabled by default for consistency between make & meson, now that you are providing a devarg for it that is disabled by default. Maybe we can just drop this config flag, what do you think? Thanks, Maxime > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/config/common_base b/config/common_base > index 00d8d0792..334a26a17 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n > > # > # Compile virtio device emulation inside virtio PMD driver > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > index c9edb84ee..4b69827ab 100644 > --- a/drivers/net/virtio/Makefile > +++ b/drivers/net/virtio/Makefile > @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c > > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) > ifeq ($(CONFIG_RTE_ARCH_X86),y) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c > else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) > @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c > else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > endif > +endif > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > index 15150eea1..ce3525ef5 100644 > --- a/drivers/net/virtio/meson.build > +++ b/drivers/net/virtio/meson.build > @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c', > 'virtqueue.c') > deps += ['kvargs', 'bus_pci'] > > +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) > if arch_subdir == 'x86' > sources += files('virtio_rxtx_simple_sse.c') > elif arch_subdir == 'ppc' > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path 2020-04-23 8:33 ` Maxime Coquelin @ 2020-04-23 8:46 ` Liu, Yong 2020-04-23 8:49 ` Maxime Coquelin 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-23 8:46 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: Van Haaren, Harry, dev > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Thursday, April 23, 2020 4:34 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org > Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path > > > > On 4/23/20 2:30 PM, Marvin Liu wrote: > > Previously, virtio split ring vectorized path is enabled as default. > > s/is/was/ > s/as/by/ > > > This is not suitable for everyone because of that path not follow virtio > > s/because of that path not follow/because that path does not follow the/ > > > spec. Add new config for virtio vectorized path selection. By default > > vectorized path is disabled. > > I think we can keep it enabled by default for consistency between make & > meson, now that you are providing a devarg for it that is disabled by > default. > > Maybe we can just drop this config flag, what do you think? > Maxime, Devarg will only have effect on virtio-user path selection, while DPDK configuration can affect both virtio pmd and virtio-user. It maybe worth to add new configuration as it can allow user to choice whether disabled vectorized path in virtio pmd. IMHO, AVX512 instructions should be selective in each component. Regards, Marvin > Thanks, > Maxime > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/config/common_base b/config/common_base > > index 00d8d0792..334a26a17 100644 > > --- a/config/common_base > > +++ b/config/common_base > > @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n > > CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n > > +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n > > > > # > > # Compile virtio device emulation inside virtio PMD driver > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > index c9edb84ee..4b69827ab 100644 > > --- a/drivers/net/virtio/Makefile > > +++ b/drivers/net/virtio/Makefile > > @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_rxtx.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c > > > > +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) > > ifeq ($(CONFIG_RTE_ARCH_X86),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c > > else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) > > @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_rxtx_simple_altivec.c > > else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > $(CONFIG_RTE_ARCH_ARM64)),) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > > endif > > +endif > > > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > > index 15150eea1..ce3525ef5 100644 > > --- a/drivers/net/virtio/meson.build > > +++ b/drivers/net/virtio/meson.build > > @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c', > > 'virtqueue.c') > > deps += ['kvargs', 'bus_pci'] > > > > +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) > > if arch_subdir == 'x86' > > sources += files('virtio_rxtx_simple_sse.c') > > elif arch_subdir == 'ppc' > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path 2020-04-23 8:46 ` Liu, Yong @ 2020-04-23 8:49 ` Maxime Coquelin 2020-04-23 9:59 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-23 8:49 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: Van Haaren, Harry, dev On 4/23/20 10:46 AM, Liu, Yong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Thursday, April 23, 2020 4:34 PM >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; >> Wang, Zhihong <zhihong.wang@intel.com> >> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org >> Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path >> >> >> >> On 4/23/20 2:30 PM, Marvin Liu wrote: >>> Previously, virtio split ring vectorized path is enabled as default. >> >> s/is/was/ >> s/as/by/ >> >>> This is not suitable for everyone because of that path not follow virtio >> >> s/because of that path not follow/because that path does not follow the/ >> >>> spec. Add new config for virtio vectorized path selection. By default >>> vectorized path is disabled. >> >> I think we can keep it enabled by default for consistency between make & >> meson, now that you are providing a devarg for it that is disabled by >> default. >> >> Maybe we can just drop this config flag, what do you think? >> > > Maxime, > Devarg will only have effect on virtio-user path selection, while DPDK configuration can affect both virtio pmd and virtio-user. > It maybe worth to add new configuration as it can allow user to choice whether disabled vectorized path in virtio pmd. Ok, so we had a misunderstanding. I was requesting the the devarg to be effective also for the Virtio PMD, disabled by default. Thanks, Maxime > IMHO, AVX512 instructions should be selective in each component. > > Regards, > Marvin > >> Thanks, >> Maxime >> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> >>> >>> diff --git a/config/common_base b/config/common_base >>> index 00d8d0792..334a26a17 100644 >>> --- a/config/common_base >>> +++ b/config/common_base >>> @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y >>> CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n >>> CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n >>> CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n >>> +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n >>> >>> # >>> # Compile virtio device emulation inside virtio PMD driver >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile >>> index c9edb84ee..4b69827ab 100644 >>> --- a/drivers/net/virtio/Makefile >>> +++ b/drivers/net/virtio/Makefile >>> @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += >> virtio_rxtx.c >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c >>> >>> +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) >>> ifeq ($(CONFIG_RTE_ARCH_X86),y) >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c >>> else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) >>> @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += >> virtio_rxtx_simple_altivec.c >>> else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) >> $(CONFIG_RTE_ARCH_ARM64)),) >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c >>> endif >>> +endif >>> >>> ifeq ($(CONFIG_RTE_VIRTIO_USER),y) >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c >>> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build >>> index 15150eea1..ce3525ef5 100644 >>> --- a/drivers/net/virtio/meson.build >>> +++ b/drivers/net/virtio/meson.build >>> @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c', >>> 'virtqueue.c') >>> deps += ['kvargs', 'bus_pci'] >>> >>> +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) >>> if arch_subdir == 'x86' >>> sources += files('virtio_rxtx_simple_sse.c') >>> elif arch_subdir == 'ppc' >>> > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path 2020-04-23 8:49 ` Maxime Coquelin @ 2020-04-23 9:59 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-23 9:59 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: Van Haaren, Harry, dev > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Thursday, April 23, 2020 4:50 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org > Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path > > > > On 4/23/20 10:46 AM, Liu, Yong wrote: > > > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Sent: Thursday, April 23, 2020 4:34 PM > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > >> Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org > >> Subject: Re: [PATCH v8 2/9] net/virtio: enable vectorized path > >> > >> > >> > >> On 4/23/20 2:30 PM, Marvin Liu wrote: > >>> Previously, virtio split ring vectorized path is enabled as default. > >> > >> s/is/was/ > >> s/as/by/ > >> > >>> This is not suitable for everyone because of that path not follow virtio > >> > >> s/because of that path not follow/because that path does not follow the/ > >> > >>> spec. Add new config for virtio vectorized path selection. By default > >>> vectorized path is disabled. > >> > >> I think we can keep it enabled by default for consistency between make & > >> meson, now that you are providing a devarg for it that is disabled by > >> default. > >> > >> Maybe we can just drop this config flag, what do you think? > >> > > > > Maxime, > > Devarg will only have effect on virtio-user path selection, while DPDK > configuration can affect both virtio pmd and virtio-user. > > It maybe worth to add new configuration as it can allow user to choice > whether disabled vectorized path in virtio pmd. > > Ok, so we had a misunderstanding. I was requesting the the devarg to be > effective also for the Virtio PMD, disabled by default. > Got you, will change in next vesion. > Thanks, > Maxime > > IMHO, AVX512 instructions should be selective in each component. > > > > Regards, > > Marvin > > > >> Thanks, > >> Maxime > >> > >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> > >>> > >>> diff --git a/config/common_base b/config/common_base > >>> index 00d8d0792..334a26a17 100644 > >>> --- a/config/common_base > >>> +++ b/config/common_base > >>> @@ -456,6 +456,7 @@ CONFIG_RTE_LIBRTE_VIRTIO_PMD=y > >>> CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n > >>> CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n > >>> CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_DUMP=n > >>> +CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR=n > >>> > >>> # > >>> # Compile virtio device emulation inside virtio PMD driver > >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > >>> index c9edb84ee..4b69827ab 100644 > >>> --- a/drivers/net/virtio/Makefile > >>> +++ b/drivers/net/virtio/Makefile > >>> @@ -28,6 +28,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > >> virtio_rxtx.c > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c > >>> > >>> +ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_INC_VECTOR),y) > >>> ifeq ($(CONFIG_RTE_ARCH_X86),y) > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_sse.c > >>> else ifeq ($(CONFIG_RTE_ARCH_PPC_64),y) > >>> @@ -35,6 +36,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > >> virtio_rxtx_simple_altivec.c > >>> else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > >> $(CONFIG_RTE_ARCH_ARM64)),) > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > >>> endif > >>> +endif > >>> > >>> ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > >>> diff --git a/drivers/net/virtio/meson.build > b/drivers/net/virtio/meson.build > >>> index 15150eea1..ce3525ef5 100644 > >>> --- a/drivers/net/virtio/meson.build > >>> +++ b/drivers/net/virtio/meson.build > >>> @@ -8,6 +8,7 @@ sources += files('virtio_ethdev.c', > >>> 'virtqueue.c') > >>> deps += ['kvargs', 'bus_pci'] > >>> > >>> +dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) > >>> if arch_subdir == 'x86' > >>> sources += files('virtio_rxtx_simple_sse.c') > >>> elif arch_subdir == 'ppc' > >>> > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 8:46 ` Maxime Coquelin 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter Marvin Liu ` (6 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Ring initialzation is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-23 8:46 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-23 8:46 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: harry.van.haaren, dev On 4/23/20 2:31 PM, Marvin Liu wrote: > Ring initialzation is different when inorder feature negotiated. This s/initialzation/initialization/ > action should dependent on negotiated feature bits. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (5 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Add new parameter "vectorized" which can select vectorized path explicitly. This parameter will work when RTE_LIBRTE_VIRTIO_INC_VECTOR option is yes. When "vectorized" is set, driver will check both compiling environment and running environment when selecting path. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 37766cbb6..361c834a9 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1551,8 +1551,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -2257,33 +2257,33 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; + hw->use_vec_rx = 1; if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index bd89357e4..668e688e1 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -253,7 +253,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 953f00d72..5c338cf44 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -452,6 +452,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_PACKED_VQ, #define VIRTIO_USER_ARG_SPEED "speed" VIRTIO_USER_ARG_SPEED, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -525,7 +527,8 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -559,6 +562,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -675,6 +679,17 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } +#endif + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -727,6 +742,23 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not match packed ring vectorized requirement"); +#endif + } else { + hw->use_vec_rx = 1; + } + } else { + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -785,4 +817,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "mrg_rxbuf=<0|1> " "in_order=<0|1> " "packed_vq=<0|1> " - "speed=<int>"); + "speed=<int> " + "vectorized=<0|1>"); diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu ` (4 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Rx path when AVX512 enabled and mergeable buffer/Rx LRO offloading are not required. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by vector instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index 4b69827ab..de0b00e50 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -36,6 +36,41 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_altivec.c else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif + +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif endif ifeq ($(CONFIG_RTE_VIRTIO_USER),y) diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index ce3525ef5..39b3605d9 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -10,6 +10,20 @@ deps += ['kvargs', 'bus_pci'] dpdk_conf.set('RTE_LIBRTE_VIRTIO_INC_VECTOR', 1) if arch_subdir == 'x86' + if '-mno-avx512f' not in machine_args + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index febaf17a8..5c112cac7 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..7b65d0b0a 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -1246,7 +1246,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) @@ -2329,3 +2328,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..3380f1da5 --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,374 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define BYTE_SIZE 8 +/* flag bits offset in packed ring desc higher 64bits */ +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ + FLAGS_BITS_OFFSET) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static inline uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + __m512i v_desc = _mm512_loadu_si512(desc_addr); + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); + + __m512i v_used_flag = _mm512_setzero_si512(); + if (vq->vq_packed.used_wrap_counter) + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + + /* Check all descs are used */ + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); + if (desc_stats) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + } + + /* + * load len from desc, store into mbuf pkt_len and data_len + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored + */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA); + + /* reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, + (uint32_t)-hdr_size); + + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); + + /* assert offset of data_len */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], + addrs[2] + 8, addrs[2], + addrs[1] + 8, addrs[1], + addrs[0] + 8, addrs[0]); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, v_index, v_value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..43e305ecc 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -20,6 +20,7 @@ struct rte_mbuf; #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +237,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (3 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 7b65d0b0a..cf18fe564 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 43e305ecc..18ae34789 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,7 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 #define VIRTIO_MBUF_BURST_SZ 64 @@ -562,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path Marvin Liu ` (2 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Optimize packed ring Tx path alike Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by vector instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 5c112cac7..b7d52d497 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index cf18fe564..f82fe8d64 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index 3380f1da5..c023ace4e 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -23,6 +23,24 @@ #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ FLAGS_BITS_OFFSET) +/* reference count offset in mbuf rearm data */ +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) + +/* default rearm data */ +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ + 1ULL << REFCNT_BITS_OFFSET) + +/* id bits offset in packed ring desc higher 64bits */ +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ sizeof(struct vring_packed_desc)) #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) @@ -47,6 +65,47 @@ for (iter = val; iter < num; iter++) #endif +static inline void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} static inline void virtio_update_batch_stats(struct virtnet_stats *stats, @@ -60,6 +119,238 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len3; stats->bytes += pkt_len4; } + +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64); + RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64); + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref); + if (unlikely(cmp)) + return -1; + + /* Check headroom is enough */ + const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12; + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) != + offsetof(struct rte_mbuf, rearm_data)); + cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms); + if (unlikely(cmp)) + return -1; + + __m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_BITS_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_BITS_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i v_offset = _mm512_broadcast_i32x4(flag_offset); + + __m512i v_desc = _mm512_add_epi64(new_descs, v_offset); + + if (!vq->hw->has_tx_offload) { + __m128i all_mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, all_mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 9/9] doc: add packed " Marvin Liu 2020-04-23 15:17 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Wang, Yinan 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from RTE_LIBRTE_VIRTIO_INC_VECTOR option. Paths criteria will be checked as listed below. Packed ring vectorized path will be selected when: vectorized option is enabled AVX512F and required extensions are supported by compiler and host virtio VERSION_1 and IN_ORDER features are negotiated virtio mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path will be selected when: vectorized option is enabled virtio mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloading are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 361c834a9..c700af6be 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1522,9 +1522,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1538,7 +1541,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1950,6 +1959,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; hw->opened = true; +#ifdef RTE_LIBRTE_VIRTIO_INC_VECTOR + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif return 0; @@ -2257,33 +2270,63 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_vec_rx = 1; + if (vtpci_packed_queue(hw)) { +#if defined RTE_ARCH_X86 + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } +#endif - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorization for requirements are not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v8 9/9] doc: add packed vectorized path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (7 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-23 12:31 ` Marvin Liu 2020-04-23 15:17 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Wang, Yinan 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-23 12:31 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: harry.van.haaren, dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 6286286db..4bd46f83e 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -417,6 +417,10 @@ Below devargs are supported by the virtio-user vdev: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to enable virtio device vectorized path. + (Default: 0 (disabled)) Virtio paths Selection and Usage -------------------------------- @@ -469,6 +473,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -491,6 +502,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -508,20 +521,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v8 0/9] add packed ring vectorized path 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu ` (8 preceding siblings ...) 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 9/9] doc: add packed " Marvin Liu @ 2020-04-23 15:17 ` Wang, Yinan 9 siblings, 0 replies; 162+ messages in thread From: Wang, Yinan @ 2020-04-23 15:17 UTC (permalink / raw) To: Liu, Yong, maxime.coquelin, Ye, Xiaolong, Wang, Zhihong Cc: Van Haaren, Harry, dev, Liu, Yong Tested-by: Wang, Yinan <yinan.wang@intel.com> > -----Original Message----- > From: dev <dev-bounces@dpdk.org> On Behalf Of Marvin Liu > Sent: 2020年4月23日 20:31 > To: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org; Liu, > Yong <yong.liu@intel.com> > Subject: [dpdk-dev] [PATCH v8 0/9] add packed ring vectorized path > > This patch set introduced vectorized path for packed ring. > > The size of packed ring descriptor is 16Bytes. Four batched descriptors are > just placed into one cacheline. AVX512 instructions can well handle this kind > of data. Packed ring TX path can fully transformed into vectorized path. > Packed ring Rx path can be vectorized when requirements met(LRO and > mergeable disabled). > > New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch > set. This option will unify split and packed ring vectorized path default setting. > Meanwhile user can specify whether enable vectorized path at runtime by > 'vectorized' parameter of virtio user vdev. > > v8: > * fix meson build error on ubuntu16.04 and suse15 > > v7: > * default vectorization is disabled > * compilation time check dependency on rte_mbuf structure > * offsets are calcuated when compiling > * remove useless barrier as descs are batched store&load > * vindex of scatter is directly set > * some comments updates > * enable vectorized path in meson build > > v6: > * fix issue when size not power of 2 > > v5: > * remove cpuflags definition as required extensions always come with > AVX512F on x86_64 > * inorder actions should depend on feature bit > * check ring type in rx queue setup > * rewrite some commit logs > * fix some checkpatch warnings > > v4: > * rename 'packed_vec' to 'vectorized', also used in split ring > * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev > * check required AVX512 extensions cpuflags > * combine split and packed ring datapath selection logic > * remove limitation that size must power of two > * clear 12Bytes virtio_net_hdr > > v3: > * remove virtio_net_hdr array for better performance > * disable 'packed_vec' by default > > v2: > * more function blocks replaced by vector instructions > * clean virtio_net_hdr by vector instruction > * allow header room size change > * add 'packed_vec' option in virtio_user vdev > * fix build not check whether AVX512 enabled > * doc update > > > Marvin Liu (9): > net/virtio: add Rx free threshold setting > net/virtio: enable vectorized path > net/virtio: inorder should depend on feature bit > net/virtio-user: add vectorized path parameter > net/virtio: add vectorized packed ring Rx path > net/virtio: reuse packed ring xmit functions > net/virtio: add vectorized packed ring Tx path > net/virtio: add election for vectorized path > doc: add packed vectorized path > > config/common_base | 1 + > doc/guides/nics/virtio.rst | 43 +- > drivers/net/virtio/Makefile | 37 ++ > drivers/net/virtio/meson.build | 15 + > drivers/net/virtio/virtio_ethdev.c | 95 ++- > drivers/net/virtio/virtio_ethdev.h | 6 + > drivers/net/virtio/virtio_pci.h | 3 +- > drivers/net/virtio/virtio_rxtx.c | 212 ++----- > drivers/net/virtio/virtio_rxtx_packed_avx.c | 665 ++++++++++++++++++++ > drivers/net/virtio/virtio_user_ethdev.c | 37 +- > drivers/net/virtio/virtqueue.c | 7 +- > drivers/net/virtio/virtqueue.h | 168 ++++- > 12 files changed, 1075 insertions(+), 214 deletions(-) > create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c > > -- > 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (13 preceding siblings ...) 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (8 more replies) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (2 subsequent siblings) 17 siblings, 9 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v9: * replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg * reorder patch sequence v8: * fix meson build error on ubuntu16.04 and suse15 v7: * default vectorization is disabled * compilation time check dependency on rte_mbuf structure * offsets are calcuated when compiling * remove useless barrier as descs are batched store&load * vindex of scatter is directly set * some comments updates * enable vectorized path in meson build v6: * fix issue when size not power of 2 v5: * remove cpuflags definition as required extensions always come with AVX512F on x86_64 * inorder actions should depend on feature bit * check ring type in rx queue setup * rewrite some commit logs * fix some checkpatch warnings v4: * rename 'packed_vec' to 'vectorized', also used in split ring * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev * check required AVX512 extensions cpuflags * combine split and packed ring datapath selection logic * remove limitation that size must power of two * clear 12Bytes virtio_net_hdr v3: * remove virtio_net_hdr array for better performance * disable 'packed_vec' by default v2: * more function blocks replaced by vector instructions * clean virtio_net_hdr by vector instruction * allow header room size change * add 'packed_vec' option in virtio_user vdev * fix build not check whether AVX512 enabled * doc update Tested-by: Wang, Yinan <yinan.wang@intel.com> Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: inorder should depend on feature bit net/virtio: add vectorized devarg net/virtio-user: add vectorized devarg net/virtio: add vectorized packed ring Rx path net/virtio: reuse packed ring xmit functions net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path doc/guides/nics/virtio.rst | 52 +- drivers/net/virtio/Makefile | 35 ++ drivers/net/virtio/meson.build | 14 + drivers/net/virtio/virtio_ethdev.c | 136 +++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 212 ++----- drivers/net/virtio/virtio_rxtx_packed_avx.c | 665 ++++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 32 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 168 ++++- 11 files changed, 1112 insertions(+), 218 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (7 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Introduce free threshold setting in Rx queue, its default value is 32. Limit the threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu ` (6 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Ring initialization is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 11:27 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu ` (5 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Previously, virtio split ring vectorized path was enabled by default. This is not suitable for everyone because that path dose not follow virtio spec. Add new devarg for virtio vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 6286286db..902a1f0cf 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfer to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) + Below devargs are supported by the virtio-user vdev: #. ``path``: diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 37766cbb6..0a69a4db1 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev); static uint32_t virtio_dev_speed_capa_get(uint32_t speed); static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed); + uint32_t *speed, + int *vectorized); static int virtio_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info); static int virtio_dev_link_update(struct rte_eth_dev *dev, @@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) { struct virtio_hw *hw = eth_dev->data->dev_private; uint32_t speed = SPEED_UNKNOWN; + int vectorized = 0; int ret; if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) { @@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) return 0; } ret = virtio_dev_devargs_parse(eth_dev->device->devargs, - NULL, &speed); + NULL, &speed, &vectorized); if (ret < 0) return ret; hw->speed = speed; @@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) if (ret < 0) goto err_virtio_init; + if (vectorized) { + if (!vtpci_packed_queue(hw)) + hw->use_vec_rx = 1; + } + hw->opened = true; return 0; @@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed) } } +static int vectorized_check_handler(__rte_unused const char *key, + const char *value, void *ret_val) +{ + if (strcmp(value, "1") == 0) + *(int *)ret_val = 1; + else + *(int *)ret_val = 0; + + return 0; +} #define VIRTIO_ARG_SPEED "speed" #define VIRTIO_ARG_VDPA "vdpa" +#define VIRTIO_ARG_VECTORIZED "vectorized" static int @@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused, static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed) + uint32_t *speed, int *vectorized) { struct rte_kvargs *kvlist; int ret = 0; @@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, } } + if (vectorized && + rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { + ret = rte_kvargs_process(kvlist, + VIRTIO_ARG_VECTORIZED, + vectorized_check_handler, vectorized); + if (ret < 0) { + PMD_INIT_LOG(ERR, "Failed to parse %s", + VIRTIO_ARG_VECTORIZED); + goto exit; + } + } + exit: rte_kvargs_free(kvlist); return ret; @@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused, int vdpa = 0; int ret = 0; - ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL); + ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL, + NULL); if (ret < 0) { PMD_INIT_LOG(ERR, "devargs parsing is failed"); return ret; @@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index bd89357e4..668e688e1 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -253,7 +253,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 953f00d72..150a8d987 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu @ 2020-04-24 11:27 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 11:27 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Previously, virtio split ring vectorized path was enabled by default. > This is not suitable for everyone because that path dose not follow > virtio spec. Add new devarg for virtio vectorized path selection. By > default vectorized path is disabled. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks! Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 4/9] net/virtio-user: add vectorized devarg 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (2 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 11:29 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (4 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Add new devarg for virtio user device vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 902a1f0cf..d59add23e 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfer to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) Virtio paths Selection and Usage -------------------------------- diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 150a8d987..40ad786cc 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -452,6 +452,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_PACKED_VQ, #define VIRTIO_USER_ARG_SPEED "speed" VIRTIO_USER_ARG_SPEED, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) + hw->use_vec_rx = 1; + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "mrg_rxbuf=<0|1> " "in_order=<0|1> " "packed_vq=<0|1> " - "speed=<int>"); + "speed=<int> " + "vectorized=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 4/9] net/virtio-user: add vectorized devarg 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu @ 2020-04-24 11:29 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 11:29 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Add new devarg for virtio user device vectorized path selection. By > default vectorized path is disabled. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (3 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 11:51 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu ` (3 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Optimize packed ring Rx path with SIMD instructions. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by AVX512 instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index c9edb84ee..102b1deab 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 15150eea1..8e68c3039 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' + if '-mno-avx512f' not in machine_args + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index febaf17a8..5c112cac7 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..c9b6e7844 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..8a7b459eb --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,374 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define BYTE_SIZE 8 +/* flag bits offset in packed ring desc higher 64bits */ +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ + FLAGS_BITS_OFFSET) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} + +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static inline uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + __m512i v_desc = _mm512_loadu_si512(desc_addr); + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); + + __m512i v_used_flag = _mm512_setzero_si512(); + if (vq->vq_packed.used_wrap_counter) + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + + /* Check all descs are used */ + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); + if (desc_stats) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + } + + /* + * load len from desc, store into mbuf pkt_len and data_len + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored + */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA); + + /* reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, + (uint32_t)-hdr_size); + + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); + + /* assert offset of data_len */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], + addrs[2] + 8, addrs[2], + addrs[1] + 8, addrs[1], + addrs[0] + 8, addrs[0]); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, v_index, v_value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 40ad786cc..c54698ad1 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) hw->use_msix = 1; hw->modern = 0; hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } - if (vectorized) - hw->use_vec_rx = 1; + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not support packed ring vectorized"); +#endif + } else { + hw->use_vec_rx = 1; + } + } rte_eth_dev_probing_finish(eth_dev); ret = 0; diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..d293a3189 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,8 +18,10 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -236,7 +238,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-24 11:51 ` Maxime Coquelin 2020-04-24 13:12 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 11:51 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Optimize packed ring Rx path with SIMD instructions. Solution of > optimization is pretty like vhost, is that split path into batch and > single functions. Batch function is further optimized by AVX512 > instructions. Also pad desc extra structure to 16 bytes aligned, thus > four elements will be saved in one batch. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > index c9edb84ee..102b1deab 100644 > --- a/drivers/net/virtio/Makefile > +++ b/drivers/net/virtio/Makefile > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > endif > > +ifneq ($(FORCE_DISABLE_AVX512), y) > + CC_AVX512_SUPPORT=\ > + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > + grep -q AVX512 && echo 1) > +endif > + > +ifeq ($(CC_AVX512_SUPPORT), 1) > +CFLAGS += -DCC_AVX512_SUPPORT > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c > + > +ifeq ($(RTE_TOOLCHAIN), gcc) > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA > +endif > +endif > + > +ifeq ($(RTE_TOOLCHAIN), clang) > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) > +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA > +endif > +endif > + > +ifeq ($(RTE_TOOLCHAIN), icc) > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA > +endif > +endif > + > +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds > +endif > +endif > + > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > index 15150eea1..8e68c3039 100644 > --- a/drivers/net/virtio/meson.build > +++ b/drivers/net/virtio/meson.build > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > deps += ['kvargs', 'bus_pci'] > > if arch_subdir == 'x86' > + if '-mno-avx512f' not in machine_args > + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') > + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > + cflags += ['-DCC_AVX512_SUPPORT'] > + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) > + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) > + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' > + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) > + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > + endif > + sources += files('virtio_rxtx_packed_avx.c') > + endif > + endif > sources += files('virtio_rxtx_simple_sse.c') > elif arch_subdir == 'ppc' > sources += files('virtio_rxtx_simple_altivec.c') > diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h > index febaf17a8..5c112cac7 100644 > --- a/drivers/net/virtio/virtio_ethdev.h > +++ b/drivers/net/virtio/virtio_ethdev.h > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, > uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t nb_pkts); > > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts); > + > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > void virtio_interrupt_handler(void *param); > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > index 84f4cf946..c9b6e7844 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > return nb_tx; > } > + > +__rte_weak uint16_t > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > + struct rte_mbuf **rx_pkts __rte_unused, > + uint16_t nb_pkts __rte_unused) > +{ > + return 0; > +} > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c > new file mode 100644 > index 000000000..8a7b459eb > --- /dev/null > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > @@ -0,0 +1,374 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2010-2020 Intel Corporation > + */ > + > +#include <stdint.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <errno.h> > + > +#include <rte_net.h> > + > +#include "virtio_logs.h" > +#include "virtio_ethdev.h" > +#include "virtio_pci.h" > +#include "virtqueue.h" > + > +#define BYTE_SIZE 8 > +/* flag bits offset in packed ring desc higher 64bits */ > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > + > +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ > + FLAGS_BITS_OFFSET) > + > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > + sizeof(struct vring_packed_desc)) > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > + > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ > + for (iter = val; iter < size; iter++) > +#endif > + > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > + for (iter = val; iter < size; iter++) > +#endif > + > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ > + for (iter = val; iter < size; iter++) > +#endif > + > +#ifndef virtio_for_each_try_unroll > +#define virtio_for_each_try_unroll(iter, val, num) \ > + for (iter = val; iter < num; iter++) > +#endif > + > +static inline void > +virtio_update_batch_stats(struct virtnet_stats *stats, > + uint16_t pkt_len1, > + uint16_t pkt_len2, > + uint16_t pkt_len3, > + uint16_t pkt_len4) > +{ > + stats->bytes += pkt_len1; > + stats->bytes += pkt_len2; > + stats->bytes += pkt_len3; > + stats->bytes += pkt_len4; > +} > + > +/* Optionally fill offload information in structure */ > +static inline int > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > +{ > + struct rte_net_hdr_lens hdr_lens; > + uint32_t hdrlen, ptype; > + int l4_supported = 0; > + > + /* nothing to do */ > + if (hdr->flags == 0) > + return 0; IIUC, the only difference with the non-vectorized version is the GSO support removed here. gso_type being in the same cacheline as flags in virtio_net_hdr, I don't think checking the performance gain is worth the added maintainance effort due to code duplication. Please prove I'm wrong, otherwise please move virtio_rx_offload() in a header and use it here. Alternative if it really imapcts performance is to put all the shared code in a dedicated function that can be re-used by both implementations. > + > + /* GSO not support in vec path, skip check */ > + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; > + > + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > + m->packet_type = ptype; > + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) > + l4_supported = 1; > + > + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > + if (hdr->csum_start <= hdrlen && l4_supported) { > + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; > + } else { > + /* Unknown proto or tunnel, do sw cksum. We can assume > + * the cksum field is in the first segment since the > + * buffers we provided to the host are large enough. > + * In case of SCTP, this will be wrong since it's a CRC > + * but there's nothing we can do. > + */ > + uint16_t csum = 0, off; > + > + rte_raw_cksum_mbuf(m, hdr->csum_start, > + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > + &csum); > + if (likely(csum != 0xffff)) > + csum = ~csum; > + off = hdr->csum_offset + hdr->csum_start; > + if (rte_pktmbuf_data_len(m) >= off + 1) > + *rte_pktmbuf_mtod_offset(m, uint16_t *, > + off) = csum; > + } > + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { > + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; > + } > + > + return 0; > +} Otherwise, the patch looks okay to me. Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-24 11:51 ` Maxime Coquelin @ 2020-04-24 13:12 ` Liu, Yong 2020-04-24 13:33 ` Maxime Coquelin 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-24 13:12 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Friday, April 24, 2020 7:52 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/24/20 11:24 AM, Marvin Liu wrote: > > Optimize packed ring Rx path with SIMD instructions. Solution of > > optimization is pretty like vhost, is that split path into batch and > > single functions. Batch function is further optimized by AVX512 > > instructions. Also pad desc extra structure to 16 bytes aligned, thus > > four elements will be saved in one batch. > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > index c9edb84ee..102b1deab 100644 > > --- a/drivers/net/virtio/Makefile > > +++ b/drivers/net/virtio/Makefile > > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > $(CONFIG_RTE_ARCH_ARM64)),) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > > endif > > > > +ifneq ($(FORCE_DISABLE_AVX512), y) > > + CC_AVX512_SUPPORT=\ > > + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ > > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > > + grep -q AVX512 && echo 1) > > +endif > > + > > +ifeq ($(CC_AVX512_SUPPORT), 1) > > +CFLAGS += -DCC_AVX512_SUPPORT > > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c > > + > > +ifeq ($(RTE_TOOLCHAIN), gcc) > > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > > +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA > > +endif > > +endif > > + > > +ifeq ($(RTE_TOOLCHAIN), clang) > > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - > ge 37 && echo 1), 1) > > +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA > > +endif > > +endif > > + > > +ifeq ($(RTE_TOOLCHAIN), icc) > > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > > +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA > > +endif > > +endif > > + > > +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl > > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > > +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds > > +endif > > +endif > > + > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c > > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > > index 15150eea1..8e68c3039 100644 > > --- a/drivers/net/virtio/meson.build > > +++ b/drivers/net/virtio/meson.build > > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > > deps += ['kvargs', 'bus_pci'] > > > > if arch_subdir == 'x86' > > + if '-mno-avx512f' not in machine_args > > + if cc.has_argument('-mavx512f') and cc.has_argument('- > mavx512vl') and cc.has_argument('-mavx512bw') > > + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > > + cflags += ['-DCC_AVX512_SUPPORT'] > > + if (toolchain == 'gcc' and > cc.version().version_compare('>=8.3.0')) > > + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > > + elif (toolchain == 'clang' and > cc.version().version_compare('>=3.7.0')) > > + cflags += '- > DVHOST_CLANG_UNROLL_PRAGMA' > > + elif (toolchain == 'icc' and > cc.version().version_compare('>=16.0.0')) > > + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > > + endif > > + sources += files('virtio_rxtx_packed_avx.c') > > + endif > > + endif > > sources += files('virtio_rxtx_simple_sse.c') > > elif arch_subdir == 'ppc' > > sources += files('virtio_rxtx_simple_altivec.c') > > diff --git a/drivers/net/virtio/virtio_ethdev.h > b/drivers/net/virtio/virtio_ethdev.h > > index febaf17a8..5c112cac7 100644 > > --- a/drivers/net/virtio/virtio_ethdev.h > > +++ b/drivers/net/virtio/virtio_ethdev.h > > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, > struct rte_mbuf **tx_pkts, > > uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > > uint16_t nb_pkts); > > > > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf > **rx_pkts, > > + uint16_t nb_pkts); > > + > > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > > > void virtio_interrupt_handler(void *param); > > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > > index 84f4cf946..c9b6e7844 100644 > > --- a/drivers/net/virtio/virtio_rxtx.c > > +++ b/drivers/net/virtio/virtio_rxtx.c > > @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > > > return nb_tx; > > } > > + > > +__rte_weak uint16_t > > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > > + struct rte_mbuf **rx_pkts __rte_unused, > > + uint16_t nb_pkts __rte_unused) > > +{ > > + return 0; > > +} > > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > new file mode 100644 > > index 000000000..8a7b459eb > > --- /dev/null > > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > @@ -0,0 +1,374 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2010-2020 Intel Corporation > > + */ > > + > > +#include <stdint.h> > > +#include <stdio.h> > > +#include <stdlib.h> > > +#include <string.h> > > +#include <errno.h> > > + > > +#include <rte_net.h> > > + > > +#include "virtio_logs.h" > > +#include "virtio_ethdev.h" > > +#include "virtio_pci.h" > > +#include "virtqueue.h" > > + > > +#define BYTE_SIZE 8 > > +/* flag bits offset in packed ring desc higher 64bits */ > > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ > > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > > + > > +#define PACKED_FLAGS_MASK ((0ULL | > VRING_PACKED_DESC_F_AVAIL_USED) << \ > > + FLAGS_BITS_OFFSET) > > + > > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > > + sizeof(struct vring_packed_desc)) > > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > > + > > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") > \ > > + for (iter = val; iter < size; iter++) > > +#endif > > + > > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > > + for (iter = val; iter < size; iter++) > > +#endif > > + > > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ > > + for (iter = val; iter < size; iter++) > > +#endif > > + > > +#ifndef virtio_for_each_try_unroll > > +#define virtio_for_each_try_unroll(iter, val, num) \ > > + for (iter = val; iter < num; iter++) > > +#endif > > + > > +static inline void > > +virtio_update_batch_stats(struct virtnet_stats *stats, > > + uint16_t pkt_len1, > > + uint16_t pkt_len2, > > + uint16_t pkt_len3, > > + uint16_t pkt_len4) > > +{ > > + stats->bytes += pkt_len1; > > + stats->bytes += pkt_len2; > > + stats->bytes += pkt_len3; > > + stats->bytes += pkt_len4; > > +} > > + > > +/* Optionally fill offload information in structure */ > > +static inline int > > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > > +{ > > + struct rte_net_hdr_lens hdr_lens; > > + uint32_t hdrlen, ptype; > > + int l4_supported = 0; > > + > > + /* nothing to do */ > > + if (hdr->flags == 0) > > + return 0; > > IIUC, the only difference with the non-vectorized version is the GSO > support removed here. > gso_type being in the same cacheline as flags in virtio_net_hdr, I don't > think checking the performance gain is worth the added maintainance > effort due to code duplication. > > Please prove I'm wrong, otherwise please move virtio_rx_offload() in a > header and use it here. Alternative if it really imapcts performance is > to put all the shared code in a dedicated function that can be re-used > by both implementations. > Maxime, It won't be much performance difference between non-vectorized and vectorized. The reason to add special vectorized version is for skipping the handling of garbage GSO packets. As all descs have been handled in batch, it is needed to revert when found garbage packets. That will introduce complicated logic in vectorized path. Regards, Marvin > > + > > + /* GSO not support in vec path, skip check */ > > + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; > > + > > + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > > + m->packet_type = ptype; > > + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || > > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || > > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) > > + l4_supported = 1; > > + > > + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > > + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > > + if (hdr->csum_start <= hdrlen && l4_supported) { > > + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; > > + } else { > > + /* Unknown proto or tunnel, do sw cksum. We can > assume > > + * the cksum field is in the first segment since the > > + * buffers we provided to the host are large enough. > > + * In case of SCTP, this will be wrong since it's a CRC > > + * but there's nothing we can do. > > + */ > > + uint16_t csum = 0, off; > > + > > + rte_raw_cksum_mbuf(m, hdr->csum_start, > > + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > > + &csum); > > + if (likely(csum != 0xffff)) > > + csum = ~csum; > > + off = hdr->csum_offset + hdr->csum_start; > > + if (rte_pktmbuf_data_len(m) >= off + 1) > > + *rte_pktmbuf_mtod_offset(m, uint16_t *, > > + off) = csum; > > + } > > + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && > l4_supported) { > > + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; > > + } > > + > > + return 0; > > +} > > Otherwise, the patch looks okay to me. > > Thanks, > Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-24 13:12 ` Liu, Yong @ 2020-04-24 13:33 ` Maxime Coquelin 2020-04-24 13:40 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 13:33 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry On 4/24/20 3:12 PM, Liu, Yong wrote: >> IIUC, the only difference with the non-vectorized version is the GSO >> support removed here. >> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't >> think checking the performance gain is worth the added maintainance >> effort due to code duplication. >> >> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a >> header and use it here. Alternative if it really imapcts performance is >> to put all the shared code in a dedicated function that can be re-used >> by both implementations. >> > Maxime, > It won't be much performance difference between non-vectorized and vectorized. > The reason to add special vectorized version is for skipping the handling of garbage GSO packets. > As all descs have been handled in batch, it is needed to revert when found garbage packets. > That will introduce complicated logic in vectorized path. What do you mean by garbage packet? Is it really good to just ignore such issues? Thanks, Maxime > Regards, > Marvin > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-24 13:33 ` Maxime Coquelin @ 2020-04-24 13:40 ` Liu, Yong 2020-04-24 15:58 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-24 13:40 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Friday, April 24, 2020 9:34 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/24/20 3:12 PM, Liu, Yong wrote: > >> IIUC, the only difference with the non-vectorized version is the GSO > >> support removed here. > >> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't > >> think checking the performance gain is worth the added maintainance > >> effort due to code duplication. > >> > >> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a > >> header and use it here. Alternative if it really imapcts performance is > >> to put all the shared code in a dedicated function that can be re-used > >> by both implementations. > >> > > Maxime, > > It won't be much performance difference between non-vectorized and > vectorized. > > The reason to add special vectorized version is for skipping the handling of > garbage GSO packets. > > As all descs have been handled in batch, it is needed to revert when found > garbage packets. > > That will introduce complicated logic in vectorized path. > Dequeue function will call virtio_discard_rxbuf when found gso info in hdr is invalid. IMHO, there's no need to check gso info when GSO not negotiated. There's an alternative way is that use single function handle GSO packets but its performance will be worse than normal function. if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) || (hdr->gso_size == 0)) { return -EINVAL; } > > What do you mean by garbage packet? > Is it really good to just ignore such issues? > > Thanks, > Maxime > > > Regards, > > Marvin > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path 2020-04-24 13:40 ` Liu, Yong @ 2020-04-24 15:58 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-24 15:58 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry > -----Original Message----- > From: Liu, Yong > Sent: Friday, April 24, 2020 9:41 PM > To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > Subject: RE: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path > > > > > -----Original Message----- > > From: Maxime Coquelin <maxime.coquelin@redhat.com> > > Sent: Friday, April 24, 2020 9:34 PM > > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > > Wang, Zhihong <zhihong.wang@intel.com> > > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > > Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path > > > > > > > > On 4/24/20 3:12 PM, Liu, Yong wrote: > > >> IIUC, the only difference with the non-vectorized version is the GSO > > >> support removed here. > > >> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't > > >> think checking the performance gain is worth the added maintainance > > >> effort due to code duplication. > > >> > > >> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a > > >> header and use it here. Alternative if it really imapcts performance is > > >> to put all the shared code in a dedicated function that can be re-used > > >> by both implementations. > > >> > > > Maxime, > > > It won't be much performance difference between non-vectorized and > > vectorized. > > > The reason to add special vectorized version is for skipping the handling > of > > garbage GSO packets. > > > As all descs have been handled in batch, it is needed to revert when > found > > garbage packets. > > > That will introduce complicated logic in vectorized path. > > > > Dequeue function will call virtio_discard_rxbuf when found gso info in hdr is > invalid. > IMHO, there's no need to check gso info when GSO not negotiated. > There's an alternative way is that use single function handle GSO packets but > its performance will be worse than normal function. > > if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) || > (hdr->gso_size == 0)) { > return -EINVAL; > } > Hi Maxime, There's about 6% performance drop in loopback case after handling this special case in Rx path. I prefer to keep current implementation. What's your option? Thanks, Marvin > > > > What do you mean by garbage packet? > > Is it really good to just ignore such issues? > > > > Thanks, > > Maxime > > > > > Regards, > > > Marvin > > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (4 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 12:01 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (2 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Move xmit offload and packed ring xmit enqueue function to header file. These functions will be reused by packed ring vectorized Tx function. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index c9b6e7844..cf18fe564 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -264,10 +264,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - static void virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) { @@ -562,68 +558,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +660,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, @@ -1246,7 +1085,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index d293a3189..18ae34789 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -563,4 +563,165 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-24 12:01 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 12:01 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Move xmit offload and packed ring xmit enqueue function to header file. > These functions will be reused by packed ring vectorized Tx function. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (5 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 12:29 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Optimize packed ring Tx path alike Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by AVX512 instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 5c112cac7..b7d52d497 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index cf18fe564..f82fe8d64 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index 8a7b459eb..c023ace4e 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -23,6 +23,24 @@ #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ FLAGS_BITS_OFFSET) +/* reference count offset in mbuf rearm data */ +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) + +/* default rearm data */ +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ + 1ULL << REFCNT_BITS_OFFSET) + +/* id bits offset in packed ring desc higher 64bits */ +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ sizeof(struct vring_packed_desc)) #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) @@ -47,6 +65,48 @@ for (iter = val; iter < num; iter++) #endif +static inline void +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) +{ + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct rte_mbuf *mbufs[size]; + uint16_t nb_mbuf = 0, i; + + used_idx = vq->vq_used_cons_idx; + + if (!desc_is_used(&desc[used_idx], vq)) + return; + + id = desc[used_idx].id; + + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + + if (dxp->cookie != NULL) { + mbufs[nb_mbuf] = dxp->cookie; + dxp->cookie = NULL; + nb_mbuf++; + } + + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + } while (curr_id != id); + + for (i = 0; i < nb_mbuf; i++) + rte_pktmbuf_free(mbufs[i]); + + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} + static inline void virtio_update_batch_stats(struct virtnet_stats *stats, uint16_t pkt_len1, @@ -60,6 +120,237 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len4; } +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64); + RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64); + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref); + if (unlikely(cmp)) + return -1; + + /* Check headroom is enough */ + const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12; + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) != + offsetof(struct rte_mbuf, rearm_data)); + cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms); + if (unlikely(cmp)) + return -1; + + __m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_BITS_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_BITS_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i v_offset = _mm512_broadcast_i32x4(flag_offset); + + __m512i v_desc = _mm512_add_epi64(new_descs, v_offset); + + if (!vq->hw->has_tx_offload) { + __m128i all_mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, all_mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_packed_vec(vq); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_packed_vec(vq); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-24 12:29 ` Maxime Coquelin 2020-04-24 13:33 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 12:29 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Optimize packed ring Tx path alike Rx path. Split Tx path into batch and s/alike/like/ ? > single Tx functions. Batch function is further optimized by AVX512 > instructions. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h > index 5c112cac7..b7d52d497 100644 > --- a/drivers/net/virtio/virtio_ethdev.h > +++ b/drivers/net/virtio/virtio_ethdev.h > @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t nb_pkts); > > +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, > + uint16_t nb_pkts); > + > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > void virtio_interrupt_handler(void *param); > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > index cf18fe564..f82fe8d64 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > { > return 0; > } > + > +__rte_weak uint16_t > +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, > + struct rte_mbuf **tx_pkts __rte_unused, > + uint16_t nb_pkts __rte_unused) > +{ > + return 0; > +} > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c > index 8a7b459eb..c023ace4e 100644 > --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > @@ -23,6 +23,24 @@ > #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ > FLAGS_BITS_OFFSET) > > +/* reference count offset in mbuf rearm data */ > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) > +/* segment number offset in mbuf rearm data */ > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) > + > +/* default rearm data */ > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ > + 1ULL << REFCNT_BITS_OFFSET) > + > +/* id bits offset in packed ring desc higher 64bits */ > +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > + > +/* net hdr short size mask */ > +#define NET_HDR_MASK 0x3F > + > #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > sizeof(struct vring_packed_desc)) > #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > @@ -47,6 +65,48 @@ > for (iter = val; iter < num; iter++) > #endif > > +static inline void > +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) > +{ > + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; > + struct vq_desc_extra *dxp; > + uint16_t used_idx, id, curr_id, free_cnt = 0; > + uint16_t size = vq->vq_nentries; > + struct rte_mbuf *mbufs[size]; > + uint16_t nb_mbuf = 0, i; > + > + used_idx = vq->vq_used_cons_idx; > + > + if (!desc_is_used(&desc[used_idx], vq)) > + return; > + > + id = desc[used_idx].id; > + > + do { > + curr_id = used_idx; > + dxp = &vq->vq_descx[used_idx]; > + used_idx += dxp->ndescs; > + free_cnt += dxp->ndescs; > + > + if (dxp->cookie != NULL) { > + mbufs[nb_mbuf] = dxp->cookie; > + dxp->cookie = NULL; > + nb_mbuf++; > + } > + > + if (used_idx >= size) { > + used_idx -= size; > + vq->vq_packed.used_wrap_counter ^= 1; > + } > + } while (curr_id != id); > + > + for (i = 0; i < nb_mbuf; i++) > + rte_pktmbuf_free(mbufs[i]); > + > + vq->vq_used_cons_idx = used_idx; > + vq->vq_free_cnt += free_cnt; > +} > + I think you can re-use the inlined non-vectorized cleanup function here. Or use your implementation in non-vectorized path. BTW, do you know we have to pass the num argument in non-vectorized case? I'm not sure to remember. Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-24 12:29 ` Maxime Coquelin @ 2020-04-24 13:33 ` Liu, Yong 2020-04-24 13:35 ` Maxime Coquelin 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-24 13:33 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Friday, April 24, 2020 8:30 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path > > > > On 4/24/20 11:24 AM, Marvin Liu wrote: > > Optimize packed ring Tx path alike Rx path. Split Tx path into batch and > > s/alike/like/ ? > > > single Tx functions. Batch function is further optimized by AVX512 > > instructions. > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/drivers/net/virtio/virtio_ethdev.h > b/drivers/net/virtio/virtio_ethdev.h > > index 5c112cac7..b7d52d497 100644 > > --- a/drivers/net/virtio/virtio_ethdev.h > > +++ b/drivers/net/virtio/virtio_ethdev.h > > @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, > struct rte_mbuf **rx_pkts, > > uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf > **rx_pkts, > > uint16_t nb_pkts); > > > > +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf > **tx_pkts, > > + uint16_t nb_pkts); > > + > > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > > > void virtio_interrupt_handler(void *param); > > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > > index cf18fe564..f82fe8d64 100644 > > --- a/drivers/net/virtio/virtio_rxtx.c > > +++ b/drivers/net/virtio/virtio_rxtx.c > > @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue > __rte_unused, > > { > > return 0; > > } > > + > > +__rte_weak uint16_t > > +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, > > + struct rte_mbuf **tx_pkts __rte_unused, > > + uint16_t nb_pkts __rte_unused) > > +{ > > + return 0; > > +} > > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > index 8a7b459eb..c023ace4e 100644 > > --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c > > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > @@ -23,6 +23,24 @@ > > #define PACKED_FLAGS_MASK ((0ULL | > VRING_PACKED_DESC_F_AVAIL_USED) << \ > > FLAGS_BITS_OFFSET) > > > > +/* reference count offset in mbuf rearm data */ > > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ > > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) > > +/* segment number offset in mbuf rearm data */ > > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ > > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) > > + > > +/* default rearm data */ > > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ > > + 1ULL << REFCNT_BITS_OFFSET) > > + > > +/* id bits offset in packed ring desc higher 64bits */ > > +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ > > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > > + > > +/* net hdr short size mask */ > > +#define NET_HDR_MASK 0x3F > > + > > #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > > sizeof(struct vring_packed_desc)) > > #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > > @@ -47,6 +65,48 @@ > > for (iter = val; iter < num; iter++) > > #endif > > > > +static inline void > > +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) > > +{ > > + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; > > + struct vq_desc_extra *dxp; > > + uint16_t used_idx, id, curr_id, free_cnt = 0; > > + uint16_t size = vq->vq_nentries; > > + struct rte_mbuf *mbufs[size]; > > + uint16_t nb_mbuf = 0, i; > > + > > + used_idx = vq->vq_used_cons_idx; > > + > > + if (!desc_is_used(&desc[used_idx], vq)) > > + return; > > + > > + id = desc[used_idx].id; > > + > > + do { > > + curr_id = used_idx; > > + dxp = &vq->vq_descx[used_idx]; > > + used_idx += dxp->ndescs; > > + free_cnt += dxp->ndescs; > > + > > + if (dxp->cookie != NULL) { > > + mbufs[nb_mbuf] = dxp->cookie; > > + dxp->cookie = NULL; > > + nb_mbuf++; > > + } > > + > > + if (used_idx >= size) { > > + used_idx -= size; > > + vq->vq_packed.used_wrap_counter ^= 1; > > + } > > + } while (curr_id != id); > > + > > + for (i = 0; i < nb_mbuf; i++) > > + rte_pktmbuf_free(mbufs[i]); > > + > > + vq->vq_used_cons_idx = used_idx; > > + vq->vq_free_cnt += free_cnt; > > +} > > + > > > I think you can re-use the inlined non-vectorized cleanup function here. > Or use your implementation in non-vectorized path. > BTW, do you know we have to pass the num argument in non-vectorized > case? I'm not sure to remember. > Maxime, This is simple version of xmit clean up function. It is based on the concept that backend will update used id in burst which also match frontend's requirement. I just found original version work better in loopback case. Will adapt it in next version. Thanks, Marvin > Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-24 13:33 ` Liu, Yong @ 2020-04-24 13:35 ` Maxime Coquelin 2020-04-24 13:47 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 13:35 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry On 4/24/20 3:33 PM, Liu, Yong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Friday, April 24, 2020 8:30 PM >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; >> Wang, Zhihong <zhihong.wang@intel.com> >> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> >> Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path >> >> >> >> On 4/24/20 11:24 AM, Marvin Liu wrote: >>> Optimize packed ring Tx path alike Rx path. Split Tx path into batch and >> >> s/alike/like/ ? >> >>> single Tx functions. Batch function is further optimized by AVX512 >>> instructions. >>> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> >>> >>> diff --git a/drivers/net/virtio/virtio_ethdev.h >> b/drivers/net/virtio/virtio_ethdev.h >>> index 5c112cac7..b7d52d497 100644 >>> --- a/drivers/net/virtio/virtio_ethdev.h >>> +++ b/drivers/net/virtio/virtio_ethdev.h >>> @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, >> struct rte_mbuf **rx_pkts, >>> uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf >> **rx_pkts, >>> uint16_t nb_pkts); >>> >>> +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf >> **tx_pkts, >>> + uint16_t nb_pkts); >>> + >>> int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); >>> >>> void virtio_interrupt_handler(void *param); >>> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c >>> index cf18fe564..f82fe8d64 100644 >>> --- a/drivers/net/virtio/virtio_rxtx.c >>> +++ b/drivers/net/virtio/virtio_rxtx.c >>> @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue >> __rte_unused, >>> { >>> return 0; >>> } >>> + >>> +__rte_weak uint16_t >>> +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, >>> + struct rte_mbuf **tx_pkts __rte_unused, >>> + uint16_t nb_pkts __rte_unused) >>> +{ >>> + return 0; >>> +} >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c >>> index 8a7b459eb..c023ace4e 100644 >>> --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c >>> @@ -23,6 +23,24 @@ >>> #define PACKED_FLAGS_MASK ((0ULL | >> VRING_PACKED_DESC_F_AVAIL_USED) << \ >>> FLAGS_BITS_OFFSET) >>> >>> +/* reference count offset in mbuf rearm data */ >>> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ >>> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) >>> +/* segment number offset in mbuf rearm data */ >>> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ >>> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) >>> + >>> +/* default rearm data */ >>> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ >>> + 1ULL << REFCNT_BITS_OFFSET) >>> + >>> +/* id bits offset in packed ring desc higher 64bits */ >>> +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ >>> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) >>> + >>> +/* net hdr short size mask */ >>> +#define NET_HDR_MASK 0x3F >>> + >>> #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ >>> sizeof(struct vring_packed_desc)) >>> #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) >>> @@ -47,6 +65,48 @@ >>> for (iter = val; iter < num; iter++) >>> #endif >>> >>> +static inline void >>> +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) >>> +{ >>> + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; >>> + struct vq_desc_extra *dxp; >>> + uint16_t used_idx, id, curr_id, free_cnt = 0; >>> + uint16_t size = vq->vq_nentries; >>> + struct rte_mbuf *mbufs[size]; >>> + uint16_t nb_mbuf = 0, i; >>> + >>> + used_idx = vq->vq_used_cons_idx; >>> + >>> + if (!desc_is_used(&desc[used_idx], vq)) >>> + return; >>> + >>> + id = desc[used_idx].id; >>> + >>> + do { >>> + curr_id = used_idx; >>> + dxp = &vq->vq_descx[used_idx]; >>> + used_idx += dxp->ndescs; >>> + free_cnt += dxp->ndescs; >>> + >>> + if (dxp->cookie != NULL) { >>> + mbufs[nb_mbuf] = dxp->cookie; >>> + dxp->cookie = NULL; >>> + nb_mbuf++; >>> + } >>> + >>> + if (used_idx >= size) { >>> + used_idx -= size; >>> + vq->vq_packed.used_wrap_counter ^= 1; >>> + } >>> + } while (curr_id != id); >>> + >>> + for (i = 0; i < nb_mbuf; i++) >>> + rte_pktmbuf_free(mbufs[i]); >>> + >>> + vq->vq_used_cons_idx = used_idx; >>> + vq->vq_free_cnt += free_cnt; >>> +} >>> + >> >> >> I think you can re-use the inlined non-vectorized cleanup function here. >> Or use your implementation in non-vectorized path. >> BTW, do you know we have to pass the num argument in non-vectorized >> case? I'm not sure to remember. >> > > Maxime, > This is simple version of xmit clean up function. It is based on the concept that backend will update used id in burst which also match frontend's requirement. And what the backend doesn't follow that concept? It is just slower or broken? > I just found original version work better in loopback case. Will adapt it in next version. > > Thanks, > Marvin > >> Maxime > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-24 13:35 ` Maxime Coquelin @ 2020-04-24 13:47 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-24 13:47 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Van Haaren, Harry > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Friday, April 24, 2020 9:36 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path > > > > On 4/24/20 3:33 PM, Liu, Yong wrote: > > > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Sent: Friday, April 24, 2020 8:30 PM > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > >> Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com> > >> Subject: Re: [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path > >> > >> > >> > >> On 4/24/20 11:24 AM, Marvin Liu wrote: > >>> Optimize packed ring Tx path alike Rx path. Split Tx path into batch and > >> > >> s/alike/like/ ? > >> > >>> single Tx functions. Batch function is further optimized by AVX512 > >>> instructions. > >>> > >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> > >>> > >>> diff --git a/drivers/net/virtio/virtio_ethdev.h > >> b/drivers/net/virtio/virtio_ethdev.h > >>> index 5c112cac7..b7d52d497 100644 > >>> --- a/drivers/net/virtio/virtio_ethdev.h > >>> +++ b/drivers/net/virtio/virtio_ethdev.h > >>> @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, > >> struct rte_mbuf **rx_pkts, > >>> uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf > >> **rx_pkts, > >>> uint16_t nb_pkts); > >>> > >>> +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf > >> **tx_pkts, > >>> + uint16_t nb_pkts); > >>> + > >>> int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > >>> > >>> void virtio_interrupt_handler(void *param); > >>> diff --git a/drivers/net/virtio/virtio_rxtx.c > b/drivers/net/virtio/virtio_rxtx.c > >>> index cf18fe564..f82fe8d64 100644 > >>> --- a/drivers/net/virtio/virtio_rxtx.c > >>> +++ b/drivers/net/virtio/virtio_rxtx.c > >>> @@ -2175,3 +2175,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue > >> __rte_unused, > >>> { > >>> return 0; > >>> } > >>> + > >>> +__rte_weak uint16_t > >>> +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, > >>> + struct rte_mbuf **tx_pkts __rte_unused, > >>> + uint16_t nb_pkts __rte_unused) > >>> +{ > >>> + return 0; > >>> +} > >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c > >>> index 8a7b459eb..c023ace4e 100644 > >>> --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c > >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > >>> @@ -23,6 +23,24 @@ > >>> #define PACKED_FLAGS_MASK ((0ULL | > >> VRING_PACKED_DESC_F_AVAIL_USED) << \ > >>> FLAGS_BITS_OFFSET) > >>> > >>> +/* reference count offset in mbuf rearm data */ > >>> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ > >>> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) > >>> +/* segment number offset in mbuf rearm data */ > >>> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ > >>> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) > >>> + > >>> +/* default rearm data */ > >>> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ > >>> + 1ULL << REFCNT_BITS_OFFSET) > >>> + > >>> +/* id bits offset in packed ring desc higher 64bits */ > >>> +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ > >>> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > >>> + > >>> +/* net hdr short size mask */ > >>> +#define NET_HDR_MASK 0x3F > >>> + > >>> #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > >>> sizeof(struct vring_packed_desc)) > >>> #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > >>> @@ -47,6 +65,48 @@ > >>> for (iter = val; iter < num; iter++) > >>> #endif > >>> > >>> +static inline void > >>> +virtio_xmit_cleanup_packed_vec(struct virtqueue *vq) > >>> +{ > >>> + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; > >>> + struct vq_desc_extra *dxp; > >>> + uint16_t used_idx, id, curr_id, free_cnt = 0; > >>> + uint16_t size = vq->vq_nentries; > >>> + struct rte_mbuf *mbufs[size]; > >>> + uint16_t nb_mbuf = 0, i; > >>> + > >>> + used_idx = vq->vq_used_cons_idx; > >>> + > >>> + if (!desc_is_used(&desc[used_idx], vq)) > >>> + return; > >>> + > >>> + id = desc[used_idx].id; > >>> + > >>> + do { > >>> + curr_id = used_idx; > >>> + dxp = &vq->vq_descx[used_idx]; > >>> + used_idx += dxp->ndescs; > >>> + free_cnt += dxp->ndescs; > >>> + > >>> + if (dxp->cookie != NULL) { > >>> + mbufs[nb_mbuf] = dxp->cookie; > >>> + dxp->cookie = NULL; > >>> + nb_mbuf++; > >>> + } > >>> + > >>> + if (used_idx >= size) { > >>> + used_idx -= size; > >>> + vq->vq_packed.used_wrap_counter ^= 1; > >>> + } > >>> + } while (curr_id != id); > >>> + > >>> + for (i = 0; i < nb_mbuf; i++) > >>> + rte_pktmbuf_free(mbufs[i]); > >>> + > >>> + vq->vq_used_cons_idx = used_idx; > >>> + vq->vq_free_cnt += free_cnt; > >>> +} > >>> + > >> > >> > >> I think you can re-use the inlined non-vectorized cleanup function here. > >> Or use your implementation in non-vectorized path. > >> BTW, do you know we have to pass the num argument in non-vectorized > >> case? I'm not sure to remember. > >> > > > > Maxime, > > This is simple version of xmit clean up function. It is based on the concept > that backend will update used id in burst which also match frontend's > requirement. > > And what the backend doesn't follow that concept? > It is just slower or broken? It is just slower. More packets maybe drop due to no free room in the ring. I will replace vectorized with non-vectorized version it shown good number. > > > I just found original version work better in loopback case. Will adapt it in > next version. > > > > Thanks, > > Marvin > > > >> Maxime > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (6 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 13:26 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from vectorized devarg, then checks each criteria. Packed ring vectorized path need: AVX512F and required extensions are supported by compiler and host VERSION_1 and IN_ORDER features are negotiated mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path need: mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloadings are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 0a69a4db1..8a9545dd8 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; if (vectorized) { - if (!vtpci_packed_queue(hw)) + if (!vtpci_packed_queue(hw)) { + hw->use_vec_rx = 1; + } else { +#if !defined(CC_AVX512_SUPPORT) + PMD_DRV_LOG(INFO, + "building environment do not support packed ring vectorized"); +#else hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif + } } hw->opened = true; @@ -2099,11 +2117,10 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, } } - if (vectorized && - rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { + if (vectorized && rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { ret = rte_kvargs_process(kvlist, - VIRTIO_ARG_VECTORIZED, - vectorized_check_handler, vectorized); + VIRTIO_ARG_VECTORIZED, + vectorized_check_handler, vectorized); if (ret < 0) { PMD_INIT_LOG(ERR, "Failed to parse %s", VIRTIO_ARG_VECTORIZED); @@ -2288,31 +2305,61 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized path for requirements not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized path for requirement not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-24 13:26 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 13:26 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Rewrite vectorized path selection logic. Default setting comes from > vectorized devarg, then checks each criteria. > > Packed ring vectorized path need: > AVX512F and required extensions are supported by compiler and host > VERSION_1 and IN_ORDER features are negotiated > mergeable feature is not negotiated > LRO offloading is disabled > > Split ring vectorized rx path need: > mergeable and IN_ORDER features are not negotiated > LRO, chksum and vlan strip offloadings are disabled > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 9/9] doc: add packed vectorized path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu ` (7 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-24 9:24 ` Marvin Liu 2020-04-24 13:31 ` Maxime Coquelin 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-24 9:24 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang Cc: dev, harry.van.haaren, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d59add23e..dbcf49ae1 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -482,6 +482,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -504,6 +511,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -521,20 +530,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v9 9/9] doc: add packed vectorized path 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu @ 2020-04-24 13:31 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-24 13:31 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev, harry.van.haaren On 4/24/20 11:24 AM, Marvin Liu wrote: > Document packed virtqueue vectorized path selection logic in virtio net > PMD. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v9 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (14 preceding siblings ...) 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (8 more replies) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu 17 siblings, 9 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v10: * reuse packed ring xmit cleanup v9: * replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg * reorder patch sequence v8: * fix meson build error on ubuntu16.04 and suse15 v7: * default vectorization is disabled * compilation time check dependency on rte_mbuf structure * offsets are calcuated when compiling * remove useless barrier as descs are batched store&load * vindex of scatter is directly set * some comments updates * enable vectorized path in meson build v6: * fix issue when size not power of 2 v5: * remove cpuflags definition as required extensions always come with AVX512F on x86_64 * inorder actions should depend on feature bit * check ring type in rx queue setup * rewrite some commit logs * fix some checkpatch warnings v4: * rename 'packed_vec' to 'vectorized', also used in split ring * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev * check required AVX512 extensions cpuflags * combine split and packed ring datapath selection logic * remove limitation that size must power of two * clear 12Bytes virtio_net_hdr v3: * remove virtio_net_hdr array for better performance * disable 'packed_vec' by default v2: * more function blocks replaced by vector instructions * clean virtio_net_hdr by vector instruction * allow header room size change * add 'packed_vec' option in virtio_user vdev * fix build not check whether AVX512 enabled * doc update Tested-by: Wang, Yinan <yinan.wang@intel.com> Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: inorder should depend on feature bit net/virtio: add vectorized devarg net/virtio-user: add vectorized devarg net/virtio: reuse packed ring functions net/virtio: add vectorized packed ring Rx path net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path doc/guides/nics/virtio.rst | 52 +- drivers/net/virtio/Makefile | 35 ++ drivers/net/virtio/meson.build | 14 + drivers/net/virtio/virtio_ethdev.c | 137 ++++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 349 ++--------- drivers/net/virtio/virtio_rxtx_packed_avx.c | 623 ++++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 32 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 307 +++++++++- 11 files changed, 1210 insertions(+), 355 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (7 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue, its default value is 32. Limit the threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu ` (6 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Ring initialization is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-27 11:12 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu ` (5 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Previously, virtio split ring vectorized path was enabled by default. This is not suitable for everyone because that path dose not follow virtio spec. Add new devarg for virtio vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 6286286db..902a1f0cf 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfer to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) + Below devargs are supported by the virtio-user vdev: #. ``path``: diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 37766cbb6..0a69a4db1 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev); static uint32_t virtio_dev_speed_capa_get(uint32_t speed); static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed); + uint32_t *speed, + int *vectorized); static int virtio_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info); static int virtio_dev_link_update(struct rte_eth_dev *dev, @@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) { struct virtio_hw *hw = eth_dev->data->dev_private; uint32_t speed = SPEED_UNKNOWN; + int vectorized = 0; int ret; if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) { @@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) return 0; } ret = virtio_dev_devargs_parse(eth_dev->device->devargs, - NULL, &speed); + NULL, &speed, &vectorized); if (ret < 0) return ret; hw->speed = speed; @@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) if (ret < 0) goto err_virtio_init; + if (vectorized) { + if (!vtpci_packed_queue(hw)) + hw->use_vec_rx = 1; + } + hw->opened = true; return 0; @@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed) } } +static int vectorized_check_handler(__rte_unused const char *key, + const char *value, void *ret_val) +{ + if (strcmp(value, "1") == 0) + *(int *)ret_val = 1; + else + *(int *)ret_val = 0; + + return 0; +} #define VIRTIO_ARG_SPEED "speed" #define VIRTIO_ARG_VDPA "vdpa" +#define VIRTIO_ARG_VECTORIZED "vectorized" static int @@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused, static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed) + uint32_t *speed, int *vectorized) { struct rte_kvargs *kvlist; int ret = 0; @@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, } } + if (vectorized && + rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { + ret = rte_kvargs_process(kvlist, + VIRTIO_ARG_VECTORIZED, + vectorized_check_handler, vectorized); + if (ret < 0) { + PMD_INIT_LOG(ERR, "Failed to parse %s", + VIRTIO_ARG_VECTORIZED); + goto exit; + } + } + exit: rte_kvargs_free(kvlist); return ret; @@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused, int vdpa = 0; int ret = 0; - ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL); + ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL, + NULL); if (ret < 0) { PMD_INIT_LOG(ERR, "devargs parsing is failed"); return ret; @@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index bd89357e4..668e688e1 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -253,7 +253,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 953f00d72..150a8d987 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu @ 2020-04-27 11:12 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-27 11:12 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev On 4/26/20 4:19 AM, Marvin Liu wrote: > Previously, virtio split ring vectorized path was enabled by default. > This is not suitable for everyone because that path dose not follow > virtio spec. Add new devarg for virtio vectorized path selection. By > default vectorized path is disabled. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> > > diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst > index 6286286db..902a1f0cf 100644 > --- a/doc/guides/nics/virtio.rst > +++ b/doc/guides/nics/virtio.rst > @@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver: > rte_eth_link_get_nowait function. > (Default: 10000 (10G)) > > +#. ``vectorized``: > + > + It is used to specify whether virtio device perfer to use vectorized path. s/perfer/prefers/ > + Afterwards, dependencies of vectorized path will be checked in path > + election. > + (Default: 0 (disabled)) > + > Below devargs are supported by the virtio-user vdev: > > #. ``path``: > diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c > index 37766cbb6..0a69a4db1 100644 > --- a/drivers/net/virtio/virtio_ethdev.c > +++ b/drivers/net/virtio/virtio_ethdev.c > @@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev); > static uint32_t virtio_dev_speed_capa_get(uint32_t speed); > static int virtio_dev_devargs_parse(struct rte_devargs *devargs, > int *vdpa, > - uint32_t *speed); > + uint32_t *speed, > + int *vectorized); > static int virtio_dev_info_get(struct rte_eth_dev *dev, > struct rte_eth_dev_info *dev_info); > static int virtio_dev_link_update(struct rte_eth_dev *dev, > @@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) > eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; > } > } else { > - if (hw->use_simple_rx) { > - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", > + if (hw->use_vec_rx) { > + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", > eth_dev->data->port_id); > eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; > } else if (hw->use_inorder_rx) { > @@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) > { > struct virtio_hw *hw = eth_dev->data->dev_private; > uint32_t speed = SPEED_UNKNOWN; > + int vectorized = 0; > int ret; > > if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) { > @@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) > return 0; > } > ret = virtio_dev_devargs_parse(eth_dev->device->devargs, > - NULL, &speed); > + NULL, &speed, &vectorized); > if (ret < 0) > return ret; > hw->speed = speed; > @@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) > if (ret < 0) > goto err_virtio_init; > > + if (vectorized) { > + if (!vtpci_packed_queue(hw)) > + hw->use_vec_rx = 1; > + } > + > hw->opened = true; > > return 0; > @@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed) > } > } > > +static int vectorized_check_handler(__rte_unused const char *key, > + const char *value, void *ret_val) > +{ > + if (strcmp(value, "1") == 0) > + *(int *)ret_val = 1; > + else > + *(int *)ret_val = 0; > + > + return 0; > +} > > #define VIRTIO_ARG_SPEED "speed" > #define VIRTIO_ARG_VDPA "vdpa" > +#define VIRTIO_ARG_VECTORIZED "vectorized" > > > static int > @@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused, > > static int > virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, > - uint32_t *speed) > + uint32_t *speed, int *vectorized) > { > struct rte_kvargs *kvlist; > int ret = 0; > @@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, > } > } > > + if (vectorized && > + rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { > + ret = rte_kvargs_process(kvlist, > + VIRTIO_ARG_VECTORIZED, > + vectorized_check_handler, vectorized); > + if (ret < 0) { > + PMD_INIT_LOG(ERR, "Failed to parse %s", > + VIRTIO_ARG_VECTORIZED); > + goto exit; > + } > + } > + > exit: > rte_kvargs_free(kvlist); > return ret; > @@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused, > int vdpa = 0; > int ret = 0; > > - ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL); > + ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL, > + NULL); > if (ret < 0) { > PMD_INIT_LOG(ERR, "devargs parsing is failed"); > return ret; > @@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) > return -EBUSY; > } > > - hw->use_simple_rx = 1; > - > if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { > hw->use_inorder_tx = 1; > hw->use_inorder_rx = 1; > - hw->use_simple_rx = 0; > + hw->use_vec_rx = 0; > } > > if (vtpci_packed_queue(hw)) { > - hw->use_simple_rx = 0; > + hw->use_vec_rx = 0; > hw->use_inorder_rx = 0; > } > > #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM > if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { > - hw->use_simple_rx = 0; > + hw->use_vec_rx = 0; > } > #endif > if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { > - hw->use_simple_rx = 0; > + hw->use_vec_rx = 0; > } > > if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | > DEV_RX_OFFLOAD_TCP_CKSUM | > DEV_RX_OFFLOAD_TCP_LRO | > DEV_RX_OFFLOAD_VLAN_STRIP)) > - hw->use_simple_rx = 0; > + hw->use_vec_rx = 0; > > return 0; > } > diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h > index bd89357e4..668e688e1 100644 > --- a/drivers/net/virtio/virtio_pci.h > +++ b/drivers/net/virtio/virtio_pci.h > @@ -253,7 +253,8 @@ struct virtio_hw { > uint8_t vlan_strip; > uint8_t use_msix; > uint8_t modern; > - uint8_t use_simple_rx; > + uint8_t use_vec_rx; > + uint8_t use_vec_tx; > uint8_t use_inorder_rx; > uint8_t use_inorder_tx; > uint8_t weak_barriers; > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > index e450477e8..84f4cf946 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) > /* Allocate blank mbufs for the each rx descriptor */ > nbufs = 0; > > - if (hw->use_simple_rx) { > + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { > for (desc_idx = 0; desc_idx < vq->vq_nentries; > desc_idx++) { > vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; > @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) > &rxvq->fake_mbuf; > } > > - if (hw->use_simple_rx) { > + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { > while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { > virtio_rxq_rearm_vec(rxvq); > nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; > diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c > index 953f00d72..150a8d987 100644 > --- a/drivers/net/virtio/virtio_user_ethdev.c > +++ b/drivers/net/virtio/virtio_user_ethdev.c > @@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) > */ > hw->use_msix = 1; > hw->modern = 0; > - hw->use_simple_rx = 0; > + hw->use_vec_rx = 0; > hw->use_inorder_rx = 0; > hw->use_inorder_tx = 0; > hw->virtio_user_dev = dev; > diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c > index 0b4e3bf3e..ca23180de 100644 > --- a/drivers/net/virtio/virtqueue.c > +++ b/drivers/net/virtio/virtqueue.c > @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) > end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); > > for (idx = 0; idx < vq->vq_nentries; idx++) { > - if (hw->use_simple_rx && type == VTNET_RQ) { > + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && > + type == VTNET_RQ) { > if (start <= end && idx >= start && idx < end) > continue; > if (start > end && (idx >= start || idx < end)) > @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) > for (i = 0; i < nb_used; i++) { > used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); > uep = &vq->vq_split.ring.used->ring[used_idx]; > - if (hw->use_simple_rx) { > + if (hw->use_vec_rx) { > desc_idx = used_idx; > rte_pktmbuf_free(vq->sw_ring[desc_idx]); > vq->vq_free_cnt++; > @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) > vq->vq_used_cons_idx++; > } > > - if (hw->use_simple_rx) { > + if (hw->use_vec_rx) { > while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { > virtio_rxq_rearm_vec(rxq); > if (virtqueue_kick_prepare(vq)) > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 4/9] net/virtio-user: add vectorized devarg 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-27 11:07 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu ` (4 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Add new devarg for virtio user device vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 902a1f0cf..d59add23e 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfer to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) Virtio paths Selection and Usage -------------------------------- diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 150a8d987..40ad786cc 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -452,6 +452,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_PACKED_VQ, #define VIRTIO_USER_ARG_SPEED "speed" VIRTIO_USER_ARG_SPEED, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) + hw->use_vec_rx = 1; + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "mrg_rxbuf=<0|1> " "in_order=<0|1> " "packed_vq=<0|1> " - "speed=<int>"); + "speed=<int> " + "vectorized=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 4/9] net/virtio-user: add vectorized devarg 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu @ 2020-04-27 11:07 ` Maxime Coquelin 2020-04-28 1:29 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-27 11:07 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev On 4/26/20 4:19 AM, Marvin Liu wrote: > Add new devarg for virtio user device vectorized path selection. By > default vectorized path is disabled. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst > index 902a1f0cf..d59add23e 100644 > --- a/doc/guides/nics/virtio.rst > +++ b/doc/guides/nics/virtio.rst > @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev: > rte_eth_link_get_nowait function. > (Default: 10000 (10G)) > > +#. ``vectorized``: > + > + It is used to specify whether virtio device perfer to use vectorized path. s/perfer/prefers/ I'll fix while applying if the rest of the series is ok. Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 4/9] net/virtio-user: add vectorized devarg 2020-04-27 11:07 ` Maxime Coquelin @ 2020-04-28 1:29 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-28 1:29 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Monday, April 27, 2020 7:07 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org > Subject: Re: [PATCH v10 4/9] net/virtio-user: add vectorized devarg > > > > On 4/26/20 4:19 AM, Marvin Liu wrote: > > Add new devarg for virtio user device vectorized path selection. By > > default vectorized path is disabled. > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst > > index 902a1f0cf..d59add23e 100644 > > --- a/doc/guides/nics/virtio.rst > > +++ b/doc/guides/nics/virtio.rst > > @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user > vdev: > > rte_eth_link_get_nowait function. > > (Default: 10000 (10G)) > > > > +#. ``vectorized``: > > + > > + It is used to specify whether virtio device perfer to use vectorized path. > > s/perfer/prefers/ > > I'll fix while applying if the rest of the series is ok. Thanks, Maxime. I will fix in next version followed with i686 building fix. > > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> > > Thanks, > Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-27 11:08 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (3 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Move offload, xmit cleanup and packed xmit enqueue function to header file. These functions will be reused by packed ring vectorized path. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..a549991aa 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -89,23 +89,6 @@ vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx) dp->next = VQ_RING_DESC_CHAIN_END; } -static void -vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id) -{ - struct vq_desc_extra *dxp; - - dxp = &vq->vq_descx[id]; - vq->vq_free_cnt += dxp->ndescs; - - if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_head_idx = id; - else - vq->vq_descx[vq->vq_desc_tail_idx].next = id; - - vq->vq_desc_tail_idx = id; - dxp->next = VQ_RING_DESC_CHAIN_END; -} - void virtio_update_packet_stats(struct virtnet_stats *stats, struct rte_mbuf *mbuf) { @@ -264,130 +247,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - -static void -virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) -{ - uint16_t used_idx, id, curr_id, free_cnt = 0; - uint16_t size = vq->vq_nentries; - struct vring_packed_desc *desc = vq->vq_packed.ring.desc; - struct vq_desc_extra *dxp; - - used_idx = vq->vq_used_cons_idx; - /* desc_is_used has a load-acquire or rte_cio_rmb inside - * and wait for used desc in virtqueue. - */ - while (num > 0 && desc_is_used(&desc[used_idx], vq)) { - id = desc[used_idx].id; - do { - curr_id = used_idx; - dxp = &vq->vq_descx[used_idx]; - used_idx += dxp->ndescs; - free_cnt += dxp->ndescs; - num -= dxp->ndescs; - if (used_idx >= size) { - used_idx -= size; - vq->vq_packed.used_wrap_counter ^= 1; - } - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } while (curr_id != id); - } - vq->vq_used_cons_idx = used_idx; - vq->vq_free_cnt += free_cnt; -} - -static void -virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num) -{ - uint16_t used_idx, id; - uint16_t size = vq->vq_nentries; - struct vring_packed_desc *desc = vq->vq_packed.ring.desc; - struct vq_desc_extra *dxp; - - used_idx = vq->vq_used_cons_idx; - /* desc_is_used has a load-acquire or rte_cio_rmb inside - * and wait for used desc in virtqueue. - */ - while (num-- && desc_is_used(&desc[used_idx], vq)) { - id = desc[used_idx].id; - dxp = &vq->vq_descx[id]; - vq->vq_used_cons_idx += dxp->ndescs; - if (vq->vq_used_cons_idx >= size) { - vq->vq_used_cons_idx -= size; - vq->vq_packed.used_wrap_counter ^= 1; - } - vq_ring_free_id_packed(vq, id); - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - used_idx = vq->vq_used_cons_idx; - } -} - -/* Cleanup from completed transmits. */ -static inline void -virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order) -{ - if (in_order) - virtio_xmit_cleanup_inorder_packed(vq, num); - else - virtio_xmit_cleanup_normal_packed(vq, num); -} - -static void -virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num) -{ - uint16_t i, used_idx, desc_idx; - for (i = 0; i < num; i++) { - struct vring_used_elem *uep; - struct vq_desc_extra *dxp; - - used_idx = (uint16_t)(vq->vq_used_cons_idx & (vq->vq_nentries - 1)); - uep = &vq->vq_split.ring.used->ring[used_idx]; - - desc_idx = (uint16_t) uep->id; - dxp = &vq->vq_descx[desc_idx]; - vq->vq_used_cons_idx++; - vq_ring_free_chain(vq, desc_idx); - - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } -} - -/* Cleanup from completed inorder transmits. */ -static __rte_always_inline void -virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num) -{ - uint16_t i, idx = vq->vq_used_cons_idx; - int16_t free_cnt = 0; - struct vq_desc_extra *dxp = NULL; - - if (unlikely(num == 0)) - return; - - for (i = 0; i < num; i++) { - dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)]; - free_cnt += dxp->ndescs; - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } - - vq->vq_free_cnt += free_cnt; - vq->vq_used_cons_idx = idx; -} - static inline int virtqueue_enqueue_refill_inorder(struct virtqueue *vq, struct rte_mbuf **cookies, @@ -562,68 +421,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +523,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, @@ -1246,7 +948,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..ca1c10499 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -10,6 +10,7 @@ #include <rte_atomic.h> #include <rte_memory.h> #include <rte_mempool.h> +#include <rte_net.h> #include "virtio_pci.h" #include "virtio_ring.h" @@ -18,8 +19,10 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -560,4 +563,303 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} + +static void +vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id) +{ + struct vq_desc_extra *dxp; + + dxp = &vq->vq_descx[id]; + vq->vq_free_cnt += dxp->ndescs; + + if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_head_idx = id; + else + vq->vq_descx[vq->vq_desc_tail_idx].next = id; + + vq->vq_desc_tail_idx = id; + dxp->next = VQ_RING_DESC_CHAIN_END; +} + +static void +virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) +{ + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + + used_idx = vq->vq_used_cons_idx; + /* desc_is_used has a load-acquire or rte_cio_rmb inside + * and wait for used desc in virtqueue. + */ + while (num > 0 && desc_is_used(&desc[used_idx], vq)) { + id = desc[used_idx].id; + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + num -= dxp->ndescs; + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } while (curr_id != id); + } + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} + +static void +virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num) +{ + uint16_t used_idx, id; + uint16_t size = vq->vq_nentries; + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + + used_idx = vq->vq_used_cons_idx; + /* desc_is_used has a load-acquire or rte_cio_rmb inside + * and wait for used desc in virtqueue. + */ + while (num-- && desc_is_used(&desc[used_idx], vq)) { + id = desc[used_idx].id; + dxp = &vq->vq_descx[id]; + vq->vq_used_cons_idx += dxp->ndescs; + if (vq->vq_used_cons_idx >= size) { + vq->vq_used_cons_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + vq_ring_free_id_packed(vq, id); + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + used_idx = vq->vq_used_cons_idx; + } +} + +/* Cleanup from completed transmits. */ +static inline void +virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order) +{ + if (in_order) + virtio_xmit_cleanup_inorder_packed(vq, num); + else + virtio_xmit_cleanup_normal_packed(vq, num); +} + +static inline void +virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num) +{ + uint16_t i, used_idx, desc_idx; + for (i = 0; i < num; i++) { + struct vring_used_elem *uep; + struct vq_desc_extra *dxp; + + used_idx = (uint16_t)(vq->vq_used_cons_idx & + (vq->vq_nentries - 1)); + uep = &vq->vq_split.ring.used->ring[used_idx]; + + desc_idx = (uint16_t)uep->id; + dxp = &vq->vq_descx[desc_idx]; + vq->vq_used_cons_idx++; + vq_ring_free_chain(vq, desc_idx); + + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } +} + +/* Cleanup from completed inorder transmits. */ +static __rte_always_inline void +virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num) +{ + uint16_t i, idx = vq->vq_used_cons_idx; + int16_t free_cnt = 0; + struct vq_desc_extra *dxp = NULL; + + if (unlikely(num == 0)) + return; + + for (i = 0; i < num; i++) { + dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)]; + free_cnt += dxp->ndescs; + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } + + vq->vq_free_cnt += free_cnt; + vq->vq_used_cons_idx = idx; +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu @ 2020-04-27 11:08 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-27 11:08 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev On 4/26/20 4:19 AM, Marvin Liu wrote: > Move offload, xmit cleanup and packed xmit enqueue function to header > file. These functions will be reused by packed ring vectorized path. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-27 11:20 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (2 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Rx path with SIMD instructions. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by AVX512 instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index c9edb84ee..102b1deab 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 15150eea1..8e68c3039 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' + if '-mno-avx512f' not in machine_args + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index febaf17a8..5c112cac7 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index a549991aa..534562cca 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..8a7b459eb --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,374 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define BYTE_SIZE 8 +/* flag bits offset in packed ring desc higher 64bits */ +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ + FLAGS_BITS_OFFSET) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} + +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static inline uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + __m512i v_desc = _mm512_loadu_si512(desc_addr); + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); + + __m512i v_used_flag = _mm512_setzero_si512(); + if (vq->vq_packed.used_wrap_counter) + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + + /* Check all descs are used */ + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); + if (desc_stats) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; + } + + /* + * load len from desc, store into mbuf pkt_len and data_len + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored + */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA); + + /* reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, + (uint32_t)-hdr_size); + + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); + + /* assert offset of data_len */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], + addrs[2] + 8, addrs[2], + addrs[1] + 8, addrs[1], + addrs[0] + 8, addrs[0]); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, v_index, v_value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 40ad786cc..c54698ad1 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) hw->use_msix = 1; hw->modern = 0; hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } - if (vectorized) - hw->use_vec_rx = 1; + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not support packed ring vectorized"); +#endif + } else { + hw->use_vec_rx = 1; + } + } rte_eth_dev_probing_finish(eth_dev); ret = 0; diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index ca1c10499..ce0340743 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -239,7 +239,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-27 11:20 ` Maxime Coquelin 2020-04-28 1:14 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-27 11:20 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev On 4/26/20 4:19 AM, Marvin Liu wrote: > Optimize packed ring Rx path with SIMD instructions. Solution of > optimization is pretty like vhost, is that split path into batch and > single functions. Batch function is further optimized by AVX512 > instructions. Also pad desc extra structure to 16 bytes aligned, thus > four elements will be saved in one batch. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > index c9edb84ee..102b1deab 100644 > --- a/drivers/net/virtio/Makefile > +++ b/drivers/net/virtio/Makefile > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > endif > > +ifneq ($(FORCE_DISABLE_AVX512), y) > + CC_AVX512_SUPPORT=\ > + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > + grep -q AVX512 && echo 1) > +endif > + > +ifeq ($(CC_AVX512_SUPPORT), 1) > +CFLAGS += -DCC_AVX512_SUPPORT > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c > + > +ifeq ($(RTE_TOOLCHAIN), gcc) > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA > +endif > +endif > + > +ifeq ($(RTE_TOOLCHAIN), clang) > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) > +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA > +endif > +endif > + > +ifeq ($(RTE_TOOLCHAIN), icc) > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA > +endif > +endif > + > +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds > +endif > +endif > + > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > index 15150eea1..8e68c3039 100644 > --- a/drivers/net/virtio/meson.build > +++ b/drivers/net/virtio/meson.build > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > deps += ['kvargs', 'bus_pci'] > > if arch_subdir == 'x86' > + if '-mno-avx512f' not in machine_args > + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') > + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > + cflags += ['-DCC_AVX512_SUPPORT'] > + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) > + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) > + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' > + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) > + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > + endif > + sources += files('virtio_rxtx_packed_avx.c') > + endif > + endif > sources += files('virtio_rxtx_simple_sse.c') > elif arch_subdir == 'ppc' > sources += files('virtio_rxtx_simple_altivec.c') > diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h > index febaf17a8..5c112cac7 100644 > --- a/drivers/net/virtio/virtio_ethdev.h > +++ b/drivers/net/virtio/virtio_ethdev.h > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, > uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t nb_pkts); > > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts); > + > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > void virtio_interrupt_handler(void *param); > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > index a549991aa..534562cca 100644 > --- a/drivers/net/virtio/virtio_rxtx.c > +++ b/drivers/net/virtio/virtio_rxtx.c > @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > return nb_tx; > } > + > +__rte_weak uint16_t > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > + struct rte_mbuf **rx_pkts __rte_unused, > + uint16_t nb_pkts __rte_unused) > +{ > + return 0; > +} > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c > new file mode 100644 > index 000000000..8a7b459eb > --- /dev/null > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > @@ -0,0 +1,374 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2010-2020 Intel Corporation > + */ > + > +#include <stdint.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <errno.h> > + > +#include <rte_net.h> > + > +#include "virtio_logs.h" > +#include "virtio_ethdev.h" > +#include "virtio_pci.h" > +#include "virtqueue.h" > + > +#define BYTE_SIZE 8 > +/* flag bits offset in packed ring desc higher 64bits */ > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > + > +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ > + FLAGS_BITS_OFFSET) > + > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > + sizeof(struct vring_packed_desc)) > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > + > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ > + for (iter = val; iter < size; iter++) > +#endif > + > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > + for (iter = val; iter < size; iter++) > +#endif > + > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ > + for (iter = val; iter < size; iter++) > +#endif > + > +#ifndef virtio_for_each_try_unroll > +#define virtio_for_each_try_unroll(iter, val, num) \ > + for (iter = val; iter < num; iter++) > +#endif > + > +static inline void > +virtio_update_batch_stats(struct virtnet_stats *stats, > + uint16_t pkt_len1, > + uint16_t pkt_len2, > + uint16_t pkt_len3, > + uint16_t pkt_len4) > +{ > + stats->bytes += pkt_len1; > + stats->bytes += pkt_len2; > + stats->bytes += pkt_len3; > + stats->bytes += pkt_len4; > +} > + > +/* Optionally fill offload information in structure */ > +static inline int > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > +{ > + struct rte_net_hdr_lens hdr_lens; > + uint32_t hdrlen, ptype; > + int l4_supported = 0; > + > + /* nothing to do */ > + if (hdr->flags == 0) > + return 0; > + > + /* GSO not support in vec path, skip check */ > + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; > + > + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > + m->packet_type = ptype; > + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) > + l4_supported = 1; > + > + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > + if (hdr->csum_start <= hdrlen && l4_supported) { > + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; > + } else { > + /* Unknown proto or tunnel, do sw cksum. We can assume > + * the cksum field is in the first segment since the > + * buffers we provided to the host are large enough. > + * In case of SCTP, this will be wrong since it's a CRC > + * but there's nothing we can do. > + */ > + uint16_t csum = 0, off; > + > + rte_raw_cksum_mbuf(m, hdr->csum_start, > + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > + &csum); > + if (likely(csum != 0xffff)) > + csum = ~csum; > + off = hdr->csum_offset + hdr->csum_start; > + if (rte_pktmbuf_data_len(m) >= off + 1) > + *rte_pktmbuf_mtod_offset(m, uint16_t *, > + off) = csum; > + } > + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { > + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; > + } > + > + return 0; > +} > + > +static inline uint16_t > +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, > + struct rte_mbuf **rx_pkts) > +{ > + struct virtqueue *vq = rxvq->vq; > + struct virtio_hw *hw = vq->hw; > + uint16_t hdr_size = hw->vtnet_hdr_size; > + uint64_t addrs[PACKED_BATCH_SIZE]; > + uint16_t id = vq->vq_used_cons_idx; > + uint8_t desc_stats; > + uint16_t i; > + void *desc_addr; > + > + if (id & PACKED_BATCH_MASK) > + return -1; > + > + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) > + return -1; > + > + /* only care avail/used bits */ > + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); > + desc_addr = &vq->vq_packed.ring.desc[id]; > + > + __m512i v_desc = _mm512_loadu_si512(desc_addr); > + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); > + > + __m512i v_used_flag = _mm512_setzero_si512(); > + if (vq->vq_packed.used_wrap_counter) > + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); > + > + /* Check all descs are used */ > + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); > + if (desc_stats) > + return -1; > + > + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; > + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); > + > + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; > + } > + > + /* > + * load len from desc, store into mbuf pkt_len and data_len > + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored > + */ > + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; > + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA); > + > + /* reduce hdr_len from pkt_len and data_len */ > + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, > + (uint32_t)-hdr_size); > + > + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); > + > + /* assert offset of data_len */ > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != > + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); > + > + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], > + addrs[2] + 8, addrs[2], > + addrs[1] + 8, addrs[1], > + addrs[0] + 8, addrs[0]); > + /* batch store into mbufs */ > + _mm512_i64scatter_epi64(0, v_index, v_value, 1); > + > + if (hw->has_rx_offload) { > + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > + char *addr = (char *)rx_pkts[i]->buf_addr + > + RTE_PKTMBUF_HEADROOM - hdr_size; > + virtio_vec_rx_offload(rx_pkts[i], > + (struct virtio_net_hdr *)addr); > + } > + } > + > + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, > + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, > + rx_pkts[3]->pkt_len); > + > + vq->vq_free_cnt += PACKED_BATCH_SIZE; > + > + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; > + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > + vq->vq_used_cons_idx -= vq->vq_nentries; > + vq->vq_packed.used_wrap_counter ^= 1; > + } > + > + return 0; > +} > + > +static uint16_t > +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, > + struct rte_mbuf **rx_pkts) > +{ > + uint16_t used_idx, id; > + uint32_t len; > + struct virtqueue *vq = rxvq->vq; > + struct virtio_hw *hw = vq->hw; > + uint32_t hdr_size = hw->vtnet_hdr_size; > + struct virtio_net_hdr *hdr; > + struct vring_packed_desc *desc; > + struct rte_mbuf *cookie; > + > + desc = vq->vq_packed.ring.desc; > + used_idx = vq->vq_used_cons_idx; > + if (!desc_is_used(&desc[used_idx], vq)) > + return -1; > + > + len = desc[used_idx].len; > + id = desc[used_idx].id; > + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; > + if (unlikely(cookie == NULL)) { > + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", > + vq->vq_used_cons_idx); > + return -1; > + } > + rte_prefetch0(cookie); > + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); > + > + cookie->data_off = RTE_PKTMBUF_HEADROOM; > + cookie->ol_flags = 0; > + cookie->pkt_len = (uint32_t)(len - hdr_size); > + cookie->data_len = (uint32_t)(len - hdr_size); > + > + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + > + RTE_PKTMBUF_HEADROOM - hdr_size); > + if (hw->has_rx_offload) > + virtio_vec_rx_offload(cookie, hdr); > + > + *rx_pkts = cookie; > + > + rxvq->stats.bytes += cookie->pkt_len; > + > + vq->vq_free_cnt++; > + vq->vq_used_cons_idx++; > + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > + vq->vq_used_cons_idx -= vq->vq_nentries; > + vq->vq_packed.used_wrap_counter ^= 1; > + } > + > + return 0; > +} > + > +static inline void > +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, > + struct rte_mbuf **cookie, > + uint16_t num) > +{ > + struct virtqueue *vq = rxvq->vq; > + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; > + uint16_t flags = vq->vq_packed.cached_flags; > + struct virtio_hw *hw = vq->hw; > + struct vq_desc_extra *dxp; > + uint16_t idx, i; > + uint16_t batch_num, total_num = 0; > + uint16_t head_idx = vq->vq_avail_idx; > + uint16_t head_flag = vq->vq_packed.cached_flags; > + uint64_t addr; > + > + do { > + idx = vq->vq_avail_idx; > + > + batch_num = PACKED_BATCH_SIZE; > + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) > + batch_num = vq->vq_nentries - idx; > + if (unlikely((total_num + batch_num) > num)) > + batch_num = num - total_num; > + > + virtio_for_each_try_unroll(i, 0, batch_num) { > + dxp = &vq->vq_descx[idx + i]; > + dxp->cookie = (void *)cookie[total_num + i]; > + > + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + > + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; > + start_dp[idx + i].addr = addr; > + start_dp[idx + i].len = cookie[total_num + i]->buf_len > + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; > + if (total_num || i) { > + virtqueue_store_flags_packed(&start_dp[idx + i], > + flags, hw->weak_barriers); > + } > + } > + > + vq->vq_avail_idx += batch_num; > + if (vq->vq_avail_idx >= vq->vq_nentries) { > + vq->vq_avail_idx -= vq->vq_nentries; > + vq->vq_packed.cached_flags ^= > + VRING_PACKED_DESC_F_AVAIL_USED; > + flags = vq->vq_packed.cached_flags; > + } > + total_num += batch_num; > + } while (total_num < num); > + > + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, > + hw->weak_barriers); > + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); > +} > + > +uint16_t > +virtio_recv_pkts_packed_vec(void *rx_queue, > + struct rte_mbuf **rx_pkts, > + uint16_t nb_pkts) > +{ > + struct virtnet_rx *rxvq = rx_queue; > + struct virtqueue *vq = rxvq->vq; > + struct virtio_hw *hw = vq->hw; > + uint16_t num, nb_rx = 0; > + uint32_t nb_enqueued = 0; > + uint16_t free_cnt = vq->vq_free_thresh; > + > + if (unlikely(hw->started == 0)) > + return nb_rx; > + > + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); > + if (likely(num > PACKED_BATCH_SIZE)) > + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); > + > + while (num) { > + if (!virtqueue_dequeue_batch_packed_vec(rxvq, > + &rx_pkts[nb_rx])) { > + nb_rx += PACKED_BATCH_SIZE; > + num -= PACKED_BATCH_SIZE; > + continue; > + } > + if (!virtqueue_dequeue_single_packed_vec(rxvq, > + &rx_pkts[nb_rx])) { > + nb_rx++; > + num--; > + continue; > + } > + break; > + }; > + > + PMD_RX_LOG(DEBUG, "dequeue:%d", num); > + > + rxvq->stats.packets += nb_rx; > + > + if (likely(vq->vq_free_cnt >= free_cnt)) { > + struct rte_mbuf *new_pkts[free_cnt]; > + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, > + free_cnt) == 0)) { > + virtio_recv_refill_packed_vec(rxvq, new_pkts, > + free_cnt); > + nb_enqueued += free_cnt; > + } else { > + struct rte_eth_dev *dev = > + &rte_eth_devices[rxvq->port_id]; > + dev->data->rx_mbuf_alloc_failed += free_cnt; > + } > + } > + > + if (likely(nb_enqueued)) { > + if (unlikely(virtqueue_kick_prepare_packed(vq))) { > + virtqueue_notify(vq); > + PMD_RX_LOG(DEBUG, "Notified"); > + } > + } > + > + return nb_rx; > +} > diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c > index 40ad786cc..c54698ad1 100644 > --- a/drivers/net/virtio/virtio_user_ethdev.c > +++ b/drivers/net/virtio/virtio_user_ethdev.c > @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) > hw->use_msix = 1; > hw->modern = 0; > hw->use_vec_rx = 0; > + hw->use_vec_tx = 0; > hw->use_inorder_rx = 0; > hw->use_inorder_tx = 0; > hw->virtio_user_dev = dev; > @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) > goto end; > } > > - if (vectorized) > - hw->use_vec_rx = 1; > + if (vectorized) { > + if (packed_vq) { > +#if defined(CC_AVX512_SUPPORT) > + hw->use_vec_rx = 1; > + hw->use_vec_tx = 1; > +#else > + PMD_INIT_LOG(INFO, > + "building environment do not support packed ring vectorized"); > +#endif > + } else { > + hw->use_vec_rx = 1; > + } > + } > > rte_eth_dev_probing_finish(eth_dev); > ret = 0; > diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h > index ca1c10499..ce0340743 100644 > --- a/drivers/net/virtio/virtqueue.h > +++ b/drivers/net/virtio/virtqueue.h > @@ -239,7 +239,8 @@ struct vq_desc_extra { > void *cookie; > uint16_t ndescs; > uint16_t next; > -}; > + uint8_t padding[4]; > +} __rte_packed __rte_aligned(16); Can't this introduce a performance impact for the non-vectorized case? I think of worse cache liens utilization. For example with a burst of 32 descriptors with 32B cachelines, before it would take 14 cachelines, after 16. So for each burst, one could face 2 extra cache misses. If you could run non-vectorized benchamrks with and without that patch, I would be grateful. Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-27 11:20 ` Maxime Coquelin @ 2020-04-28 1:14 ` Liu, Yong 2020-04-28 8:44 ` Maxime Coquelin 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-28 1:14 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Monday, April 27, 2020 7:21 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/26/20 4:19 AM, Marvin Liu wrote: > > Optimize packed ring Rx path with SIMD instructions. Solution of > > optimization is pretty like vhost, is that split path into batch and > > single functions. Batch function is further optimized by AVX512 > > instructions. Also pad desc extra structure to 16 bytes aligned, thus > > four elements will be saved in one batch. > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > index c9edb84ee..102b1deab 100644 > > --- a/drivers/net/virtio/Makefile > > +++ b/drivers/net/virtio/Makefile > > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > $(CONFIG_RTE_ARCH_ARM64)),) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c > > endif > > > > +ifneq ($(FORCE_DISABLE_AVX512), y) > > + CC_AVX512_SUPPORT=\ > > + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ > > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > > + grep -q AVX512 && echo 1) > > +endif > > + > > +ifeq ($(CC_AVX512_SUPPORT), 1) > > +CFLAGS += -DCC_AVX512_SUPPORT > > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c > > + > > +ifeq ($(RTE_TOOLCHAIN), gcc) > > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > > +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA > > +endif > > +endif > > + > > +ifeq ($(RTE_TOOLCHAIN), clang) > > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - > ge 37 && echo 1), 1) > > +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA > > +endif > > +endif > > + > > +ifeq ($(RTE_TOOLCHAIN), icc) > > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > > +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA > > +endif > > +endif > > + > > +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl > > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > > +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds > > +endif > > +endif > > + > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c > > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build > > index 15150eea1..8e68c3039 100644 > > --- a/drivers/net/virtio/meson.build > > +++ b/drivers/net/virtio/meson.build > > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > > deps += ['kvargs', 'bus_pci'] > > > > if arch_subdir == 'x86' > > + if '-mno-avx512f' not in machine_args > > + if cc.has_argument('-mavx512f') and cc.has_argument('- > mavx512vl') and cc.has_argument('-mavx512bw') > > + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > > + cflags += ['-DCC_AVX512_SUPPORT'] > > + if (toolchain == 'gcc' and > cc.version().version_compare('>=8.3.0')) > > + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > > + elif (toolchain == 'clang' and > cc.version().version_compare('>=3.7.0')) > > + cflags += '- > DVHOST_CLANG_UNROLL_PRAGMA' > > + elif (toolchain == 'icc' and > cc.version().version_compare('>=16.0.0')) > > + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > > + endif > > + sources += files('virtio_rxtx_packed_avx.c') > > + endif > > + endif > > sources += files('virtio_rxtx_simple_sse.c') > > elif arch_subdir == 'ppc' > > sources += files('virtio_rxtx_simple_altivec.c') > > diff --git a/drivers/net/virtio/virtio_ethdev.h > b/drivers/net/virtio/virtio_ethdev.h > > index febaf17a8..5c112cac7 100644 > > --- a/drivers/net/virtio/virtio_ethdev.h > > +++ b/drivers/net/virtio/virtio_ethdev.h > > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, > struct rte_mbuf **tx_pkts, > > uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > > uint16_t nb_pkts); > > > > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf > **rx_pkts, > > + uint16_t nb_pkts); > > + > > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > > > void virtio_interrupt_handler(void *param); > > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c > > index a549991aa..534562cca 100644 > > --- a/drivers/net/virtio/virtio_rxtx.c > > +++ b/drivers/net/virtio/virtio_rxtx.c > > @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > > > return nb_tx; > > } > > + > > +__rte_weak uint16_t > > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > > + struct rte_mbuf **rx_pkts __rte_unused, > > + uint16_t nb_pkts __rte_unused) > > +{ > > + return 0; > > +} > > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > new file mode 100644 > > index 000000000..8a7b459eb > > --- /dev/null > > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > @@ -0,0 +1,374 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2010-2020 Intel Corporation > > + */ > > + > > +#include <stdint.h> > > +#include <stdio.h> > > +#include <stdlib.h> > > +#include <string.h> > > +#include <errno.h> > > + > > +#include <rte_net.h> > > + > > +#include "virtio_logs.h" > > +#include "virtio_ethdev.h" > > +#include "virtio_pci.h" > > +#include "virtqueue.h" > > + > > +#define BYTE_SIZE 8 > > +/* flag bits offset in packed ring desc higher 64bits */ > > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ > > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > > + > > +#define PACKED_FLAGS_MASK ((0ULL | > VRING_PACKED_DESC_F_AVAIL_USED) << \ > > + FLAGS_BITS_OFFSET) > > + > > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > > + sizeof(struct vring_packed_desc)) > > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > > + > > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") > \ > > + for (iter = val; iter < size; iter++) > > +#endif > > + > > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > > + for (iter = val; iter < size; iter++) > > +#endif > > + > > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ > > + for (iter = val; iter < size; iter++) > > +#endif > > + > > +#ifndef virtio_for_each_try_unroll > > +#define virtio_for_each_try_unroll(iter, val, num) \ > > + for (iter = val; iter < num; iter++) > > +#endif > > + > > +static inline void > > +virtio_update_batch_stats(struct virtnet_stats *stats, > > + uint16_t pkt_len1, > > + uint16_t pkt_len2, > > + uint16_t pkt_len3, > > + uint16_t pkt_len4) > > +{ > > + stats->bytes += pkt_len1; > > + stats->bytes += pkt_len2; > > + stats->bytes += pkt_len3; > > + stats->bytes += pkt_len4; > > +} > > + > > +/* Optionally fill offload information in structure */ > > +static inline int > > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > > +{ > > + struct rte_net_hdr_lens hdr_lens; > > + uint32_t hdrlen, ptype; > > + int l4_supported = 0; > > + > > + /* nothing to do */ > > + if (hdr->flags == 0) > > + return 0; > > + > > + /* GSO not support in vec path, skip check */ > > + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; > > + > > + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > > + m->packet_type = ptype; > > + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || > > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || > > + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) > > + l4_supported = 1; > > + > > + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > > + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > > + if (hdr->csum_start <= hdrlen && l4_supported) { > > + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; > > + } else { > > + /* Unknown proto or tunnel, do sw cksum. We can > assume > > + * the cksum field is in the first segment since the > > + * buffers we provided to the host are large enough. > > + * In case of SCTP, this will be wrong since it's a CRC > > + * but there's nothing we can do. > > + */ > > + uint16_t csum = 0, off; > > + > > + rte_raw_cksum_mbuf(m, hdr->csum_start, > > + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > > + &csum); > > + if (likely(csum != 0xffff)) > > + csum = ~csum; > > + off = hdr->csum_offset + hdr->csum_start; > > + if (rte_pktmbuf_data_len(m) >= off + 1) > > + *rte_pktmbuf_mtod_offset(m, uint16_t *, > > + off) = csum; > > + } > > + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && > l4_supported) { > > + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; > > + } > > + > > + return 0; > > +} > > + > > +static inline uint16_t > > +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, > > + struct rte_mbuf **rx_pkts) > > +{ > > + struct virtqueue *vq = rxvq->vq; > > + struct virtio_hw *hw = vq->hw; > > + uint16_t hdr_size = hw->vtnet_hdr_size; > > + uint64_t addrs[PACKED_BATCH_SIZE]; > > + uint16_t id = vq->vq_used_cons_idx; > > + uint8_t desc_stats; > > + uint16_t i; > > + void *desc_addr; > > + > > + if (id & PACKED_BATCH_MASK) > > + return -1; > > + > > + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) > > + return -1; > > + > > + /* only care avail/used bits */ > > + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, > PACKED_FLAGS_MASK); > > + desc_addr = &vq->vq_packed.ring.desc[id]; > > + > > + __m512i v_desc = _mm512_loadu_si512(desc_addr); > > + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); > > + > > + __m512i v_used_flag = _mm512_setzero_si512(); > > + if (vq->vq_packed.used_wrap_counter) > > + v_used_flag = _mm512_maskz_set1_epi64(0xaa, > PACKED_FLAGS_MASK); > > + > > + /* Check all descs are used */ > > + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); > > + if (desc_stats) > > + return -1; > > + > > + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; > > + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); > > + > > + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; > > + } > > + > > + /* > > + * load len from desc, store into mbuf pkt_len and data_len > > + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored > > + */ > > + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; > > + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, > 0xAA); > > + > > + /* reduce hdr_len from pkt_len and data_len */ > > + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, > > + (uint32_t)-hdr_size); > > + > > + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); > > + > > + /* assert offset of data_len */ > > + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != > > + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); > > + > > + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], > > + addrs[2] + 8, addrs[2], > > + addrs[1] + 8, addrs[1], > > + addrs[0] + 8, addrs[0]); > > + /* batch store into mbufs */ > > + _mm512_i64scatter_epi64(0, v_index, v_value, 1); > > + > > + if (hw->has_rx_offload) { > > + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > + char *addr = (char *)rx_pkts[i]->buf_addr + > > + RTE_PKTMBUF_HEADROOM - hdr_size; > > + virtio_vec_rx_offload(rx_pkts[i], > > + (struct virtio_net_hdr *)addr); > > + } > > + } > > + > > + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, > > + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, > > + rx_pkts[3]->pkt_len); > > + > > + vq->vq_free_cnt += PACKED_BATCH_SIZE; > > + > > + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; > > + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > > + vq->vq_used_cons_idx -= vq->vq_nentries; > > + vq->vq_packed.used_wrap_counter ^= 1; > > + } > > + > > + return 0; > > +} > > + > > +static uint16_t > > +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, > > + struct rte_mbuf **rx_pkts) > > +{ > > + uint16_t used_idx, id; > > + uint32_t len; > > + struct virtqueue *vq = rxvq->vq; > > + struct virtio_hw *hw = vq->hw; > > + uint32_t hdr_size = hw->vtnet_hdr_size; > > + struct virtio_net_hdr *hdr; > > + struct vring_packed_desc *desc; > > + struct rte_mbuf *cookie; > > + > > + desc = vq->vq_packed.ring.desc; > > + used_idx = vq->vq_used_cons_idx; > > + if (!desc_is_used(&desc[used_idx], vq)) > > + return -1; > > + > > + len = desc[used_idx].len; > > + id = desc[used_idx].id; > > + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; > > + if (unlikely(cookie == NULL)) { > > + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie > at %u", > > + vq->vq_used_cons_idx); > > + return -1; > > + } > > + rte_prefetch0(cookie); > > + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); > > + > > + cookie->data_off = RTE_PKTMBUF_HEADROOM; > > + cookie->ol_flags = 0; > > + cookie->pkt_len = (uint32_t)(len - hdr_size); > > + cookie->data_len = (uint32_t)(len - hdr_size); > > + > > + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + > > + RTE_PKTMBUF_HEADROOM - > hdr_size); > > + if (hw->has_rx_offload) > > + virtio_vec_rx_offload(cookie, hdr); > > + > > + *rx_pkts = cookie; > > + > > + rxvq->stats.bytes += cookie->pkt_len; > > + > > + vq->vq_free_cnt++; > > + vq->vq_used_cons_idx++; > > + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > > + vq->vq_used_cons_idx -= vq->vq_nentries; > > + vq->vq_packed.used_wrap_counter ^= 1; > > + } > > + > > + return 0; > > +} > > + > > +static inline void > > +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, > > + struct rte_mbuf **cookie, > > + uint16_t num) > > +{ > > + struct virtqueue *vq = rxvq->vq; > > + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; > > + uint16_t flags = vq->vq_packed.cached_flags; > > + struct virtio_hw *hw = vq->hw; > > + struct vq_desc_extra *dxp; > > + uint16_t idx, i; > > + uint16_t batch_num, total_num = 0; > > + uint16_t head_idx = vq->vq_avail_idx; > > + uint16_t head_flag = vq->vq_packed.cached_flags; > > + uint64_t addr; > > + > > + do { > > + idx = vq->vq_avail_idx; > > + > > + batch_num = PACKED_BATCH_SIZE; > > + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) > > + batch_num = vq->vq_nentries - idx; > > + if (unlikely((total_num + batch_num) > num)) > > + batch_num = num - total_num; > > + > > + virtio_for_each_try_unroll(i, 0, batch_num) { > > + dxp = &vq->vq_descx[idx + i]; > > + dxp->cookie = (void *)cookie[total_num + i]; > > + > > + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], > vq) + > > + RTE_PKTMBUF_HEADROOM - hw- > >vtnet_hdr_size; > > + start_dp[idx + i].addr = addr; > > + start_dp[idx + i].len = cookie[total_num + i]->buf_len > > + - RTE_PKTMBUF_HEADROOM + hw- > >vtnet_hdr_size; > > + if (total_num || i) { > > + virtqueue_store_flags_packed(&start_dp[idx > + i], > > + flags, hw->weak_barriers); > > + } > > + } > > + > > + vq->vq_avail_idx += batch_num; > > + if (vq->vq_avail_idx >= vq->vq_nentries) { > > + vq->vq_avail_idx -= vq->vq_nentries; > > + vq->vq_packed.cached_flags ^= > > + VRING_PACKED_DESC_F_AVAIL_USED; > > + flags = vq->vq_packed.cached_flags; > > + } > > + total_num += batch_num; > > + } while (total_num < num); > > + > > + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, > > + hw->weak_barriers); > > + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); > > +} > > + > > +uint16_t > > +virtio_recv_pkts_packed_vec(void *rx_queue, > > + struct rte_mbuf **rx_pkts, > > + uint16_t nb_pkts) > > +{ > > + struct virtnet_rx *rxvq = rx_queue; > > + struct virtqueue *vq = rxvq->vq; > > + struct virtio_hw *hw = vq->hw; > > + uint16_t num, nb_rx = 0; > > + uint32_t nb_enqueued = 0; > > + uint16_t free_cnt = vq->vq_free_thresh; > > + > > + if (unlikely(hw->started == 0)) > > + return nb_rx; > > + > > + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); > > + if (likely(num > PACKED_BATCH_SIZE)) > > + num = num - ((vq->vq_used_cons_idx + num) % > PACKED_BATCH_SIZE); > > + > > + while (num) { > > + if (!virtqueue_dequeue_batch_packed_vec(rxvq, > > + &rx_pkts[nb_rx])) { > > + nb_rx += PACKED_BATCH_SIZE; > > + num -= PACKED_BATCH_SIZE; > > + continue; > > + } > > + if (!virtqueue_dequeue_single_packed_vec(rxvq, > > + &rx_pkts[nb_rx])) { > > + nb_rx++; > > + num--; > > + continue; > > + } > > + break; > > + }; > > + > > + PMD_RX_LOG(DEBUG, "dequeue:%d", num); > > + > > + rxvq->stats.packets += nb_rx; > > + > > + if (likely(vq->vq_free_cnt >= free_cnt)) { > > + struct rte_mbuf *new_pkts[free_cnt]; > > + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, > > + free_cnt) == 0)) { > > + virtio_recv_refill_packed_vec(rxvq, new_pkts, > > + free_cnt); > > + nb_enqueued += free_cnt; > > + } else { > > + struct rte_eth_dev *dev = > > + &rte_eth_devices[rxvq->port_id]; > > + dev->data->rx_mbuf_alloc_failed += free_cnt; > > + } > > + } > > + > > + if (likely(nb_enqueued)) { > > + if (unlikely(virtqueue_kick_prepare_packed(vq))) { > > + virtqueue_notify(vq); > > + PMD_RX_LOG(DEBUG, "Notified"); > > + } > > + } > > + > > + return nb_rx; > > +} > > diff --git a/drivers/net/virtio/virtio_user_ethdev.c > b/drivers/net/virtio/virtio_user_ethdev.c > > index 40ad786cc..c54698ad1 100644 > > --- a/drivers/net/virtio/virtio_user_ethdev.c > > +++ b/drivers/net/virtio/virtio_user_ethdev.c > > @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device > *vdev) > > hw->use_msix = 1; > > hw->modern = 0; > > hw->use_vec_rx = 0; > > + hw->use_vec_tx = 0; > > hw->use_inorder_rx = 0; > > hw->use_inorder_tx = 0; > > hw->virtio_user_dev = dev; > > @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device > *dev) > > goto end; > > } > > > > - if (vectorized) > > - hw->use_vec_rx = 1; > > + if (vectorized) { > > + if (packed_vq) { > > +#if defined(CC_AVX512_SUPPORT) > > + hw->use_vec_rx = 1; > > + hw->use_vec_tx = 1; > > +#else > > + PMD_INIT_LOG(INFO, > > + "building environment do not support packed > ring vectorized"); > > +#endif > > + } else { > > + hw->use_vec_rx = 1; > > + } > > + } > > > > rte_eth_dev_probing_finish(eth_dev); > > ret = 0; > > diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h > > index ca1c10499..ce0340743 100644 > > --- a/drivers/net/virtio/virtqueue.h > > +++ b/drivers/net/virtio/virtqueue.h > > @@ -239,7 +239,8 @@ struct vq_desc_extra { > > void *cookie; > > uint16_t ndescs; > > uint16_t next; > > -}; > > + uint8_t padding[4]; > > +} __rte_packed __rte_aligned(16); > > Can't this introduce a performance impact for the non-vectorized > case? I think of worse cache liens utilization. > > For example with a burst of 32 descriptors with 32B cachelines, before > it would take 14 cachelines, after 16. So for each burst, one could face > 2 extra cache misses. > > If you could run non-vectorized benchamrks with and without that patch, > I would be grateful. > Maxime, Thanks for point it out, it will add extra cache miss in datapath. And its impact on performance is around 1% in loopback case. While benefit of vectorized path will be more than that number. Thanks, Marvin > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> > > Thanks, > Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 1:14 ` Liu, Yong @ 2020-04-28 8:44 ` Maxime Coquelin 2020-04-28 13:01 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-28 8:44 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev On 4/28/20 3:14 AM, Liu, Yong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Monday, April 27, 2020 7:21 PM >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; >> Wang, Zhihong <zhihong.wang@intel.com> >> Cc: dev@dpdk.org >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path >> >> >> >> On 4/26/20 4:19 AM, Marvin Liu wrote: >>> Optimize packed ring Rx path with SIMD instructions. Solution of >>> optimization is pretty like vhost, is that split path into batch and >>> single functions. Batch function is further optimized by AVX512 >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus >>> four elements will be saved in one batch. >>> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> >>> >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile >>> index c9edb84ee..102b1deab 100644 >>> --- a/drivers/net/virtio/Makefile >>> +++ b/drivers/net/virtio/Makefile >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) >> $(CONFIG_RTE_ARCH_ARM64)),) >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c >>> endif >>> >>> +ifneq ($(FORCE_DISABLE_AVX512), y) >>> + CC_AVX512_SUPPORT=\ >>> + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ >>> + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ >>> + grep -q AVX512 && echo 1) >>> +endif >>> + >>> +ifeq ($(CC_AVX512_SUPPORT), 1) >>> +CFLAGS += -DCC_AVX512_SUPPORT >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c >>> + >>> +ifeq ($(RTE_TOOLCHAIN), gcc) >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) >>> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA >>> +endif >>> +endif >>> + >>> +ifeq ($(RTE_TOOLCHAIN), clang) >>> +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - >> ge 37 && echo 1), 1) >>> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA >>> +endif >>> +endif >>> + >>> +ifeq ($(RTE_TOOLCHAIN), icc) >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) >>> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA >>> +endif >>> +endif >>> + >>> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) >>> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds >>> +endif >>> +endif >>> + >>> ifeq ($(CONFIG_RTE_VIRTIO_USER),y) >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c >>> diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build >>> index 15150eea1..8e68c3039 100644 >>> --- a/drivers/net/virtio/meson.build >>> +++ b/drivers/net/virtio/meson.build >>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', >>> deps += ['kvargs', 'bus_pci'] >>> >>> if arch_subdir == 'x86' >>> + if '-mno-avx512f' not in machine_args >>> + if cc.has_argument('-mavx512f') and cc.has_argument('- >> mavx512vl') and cc.has_argument('-mavx512bw') >>> + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] >>> + cflags += ['-DCC_AVX512_SUPPORT'] >>> + if (toolchain == 'gcc' and >> cc.version().version_compare('>=8.3.0')) >>> + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' >>> + elif (toolchain == 'clang' and >> cc.version().version_compare('>=3.7.0')) >>> + cflags += '- >> DVHOST_CLANG_UNROLL_PRAGMA' >>> + elif (toolchain == 'icc' and >> cc.version().version_compare('>=16.0.0')) >>> + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' >>> + endif >>> + sources += files('virtio_rxtx_packed_avx.c') >>> + endif >>> + endif >>> sources += files('virtio_rxtx_simple_sse.c') >>> elif arch_subdir == 'ppc' >>> sources += files('virtio_rxtx_simple_altivec.c') >>> diff --git a/drivers/net/virtio/virtio_ethdev.h >> b/drivers/net/virtio/virtio_ethdev.h >>> index febaf17a8..5c112cac7 100644 >>> --- a/drivers/net/virtio/virtio_ethdev.h >>> +++ b/drivers/net/virtio/virtio_ethdev.h >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, >> struct rte_mbuf **tx_pkts, >>> uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, >>> uint16_t nb_pkts); >>> >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf >> **rx_pkts, >>> + uint16_t nb_pkts); >>> + >>> int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); >>> >>> void virtio_interrupt_handler(void *param); >>> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c >>> index a549991aa..534562cca 100644 >>> --- a/drivers/net/virtio/virtio_rxtx.c >>> +++ b/drivers/net/virtio/virtio_rxtx.c >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, >>> >>> return nb_tx; >>> } >>> + >>> +__rte_weak uint16_t >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, >>> + struct rte_mbuf **rx_pkts __rte_unused, >>> + uint16_t nb_pkts __rte_unused) >>> +{ >>> + return 0; >>> +} >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c >>> new file mode 100644 >>> index 000000000..8a7b459eb >>> --- /dev/null >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c >>> @@ -0,0 +1,374 @@ >>> +/* SPDX-License-Identifier: BSD-3-Clause >>> + * Copyright(c) 2010-2020 Intel Corporation >>> + */ >>> + >>> +#include <stdint.h> >>> +#include <stdio.h> >>> +#include <stdlib.h> >>> +#include <string.h> >>> +#include <errno.h> >>> + >>> +#include <rte_net.h> >>> + >>> +#include "virtio_logs.h" >>> +#include "virtio_ethdev.h" >>> +#include "virtio_pci.h" >>> +#include "virtqueue.h" >>> + >>> +#define BYTE_SIZE 8 >>> +/* flag bits offset in packed ring desc higher 64bits */ >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ >>> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) >>> + >>> +#define PACKED_FLAGS_MASK ((0ULL | >> VRING_PACKED_DESC_F_AVAIL_USED) << \ >>> + FLAGS_BITS_OFFSET) >>> + >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ >>> + sizeof(struct vring_packed_desc)) >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) >>> + >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") >> \ >>> + for (iter = val; iter < size; iter++) >>> +#endif >>> + >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ >>> + for (iter = val; iter < size; iter++) >>> +#endif >>> + >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ >>> + for (iter = val; iter < size; iter++) >>> +#endif >>> + >>> +#ifndef virtio_for_each_try_unroll >>> +#define virtio_for_each_try_unroll(iter, val, num) \ >>> + for (iter = val; iter < num; iter++) >>> +#endif >>> + >>> +static inline void >>> +virtio_update_batch_stats(struct virtnet_stats *stats, >>> + uint16_t pkt_len1, >>> + uint16_t pkt_len2, >>> + uint16_t pkt_len3, >>> + uint16_t pkt_len4) >>> +{ >>> + stats->bytes += pkt_len1; >>> + stats->bytes += pkt_len2; >>> + stats->bytes += pkt_len3; >>> + stats->bytes += pkt_len4; >>> +} >>> + >>> +/* Optionally fill offload information in structure */ >>> +static inline int >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) >>> +{ >>> + struct rte_net_hdr_lens hdr_lens; >>> + uint32_t hdrlen, ptype; >>> + int l4_supported = 0; >>> + >>> + /* nothing to do */ >>> + if (hdr->flags == 0) >>> + return 0; >>> + >>> + /* GSO not support in vec path, skip check */ >>> + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; >>> + >>> + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); >>> + m->packet_type = ptype; >>> + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || >>> + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || >>> + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) >>> + l4_supported = 1; >>> + >>> + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { >>> + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; >>> + if (hdr->csum_start <= hdrlen && l4_supported) { >>> + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; >>> + } else { >>> + /* Unknown proto or tunnel, do sw cksum. We can >> assume >>> + * the cksum field is in the first segment since the >>> + * buffers we provided to the host are large enough. >>> + * In case of SCTP, this will be wrong since it's a CRC >>> + * but there's nothing we can do. >>> + */ >>> + uint16_t csum = 0, off; >>> + >>> + rte_raw_cksum_mbuf(m, hdr->csum_start, >>> + rte_pktmbuf_pkt_len(m) - hdr->csum_start, >>> + &csum); >>> + if (likely(csum != 0xffff)) >>> + csum = ~csum; >>> + off = hdr->csum_offset + hdr->csum_start; >>> + if (rte_pktmbuf_data_len(m) >= off + 1) >>> + *rte_pktmbuf_mtod_offset(m, uint16_t *, >>> + off) = csum; >>> + } >>> + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && >> l4_supported) { >>> + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static inline uint16_t >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, >>> + struct rte_mbuf **rx_pkts) >>> +{ >>> + struct virtqueue *vq = rxvq->vq; >>> + struct virtio_hw *hw = vq->hw; >>> + uint16_t hdr_size = hw->vtnet_hdr_size; >>> + uint64_t addrs[PACKED_BATCH_SIZE]; >>> + uint16_t id = vq->vq_used_cons_idx; >>> + uint8_t desc_stats; >>> + uint16_t i; >>> + void *desc_addr; >>> + >>> + if (id & PACKED_BATCH_MASK) >>> + return -1; >>> + >>> + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) >>> + return -1; >>> + >>> + /* only care avail/used bits */ >>> + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, >> PACKED_FLAGS_MASK); >>> + desc_addr = &vq->vq_packed.ring.desc[id]; >>> + >>> + __m512i v_desc = _mm512_loadu_si512(desc_addr); >>> + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); >>> + >>> + __m512i v_used_flag = _mm512_setzero_si512(); >>> + if (vq->vq_packed.used_wrap_counter) >>> + v_used_flag = _mm512_maskz_set1_epi64(0xaa, >> PACKED_FLAGS_MASK); >>> + >>> + /* Check all descs are used */ >>> + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); >>> + if (desc_stats) >>> + return -1; >>> + >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { >>> + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; >>> + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); >>> + >>> + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; >>> + } >>> + >>> + /* >>> + * load len from desc, store into mbuf pkt_len and data_len >>> + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored >>> + */ >>> + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; >>> + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, >> 0xAA); >>> + >>> + /* reduce hdr_len from pkt_len and data_len */ >>> + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, >>> + (uint32_t)-hdr_size); >>> + >>> + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); >>> + >>> + /* assert offset of data_len */ >>> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != >>> + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); >>> + >>> + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], >>> + addrs[2] + 8, addrs[2], >>> + addrs[1] + 8, addrs[1], >>> + addrs[0] + 8, addrs[0]); >>> + /* batch store into mbufs */ >>> + _mm512_i64scatter_epi64(0, v_index, v_value, 1); >>> + >>> + if (hw->has_rx_offload) { >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { >>> + char *addr = (char *)rx_pkts[i]->buf_addr + >>> + RTE_PKTMBUF_HEADROOM - hdr_size; >>> + virtio_vec_rx_offload(rx_pkts[i], >>> + (struct virtio_net_hdr *)addr); >>> + } >>> + } >>> + >>> + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, >>> + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, >>> + rx_pkts[3]->pkt_len); >>> + >>> + vq->vq_free_cnt += PACKED_BATCH_SIZE; >>> + >>> + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; >>> + if (vq->vq_used_cons_idx >= vq->vq_nentries) { >>> + vq->vq_used_cons_idx -= vq->vq_nentries; >>> + vq->vq_packed.used_wrap_counter ^= 1; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static uint16_t >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, >>> + struct rte_mbuf **rx_pkts) >>> +{ >>> + uint16_t used_idx, id; >>> + uint32_t len; >>> + struct virtqueue *vq = rxvq->vq; >>> + struct virtio_hw *hw = vq->hw; >>> + uint32_t hdr_size = hw->vtnet_hdr_size; >>> + struct virtio_net_hdr *hdr; >>> + struct vring_packed_desc *desc; >>> + struct rte_mbuf *cookie; >>> + >>> + desc = vq->vq_packed.ring.desc; >>> + used_idx = vq->vq_used_cons_idx; >>> + if (!desc_is_used(&desc[used_idx], vq)) >>> + return -1; >>> + >>> + len = desc[used_idx].len; >>> + id = desc[used_idx].id; >>> + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; >>> + if (unlikely(cookie == NULL)) { >>> + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie >> at %u", >>> + vq->vq_used_cons_idx); >>> + return -1; >>> + } >>> + rte_prefetch0(cookie); >>> + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); >>> + >>> + cookie->data_off = RTE_PKTMBUF_HEADROOM; >>> + cookie->ol_flags = 0; >>> + cookie->pkt_len = (uint32_t)(len - hdr_size); >>> + cookie->data_len = (uint32_t)(len - hdr_size); >>> + >>> + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + >>> + RTE_PKTMBUF_HEADROOM - >> hdr_size); >>> + if (hw->has_rx_offload) >>> + virtio_vec_rx_offload(cookie, hdr); >>> + >>> + *rx_pkts = cookie; >>> + >>> + rxvq->stats.bytes += cookie->pkt_len; >>> + >>> + vq->vq_free_cnt++; >>> + vq->vq_used_cons_idx++; >>> + if (vq->vq_used_cons_idx >= vq->vq_nentries) { >>> + vq->vq_used_cons_idx -= vq->vq_nentries; >>> + vq->vq_packed.used_wrap_counter ^= 1; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static inline void >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, >>> + struct rte_mbuf **cookie, >>> + uint16_t num) >>> +{ >>> + struct virtqueue *vq = rxvq->vq; >>> + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; >>> + uint16_t flags = vq->vq_packed.cached_flags; >>> + struct virtio_hw *hw = vq->hw; >>> + struct vq_desc_extra *dxp; >>> + uint16_t idx, i; >>> + uint16_t batch_num, total_num = 0; >>> + uint16_t head_idx = vq->vq_avail_idx; >>> + uint16_t head_flag = vq->vq_packed.cached_flags; >>> + uint64_t addr; >>> + >>> + do { >>> + idx = vq->vq_avail_idx; >>> + >>> + batch_num = PACKED_BATCH_SIZE; >>> + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) >>> + batch_num = vq->vq_nentries - idx; >>> + if (unlikely((total_num + batch_num) > num)) >>> + batch_num = num - total_num; >>> + >>> + virtio_for_each_try_unroll(i, 0, batch_num) { >>> + dxp = &vq->vq_descx[idx + i]; >>> + dxp->cookie = (void *)cookie[total_num + i]; >>> + >>> + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], >> vq) + >>> + RTE_PKTMBUF_HEADROOM - hw- >>> vtnet_hdr_size; >>> + start_dp[idx + i].addr = addr; >>> + start_dp[idx + i].len = cookie[total_num + i]->buf_len >>> + - RTE_PKTMBUF_HEADROOM + hw- >>> vtnet_hdr_size; >>> + if (total_num || i) { >>> + virtqueue_store_flags_packed(&start_dp[idx >> + i], >>> + flags, hw->weak_barriers); >>> + } >>> + } >>> + >>> + vq->vq_avail_idx += batch_num; >>> + if (vq->vq_avail_idx >= vq->vq_nentries) { >>> + vq->vq_avail_idx -= vq->vq_nentries; >>> + vq->vq_packed.cached_flags ^= >>> + VRING_PACKED_DESC_F_AVAIL_USED; >>> + flags = vq->vq_packed.cached_flags; >>> + } >>> + total_num += batch_num; >>> + } while (total_num < num); >>> + >>> + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, >>> + hw->weak_barriers); >>> + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); >>> +} >>> + >>> +uint16_t >>> +virtio_recv_pkts_packed_vec(void *rx_queue, >>> + struct rte_mbuf **rx_pkts, >>> + uint16_t nb_pkts) >>> +{ >>> + struct virtnet_rx *rxvq = rx_queue; >>> + struct virtqueue *vq = rxvq->vq; >>> + struct virtio_hw *hw = vq->hw; >>> + uint16_t num, nb_rx = 0; >>> + uint32_t nb_enqueued = 0; >>> + uint16_t free_cnt = vq->vq_free_thresh; >>> + >>> + if (unlikely(hw->started == 0)) >>> + return nb_rx; >>> + >>> + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); >>> + if (likely(num > PACKED_BATCH_SIZE)) >>> + num = num - ((vq->vq_used_cons_idx + num) % >> PACKED_BATCH_SIZE); >>> + >>> + while (num) { >>> + if (!virtqueue_dequeue_batch_packed_vec(rxvq, >>> + &rx_pkts[nb_rx])) { >>> + nb_rx += PACKED_BATCH_SIZE; >>> + num -= PACKED_BATCH_SIZE; >>> + continue; >>> + } >>> + if (!virtqueue_dequeue_single_packed_vec(rxvq, >>> + &rx_pkts[nb_rx])) { >>> + nb_rx++; >>> + num--; >>> + continue; >>> + } >>> + break; >>> + }; >>> + >>> + PMD_RX_LOG(DEBUG, "dequeue:%d", num); >>> + >>> + rxvq->stats.packets += nb_rx; >>> + >>> + if (likely(vq->vq_free_cnt >= free_cnt)) { >>> + struct rte_mbuf *new_pkts[free_cnt]; >>> + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, >>> + free_cnt) == 0)) { >>> + virtio_recv_refill_packed_vec(rxvq, new_pkts, >>> + free_cnt); >>> + nb_enqueued += free_cnt; >>> + } else { >>> + struct rte_eth_dev *dev = >>> + &rte_eth_devices[rxvq->port_id]; >>> + dev->data->rx_mbuf_alloc_failed += free_cnt; >>> + } >>> + } >>> + >>> + if (likely(nb_enqueued)) { >>> + if (unlikely(virtqueue_kick_prepare_packed(vq))) { >>> + virtqueue_notify(vq); >>> + PMD_RX_LOG(DEBUG, "Notified"); >>> + } >>> + } >>> + >>> + return nb_rx; >>> +} >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c >> b/drivers/net/virtio/virtio_user_ethdev.c >>> index 40ad786cc..c54698ad1 100644 >>> --- a/drivers/net/virtio/virtio_user_ethdev.c >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device >> *vdev) >>> hw->use_msix = 1; >>> hw->modern = 0; >>> hw->use_vec_rx = 0; >>> + hw->use_vec_tx = 0; >>> hw->use_inorder_rx = 0; >>> hw->use_inorder_tx = 0; >>> hw->virtio_user_dev = dev; >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device >> *dev) >>> goto end; >>> } >>> >>> - if (vectorized) >>> - hw->use_vec_rx = 1; >>> + if (vectorized) { >>> + if (packed_vq) { >>> +#if defined(CC_AVX512_SUPPORT) >>> + hw->use_vec_rx = 1; >>> + hw->use_vec_tx = 1; >>> +#else >>> + PMD_INIT_LOG(INFO, >>> + "building environment do not support packed >> ring vectorized"); >>> +#endif >>> + } else { >>> + hw->use_vec_rx = 1; >>> + } >>> + } >>> >>> rte_eth_dev_probing_finish(eth_dev); >>> ret = 0; >>> diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h >>> index ca1c10499..ce0340743 100644 >>> --- a/drivers/net/virtio/virtqueue.h >>> +++ b/drivers/net/virtio/virtqueue.h >>> @@ -239,7 +239,8 @@ struct vq_desc_extra { >>> void *cookie; >>> uint16_t ndescs; >>> uint16_t next; >>> -}; >>> + uint8_t padding[4]; >>> +} __rte_packed __rte_aligned(16); >> >> Can't this introduce a performance impact for the non-vectorized >> case? I think of worse cache liens utilization. >> >> For example with a burst of 32 descriptors with 32B cachelines, before >> it would take 14 cachelines, after 16. So for each burst, one could face >> 2 extra cache misses. >> >> If you could run non-vectorized benchamrks with and without that patch, >> I would be grateful. >> > > Maxime, > Thanks for point it out, it will add extra cache miss in datapath. > And its impact on performance is around 1% in loopback case. Ok, thanks for doing the test. I'll try to run some PVP benchmarks on my side because when doing IO loopback, the cache pressure is much less important. > While benefit of vectorized path will be more than that number. Ok, but I disagree for two reasons: 1. You have to keep in mind than non-vectorized is the default and encouraged mode to use. Indeed, it takes a lot of shortcuts like not checking header length (so no error stats), etc... 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because the gain is 20% on $CPU_VENDOR_B. In the case we see more degradation in real-world scenario, you might want to consider using ifdefs to avoid adding padding in the non- vectorized case, like you did to differentiate Virtio PMD to Virtio-user PMD in patch 7. Thanks, Maxime > Thanks, > Marvin > >> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> >> >> Thanks, >> Maxime > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 8:44 ` Maxime Coquelin @ 2020-04-28 13:01 ` Liu, Yong 2020-04-28 13:46 ` Maxime Coquelin 2020-04-28 17:01 ` Liu, Yong 0 siblings, 2 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-28 13:01 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Tuesday, April 28, 2020 4:44 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/28/20 3:14 AM, Liu, Yong wrote: > > > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Sent: Monday, April 27, 2020 7:21 PM > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; > >> Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: dev@dpdk.org > >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > path > >> > >> > >> > >> On 4/26/20 4:19 AM, Marvin Liu wrote: > >>> Optimize packed ring Rx path with SIMD instructions. Solution of > >>> optimization is pretty like vhost, is that split path into batch and > >>> single functions. Batch function is further optimized by AVX512 > >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus > >>> four elements will be saved in one batch. > >>> > >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> > >>> > >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > >>> index c9edb84ee..102b1deab 100644 > >>> --- a/drivers/net/virtio/Makefile > >>> +++ b/drivers/net/virtio/Makefile > >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > >> $(CONFIG_RTE_ARCH_ARM64)),) > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_rxtx_simple_neon.c > >>> endif > >>> > >>> +ifneq ($(FORCE_DISABLE_AVX512), y) > >>> + CC_AVX512_SUPPORT=\ > >>> + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ > >>> + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > >>> + grep -q AVX512 && echo 1) > >>> +endif > >>> + > >>> +ifeq ($(CC_AVX512_SUPPORT), 1) > >>> +CFLAGS += -DCC_AVX512_SUPPORT > >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c > >>> + > >>> +ifeq ($(RTE_TOOLCHAIN), gcc) > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > >>> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA > >>> +endif > >>> +endif > >>> + > >>> +ifeq ($(RTE_TOOLCHAIN), clang) > >>> +ifeq ($(shell test > $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - > >> ge 37 && echo 1), 1) > >>> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA > >>> +endif > >>> +endif > >>> + > >>> +ifeq ($(RTE_TOOLCHAIN), icc) > >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > >>> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA > >>> +endif > >>> +endif > >>> + > >>> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw - > mavx512vl > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > >>> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds > >>> +endif > >>> +endif > >>> + > >>> ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_user/vhost_kernel.c > >>> diff --git a/drivers/net/virtio/meson.build > b/drivers/net/virtio/meson.build > >>> index 15150eea1..8e68c3039 100644 > >>> --- a/drivers/net/virtio/meson.build > >>> +++ b/drivers/net/virtio/meson.build > >>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > >>> deps += ['kvargs', 'bus_pci'] > >>> > >>> if arch_subdir == 'x86' > >>> + if '-mno-avx512f' not in machine_args > >>> + if cc.has_argument('-mavx512f') and cc.has_argument('- > >> mavx512vl') and cc.has_argument('-mavx512bw') > >>> + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > >>> + cflags += ['-DCC_AVX512_SUPPORT'] > >>> + if (toolchain == 'gcc' and > >> cc.version().version_compare('>=8.3.0')) > >>> + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > >>> + elif (toolchain == 'clang' and > >> cc.version().version_compare('>=3.7.0')) > >>> + cflags += '- > >> DVHOST_CLANG_UNROLL_PRAGMA' > >>> + elif (toolchain == 'icc' and > >> cc.version().version_compare('>=16.0.0')) > >>> + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > >>> + endif > >>> + sources += files('virtio_rxtx_packed_avx.c') > >>> + endif > >>> + endif > >>> sources += files('virtio_rxtx_simple_sse.c') > >>> elif arch_subdir == 'ppc' > >>> sources += files('virtio_rxtx_simple_altivec.c') > >>> diff --git a/drivers/net/virtio/virtio_ethdev.h > >> b/drivers/net/virtio/virtio_ethdev.h > >>> index febaf17a8..5c112cac7 100644 > >>> --- a/drivers/net/virtio/virtio_ethdev.h > >>> +++ b/drivers/net/virtio/virtio_ethdev.h > >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void > *tx_queue, > >> struct rte_mbuf **tx_pkts, > >>> uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf > **rx_pkts, > >>> uint16_t nb_pkts); > >>> > >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf > >> **rx_pkts, > >>> + uint16_t nb_pkts); > >>> + > >>> int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > >>> > >>> void virtio_interrupt_handler(void *param); > >>> diff --git a/drivers/net/virtio/virtio_rxtx.c > b/drivers/net/virtio/virtio_rxtx.c > >>> index a549991aa..534562cca 100644 > >>> --- a/drivers/net/virtio/virtio_rxtx.c > >>> +++ b/drivers/net/virtio/virtio_rxtx.c > >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > >>> > >>> return nb_tx; > >>> } > >>> + > >>> +__rte_weak uint16_t > >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > >>> + struct rte_mbuf **rx_pkts __rte_unused, > >>> + uint16_t nb_pkts __rte_unused) > >>> +{ > >>> + return 0; > >>> +} > >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c > >>> new file mode 100644 > >>> index 000000000..8a7b459eb > >>> --- /dev/null > >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > >>> @@ -0,0 +1,374 @@ > >>> +/* SPDX-License-Identifier: BSD-3-Clause > >>> + * Copyright(c) 2010-2020 Intel Corporation > >>> + */ > >>> + > >>> +#include <stdint.h> > >>> +#include <stdio.h> > >>> +#include <stdlib.h> > >>> +#include <string.h> > >>> +#include <errno.h> > >>> + > >>> +#include <rte_net.h> > >>> + > >>> +#include "virtio_logs.h" > >>> +#include "virtio_ethdev.h" > >>> +#include "virtio_pci.h" > >>> +#include "virtqueue.h" > >>> + > >>> +#define BYTE_SIZE 8 > >>> +/* flag bits offset in packed ring desc higher 64bits */ > >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) > - \ > >>> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > >>> + > >>> +#define PACKED_FLAGS_MASK ((0ULL | > >> VRING_PACKED_DESC_F_AVAIL_USED) << \ > >>> + FLAGS_BITS_OFFSET) > >>> + > >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > >>> + sizeof(struct vring_packed_desc)) > >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > >>> + > >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > 4") > >> \ > >>> + for (iter = val; iter < size; iter++) > >>> +#endif > >>> + > >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > >>> + for (iter = val; iter < size; iter++) > >>> +#endif > >>> + > >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") > \ > >>> + for (iter = val; iter < size; iter++) > >>> +#endif > >>> + > >>> +#ifndef virtio_for_each_try_unroll > >>> +#define virtio_for_each_try_unroll(iter, val, num) \ > >>> + for (iter = val; iter < num; iter++) > >>> +#endif > >>> + > >>> +static inline void > >>> +virtio_update_batch_stats(struct virtnet_stats *stats, > >>> + uint16_t pkt_len1, > >>> + uint16_t pkt_len2, > >>> + uint16_t pkt_len3, > >>> + uint16_t pkt_len4) > >>> +{ > >>> + stats->bytes += pkt_len1; > >>> + stats->bytes += pkt_len2; > >>> + stats->bytes += pkt_len3; > >>> + stats->bytes += pkt_len4; > >>> +} > >>> + > >>> +/* Optionally fill offload information in structure */ > >>> +static inline int > >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > >>> +{ > >>> + struct rte_net_hdr_lens hdr_lens; > >>> + uint32_t hdrlen, ptype; > >>> + int l4_supported = 0; > >>> + > >>> + /* nothing to do */ > >>> + if (hdr->flags == 0) > >>> + return 0; > >>> + > >>> + /* GSO not support in vec path, skip check */ > >>> + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; > >>> + > >>> + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > >>> + m->packet_type = ptype; > >>> + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || > >>> + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || > >>> + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) > >>> + l4_supported = 1; > >>> + > >>> + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > >>> + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > >>> + if (hdr->csum_start <= hdrlen && l4_supported) { > >>> + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; > >>> + } else { > >>> + /* Unknown proto or tunnel, do sw cksum. We can > >> assume > >>> + * the cksum field is in the first segment since the > >>> + * buffers we provided to the host are large enough. > >>> + * In case of SCTP, this will be wrong since it's a CRC > >>> + * but there's nothing we can do. > >>> + */ > >>> + uint16_t csum = 0, off; > >>> + > >>> + rte_raw_cksum_mbuf(m, hdr->csum_start, > >>> + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > >>> + &csum); > >>> + if (likely(csum != 0xffff)) > >>> + csum = ~csum; > >>> + off = hdr->csum_offset + hdr->csum_start; > >>> + if (rte_pktmbuf_data_len(m) >= off + 1) > >>> + *rte_pktmbuf_mtod_offset(m, uint16_t *, > >>> + off) = csum; > >>> + } > >>> + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && > >> l4_supported) { > >>> + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; > >>> + } > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +static inline uint16_t > >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, > >>> + struct rte_mbuf **rx_pkts) > >>> +{ > >>> + struct virtqueue *vq = rxvq->vq; > >>> + struct virtio_hw *hw = vq->hw; > >>> + uint16_t hdr_size = hw->vtnet_hdr_size; > >>> + uint64_t addrs[PACKED_BATCH_SIZE]; > >>> + uint16_t id = vq->vq_used_cons_idx; > >>> + uint8_t desc_stats; > >>> + uint16_t i; > >>> + void *desc_addr; > >>> + > >>> + if (id & PACKED_BATCH_MASK) > >>> + return -1; > >>> + > >>> + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) > >>> + return -1; > >>> + > >>> + /* only care avail/used bits */ > >>> + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, > >> PACKED_FLAGS_MASK); > >>> + desc_addr = &vq->vq_packed.ring.desc[id]; > >>> + > >>> + __m512i v_desc = _mm512_loadu_si512(desc_addr); > >>> + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); > >>> + > >>> + __m512i v_used_flag = _mm512_setzero_si512(); > >>> + if (vq->vq_packed.used_wrap_counter) > >>> + v_used_flag = _mm512_maskz_set1_epi64(0xaa, > >> PACKED_FLAGS_MASK); > >>> + > >>> + /* Check all descs are used */ > >>> + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); > >>> + if (desc_stats) > >>> + return -1; > >>> + > >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > >>> + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; > >>> + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); > >>> + > >>> + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; > >>> + } > >>> + > >>> + /* > >>> + * load len from desc, store into mbuf pkt_len and data_len > >>> + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored > >>> + */ > >>> + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; > >>> + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, > >> 0xAA); > >>> + > >>> + /* reduce hdr_len from pkt_len and data_len */ > >>> + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, > >>> + (uint32_t)-hdr_size); > >>> + > >>> + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); > >>> + > >>> + /* assert offset of data_len */ > >>> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != > >>> + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); > >>> + > >>> + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], > >>> + addrs[2] + 8, addrs[2], > >>> + addrs[1] + 8, addrs[1], > >>> + addrs[0] + 8, addrs[0]); > >>> + /* batch store into mbufs */ > >>> + _mm512_i64scatter_epi64(0, v_index, v_value, 1); > >>> + > >>> + if (hw->has_rx_offload) { > >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > >>> + char *addr = (char *)rx_pkts[i]->buf_addr + > >>> + RTE_PKTMBUF_HEADROOM - hdr_size; > >>> + virtio_vec_rx_offload(rx_pkts[i], > >>> + (struct virtio_net_hdr *)addr); > >>> + } > >>> + } > >>> + > >>> + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, > >>> + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, > >>> + rx_pkts[3]->pkt_len); > >>> + > >>> + vq->vq_free_cnt += PACKED_BATCH_SIZE; > >>> + > >>> + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; > >>> + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > >>> + vq->vq_used_cons_idx -= vq->vq_nentries; > >>> + vq->vq_packed.used_wrap_counter ^= 1; > >>> + } > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +static uint16_t > >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, > >>> + struct rte_mbuf **rx_pkts) > >>> +{ > >>> + uint16_t used_idx, id; > >>> + uint32_t len; > >>> + struct virtqueue *vq = rxvq->vq; > >>> + struct virtio_hw *hw = vq->hw; > >>> + uint32_t hdr_size = hw->vtnet_hdr_size; > >>> + struct virtio_net_hdr *hdr; > >>> + struct vring_packed_desc *desc; > >>> + struct rte_mbuf *cookie; > >>> + > >>> + desc = vq->vq_packed.ring.desc; > >>> + used_idx = vq->vq_used_cons_idx; > >>> + if (!desc_is_used(&desc[used_idx], vq)) > >>> + return -1; > >>> + > >>> + len = desc[used_idx].len; > >>> + id = desc[used_idx].id; > >>> + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; > >>> + if (unlikely(cookie == NULL)) { > >>> + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie > >> at %u", > >>> + vq->vq_used_cons_idx); > >>> + return -1; > >>> + } > >>> + rte_prefetch0(cookie); > >>> + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); > >>> + > >>> + cookie->data_off = RTE_PKTMBUF_HEADROOM; > >>> + cookie->ol_flags = 0; > >>> + cookie->pkt_len = (uint32_t)(len - hdr_size); > >>> + cookie->data_len = (uint32_t)(len - hdr_size); > >>> + > >>> + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + > >>> + RTE_PKTMBUF_HEADROOM - > >> hdr_size); > >>> + if (hw->has_rx_offload) > >>> + virtio_vec_rx_offload(cookie, hdr); > >>> + > >>> + *rx_pkts = cookie; > >>> + > >>> + rxvq->stats.bytes += cookie->pkt_len; > >>> + > >>> + vq->vq_free_cnt++; > >>> + vq->vq_used_cons_idx++; > >>> + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > >>> + vq->vq_used_cons_idx -= vq->vq_nentries; > >>> + vq->vq_packed.used_wrap_counter ^= 1; > >>> + } > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +static inline void > >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, > >>> + struct rte_mbuf **cookie, > >>> + uint16_t num) > >>> +{ > >>> + struct virtqueue *vq = rxvq->vq; > >>> + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; > >>> + uint16_t flags = vq->vq_packed.cached_flags; > >>> + struct virtio_hw *hw = vq->hw; > >>> + struct vq_desc_extra *dxp; > >>> + uint16_t idx, i; > >>> + uint16_t batch_num, total_num = 0; > >>> + uint16_t head_idx = vq->vq_avail_idx; > >>> + uint16_t head_flag = vq->vq_packed.cached_flags; > >>> + uint64_t addr; > >>> + > >>> + do { > >>> + idx = vq->vq_avail_idx; > >>> + > >>> + batch_num = PACKED_BATCH_SIZE; > >>> + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) > >>> + batch_num = vq->vq_nentries - idx; > >>> + if (unlikely((total_num + batch_num) > num)) > >>> + batch_num = num - total_num; > >>> + > >>> + virtio_for_each_try_unroll(i, 0, batch_num) { > >>> + dxp = &vq->vq_descx[idx + i]; > >>> + dxp->cookie = (void *)cookie[total_num + i]; > >>> + > >>> + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], > >> vq) + > >>> + RTE_PKTMBUF_HEADROOM - hw- > >>> vtnet_hdr_size; > >>> + start_dp[idx + i].addr = addr; > >>> + start_dp[idx + i].len = cookie[total_num + i]- > >buf_len > >>> + - RTE_PKTMBUF_HEADROOM + hw- > >>> vtnet_hdr_size; > >>> + if (total_num || i) { > >>> + virtqueue_store_flags_packed(&start_dp[idx > >> + i], > >>> + flags, hw->weak_barriers); > >>> + } > >>> + } > >>> + > >>> + vq->vq_avail_idx += batch_num; > >>> + if (vq->vq_avail_idx >= vq->vq_nentries) { > >>> + vq->vq_avail_idx -= vq->vq_nentries; > >>> + vq->vq_packed.cached_flags ^= > >>> + VRING_PACKED_DESC_F_AVAIL_USED; > >>> + flags = vq->vq_packed.cached_flags; > >>> + } > >>> + total_num += batch_num; > >>> + } while (total_num < num); > >>> + > >>> + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, > >>> + hw->weak_barriers); > >>> + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); > >>> +} > >>> + > >>> +uint16_t > >>> +virtio_recv_pkts_packed_vec(void *rx_queue, > >>> + struct rte_mbuf **rx_pkts, > >>> + uint16_t nb_pkts) > >>> +{ > >>> + struct virtnet_rx *rxvq = rx_queue; > >>> + struct virtqueue *vq = rxvq->vq; > >>> + struct virtio_hw *hw = vq->hw; > >>> + uint16_t num, nb_rx = 0; > >>> + uint32_t nb_enqueued = 0; > >>> + uint16_t free_cnt = vq->vq_free_thresh; > >>> + > >>> + if (unlikely(hw->started == 0)) > >>> + return nb_rx; > >>> + > >>> + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); > >>> + if (likely(num > PACKED_BATCH_SIZE)) > >>> + num = num - ((vq->vq_used_cons_idx + num) % > >> PACKED_BATCH_SIZE); > >>> + > >>> + while (num) { > >>> + if (!virtqueue_dequeue_batch_packed_vec(rxvq, > >>> + &rx_pkts[nb_rx])) { > >>> + nb_rx += PACKED_BATCH_SIZE; > >>> + num -= PACKED_BATCH_SIZE; > >>> + continue; > >>> + } > >>> + if (!virtqueue_dequeue_single_packed_vec(rxvq, > >>> + &rx_pkts[nb_rx])) { > >>> + nb_rx++; > >>> + num--; > >>> + continue; > >>> + } > >>> + break; > >>> + }; > >>> + > >>> + PMD_RX_LOG(DEBUG, "dequeue:%d", num); > >>> + > >>> + rxvq->stats.packets += nb_rx; > >>> + > >>> + if (likely(vq->vq_free_cnt >= free_cnt)) { > >>> + struct rte_mbuf *new_pkts[free_cnt]; > >>> + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, > >>> + free_cnt) == 0)) { > >>> + virtio_recv_refill_packed_vec(rxvq, new_pkts, > >>> + free_cnt); > >>> + nb_enqueued += free_cnt; > >>> + } else { > >>> + struct rte_eth_dev *dev = > >>> + &rte_eth_devices[rxvq->port_id]; > >>> + dev->data->rx_mbuf_alloc_failed += free_cnt; > >>> + } > >>> + } > >>> + > >>> + if (likely(nb_enqueued)) { > >>> + if (unlikely(virtqueue_kick_prepare_packed(vq))) { > >>> + virtqueue_notify(vq); > >>> + PMD_RX_LOG(DEBUG, "Notified"); > >>> + } > >>> + } > >>> + > >>> + return nb_rx; > >>> +} > >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c > >> b/drivers/net/virtio/virtio_user_ethdev.c > >>> index 40ad786cc..c54698ad1 100644 > >>> --- a/drivers/net/virtio/virtio_user_ethdev.c > >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c > >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct > rte_vdev_device > >> *vdev) > >>> hw->use_msix = 1; > >>> hw->modern = 0; > >>> hw->use_vec_rx = 0; > >>> + hw->use_vec_tx = 0; > >>> hw->use_inorder_rx = 0; > >>> hw->use_inorder_tx = 0; > >>> hw->virtio_user_dev = dev; > >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device > >> *dev) > >>> goto end; > >>> } > >>> > >>> - if (vectorized) > >>> - hw->use_vec_rx = 1; > >>> + if (vectorized) { > >>> + if (packed_vq) { > >>> +#if defined(CC_AVX512_SUPPORT) > >>> + hw->use_vec_rx = 1; > >>> + hw->use_vec_tx = 1; > >>> +#else > >>> + PMD_INIT_LOG(INFO, > >>> + "building environment do not support > packed > >> ring vectorized"); > >>> +#endif > >>> + } else { > >>> + hw->use_vec_rx = 1; > >>> + } > >>> + } > >>> > >>> rte_eth_dev_probing_finish(eth_dev); > >>> ret = 0; > >>> diff --git a/drivers/net/virtio/virtqueue.h > b/drivers/net/virtio/virtqueue.h > >>> index ca1c10499..ce0340743 100644 > >>> --- a/drivers/net/virtio/virtqueue.h > >>> +++ b/drivers/net/virtio/virtqueue.h > >>> @@ -239,7 +239,8 @@ struct vq_desc_extra { > >>> void *cookie; > >>> uint16_t ndescs; > >>> uint16_t next; > >>> -}; > >>> + uint8_t padding[4]; > >>> +} __rte_packed __rte_aligned(16); > >> > >> Can't this introduce a performance impact for the non-vectorized > >> case? I think of worse cache liens utilization. > >> > >> For example with a burst of 32 descriptors with 32B cachelines, before > >> it would take 14 cachelines, after 16. So for each burst, one could face > >> 2 extra cache misses. > >> > >> If you could run non-vectorized benchamrks with and without that patch, > >> I would be grateful. > >> > > > > Maxime, > > Thanks for point it out, it will add extra cache miss in datapath. > > And its impact on performance is around 1% in loopback case. > > Ok, thanks for doing the test. I'll try to run some PVP benchmarks > on my side because when doing IO loopback, the cache pressure is > much less important. > > > While benefit of vectorized path will be more than that number. > > Ok, but I disagree for two reasons: > 1. You have to keep in mind than non-vectorized is the default and > encouraged mode to use. Indeed, it takes a lot of shortcuts like not > checking header length (so no error stats), etc... > Ok, I will keep non-vectorized same as before. > 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because > the gain is 20% on $CPU_VENDOR_B. > > In the case we see more degradation in real-world scenario, you might > want to consider using ifdefs to avoid adding padding in the non- > vectorized case, like you did to differentiate Virtio PMD to Virtio-user > PMD in patch 7. > Maxime, The performance difference is so slight, so I ignored for it look like a sampling error. It maybe not suitable to add new configuration for such setting which only used inside driver. Virtio driver can check whether virtqueue is using vectorized path when initialization, will use padded structure if it is. I have added some tested code and now performance came back. Since code has changed in initialization process, it need some time for regression check. Regards, Marvin > Thanks, > Maxime > > > Thanks, > > Marvin > > > >> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> > >> > >> Thanks, > >> Maxime > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 13:01 ` Liu, Yong @ 2020-04-28 13:46 ` Maxime Coquelin 2020-04-28 14:43 ` Liu, Yong 2020-04-28 17:01 ` Liu, Yong 1 sibling, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-28 13:46 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Honnappa Nagarahalli, jerinj On 4/28/20 3:01 PM, Liu, Yong wrote: >>> Maxime, >>> Thanks for point it out, it will add extra cache miss in datapath. >>> And its impact on performance is around 1% in loopback case. >> Ok, thanks for doing the test. I'll try to run some PVP benchmarks >> on my side because when doing IO loopback, the cache pressure is >> much less important. >> >>> While benefit of vectorized path will be more than that number. >> Ok, but I disagree for two reasons: >> 1. You have to keep in mind than non-vectorized is the default and >> encouraged mode to use. Indeed, it takes a lot of shortcuts like not >> checking header length (so no error stats), etc... >> > Ok, I will keep non-vectorized same as before. > >> 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because >> the gain is 20% on $CPU_VENDOR_B. >> >> In the case we see more degradation in real-world scenario, you might >> want to consider using ifdefs to avoid adding padding in the non- >> vectorized case, like you did to differentiate Virtio PMD to Virtio-user >> PMD in patch 7. >> > Maxime, > The performance difference is so slight, so I ignored for it look like a sampling error. Agree for IO loopback, but it adds one more cache line access per burst, which might be see in some real-life use cases. > It maybe not suitable to add new configuration for such setting which only used inside driver. Wait, the Virtio-user #ifdef is based on the defconfig options? How can it work since both Virtio PMD and Virtio-user PMD can be selected at the same time? I thought it was a define set before the headers inclusion and unset afterwards, but I didn't checked carefully. > Virtio driver can check whether virtqueue is using vectorized path when initialization, will use padded structure if it is. > I have added some tested code and now performance came back. Since code has changed in initialization process, it need some time for regression check. Ok, works for me. I am investigating a linkage issue with your series, which does not happen systematically (see below, it happens also with clang). David pointed me to some Intel patches removing the usage if __rte_weak, could it be related? gcc -o app/test/dpdk-test 'app/test/3062f5d@@dpdk-test@exe/commands.c.o' 'app/test/3062f5d@@dpdk-test@exe/packet_burst_generator.c.o' 'app/test/3062f5d@@dpdk-test@exe/test.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_acl.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_alarm.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_atomic.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_barrier.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_bpf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_byteorder.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_cirbuf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_etheraddr.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_ipaddr.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_lib.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_num.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_portlist.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_string.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_common.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cpuflags.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_crc.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_asym.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_blockcipher.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_security_pdcp.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_cycles.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_debug.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_distributor.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_distributor_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_eal_flags.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_eal_fs.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_efd.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_efd_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_errno.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_event_crypto_adapter.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_event_eth_rx_adapter.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_event_ring.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_event_timer_adapter.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_eventdev.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_external_mem.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_fbarray.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_fib.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_fib_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_fib6.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_fib6_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_func_reentrancy.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_flow_classify.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_hash.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_hash_functions.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_hash_multiwriter.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_hash_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite_lf_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_interrupts.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ipfrag.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ipsec.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ipsec_sad.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_kni.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_kvargs.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_rssconf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_logs.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_lpm.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_lpm6.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_lpm6_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_lpm_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_malloc.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_mbuf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_member.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_member_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_memcpy.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_memcpy_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_memory.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_mempool.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_mempool_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_memzone.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_meter.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_metrics.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_mcslock.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_mp_secondary.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_per_lcore.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_pmd_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_power.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_power_cpufreq.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_power_kvm_vm.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_prefetch.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rand_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rawdev.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_red.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_reorder.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rib.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rib6.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring_mpmc_stress.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring_hts_stress.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring_peek_stress.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring_rts_stress.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ring_stress.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_rwlock.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_sched.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_security.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_service_cores.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_spinlock.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_stack.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_stack_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_string_fns.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_table.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_table_acl.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_table_combined.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_table_pipeline.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_table_ports.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_table_tables.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_tailq.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_thash.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_timer.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_timer_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_timer_racecond.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_timer_secondary.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_ticketlock.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_trace.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_trace_register.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_trace_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_version.c.o' 'app/test/3062f5d@@dpdk-test@exe/virtual_pmd.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring_perf.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_event_eth_tx_adapter.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_bitratestats.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_latencystats.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_mode4.c.o' 'app/test/3062f5d@@dpdk-test@exe/sample_packet_forward.c.o' 'app/test/3062f5d@@dpdk-test@exe/test_pdump.c.o' -Wl,--no-undefined -Wl,--as-needed -Wl,-O1 -Wl,--whole-archive -Wl,--start-group drivers/librte_common_cpt.a drivers/librte_common_dpaax.a drivers/librte_common_iavf.a drivers/librte_common_octeontx.a drivers/librte_common_octeontx2.a drivers/librte_bus_dpaa.a drivers/librte_bus_fslmc.a drivers/librte_bus_ifpga.a drivers/librte_bus_pci.a drivers/librte_bus_vdev.a drivers/librte_bus_vmbus.a drivers/librte_mempool_bucket.a drivers/librte_mempool_dpaa.a drivers/librte_mempool_dpaa2.a drivers/librte_mempool_octeontx.a drivers/librte_mempool_octeontx2.a drivers/librte_mempool_ring.a drivers/librte_mempool_stack.a drivers/librte_pmd_af_packet.a drivers/librte_pmd_ark.a drivers/librte_pmd_atlantic.a drivers/librte_pmd_avp.a drivers/librte_pmd_axgbe.a drivers/librte_pmd_bond.a drivers/librte_pmd_bnxt.a drivers/librte_pmd_cxgbe.a drivers/librte_pmd_dpaa.a drivers/librte_pmd_dpaa2.a drivers/librte_pmd_e1000.a drivers/librte_pmd_ena.a drivers/librte_pmd_enetc.a drivers/librte_pmd_enic.a drivers/librte_pmd_failsafe.a drivers/librte_pmd_fm10k.a drivers/librte_pmd_i40e.a drivers/librte_pmd_hinic.a drivers/librte_pmd_hns3.a drivers/librte_pmd_iavf.a drivers/librte_pmd_ice.a drivers/librte_pmd_igc.a drivers/librte_pmd_ixgbe.a drivers/librte_pmd_kni.a drivers/librte_pmd_liquidio.a drivers/librte_pmd_memif.a drivers/librte_pmd_netvsc.a drivers/librte_pmd_nfp.a drivers/librte_pmd_null.a drivers/librte_pmd_octeontx.a drivers/librte_pmd_octeontx2.a drivers/librte_pmd_pfe.a drivers/librte_pmd_qede.a drivers/librte_pmd_ring.a drivers/librte_pmd_sfc.a drivers/librte_pmd_softnic.a drivers/librte_pmd_tap.a drivers/librte_pmd_thunderx.a drivers/librte_pmd_vdev_netvsc.a drivers/librte_pmd_vhost.a drivers/librte_pmd_virtio.a drivers/librte_pmd_vmxnet3.a drivers/librte_rawdev_dpaa2_cmdif.a drivers/librte_rawdev_dpaa2_qdma.a drivers/librte_rawdev_ioat.a drivers/librte_rawdev_ntb.a drivers/librte_rawdev_octeontx2_dma.a drivers/librte_rawdev_octeontx2_ep.a drivers/librte_rawdev_skeleton.a drivers/librte_pmd_caam_jr.a drivers/librte_pmd_dpaa_sec.a drivers/librte_pmd_dpaa2_sec.a drivers/librte_pmd_nitrox.a drivers/librte_pmd_null_crypto.a drivers/librte_pmd_octeontx_crypto.a drivers/librte_pmd_octeontx2_crypto.a drivers/librte_pmd_crypto_scheduler.a drivers/librte_pmd_virtio_crypto.a drivers/librte_pmd_octeontx_compress.a drivers/librte_pmd_qat.a drivers/librte_pmd_ifc.a drivers/librte_pmd_dpaa_event.a drivers/librte_pmd_dpaa2_event.a drivers/librte_pmd_octeontx2_event.a drivers/librte_pmd_opdl_event.a drivers/librte_pmd_skeleton_event.a drivers/librte_pmd_sw_event.a drivers/librte_pmd_dsw_event.a drivers/librte_pmd_octeontx_event.a drivers/librte_pmd_bbdev_null.a drivers/librte_pmd_bbdev_turbo_sw.a drivers/librte_pmd_bbdev_fpga_lte_fec.a drivers/librte_pmd_bbdev_fpga_5gnr_fec.a -Wl,--no-whole-archive -Wl,--no-as-needed -pthread -lm -ldl -lnuma lib/librte_acl.a lib/librte_eal.a lib/librte_kvargs.a lib/librte_bitratestats.a lib/librte_ethdev.a lib/librte_net.a lib/librte_mbuf.a lib/librte_mempool.a lib/librte_ring.a lib/librte_meter.a lib/librte_metrics.a lib/librte_bpf.a lib/librte_cfgfile.a lib/librte_cmdline.a lib/librte_cryptodev.a lib/librte_distributor.a lib/librte_efd.a lib/librte_hash.a lib/librte_eventdev.a lib/librte_timer.a lib/librte_fib.a lib/librte_rib.a lib/librte_flow_classify.a lib/librte_table.a lib/librte_port.a lib/librte_sched.a lib/librte_ip_frag.a lib/librte_kni.a lib/librte_pci.a lib/librte_lpm.a lib/librte_ipsec.a lib/librte_security.a lib/librte_latencystats.a lib/librte_member.a lib/librte_pipeline.a lib/librte_rawdev.a lib/librte_rcu.a lib/librte_reorder.a lib/librte_stack.a lib/librte_power.a lib/librte_pdump.a lib/librte_gso.a lib/librte_vhost.a lib/librte_compressdev.a lib/librte_bbdev.a -Wl,--end-group '-Wl,-rpath,$ORIGIN/../../lib:$ORIGIN/../../drivers' -Wl,-rpath-link,/tmp/dpdk_build/meson_buildir_gcc/lib:/tmp/dpdk_build/meson_buildir_gcc/drivers drivers/librte_pmd_virtio.a(net_virtio_virtio_ethdev.c.o): In function `set_rxtx_funcs': virtio_ethdev.c:(.text.unlikely+0x6f): undefined reference to `virtio_xmit_pkts_packed_vec' collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. > Regards, > Marvin > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 13:46 ` Maxime Coquelin @ 2020-04-28 14:43 ` Liu, Yong 2020-04-28 14:50 ` Maxime Coquelin 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-28 14:43 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong Cc: dev, Honnappa Nagarahalli, jerinj > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Tuesday, April 28, 2020 9:46 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/28/20 3:01 PM, Liu, Yong wrote: > >>> Maxime, > >>> Thanks for point it out, it will add extra cache miss in datapath. > >>> And its impact on performance is around 1% in loopback case. > >> Ok, thanks for doing the test. I'll try to run some PVP benchmarks > >> on my side because when doing IO loopback, the cache pressure is > >> much less important. > >> > >>> While benefit of vectorized path will be more than that number. > >> Ok, but I disagree for two reasons: > >> 1. You have to keep in mind than non-vectorized is the default and > >> encouraged mode to use. Indeed, it takes a lot of shortcuts like not > >> checking header length (so no error stats), etc... > >> > > Ok, I will keep non-vectorized same as before. > > > >> 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because > >> the gain is 20% on $CPU_VENDOR_B. > >> > >> In the case we see more degradation in real-world scenario, you might > >> want to consider using ifdefs to avoid adding padding in the non- > >> vectorized case, like you did to differentiate Virtio PMD to Virtio-user > >> PMD in patch 7. > >> > > Maxime, > > The performance difference is so slight, so I ignored for it look like a > sampling error. > > Agree for IO loopback, but it adds one more cache line access per burst, > which might be see in some real-life use cases. > > > It maybe not suitable to add new configuration for such setting which > only used inside driver. > > Wait, the Virtio-user #ifdef is based on the defconfig options? How can > it work since both Virtio PMD and Virtio-user PMD can be selected at the > same time? > > I thought it was a define set before the headers inclusion and unset > afterwards, but I didn't checked carefully. > Maxime, The difference between virtio PMD and Virtio-user PMD addresses is handled by vq->offset. When virtio PMD is running, offset will be set to buf_iova. vq->offset = offsetof(struct rte_mbuf, buf_iova); When virtio_user PMD is running, offset will be set to buf_addr. vq->offset = offsetof(struct rte_mbuf, buf_addr); > > Virtio driver can check whether virtqueue is using vectorized path when > initialization, will use padded structure if it is. > > I have added some tested code and now performance came back. Since > code has changed in initialization process, it need some time for regression > check. > > Ok, works for me. > > I am investigating a linkage issue with your series, which does not > happen systematically (see below, it happens also with clang). David > pointed me to some Intel patches removing the usage if __rte_weak, > could it be related? > I checked David's patch, it only changed i40e driver. Meanwhile attribute __rte_weak should still be in virtio_rxtx.c. I will follow David's patch, eliminate the usage of weak attribute. > > gcc -o app/test/dpdk-test > 'app/test/3062f5d@@dpdk-test@exe/commands.c.o' > 'app/test/3062f5d@@dpdk-test@exe/packet_burst_generator.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_acl.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_alarm.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_atomic.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_barrier.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_bpf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_byteorder.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_cirbuf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_etheraddr.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_ipaddr.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_lib.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_num.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_portlist.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cmdline_string.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_common.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cpuflags.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_crc.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_asym.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_blockcipher.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cryptodev_security_pdcp.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_cycles.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_debug.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_distributor.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_distributor_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_eal_flags.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_eal_fs.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_efd.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_efd_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_errno.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_event_crypto_adapter.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_event_eth_rx_adapter.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_event_ring.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_event_timer_adapter.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_eventdev.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_external_mem.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_fbarray.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_fib.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_fib_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_fib6.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_fib6_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_func_reentrancy.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_flow_classify.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_hash.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_hash_functions.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_hash_multiwriter.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_hash_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_hash_readwrite_lf_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_interrupts.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ipfrag.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ipsec.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ipsec_sad.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_kni.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_kvargs.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_rssconf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_logs.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_lpm.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_lpm6.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_lpm6_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_lpm_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_malloc.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_mbuf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_member.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_member_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_memcpy.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_memcpy_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_memory.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_mempool.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_mempool_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_memzone.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_meter.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_metrics.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_mcslock.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_mp_secondary.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_per_lcore.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_pmd_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_power.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_power_cpufreq.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_power_kvm_vm.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_prefetch.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rand_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rawdev.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rcu_qsbr_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_reciprocal_division_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_red.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_reorder.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rib.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rib6.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring_mpmc_stress.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring_hts_stress.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring_peek_stress.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring_rts_stress.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ring_stress.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_rwlock.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_sched.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_security.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_service_cores.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_spinlock.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_stack.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_stack_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_string_fns.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_table.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_table_acl.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_table_combined.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_table_pipeline.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_table_ports.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_table_tables.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_tailq.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_thash.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_timer.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_timer_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_timer_racecond.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_timer_secondary.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_ticketlock.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_trace.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_trace_register.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_trace_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_version.c.o' > 'app/test/3062f5d@@dpdk-test@exe/virtual_pmd.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring_perf.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_pmd_ring.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_event_eth_tx_adapter.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_bitratestats.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_latencystats.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_link_bonding_mode4.c.o' > 'app/test/3062f5d@@dpdk-test@exe/sample_packet_forward.c.o' > 'app/test/3062f5d@@dpdk-test@exe/test_pdump.c.o' -Wl,--no-undefined > -Wl,--as-needed -Wl,-O1 -Wl,--whole-archive -Wl,--start-group > drivers/librte_common_cpt.a drivers/librte_common_dpaax.a > drivers/librte_common_iavf.a drivers/librte_common_octeontx.a > drivers/librte_common_octeontx2.a drivers/librte_bus_dpaa.a > drivers/librte_bus_fslmc.a drivers/librte_bus_ifpga.a > drivers/librte_bus_pci.a drivers/librte_bus_vdev.a > drivers/librte_bus_vmbus.a drivers/librte_mempool_bucket.a > drivers/librte_mempool_dpaa.a drivers/librte_mempool_dpaa2.a > drivers/librte_mempool_octeontx.a drivers/librte_mempool_octeontx2.a > drivers/librte_mempool_ring.a drivers/librte_mempool_stack.a > drivers/librte_pmd_af_packet.a drivers/librte_pmd_ark.a > drivers/librte_pmd_atlantic.a drivers/librte_pmd_avp.a > drivers/librte_pmd_axgbe.a drivers/librte_pmd_bond.a > drivers/librte_pmd_bnxt.a drivers/librte_pmd_cxgbe.a > drivers/librte_pmd_dpaa.a drivers/librte_pmd_dpaa2.a > drivers/librte_pmd_e1000.a drivers/librte_pmd_ena.a > drivers/librte_pmd_enetc.a drivers/librte_pmd_enic.a > drivers/librte_pmd_failsafe.a drivers/librte_pmd_fm10k.a > drivers/librte_pmd_i40e.a drivers/librte_pmd_hinic.a > drivers/librte_pmd_hns3.a drivers/librte_pmd_iavf.a > drivers/librte_pmd_ice.a drivers/librte_pmd_igc.a > drivers/librte_pmd_ixgbe.a drivers/librte_pmd_kni.a > drivers/librte_pmd_liquidio.a drivers/librte_pmd_memif.a > drivers/librte_pmd_netvsc.a drivers/librte_pmd_nfp.a > drivers/librte_pmd_null.a drivers/librte_pmd_octeontx.a > drivers/librte_pmd_octeontx2.a drivers/librte_pmd_pfe.a > drivers/librte_pmd_qede.a drivers/librte_pmd_ring.a > drivers/librte_pmd_sfc.a drivers/librte_pmd_softnic.a > drivers/librte_pmd_tap.a drivers/librte_pmd_thunderx.a > drivers/librte_pmd_vdev_netvsc.a drivers/librte_pmd_vhost.a > drivers/librte_pmd_virtio.a drivers/librte_pmd_vmxnet3.a > drivers/librte_rawdev_dpaa2_cmdif.a drivers/librte_rawdev_dpaa2_qdma.a > drivers/librte_rawdev_ioat.a drivers/librte_rawdev_ntb.a > drivers/librte_rawdev_octeontx2_dma.a > drivers/librte_rawdev_octeontx2_ep.a drivers/librte_rawdev_skeleton.a > drivers/librte_pmd_caam_jr.a drivers/librte_pmd_dpaa_sec.a > drivers/librte_pmd_dpaa2_sec.a drivers/librte_pmd_nitrox.a > drivers/librte_pmd_null_crypto.a drivers/librte_pmd_octeontx_crypto.a > drivers/librte_pmd_octeontx2_crypto.a > drivers/librte_pmd_crypto_scheduler.a drivers/librte_pmd_virtio_crypto.a > drivers/librte_pmd_octeontx_compress.a drivers/librte_pmd_qat.a > drivers/librte_pmd_ifc.a drivers/librte_pmd_dpaa_event.a > drivers/librte_pmd_dpaa2_event.a drivers/librte_pmd_octeontx2_event.a > drivers/librte_pmd_opdl_event.a drivers/librte_pmd_skeleton_event.a > drivers/librte_pmd_sw_event.a drivers/librte_pmd_dsw_event.a > drivers/librte_pmd_octeontx_event.a drivers/librte_pmd_bbdev_null.a > drivers/librte_pmd_bbdev_turbo_sw.a > drivers/librte_pmd_bbdev_fpga_lte_fec.a > drivers/librte_pmd_bbdev_fpga_5gnr_fec.a -Wl,--no-whole-archive > -Wl,--no-as-needed -pthread -lm -ldl -lnuma lib/librte_acl.a > lib/librte_eal.a lib/librte_kvargs.a lib/librte_bitratestats.a > lib/librte_ethdev.a lib/librte_net.a lib/librte_mbuf.a > lib/librte_mempool.a lib/librte_ring.a lib/librte_meter.a > lib/librte_metrics.a lib/librte_bpf.a lib/librte_cfgfile.a > lib/librte_cmdline.a lib/librte_cryptodev.a lib/librte_distributor.a > lib/librte_efd.a lib/librte_hash.a lib/librte_eventdev.a > lib/librte_timer.a lib/librte_fib.a lib/librte_rib.a > lib/librte_flow_classify.a lib/librte_table.a lib/librte_port.a > lib/librte_sched.a lib/librte_ip_frag.a lib/librte_kni.a > lib/librte_pci.a lib/librte_lpm.a lib/librte_ipsec.a > lib/librte_security.a lib/librte_latencystats.a lib/librte_member.a > lib/librte_pipeline.a lib/librte_rawdev.a lib/librte_rcu.a > lib/librte_reorder.a lib/librte_stack.a lib/librte_power.a > lib/librte_pdump.a lib/librte_gso.a lib/librte_vhost.a > lib/librte_compressdev.a lib/librte_bbdev.a -Wl,--end-group > '-Wl,-rpath,$ORIGIN/../../lib:$ORIGIN/../../drivers' > -Wl,-rpath- > link,/tmp/dpdk_build/meson_buildir_gcc/lib:/tmp/dpdk_build/meson_buil > dir_gcc/drivers > drivers/librte_pmd_virtio.a(net_virtio_virtio_ethdev.c.o): In function > `set_rxtx_funcs': > virtio_ethdev.c:(.text.unlikely+0x6f): undefined reference to > `virtio_xmit_pkts_packed_vec' > collect2: error: ld returned 1 exit status > ninja: build stopped: subcommand failed. > > > Regards, > > Marvin > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 14:43 ` Liu, Yong @ 2020-04-28 14:50 ` Maxime Coquelin 2020-04-28 15:35 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-28 14:50 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Honnappa Nagarahalli, jerinj On 4/28/20 4:43 PM, Liu, Yong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Tuesday, April 28, 2020 9:46 PM >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; >> Wang, Zhihong <zhihong.wang@intel.com> >> Cc: dev@dpdk.org; Honnappa Nagarahalli >> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path >> >> >> >> On 4/28/20 3:01 PM, Liu, Yong wrote: >>>>> Maxime, >>>>> Thanks for point it out, it will add extra cache miss in datapath. >>>>> And its impact on performance is around 1% in loopback case. >>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks >>>> on my side because when doing IO loopback, the cache pressure is >>>> much less important. >>>> >>>>> While benefit of vectorized path will be more than that number. >>>> Ok, but I disagree for two reasons: >>>> 1. You have to keep in mind than non-vectorized is the default and >>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not >>>> checking header length (so no error stats), etc... >>>> >>> Ok, I will keep non-vectorized same as before. >>> >>>> 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because >>>> the gain is 20% on $CPU_VENDOR_B. >>>> >>>> In the case we see more degradation in real-world scenario, you might >>>> want to consider using ifdefs to avoid adding padding in the non- >>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user >>>> PMD in patch 7. >>>> >>> Maxime, >>> The performance difference is so slight, so I ignored for it look like a >> sampling error. >> >> Agree for IO loopback, but it adds one more cache line access per burst, >> which might be see in some real-life use cases. >> >>> It maybe not suitable to add new configuration for such setting which >> only used inside driver. >> >> Wait, the Virtio-user #ifdef is based on the defconfig options? How can >> it work since both Virtio PMD and Virtio-user PMD can be selected at the >> same time? >> >> I thought it was a define set before the headers inclusion and unset >> afterwards, but I didn't checked carefully. >> > > Maxime, > The difference between virtio PMD and Virtio-user PMD addresses is handled by vq->offset. > > When virtio PMD is running, offset will be set to buf_iova. > vq->offset = offsetof(struct rte_mbuf, buf_iova); > > When virtio_user PMD is running, offset will be set to buf_addr. > vq->offset = offsetof(struct rte_mbuf, buf_addr); Ok, but below is a build time check: +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif So how can it work for a single build for both Virtio and Virtio-user? >>> Virtio driver can check whether virtqueue is using vectorized path when >> initialization, will use padded structure if it is. >>> I have added some tested code and now performance came back. Since >> code has changed in initialization process, it need some time for regression >> check. >> >> Ok, works for me. >> >> I am investigating a linkage issue with your series, which does not >> happen systematically (see below, it happens also with clang). David >> pointed me to some Intel patches removing the usage if __rte_weak, >> could it be related? >> > > I checked David's patch, it only changed i40e driver. Meanwhile attribute __rte_weak should still be in virtio_rxtx.c. > I will follow David's patch, eliminate the usage of weak attribute. Yeah, I meant below issue could be linked to __rte_weak, not that i40e patch was the cause of this problem. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 14:50 ` Maxime Coquelin @ 2020-04-28 15:35 ` Liu, Yong 2020-04-28 15:40 ` Maxime Coquelin 0 siblings, 1 reply; 162+ messages in thread From: Liu, Yong @ 2020-04-28 15:35 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong Cc: dev, Honnappa Nagarahalli, jerinj > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Tuesday, April 28, 2020 10:50 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/28/20 4:43 PM, Liu, Yong wrote: > > > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Sent: Tuesday, April 28, 2020 9:46 PM > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; > >> Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: dev@dpdk.org; Honnappa Nagarahalli > >> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com > >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > path > >> > >> > >> > >> On 4/28/20 3:01 PM, Liu, Yong wrote: > >>>>> Maxime, > >>>>> Thanks for point it out, it will add extra cache miss in datapath. > >>>>> And its impact on performance is around 1% in loopback case. > >>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks > >>>> on my side because when doing IO loopback, the cache pressure is > >>>> much less important. > >>>> > >>>>> While benefit of vectorized path will be more than that number. > >>>> Ok, but I disagree for two reasons: > >>>> 1. You have to keep in mind than non-vectorized is the default and > >>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not > >>>> checking header length (so no error stats), etc... > >>>> > >>> Ok, I will keep non-vectorized same as before. > >>> > >>>> 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A > because > >>>> the gain is 20% on $CPU_VENDOR_B. > >>>> > >>>> In the case we see more degradation in real-world scenario, you might > >>>> want to consider using ifdefs to avoid adding padding in the non- > >>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user > >>>> PMD in patch 7. > >>>> > >>> Maxime, > >>> The performance difference is so slight, so I ignored for it look like a > >> sampling error. > >> > >> Agree for IO loopback, but it adds one more cache line access per burst, > >> which might be see in some real-life use cases. > >> > >>> It maybe not suitable to add new configuration for such setting which > >> only used inside driver. > >> > >> Wait, the Virtio-user #ifdef is based on the defconfig options? How can > >> it work since both Virtio PMD and Virtio-user PMD can be selected at the > >> same time? > >> > >> I thought it was a define set before the headers inclusion and unset > >> afterwards, but I didn't checked carefully. > >> > > > > Maxime, > > The difference between virtio PMD and Virtio-user PMD addresses is > handled by vq->offset. > > > > When virtio PMD is running, offset will be set to buf_iova. > > vq->offset = offsetof(struct rte_mbuf, buf_iova); > > > > When virtio_user PMD is running, offset will be set to buf_addr. > > vq->offset = offsetof(struct rte_mbuf, buf_addr); > > Ok, but below is a build time check: > > +#ifdef RTE_VIRTIO_USER > + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq- > >offset); > +#else > + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); > +#endif > > So how can it work for a single build for both Virtio and Virtio-user? > Sorry, here is an implementation error. vq->offset should be used in descs_base for getting the iova address. It will work the same as VIRTIO_MBUF_ADDR macro. > >>> Virtio driver can check whether virtqueue is using vectorized path when > >> initialization, will use padded structure if it is. > >>> I have added some tested code and now performance came back. Since > >> code has changed in initialization process, it need some time for > regression > >> check. > >> > >> Ok, works for me. > >> > >> I am investigating a linkage issue with your series, which does not > >> happen systematically (see below, it happens also with clang). David > >> pointed me to some Intel patches removing the usage if __rte_weak, > >> could it be related? > >> > > > > I checked David's patch, it only changed i40e driver. Meanwhile attribute > __rte_weak should still be in virtio_rxtx.c. > > I will follow David's patch, eliminate the usage of weak attribute. > > Yeah, I meant below issue could be linked to __rte_weak, not that i40e > patch was the cause of this problem. > Maxime, I haven't seen any build issue related to __rte_weak both with gcc and clang. Thanks, Marvin ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 15:35 ` Liu, Yong @ 2020-04-28 15:40 ` Maxime Coquelin 2020-04-28 15:55 ` Liu, Yong 0 siblings, 1 reply; 162+ messages in thread From: Maxime Coquelin @ 2020-04-28 15:40 UTC (permalink / raw) To: Liu, Yong, Ye, Xiaolong, Wang, Zhihong; +Cc: dev, Honnappa Nagarahalli, jerinj On 4/28/20 5:35 PM, Liu, Yong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Tuesday, April 28, 2020 10:50 PM >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; >> Wang, Zhihong <zhihong.wang@intel.com> >> Cc: dev@dpdk.org; Honnappa Nagarahalli >> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path >> >> >> >> On 4/28/20 4:43 PM, Liu, Yong wrote: >>> >>> >>>> -----Original Message----- >>>> From: Maxime Coquelin <maxime.coquelin@redhat.com> >>>> Sent: Tuesday, April 28, 2020 9:46 PM >>>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong >> <xiaolong.ye@intel.com>; >>>> Wang, Zhihong <zhihong.wang@intel.com> >>>> Cc: dev@dpdk.org; Honnappa Nagarahalli >>>> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com >>>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx >> path >>>> >>>> >>>> >>>> On 4/28/20 3:01 PM, Liu, Yong wrote: >>>>>>> Maxime, >>>>>>> Thanks for point it out, it will add extra cache miss in datapath. >>>>>>> And its impact on performance is around 1% in loopback case. >>>>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks >>>>>> on my side because when doing IO loopback, the cache pressure is >>>>>> much less important. >>>>>> >>>>>>> While benefit of vectorized path will be more than that number. >>>>>> Ok, but I disagree for two reasons: >>>>>> 1. You have to keep in mind than non-vectorized is the default and >>>>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not >>>>>> checking header length (so no error stats), etc... >>>>>> >>>>> Ok, I will keep non-vectorized same as before. >>>>> >>>>>> 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A >> because >>>>>> the gain is 20% on $CPU_VENDOR_B. >>>>>> >>>>>> In the case we see more degradation in real-world scenario, you might >>>>>> want to consider using ifdefs to avoid adding padding in the non- >>>>>> vectorized case, like you did to differentiate Virtio PMD to Virtio-user >>>>>> PMD in patch 7. >>>>>> >>>>> Maxime, >>>>> The performance difference is so slight, so I ignored for it look like a >>>> sampling error. >>>> >>>> Agree for IO loopback, but it adds one more cache line access per burst, >>>> which might be see in some real-life use cases. >>>> >>>>> It maybe not suitable to add new configuration for such setting which >>>> only used inside driver. >>>> >>>> Wait, the Virtio-user #ifdef is based on the defconfig options? How can >>>> it work since both Virtio PMD and Virtio-user PMD can be selected at the >>>> same time? >>>> >>>> I thought it was a define set before the headers inclusion and unset >>>> afterwards, but I didn't checked carefully. >>>> >>> >>> Maxime, >>> The difference between virtio PMD and Virtio-user PMD addresses is >> handled by vq->offset. >>> >>> When virtio PMD is running, offset will be set to buf_iova. >>> vq->offset = offsetof(struct rte_mbuf, buf_iova); >>> >>> When virtio_user PMD is running, offset will be set to buf_addr. >>> vq->offset = offsetof(struct rte_mbuf, buf_addr); >> >> Ok, but below is a build time check: >> >> +#ifdef RTE_VIRTIO_USER >> + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq- >>> offset); >> +#else >> + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); >> +#endif >> >> So how can it work for a single build for both Virtio and Virtio-user? >> > > Sorry, here is an implementation error. vq->offset should be used in descs_base for getting the iova address. > It will work the same as VIRTIO_MBUF_ADDR macro. > >>>>> Virtio driver can check whether virtqueue is using vectorized path when >>>> initialization, will use padded structure if it is. >>>>> I have added some tested code and now performance came back. Since >>>> code has changed in initialization process, it need some time for >> regression >>>> check. >>>> >>>> Ok, works for me. >>>> >>>> I am investigating a linkage issue with your series, which does not >>>> happen systematically (see below, it happens also with clang). David >>>> pointed me to some Intel patches removing the usage if __rte_weak, >>>> could it be related? >>>> >>> >>> I checked David's patch, it only changed i40e driver. Meanwhile attribute >> __rte_weak should still be in virtio_rxtx.c. >>> I will follow David's patch, eliminate the usage of weak attribute. >> >> Yeah, I meant below issue could be linked to __rte_weak, not that i40e >> patch was the cause of this problem. >> > > Maxime, > I haven't seen any build issue related to __rte_weak both with gcc and clang. Note that this build (which does not fail systematically) is when using binutils 2.30, which cause AVX512 support to be disabled. > Thanks, > Marvin > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 15:40 ` Maxime Coquelin @ 2020-04-28 15:55 ` Liu, Yong 0 siblings, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-28 15:55 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong Cc: dev, Honnappa Nagarahalli, jerinj > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Tuesday, April 28, 2020 11:40 PM > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path > > > > On 4/28/20 5:35 PM, Liu, Yong wrote: > > > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Sent: Tuesday, April 28, 2020 10:50 PM > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; > >> Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: dev@dpdk.org; Honnappa Nagarahalli > >> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com > >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > path > >> > >> > >> > >> On 4/28/20 4:43 PM, Liu, Yong wrote: > >>> > >>> > >>>> -----Original Message----- > >>>> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >>>> Sent: Tuesday, April 28, 2020 9:46 PM > >>>> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong > >> <xiaolong.ye@intel.com>; > >>>> Wang, Zhihong <zhihong.wang@intel.com> > >>>> Cc: dev@dpdk.org; Honnappa Nagarahalli > >>>> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com > >>>> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > >> path > >>>> > >>>> > >>>> > >>>> On 4/28/20 3:01 PM, Liu, Yong wrote: > >>>>>>> Maxime, > >>>>>>> Thanks for point it out, it will add extra cache miss in datapath. > >>>>>>> And its impact on performance is around 1% in loopback case. > >>>>>> Ok, thanks for doing the test. I'll try to run some PVP benchmarks > >>>>>> on my side because when doing IO loopback, the cache pressure is > >>>>>> much less important. > >>>>>> > >>>>>>> While benefit of vectorized path will be more than that number. > >>>>>> Ok, but I disagree for two reasons: > >>>>>> 1. You have to keep in mind than non-vectorized is the default and > >>>>>> encouraged mode to use. Indeed, it takes a lot of shortcuts like not > >>>>>> checking header length (so no error stats), etc... > >>>>>> > >>>>> Ok, I will keep non-vectorized same as before. > >>>>> > >>>>>> 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A > >> because > >>>>>> the gain is 20% on $CPU_VENDOR_B. > >>>>>> > >>>>>> In the case we see more degradation in real-world scenario, you > might > >>>>>> want to consider using ifdefs to avoid adding padding in the non- > >>>>>> vectorized case, like you did to differentiate Virtio PMD to Virtio- > user > >>>>>> PMD in patch 7. > >>>>>> > >>>>> Maxime, > >>>>> The performance difference is so slight, so I ignored for it look like a > >>>> sampling error. > >>>> > >>>> Agree for IO loopback, but it adds one more cache line access per > burst, > >>>> which might be see in some real-life use cases. > >>>> > >>>>> It maybe not suitable to add new configuration for such setting > which > >>>> only used inside driver. > >>>> > >>>> Wait, the Virtio-user #ifdef is based on the defconfig options? How > can > >>>> it work since both Virtio PMD and Virtio-user PMD can be selected at > the > >>>> same time? > >>>> > >>>> I thought it was a define set before the headers inclusion and unset > >>>> afterwards, but I didn't checked carefully. > >>>> > >>> > >>> Maxime, > >>> The difference between virtio PMD and Virtio-user PMD addresses is > >> handled by vq->offset. > >>> > >>> When virtio PMD is running, offset will be set to buf_iova. > >>> vq->offset = offsetof(struct rte_mbuf, buf_iova); > >>> > >>> When virtio_user PMD is running, offset will be set to buf_addr. > >>> vq->offset = offsetof(struct rte_mbuf, buf_addr); > >> > >> Ok, but below is a build time check: > >> > >> +#ifdef RTE_VIRTIO_USER > >> + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq- > >>> offset); > >> +#else > >> + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); > >> +#endif > >> > >> So how can it work for a single build for both Virtio and Virtio-user? > >> > > > > Sorry, here is an implementation error. vq->offset should be used in > descs_base for getting the iova address. > > It will work the same as VIRTIO_MBUF_ADDR macro. > > > >>>>> Virtio driver can check whether virtqueue is using vectorized path > when > >>>> initialization, will use padded structure if it is. > >>>>> I have added some tested code and now performance came back. > Since > >>>> code has changed in initialization process, it need some time for > >> regression > >>>> check. > >>>> > >>>> Ok, works for me. > >>>> > >>>> I am investigating a linkage issue with your series, which does not > >>>> happen systematically (see below, it happens also with clang). David > >>>> pointed me to some Intel patches removing the usage if __rte_weak, > >>>> could it be related? > >>>> > >>> > >>> I checked David's patch, it only changed i40e driver. Meanwhile > attribute > >> __rte_weak should still be in virtio_rxtx.c. > >>> I will follow David's patch, eliminate the usage of weak attribute. > >> > >> Yeah, I meant below issue could be linked to __rte_weak, not that i40e > >> patch was the cause of this problem. > >> > > > > Maxime, > > I haven't seen any build issue related to __rte_weak both with gcc and > clang. > > Note that this build (which does not fail systematically) is when using > binutils 2.30, which cause AVX512 support to be disabled. > Just change to binutils 2.30, AVX512 code will be skipped as expected in meson build. Could you please supply more information, I will try to reproduce it. > > Thanks, > > Marvin > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 13:01 ` Liu, Yong 2020-04-28 13:46 ` Maxime Coquelin @ 2020-04-28 17:01 ` Liu, Yong 1 sibling, 0 replies; 162+ messages in thread From: Liu, Yong @ 2020-04-28 17:01 UTC (permalink / raw) To: Maxime Coquelin, Ye, Xiaolong, Wang, Zhihong; +Cc: dev > -----Original Message----- > From: Liu, Yong > Sent: Tuesday, April 28, 2020 9:01 PM > To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Ye, Xiaolong > <xiaolong.ye@intel.com>; Wang, Zhihong <zhihong.wang@intel.com> > Cc: dev@dpdk.org > Subject: RE: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path > > > > > -----Original Message----- > > From: Maxime Coquelin <maxime.coquelin@redhat.com> > > Sent: Tuesday, April 28, 2020 4:44 PM > > To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>; > > Wang, Zhihong <zhihong.wang@intel.com> > > Cc: dev@dpdk.org > > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > path > > > > > > > > On 4/28/20 3:14 AM, Liu, Yong wrote: > > > > > > > > >> -----Original Message----- > > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > > >> Sent: Monday, April 27, 2020 7:21 PM > > >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong > > <xiaolong.ye@intel.com>; > > >> Wang, Zhihong <zhihong.wang@intel.com> > > >> Cc: dev@dpdk.org > > >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > > path > > >> > > >> > > >> > > >> On 4/26/20 4:19 AM, Marvin Liu wrote: > > >>> Optimize packed ring Rx path with SIMD instructions. Solution of > > >>> optimization is pretty like vhost, is that split path into batch and > > >>> single functions. Batch function is further optimized by AVX512 > > >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus > > >>> four elements will be saved in one batch. > > >>> > > >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> > > >>> > > >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > >>> index c9edb84ee..102b1deab 100644 > > >>> --- a/drivers/net/virtio/Makefile > > >>> +++ b/drivers/net/virtio/Makefile > > >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > > >> $(CONFIG_RTE_ARCH_ARM64)),) > > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > > virtio_rxtx_simple_neon.c > > >>> endif > > >>> > > >>> +ifneq ($(FORCE_DISABLE_AVX512), y) > > >>> + CC_AVX512_SUPPORT=\ > > >>> + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ > > >>> + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > > >>> + grep -q AVX512 && echo 1) > > >>> +endif > > >>> + > > >>> +ifeq ($(CC_AVX512_SUPPORT), 1) > > >>> +CFLAGS += -DCC_AVX512_SUPPORT > > >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_rxtx_packed_avx.c > > >>> + > > >>> +ifeq ($(RTE_TOOLCHAIN), gcc) > > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > > >>> +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA > > >>> +endif > > >>> +endif > > >>> + > > >>> +ifeq ($(RTE_TOOLCHAIN), clang) > > >>> +ifeq ($(shell test > > $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - > > >> ge 37 && echo 1), 1) > > >>> +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA > > >>> +endif > > >>> +endif > > >>> + > > >>> +ifeq ($(RTE_TOOLCHAIN), icc) > > >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > > >>> +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA > > >>> +endif > > >>> +endif > > >>> + > > >>> +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw - > > mavx512vl > > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > > >>> +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds > > >>> +endif > > >>> +endif > > >>> + > > >>> ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > virtio_user/vhost_user.c > > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += > > virtio_user/vhost_kernel.c > > >>> diff --git a/drivers/net/virtio/meson.build > > b/drivers/net/virtio/meson.build > > >>> index 15150eea1..8e68c3039 100644 > > >>> --- a/drivers/net/virtio/meson.build > > >>> +++ b/drivers/net/virtio/meson.build > > >>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > > >>> deps += ['kvargs', 'bus_pci'] > > >>> > > >>> if arch_subdir == 'x86' > > >>> + if '-mno-avx512f' not in machine_args > > >>> + if cc.has_argument('-mavx512f') and cc.has_argument('- > > >> mavx512vl') and cc.has_argument('-mavx512bw') > > >>> + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > > >>> + cflags += ['-DCC_AVX512_SUPPORT'] > > >>> + if (toolchain == 'gcc' and > > >> cc.version().version_compare('>=8.3.0')) > > >>> + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > > >>> + elif (toolchain == 'clang' and > > >> cc.version().version_compare('>=3.7.0')) > > >>> + cflags += '- > > >> DVHOST_CLANG_UNROLL_PRAGMA' > > >>> + elif (toolchain == 'icc' and > > >> cc.version().version_compare('>=16.0.0')) > > >>> + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > > >>> + endif > > >>> + sources += files('virtio_rxtx_packed_avx.c') > > >>> + endif > > >>> + endif > > >>> sources += files('virtio_rxtx_simple_sse.c') > > >>> elif arch_subdir == 'ppc' > > >>> sources += files('virtio_rxtx_simple_altivec.c') > > >>> diff --git a/drivers/net/virtio/virtio_ethdev.h > > >> b/drivers/net/virtio/virtio_ethdev.h > > >>> index febaf17a8..5c112cac7 100644 > > >>> --- a/drivers/net/virtio/virtio_ethdev.h > > >>> +++ b/drivers/net/virtio/virtio_ethdev.h > > >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void > > *tx_queue, > > >> struct rte_mbuf **tx_pkts, > > >>> uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf > > **rx_pkts, > > >>> uint16_t nb_pkts); > > >>> > > >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct > rte_mbuf > > >> **rx_pkts, > > >>> + uint16_t nb_pkts); > > >>> + > > >>> int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > >>> > > >>> void virtio_interrupt_handler(void *param); > > >>> diff --git a/drivers/net/virtio/virtio_rxtx.c > > b/drivers/net/virtio/virtio_rxtx.c > > >>> index a549991aa..534562cca 100644 > > >>> --- a/drivers/net/virtio/virtio_rxtx.c > > >>> +++ b/drivers/net/virtio/virtio_rxtx.c > > >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > >>> > > >>> return nb_tx; > > >>> } > > >>> + > > >>> +__rte_weak uint16_t > > >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > > >>> + struct rte_mbuf **rx_pkts __rte_unused, > > >>> + uint16_t nb_pkts __rte_unused) > > >>> +{ > > >>> + return 0; > > >>> +} > > >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > > >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > >>> new file mode 100644 > > >>> index 000000000..8a7b459eb > > >>> --- /dev/null > > >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > >>> @@ -0,0 +1,374 @@ > > >>> +/* SPDX-License-Identifier: BSD-3-Clause > > >>> + * Copyright(c) 2010-2020 Intel Corporation > > >>> + */ > > >>> + > > >>> +#include <stdint.h> > > >>> +#include <stdio.h> > > >>> +#include <stdlib.h> > > >>> +#include <string.h> > > >>> +#include <errno.h> > > >>> + > > >>> +#include <rte_net.h> > > >>> + > > >>> +#include "virtio_logs.h" > > >>> +#include "virtio_ethdev.h" > > >>> +#include "virtio_pci.h" > > >>> +#include "virtqueue.h" > > >>> + > > >>> +#define BYTE_SIZE 8 > > >>> +/* flag bits offset in packed ring desc higher 64bits */ > > >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, > flags) > > - \ > > >>> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > > >>> + > > >>> +#define PACKED_FLAGS_MASK ((0ULL | > > >> VRING_PACKED_DESC_F_AVAIL_USED) << \ > > >>> + FLAGS_BITS_OFFSET) > > >>> + > > >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > > >>> + sizeof(struct vring_packed_desc)) > > >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > > >>> + > > >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC > unroll > > 4") > > >> \ > > >>> + for (iter = val; iter < size; iter++) > > >>> +#endif > > >>> + > > >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") > \ > > >>> + for (iter = val; iter < size; iter++) > > >>> +#endif > > >>> + > > >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll > (4)") > > \ > > >>> + for (iter = val; iter < size; iter++) > > >>> +#endif > > >>> + > > >>> +#ifndef virtio_for_each_try_unroll > > >>> +#define virtio_for_each_try_unroll(iter, val, num) \ > > >>> + for (iter = val; iter < num; iter++) > > >>> +#endif > > >>> + > > >>> +static inline void > > >>> +virtio_update_batch_stats(struct virtnet_stats *stats, > > >>> + uint16_t pkt_len1, > > >>> + uint16_t pkt_len2, > > >>> + uint16_t pkt_len3, > > >>> + uint16_t pkt_len4) > > >>> +{ > > >>> + stats->bytes += pkt_len1; > > >>> + stats->bytes += pkt_len2; > > >>> + stats->bytes += pkt_len3; > > >>> + stats->bytes += pkt_len4; > > >>> +} > > >>> + > > >>> +/* Optionally fill offload information in structure */ > > >>> +static inline int > > >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > > >>> +{ > > >>> + struct rte_net_hdr_lens hdr_lens; > > >>> + uint32_t hdrlen, ptype; > > >>> + int l4_supported = 0; > > >>> + > > >>> + /* nothing to do */ > > >>> + if (hdr->flags == 0) > > >>> + return 0; > > >>> + > > >>> + /* GSO not support in vec path, skip check */ > > >>> + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; > > >>> + > > >>> + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > > >>> + m->packet_type = ptype; > > >>> + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || > > >>> + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || > > >>> + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) > > >>> + l4_supported = 1; > > >>> + > > >>> + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > > >>> + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > > >>> + if (hdr->csum_start <= hdrlen && l4_supported) { > > >>> + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; > > >>> + } else { > > >>> + /* Unknown proto or tunnel, do sw cksum. We can > > >> assume > > >>> + * the cksum field is in the first segment since the > > >>> + * buffers we provided to the host are large enough. > > >>> + * In case of SCTP, this will be wrong since it's a CRC > > >>> + * but there's nothing we can do. > > >>> + */ > > >>> + uint16_t csum = 0, off; > > >>> + > > >>> + rte_raw_cksum_mbuf(m, hdr->csum_start, > > >>> + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > > >>> + &csum); > > >>> + if (likely(csum != 0xffff)) > > >>> + csum = ~csum; > > >>> + off = hdr->csum_offset + hdr->csum_start; > > >>> + if (rte_pktmbuf_data_len(m) >= off + 1) > > >>> + *rte_pktmbuf_mtod_offset(m, uint16_t *, > > >>> + off) = csum; > > >>> + } > > >>> + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && > > >> l4_supported) { > > >>> + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; > > >>> + } > > >>> + > > >>> + return 0; > > >>> +} > > >>> + > > >>> +static inline uint16_t > > >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, > > >>> + struct rte_mbuf **rx_pkts) > > >>> +{ > > >>> + struct virtqueue *vq = rxvq->vq; > > >>> + struct virtio_hw *hw = vq->hw; > > >>> + uint16_t hdr_size = hw->vtnet_hdr_size; > > >>> + uint64_t addrs[PACKED_BATCH_SIZE]; > > >>> + uint16_t id = vq->vq_used_cons_idx; > > >>> + uint8_t desc_stats; > > >>> + uint16_t i; > > >>> + void *desc_addr; > > >>> + > > >>> + if (id & PACKED_BATCH_MASK) > > >>> + return -1; > > >>> + > > >>> + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) > > >>> + return -1; > > >>> + > > >>> + /* only care avail/used bits */ > > >>> + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, > > >> PACKED_FLAGS_MASK); > > >>> + desc_addr = &vq->vq_packed.ring.desc[id]; > > >>> + > > >>> + __m512i v_desc = _mm512_loadu_si512(desc_addr); > > >>> + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); > > >>> + > > >>> + __m512i v_used_flag = _mm512_setzero_si512(); > > >>> + if (vq->vq_packed.used_wrap_counter) > > >>> + v_used_flag = _mm512_maskz_set1_epi64(0xaa, > > >> PACKED_FLAGS_MASK); > > >>> + > > >>> + /* Check all descs are used */ > > >>> + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); > > >>> + if (desc_stats) > > >>> + return -1; > > >>> + > > >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > >>> + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; > > >>> + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); > > >>> + > > >>> + addrs[i] = (uint64_t)rx_pkts[i]->rx_descriptor_fields1; > > >>> + } > > >>> + > > >>> + /* > > >>> + * load len from desc, store into mbuf pkt_len and data_len > > >>> + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored > > >>> + */ > > >>> + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; > > >>> + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, > > >> 0xAA); > > >>> + > > >>> + /* reduce hdr_len from pkt_len and data_len */ > > >>> + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, > > >>> + (uint32_t)-hdr_size); > > >>> + > > >>> + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); > > >>> + > > >>> + /* assert offset of data_len */ > > >>> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != > > >>> + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); > > >>> + > > >>> + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], > > >>> + addrs[2] + 8, addrs[2], > > >>> + addrs[1] + 8, addrs[1], > > >>> + addrs[0] + 8, addrs[0]); > > >>> + /* batch store into mbufs */ > > >>> + _mm512_i64scatter_epi64(0, v_index, v_value, 1); > > >>> + > > >>> + if (hw->has_rx_offload) { > > >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > >>> + char *addr = (char *)rx_pkts[i]->buf_addr + > > >>> + RTE_PKTMBUF_HEADROOM - hdr_size; > > >>> + virtio_vec_rx_offload(rx_pkts[i], > > >>> + (struct virtio_net_hdr *)addr); > > >>> + } > > >>> + } > > >>> + > > >>> + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, > > >>> + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, > > >>> + rx_pkts[3]->pkt_len); > > >>> + > > >>> + vq->vq_free_cnt += PACKED_BATCH_SIZE; > > >>> + > > >>> + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; > > >>> + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > > >>> + vq->vq_used_cons_idx -= vq->vq_nentries; > > >>> + vq->vq_packed.used_wrap_counter ^= 1; > > >>> + } > > >>> + > > >>> + return 0; > > >>> +} > > >>> + > > >>> +static uint16_t > > >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, > > >>> + struct rte_mbuf **rx_pkts) > > >>> +{ > > >>> + uint16_t used_idx, id; > > >>> + uint32_t len; > > >>> + struct virtqueue *vq = rxvq->vq; > > >>> + struct virtio_hw *hw = vq->hw; > > >>> + uint32_t hdr_size = hw->vtnet_hdr_size; > > >>> + struct virtio_net_hdr *hdr; > > >>> + struct vring_packed_desc *desc; > > >>> + struct rte_mbuf *cookie; > > >>> + > > >>> + desc = vq->vq_packed.ring.desc; > > >>> + used_idx = vq->vq_used_cons_idx; > > >>> + if (!desc_is_used(&desc[used_idx], vq)) > > >>> + return -1; > > >>> + > > >>> + len = desc[used_idx].len; > > >>> + id = desc[used_idx].id; > > >>> + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; > > >>> + if (unlikely(cookie == NULL)) { > > >>> + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie > > >> at %u", > > >>> + vq->vq_used_cons_idx); > > >>> + return -1; > > >>> + } > > >>> + rte_prefetch0(cookie); > > >>> + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); > > >>> + > > >>> + cookie->data_off = RTE_PKTMBUF_HEADROOM; > > >>> + cookie->ol_flags = 0; > > >>> + cookie->pkt_len = (uint32_t)(len - hdr_size); > > >>> + cookie->data_len = (uint32_t)(len - hdr_size); > > >>> + > > >>> + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + > > >>> + RTE_PKTMBUF_HEADROOM - > > >> hdr_size); > > >>> + if (hw->has_rx_offload) > > >>> + virtio_vec_rx_offload(cookie, hdr); > > >>> + > > >>> + *rx_pkts = cookie; > > >>> + > > >>> + rxvq->stats.bytes += cookie->pkt_len; > > >>> + > > >>> + vq->vq_free_cnt++; > > >>> + vq->vq_used_cons_idx++; > > >>> + if (vq->vq_used_cons_idx >= vq->vq_nentries) { > > >>> + vq->vq_used_cons_idx -= vq->vq_nentries; > > >>> + vq->vq_packed.used_wrap_counter ^= 1; > > >>> + } > > >>> + > > >>> + return 0; > > >>> +} > > >>> + > > >>> +static inline void > > >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, > > >>> + struct rte_mbuf **cookie, > > >>> + uint16_t num) > > >>> +{ > > >>> + struct virtqueue *vq = rxvq->vq; > > >>> + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; > > >>> + uint16_t flags = vq->vq_packed.cached_flags; > > >>> + struct virtio_hw *hw = vq->hw; > > >>> + struct vq_desc_extra *dxp; > > >>> + uint16_t idx, i; > > >>> + uint16_t batch_num, total_num = 0; > > >>> + uint16_t head_idx = vq->vq_avail_idx; > > >>> + uint16_t head_flag = vq->vq_packed.cached_flags; > > >>> + uint64_t addr; > > >>> + > > >>> + do { > > >>> + idx = vq->vq_avail_idx; > > >>> + > > >>> + batch_num = PACKED_BATCH_SIZE; > > >>> + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) > > >>> + batch_num = vq->vq_nentries - idx; > > >>> + if (unlikely((total_num + batch_num) > num)) > > >>> + batch_num = num - total_num; > > >>> + > > >>> + virtio_for_each_try_unroll(i, 0, batch_num) { > > >>> + dxp = &vq->vq_descx[idx + i]; > > >>> + dxp->cookie = (void *)cookie[total_num + i]; > > >>> + > > >>> + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], > > >> vq) + > > >>> + RTE_PKTMBUF_HEADROOM - hw- > > >>> vtnet_hdr_size; > > >>> + start_dp[idx + i].addr = addr; > > >>> + start_dp[idx + i].len = cookie[total_num + i]- > > >buf_len > > >>> + - RTE_PKTMBUF_HEADROOM + hw- > > >>> vtnet_hdr_size; > > >>> + if (total_num || i) { > > >>> + virtqueue_store_flags_packed(&start_dp[idx > > >> + i], > > >>> + flags, hw->weak_barriers); > > >>> + } > > >>> + } > > >>> + > > >>> + vq->vq_avail_idx += batch_num; > > >>> + if (vq->vq_avail_idx >= vq->vq_nentries) { > > >>> + vq->vq_avail_idx -= vq->vq_nentries; > > >>> + vq->vq_packed.cached_flags ^= > > >>> + VRING_PACKED_DESC_F_AVAIL_USED; > > >>> + flags = vq->vq_packed.cached_flags; > > >>> + } > > >>> + total_num += batch_num; > > >>> + } while (total_num < num); > > >>> + > > >>> + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, > > >>> + hw->weak_barriers); > > >>> + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); > > >>> +} > > >>> + > > >>> +uint16_t > > >>> +virtio_recv_pkts_packed_vec(void *rx_queue, > > >>> + struct rte_mbuf **rx_pkts, > > >>> + uint16_t nb_pkts) > > >>> +{ > > >>> + struct virtnet_rx *rxvq = rx_queue; > > >>> + struct virtqueue *vq = rxvq->vq; > > >>> + struct virtio_hw *hw = vq->hw; > > >>> + uint16_t num, nb_rx = 0; > > >>> + uint32_t nb_enqueued = 0; > > >>> + uint16_t free_cnt = vq->vq_free_thresh; > > >>> + > > >>> + if (unlikely(hw->started == 0)) > > >>> + return nb_rx; > > >>> + > > >>> + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); > > >>> + if (likely(num > PACKED_BATCH_SIZE)) > > >>> + num = num - ((vq->vq_used_cons_idx + num) % > > >> PACKED_BATCH_SIZE); > > >>> + > > >>> + while (num) { > > >>> + if (!virtqueue_dequeue_batch_packed_vec(rxvq, > > >>> + &rx_pkts[nb_rx])) { > > >>> + nb_rx += PACKED_BATCH_SIZE; > > >>> + num -= PACKED_BATCH_SIZE; > > >>> + continue; > > >>> + } > > >>> + if (!virtqueue_dequeue_single_packed_vec(rxvq, > > >>> + &rx_pkts[nb_rx])) { > > >>> + nb_rx++; > > >>> + num--; > > >>> + continue; > > >>> + } > > >>> + break; > > >>> + }; > > >>> + > > >>> + PMD_RX_LOG(DEBUG, "dequeue:%d", num); > > >>> + > > >>> + rxvq->stats.packets += nb_rx; > > >>> + > > >>> + if (likely(vq->vq_free_cnt >= free_cnt)) { > > >>> + struct rte_mbuf *new_pkts[free_cnt]; > > >>> + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, > > >>> + free_cnt) == 0)) { > > >>> + virtio_recv_refill_packed_vec(rxvq, new_pkts, > > >>> + free_cnt); > > >>> + nb_enqueued += free_cnt; > > >>> + } else { > > >>> + struct rte_eth_dev *dev = > > >>> + &rte_eth_devices[rxvq->port_id]; > > >>> + dev->data->rx_mbuf_alloc_failed += free_cnt; > > >>> + } > > >>> + } > > >>> + > > >>> + if (likely(nb_enqueued)) { > > >>> + if (unlikely(virtqueue_kick_prepare_packed(vq))) { > > >>> + virtqueue_notify(vq); > > >>> + PMD_RX_LOG(DEBUG, "Notified"); > > >>> + } > > >>> + } > > >>> + > > >>> + return nb_rx; > > >>> +} > > >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c > > >> b/drivers/net/virtio/virtio_user_ethdev.c > > >>> index 40ad786cc..c54698ad1 100644 > > >>> --- a/drivers/net/virtio/virtio_user_ethdev.c > > >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c > > >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct > > rte_vdev_device > > >> *vdev) > > >>> hw->use_msix = 1; > > >>> hw->modern = 0; > > >>> hw->use_vec_rx = 0; > > >>> + hw->use_vec_tx = 0; > > >>> hw->use_inorder_rx = 0; > > >>> hw->use_inorder_tx = 0; > > >>> hw->virtio_user_dev = dev; > > >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct > rte_vdev_device > > >> *dev) > > >>> goto end; > > >>> } > > >>> > > >>> - if (vectorized) > > >>> - hw->use_vec_rx = 1; > > >>> + if (vectorized) { > > >>> + if (packed_vq) { > > >>> +#if defined(CC_AVX512_SUPPORT) > > >>> + hw->use_vec_rx = 1; > > >>> + hw->use_vec_tx = 1; > > >>> +#else > > >>> + PMD_INIT_LOG(INFO, > > >>> + "building environment do not support > > packed > > >> ring vectorized"); > > >>> +#endif > > >>> + } else { > > >>> + hw->use_vec_rx = 1; > > >>> + } > > >>> + } > > >>> > > >>> rte_eth_dev_probing_finish(eth_dev); > > >>> ret = 0; > > >>> diff --git a/drivers/net/virtio/virtqueue.h > > b/drivers/net/virtio/virtqueue.h > > >>> index ca1c10499..ce0340743 100644 > > >>> --- a/drivers/net/virtio/virtqueue.h > > >>> +++ b/drivers/net/virtio/virtqueue.h > > >>> @@ -239,7 +239,8 @@ struct vq_desc_extra { > > >>> void *cookie; > > >>> uint16_t ndescs; > > >>> uint16_t next; > > >>> -}; > > >>> + uint8_t padding[4]; > > >>> +} __rte_packed __rte_aligned(16); > > >> > > >> Can't this introduce a performance impact for the non-vectorized > > >> case? I think of worse cache liens utilization. > > >> > > >> For example with a burst of 32 descriptors with 32B cachelines, before > > >> it would take 14 cachelines, after 16. So for each burst, one could face > > >> 2 extra cache misses. > > >> > > >> If you could run non-vectorized benchamrks with and without that > patch, > > >> I would be grateful. > > >> > > > > > > Maxime, > > > Thanks for point it out, it will add extra cache miss in datapath. > > > And its impact on performance is around 1% in loopback case. > > > > Ok, thanks for doing the test. I'll try to run some PVP benchmarks > > on my side because when doing IO loopback, the cache pressure is > > much less important. > > > > > While benefit of vectorized path will be more than that number. > > > > Ok, but I disagree for two reasons: > > 1. You have to keep in mind than non-vectorized is the default and > > encouraged mode to use. Indeed, it takes a lot of shortcuts like not > > checking header length (so no error stats), etc... > > > Ok, I will keep non-vectorized same as before. > > > 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because > > the gain is 20% on $CPU_VENDOR_B. > > > > In the case we see more degradation in real-world scenario, you might > > want to consider using ifdefs to avoid adding padding in the non- > > vectorized case, like you did to differentiate Virtio PMD to Virtio-user > > PMD in patch 7. > > > > Maxime, > The performance difference is so slight, so I ignored for it look like a > sampling error. > It maybe not suitable to add new configuration for such setting which only > used inside driver. > Virtio driver can check whether virtqueue is using vectorized path when > initialization, will use padded structure if it is. > I have added some tested code and now performance came back. Since > code has changed in initialization process, it need some time for regression > check. > + one more update. Batch store with padding structure won't have benefit based on the latest code. It may due to addition load/store cost can't be hidden by saved cpu cycles. Will moved padding structure and make things clear as before. > Regards, > Marvin > > > Thanks, > > Maxime > > > > > Thanks, > > > Marvin > > > > > >> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> > > >> > > >> Thanks, > > >> Maxime > > > ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-27 11:55 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 9/9] doc: add packed " Marvin Liu 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Tx path like Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by AVX512 instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 5c112cac7..b7d52d497 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 534562cca..460e9d4a2 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2038,3 +2038,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index 8a7b459eb..43cee4244 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -23,6 +23,24 @@ #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ FLAGS_BITS_OFFSET) +/* reference count offset in mbuf rearm data */ +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) + +/* default rearm data */ +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ + 1ULL << REFCNT_BITS_OFFSET) + +/* id bits offset in packed ring desc higher 64bits */ +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ sizeof(struct vring_packed_desc)) #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) @@ -60,6 +78,237 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len4; } +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64); + RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64); + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref); + if (unlikely(cmp)) + return -1; + + /* Check headroom is enough */ + const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12; + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) != + offsetof(struct rte_mbuf, rearm_data)); + cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms); + if (unlikely(cmp)) + return -1; + + __m512i v_descx = _mm512_set_epi64(0x1, (uint64_t)tx_pkts[3], + 0x1, (uint64_t)tx_pkts[2], + 0x1, (uint64_t)tx_pkts[1], + 0x1, (uint64_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_BITS_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_BITS_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i v_offset = _mm512_broadcast_i32x4(flag_offset); + + __m512i v_desc = _mm512_add_epi64(new_descs, v_offset); + + if (!vq->hw->has_tx_offload) { + __m128i all_mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, all_mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_inorder_packed(vq, need); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_inorder_packed(vq, vq->vq_free_thresh); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-27 11:55 ` Maxime Coquelin 0 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-27 11:55 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev On 4/26/20 4:19 AM, Marvin Liu wrote: > Optimize packed ring Tx path like Rx path. Split Tx path into batch and > single Tx functions. Batch function is further optimized by AVX512 > instructions. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from vectorized devarg, then checks each criteria. Packed ring vectorized path need: AVX512F and required extensions are supported by compiler and host VERSION_1 and IN_ORDER features are negotiated mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path need: mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloadings are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 0a69a4db1..f8ff41d99 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; if (vectorized) { - if (!vtpci_packed_queue(hw)) + if (!vtpci_packed_queue(hw)) { + hw->use_vec_rx = 1; + } else { +#if !defined(CC_AVX512_SUPPORT) + PMD_DRV_LOG(INFO, + "building environment do not support packed ring vectorized"); +#else hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif + } } hw->opened = true; @@ -2102,8 +2120,8 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, if (vectorized && rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { ret = rte_kvargs_process(kvlist, - VIRTIO_ARG_VECTORIZED, - vectorized_check_handler, vectorized); + VIRTIO_ARG_VECTORIZED, + vectorized_check_handler, vectorized); if (ret < 0) { PMD_INIT_LOG(ERR, "Failed to parse %s", VIRTIO_ARG_VECTORIZED); @@ -2288,31 +2306,61 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized path for requirements not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized path for requirement not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v10 9/9] doc: add packed vectorized path 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu ` (7 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-26 2:19 ` Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-26 2:19 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index d59add23e..dbcf49ae1 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -482,6 +482,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -504,6 +511,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -521,20 +530,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (15 preceding siblings ...) 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (8 more replies) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu 17 siblings, 9 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New option RTE_LIBRTE_VIRTIO_INC_VECTOR will be introduced in this patch set. This option will unify split and packed ring vectorized path default setting. Meanwhile user can specify whether enable vectorized path at runtime by 'vectorized' parameter of virtio user vdev. v11: * fix i686 build warnings * fix typo in doc v10: * reuse packed ring xmit cleanup v9: * replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg * reorder patch sequence v8: * fix meson build error on ubuntu16.04 and suse15 v7: * default vectorization is disabled * compilation time check dependency on rte_mbuf structure * offsets are calcuated when compiling * remove useless barrier as descs are batched store&load * vindex of scatter is directly set * some comments updates * enable vectorized path in meson build v6: * fix issue when size not power of 2 v5: * remove cpuflags definition as required extensions always come with AVX512F on x86_64 * inorder actions should depend on feature bit * check ring type in rx queue setup * rewrite some commit logs * fix some checkpatch warnings v4: * rename 'packed_vec' to 'vectorized', also used in split ring * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev * check required AVX512 extensions cpuflags * combine split and packed ring datapath selection logic * remove limitation that size must power of two * clear 12Bytes virtio_net_hdr v3: * remove virtio_net_hdr array for better performance * disable 'packed_vec' by default v2: * more function blocks replaced by vector instructions * clean virtio_net_hdr by vector instruction * allow header room size change * add 'packed_vec' option in virtio_user vdev * fix build not check whether AVX512 enabled * doc update Tested-by: Wang, Yinan <yinan.wang@intel.com> Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: inorder should depend on feature bit net/virtio: add vectorized devarg net/virtio-user: add vectorized devarg net/virtio: reuse packed ring functions net/virtio: add vectorized packed ring Rx path net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path doc/guides/nics/virtio.rst | 52 +- drivers/net/virtio/Makefile | 35 ++ drivers/net/virtio/meson.build | 14 + drivers/net/virtio/virtio_ethdev.c | 137 ++++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 349 ++--------- drivers/net/virtio/virtio_rxtx_packed_avx.c | 623 ++++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 32 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 307 +++++++++- 11 files changed, 1210 insertions(+), 355 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (7 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue, its default value is 32. Limit the threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg Marvin Liu ` (6 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Ring initialization is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 4/9] net/virtio-user: " Marvin Liu ` (5 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Previously, virtio split ring vectorized path was enabled by default. This is not suitable for everyone because that path dose not follow virtio spec. Add new devarg for virtio vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 6286286db..a67774e91 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfers to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) + Below devargs are supported by the virtio-user vdev: #. ``path``: diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 37766cbb6..0a69a4db1 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev); static uint32_t virtio_dev_speed_capa_get(uint32_t speed); static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed); + uint32_t *speed, + int *vectorized); static int virtio_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info); static int virtio_dev_link_update(struct rte_eth_dev *dev, @@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) { struct virtio_hw *hw = eth_dev->data->dev_private; uint32_t speed = SPEED_UNKNOWN; + int vectorized = 0; int ret; if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) { @@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) return 0; } ret = virtio_dev_devargs_parse(eth_dev->device->devargs, - NULL, &speed); + NULL, &speed, &vectorized); if (ret < 0) return ret; hw->speed = speed; @@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) if (ret < 0) goto err_virtio_init; + if (vectorized) { + if (!vtpci_packed_queue(hw)) + hw->use_vec_rx = 1; + } + hw->opened = true; return 0; @@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed) } } +static int vectorized_check_handler(__rte_unused const char *key, + const char *value, void *ret_val) +{ + if (strcmp(value, "1") == 0) + *(int *)ret_val = 1; + else + *(int *)ret_val = 0; + + return 0; +} #define VIRTIO_ARG_SPEED "speed" #define VIRTIO_ARG_VDPA "vdpa" +#define VIRTIO_ARG_VECTORIZED "vectorized" static int @@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused, static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed) + uint32_t *speed, int *vectorized) { struct rte_kvargs *kvlist; int ret = 0; @@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, } } + if (vectorized && + rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { + ret = rte_kvargs_process(kvlist, + VIRTIO_ARG_VECTORIZED, + vectorized_check_handler, vectorized); + if (ret < 0) { + PMD_INIT_LOG(ERR, "Failed to parse %s", + VIRTIO_ARG_VECTORIZED); + goto exit; + } + } + exit: rte_kvargs_free(kvlist); return ret; @@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused, int vdpa = 0; int ret = 0; - ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL); + ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL, + NULL); if (ret < 0) { PMD_INIT_LOG(ERR, "devargs parsing is failed"); return ret; @@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index bd89357e4..668e688e1 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -253,7 +253,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 953f00d72..150a8d987 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 4/9] net/virtio-user: add vectorized devarg 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions Marvin Liu ` (4 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Add new devarg for virtio user device vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index a67774e91..fdd0790e0 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfers to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) Virtio paths Selection and Usage -------------------------------- diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 150a8d987..40ad786cc 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -452,6 +452,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_PACKED_VQ, #define VIRTIO_USER_ARG_SPEED "speed" VIRTIO_USER_ARG_SPEED, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) + hw->use_vec_rx = 1; + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "mrg_rxbuf=<0|1> " "in_order=<0|1> " "packed_vq=<0|1> " - "speed=<int>"); + "speed=<int> " + "vectorized=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 4/9] net/virtio-user: " Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (3 subsequent siblings) 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Move offload, xmit cleanup and packed xmit enqueue function to header file. These functions will be reused by packed ring vectorized path. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..a549991aa 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -89,23 +89,6 @@ vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx) dp->next = VQ_RING_DESC_CHAIN_END; } -static void -vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id) -{ - struct vq_desc_extra *dxp; - - dxp = &vq->vq_descx[id]; - vq->vq_free_cnt += dxp->ndescs; - - if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_head_idx = id; - else - vq->vq_descx[vq->vq_desc_tail_idx].next = id; - - vq->vq_desc_tail_idx = id; - dxp->next = VQ_RING_DESC_CHAIN_END; -} - void virtio_update_packet_stats(struct virtnet_stats *stats, struct rte_mbuf *mbuf) { @@ -264,130 +247,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - -static void -virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) -{ - uint16_t used_idx, id, curr_id, free_cnt = 0; - uint16_t size = vq->vq_nentries; - struct vring_packed_desc *desc = vq->vq_packed.ring.desc; - struct vq_desc_extra *dxp; - - used_idx = vq->vq_used_cons_idx; - /* desc_is_used has a load-acquire or rte_cio_rmb inside - * and wait for used desc in virtqueue. - */ - while (num > 0 && desc_is_used(&desc[used_idx], vq)) { - id = desc[used_idx].id; - do { - curr_id = used_idx; - dxp = &vq->vq_descx[used_idx]; - used_idx += dxp->ndescs; - free_cnt += dxp->ndescs; - num -= dxp->ndescs; - if (used_idx >= size) { - used_idx -= size; - vq->vq_packed.used_wrap_counter ^= 1; - } - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } while (curr_id != id); - } - vq->vq_used_cons_idx = used_idx; - vq->vq_free_cnt += free_cnt; -} - -static void -virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num) -{ - uint16_t used_idx, id; - uint16_t size = vq->vq_nentries; - struct vring_packed_desc *desc = vq->vq_packed.ring.desc; - struct vq_desc_extra *dxp; - - used_idx = vq->vq_used_cons_idx; - /* desc_is_used has a load-acquire or rte_cio_rmb inside - * and wait for used desc in virtqueue. - */ - while (num-- && desc_is_used(&desc[used_idx], vq)) { - id = desc[used_idx].id; - dxp = &vq->vq_descx[id]; - vq->vq_used_cons_idx += dxp->ndescs; - if (vq->vq_used_cons_idx >= size) { - vq->vq_used_cons_idx -= size; - vq->vq_packed.used_wrap_counter ^= 1; - } - vq_ring_free_id_packed(vq, id); - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - used_idx = vq->vq_used_cons_idx; - } -} - -/* Cleanup from completed transmits. */ -static inline void -virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order) -{ - if (in_order) - virtio_xmit_cleanup_inorder_packed(vq, num); - else - virtio_xmit_cleanup_normal_packed(vq, num); -} - -static void -virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num) -{ - uint16_t i, used_idx, desc_idx; - for (i = 0; i < num; i++) { - struct vring_used_elem *uep; - struct vq_desc_extra *dxp; - - used_idx = (uint16_t)(vq->vq_used_cons_idx & (vq->vq_nentries - 1)); - uep = &vq->vq_split.ring.used->ring[used_idx]; - - desc_idx = (uint16_t) uep->id; - dxp = &vq->vq_descx[desc_idx]; - vq->vq_used_cons_idx++; - vq_ring_free_chain(vq, desc_idx); - - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } -} - -/* Cleanup from completed inorder transmits. */ -static __rte_always_inline void -virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num) -{ - uint16_t i, idx = vq->vq_used_cons_idx; - int16_t free_cnt = 0; - struct vq_desc_extra *dxp = NULL; - - if (unlikely(num == 0)) - return; - - for (i = 0; i < num; i++) { - dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)]; - free_cnt += dxp->ndescs; - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } - - vq->vq_free_cnt += free_cnt; - vq->vq_used_cons_idx = idx; -} - static inline int virtqueue_enqueue_refill_inorder(struct virtqueue *vq, struct rte_mbuf **cookies, @@ -562,68 +421,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +523,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, @@ -1246,7 +948,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..ca1c10499 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -10,6 +10,7 @@ #include <rte_atomic.h> #include <rte_memory.h> #include <rte_mempool.h> +#include <rte_net.h> #include "virtio_pci.h" #include "virtio_ring.h" @@ -18,8 +19,10 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -560,4 +563,303 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} + +static void +vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id) +{ + struct vq_desc_extra *dxp; + + dxp = &vq->vq_descx[id]; + vq->vq_free_cnt += dxp->ndescs; + + if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_head_idx = id; + else + vq->vq_descx[vq->vq_desc_tail_idx].next = id; + + vq->vq_desc_tail_idx = id; + dxp->next = VQ_RING_DESC_CHAIN_END; +} + +static void +virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) +{ + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + + used_idx = vq->vq_used_cons_idx; + /* desc_is_used has a load-acquire or rte_cio_rmb inside + * and wait for used desc in virtqueue. + */ + while (num > 0 && desc_is_used(&desc[used_idx], vq)) { + id = desc[used_idx].id; + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + num -= dxp->ndescs; + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } while (curr_id != id); + } + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} + +static void +virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num) +{ + uint16_t used_idx, id; + uint16_t size = vq->vq_nentries; + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + + used_idx = vq->vq_used_cons_idx; + /* desc_is_used has a load-acquire or rte_cio_rmb inside + * and wait for used desc in virtqueue. + */ + while (num-- && desc_is_used(&desc[used_idx], vq)) { + id = desc[used_idx].id; + dxp = &vq->vq_descx[id]; + vq->vq_used_cons_idx += dxp->ndescs; + if (vq->vq_used_cons_idx >= size) { + vq->vq_used_cons_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + vq_ring_free_id_packed(vq, id); + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + used_idx = vq->vq_used_cons_idx; + } +} + +/* Cleanup from completed transmits. */ +static inline void +virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order) +{ + if (in_order) + virtio_xmit_cleanup_inorder_packed(vq, num); + else + virtio_xmit_cleanup_normal_packed(vq, num); +} + +static inline void +virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num) +{ + uint16_t i, used_idx, desc_idx; + for (i = 0; i < num; i++) { + struct vring_used_elem *uep; + struct vq_desc_extra *dxp; + + used_idx = (uint16_t)(vq->vq_used_cons_idx & + (vq->vq_nentries - 1)); + uep = &vq->vq_split.ring.used->ring[used_idx]; + + desc_idx = (uint16_t)uep->id; + dxp = &vq->vq_descx[desc_idx]; + vq->vq_used_cons_idx++; + vq_ring_free_chain(vq, desc_idx); + + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } +} + +/* Cleanup from completed inorder transmits. */ +static __rte_always_inline void +virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num) +{ + uint16_t i, idx = vq->vq_used_cons_idx; + int16_t free_cnt = 0; + struct vq_desc_extra *dxp = NULL; + + if (unlikely(num == 0)) + return; + + for (i = 0; i < num; i++) { + dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)]; + free_cnt += dxp->ndescs; + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } + + vq->vq_free_cnt += free_cnt; + vq->vq_used_cons_idx = idx; +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-30 9:48 ` Ferruh Yigit 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (2 subsequent siblings) 8 siblings, 1 reply; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Rx path with SIMD instructions. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by AVX512 instructions. Also pad desc extra structure to 16 bytes aligned, thus four elements will be saved in one batch. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index c9edb84ee..102b1deab 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 15150eea1..8e68c3039 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' + if '-mno-avx512f' not in machine_args + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index febaf17a8..5c112cac7 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index a549991aa..534562cca 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +__rte_weak uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..88831a786 --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,374 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define BYTE_SIZE 8 +/* flag bits offset in packed ring desc higher 64bits */ +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ + FLAGS_BITS_OFFSET) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} + +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static inline uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + __m512i v_desc = _mm512_loadu_si512(desc_addr); + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); + + __m512i v_used_flag = _mm512_setzero_si512(); + if (vq->vq_packed.used_wrap_counter) + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + + /* Check all descs are used */ + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); + if (desc_stats) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i] = (uintptr_t)rx_pkts[i]->rx_descriptor_fields1; + } + + /* + * load len from desc, store into mbuf pkt_len and data_len + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored + */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA); + + /* reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, + (uint32_t)-hdr_size); + + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); + + /* assert offset of data_len */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], + addrs[2] + 8, addrs[2], + addrs[1] + 8, addrs[1], + addrs[0] + 8, addrs[0]); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, v_index, v_value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 40ad786cc..c54698ad1 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) hw->use_msix = 1; hw->modern = 0; hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } - if (vectorized) - hw->use_vec_rx = 1; + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not support packed ring vectorized"); +#endif + } else { + hw->use_vec_rx = 1; + } + } rte_eth_dev_probing_finish(eth_dev); ret = 0; diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index ca1c10499..ce0340743 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -239,7 +239,8 @@ struct vq_desc_extra { void *cookie; uint16_t ndescs; uint16_t next; -}; + uint8_t padding[4]; +} __rte_packed __rte_aligned(16); struct virtqueue { struct virtio_hw *hw; /**< virtio_hw structure pointer. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-30 9:48 ` Ferruh Yigit 2020-04-30 10:23 ` Bruce Richardson 0 siblings, 1 reply; 162+ messages in thread From: Ferruh Yigit @ 2020-04-30 9:48 UTC (permalink / raw) To: Marvin Liu, maxime.coquelin Cc: xiaolong.ye, zhihong.wang, dev, Luca Boccassi, Bruce Richardson On 4/28/2020 9:32 AM, Marvin Liu wrote: > Optimize packed ring Rx path with SIMD instructions. Solution of > optimization is pretty like vhost, is that split path into batch and > single functions. Batch function is further optimized by AVX512 > instructions. Also pad desc extra structure to 16 bytes aligned, thus > four elements will be saved in one batch. > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> <...> > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > deps += ['kvargs', 'bus_pci'] > > if arch_subdir == 'x86' > + if '-mno-avx512f' not in machine_args > + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') > + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > + cflags += ['-DCC_AVX512_SUPPORT'] > + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) > + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) > + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' > + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) > + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > + endif > + sources += files('virtio_rxtx_packed_avx.c') > + endif > + endif This is giving following error in Travis build [1], it is seems this usage is supported since meson 0.49 [2] and Travis has 0.47 [3], also DPDK supports version 0.47.1+ [4]. Can you please check for meson v0.47 version way of doing same thing? [1] drivers/net/virtio/meson.build:12:19: ERROR: Expecting eol got not. if '-mno-avx512f' not in machine_args ^ [2] https://mesonbuild.com/Syntax.html#dictionaries Since 0.49.0, you can check if a dictionary contains a key like this: if 'foo' not in my_dict # This condition is false endif [3] The Meson build system Version: 0.47.1 [4] doc/guides/linux_gsg/sys_reqs.rst * Meson (version 0.47.1+) and ninja ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-30 9:48 ` Ferruh Yigit @ 2020-04-30 10:23 ` Bruce Richardson 2020-04-30 13:04 ` Ferruh Yigit 0 siblings, 1 reply; 162+ messages in thread From: Bruce Richardson @ 2020-04-30 10:23 UTC (permalink / raw) To: Ferruh Yigit Cc: Marvin Liu, maxime.coquelin, xiaolong.ye, zhihong.wang, dev, Luca Boccassi On Thu, Apr 30, 2020 at 10:48:35AM +0100, Ferruh Yigit wrote: > On 4/28/2020 9:32 AM, Marvin Liu wrote: > > Optimize packed ring Rx path with SIMD instructions. Solution of > > optimization is pretty like vhost, is that split path into batch and > > single functions. Batch function is further optimized by AVX512 > > instructions. Also pad desc extra structure to 16 bytes aligned, thus > > four elements will be saved in one batch. > > > > Signed-off-by: Marvin Liu <yong.liu@intel.com> > > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> > > <...> > > > @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', > > deps += ['kvargs', 'bus_pci'] > > > > if arch_subdir == 'x86' > > + if '-mno-avx512f' not in machine_args > > + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') > > + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] > > + cflags += ['-DCC_AVX512_SUPPORT'] > > + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) > > + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' > > + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) > > + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' > > + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) > > + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' > > + endif > > + sources += files('virtio_rxtx_packed_avx.c') > > + endif > > + endif > > This is giving following error in Travis build [1], it is seems this usage is > supported since meson 0.49 [2] and Travis has 0.47 [3], also DPDK supports > version 0.47.1+ [4]. > > Can you please check for meson v0.47 version way of doing same thing? > > <arrayname>.contains() is probably what you want. /Bruce ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-30 10:23 ` Bruce Richardson @ 2020-04-30 13:04 ` Ferruh Yigit 0 siblings, 0 replies; 162+ messages in thread From: Ferruh Yigit @ 2020-04-30 13:04 UTC (permalink / raw) To: Bruce Richardson Cc: Marvin Liu, maxime.coquelin, xiaolong.ye, zhihong.wang, dev, Luca Boccassi On 4/30/2020 11:23 AM, Bruce Richardson wrote: > On Thu, Apr 30, 2020 at 10:48:35AM +0100, Ferruh Yigit wrote: >> On 4/28/2020 9:32 AM, Marvin Liu wrote: >>> Optimize packed ring Rx path with SIMD instructions. Solution of >>> optimization is pretty like vhost, is that split path into batch and >>> single functions. Batch function is further optimized by AVX512 >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus >>> four elements will be saved in one batch. >>> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com> >>> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> >> >> <...> >> >>> @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', >>> deps += ['kvargs', 'bus_pci'] >>> >>> if arch_subdir == 'x86' >>> + if '-mno-avx512f' not in machine_args >>> + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') >>> + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] >>> + cflags += ['-DCC_AVX512_SUPPORT'] >>> + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) >>> + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' >>> + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) >>> + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' >>> + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) >>> + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' >>> + endif >>> + sources += files('virtio_rxtx_packed_avx.c') >>> + endif >>> + endif >> >> This is giving following error in Travis build [1], it is seems this usage is >> supported since meson 0.49 [2] and Travis has 0.47 [3], also DPDK supports >> version 0.47.1+ [4]. >> >> Can you please check for meson v0.47 version way of doing same thing? >> >> > <arrayname>.contains() is probably what you want. > Thanks Bruce, I will update in the next-net as following [1]. @Marvin can you please double check it on the next-net? [1] - if '-mno-avx512f' not in machine_args + if not machine_args.contains('-mno-avx512f') ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Tx path like Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by AVX512 instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 5c112cac7..b7d52d497 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 534562cca..460e9d4a2 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2038,3 +2038,11 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +__rte_weak uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index 88831a786..a7358c768 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -23,6 +23,24 @@ #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ FLAGS_BITS_OFFSET) +/* reference count offset in mbuf rearm data */ +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) + +/* default rearm data */ +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ + 1ULL << REFCNT_BITS_OFFSET) + +/* id bits offset in packed ring desc higher 64bits */ +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ sizeof(struct vring_packed_desc)) #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) @@ -60,6 +78,237 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len4; } +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64); + RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64); + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref); + if (unlikely(cmp)) + return -1; + + /* Check headroom is enough */ + const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12; + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) != + offsetof(struct rte_mbuf, rearm_data)); + cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms); + if (unlikely(cmp)) + return -1; + + __m512i v_descx = _mm512_set_epi64(0x1, (uintptr_t)tx_pkts[3], + 0x1, (uintptr_t)tx_pkts[2], + 0x1, (uintptr_t)tx_pkts[1], + 0x1, (uintptr_t)tx_pkts[0]); + + _mm512_storeu_si512((void *)&vq->vq_descx[idx], v_descx); + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + +#ifdef RTE_VIRTIO_USER + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[3])), + tx_pkts[2]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[2])), + tx_pkts[1]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[1])), + tx_pkts[0]->data_len, + (uint64_t)(*(uintptr_t *)((uintptr_t)tx_pkts[0]))); +#else + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + tx_pkts[3]->buf_iova, + tx_pkts[2]->data_len, + tx_pkts[2]->buf_iova, + tx_pkts[1]->data_len, + tx_pkts[1]->buf_iova, + tx_pkts[0]->data_len, + tx_pkts[0]->buf_iova); +#endif + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_BITS_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_BITS_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET; + + /* flags offset and guest virtual address offset */ +#ifdef RTE_VIRTIO_USER + __m128i flag_offset = _mm_set_epi64x(flags_temp, (uint64_t)vq->offset); +#else + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); +#endif + __m512i v_offset = _mm512_broadcast_i32x4(flag_offset); + + __m512i v_desc = _mm512_add_epi64(new_descs, v_offset); + + if (!vq->hw->has_tx_offload) { + __m128i all_mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, all_mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_inorder_packed(vq, need); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_inorder_packed(vq, vq->vq_free_thresh); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 9/9] doc: add packed " Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from vectorized devarg, then checks each criteria. Packed ring vectorized path need: AVX512F and required extensions are supported by compiler and host VERSION_1 and IN_ORDER features are negotiated mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path need: mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloadings are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 0a69a4db1..088d0e45e 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; if (vectorized) { - if (!vtpci_packed_queue(hw)) + if (!vtpci_packed_queue(hw)) { + hw->use_vec_rx = 1; + } else { +#if !defined(CC_AVX512_SUPPORT) + PMD_DRV_LOG(INFO, + "building environment do not support packed ring vectorized"); +#else hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif + } } hw->opened = true; @@ -2288,31 +2306,61 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { - hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized path for requirements not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized path for requirement not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v11 9/9] doc: add packed vectorized path 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu ` (7 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-28 8:32 ` Marvin Liu 8 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-28 8:32 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index fdd0790e0..226f4308d 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -482,6 +482,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -504,6 +511,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -521,20 +530,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 0/9] add packed ring vectorized path 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu ` (16 preceding siblings ...) 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu ` (9 more replies) 17 siblings, 10 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu This patch set introduced vectorized path for packed ring. The size of packed ring descriptor is 16Bytes. Four batched descriptors are just placed into one cacheline. AVX512 instructions can well handle this kind of data. Packed ring TX path can fully transformed into vectorized path. Packed ring Rx path can be vectorized when requirements met(LRO and mergeable disabled). New device parameter "vectorized" will be introduced in this patch set. This parameter will be workable for both virtio device and virtio user vdev. It will also unify split and packed ring vectorized path default setting. Path election logic will check dependencies of vectorized path. Packed ring vectorized path is dependent on building/running environment and features like IN_ORDER and VERSION_1 enabled, MRG and LRO disabled. If vectorized path is not supported, will fallback to normal path. v12: * eliminate weak symbols in data path * remove desc extra padding which can impact normal path * fix enqueue address invalid v11: * fix i686 build warnings * fix typo in doc v10: * reuse packed ring xmit cleanup v9: * replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg * reorder patch sequence v8: * fix meson build error on ubuntu16.04 and suse15 v7: * default vectorization is disabled * compilation time check dependency on rte_mbuf structure * offsets are calcuated when compiling * remove useless barrier as descs are batched store&load * vindex of scatter is directly set * some comments updates * enable vectorized path in meson build v6: * fix issue when size not power of 2 v5: * remove cpuflags definition as required extensions always come with AVX512F on x86_64 * inorder actions should depend on feature bit * check ring type in rx queue setup * rewrite some commit logs * fix some checkpatch warnings v4: * rename 'packed_vec' to 'vectorized', also used in split ring * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev * check required AVX512 extensions cpuflags * combine split and packed ring datapath selection logic * remove limitation that size must power of two * clear 12Bytes virtio_net_hdr v3: * remove virtio_net_hdr array for better performance * disable 'packed_vec' by default v2: * more function blocks replaced by vector instructions * clean virtio_net_hdr by vector instruction * allow header room size change * add 'packed_vec' option in virtio_user vdev * fix build not check whether AVX512 enabled * doc update Tested-by: Wang, Yinan <yinan.wang@intel.com> Marvin Liu (9): net/virtio: add Rx free threshold setting net/virtio: inorder should depend on feature bit net/virtio: add vectorized devarg net/virtio-user: add vectorized devarg net/virtio: reuse packed ring functions net/virtio: add vectorized packed ring Rx path net/virtio: add vectorized packed ring Tx path net/virtio: add election for vectorized path doc: add packed vectorized path doc/guides/nics/virtio.rst | 52 +- drivers/net/virtio/Makefile | 35 ++ drivers/net/virtio/meson.build | 14 + drivers/net/virtio/virtio_ethdev.c | 142 ++++- drivers/net/virtio/virtio_ethdev.h | 6 + drivers/net/virtio/virtio_pci.h | 3 +- drivers/net/virtio/virtio_rxtx.c | 351 ++--------- drivers/net/virtio/virtio_rxtx_packed_avx.c | 607 ++++++++++++++++++++ drivers/net/virtio/virtio_user_ethdev.c | 32 +- drivers/net/virtio/virtqueue.c | 7 +- drivers/net/virtio/virtqueue.h | 304 ++++++++++ 11 files changed, 1199 insertions(+), 354 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit Marvin Liu ` (8 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Introduce free threshold setting in Rx queue, its default value is 32. Limit the threshold size to multiple of four as only vectorized packed Rx function will utilize it. Virtio driver will rearm Rx queue when more than rx_free_thresh descs were dequeued. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 060410577..94ba7a3ec 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -936,6 +936,7 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, struct virtio_hw *hw = dev->data->dev_private; struct virtqueue *vq = hw->vqs[vtpci_queue_idx]; struct virtnet_rx *rxvq; + uint16_t rx_free_thresh; PMD_INIT_FUNC_TRACE(); @@ -944,6 +945,28 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev, return -EINVAL; } + rx_free_thresh = rx_conf->rx_free_thresh; + if (rx_free_thresh == 0) + rx_free_thresh = + RTE_MIN(vq->vq_nentries / 4, DEFAULT_RX_FREE_THRESH); + + if (rx_free_thresh & 0x3) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be multiples of four." + " (rx_free_thresh=%u port=%u queue=%u)\n", + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + + if (rx_free_thresh >= vq->vq_nentries) { + RTE_LOG(ERR, PMD, "rx_free_thresh must be less than the " + "number of RX entries (%u)." + " (rx_free_thresh=%u port=%u queue=%u)\n", + vq->vq_nentries, + rx_free_thresh, dev->data->port_id, queue_idx); + return -EINVAL; + } + vq->vq_free_thresh = rx_free_thresh; + if (nb_desc == 0 || nb_desc > vq->vq_nentries) nb_desc = vq->vq_nentries; vq->vq_free_cnt = RTE_MIN(vq->vq_free_cnt, nb_desc); diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 58ad7309a..6301c56b2 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -18,6 +18,8 @@ struct rte_mbuf; +#define DEFAULT_RX_FREE_THRESH 32 + /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg Marvin Liu ` (7 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Ring initialization is different when inorder feature negotiated. This action should dependent on negotiated feature bits. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 94ba7a3ec..e450477e8 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -989,6 +989,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) struct rte_mbuf *m; uint16_t desc_idx; int error, nbufs, i; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); PMD_INIT_FUNC_TRACE(); @@ -1018,7 +1019,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; } - } else if (hw->use_inorder_rx) { + } else if (!vtpci_packed_queue(vq->hw) && in_order) { if ((!virtqueue_full(vq))) { uint16_t free_cnt = vq->vq_free_cnt; struct rte_mbuf *pkts[free_cnt]; @@ -1133,7 +1134,7 @@ virtio_dev_tx_queue_setup_finish(struct rte_eth_dev *dev, PMD_INIT_FUNC_TRACE(); if (!vtpci_packed_queue(hw)) { - if (hw->use_inorder_tx) + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) vq->vq_split.ring.desc[vq->vq_nentries - 1].next = 0; } @@ -2046,7 +2047,7 @@ virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts, struct virtio_hw *hw = vq->hw; uint16_t hdr_size = hw->vtnet_hdr_size; uint16_t nb_tx = 0; - bool in_order = hw->use_inorder_tx; + bool in_order = vtpci_with_feature(hw, VIRTIO_F_IN_ORDER); if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) return nb_tx; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 4/9] net/virtio-user: " Marvin Liu ` (6 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Previously, virtio split ring vectorized path was enabled by default. This is not suitable for everyone because that path dose not follow virtio spec. Add new devarg for virtio vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index 6286286db..a67774e91 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -363,6 +363,13 @@ Below devargs are supported by the PCI virtio driver: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfers to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) + Below devargs are supported by the virtio-user vdev: #. ``path``: diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 37766cbb6..0a69a4db1 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -48,7 +48,8 @@ static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev); static uint32_t virtio_dev_speed_capa_get(uint32_t speed); static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed); + uint32_t *speed, + int *vectorized); static int virtio_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info); static int virtio_dev_link_update(struct rte_eth_dev *dev, @@ -1551,8 +1552,8 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed; } } else { - if (hw->use_simple_rx) { - PMD_INIT_LOG(INFO, "virtio: using simple Rx path on port %u", + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, "virtio: using vectorized Rx path on port %u", eth_dev->data->port_id); eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; } else if (hw->use_inorder_rx) { @@ -1886,6 +1887,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) { struct virtio_hw *hw = eth_dev->data->dev_private; uint32_t speed = SPEED_UNKNOWN; + int vectorized = 0; int ret; if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) { @@ -1912,7 +1914,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) return 0; } ret = virtio_dev_devargs_parse(eth_dev->device->devargs, - NULL, &speed); + NULL, &speed, &vectorized); if (ret < 0) return ret; hw->speed = speed; @@ -1949,6 +1951,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) if (ret < 0) goto err_virtio_init; + if (vectorized) { + if (!vtpci_packed_queue(hw)) + hw->use_vec_rx = 1; + } + hw->opened = true; return 0; @@ -2021,9 +2028,20 @@ virtio_dev_speed_capa_get(uint32_t speed) } } +static int vectorized_check_handler(__rte_unused const char *key, + const char *value, void *ret_val) +{ + if (strcmp(value, "1") == 0) + *(int *)ret_val = 1; + else + *(int *)ret_val = 0; + + return 0; +} #define VIRTIO_ARG_SPEED "speed" #define VIRTIO_ARG_VDPA "vdpa" +#define VIRTIO_ARG_VECTORIZED "vectorized" static int @@ -2045,7 +2063,7 @@ link_speed_handler(const char *key __rte_unused, static int virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, - uint32_t *speed) + uint32_t *speed, int *vectorized) { struct rte_kvargs *kvlist; int ret = 0; @@ -2081,6 +2099,18 @@ virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa, } } + if (vectorized && + rte_kvargs_count(kvlist, VIRTIO_ARG_VECTORIZED) == 1) { + ret = rte_kvargs_process(kvlist, + VIRTIO_ARG_VECTORIZED, + vectorized_check_handler, vectorized); + if (ret < 0) { + PMD_INIT_LOG(ERR, "Failed to parse %s", + VIRTIO_ARG_VECTORIZED); + goto exit; + } + } + exit: rte_kvargs_free(kvlist); return ret; @@ -2092,7 +2122,8 @@ static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused, int vdpa = 0; int ret = 0; - ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL); + ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa, NULL, + NULL); if (ret < 0) { PMD_INIT_LOG(ERR, "devargs parsing is failed"); return ret; @@ -2257,33 +2288,31 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - hw->use_simple_rx = 1; - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { hw->use_inorder_tx = 1; hw->use_inorder_rx = 1; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (vtpci_packed_queue(hw)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; } #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } #endif if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; } if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | DEV_RX_OFFLOAD_TCP_CKSUM | DEV_RX_OFFLOAD_TCP_LRO | DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; return 0; } diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h index bd89357e4..668e688e1 100644 --- a/drivers/net/virtio/virtio_pci.h +++ b/drivers/net/virtio/virtio_pci.h @@ -253,7 +253,8 @@ struct virtio_hw { uint8_t vlan_strip; uint8_t use_msix; uint8_t modern; - uint8_t use_simple_rx; + uint8_t use_vec_rx; + uint8_t use_vec_tx; uint8_t use_inorder_rx; uint8_t use_inorder_tx; uint8_t weak_barriers; diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index e450477e8..84f4cf946 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -996,7 +996,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) /* Allocate blank mbufs for the each rx descriptor */ nbufs = 0; - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { for (desc_idx = 0; desc_idx < vq->vq_nentries; desc_idx++) { vq->vq_split.ring.avail->ring[desc_idx] = desc_idx; @@ -1014,7 +1014,7 @@ virtio_dev_rx_queue_setup_finish(struct rte_eth_dev *dev, uint16_t queue_idx) &rxvq->fake_mbuf; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw)) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxvq); nbufs += RTE_VIRTIO_VPMD_RX_REARM_THRESH; diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 953f00d72..150a8d987 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -525,7 +525,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) */ hw->use_msix = 1; hw->modern = 0; - hw->use_simple_rx = 0; + hw->use_vec_rx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; diff --git a/drivers/net/virtio/virtqueue.c b/drivers/net/virtio/virtqueue.c index 0b4e3bf3e..ca23180de 100644 --- a/drivers/net/virtio/virtqueue.c +++ b/drivers/net/virtio/virtqueue.c @@ -32,7 +32,8 @@ virtqueue_detach_unused(struct virtqueue *vq) end = (vq->vq_avail_idx + vq->vq_free_cnt) & (vq->vq_nentries - 1); for (idx = 0; idx < vq->vq_nentries; idx++) { - if (hw->use_simple_rx && type == VTNET_RQ) { + if (hw->use_vec_rx && !vtpci_packed_queue(hw) && + type == VTNET_RQ) { if (start <= end && idx >= start && idx < end) continue; if (start > end && (idx >= start || idx < end)) @@ -97,7 +98,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) for (i = 0; i < nb_used; i++) { used_idx = vq->vq_used_cons_idx & (vq->vq_nentries - 1); uep = &vq->vq_split.ring.used->ring[used_idx]; - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { desc_idx = used_idx; rte_pktmbuf_free(vq->sw_ring[desc_idx]); vq->vq_free_cnt++; @@ -121,7 +122,7 @@ virtqueue_rxvq_flush_split(struct virtqueue *vq) vq->vq_used_cons_idx++; } - if (hw->use_simple_rx) { + if (hw->use_vec_rx) { while (vq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) { virtio_rxq_rearm_vec(rxq); if (virtqueue_kick_prepare(vq)) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 4/9] net/virtio-user: add vectorized devarg 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (2 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions Marvin Liu ` (5 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Add new devarg for virtio user device vectorized path selection. By default vectorized path is disabled. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index a67774e91..fdd0790e0 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -424,6 +424,12 @@ Below devargs are supported by the virtio-user vdev: rte_eth_link_get_nowait function. (Default: 10000 (10G)) +#. ``vectorized``: + + It is used to specify whether virtio device perfers to use vectorized path. + Afterwards, dependencies of vectorized path will be checked in path + election. + (Default: 0 (disabled)) Virtio paths Selection and Usage -------------------------------- diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 150a8d987..40ad786cc 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -452,6 +452,8 @@ static const char *valid_args[] = { VIRTIO_USER_ARG_PACKED_VQ, #define VIRTIO_USER_ARG_SPEED "speed" VIRTIO_USER_ARG_SPEED, +#define VIRTIO_USER_ARG_VECTORIZED "vectorized" + VIRTIO_USER_ARG_VECTORIZED, NULL }; @@ -559,6 +561,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) uint64_t mrg_rxbuf = 1; uint64_t in_order = 1; uint64_t packed_vq = 0; + uint64_t vectorized = 0; char *path = NULL; char *ifname = NULL; char *mac_addr = NULL; @@ -675,6 +678,15 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) } } + if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_VECTORIZED) == 1) { + if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_VECTORIZED, + &get_integer_arg, &vectorized) < 0) { + PMD_INIT_LOG(ERR, "error to parse %s", + VIRTIO_USER_ARG_VECTORIZED); + goto end; + } + } + if (queues > 1 && cq == 0) { PMD_INIT_LOG(ERR, "multi-q requires ctrl-q"); goto end; @@ -727,6 +739,9 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } + if (vectorized) + hw->use_vec_rx = 1; + rte_eth_dev_probing_finish(eth_dev); ret = 0; @@ -785,4 +800,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "mrg_rxbuf=<0|1> " "in_order=<0|1> " "packed_vq=<0|1> " - "speed=<int>"); + "speed=<int> " + "vectorized=<0|1>"); -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (3 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 4/9] net/virtio-user: " Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu ` (4 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Move offload, xmit cleanup and packed xmit enqueue function to header file. These functions will be reused by packed ring vectorized path. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index 84f4cf946..a549991aa 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -89,23 +89,6 @@ vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx) dp->next = VQ_RING_DESC_CHAIN_END; } -static void -vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id) -{ - struct vq_desc_extra *dxp; - - dxp = &vq->vq_descx[id]; - vq->vq_free_cnt += dxp->ndescs; - - if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_head_idx = id; - else - vq->vq_descx[vq->vq_desc_tail_idx].next = id; - - vq->vq_desc_tail_idx = id; - dxp->next = VQ_RING_DESC_CHAIN_END; -} - void virtio_update_packet_stats(struct virtnet_stats *stats, struct rte_mbuf *mbuf) { @@ -264,130 +247,6 @@ virtqueue_dequeue_rx_inorder(struct virtqueue *vq, return i; } -#ifndef DEFAULT_TX_FREE_THRESH -#define DEFAULT_TX_FREE_THRESH 32 -#endif - -static void -virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) -{ - uint16_t used_idx, id, curr_id, free_cnt = 0; - uint16_t size = vq->vq_nentries; - struct vring_packed_desc *desc = vq->vq_packed.ring.desc; - struct vq_desc_extra *dxp; - - used_idx = vq->vq_used_cons_idx; - /* desc_is_used has a load-acquire or rte_cio_rmb inside - * and wait for used desc in virtqueue. - */ - while (num > 0 && desc_is_used(&desc[used_idx], vq)) { - id = desc[used_idx].id; - do { - curr_id = used_idx; - dxp = &vq->vq_descx[used_idx]; - used_idx += dxp->ndescs; - free_cnt += dxp->ndescs; - num -= dxp->ndescs; - if (used_idx >= size) { - used_idx -= size; - vq->vq_packed.used_wrap_counter ^= 1; - } - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } while (curr_id != id); - } - vq->vq_used_cons_idx = used_idx; - vq->vq_free_cnt += free_cnt; -} - -static void -virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num) -{ - uint16_t used_idx, id; - uint16_t size = vq->vq_nentries; - struct vring_packed_desc *desc = vq->vq_packed.ring.desc; - struct vq_desc_extra *dxp; - - used_idx = vq->vq_used_cons_idx; - /* desc_is_used has a load-acquire or rte_cio_rmb inside - * and wait for used desc in virtqueue. - */ - while (num-- && desc_is_used(&desc[used_idx], vq)) { - id = desc[used_idx].id; - dxp = &vq->vq_descx[id]; - vq->vq_used_cons_idx += dxp->ndescs; - if (vq->vq_used_cons_idx >= size) { - vq->vq_used_cons_idx -= size; - vq->vq_packed.used_wrap_counter ^= 1; - } - vq_ring_free_id_packed(vq, id); - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - used_idx = vq->vq_used_cons_idx; - } -} - -/* Cleanup from completed transmits. */ -static inline void -virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order) -{ - if (in_order) - virtio_xmit_cleanup_inorder_packed(vq, num); - else - virtio_xmit_cleanup_normal_packed(vq, num); -} - -static void -virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num) -{ - uint16_t i, used_idx, desc_idx; - for (i = 0; i < num; i++) { - struct vring_used_elem *uep; - struct vq_desc_extra *dxp; - - used_idx = (uint16_t)(vq->vq_used_cons_idx & (vq->vq_nentries - 1)); - uep = &vq->vq_split.ring.used->ring[used_idx]; - - desc_idx = (uint16_t) uep->id; - dxp = &vq->vq_descx[desc_idx]; - vq->vq_used_cons_idx++; - vq_ring_free_chain(vq, desc_idx); - - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } -} - -/* Cleanup from completed inorder transmits. */ -static __rte_always_inline void -virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num) -{ - uint16_t i, idx = vq->vq_used_cons_idx; - int16_t free_cnt = 0; - struct vq_desc_extra *dxp = NULL; - - if (unlikely(num == 0)) - return; - - for (i = 0; i < num; i++) { - dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)]; - free_cnt += dxp->ndescs; - if (dxp->cookie != NULL) { - rte_pktmbuf_free(dxp->cookie); - dxp->cookie = NULL; - } - } - - vq->vq_free_cnt += free_cnt; - vq->vq_used_cons_idx = idx; -} - static inline int virtqueue_enqueue_refill_inorder(struct virtqueue *vq, struct rte_mbuf **cookies, @@ -562,68 +421,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m) } -/* avoid write operation when necessary, to lessen cache issues */ -#define ASSIGN_UNLESS_EQUAL(var, val) do { \ - if ((var) != (val)) \ - (var) = (val); \ -} while (0) - -#define virtqueue_clear_net_hdr(_hdr) do { \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_start, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->csum_offset, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->flags, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_type, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->gso_size, 0); \ - ASSIGN_UNLESS_EQUAL((_hdr)->hdr_len, 0); \ -} while (0) - -static inline void -virtqueue_xmit_offload(struct virtio_net_hdr *hdr, - struct rte_mbuf *cookie, - bool offload) -{ - if (offload) { - if (cookie->ol_flags & PKT_TX_TCP_SEG) - cookie->ol_flags |= PKT_TX_TCP_CKSUM; - - switch (cookie->ol_flags & PKT_TX_L4_MASK) { - case PKT_TX_UDP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_udp_hdr, - dgram_cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - case PKT_TX_TCP_CKSUM: - hdr->csum_start = cookie->l2_len + cookie->l3_len; - hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); - hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - break; - - default: - ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); - ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); - ASSIGN_UNLESS_EQUAL(hdr->flags, 0); - break; - } - /* TCP Segmentation Offload */ - if (cookie->ol_flags & PKT_TX_TCP_SEG) { - hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? - VIRTIO_NET_HDR_GSO_TCPV6 : - VIRTIO_NET_HDR_GSO_TCPV4; - hdr->gso_size = cookie->tso_segsz; - hdr->hdr_len = - cookie->l2_len + - cookie->l3_len + - cookie->l4_len; - } else { - ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); - ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); - ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); - } - } -} static inline void virtqueue_enqueue_xmit_inorder(struct virtnet_tx *txvq, @@ -725,102 +523,6 @@ virtqueue_enqueue_xmit_packed_fast(struct virtnet_tx *txvq, virtqueue_store_flags_packed(dp, flags, vq->hw->weak_barriers); } -static inline void -virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, - uint16_t needed, int can_push, int in_order) -{ - struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; - struct vq_desc_extra *dxp; - struct virtqueue *vq = txvq->vq; - struct vring_packed_desc *start_dp, *head_dp; - uint16_t idx, id, head_idx, head_flags; - int16_t head_size = vq->hw->vtnet_hdr_size; - struct virtio_net_hdr *hdr; - uint16_t prev; - bool prepend_header = false; - - id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; - - dxp = &vq->vq_descx[id]; - dxp->ndescs = needed; - dxp->cookie = cookie; - - head_idx = vq->vq_avail_idx; - idx = head_idx; - prev = head_idx; - start_dp = vq->vq_packed.ring.desc; - - head_dp = &vq->vq_packed.ring.desc[idx]; - head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; - head_flags |= vq->vq_packed.cached_flags; - - if (can_push) { - /* prepend cannot fail, checked by caller */ - hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, - -head_size); - prepend_header = true; - - /* if offload disabled, it is not zeroed below, do it now */ - if (!vq->hw->has_tx_offload) - virtqueue_clear_net_hdr(hdr); - } else { - /* setup first tx ring slot to point to header - * stored in reserved region. - */ - start_dp[idx].addr = txvq->virtio_net_hdr_mem + - RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); - start_dp[idx].len = vq->hw->vtnet_hdr_size; - hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } - - virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); - - do { - uint16_t flags; - - start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); - start_dp[idx].len = cookie->data_len; - if (prepend_header) { - start_dp[idx].addr -= head_size; - start_dp[idx].len += head_size; - prepend_header = false; - } - - if (likely(idx != head_idx)) { - flags = cookie->next ? VRING_DESC_F_NEXT : 0; - flags |= vq->vq_packed.cached_flags; - start_dp[idx].flags = flags; - } - prev = idx; - idx++; - if (idx >= vq->vq_nentries) { - idx -= vq->vq_nentries; - vq->vq_packed.cached_flags ^= - VRING_PACKED_DESC_F_AVAIL_USED; - } - } while ((cookie = cookie->next) != NULL); - - start_dp[prev].id = id; - - vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); - vq->vq_avail_idx = idx; - - if (!in_order) { - vq->vq_desc_head_idx = dxp->next; - if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) - vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; - } - - virtqueue_store_flags_packed(head_dp, head_flags, - vq->hw->weak_barriers); -} - static inline void virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie, uint16_t needed, int use_indirect, int can_push, @@ -1246,7 +948,6 @@ virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) return 0; } -#define VIRTIO_MBUF_BURST_SZ 64 #define DESC_PER_CACHELINE (RTE_CACHE_LINE_SIZE / sizeof(struct vring_desc)) uint16_t virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h index 6301c56b2..ca1c10499 100644 --- a/drivers/net/virtio/virtqueue.h +++ b/drivers/net/virtio/virtqueue.h @@ -10,6 +10,7 @@ #include <rte_atomic.h> #include <rte_memory.h> #include <rte_mempool.h> +#include <rte_net.h> #include "virtio_pci.h" #include "virtio_ring.h" @@ -18,8 +19,10 @@ struct rte_mbuf; +#define DEFAULT_TX_FREE_THRESH 32 #define DEFAULT_RX_FREE_THRESH 32 +#define VIRTIO_MBUF_BURST_SZ 64 /* * Per virtio_ring.h in Linux. * For virtio_pci on SMP, we don't need to order with respect to MMIO @@ -560,4 +563,303 @@ virtqueue_notify(struct virtqueue *vq) #define VIRTQUEUE_DUMP(vq) do { } while (0) #endif +/* avoid write operation when necessary, to lessen cache issues */ +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ + typeof(var) var_ = (var); \ + typeof(val) val_ = (val); \ + if ((var_) != (val_)) \ + (var_) = (val_); \ +} while (0) + +#define virtqueue_clear_net_hdr(hdr) do { \ + typeof(hdr) hdr_ = (hdr); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_start, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->csum_offset, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->flags, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_type, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->gso_size, 0); \ + ASSIGN_UNLESS_EQUAL((hdr_)->hdr_len, 0); \ +} while (0) + +static inline void +virtqueue_xmit_offload(struct virtio_net_hdr *hdr, + struct rte_mbuf *cookie, + bool offload) +{ + if (offload) { + if (cookie->ol_flags & PKT_TX_TCP_SEG) + cookie->ol_flags |= PKT_TX_TCP_CKSUM; + + switch (cookie->ol_flags & PKT_TX_L4_MASK) { + case PKT_TX_UDP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_udp_hdr, + dgram_cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + case PKT_TX_TCP_CKSUM: + hdr->csum_start = cookie->l2_len + cookie->l3_len; + hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum); + hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; + break; + + default: + ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); + ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); + ASSIGN_UNLESS_EQUAL(hdr->flags, 0); + break; + } + + /* TCP Segmentation Offload */ + if (cookie->ol_flags & PKT_TX_TCP_SEG) { + hdr->gso_type = (cookie->ol_flags & PKT_TX_IPV6) ? + VIRTIO_NET_HDR_GSO_TCPV6 : + VIRTIO_NET_HDR_GSO_TCPV4; + hdr->gso_size = cookie->tso_segsz; + hdr->hdr_len = + cookie->l2_len + + cookie->l3_len + + cookie->l4_len; + } else { + ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0); + ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0); + ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0); + } + } +} + +static inline void +virtqueue_enqueue_xmit_packed(struct virtnet_tx *txvq, struct rte_mbuf *cookie, + uint16_t needed, int can_push, int in_order) +{ + struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr; + struct vq_desc_extra *dxp; + struct virtqueue *vq = txvq->vq; + struct vring_packed_desc *start_dp, *head_dp; + uint16_t idx, id, head_idx, head_flags; + int16_t head_size = vq->hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + uint16_t prev; + bool prepend_header = false; + + id = in_order ? vq->vq_avail_idx : vq->vq_desc_head_idx; + + dxp = &vq->vq_descx[id]; + dxp->ndescs = needed; + dxp->cookie = cookie; + + head_idx = vq->vq_avail_idx; + idx = head_idx; + prev = head_idx; + start_dp = vq->vq_packed.ring.desc; + + head_dp = &vq->vq_packed.ring.desc[idx]; + head_flags = cookie->next ? VRING_DESC_F_NEXT : 0; + head_flags |= vq->vq_packed.cached_flags; + + if (can_push) { + /* prepend cannot fail, checked by caller */ + hdr = rte_pktmbuf_mtod_offset(cookie, struct virtio_net_hdr *, + -head_size); + prepend_header = true; + + /* if offload disabled, it is not zeroed below, do it now */ + if (!vq->hw->has_tx_offload) + virtqueue_clear_net_hdr(hdr); + } else { + /* setup first tx ring slot to point to header + * stored in reserved region. + */ + start_dp[idx].addr = txvq->virtio_net_hdr_mem + + RTE_PTR_DIFF(&txr[idx].tx_hdr, txr); + start_dp[idx].len = vq->hw->vtnet_hdr_size; + hdr = (struct virtio_net_hdr *)&txr[idx].tx_hdr; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } + + virtqueue_xmit_offload(hdr, cookie, vq->hw->has_tx_offload); + + do { + uint16_t flags; + + start_dp[idx].addr = VIRTIO_MBUF_DATA_DMA_ADDR(cookie, vq); + start_dp[idx].len = cookie->data_len; + if (prepend_header) { + start_dp[idx].addr -= head_size; + start_dp[idx].len += head_size; + prepend_header = false; + } + + if (likely(idx != head_idx)) { + flags = cookie->next ? VRING_DESC_F_NEXT : 0; + flags |= vq->vq_packed.cached_flags; + start_dp[idx].flags = flags; + } + prev = idx; + idx++; + if (idx >= vq->vq_nentries) { + idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + } while ((cookie = cookie->next) != NULL); + + start_dp[prev].id = id; + + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - needed); + vq->vq_avail_idx = idx; + + if (!in_order) { + vq->vq_desc_head_idx = dxp->next; + if (vq->vq_desc_head_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_tail_idx = VQ_RING_DESC_CHAIN_END; + } + + virtqueue_store_flags_packed(head_dp, head_flags, + vq->hw->weak_barriers); +} + +static void +vq_ring_free_id_packed(struct virtqueue *vq, uint16_t id) +{ + struct vq_desc_extra *dxp; + + dxp = &vq->vq_descx[id]; + vq->vq_free_cnt += dxp->ndescs; + + if (vq->vq_desc_tail_idx == VQ_RING_DESC_CHAIN_END) + vq->vq_desc_head_idx = id; + else + vq->vq_descx[vq->vq_desc_tail_idx].next = id; + + vq->vq_desc_tail_idx = id; + dxp->next = VQ_RING_DESC_CHAIN_END; +} + +static void +virtio_xmit_cleanup_inorder_packed(struct virtqueue *vq, int num) +{ + uint16_t used_idx, id, curr_id, free_cnt = 0; + uint16_t size = vq->vq_nentries; + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + + used_idx = vq->vq_used_cons_idx; + /* desc_is_used has a load-acquire or rte_cio_rmb inside + * and wait for used desc in virtqueue. + */ + while (num > 0 && desc_is_used(&desc[used_idx], vq)) { + id = desc[used_idx].id; + do { + curr_id = used_idx; + dxp = &vq->vq_descx[used_idx]; + used_idx += dxp->ndescs; + free_cnt += dxp->ndescs; + num -= dxp->ndescs; + if (used_idx >= size) { + used_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } while (curr_id != id); + } + vq->vq_used_cons_idx = used_idx; + vq->vq_free_cnt += free_cnt; +} + +static void +virtio_xmit_cleanup_normal_packed(struct virtqueue *vq, int num) +{ + uint16_t used_idx, id; + uint16_t size = vq->vq_nentries; + struct vring_packed_desc *desc = vq->vq_packed.ring.desc; + struct vq_desc_extra *dxp; + + used_idx = vq->vq_used_cons_idx; + /* desc_is_used has a load-acquire or rte_cio_rmb inside + * and wait for used desc in virtqueue. + */ + while (num-- && desc_is_used(&desc[used_idx], vq)) { + id = desc[used_idx].id; + dxp = &vq->vq_descx[id]; + vq->vq_used_cons_idx += dxp->ndescs; + if (vq->vq_used_cons_idx >= size) { + vq->vq_used_cons_idx -= size; + vq->vq_packed.used_wrap_counter ^= 1; + } + vq_ring_free_id_packed(vq, id); + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + used_idx = vq->vq_used_cons_idx; + } +} + +/* Cleanup from completed transmits. */ +static inline void +virtio_xmit_cleanup_packed(struct virtqueue *vq, int num, int in_order) +{ + if (in_order) + virtio_xmit_cleanup_inorder_packed(vq, num); + else + virtio_xmit_cleanup_normal_packed(vq, num); +} + +static inline void +virtio_xmit_cleanup(struct virtqueue *vq, uint16_t num) +{ + uint16_t i, used_idx, desc_idx; + for (i = 0; i < num; i++) { + struct vring_used_elem *uep; + struct vq_desc_extra *dxp; + + used_idx = (uint16_t)(vq->vq_used_cons_idx & + (vq->vq_nentries - 1)); + uep = &vq->vq_split.ring.used->ring[used_idx]; + + desc_idx = (uint16_t)uep->id; + dxp = &vq->vq_descx[desc_idx]; + vq->vq_used_cons_idx++; + vq_ring_free_chain(vq, desc_idx); + + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } +} + +/* Cleanup from completed inorder transmits. */ +static __rte_always_inline void +virtio_xmit_cleanup_inorder(struct virtqueue *vq, uint16_t num) +{ + uint16_t i, idx = vq->vq_used_cons_idx; + int16_t free_cnt = 0; + struct vq_desc_extra *dxp = NULL; + + if (unlikely(num == 0)) + return; + + for (i = 0; i < num; i++) { + dxp = &vq->vq_descx[idx++ & (vq->vq_nentries - 1)]; + free_cnt += dxp->ndescs; + if (dxp->cookie != NULL) { + rte_pktmbuf_free(dxp->cookie); + dxp->cookie = NULL; + } + } + + vq->vq_free_cnt += free_cnt; + vq->vq_used_cons_idx = idx; +} #endif /* _VIRTQUEUE_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (4 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu ` (3 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Rx path with SIMD instructions. Solution of optimization is pretty like vhost, is that split path into batch and single functions. Batch function is further optimized by AVX512 instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index c9edb84ee..102b1deab 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) $(CONFIG_RTE_ARCH_ARM64)),) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple_neon.c endif +ifneq ($(FORCE_DISABLE_AVX512), y) + CC_AVX512_SUPPORT=\ + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \ + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ + grep -q AVX512 && echo 1) +endif + +ifeq ($(CC_AVX512_SUPPORT), 1) +CFLAGS += -DCC_AVX512_SUPPORT +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_packed_avx.c + +ifeq ($(RTE_TOOLCHAIN), gcc) +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) +CFLAGS += -DVIRTIO_GCC_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), clang) +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -ge 37 && echo 1), 1) +CFLAGS += -DVIRTIO_CLANG_UNROLL_PRAGMA +endif +endif + +ifeq ($(RTE_TOOLCHAIN), icc) +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) +CFLAGS += -DVIRTIO_ICC_UNROLL_PRAGMA +endif +endif + +CFLAGS_virtio_rxtx_packed_avx.o += -mavx512f -mavx512bw -mavx512vl +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) +CFLAGS_virtio_rxtx_packed_avx.o += -Wno-zero-length-bounds +endif +endif + ifeq ($(CONFIG_RTE_VIRTIO_USER),y) SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build index 15150eea1..8e68c3039 100644 --- a/drivers/net/virtio/meson.build +++ b/drivers/net/virtio/meson.build @@ -9,6 +9,20 @@ sources += files('virtio_ethdev.c', deps += ['kvargs', 'bus_pci'] if arch_subdir == 'x86' + if '-mno-avx512f' not in machine_args + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw') + cflags += ['-mavx512f', '-mavx512bw', '-mavx512vl'] + cflags += ['-DCC_AVX512_SUPPORT'] + if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0')) + cflags += '-DVHOST_GCC_UNROLL_PRAGMA' + elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0')) + cflags += '-DVHOST_CLANG_UNROLL_PRAGMA' + elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0')) + cflags += '-DVHOST_ICC_UNROLL_PRAGMA' + endif + sources += files('virtio_rxtx_packed_avx.c') + endif + endif sources += files('virtio_rxtx_simple_sse.c') elif arch_subdir == 'ppc' sources += files('virtio_rxtx_simple_altivec.c') diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index febaf17a8..5c112cac7 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index a549991aa..c50980c82 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2030,3 +2030,13 @@ virtio_xmit_pkts_inorder(void *tx_queue, return nb_tx; } + +#ifndef CC_AVX512_SUPPORT +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, + struct rte_mbuf **rx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} +#endif /* ifndef CC_AVX512_SUPPORT */ diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c new file mode 100644 index 000000000..88831a786 --- /dev/null +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -0,0 +1,374 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <errno.h> + +#include <rte_net.h> + +#include "virtio_logs.h" +#include "virtio_ethdev.h" +#include "virtio_pci.h" +#include "virtqueue.h" + +#define BYTE_SIZE 8 +/* flag bits offset in packed ring desc higher 64bits */ +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +#define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ + FLAGS_BITS_OFFSET) + +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ + sizeof(struct vring_packed_desc)) +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) + +#ifdef VIRTIO_GCC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) +#endif + +#ifdef VIRTIO_ICC_UNROLL_PRAGMA +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) +#endif + +#ifndef virtio_for_each_try_unroll +#define virtio_for_each_try_unroll(iter, val, num) \ + for (iter = val; iter < num; iter++) +#endif + +static inline void +virtio_update_batch_stats(struct virtnet_stats *stats, + uint16_t pkt_len1, + uint16_t pkt_len2, + uint16_t pkt_len3, + uint16_t pkt_len4) +{ + stats->bytes += pkt_len1; + stats->bytes += pkt_len2; + stats->bytes += pkt_len3; + stats->bytes += pkt_len4; +} + +/* Optionally fill offload information in structure */ +static inline int +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) +{ + struct rte_net_hdr_lens hdr_lens; + uint32_t hdrlen, ptype; + int l4_supported = 0; + + /* nothing to do */ + if (hdr->flags == 0) + return 0; + + /* GSO not support in vec path, skip check */ + m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN; + + ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); + m->packet_type = ptype; + if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP || + (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP) + l4_supported = 1; + + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { + hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; + if (hdr->csum_start <= hdrlen && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_NONE; + } else { + /* Unknown proto or tunnel, do sw cksum. We can assume + * the cksum field is in the first segment since the + * buffers we provided to the host are large enough. + * In case of SCTP, this will be wrong since it's a CRC + * but there's nothing we can do. + */ + uint16_t csum = 0, off; + + rte_raw_cksum_mbuf(m, hdr->csum_start, + rte_pktmbuf_pkt_len(m) - hdr->csum_start, + &csum); + if (likely(csum != 0xffff)) + csum = ~csum; + off = hdr->csum_offset + hdr->csum_start; + if (rte_pktmbuf_data_len(m) >= off + 1) + *rte_pktmbuf_mtod_offset(m, uint16_t *, + off) = csum; + } + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && l4_supported) { + m->ol_flags |= PKT_RX_L4_CKSUM_GOOD; + } + + return 0; +} + +static inline uint16_t +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint64_t addrs[PACKED_BATCH_SIZE]; + uint16_t id = vq->vq_used_cons_idx; + uint8_t desc_stats; + uint16_t i; + void *desc_addr; + + if (id & PACKED_BATCH_MASK) + return -1; + + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* only care avail/used bits */ + __m512i v_mask = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + desc_addr = &vq->vq_packed.ring.desc[id]; + + __m512i v_desc = _mm512_loadu_si512(desc_addr); + __m512i v_flag = _mm512_and_epi64(v_desc, v_mask); + + __m512i v_used_flag = _mm512_setzero_si512(); + if (vq->vq_packed.used_wrap_counter) + v_used_flag = _mm512_maskz_set1_epi64(0xaa, PACKED_FLAGS_MASK); + + /* Check all descs are used */ + desc_stats = _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); + if (desc_stats) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + rx_pkts[i] = (struct rte_mbuf *)vq->vq_descx[id + i].cookie; + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); + + addrs[i] = (uintptr_t)rx_pkts[i]->rx_descriptor_fields1; + } + + /* + * load len from desc, store into mbuf pkt_len and data_len + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored + */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + __m512i values = _mm512_maskz_shuffle_epi32(mask, v_desc, 0xAA); + + /* reduce hdr_len from pkt_len and data_len */ + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(mask, + (uint32_t)-hdr_size); + + __m512i v_value = _mm512_add_epi32(values, mbuf_len_offset); + + /* assert offset of data_len */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + + __m512i v_index = _mm512_set_epi64(addrs[3] + 8, addrs[3], + addrs[2] + 8, addrs[2], + addrs[1] + 8, addrs[1], + addrs[0] + 8, addrs[0]); + /* batch store into mbufs */ + _mm512_i64scatter_epi64(0, v_index, v_value, 1); + + if (hw->has_rx_offload) { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + char *addr = (char *)rx_pkts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size; + virtio_vec_rx_offload(rx_pkts[i], + (struct virtio_net_hdr *)addr); + } + } + + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, + rx_pkts[3]->pkt_len); + + vq->vq_free_cnt += PACKED_BATCH_SIZE; + + vq->vq_used_cons_idx += PACKED_BATCH_SIZE; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static uint16_t +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **rx_pkts) +{ + uint16_t used_idx, id; + uint32_t len; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint32_t hdr_size = hw->vtnet_hdr_size; + struct virtio_net_hdr *hdr; + struct vring_packed_desc *desc; + struct rte_mbuf *cookie; + + desc = vq->vq_packed.ring.desc; + used_idx = vq->vq_used_cons_idx; + if (!desc_is_used(&desc[used_idx], vq)) + return -1; + + len = desc[used_idx].len; + id = desc[used_idx].id; + cookie = (struct rte_mbuf *)vq->vq_descx[id].cookie; + if (unlikely(cookie == NULL)) { + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie at %u", + vq->vq_used_cons_idx); + return -1; + } + rte_prefetch0(cookie); + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); + + cookie->data_off = RTE_PKTMBUF_HEADROOM; + cookie->ol_flags = 0; + cookie->pkt_len = (uint32_t)(len - hdr_size); + cookie->data_len = (uint32_t)(len - hdr_size); + + hdr = (struct virtio_net_hdr *)((char *)cookie->buf_addr + + RTE_PKTMBUF_HEADROOM - hdr_size); + if (hw->has_rx_offload) + virtio_vec_rx_offload(cookie, hdr); + + *rx_pkts = cookie; + + rxvq->stats.bytes += cookie->pkt_len; + + vq->vq_free_cnt++; + vq->vq_used_cons_idx++; + if (vq->vq_used_cons_idx >= vq->vq_nentries) { + vq->vq_used_cons_idx -= vq->vq_nentries; + vq->vq_packed.used_wrap_counter ^= 1; + } + + return 0; +} + +static inline void +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, + struct rte_mbuf **cookie, + uint16_t num) +{ + struct virtqueue *vq = rxvq->vq; + struct vring_packed_desc *start_dp = vq->vq_packed.ring.desc; + uint16_t flags = vq->vq_packed.cached_flags; + struct virtio_hw *hw = vq->hw; + struct vq_desc_extra *dxp; + uint16_t idx, i; + uint16_t batch_num, total_num = 0; + uint16_t head_idx = vq->vq_avail_idx; + uint16_t head_flag = vq->vq_packed.cached_flags; + uint64_t addr; + + do { + idx = vq->vq_avail_idx; + + batch_num = PACKED_BATCH_SIZE; + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + batch_num = vq->vq_nentries - idx; + if (unlikely((total_num + batch_num) > num)) + batch_num = num - total_num; + + virtio_for_each_try_unroll(i, 0, batch_num) { + dxp = &vq->vq_descx[idx + i]; + dxp->cookie = (void *)cookie[total_num + i]; + + addr = VIRTIO_MBUF_ADDR(cookie[total_num + i], vq) + + RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size; + start_dp[idx + i].addr = addr; + start_dp[idx + i].len = cookie[total_num + i]->buf_len + - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size; + if (total_num || i) { + virtqueue_store_flags_packed(&start_dp[idx + i], + flags, hw->weak_barriers); + } + } + + vq->vq_avail_idx += batch_num; + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + flags = vq->vq_packed.cached_flags; + } + total_num += batch_num; + } while (total_num < num); + + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, + hw->weak_barriers); + vq->vq_free_cnt = (uint16_t)(vq->vq_free_cnt - num); +} + +uint16_t +virtio_recv_pkts_packed_vec(void *rx_queue, + struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_rx *rxvq = rx_queue; + struct virtqueue *vq = rxvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t num, nb_rx = 0; + uint32_t nb_enqueued = 0; + uint16_t free_cnt = vq->vq_free_thresh; + + if (unlikely(hw->started == 0)) + return nb_rx; + + num = RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); + if (likely(num > PACKED_BATCH_SIZE)) + num = num - ((vq->vq_used_cons_idx + num) % PACKED_BATCH_SIZE); + + while (num) { + if (!virtqueue_dequeue_batch_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx += PACKED_BATCH_SIZE; + num -= PACKED_BATCH_SIZE; + continue; + } + if (!virtqueue_dequeue_single_packed_vec(rxvq, + &rx_pkts[nb_rx])) { + nb_rx++; + num--; + continue; + } + break; + }; + + PMD_RX_LOG(DEBUG, "dequeue:%d", num); + + rxvq->stats.packets += nb_rx; + + if (likely(vq->vq_free_cnt >= free_cnt)) { + struct rte_mbuf *new_pkts[free_cnt]; + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, + free_cnt) == 0)) { + virtio_recv_refill_packed_vec(rxvq, new_pkts, + free_cnt); + nb_enqueued += free_cnt; + } else { + struct rte_eth_dev *dev = + &rte_eth_devices[rxvq->port_id]; + dev->data->rx_mbuf_alloc_failed += free_cnt; + } + } + + if (likely(nb_enqueued)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_RX_LOG(DEBUG, "Notified"); + } + } + + return nb_rx; +} diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 40ad786cc..c54698ad1 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct rte_vdev_device *vdev) hw->use_msix = 1; hw->modern = 0; hw->use_vec_rx = 0; + hw->use_vec_tx = 0; hw->use_inorder_rx = 0; hw->use_inorder_tx = 0; hw->virtio_user_dev = dev; @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device *dev) goto end; } - if (vectorized) - hw->use_vec_rx = 1; + if (vectorized) { + if (packed_vq) { +#if defined(CC_AVX512_SUPPORT) + hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#else + PMD_INIT_LOG(INFO, + "building environment do not support packed ring vectorized"); +#endif + } else { + hw->use_vec_rx = 1; + } + } rte_eth_dev_probing_finish(eth_dev); ret = 0; -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (5 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path Marvin Liu ` (2 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Optimize packed ring Tx path like Rx path. Split Tx path into batch and single Tx functions. Batch function is further optimized by AVX512 instructions. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h index 5c112cac7..b7d52d497 100644 --- a/drivers/net/virtio/virtio_ethdev.h +++ b/drivers/net/virtio/virtio_ethdev.h @@ -108,6 +108,9 @@ uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); +uint16_t virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); + int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); void virtio_interrupt_handler(void *param); diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c index c50980c82..050541a10 100644 --- a/drivers/net/virtio/virtio_rxtx.c +++ b/drivers/net/virtio/virtio_rxtx.c @@ -2039,4 +2039,12 @@ virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, { return 0; } + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue __rte_unused, + struct rte_mbuf **tx_pkts __rte_unused, + uint16_t nb_pkts __rte_unused) +{ + return 0; +} #endif /* ifndef CC_AVX512_SUPPORT */ diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c b/drivers/net/virtio/virtio_rxtx_packed_avx.c index 88831a786..d130d68bf 100644 --- a/drivers/net/virtio/virtio_rxtx_packed_avx.c +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c @@ -23,6 +23,24 @@ #define PACKED_FLAGS_MASK ((0ULL | VRING_PACKED_DESC_F_AVAIL_USED) << \ FLAGS_BITS_OFFSET) +/* reference count offset in mbuf rearm data */ +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) +/* segment number offset in mbuf rearm data */ +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \ + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE) + +/* default rearm data */ +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \ + 1ULL << REFCNT_BITS_OFFSET) + +/* id bits offset in packed ring desc higher 64bits */ +#define ID_BITS_OFFSET ((offsetof(struct vring_packed_desc, id) - \ + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) + +/* net hdr short size mask */ +#define NET_HDR_MASK 0x3F + #define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ sizeof(struct vring_packed_desc)) #define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) @@ -60,6 +78,221 @@ virtio_update_batch_stats(struct virtnet_stats *stats, stats->bytes += pkt_len4; } +static inline int +virtqueue_enqueue_batch_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf **tx_pkts) +{ + struct virtqueue *vq = txvq->vq; + uint16_t head_size = vq->hw->vtnet_hdr_size; + uint16_t idx = vq->vq_avail_idx; + struct virtio_net_hdr *hdr; + struct vq_desc_extra *dxp; + uint16_t i, cmp; + + if (vq->vq_avail_idx & PACKED_BATCH_MASK) + return -1; + + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) + return -1; + + /* Load four mbufs rearm data */ + RTE_BUILD_BUG_ON(REFCNT_BITS_OFFSET >= 64); + RTE_BUILD_BUG_ON(SEG_NUM_BITS_OFFSET >= 64); + __m256i mbufs = _mm256_set_epi64x(*tx_pkts[3]->rearm_data, + *tx_pkts[2]->rearm_data, + *tx_pkts[1]->rearm_data, + *tx_pkts[0]->rearm_data); + + /* refcnt=1 and nb_segs=1 */ + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA); + __m256i head_rooms = _mm256_set1_epi16(head_size); + + /* Check refcnt and nb_segs */ + const __mmask16 mask = 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; + cmp = _mm256_mask_cmpneq_epu16_mask(mask, mbufs, mbuf_ref); + if (unlikely(cmp)) + return -1; + + /* Check headroom is enough */ + const __mmask16 data_mask = 0x1 | 0x1 << 4 | 0x1 << 8 | 0x1 << 12; + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_off) != + offsetof(struct rte_mbuf, rearm_data)); + cmp = _mm256_mask_cmplt_epu16_mask(data_mask, mbufs, head_rooms); + if (unlikely(cmp)) + return -1; + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + dxp = &vq->vq_descx[idx + i]; + dxp->ndescs = 1; + dxp->cookie = tx_pkts[i]; + } + + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + tx_pkts[i]->data_off -= head_size; + tx_pkts[i]->data_len += head_size; + } + + __m512i descs_base = _mm512_set_epi64(tx_pkts[3]->data_len, + VIRTIO_MBUF_ADDR(tx_pkts[3], vq), + tx_pkts[2]->data_len, + VIRTIO_MBUF_ADDR(tx_pkts[2], vq), + tx_pkts[1]->data_len, + VIRTIO_MBUF_ADDR(tx_pkts[1], vq), + tx_pkts[0]->data_len, + VIRTIO_MBUF_ADDR(tx_pkts[0], vq)); + + /* id offset and data offset */ + __m512i data_offsets = _mm512_set_epi64((uint64_t)3 << ID_BITS_OFFSET, + tx_pkts[3]->data_off, + (uint64_t)2 << ID_BITS_OFFSET, + tx_pkts[2]->data_off, + (uint64_t)1 << ID_BITS_OFFSET, + tx_pkts[1]->data_off, + 0, tx_pkts[0]->data_off); + + __m512i new_descs = _mm512_add_epi64(descs_base, data_offsets); + + uint64_t flags_temp = (uint64_t)idx << ID_BITS_OFFSET | + (uint64_t)vq->vq_packed.cached_flags << FLAGS_BITS_OFFSET; + + /* flags offset and guest virtual address offset */ + __m128i flag_offset = _mm_set_epi64x(flags_temp, 0); + __m512i v_offset = _mm512_broadcast_i32x4(flag_offset); + __m512i v_desc = _mm512_add_epi64(new_descs, v_offset); + + if (!vq->hw->has_tx_offload) { + __m128i all_mask = _mm_set1_epi16(0xFFFF); + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + __m128i v_hdr = _mm_loadu_si128((void *)hdr); + if (unlikely(_mm_mask_test_epi16_mask(NET_HDR_MASK, + v_hdr, all_mask))) { + __m128i all_zero = _mm_setzero_si128(); + _mm_mask_storeu_epi16((void *)hdr, + NET_HDR_MASK, all_zero); + } + } + } else { + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { + hdr = rte_pktmbuf_mtod_offset(tx_pkts[i], + struct virtio_net_hdr *, -head_size); + virtqueue_xmit_offload(hdr, tx_pkts[i], true); + } + } + + /* Enqueue Packet buffers */ + _mm512_storeu_si512((void *)&vq->vq_packed.ring.desc[idx], v_desc); + + virtio_update_batch_stats(&txvq->stats, tx_pkts[0]->pkt_len, + tx_pkts[1]->pkt_len, tx_pkts[2]->pkt_len, + tx_pkts[3]->pkt_len); + + vq->vq_avail_idx += PACKED_BATCH_SIZE; + vq->vq_free_cnt -= PACKED_BATCH_SIZE; + + if (vq->vq_avail_idx >= vq->vq_nentries) { + vq->vq_avail_idx -= vq->vq_nentries; + vq->vq_packed.cached_flags ^= + VRING_PACKED_DESC_F_AVAIL_USED; + } + + return 0; +} + +static inline int +virtqueue_enqueue_single_packed_vec(struct virtnet_tx *txvq, + struct rte_mbuf *txm) +{ + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t hdr_size = hw->vtnet_hdr_size; + uint16_t slots, can_push; + int16_t need; + + /* How many main ring entries are needed to this Tx? + * any_layout => number of segments + * default => number of segments + 1 + */ + can_push = rte_mbuf_refcnt_read(txm) == 1 && + RTE_MBUF_DIRECT(txm) && + txm->nb_segs == 1 && + rte_pktmbuf_headroom(txm) >= hdr_size; + + slots = txm->nb_segs + !can_push; + need = slots - vq->vq_free_cnt; + + /* Positive value indicates it need free vring descriptors */ + if (unlikely(need > 0)) { + virtio_xmit_cleanup_inorder_packed(vq, need); + need = slots - vq->vq_free_cnt; + if (unlikely(need > 0)) { + PMD_TX_LOG(ERR, + "No free tx descriptors to transmit"); + return -1; + } + } + + /* Enqueue Packet buffers */ + virtqueue_enqueue_xmit_packed(txvq, txm, slots, can_push, 1); + + txvq->stats.bytes += txm->pkt_len; + return 0; +} + +uint16_t +virtio_xmit_pkts_packed_vec(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + struct virtnet_tx *txvq = tx_queue; + struct virtqueue *vq = txvq->vq; + struct virtio_hw *hw = vq->hw; + uint16_t nb_tx = 0; + uint16_t remained; + + if (unlikely(hw->started == 0 && tx_pkts != hw->inject_pkts)) + return nb_tx; + + if (unlikely(nb_pkts < 1)) + return nb_pkts; + + PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts); + + if (vq->vq_free_cnt <= vq->vq_nentries - vq->vq_free_thresh) + virtio_xmit_cleanup_inorder_packed(vq, vq->vq_free_thresh); + + remained = RTE_MIN(nb_pkts, vq->vq_free_cnt); + + while (remained) { + if (remained >= PACKED_BATCH_SIZE) { + if (!virtqueue_enqueue_batch_packed_vec(txvq, + &tx_pkts[nb_tx])) { + nb_tx += PACKED_BATCH_SIZE; + remained -= PACKED_BATCH_SIZE; + continue; + } + } + if (!virtqueue_enqueue_single_packed_vec(txvq, + tx_pkts[nb_tx])) { + nb_tx++; + remained--; + continue; + } + break; + }; + + txvq->stats.packets += nb_tx; + + if (likely(nb_tx)) { + if (unlikely(virtqueue_kick_prepare_packed(vq))) { + virtqueue_notify(vq); + PMD_TX_LOG(DEBUG, "Notified backend after xmit"); + } + } + + return nb_tx; +} + /* Optionally fill offload information in structure */ static inline int virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (6 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 9/9] doc: add packed " Marvin Liu 2020-04-29 8:17 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Maxime Coquelin 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Rewrite vectorized path selection logic. Default setting comes from vectorized devarg, then checks each criteria. Packed ring vectorized path need: AVX512F and required extensions are supported by compiler and host VERSION_1 and IN_ORDER features are negotiated mergeable feature is not negotiated LRO offloading is disabled Split ring vectorized rx path need: mergeable and IN_ORDER features are not negotiated LRO, chksum and vlan strip offloadings are disabled Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c index 0a69a4db1..e86d4e08f 100644 --- a/drivers/net/virtio/virtio_ethdev.c +++ b/drivers/net/virtio/virtio_ethdev.c @@ -1523,9 +1523,12 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) if (vtpci_packed_queue(hw)) { PMD_INIT_LOG(INFO, "virtio: using packed ring %s Tx path on port %u", - hw->use_inorder_tx ? "inorder" : "standard", + hw->use_vec_tx ? "vectorized" : "standard", eth_dev->data->port_id); - eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; + if (hw->use_vec_tx) + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; + else + eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed; } else { if (hw->use_inorder_tx) { PMD_INIT_LOG(INFO, "virtio: using inorder Tx path on port %u", @@ -1539,7 +1542,13 @@ set_rxtx_funcs(struct rte_eth_dev *eth_dev) } if (vtpci_packed_queue(hw)) { - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + if (hw->use_vec_rx) { + PMD_INIT_LOG(INFO, + "virtio: using packed ring vectorized Rx path on port %u", + eth_dev->data->port_id); + eth_dev->rx_pkt_burst = + &virtio_recv_pkts_packed_vec; + } else if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { PMD_INIT_LOG(INFO, "virtio: using packed ring mergeable buffer Rx path on port %u", eth_dev->data->port_id); @@ -1952,8 +1961,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev) goto err_virtio_init; if (vectorized) { - if (!vtpci_packed_queue(hw)) + if (!vtpci_packed_queue(hw)) { + hw->use_vec_rx = 1; + } else { +#if !defined(CC_AVX512_SUPPORT) + PMD_DRV_LOG(INFO, + "building environment do not support packed ring vectorized"); +#else hw->use_vec_rx = 1; + hw->use_vec_tx = 1; +#endif + } } hw->opened = true; @@ -2288,31 +2306,66 @@ virtio_dev_configure(struct rte_eth_dev *dev) return -EBUSY; } - if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { - hw->use_inorder_tx = 1; - hw->use_inorder_rx = 1; - hw->use_vec_rx = 0; - } - if (vtpci_packed_queue(hw)) { +#if defined(RTE_ARCH_X86_64) && defined(CC_AVX512_SUPPORT) + if ((hw->use_vec_rx || hw->use_vec_tx) && + (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) || + !vtpci_with_feature(hw, VIRTIO_F_IN_ORDER) || + !vtpci_with_feature(hw, VIRTIO_F_VERSION_1))) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized path for requirements not met"); + hw->use_vec_rx = 0; + hw->use_vec_tx = 0; + } +#else hw->use_vec_rx = 0; - hw->use_inorder_rx = 0; - } + hw->use_vec_tx = 0; +#endif + + if (hw->use_vec_rx) { + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } + if (rx_offloads & DEV_RX_OFFLOAD_TCP_LRO) { + PMD_DRV_LOG(INFO, + "disabled packed ring vectorized rx for TCP_LRO enabled"); + hw->use_vec_rx = 0; + } + } + } else { + if (vtpci_with_feature(hw, VIRTIO_F_IN_ORDER)) { + hw->use_inorder_tx = 1; + hw->use_inorder_rx = 1; + hw->use_vec_rx = 0; + } + + if (hw->use_vec_rx) { #if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM - if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { - hw->use_vec_rx = 0; - } + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized path for requirement not met"); + hw->use_vec_rx = 0; + } #endif - if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { - hw->use_vec_rx = 0; - } + if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for mrg_rxbuf enabled"); + hw->use_vec_rx = 0; + } - if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_TCP_LRO | - DEV_RX_OFFLOAD_VLAN_STRIP)) - hw->use_vec_rx = 0; + if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_TCP_LRO | + DEV_RX_OFFLOAD_VLAN_STRIP)) { + PMD_DRV_LOG(INFO, + "disabled split ring vectorized rx for offloading enabled"); + hw->use_vec_rx = 0; + } + } + } return 0; } -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [dpdk-dev] [PATCH v12 9/9] doc: add packed vectorized path 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (7 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path Marvin Liu @ 2020-04-29 7:28 ` Marvin Liu 2020-04-29 8:17 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Maxime Coquelin 9 siblings, 0 replies; 162+ messages in thread From: Marvin Liu @ 2020-04-29 7:28 UTC (permalink / raw) To: maxime.coquelin, xiaolong.ye, zhihong.wang; +Cc: dev, Marvin Liu Document packed virtqueue vectorized path selection logic in virtio net PMD. Signed-off-by: Marvin Liu <yong.liu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst index fdd0790e0..226f4308d 100644 --- a/doc/guides/nics/virtio.rst +++ b/doc/guides/nics/virtio.rst @@ -482,6 +482,13 @@ according to below configuration: both negotiated, this path will be selected. #. Packed virtqueue in-order non-mergeable path: If in-order feature is negotiated and Rx mergeable is not negotiated, this path will be selected. +#. Packed virtqueue vectorized Rx path: If building and running environment support + AVX512 && in-order feature is negotiated && Rx mergeable is not negotiated && + TCP_LRO Rx offloading is disabled && vectorized option enabled, + this path will be selected. +#. Packed virtqueue vectorized Tx path: If building and running environment support + AVX512 && in-order feature is negotiated && vectorized option enabled, + this path will be selected. Rx/Tx callbacks of each Virtio path ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -504,6 +511,8 @@ are shown in below table: Packed virtqueue non-meregable path virtio_recv_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order mergeable path virtio_recv_mergeable_pkts_packed virtio_xmit_pkts_packed Packed virtqueue in-order non-mergeable path virtio_recv_pkts_packed virtio_xmit_pkts_packed + Packed virtqueue vectorized Rx path virtio_recv_pkts_packed_vec virtio_xmit_pkts_packed + Packed virtqueue vectorized Tx path virtio_recv_pkts_packed virtio_xmit_pkts_packed_vec ============================================ ================================= ======================== Virtio paths Support Status from Release to Release @@ -521,20 +530,22 @@ All virtio paths support status are shown in below table: .. table:: Virtio Paths and Releases - ============================================ ============= ============= ============= - Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 - ============================================ ============= ============= ============= - Split virtqueue mergeable path Y Y Y - Split virtqueue non-mergeable path Y Y Y - Split virtqueue vectorized Rx path Y Y Y - Split virtqueue simple Tx path Y N N - Split virtqueue in-order mergeable path Y Y - Split virtqueue in-order non-mergeable path Y Y - Packed virtqueue mergeable path Y - Packed virtqueue non-mergeable path Y - Packed virtqueue in-order mergeable path Y - Packed virtqueue in-order non-mergeable path Y - ============================================ ============= ============= ============= + ============================================ ============= ============= ============= ======= + Virtio paths 16.11 ~ 18.05 18.08 ~ 18.11 19.02 ~ 19.11 20.05 ~ + ============================================ ============= ============= ============= ======= + Split virtqueue mergeable path Y Y Y Y + Split virtqueue non-mergeable path Y Y Y Y + Split virtqueue vectorized Rx path Y Y Y Y + Split virtqueue simple Tx path Y N N N + Split virtqueue in-order mergeable path Y Y Y + Split virtqueue in-order non-mergeable path Y Y Y + Packed virtqueue mergeable path Y Y + Packed virtqueue non-mergeable path Y Y + Packed virtqueue in-order mergeable path Y Y + Packed virtqueue in-order non-mergeable path Y Y + Packed virtqueue vectorized Rx path Y + Packed virtqueue vectorized Tx path Y + ============================================ ============= ============= ============= ======= QEMU Support Status ~~~~~~~~~~~~~~~~~~~ -- 2.17.1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [dpdk-dev] [PATCH v12 0/9] add packed ring vectorized path 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu ` (8 preceding siblings ...) 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 9/9] doc: add packed " Marvin Liu @ 2020-04-29 8:17 ` Maxime Coquelin 9 siblings, 0 replies; 162+ messages in thread From: Maxime Coquelin @ 2020-04-29 8:17 UTC (permalink / raw) To: Marvin Liu, xiaolong.ye, zhihong.wang; +Cc: dev On 4/29/20 9:28 AM, Marvin Liu wrote: > This patch set introduced vectorized path for packed ring. > > The size of packed ring descriptor is 16Bytes. Four batched descriptors > are just placed into one cacheline. AVX512 instructions can well handle > this kind of data. Packed ring TX path can fully transformed into > vectorized path. Packed ring Rx path can be vectorized when requirements > met(LRO and mergeable disabled). > > New device parameter "vectorized" will be introduced in this patch set. > This parameter will be workable for both virtio device and virtio user > vdev. It will also unify split and packed ring vectorized path default > setting. Path election logic will check dependencies of vectorized path. > Packed ring vectorized path is dependent on building/running environment > and features like IN_ORDER and VERSION_1 enabled, MRG and LRO disabled. > If vectorized path is not supported, will fallback to normal path. > > v12: > * eliminate weak symbols in data path > * remove desc extra padding which can impact normal path > * fix enqueue address invalid > > v11: > * fix i686 build warnings > * fix typo in doc > > v10: > * reuse packed ring xmit cleanup > > v9: > * replace RTE_LIBRTE_VIRTIO_INC_VECTOR with vectorized devarg > * reorder patch sequence > > v8: > * fix meson build error on ubuntu16.04 and suse15 > > v7: > * default vectorization is disabled > * compilation time check dependency on rte_mbuf structure > * offsets are calcuated when compiling > * remove useless barrier as descs are batched store&load > * vindex of scatter is directly set > * some comments updates > * enable vectorized path in meson build > > v6: > * fix issue when size not power of 2 > > v5: > * remove cpuflags definition as required extensions always come with > AVX512F on x86_64 > * inorder actions should depend on feature bit > * check ring type in rx queue setup > * rewrite some commit logs > * fix some checkpatch warnings > > v4: > * rename 'packed_vec' to 'vectorized', also used in split ring > * add RTE_LIBRTE_VIRTIO_INC_VECTOR config for virtio ethdev > * check required AVX512 extensions cpuflags > * combine split and packed ring datapath selection logic > * remove limitation that size must power of two > * clear 12Bytes virtio_net_hdr > > v3: > * remove virtio_net_hdr array for better performance > * disable 'packed_vec' by default > > v2: > * more function blocks replaced by vector instructions > * clean virtio_net_hdr by vector instruction > * allow header room size change > * add 'packed_vec' option in virtio_user vdev > * fix build not check whether AVX512 enabled > * doc update > > Tested-by: Wang, Yinan <yinan.wang@intel.com> > > Marvin Liu (9): > net/virtio: add Rx free threshold setting > net/virtio: inorder should depend on feature bit > net/virtio: add vectorized devarg > net/virtio-user: add vectorized devarg > net/virtio: reuse packed ring functions > net/virtio: add vectorized packed ring Rx path > net/virtio: add vectorized packed ring Tx path > net/virtio: add election for vectorized path > doc: add packed vectorized path > > doc/guides/nics/virtio.rst | 52 +- > drivers/net/virtio/Makefile | 35 ++ > drivers/net/virtio/meson.build | 14 + > drivers/net/virtio/virtio_ethdev.c | 142 ++++- > drivers/net/virtio/virtio_ethdev.h | 6 + > drivers/net/virtio/virtio_pci.h | 3 +- > drivers/net/virtio/virtio_rxtx.c | 351 ++--------- > drivers/net/virtio/virtio_rxtx_packed_avx.c | 607 ++++++++++++++++++++ > drivers/net/virtio/virtio_user_ethdev.c | 32 +- > drivers/net/virtio/virtqueue.c | 7 +- > drivers/net/virtio/virtqueue.h | 304 ++++++++++ > 11 files changed, 1199 insertions(+), 354 deletions(-) > create mode 100644 drivers/net/virtio/virtio_rxtx_packed_avx.c > Applied to dpdk-next-virtio/master, Thanks, Maxime ^ permalink raw reply [flat|nested] 162+ messages in thread
end of thread, other threads:[~2020-04-30 13:04 UTC | newest] Thread overview: 162+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-13 17:42 [dpdk-dev] [PATCH v1 0/7] vectorize virtio packed ring datapath Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 1/7] net/virtio: add Rx free threshold setting Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 2/7] net/virtio-user: add LRO parameter Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 5/7] net/virtio: add vectorized packed ring Tx function Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 6/7] net/virtio: add election for vectorized datapath Marvin Liu 2020-03-13 17:42 ` [dpdk-dev] [PATCH v1 7/7] net/virtio: support meson build Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 0/7] add packed ring vectorized datapath Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 1/7] net/virtio: add Rx free threshold setting Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 6/7] net/virtio: add election for vectorized datapath Marvin Liu 2020-03-27 16:54 ` [dpdk-dev] [PATCH v2 7/7] doc: add packed " Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 0/7] add packed ring " Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 1/7] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-08 6:08 ` Ye Xiaolong 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 2/7] net/virtio-user: add vectorized packed ring parameter Marvin Liu 2020-04-08 6:22 ` Ye Xiaolong 2020-04-08 7:31 ` Liu, Yong 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 3/7] net/virtio: add vectorized packed ring Rx function Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 4/7] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 5/7] net/virtio: add vectorized packed ring Tx datapath Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 6/7] net/virtio: add election for vectorized datapath Marvin Liu 2020-04-08 8:53 ` [dpdk-dev] [PATCH v3 7/7] doc: add packed " Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 0/8] add packed ring " Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 1/8] net/virtio: enable " Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 2/8] net/virtio-user: add vectorized datapath parameter Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 3/8] net/virtio: add vectorized packed ring Rx function Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 4/8] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 5/8] net/virtio: add vectorized packed ring Tx datapath Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 6/8] eal/x86: identify AVX512 extensions flag Marvin Liu 2020-04-15 13:31 ` David Marchand 2020-04-15 14:57 ` Liu, Yong 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 7/8] net/virtio: add election for vectorized datapath Marvin Liu 2020-04-15 16:47 ` [dpdk-dev] [PATCH v4 8/8] doc: add packed " Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 0/9] add packed ring vectorized path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 2/9] net/virtio: enable vectorized path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 3/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 4/9] net/virtio-user: add vectorized path parameter Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-16 15:31 ` [dpdk-dev] [PATCH v5 9/9] doc: add packed " Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 0/9] add packed ring " Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 2/9] net/virtio: enable vectorized path Marvin Liu 2020-04-20 14:08 ` Maxime Coquelin 2020-04-21 6:43 ` Liu, Yong 2020-04-22 8:07 ` Liu, Yong 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 3/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 4/9] net/virtio-user: add vectorized path parameter Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-16 22:24 ` [dpdk-dev] [PATCH v6 9/9] doc: add packed " Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 0/9] add packed ring " Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 2/9] net/virtio: enable vectorized path Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 3/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 4/9] net/virtio-user: add vectorized path parameter Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-22 6:16 ` [dpdk-dev] [PATCH v7 9/9] doc: add packed " Marvin Liu 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Marvin Liu 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-23 8:09 ` Maxime Coquelin 2020-04-23 12:30 ` [dpdk-dev] [PATCH v8 2/9] net/virtio: enable vectorized path Marvin Liu 2020-04-23 8:33 ` Maxime Coquelin 2020-04-23 8:46 ` Liu, Yong 2020-04-23 8:49 ` Maxime Coquelin 2020-04-23 9:59 ` Liu, Yong 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 3/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-23 8:46 ` Maxime Coquelin 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 4/9] net/virtio-user: add vectorized path parameter Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-23 12:31 ` [dpdk-dev] [PATCH v8 9/9] doc: add packed " Marvin Liu 2020-04-23 15:17 ` [dpdk-dev] [PATCH v8 0/9] add packed ring " Wang, Yinan 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 " Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 2/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 3/9] net/virtio: add vectorized devarg Marvin Liu 2020-04-24 11:27 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 4/9] net/virtio-user: " Marvin Liu 2020-04-24 11:29 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-24 11:51 ` Maxime Coquelin 2020-04-24 13:12 ` Liu, Yong 2020-04-24 13:33 ` Maxime Coquelin 2020-04-24 13:40 ` Liu, Yong 2020-04-24 15:58 ` Liu, Yong 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 6/9] net/virtio: reuse packed ring xmit functions Marvin Liu 2020-04-24 12:01 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-24 12:29 ` Maxime Coquelin 2020-04-24 13:33 ` Liu, Yong 2020-04-24 13:35 ` Maxime Coquelin 2020-04-24 13:47 ` Liu, Yong 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-24 13:26 ` Maxime Coquelin 2020-04-24 9:24 ` [dpdk-dev] [PATCH v9 9/9] doc: add packed " Marvin Liu 2020-04-24 13:31 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v9 0/9] add packed ring " Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 2/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 3/9] net/virtio: add vectorized devarg Marvin Liu 2020-04-27 11:12 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 4/9] net/virtio-user: " Marvin Liu 2020-04-27 11:07 ` Maxime Coquelin 2020-04-28 1:29 ` Liu, Yong 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 5/9] net/virtio: reuse packed ring functions Marvin Liu 2020-04-27 11:08 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-27 11:20 ` Maxime Coquelin 2020-04-28 1:14 ` Liu, Yong 2020-04-28 8:44 ` Maxime Coquelin 2020-04-28 13:01 ` Liu, Yong 2020-04-28 13:46 ` Maxime Coquelin 2020-04-28 14:43 ` Liu, Yong 2020-04-28 14:50 ` Maxime Coquelin 2020-04-28 15:35 ` Liu, Yong 2020-04-28 15:40 ` Maxime Coquelin 2020-04-28 15:55 ` Liu, Yong 2020-04-28 17:01 ` Liu, Yong 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-27 11:55 ` Maxime Coquelin 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-26 2:19 ` [dpdk-dev] [PATCH v10 9/9] doc: add packed " Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 0/9] add packed ring " Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 2/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 3/9] net/virtio: add vectorized devarg Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 4/9] net/virtio-user: " Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 5/9] net/virtio: reuse packed ring functions Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-30 9:48 ` Ferruh Yigit 2020-04-30 10:23 ` Bruce Richardson 2020-04-30 13:04 ` Ferruh Yigit 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-28 8:32 ` [dpdk-dev] [PATCH v11 9/9] doc: add packed " Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 1/9] net/virtio: add Rx free threshold setting Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 2/9] net/virtio: inorder should depend on feature bit Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 3/9] net/virtio: add vectorized devarg Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 4/9] net/virtio-user: " Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 5/9] net/virtio: reuse packed ring functions Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 6/9] net/virtio: add vectorized packed ring Rx path Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 7/9] net/virtio: add vectorized packed ring Tx path Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 8/9] net/virtio: add election for vectorized path Marvin Liu 2020-04-29 7:28 ` [dpdk-dev] [PATCH v12 9/9] doc: add packed " Marvin Liu 2020-04-29 8:17 ` [dpdk-dev] [PATCH v12 0/9] add packed ring " Maxime Coquelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).