From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id ACA75A00BE; Tue, 28 Apr 2020 19:01:16 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 632DE1D61E; Tue, 28 Apr 2020 19:01:15 +0200 (CEST) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by dpdk.org (Postfix) with ESMTP id CF8491D61A for ; Tue, 28 Apr 2020 19:01:12 +0200 (CEST) IronPort-SDR: K5X5zLayPWUa9l163zUZSL/8LoStbxlsDXkuZcrJIbpZajaIZOQvc+EM9E0halup11sC2IqUYJ pLr628cvNqDg== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Apr 2020 10:01:11 -0700 IronPort-SDR: tQY4F6erFneNq29Cz4MGZ7K5mw/aVf1eXBTgmdx1APP/DUEaYuytW0qE4YyQDnDPeuvsuVbv/q wmFC/XwoJSNg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,328,1583222400"; d="scan'208";a="246564101" Received: from fmsmsx107.amr.corp.intel.com ([10.18.124.205]) by orsmga007.jf.intel.com with ESMTP; 28 Apr 2020 10:01:11 -0700 Received: from fmsmsx120.amr.corp.intel.com (10.18.124.208) by fmsmsx107.amr.corp.intel.com (10.18.124.205) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 28 Apr 2020 10:01:11 -0700 Received: from shsmsx105.ccr.corp.intel.com (10.239.4.158) by fmsmsx120.amr.corp.intel.com (10.18.124.208) with Microsoft SMTP Server (TLS) id 14.3.439.0; Tue, 28 Apr 2020 10:01:10 -0700 Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.146]) by SHSMSX105.ccr.corp.intel.com ([169.254.11.213]) with mapi id 14.03.0439.000; Wed, 29 Apr 2020 01:01:07 +0800 From: "Liu, Yong" To: Maxime Coquelin , "Ye, Xiaolong" , "Wang, Zhihong" CC: "dev@dpdk.org" Thread-Topic: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path Thread-Index: AQHWG3Ez/QtVlUuuCEi04+ox0lqZDaiMTtSAgAFrqFD///rlgIAAx6WAgABIK2A= Date: Tue, 28 Apr 2020 17:01:07 +0000 Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E635488D5@SHSMSX103.ccr.corp.intel.com> References: <20200313174230.74661-1-yong.liu@intel.com> <20200426021943.43158-1-yong.liu@intel.com> <20200426021943.43158-7-yong.liu@intel.com> <672a584a-46d1-c78b-7b21-9ed7bc060814@redhat.com> <86228AFD5BCD8E4EBFD2B90117B5E81E63546D0C@SHSMSX103.ccr.corp.intel.com> <86228AFD5BCD8E4EBFD2B90117B5E81E6354755C@SHSMSX103.ccr.corp.intel.com> In-Reply-To: <86228AFD5BCD8E4EBFD2B90117B5E81E6354755C@SHSMSX103.ccr.corp.intel.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: Liu, Yong > Sent: Tuesday, April 28, 2020 9:01 PM > To: 'Maxime Coquelin' ; Ye, Xiaolong > ; Wang, Zhihong > Cc: dev@dpdk.org > Subject: RE: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx pa= th >=20 >=20 >=20 > > -----Original Message----- > > From: Maxime Coquelin > > Sent: Tuesday, April 28, 2020 4:44 PM > > To: Liu, Yong ; Ye, Xiaolong ; > > Wang, Zhihong > > Cc: dev@dpdk.org > > Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx > path > > > > > > > > On 4/28/20 3:14 AM, Liu, Yong wrote: > > > > > > > > >> -----Original Message----- > > >> From: Maxime Coquelin > > >> Sent: Monday, April 27, 2020 7:21 PM > > >> To: Liu, Yong ; Ye, Xiaolong > > ; > > >> Wang, Zhihong > > >> Cc: dev@dpdk.org > > >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring = Rx > > path > > >> > > >> > > >> > > >> On 4/26/20 4:19 AM, Marvin Liu wrote: > > >>> Optimize packed ring Rx path with SIMD instructions. Solution of > > >>> optimization is pretty like vhost, is that split path into batch an= d > > >>> single functions. Batch function is further optimized by AVX512 > > >>> instructions. Also pad desc extra structure to 16 bytes aligned, th= us > > >>> four elements will be saved in one batch. > > >>> > > >>> Signed-off-by: Marvin Liu > > >>> > > >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makef= ile > > >>> index c9edb84ee..102b1deab 100644 > > >>> --- a/drivers/net/virtio/Makefile > > >>> +++ b/drivers/net/virtio/Makefile > > >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > > >> $(CONFIG_RTE_ARCH_ARM64)),) > > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D > > virtio_rxtx_simple_neon.c > > >>> endif > > >>> > > >>> +ifneq ($(FORCE_DISABLE_AVX512), y) > > >>> + CC_AVX512_SUPPORT=3D\ > > >>> + $(shell $(CC) -march=3Dnative -dM -E - &1 | \ > > >>> + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' |= \ > > >>> + grep -q AVX512 && echo 1) > > >>> +endif > > >>> + > > >>> +ifeq ($(CC_AVX512_SUPPORT), 1) > > >>> +CFLAGS +=3D -DCC_AVX512_SUPPORT > > >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D > virtio_rxtx_packed_avx.c > > >>> + > > >>> +ifeq ($(RTE_TOOLCHAIN), gcc) > > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > > >>> +CFLAGS +=3D -DVIRTIO_GCC_UNROLL_PRAGMA > > >>> +endif > > >>> +endif > > >>> + > > >>> +ifeq ($(RTE_TOOLCHAIN), clang) > > >>> +ifeq ($(shell test > > $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - > > >> ge 37 && echo 1), 1) > > >>> +CFLAGS +=3D -DVIRTIO_CLANG_UNROLL_PRAGMA > > >>> +endif > > >>> +endif > > >>> + > > >>> +ifeq ($(RTE_TOOLCHAIN), icc) > > >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > > >>> +CFLAGS +=3D -DVIRTIO_ICC_UNROLL_PRAGMA > > >>> +endif > > >>> +endif > > >>> + > > >>> +CFLAGS_virtio_rxtx_packed_avx.o +=3D -mavx512f -mavx512bw - > > mavx512vl > > >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > > >>> +CFLAGS_virtio_rxtx_packed_avx.o +=3D -Wno-zero-length-bounds > > >>> +endif > > >>> +endif > > >>> + > > >>> ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D > virtio_user/vhost_user.c > > >>> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D > > virtio_user/vhost_kernel.c > > >>> diff --git a/drivers/net/virtio/meson.build > > b/drivers/net/virtio/meson.build > > >>> index 15150eea1..8e68c3039 100644 > > >>> --- a/drivers/net/virtio/meson.build > > >>> +++ b/drivers/net/virtio/meson.build > > >>> @@ -9,6 +9,20 @@ sources +=3D files('virtio_ethdev.c', > > >>> deps +=3D ['kvargs', 'bus_pci'] > > >>> > > >>> if arch_subdir =3D=3D 'x86' > > >>> + if '-mno-avx512f' not in machine_args > > >>> + if cc.has_argument('-mavx512f') and cc.has_argument('- > > >> mavx512vl') and cc.has_argument('-mavx512bw') > > >>> + cflags +=3D ['-mavx512f', '-mavx512bw', '-mavx512vl'] > > >>> + cflags +=3D ['-DCC_AVX512_SUPPORT'] > > >>> + if (toolchain =3D=3D 'gcc' and > > >> cc.version().version_compare('>=3D8.3.0')) > > >>> + cflags +=3D '-DVHOST_GCC_UNROLL_PRAGMA' > > >>> + elif (toolchain =3D=3D 'clang' and > > >> cc.version().version_compare('>=3D3.7.0')) > > >>> + cflags +=3D '- > > >> DVHOST_CLANG_UNROLL_PRAGMA' > > >>> + elif (toolchain =3D=3D 'icc' and > > >> cc.version().version_compare('>=3D16.0.0')) > > >>> + cflags +=3D '-DVHOST_ICC_UNROLL_PRAGMA' > > >>> + endif > > >>> + sources +=3D files('virtio_rxtx_packed_avx.c') > > >>> + endif > > >>> + endif > > >>> sources +=3D files('virtio_rxtx_simple_sse.c') > > >>> elif arch_subdir =3D=3D 'ppc' > > >>> sources +=3D files('virtio_rxtx_simple_altivec.c') > > >>> diff --git a/drivers/net/virtio/virtio_ethdev.h > > >> b/drivers/net/virtio/virtio_ethdev.h > > >>> index febaf17a8..5c112cac7 100644 > > >>> --- a/drivers/net/virtio/virtio_ethdev.h > > >>> +++ b/drivers/net/virtio/virtio_ethdev.h > > >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void > > *tx_queue, > > >> struct rte_mbuf **tx_pkts, > > >>> uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf > > **rx_pkts, > > >>> uint16_t nb_pkts); > > >>> > > >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct > rte_mbuf > > >> **rx_pkts, > > >>> + uint16_t nb_pkts); > > >>> + > > >>> int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > >>> > > >>> void virtio_interrupt_handler(void *param); > > >>> diff --git a/drivers/net/virtio/virtio_rxtx.c > > b/drivers/net/virtio/virtio_rxtx.c > > >>> index a549991aa..534562cca 100644 > > >>> --- a/drivers/net/virtio/virtio_rxtx.c > > >>> +++ b/drivers/net/virtio/virtio_rxtx.c > > >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > >>> > > >>> return nb_tx; > > >>> } > > >>> + > > >>> +__rte_weak uint16_t > > >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > > >>> + struct rte_mbuf **rx_pkts __rte_unused, > > >>> + uint16_t nb_pkts __rte_unused) > > >>> +{ > > >>> + return 0; > > >>> +} > > >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > > >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > >>> new file mode 100644 > > >>> index 000000000..8a7b459eb > > >>> --- /dev/null > > >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > >>> @@ -0,0 +1,374 @@ > > >>> +/* SPDX-License-Identifier: BSD-3-Clause > > >>> + * Copyright(c) 2010-2020 Intel Corporation > > >>> + */ > > >>> + > > >>> +#include > > >>> +#include > > >>> +#include > > >>> +#include > > >>> +#include > > >>> + > > >>> +#include > > >>> + > > >>> +#include "virtio_logs.h" > > >>> +#include "virtio_ethdev.h" > > >>> +#include "virtio_pci.h" > > >>> +#include "virtqueue.h" > > >>> + > > >>> +#define BYTE_SIZE 8 > > >>> +/* flag bits offset in packed ring desc higher 64bits */ > > >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, > flags) > > - \ > > >>> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > > >>> + > > >>> +#define PACKED_FLAGS_MASK ((0ULL | > > >> VRING_PACKED_DESC_F_AVAIL_USED) << \ > > >>> + FLAGS_BITS_OFFSET) > > >>> + > > >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > > >>> + sizeof(struct vring_packed_desc)) > > >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > > >>> + > > >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC > unroll > > 4") > > >> \ > > >>> + for (iter =3D val; iter < size; iter++) > > >>> +#endif > > >>> + > > >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unrol= l 4") > \ > > >>> + for (iter =3D val; iter < size; iter++) > > >>> +#endif > > >>> + > > >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > > >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unrol= l > (4)") > > \ > > >>> + for (iter =3D val; iter < size; iter++) > > >>> +#endif > > >>> + > > >>> +#ifndef virtio_for_each_try_unroll > > >>> +#define virtio_for_each_try_unroll(iter, val, num) \ > > >>> + for (iter =3D val; iter < num; iter++) > > >>> +#endif > > >>> + > > >>> +static inline void > > >>> +virtio_update_batch_stats(struct virtnet_stats *stats, > > >>> + uint16_t pkt_len1, > > >>> + uint16_t pkt_len2, > > >>> + uint16_t pkt_len3, > > >>> + uint16_t pkt_len4) > > >>> +{ > > >>> + stats->bytes +=3D pkt_len1; > > >>> + stats->bytes +=3D pkt_len2; > > >>> + stats->bytes +=3D pkt_len3; > > >>> + stats->bytes +=3D pkt_len4; > > >>> +} > > >>> + > > >>> +/* Optionally fill offload information in structure */ > > >>> +static inline int > > >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *h= dr) > > >>> +{ > > >>> + struct rte_net_hdr_lens hdr_lens; > > >>> + uint32_t hdrlen, ptype; > > >>> + int l4_supported =3D 0; > > >>> + > > >>> + /* nothing to do */ > > >>> + if (hdr->flags =3D=3D 0) > > >>> + return 0; > > >>> + > > >>> + /* GSO not support in vec path, skip check */ > > >>> + m->ol_flags |=3D PKT_RX_IP_CKSUM_UNKNOWN; > > >>> + > > >>> + ptype =3D rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > > >>> + m->packet_type =3D ptype; > > >>> + if ((ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_TCP || > > >>> + (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_UDP || > > >>> + (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_SCTP) > > >>> + l4_supported =3D 1; > > >>> + > > >>> + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > > >>> + hdrlen =3D hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > > >>> + if (hdr->csum_start <=3D hdrlen && l4_supported) { > > >>> + m->ol_flags |=3D PKT_RX_L4_CKSUM_NONE; > > >>> + } else { > > >>> + /* Unknown proto or tunnel, do sw cksum. We can > > >> assume > > >>> + * the cksum field is in the first segment since the > > >>> + * buffers we provided to the host are large enough. > > >>> + * In case of SCTP, this will be wrong since it's a CRC > > >>> + * but there's nothing we can do. > > >>> + */ > > >>> + uint16_t csum =3D 0, off; > > >>> + > > >>> + rte_raw_cksum_mbuf(m, hdr->csum_start, > > >>> + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > > >>> + &csum); > > >>> + if (likely(csum !=3D 0xffff)) > > >>> + csum =3D ~csum; > > >>> + off =3D hdr->csum_offset + hdr->csum_start; > > >>> + if (rte_pktmbuf_data_len(m) >=3D off + 1) > > >>> + *rte_pktmbuf_mtod_offset(m, uint16_t *, > > >>> + off) =3D csum; > > >>> + } > > >>> + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && > > >> l4_supported) { > > >>> + m->ol_flags |=3D PKT_RX_L4_CKSUM_GOOD; > > >>> + } > > >>> + > > >>> + return 0; > > >>> +} > > >>> + > > >>> +static inline uint16_t > > >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq, > > >>> + struct rte_mbuf **rx_pkts) > > >>> +{ > > >>> + struct virtqueue *vq =3D rxvq->vq; > > >>> + struct virtio_hw *hw =3D vq->hw; > > >>> + uint16_t hdr_size =3D hw->vtnet_hdr_size; > > >>> + uint64_t addrs[PACKED_BATCH_SIZE]; > > >>> + uint16_t id =3D vq->vq_used_cons_idx; > > >>> + uint8_t desc_stats; > > >>> + uint16_t i; > > >>> + void *desc_addr; > > >>> + > > >>> + if (id & PACKED_BATCH_MASK) > > >>> + return -1; > > >>> + > > >>> + if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries)) > > >>> + return -1; > > >>> + > > >>> + /* only care avail/used bits */ > > >>> + __m512i v_mask =3D _mm512_maskz_set1_epi64(0xaa, > > >> PACKED_FLAGS_MASK); > > >>> + desc_addr =3D &vq->vq_packed.ring.desc[id]; > > >>> + > > >>> + __m512i v_desc =3D _mm512_loadu_si512(desc_addr); > > >>> + __m512i v_flag =3D _mm512_and_epi64(v_desc, v_mask); > > >>> + > > >>> + __m512i v_used_flag =3D _mm512_setzero_si512(); > > >>> + if (vq->vq_packed.used_wrap_counter) > > >>> + v_used_flag =3D _mm512_maskz_set1_epi64(0xaa, > > >> PACKED_FLAGS_MASK); > > >>> + > > >>> + /* Check all descs are used */ > > >>> + desc_stats =3D _mm512_cmpneq_epu64_mask(v_flag, v_used_flag); > > >>> + if (desc_stats) > > >>> + return -1; > > >>> + > > >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > >>> + rx_pkts[i] =3D (struct rte_mbuf *)vq->vq_descx[id + i].cookie; > > >>> + rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *)); > > >>> + > > >>> + addrs[i] =3D (uint64_t)rx_pkts[i]->rx_descriptor_fields1; > > >>> + } > > >>> + > > >>> + /* > > >>> + * load len from desc, store into mbuf pkt_len and data_len > > >>> + * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored > > >>> + */ > > >>> + const __mmask16 mask =3D 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12; > > >>> + __m512i values =3D _mm512_maskz_shuffle_epi32(mask, v_desc, > > >> 0xAA); > > >>> + > > >>> + /* reduce hdr_len from pkt_len and data_len */ > > >>> + __m512i mbuf_len_offset =3D _mm512_maskz_set1_epi32(mask, > > >>> + (uint32_t)-hdr_size); > > >>> + > > >>> + __m512i v_value =3D _mm512_add_epi32(values, mbuf_len_offset); > > >>> + > > >>> + /* assert offset of data_len */ > > >>> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=3D > > >>> + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); > > >>> + > > >>> + __m512i v_index =3D _mm512_set_epi64(addrs[3] + 8, addrs[3], > > >>> + addrs[2] + 8, addrs[2], > > >>> + addrs[1] + 8, addrs[1], > > >>> + addrs[0] + 8, addrs[0]); > > >>> + /* batch store into mbufs */ > > >>> + _mm512_i64scatter_epi64(0, v_index, v_value, 1); > > >>> + > > >>> + if (hw->has_rx_offload) { > > >>> + virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) { > > >>> + char *addr =3D (char *)rx_pkts[i]->buf_addr + > > >>> + RTE_PKTMBUF_HEADROOM - hdr_size; > > >>> + virtio_vec_rx_offload(rx_pkts[i], > > >>> + (struct virtio_net_hdr *)addr); > > >>> + } > > >>> + } > > >>> + > > >>> + virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len, > > >>> + rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len, > > >>> + rx_pkts[3]->pkt_len); > > >>> + > > >>> + vq->vq_free_cnt +=3D PACKED_BATCH_SIZE; > > >>> + > > >>> + vq->vq_used_cons_idx +=3D PACKED_BATCH_SIZE; > > >>> + if (vq->vq_used_cons_idx >=3D vq->vq_nentries) { > > >>> + vq->vq_used_cons_idx -=3D vq->vq_nentries; > > >>> + vq->vq_packed.used_wrap_counter ^=3D 1; > > >>> + } > > >>> + > > >>> + return 0; > > >>> +} > > >>> + > > >>> +static uint16_t > > >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq, > > >>> + struct rte_mbuf **rx_pkts) > > >>> +{ > > >>> + uint16_t used_idx, id; > > >>> + uint32_t len; > > >>> + struct virtqueue *vq =3D rxvq->vq; > > >>> + struct virtio_hw *hw =3D vq->hw; > > >>> + uint32_t hdr_size =3D hw->vtnet_hdr_size; > > >>> + struct virtio_net_hdr *hdr; > > >>> + struct vring_packed_desc *desc; > > >>> + struct rte_mbuf *cookie; > > >>> + > > >>> + desc =3D vq->vq_packed.ring.desc; > > >>> + used_idx =3D vq->vq_used_cons_idx; > > >>> + if (!desc_is_used(&desc[used_idx], vq)) > > >>> + return -1; > > >>> + > > >>> + len =3D desc[used_idx].len; > > >>> + id =3D desc[used_idx].id; > > >>> + cookie =3D (struct rte_mbuf *)vq->vq_descx[id].cookie; > > >>> + if (unlikely(cookie =3D=3D NULL)) { > > >>> + PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie > > >> at %u", > > >>> + vq->vq_used_cons_idx); > > >>> + return -1; > > >>> + } > > >>> + rte_prefetch0(cookie); > > >>> + rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *)); > > >>> + > > >>> + cookie->data_off =3D RTE_PKTMBUF_HEADROOM; > > >>> + cookie->ol_flags =3D 0; > > >>> + cookie->pkt_len =3D (uint32_t)(len - hdr_size); > > >>> + cookie->data_len =3D (uint32_t)(len - hdr_size); > > >>> + > > >>> + hdr =3D (struct virtio_net_hdr *)((char *)cookie->buf_addr + > > >>> + RTE_PKTMBUF_HEADROOM - > > >> hdr_size); > > >>> + if (hw->has_rx_offload) > > >>> + virtio_vec_rx_offload(cookie, hdr); > > >>> + > > >>> + *rx_pkts =3D cookie; > > >>> + > > >>> + rxvq->stats.bytes +=3D cookie->pkt_len; > > >>> + > > >>> + vq->vq_free_cnt++; > > >>> + vq->vq_used_cons_idx++; > > >>> + if (vq->vq_used_cons_idx >=3D vq->vq_nentries) { > > >>> + vq->vq_used_cons_idx -=3D vq->vq_nentries; > > >>> + vq->vq_packed.used_wrap_counter ^=3D 1; > > >>> + } > > >>> + > > >>> + return 0; > > >>> +} > > >>> + > > >>> +static inline void > > >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq, > > >>> + struct rte_mbuf **cookie, > > >>> + uint16_t num) > > >>> +{ > > >>> + struct virtqueue *vq =3D rxvq->vq; > > >>> + struct vring_packed_desc *start_dp =3D vq->vq_packed.ring.desc; > > >>> + uint16_t flags =3D vq->vq_packed.cached_flags; > > >>> + struct virtio_hw *hw =3D vq->hw; > > >>> + struct vq_desc_extra *dxp; > > >>> + uint16_t idx, i; > > >>> + uint16_t batch_num, total_num =3D 0; > > >>> + uint16_t head_idx =3D vq->vq_avail_idx; > > >>> + uint16_t head_flag =3D vq->vq_packed.cached_flags; > > >>> + uint64_t addr; > > >>> + > > >>> + do { > > >>> + idx =3D vq->vq_avail_idx; > > >>> + > > >>> + batch_num =3D PACKED_BATCH_SIZE; > > >>> + if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries)) > > >>> + batch_num =3D vq->vq_nentries - idx; > > >>> + if (unlikely((total_num + batch_num) > num)) > > >>> + batch_num =3D num - total_num; > > >>> + > > >>> + virtio_for_each_try_unroll(i, 0, batch_num) { > > >>> + dxp =3D &vq->vq_descx[idx + i]; > > >>> + dxp->cookie =3D (void *)cookie[total_num + i]; > > >>> + > > >>> + addr =3D VIRTIO_MBUF_ADDR(cookie[total_num + i], > > >> vq) + > > >>> + RTE_PKTMBUF_HEADROOM - hw- > > >>> vtnet_hdr_size; > > >>> + start_dp[idx + i].addr =3D addr; > > >>> + start_dp[idx + i].len =3D cookie[total_num + i]- > > >buf_len > > >>> + - RTE_PKTMBUF_HEADROOM + hw- > > >>> vtnet_hdr_size; > > >>> + if (total_num || i) { > > >>> + virtqueue_store_flags_packed(&start_dp[idx > > >> + i], > > >>> + flags, hw->weak_barriers); > > >>> + } > > >>> + } > > >>> + > > >>> + vq->vq_avail_idx +=3D batch_num; > > >>> + if (vq->vq_avail_idx >=3D vq->vq_nentries) { > > >>> + vq->vq_avail_idx -=3D vq->vq_nentries; > > >>> + vq->vq_packed.cached_flags ^=3D > > >>> + VRING_PACKED_DESC_F_AVAIL_USED; > > >>> + flags =3D vq->vq_packed.cached_flags; > > >>> + } > > >>> + total_num +=3D batch_num; > > >>> + } while (total_num < num); > > >>> + > > >>> + virtqueue_store_flags_packed(&start_dp[head_idx], head_flag, > > >>> + hw->weak_barriers); > > >>> + vq->vq_free_cnt =3D (uint16_t)(vq->vq_free_cnt - num); > > >>> +} > > >>> + > > >>> +uint16_t > > >>> +virtio_recv_pkts_packed_vec(void *rx_queue, > > >>> + struct rte_mbuf **rx_pkts, > > >>> + uint16_t nb_pkts) > > >>> +{ > > >>> + struct virtnet_rx *rxvq =3D rx_queue; > > >>> + struct virtqueue *vq =3D rxvq->vq; > > >>> + struct virtio_hw *hw =3D vq->hw; > > >>> + uint16_t num, nb_rx =3D 0; > > >>> + uint32_t nb_enqueued =3D 0; > > >>> + uint16_t free_cnt =3D vq->vq_free_thresh; > > >>> + > > >>> + if (unlikely(hw->started =3D=3D 0)) > > >>> + return nb_rx; > > >>> + > > >>> + num =3D RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts); > > >>> + if (likely(num > PACKED_BATCH_SIZE)) > > >>> + num =3D num - ((vq->vq_used_cons_idx + num) % > > >> PACKED_BATCH_SIZE); > > >>> + > > >>> + while (num) { > > >>> + if (!virtqueue_dequeue_batch_packed_vec(rxvq, > > >>> + &rx_pkts[nb_rx])) { > > >>> + nb_rx +=3D PACKED_BATCH_SIZE; > > >>> + num -=3D PACKED_BATCH_SIZE; > > >>> + continue; > > >>> + } > > >>> + if (!virtqueue_dequeue_single_packed_vec(rxvq, > > >>> + &rx_pkts[nb_rx])) { > > >>> + nb_rx++; > > >>> + num--; > > >>> + continue; > > >>> + } > > >>> + break; > > >>> + }; > > >>> + > > >>> + PMD_RX_LOG(DEBUG, "dequeue:%d", num); > > >>> + > > >>> + rxvq->stats.packets +=3D nb_rx; > > >>> + > > >>> + if (likely(vq->vq_free_cnt >=3D free_cnt)) { > > >>> + struct rte_mbuf *new_pkts[free_cnt]; > > >>> + if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, > > >>> + free_cnt) =3D=3D 0)) { > > >>> + virtio_recv_refill_packed_vec(rxvq, new_pkts, > > >>> + free_cnt); > > >>> + nb_enqueued +=3D free_cnt; > > >>> + } else { > > >>> + struct rte_eth_dev *dev =3D > > >>> + &rte_eth_devices[rxvq->port_id]; > > >>> + dev->data->rx_mbuf_alloc_failed +=3D free_cnt; > > >>> + } > > >>> + } > > >>> + > > >>> + if (likely(nb_enqueued)) { > > >>> + if (unlikely(virtqueue_kick_prepare_packed(vq))) { > > >>> + virtqueue_notify(vq); > > >>> + PMD_RX_LOG(DEBUG, "Notified"); > > >>> + } > > >>> + } > > >>> + > > >>> + return nb_rx; > > >>> +} > > >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c > > >> b/drivers/net/virtio/virtio_user_ethdev.c > > >>> index 40ad786cc..c54698ad1 100644 > > >>> --- a/drivers/net/virtio/virtio_user_ethdev.c > > >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c > > >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct > > rte_vdev_device > > >> *vdev) > > >>> hw->use_msix =3D 1; > > >>> hw->modern =3D 0; > > >>> hw->use_vec_rx =3D 0; > > >>> + hw->use_vec_tx =3D 0; > > >>> hw->use_inorder_rx =3D 0; > > >>> hw->use_inorder_tx =3D 0; > > >>> hw->virtio_user_dev =3D dev; > > >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct > rte_vdev_device > > >> *dev) > > >>> goto end; > > >>> } > > >>> > > >>> - if (vectorized) > > >>> - hw->use_vec_rx =3D 1; > > >>> + if (vectorized) { > > >>> + if (packed_vq) { > > >>> +#if defined(CC_AVX512_SUPPORT) > > >>> + hw->use_vec_rx =3D 1; > > >>> + hw->use_vec_tx =3D 1; > > >>> +#else > > >>> + PMD_INIT_LOG(INFO, > > >>> + "building environment do not support > > packed > > >> ring vectorized"); > > >>> +#endif > > >>> + } else { > > >>> + hw->use_vec_rx =3D 1; > > >>> + } > > >>> + } > > >>> > > >>> rte_eth_dev_probing_finish(eth_dev); > > >>> ret =3D 0; > > >>> diff --git a/drivers/net/virtio/virtqueue.h > > b/drivers/net/virtio/virtqueue.h > > >>> index ca1c10499..ce0340743 100644 > > >>> --- a/drivers/net/virtio/virtqueue.h > > >>> +++ b/drivers/net/virtio/virtqueue.h > > >>> @@ -239,7 +239,8 @@ struct vq_desc_extra { > > >>> void *cookie; > > >>> uint16_t ndescs; > > >>> uint16_t next; > > >>> -}; > > >>> + uint8_t padding[4]; > > >>> +} __rte_packed __rte_aligned(16); > > >> > > >> Can't this introduce a performance impact for the non-vectorized > > >> case? I think of worse cache liens utilization. > > >> > > >> For example with a burst of 32 descriptors with 32B cachelines, befo= re > > >> it would take 14 cachelines, after 16. So for each burst, one could = face > > >> 2 extra cache misses. > > >> > > >> If you could run non-vectorized benchamrks with and without that > patch, > > >> I would be grateful. > > >> > > > > > > Maxime, > > > Thanks for point it out, it will add extra cache miss in datapath. > > > And its impact on performance is around 1% in loopback case. > > > > Ok, thanks for doing the test. I'll try to run some PVP benchmarks > > on my side because when doing IO loopback, the cache pressure is > > much less important. > > > > > While benefit of vectorized path will be more than that number. > > > > Ok, but I disagree for two reasons: > > 1. You have to keep in mind than non-vectorized is the default and > > encouraged mode to use. Indeed, it takes a lot of shortcuts like not > > checking header length (so no error stats), etc... > > > Ok, I will keep non-vectorized same as before. >=20 > > 2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because > > the gain is 20% on $CPU_VENDOR_B. > > > > In the case we see more degradation in real-world scenario, you might > > want to consider using ifdefs to avoid adding padding in the non- > > vectorized case, like you did to differentiate Virtio PMD to Virtio-use= r > > PMD in patch 7. > > >=20 > Maxime, > The performance difference is so slight, so I ignored for it look like a > sampling error. > It maybe not suitable to add new configuration for such setting which onl= y > used inside driver. > Virtio driver can check whether virtqueue is using vectorized path when > initialization, will use padded structure if it is. > I have added some tested code and now performance came back. Since > code has changed in initialization process, it need some time for regres= sion > check. >=20 + one more update. Batch store with padding structure won't have benefit based on the latest c= ode. It may due to addition load/store cost can't be hidden by saved cpu cycles. Will moved padding structure and make things clear as before. > Regards, > Marvin >=20 > > Thanks, > > Maxime > > > > > Thanks, > > > Marvin > > > > > >> Reviewed-by: Maxime Coquelin > > >> > > >> Thanks, > > >> Maxime > > >