From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 1A238A00BE;
	Tue, 28 Apr 2020 15:01:25 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id E1AA91D5C3;
	Tue, 28 Apr 2020 15:01:24 +0200 (CEST)
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 721821D5AA
 for <dev@dpdk.org>; Tue, 28 Apr 2020 15:01:22 +0200 (CEST)
IronPort-SDR: Hbx9H/1zka6vojJGFpUNbVy18ff/ICFS/tNMczMsGk4KD5u7o2UhsCaPLxqE5zNnyctDWByfI0
 Ql1ivbLL+lJg==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga001.jf.intel.com ([10.7.209.18])
 by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 28 Apr 2020 06:01:21 -0700
IronPort-SDR: Jd2Rwqoa0iOxB3BZ4NrymH+kf8Wq/2CblgJpSe8axOF57UALAamMYbF46rS0/Xs7ekBOaVWX01
 A+YnscBIHyqw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.73,327,1583222400"; d="scan'208";a="336621151"
Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203])
 by orsmga001.jf.intel.com with ESMTP; 28 Apr 2020 06:01:19 -0700
Received: from fmsmsx115.amr.corp.intel.com (10.18.116.19) by
 FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Tue, 28 Apr 2020 06:01:13 -0700
Received: from shsmsx105.ccr.corp.intel.com (10.239.4.158) by
 fmsmsx115.amr.corp.intel.com (10.18.116.19) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Tue, 28 Apr 2020 06:01:13 -0700
Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.146]) by
 SHSMSX105.ccr.corp.intel.com ([169.254.11.213]) with mapi id 14.03.0439.000;
 Tue, 28 Apr 2020 21:01:10 +0800
From: "Liu, Yong" <yong.liu@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>, "Ye, Xiaolong"
 <xiaolong.ye@intel.com>, "Wang, Zhihong" <zhihong.wang@intel.com>
CC: "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx path
Thread-Index: AQHWG3Ez/QtVlUuuCEi04+ox0lqZDaiMTtSAgAFrqFD///rlgIAAx6WA
Date: Tue, 28 Apr 2020 13:01:10 +0000
Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E6354755C@SHSMSX103.ccr.corp.intel.com>
References: <20200313174230.74661-1-yong.liu@intel.com>
 <20200426021943.43158-1-yong.liu@intel.com>
 <20200426021943.43158-7-yong.liu@intel.com>
 <672a584a-46d1-c78b-7b21-9ed7bc060814@redhat.com>
 <86228AFD5BCD8E4EBFD2B90117B5E81E63546D0C@SHSMSX103.ccr.corp.intel.com>
 <f4df3a3a-7d8f-815c-0945-de33b7d644fb@redhat.com>
In-Reply-To: <f4df3a3a-7d8f-815c-0945-de33b7d644fb@redhat.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 11.2.0.6
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v10 6/9] net/virtio: add vectorized packed
	ring Rx path
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, April 28, 2020 4:44 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx pa=
th
>=20
>=20
>=20
> On 4/28/20 3:14 AM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Monday, April 27, 2020 7:21 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org
> >> Subject: Re: [PATCH v10 6/9] net/virtio: add vectorized packed ring Rx
> path
> >>
> >>
> >>
> >> On 4/26/20 4:19 AM, Marvin Liu wrote:
> >>> Optimize packed ring Rx path with SIMD instructions. Solution of
> >>> optimization is pretty like vhost, is that split path into batch and
> >>> single functions. Batch function is further optimized by AVX512
> >>> instructions. Also pad desc extra structure to 16 bytes aligned, thus
> >>> four elements will be saved in one batch.
> >>>
> >>> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >>>
> >>> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefil=
e
> >>> index c9edb84ee..102b1deab 100644
> >>> --- a/drivers/net/virtio/Makefile
> >>> +++ b/drivers/net/virtio/Makefile
> >>> @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> >> $(CONFIG_RTE_ARCH_ARM64)),)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D
> virtio_rxtx_simple_neon.c
> >>>  endif
> >>>
> >>> +ifneq ($(FORCE_DISABLE_AVX512), y)
> >>> +	CC_AVX512_SUPPORT=3D\
> >>> +	$(shell $(CC) -march=3Dnative -dM -E - </dev/null 2>&1 | \
> >>> +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> >>> +	grep -q AVX512 && echo 1)
> >>> +endif
> >>> +
> >>> +ifeq ($(CC_AVX512_SUPPORT), 1)
> >>> +CFLAGS +=3D -DCC_AVX512_SUPPORT
> >>> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_rxtx_packed_avx.c
> >>> +
> >>> +ifeq ($(RTE_TOOLCHAIN), gcc)
> >>> +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> >>> +CFLAGS +=3D -DVIRTIO_GCC_UNROLL_PRAGMA
> >>> +endif
> >>> +endif
> >>> +
> >>> +ifeq ($(RTE_TOOLCHAIN), clang)
> >>> +ifeq ($(shell test
> $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
> >> ge 37 && echo 1), 1)
> >>> +CFLAGS +=3D -DVIRTIO_CLANG_UNROLL_PRAGMA
> >>> +endif
> >>> +endif
> >>> +
> >>> +ifeq ($(RTE_TOOLCHAIN), icc)
> >>> +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> >>> +CFLAGS +=3D -DVIRTIO_ICC_UNROLL_PRAGMA
> >>> +endif
> >>> +endif
> >>> +
> >>> +CFLAGS_virtio_rxtx_packed_avx.o +=3D -mavx512f -mavx512bw -
> mavx512vl
> >>> +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> >>> +CFLAGS_virtio_rxtx_packed_avx.o +=3D -Wno-zero-length-bounds
> >>> +endif
> >>> +endif
> >>> +
> >>>  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_user/vhost_user.c
> >>>  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D
> virtio_user/vhost_kernel.c
> >>> diff --git a/drivers/net/virtio/meson.build
> b/drivers/net/virtio/meson.build
> >>> index 15150eea1..8e68c3039 100644
> >>> --- a/drivers/net/virtio/meson.build
> >>> +++ b/drivers/net/virtio/meson.build
> >>> @@ -9,6 +9,20 @@ sources +=3D files('virtio_ethdev.c',
> >>>  deps +=3D ['kvargs', 'bus_pci']
> >>>
> >>>  if arch_subdir =3D=3D 'x86'
> >>> +	if '-mno-avx512f' not in machine_args
> >>> +		if cc.has_argument('-mavx512f') and cc.has_argument('-
> >> mavx512vl') and cc.has_argument('-mavx512bw')
> >>> +			cflags +=3D ['-mavx512f', '-mavx512bw', '-mavx512vl']
> >>> +			cflags +=3D ['-DCC_AVX512_SUPPORT']
> >>> +			if (toolchain =3D=3D 'gcc' and
> >> cc.version().version_compare('>=3D8.3.0'))
> >>> +				cflags +=3D '-DVHOST_GCC_UNROLL_PRAGMA'
> >>> +			elif (toolchain =3D=3D 'clang' and
> >> cc.version().version_compare('>=3D3.7.0'))
> >>> +				cflags +=3D '-
> >> DVHOST_CLANG_UNROLL_PRAGMA'
> >>> +			elif (toolchain =3D=3D 'icc' and
> >> cc.version().version_compare('>=3D16.0.0'))
> >>> +				cflags +=3D '-DVHOST_ICC_UNROLL_PRAGMA'
> >>> +			endif
> >>> +			sources +=3D files('virtio_rxtx_packed_avx.c')
> >>> +		endif
> >>> +	endif
> >>>  	sources +=3D files('virtio_rxtx_simple_sse.c')
> >>>  elif arch_subdir =3D=3D 'ppc'
> >>>  	sources +=3D files('virtio_rxtx_simple_altivec.c')
> >>> diff --git a/drivers/net/virtio/virtio_ethdev.h
> >> b/drivers/net/virtio/virtio_ethdev.h
> >>> index febaf17a8..5c112cac7 100644
> >>> --- a/drivers/net/virtio/virtio_ethdev.h
> >>> +++ b/drivers/net/virtio/virtio_ethdev.h
> >>> @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void
> *tx_queue,
> >> struct rte_mbuf **tx_pkts,
> >>>  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> >>>  		uint16_t nb_pkts);
> >>>
> >>> +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> >> **rx_pkts,
> >>> +		uint16_t nb_pkts);
> >>> +
> >>>  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >>>
> >>>  void virtio_interrupt_handler(void *param);
> >>> diff --git a/drivers/net/virtio/virtio_rxtx.c
> b/drivers/net/virtio/virtio_rxtx.c
> >>> index a549991aa..534562cca 100644
> >>> --- a/drivers/net/virtio/virtio_rxtx.c
> >>> +++ b/drivers/net/virtio/virtio_rxtx.c
> >>> @@ -2030,3 +2030,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
> >>>
> >>>  	return nb_tx;
> >>>  }
> >>> +
> >>> +__rte_weak uint16_t
> >>> +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> >>> +			    struct rte_mbuf **rx_pkts __rte_unused,
> >>> +			    uint16_t nb_pkts __rte_unused)
> >>> +{
> >>> +	return 0;
> >>> +}
> >>> diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> new file mode 100644
> >>> index 000000000..8a7b459eb
> >>> --- /dev/null
> >>> +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> >>> @@ -0,0 +1,374 @@
> >>> +/* SPDX-License-Identifier: BSD-3-Clause
> >>> + * Copyright(c) 2010-2020 Intel Corporation
> >>> + */
> >>> +
> >>> +#include <stdint.h>
> >>> +#include <stdio.h>
> >>> +#include <stdlib.h>
> >>> +#include <string.h>
> >>> +#include <errno.h>
> >>> +
> >>> +#include <rte_net.h>
> >>> +
> >>> +#include "virtio_logs.h"
> >>> +#include "virtio_ethdev.h"
> >>> +#include "virtio_pci.h"
> >>> +#include "virtqueue.h"
> >>> +
> >>> +#define BYTE_SIZE 8
> >>> +/* flag bits offset in packed ring desc higher 64bits */
> >>> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags=
)
> - \
> >>> +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> >>> +
> >>> +#define PACKED_FLAGS_MASK ((0ULL |
> >> VRING_PACKED_DESC_F_AVAIL_USED) << \
> >>> +	FLAGS_BITS_OFFSET)
> >>> +
> >>> +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> >>> +	sizeof(struct vring_packed_desc))
> >>> +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> >>> +
> >>> +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unr=
oll
> 4")
> >> \
> >>> +	for (iter =3D val; iter < size; iter++)
> >>> +#endif
> >>> +
> >>> +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll =
4") \
> >>> +	for (iter =3D val; iter < size; iter++)
> >>> +#endif
> >>> +
> >>> +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> >>> +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll =
(4)")
> \
> >>> +	for (iter =3D val; iter < size; iter++)
> >>> +#endif
> >>> +
> >>> +#ifndef virtio_for_each_try_unroll
> >>> +#define virtio_for_each_try_unroll(iter, val, num) \
> >>> +	for (iter =3D val; iter < num; iter++)
> >>> +#endif
> >>> +
> >>> +static inline void
> >>> +virtio_update_batch_stats(struct virtnet_stats *stats,
> >>> +			  uint16_t pkt_len1,
> >>> +			  uint16_t pkt_len2,
> >>> +			  uint16_t pkt_len3,
> >>> +			  uint16_t pkt_len4)
> >>> +{
> >>> +	stats->bytes +=3D pkt_len1;
> >>> +	stats->bytes +=3D pkt_len2;
> >>> +	stats->bytes +=3D pkt_len3;
> >>> +	stats->bytes +=3D pkt_len4;
> >>> +}
> >>> +
> >>> +/* Optionally fill offload information in structure */
> >>> +static inline int
> >>> +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr=
)
> >>> +{
> >>> +	struct rte_net_hdr_lens hdr_lens;
> >>> +	uint32_t hdrlen, ptype;
> >>> +	int l4_supported =3D 0;
> >>> +
> >>> +	/* nothing to do */
> >>> +	if (hdr->flags =3D=3D 0)
> >>> +		return 0;
> >>> +
> >>> +	/* GSO not support in vec path, skip check */
> >>> +	m->ol_flags |=3D PKT_RX_IP_CKSUM_UNKNOWN;
> >>> +
> >>> +	ptype =3D rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> >>> +	m->packet_type =3D ptype;
> >>> +	if ((ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_TCP ||
> >>> +	    (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_UDP ||
> >>> +	    (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_SCTP)
> >>> +		l4_supported =3D 1;
> >>> +
> >>> +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> >>> +		hdrlen =3D hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> >>> +		if (hdr->csum_start <=3D hdrlen && l4_supported) {
> >>> +			m->ol_flags |=3D PKT_RX_L4_CKSUM_NONE;
> >>> +		} else {
> >>> +			/* Unknown proto or tunnel, do sw cksum. We can
> >> assume
> >>> +			 * the cksum field is in the first segment since the
> >>> +			 * buffers we provided to the host are large enough.
> >>> +			 * In case of SCTP, this will be wrong since it's a CRC
> >>> +			 * but there's nothing we can do.
> >>> +			 */
> >>> +			uint16_t csum =3D 0, off;
> >>> +
> >>> +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> >>> +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> >>> +				&csum);
> >>> +			if (likely(csum !=3D 0xffff))
> >>> +				csum =3D ~csum;
> >>> +			off =3D hdr->csum_offset + hdr->csum_start;
> >>> +			if (rte_pktmbuf_data_len(m) >=3D off + 1)
> >>> +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> >>> +					off) =3D csum;
> >>> +		}
> >>> +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
> >> l4_supported) {
> >>> +		m->ol_flags |=3D PKT_RX_L4_CKSUM_GOOD;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline uint16_t
> >>> +virtqueue_dequeue_batch_packed_vec(struct virtnet_rx *rxvq,
> >>> +				   struct rte_mbuf **rx_pkts)
> >>> +{
> >>> +	struct virtqueue *vq =3D rxvq->vq;
> >>> +	struct virtio_hw *hw =3D vq->hw;
> >>> +	uint16_t hdr_size =3D hw->vtnet_hdr_size;
> >>> +	uint64_t addrs[PACKED_BATCH_SIZE];
> >>> +	uint16_t id =3D vq->vq_used_cons_idx;
> >>> +	uint8_t desc_stats;
> >>> +	uint16_t i;
> >>> +	void *desc_addr;
> >>> +
> >>> +	if (id & PACKED_BATCH_MASK)
> >>> +		return -1;
> >>> +
> >>> +	if (unlikely((id + PACKED_BATCH_SIZE) > vq->vq_nentries))
> >>> +		return -1;
> >>> +
> >>> +	/* only care avail/used bits */
> >>> +	__m512i v_mask =3D _mm512_maskz_set1_epi64(0xaa,
> >> PACKED_FLAGS_MASK);
> >>> +	desc_addr =3D &vq->vq_packed.ring.desc[id];
> >>> +
> >>> +	__m512i v_desc =3D _mm512_loadu_si512(desc_addr);
> >>> +	__m512i v_flag =3D _mm512_and_epi64(v_desc, v_mask);
> >>> +
> >>> +	__m512i v_used_flag =3D _mm512_setzero_si512();
> >>> +	if (vq->vq_packed.used_wrap_counter)
> >>> +		v_used_flag =3D _mm512_maskz_set1_epi64(0xaa,
> >> PACKED_FLAGS_MASK);
> >>> +
> >>> +	/* Check all descs are used */
> >>> +	desc_stats =3D _mm512_cmpneq_epu64_mask(v_flag, v_used_flag);
> >>> +	if (desc_stats)
> >>> +		return -1;
> >>> +
> >>> +	virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> >>> +		rx_pkts[i] =3D (struct rte_mbuf *)vq->vq_descx[id + i].cookie;
> >>> +		rte_packet_prefetch(rte_pktmbuf_mtod(rx_pkts[i], void *));
> >>> +
> >>> +		addrs[i] =3D (uint64_t)rx_pkts[i]->rx_descriptor_fields1;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * load len from desc, store into mbuf pkt_len and data_len
> >>> +	 * len limiated by l6bit buf_len, pkt_len[16:31] can be ignored
> >>> +	 */
> >>> +	const __mmask16 mask =3D 0x6 | 0x6 << 4 | 0x6 << 8 | 0x6 << 12;
> >>> +	__m512i values =3D _mm512_maskz_shuffle_epi32(mask, v_desc,
> >> 0xAA);
> >>> +
> >>> +	/* reduce hdr_len from pkt_len and data_len */
> >>> +	__m512i mbuf_len_offset =3D _mm512_maskz_set1_epi32(mask,
> >>> +			(uint32_t)-hdr_size);
> >>> +
> >>> +	__m512i v_value =3D _mm512_add_epi32(values, mbuf_len_offset);
> >>> +
> >>> +	/* assert offset of data_len */
> >>> +	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=3D
> >>> +		offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8);
> >>> +
> >>> +	__m512i v_index =3D _mm512_set_epi64(addrs[3] + 8, addrs[3],
> >>> +					   addrs[2] + 8, addrs[2],
> >>> +					   addrs[1] + 8, addrs[1],
> >>> +					   addrs[0] + 8, addrs[0]);
> >>> +	/* batch store into mbufs */
> >>> +	_mm512_i64scatter_epi64(0, v_index, v_value, 1);
> >>> +
> >>> +	if (hw->has_rx_offload) {
> >>> +		virtio_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> >>> +			char *addr =3D (char *)rx_pkts[i]->buf_addr +
> >>> +				RTE_PKTMBUF_HEADROOM - hdr_size;
> >>> +			virtio_vec_rx_offload(rx_pkts[i],
> >>> +					(struct virtio_net_hdr *)addr);
> >>> +		}
> >>> +	}
> >>> +
> >>> +	virtio_update_batch_stats(&rxvq->stats, rx_pkts[0]->pkt_len,
> >>> +			rx_pkts[1]->pkt_len, rx_pkts[2]->pkt_len,
> >>> +			rx_pkts[3]->pkt_len);
> >>> +
> >>> +	vq->vq_free_cnt +=3D PACKED_BATCH_SIZE;
> >>> +
> >>> +	vq->vq_used_cons_idx +=3D PACKED_BATCH_SIZE;
> >>> +	if (vq->vq_used_cons_idx >=3D vq->vq_nentries) {
> >>> +		vq->vq_used_cons_idx -=3D vq->vq_nentries;
> >>> +		vq->vq_packed.used_wrap_counter ^=3D 1;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static uint16_t
> >>> +virtqueue_dequeue_single_packed_vec(struct virtnet_rx *rxvq,
> >>> +				    struct rte_mbuf **rx_pkts)
> >>> +{
> >>> +	uint16_t used_idx, id;
> >>> +	uint32_t len;
> >>> +	struct virtqueue *vq =3D rxvq->vq;
> >>> +	struct virtio_hw *hw =3D vq->hw;
> >>> +	uint32_t hdr_size =3D hw->vtnet_hdr_size;
> >>> +	struct virtio_net_hdr *hdr;
> >>> +	struct vring_packed_desc *desc;
> >>> +	struct rte_mbuf *cookie;
> >>> +
> >>> +	desc =3D vq->vq_packed.ring.desc;
> >>> +	used_idx =3D vq->vq_used_cons_idx;
> >>> +	if (!desc_is_used(&desc[used_idx], vq))
> >>> +		return -1;
> >>> +
> >>> +	len =3D desc[used_idx].len;
> >>> +	id =3D desc[used_idx].id;
> >>> +	cookie =3D (struct rte_mbuf *)vq->vq_descx[id].cookie;
> >>> +	if (unlikely(cookie =3D=3D NULL)) {
> >>> +		PMD_DRV_LOG(ERR, "vring descriptor with no mbuf cookie
> >> at %u",
> >>> +				vq->vq_used_cons_idx);
> >>> +		return -1;
> >>> +	}
> >>> +	rte_prefetch0(cookie);
> >>> +	rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
> >>> +
> >>> +	cookie->data_off =3D RTE_PKTMBUF_HEADROOM;
> >>> +	cookie->ol_flags =3D 0;
> >>> +	cookie->pkt_len =3D (uint32_t)(len - hdr_size);
> >>> +	cookie->data_len =3D (uint32_t)(len - hdr_size);
> >>> +
> >>> +	hdr =3D (struct virtio_net_hdr *)((char *)cookie->buf_addr +
> >>> +					RTE_PKTMBUF_HEADROOM -
> >> hdr_size);
> >>> +	if (hw->has_rx_offload)
> >>> +		virtio_vec_rx_offload(cookie, hdr);
> >>> +
> >>> +	*rx_pkts =3D cookie;
> >>> +
> >>> +	rxvq->stats.bytes +=3D cookie->pkt_len;
> >>> +
> >>> +	vq->vq_free_cnt++;
> >>> +	vq->vq_used_cons_idx++;
> >>> +	if (vq->vq_used_cons_idx >=3D vq->vq_nentries) {
> >>> +		vq->vq_used_cons_idx -=3D vq->vq_nentries;
> >>> +		vq->vq_packed.used_wrap_counter ^=3D 1;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline void
> >>> +virtio_recv_refill_packed_vec(struct virtnet_rx *rxvq,
> >>> +			      struct rte_mbuf **cookie,
> >>> +			      uint16_t num)
> >>> +{
> >>> +	struct virtqueue *vq =3D rxvq->vq;
> >>> +	struct vring_packed_desc *start_dp =3D vq->vq_packed.ring.desc;
> >>> +	uint16_t flags =3D vq->vq_packed.cached_flags;
> >>> +	struct virtio_hw *hw =3D vq->hw;
> >>> +	struct vq_desc_extra *dxp;
> >>> +	uint16_t idx, i;
> >>> +	uint16_t batch_num, total_num =3D 0;
> >>> +	uint16_t head_idx =3D vq->vq_avail_idx;
> >>> +	uint16_t head_flag =3D vq->vq_packed.cached_flags;
> >>> +	uint64_t addr;
> >>> +
> >>> +	do {
> >>> +		idx =3D vq->vq_avail_idx;
> >>> +
> >>> +		batch_num =3D PACKED_BATCH_SIZE;
> >>> +		if (unlikely((idx + PACKED_BATCH_SIZE) > vq->vq_nentries))
> >>> +			batch_num =3D vq->vq_nentries - idx;
> >>> +		if (unlikely((total_num + batch_num) > num))
> >>> +			batch_num =3D num - total_num;
> >>> +
> >>> +		virtio_for_each_try_unroll(i, 0, batch_num) {
> >>> +			dxp =3D &vq->vq_descx[idx + i];
> >>> +			dxp->cookie =3D (void *)cookie[total_num + i];
> >>> +
> >>> +			addr =3D VIRTIO_MBUF_ADDR(cookie[total_num + i],
> >> vq) +
> >>> +				RTE_PKTMBUF_HEADROOM - hw-
> >>> vtnet_hdr_size;
> >>> +			start_dp[idx + i].addr =3D addr;
> >>> +			start_dp[idx + i].len =3D cookie[total_num + i]-
> >buf_len
> >>> +				- RTE_PKTMBUF_HEADROOM + hw-
> >>> vtnet_hdr_size;
> >>> +			if (total_num || i) {
> >>> +				virtqueue_store_flags_packed(&start_dp[idx
> >> + i],
> >>> +						flags, hw->weak_barriers);
> >>> +			}
> >>> +		}
> >>> +
> >>> +		vq->vq_avail_idx +=3D batch_num;
> >>> +		if (vq->vq_avail_idx >=3D vq->vq_nentries) {
> >>> +			vq->vq_avail_idx -=3D vq->vq_nentries;
> >>> +			vq->vq_packed.cached_flags ^=3D
> >>> +				VRING_PACKED_DESC_F_AVAIL_USED;
> >>> +			flags =3D vq->vq_packed.cached_flags;
> >>> +		}
> >>> +		total_num +=3D batch_num;
> >>> +	} while (total_num < num);
> >>> +
> >>> +	virtqueue_store_flags_packed(&start_dp[head_idx], head_flag,
> >>> +				hw->weak_barriers);
> >>> +	vq->vq_free_cnt =3D (uint16_t)(vq->vq_free_cnt - num);
> >>> +}
> >>> +
> >>> +uint16_t
> >>> +virtio_recv_pkts_packed_vec(void *rx_queue,
> >>> +			    struct rte_mbuf **rx_pkts,
> >>> +			    uint16_t nb_pkts)
> >>> +{
> >>> +	struct virtnet_rx *rxvq =3D rx_queue;
> >>> +	struct virtqueue *vq =3D rxvq->vq;
> >>> +	struct virtio_hw *hw =3D vq->hw;
> >>> +	uint16_t num, nb_rx =3D 0;
> >>> +	uint32_t nb_enqueued =3D 0;
> >>> +	uint16_t free_cnt =3D vq->vq_free_thresh;
> >>> +
> >>> +	if (unlikely(hw->started =3D=3D 0))
> >>> +		return nb_rx;
> >>> +
> >>> +	num =3D RTE_MIN(VIRTIO_MBUF_BURST_SZ, nb_pkts);
> >>> +	if (likely(num > PACKED_BATCH_SIZE))
> >>> +		num =3D num - ((vq->vq_used_cons_idx + num) %
> >> PACKED_BATCH_SIZE);
> >>> +
> >>> +	while (num) {
> >>> +		if (!virtqueue_dequeue_batch_packed_vec(rxvq,
> >>> +					&rx_pkts[nb_rx])) {
> >>> +			nb_rx +=3D PACKED_BATCH_SIZE;
> >>> +			num -=3D PACKED_BATCH_SIZE;
> >>> +			continue;
> >>> +		}
> >>> +		if (!virtqueue_dequeue_single_packed_vec(rxvq,
> >>> +					&rx_pkts[nb_rx])) {
> >>> +			nb_rx++;
> >>> +			num--;
> >>> +			continue;
> >>> +		}
> >>> +		break;
> >>> +	};
> >>> +
> >>> +	PMD_RX_LOG(DEBUG, "dequeue:%d", num);
> >>> +
> >>> +	rxvq->stats.packets +=3D nb_rx;
> >>> +
> >>> +	if (likely(vq->vq_free_cnt >=3D free_cnt)) {
> >>> +		struct rte_mbuf *new_pkts[free_cnt];
> >>> +		if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts,
> >>> +						free_cnt) =3D=3D 0)) {
> >>> +			virtio_recv_refill_packed_vec(rxvq, new_pkts,
> >>> +					free_cnt);
> >>> +			nb_enqueued +=3D free_cnt;
> >>> +		} else {
> >>> +			struct rte_eth_dev *dev =3D
> >>> +				&rte_eth_devices[rxvq->port_id];
> >>> +			dev->data->rx_mbuf_alloc_failed +=3D free_cnt;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (likely(nb_enqueued)) {
> >>> +		if (unlikely(virtqueue_kick_prepare_packed(vq))) {
> >>> +			virtqueue_notify(vq);
> >>> +			PMD_RX_LOG(DEBUG, "Notified");
> >>> +		}
> >>> +	}
> >>> +
> >>> +	return nb_rx;
> >>> +}
> >>> diff --git a/drivers/net/virtio/virtio_user_ethdev.c
> >> b/drivers/net/virtio/virtio_user_ethdev.c
> >>> index 40ad786cc..c54698ad1 100644
> >>> --- a/drivers/net/virtio/virtio_user_ethdev.c
> >>> +++ b/drivers/net/virtio/virtio_user_ethdev.c
> >>> @@ -528,6 +528,7 @@ virtio_user_eth_dev_alloc(struct
> rte_vdev_device
> >> *vdev)
> >>>  	hw->use_msix =3D 1;
> >>>  	hw->modern   =3D 0;
> >>>  	hw->use_vec_rx =3D 0;
> >>> +	hw->use_vec_tx =3D 0;
> >>>  	hw->use_inorder_rx =3D 0;
> >>>  	hw->use_inorder_tx =3D 0;
> >>>  	hw->virtio_user_dev =3D dev;
> >>> @@ -739,8 +740,19 @@ virtio_user_pmd_probe(struct rte_vdev_device
> >> *dev)
> >>>  		goto end;
> >>>  	}
> >>>
> >>> -	if (vectorized)
> >>> -		hw->use_vec_rx =3D 1;
> >>> +	if (vectorized) {
> >>> +		if (packed_vq) {
> >>> +#if defined(CC_AVX512_SUPPORT)
> >>> +			hw->use_vec_rx =3D 1;
> >>> +			hw->use_vec_tx =3D 1;
> >>> +#else
> >>> +			PMD_INIT_LOG(INFO,
> >>> +				"building environment do not support
> packed
> >> ring vectorized");
> >>> +#endif
> >>> +		} else {
> >>> +			hw->use_vec_rx =3D 1;
> >>> +		}
> >>> +	}
> >>>
> >>>  	rte_eth_dev_probing_finish(eth_dev);
> >>>  	ret =3D 0;
> >>> diff --git a/drivers/net/virtio/virtqueue.h
> b/drivers/net/virtio/virtqueue.h
> >>> index ca1c10499..ce0340743 100644
> >>> --- a/drivers/net/virtio/virtqueue.h
> >>> +++ b/drivers/net/virtio/virtqueue.h
> >>> @@ -239,7 +239,8 @@ struct vq_desc_extra {
> >>>  	void *cookie;
> >>>  	uint16_t ndescs;
> >>>  	uint16_t next;
> >>> -};
> >>> +	uint8_t padding[4];
> >>> +} __rte_packed __rte_aligned(16);
> >>
> >> Can't this introduce a performance impact for the non-vectorized
> >> case? I think of worse cache liens utilization.
> >>
> >> For example with a burst of 32 descriptors with 32B cachelines, before
> >> it would take 14 cachelines, after 16. So for each burst, one could fa=
ce
> >> 2 extra cache misses.
> >>
> >> If you could run non-vectorized benchamrks with and without that patch=
,
> >> I would be grateful.
> >>
> >
> > Maxime,
> > Thanks for point it out, it will add extra cache miss in datapath.
> > And its impact on performance is around 1% in loopback case.
>=20
> Ok, thanks for doing the test. I'll try to run some PVP benchmarks
> on my side because when doing IO loopback, the cache pressure is
> much less important.
>=20
> > While benefit of vectorized path will be more than that number.
>=20
> Ok, but I disagree for two reasons:
>  1. You have to keep in mind than non-vectorized is the default and
> encouraged mode to use. Indeed, it takes a lot of shortcuts like not
> checking header length (so no error stats), etc...
>=20
Ok, I will keep non-vectorized same as before.=20

>  2. It's like saying it's OK it degrades by 5% on $CPU_VENDOR_A because
> the gain is 20% on $CPU_VENDOR_B.
>=20
> In the case we see more degradation in real-world scenario, you might
> want to consider using ifdefs to avoid adding padding in the non-
> vectorized case, like you did to differentiate Virtio PMD to Virtio-user
> PMD in patch 7.
>=20

Maxime,
The performance difference is so slight, so I ignored for it look like a sa=
mpling error.=20
It maybe not suitable to add new configuration for such setting which only =
used inside driver.
Virtio driver can check whether virtqueue is using vectorized path when ini=
tialization, will use padded structure if it is.
I have added some tested code and now performance came back.  Since code ha=
s changed in initialization process,  it need some time for regression chec=
k.

Regards,
Marvin

> Thanks,
> Maxime
>=20
> > Thanks,
> > Marvin
> >
> >> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>
> >> Thanks,
> >> Maxime
> >