From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id AB70EA00C2;
	Fri, 24 Apr 2020 15:12:40 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 45A741C226;
	Fri, 24 Apr 2020 15:12:39 +0200 (CEST)
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by dpdk.org (Postfix) with ESMTP id AB2111C223
 for <dev@dpdk.org>; Fri, 24 Apr 2020 15:12:36 +0200 (CEST)
IronPort-SDR: OQ36Np+18VEqUAyY6BFhUMc+bzuni6BcQnw9JoDXD5LWqGSqZW8fVTzs8DwcLcVem3mA7jVl96
 rQYJEfdtoc9Q==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
 by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 24 Apr 2020 06:12:35 -0700
IronPort-SDR: AxkGp1V3/sj8Ffa9OG9RcKWodAk8kjCxtNkNQMdv4WRCr4h2dBn6x97BPLGLq2Ka+Z3Z0fX7k3
 eT66jHvivmhQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.73,311,1583222400"; d="scan'208";a="248061876"
Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204])
 by fmsmga008.fm.intel.com with ESMTP; 24 Apr 2020 06:12:35 -0700
Received: from fmsmsx115.amr.corp.intel.com (10.18.116.19) by
 FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Fri, 24 Apr 2020 06:12:32 -0700
Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by
 fmsmsx115.amr.corp.intel.com (10.18.116.19) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Fri, 24 Apr 2020 06:12:32 -0700
Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.146]) by
 shsmsx102.ccr.corp.intel.com ([169.254.2.138]) with mapi id 14.03.0439.000;
 Fri, 24 Apr 2020 21:12:28 +0800
From: "Liu, Yong" <yong.liu@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>, "Ye, Xiaolong"
 <xiaolong.ye@intel.com>, "Wang, Zhihong" <zhihong.wang@intel.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, "Van Haaren, Harry"
 <harry.van.haaren@intel.com>
Thread-Topic: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path
Thread-Index: AQHWGdqj9niUthnVgU+NhrCwmxjwpaiHo6OAgACYrYA=
Date: Fri, 24 Apr 2020 13:12:28 +0000
Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E63543212@SHSMSX103.ccr.corp.intel.com>
References: <20200313174230.74661-1-yong.liu@intel.com>
 <20200424092445.44693-1-yong.liu@intel.com>
 <20200424092445.44693-6-yong.liu@intel.com>
 <8d024d70-21b2-19cb-ee72-0bd036f0816f@redhat.com>
In-Reply-To: <8d024d70-21b2-19cb-ee72-0bd036f0816f@redhat.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 11.2.0.6
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed
	ring Rx path
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>



> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, April 24, 2020 7:52 PM
> To: Liu, Yong <yong.liu@intel.com>; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>
> Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx pat=
h
>=20
>=20
>=20
> On 4/24/20 11:24 AM, Marvin Liu wrote:
> > Optimize packed ring Rx path with SIMD instructions. Solution of
> > optimization is pretty like vhost, is that split path into batch and
> > single functions. Batch function is further optimized by AVX512
> > instructions. Also pad desc extra structure to 16 bytes aligned, thus
> > four elements will be saved in one batch.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> > index c9edb84ee..102b1deab 100644
> > --- a/drivers/net/virtio/Makefile
> > +++ b/drivers/net/virtio/Makefile
> > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM)
> $(CONFIG_RTE_ARCH_ARM64)),)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_rxtx_simple_neon.c
> >  endif
> >
> > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > +	CC_AVX512_SUPPORT=3D\
> > +	$(shell $(CC) -march=3Dnative -dM -E - </dev/null 2>&1 | \
> > +	sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > +	grep -q AVX512 && echo 1)
> > +endif
> > +
> > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > +CFLAGS +=3D -DCC_AVX512_SUPPORT
> > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_rxtx_packed_avx.c
> > +
> > +ifeq ($(RTE_TOOLCHAIN), gcc)
> > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> > +CFLAGS +=3D -DVIRTIO_GCC_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +ifeq ($(RTE_TOOLCHAIN), clang)
> > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) -
> ge 37 && echo 1), 1)
> > +CFLAGS +=3D -DVIRTIO_CLANG_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +ifeq ($(RTE_TOOLCHAIN), icc)
> > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1)
> > +CFLAGS +=3D -DVIRTIO_ICC_UNROLL_PRAGMA
> > +endif
> > +endif
> > +
> > +CFLAGS_virtio_rxtx_packed_avx.o +=3D -mavx512f -mavx512bw -mavx512vl
> > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1)
> > +CFLAGS_virtio_rxtx_packed_avx.o +=3D -Wno-zero-length-bounds
> > +endif
> > +endif
> > +
> >  ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_user/vhost_user.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_user/vhost_kernel.c
> > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.=
build
> > index 15150eea1..8e68c3039 100644
> > --- a/drivers/net/virtio/meson.build
> > +++ b/drivers/net/virtio/meson.build
> > @@ -9,6 +9,20 @@ sources +=3D files('virtio_ethdev.c',
> >  deps +=3D ['kvargs', 'bus_pci']
> >
> >  if arch_subdir =3D=3D 'x86'
> > +	if '-mno-avx512f' not in machine_args
> > +		if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +			cflags +=3D ['-mavx512f', '-mavx512bw', '-mavx512vl']
> > +			cflags +=3D ['-DCC_AVX512_SUPPORT']
> > +			if (toolchain =3D=3D 'gcc' and
> cc.version().version_compare('>=3D8.3.0'))
> > +				cflags +=3D '-DVHOST_GCC_UNROLL_PRAGMA'
> > +			elif (toolchain =3D=3D 'clang' and
> cc.version().version_compare('>=3D3.7.0'))
> > +				cflags +=3D '-
> DVHOST_CLANG_UNROLL_PRAGMA'
> > +			elif (toolchain =3D=3D 'icc' and
> cc.version().version_compare('>=3D16.0.0'))
> > +				cflags +=3D '-DVHOST_ICC_UNROLL_PRAGMA'
> > +			endif
> > +			sources +=3D files('virtio_rxtx_packed_avx.c')
> > +		endif
> > +	endif
> >  	sources +=3D files('virtio_rxtx_simple_sse.c')
> >  elif arch_subdir =3D=3D 'ppc'
> >  	sources +=3D files('virtio_rxtx_simple_altivec.c')
> > diff --git a/drivers/net/virtio/virtio_ethdev.h
> b/drivers/net/virtio/virtio_ethdev.h
> > index febaf17a8..5c112cac7 100644
> > --- a/drivers/net/virtio/virtio_ethdev.h
> > +++ b/drivers/net/virtio/virtio_ethdev.h
> > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue,
> struct rte_mbuf **tx_pkts,
> >  uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkt=
s,
> >  		uint16_t nb_pkts);
> >
> > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> > +		uint16_t nb_pkts);
> > +
> >  int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
> >
> >  void virtio_interrupt_handler(void *param);
> > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virt=
io_rxtx.c
> > index 84f4cf946..c9b6e7844 100644
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue,
> >
> >  	return nb_tx;
> >  }
> > +
> > +__rte_weak uint16_t
> > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused,
> > +			    struct rte_mbuf **rx_pkts __rte_unused,
> > +			    uint16_t nb_pkts __rte_unused)
> > +{
> > +	return 0;
> > +}
> > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c
> b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > new file mode 100644
> > index 000000000..8a7b459eb
> > --- /dev/null
> > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c
> > @@ -0,0 +1,374 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2020 Intel Corporation
> > + */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <errno.h>
> > +
> > +#include <rte_net.h>
> > +
> > +#include "virtio_logs.h"
> > +#include "virtio_ethdev.h"
> > +#include "virtio_pci.h"
> > +#include "virtqueue.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* flag bits offset in packed ring desc higher 64bits */
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) =
- \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL |
> VRING_PACKED_DESC_F_AVAIL_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \
> > +	sizeof(struct vring_packed_desc))
> > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1)
> > +
> > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unrol=
l 4")
> \
> > +	for (iter =3D val; iter < size; iter++)
> > +#endif
> > +
> > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4"=
) \
> > +	for (iter =3D val; iter < size; iter++)
> > +#endif
> > +
> > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA
> > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4=
)") \
> > +	for (iter =3D val; iter < size; iter++)
> > +#endif
> > +
> > +#ifndef virtio_for_each_try_unroll
> > +#define virtio_for_each_try_unroll(iter, val, num) \
> > +	for (iter =3D val; iter < num; iter++)
> > +#endif
> > +
> > +static inline void
> > +virtio_update_batch_stats(struct virtnet_stats *stats,
> > +			  uint16_t pkt_len1,
> > +			  uint16_t pkt_len2,
> > +			  uint16_t pkt_len3,
> > +			  uint16_t pkt_len4)
> > +{
> > +	stats->bytes +=3D pkt_len1;
> > +	stats->bytes +=3D pkt_len2;
> > +	stats->bytes +=3D pkt_len3;
> > +	stats->bytes +=3D pkt_len4;
> > +}
> > +
> > +/* Optionally fill offload information in structure */
> > +static inline int
> > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
> > +{
> > +	struct rte_net_hdr_lens hdr_lens;
> > +	uint32_t hdrlen, ptype;
> > +	int l4_supported =3D 0;
> > +
> > +	/* nothing to do */
> > +	if (hdr->flags =3D=3D 0)
> > +		return 0;
>=20
> IIUC, the only difference with the non-vectorized version is the GSO
> support removed here.
> gso_type being in the same cacheline as flags in virtio_net_hdr, I don't
> think checking the performance gain is worth the added maintainance
> effort due to code duplication.
>=20
> Please prove I'm wrong, otherwise please move virtio_rx_offload() in a
> header and use it here. Alternative if it really imapcts performance is
> to put all the shared code in a dedicated function that can be re-used
> by both implementations.
>=20

Maxime,
It won't be much performance difference between non-vectorized and vectoriz=
ed.
The reason to add special vectorized version is for skipping the handling o=
f garbage GSO packets.=20
As all descs have been handled in batch, it is needed to revert when found =
garbage packets.=20
That will introduce complicated logic in vectorized path.

Regards,
Marvin

> > +
> > +	/* GSO not support in vec path, skip check */
> > +	m->ol_flags |=3D PKT_RX_IP_CKSUM_UNKNOWN;
> > +
> > +	ptype =3D rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
> > +	m->packet_type =3D ptype;
> > +	if ((ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_TCP ||
> > +	    (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_UDP ||
> > +	    (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_SCTP)
> > +		l4_supported =3D 1;
> > +
> > +	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > +		hdrlen =3D hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
> > +		if (hdr->csum_start <=3D hdrlen && l4_supported) {
> > +			m->ol_flags |=3D PKT_RX_L4_CKSUM_NONE;
> > +		} else {
> > +			/* Unknown proto or tunnel, do sw cksum. We can
> assume
> > +			 * the cksum field is in the first segment since the
> > +			 * buffers we provided to the host are large enough.
> > +			 * In case of SCTP, this will be wrong since it's a CRC
> > +			 * but there's nothing we can do.
> > +			 */
> > +			uint16_t csum =3D 0, off;
> > +
> > +			rte_raw_cksum_mbuf(m, hdr->csum_start,
> > +				rte_pktmbuf_pkt_len(m) - hdr->csum_start,
> > +				&csum);
> > +			if (likely(csum !=3D 0xffff))
> > +				csum =3D ~csum;
> > +			off =3D hdr->csum_offset + hdr->csum_start;
> > +			if (rte_pktmbuf_data_len(m) >=3D off + 1)
> > +				*rte_pktmbuf_mtod_offset(m, uint16_t *,
> > +					off) =3D csum;
> > +		}
> > +	} else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID &&
> l4_supported) {
> > +		m->ol_flags |=3D PKT_RX_L4_CKSUM_GOOD;
> > +	}
> > +
> > +	return 0;
> > +}
>=20
> Otherwise, the patch looks okay to me.
>=20
> Thanks,
> Maxime