From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id AB70EA00C2; Fri, 24 Apr 2020 15:12:40 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 45A741C226; Fri, 24 Apr 2020 15:12:39 +0200 (CEST) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by dpdk.org (Postfix) with ESMTP id AB2111C223 for ; Fri, 24 Apr 2020 15:12:36 +0200 (CEST) IronPort-SDR: OQ36Np+18VEqUAyY6BFhUMc+bzuni6BcQnw9JoDXD5LWqGSqZW8fVTzs8DwcLcVem3mA7jVl96 rQYJEfdtoc9Q== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Apr 2020 06:12:35 -0700 IronPort-SDR: AxkGp1V3/sj8Ffa9OG9RcKWodAk8kjCxtNkNQMdv4WRCr4h2dBn6x97BPLGLq2Ka+Z3Z0fX7k3 eT66jHvivmhQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,311,1583222400"; d="scan'208";a="248061876" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by fmsmga008.fm.intel.com with ESMTP; 24 Apr 2020 06:12:35 -0700 Received: from fmsmsx115.amr.corp.intel.com (10.18.116.19) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.439.0; Fri, 24 Apr 2020 06:12:32 -0700 Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by fmsmsx115.amr.corp.intel.com (10.18.116.19) with Microsoft SMTP Server (TLS) id 14.3.439.0; Fri, 24 Apr 2020 06:12:32 -0700 Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.146]) by shsmsx102.ccr.corp.intel.com ([169.254.2.138]) with mapi id 14.03.0439.000; Fri, 24 Apr 2020 21:12:28 +0800 From: "Liu, Yong" To: Maxime Coquelin , "Ye, Xiaolong" , "Wang, Zhihong" CC: "dev@dpdk.org" , "Van Haaren, Harry" Thread-Topic: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path Thread-Index: AQHWGdqj9niUthnVgU+NhrCwmxjwpaiHo6OAgACYrYA= Date: Fri, 24 Apr 2020 13:12:28 +0000 Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E63543212@SHSMSX103.ccr.corp.intel.com> References: <20200313174230.74661-1-yong.liu@intel.com> <20200424092445.44693-1-yong.liu@intel.com> <20200424092445.44693-6-yong.liu@intel.com> <8d024d70-21b2-19cb-ee72-0bd036f0816f@redhat.com> In-Reply-To: <8d024d70-21b2-19cb-ee72-0bd036f0816f@redhat.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx path X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: Maxime Coquelin > Sent: Friday, April 24, 2020 7:52 PM > To: Liu, Yong ; Ye, Xiaolong ; > Wang, Zhihong > Cc: dev@dpdk.org; Van Haaren, Harry > Subject: Re: [PATCH v9 5/9] net/virtio: add vectorized packed ring Rx pat= h >=20 >=20 >=20 > On 4/24/20 11:24 AM, Marvin Liu wrote: > > Optimize packed ring Rx path with SIMD instructions. Solution of > > optimization is pretty like vhost, is that split path into batch and > > single functions. Batch function is further optimized by AVX512 > > instructions. Also pad desc extra structure to 16 bytes aligned, thus > > four elements will be saved in one batch. > > > > Signed-off-by: Marvin Liu > > > > diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile > > index c9edb84ee..102b1deab 100644 > > --- a/drivers/net/virtio/Makefile > > +++ b/drivers/net/virtio/Makefile > > @@ -36,6 +36,41 @@ else ifneq ($(filter y,$(CONFIG_RTE_ARCH_ARM) > $(CONFIG_RTE_ARCH_ARM64)),) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_rxtx_simple_neon.c > > endif > > > > +ifneq ($(FORCE_DISABLE_AVX512), y) > > + CC_AVX512_SUPPORT=3D\ > > + $(shell $(CC) -march=3Dnative -dM -E - &1 | \ > > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \ > > + grep -q AVX512 && echo 1) > > +endif > > + > > +ifeq ($(CC_AVX512_SUPPORT), 1) > > +CFLAGS +=3D -DCC_AVX512_SUPPORT > > +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_rxtx_packed_avx.c > > + > > +ifeq ($(RTE_TOOLCHAIN), gcc) > > +ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1) > > +CFLAGS +=3D -DVIRTIO_GCC_UNROLL_PRAGMA > > +endif > > +endif > > + > > +ifeq ($(RTE_TOOLCHAIN), clang) > > +ifeq ($(shell test $(CLANG_MAJOR_VERSION)$(CLANG_MINOR_VERSION) - > ge 37 && echo 1), 1) > > +CFLAGS +=3D -DVIRTIO_CLANG_UNROLL_PRAGMA > > +endif > > +endif > > + > > +ifeq ($(RTE_TOOLCHAIN), icc) > > +ifeq ($(shell test $(ICC_MAJOR_VERSION) -ge 16 && echo 1), 1) > > +CFLAGS +=3D -DVIRTIO_ICC_UNROLL_PRAGMA > > +endif > > +endif > > + > > +CFLAGS_virtio_rxtx_packed_avx.o +=3D -mavx512f -mavx512bw -mavx512vl > > +ifeq ($(shell test $(GCC_VERSION) -ge 100 && echo 1), 1) > > +CFLAGS_virtio_rxtx_packed_avx.o +=3D -Wno-zero-length-bounds > > +endif > > +endif > > + > > ifeq ($(CONFIG_RTE_VIRTIO_USER),y) > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_user/vhost_user.c > > SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) +=3D virtio_user/vhost_kernel.c > > diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.= build > > index 15150eea1..8e68c3039 100644 > > --- a/drivers/net/virtio/meson.build > > +++ b/drivers/net/virtio/meson.build > > @@ -9,6 +9,20 @@ sources +=3D files('virtio_ethdev.c', > > deps +=3D ['kvargs', 'bus_pci'] > > > > if arch_subdir =3D=3D 'x86' > > + if '-mno-avx512f' not in machine_args > > + if cc.has_argument('-mavx512f') and cc.has_argument('- > mavx512vl') and cc.has_argument('-mavx512bw') > > + cflags +=3D ['-mavx512f', '-mavx512bw', '-mavx512vl'] > > + cflags +=3D ['-DCC_AVX512_SUPPORT'] > > + if (toolchain =3D=3D 'gcc' and > cc.version().version_compare('>=3D8.3.0')) > > + cflags +=3D '-DVHOST_GCC_UNROLL_PRAGMA' > > + elif (toolchain =3D=3D 'clang' and > cc.version().version_compare('>=3D3.7.0')) > > + cflags +=3D '- > DVHOST_CLANG_UNROLL_PRAGMA' > > + elif (toolchain =3D=3D 'icc' and > cc.version().version_compare('>=3D16.0.0')) > > + cflags +=3D '-DVHOST_ICC_UNROLL_PRAGMA' > > + endif > > + sources +=3D files('virtio_rxtx_packed_avx.c') > > + endif > > + endif > > sources +=3D files('virtio_rxtx_simple_sse.c') > > elif arch_subdir =3D=3D 'ppc' > > sources +=3D files('virtio_rxtx_simple_altivec.c') > > diff --git a/drivers/net/virtio/virtio_ethdev.h > b/drivers/net/virtio/virtio_ethdev.h > > index febaf17a8..5c112cac7 100644 > > --- a/drivers/net/virtio/virtio_ethdev.h > > +++ b/drivers/net/virtio/virtio_ethdev.h > > @@ -105,6 +105,9 @@ uint16_t virtio_xmit_pkts_inorder(void *tx_queue, > struct rte_mbuf **tx_pkts, > > uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkt= s, > > uint16_t nb_pkts); > > > > +uint16_t virtio_recv_pkts_packed_vec(void *rx_queue, struct rte_mbuf > **rx_pkts, > > + uint16_t nb_pkts); > > + > > int eth_virtio_dev_init(struct rte_eth_dev *eth_dev); > > > > void virtio_interrupt_handler(void *param); > > diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virt= io_rxtx.c > > index 84f4cf946..c9b6e7844 100644 > > --- a/drivers/net/virtio/virtio_rxtx.c > > +++ b/drivers/net/virtio/virtio_rxtx.c > > @@ -2329,3 +2329,11 @@ virtio_xmit_pkts_inorder(void *tx_queue, > > > > return nb_tx; > > } > > + > > +__rte_weak uint16_t > > +virtio_recv_pkts_packed_vec(void *rx_queue __rte_unused, > > + struct rte_mbuf **rx_pkts __rte_unused, > > + uint16_t nb_pkts __rte_unused) > > +{ > > + return 0; > > +} > > diff --git a/drivers/net/virtio/virtio_rxtx_packed_avx.c > b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > new file mode 100644 > > index 000000000..8a7b459eb > > --- /dev/null > > +++ b/drivers/net/virtio/virtio_rxtx_packed_avx.c > > @@ -0,0 +1,374 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2010-2020 Intel Corporation > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include > > + > > +#include "virtio_logs.h" > > +#include "virtio_ethdev.h" > > +#include "virtio_pci.h" > > +#include "virtqueue.h" > > + > > +#define BYTE_SIZE 8 > > +/* flag bits offset in packed ring desc higher 64bits */ > > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) = - \ > > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE) > > + > > +#define PACKED_FLAGS_MASK ((0ULL | > VRING_PACKED_DESC_F_AVAIL_USED) << \ > > + FLAGS_BITS_OFFSET) > > + > > +#define PACKED_BATCH_SIZE (RTE_CACHE_LINE_SIZE / \ > > + sizeof(struct vring_packed_desc)) > > +#define PACKED_BATCH_MASK (PACKED_BATCH_SIZE - 1) > > + > > +#ifdef VIRTIO_GCC_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("GCC unrol= l 4") > \ > > + for (iter =3D val; iter < size; iter++) > > +#endif > > + > > +#ifdef VIRTIO_CLANG_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll 4"= ) \ > > + for (iter =3D val; iter < size; iter++) > > +#endif > > + > > +#ifdef VIRTIO_ICC_UNROLL_PRAGMA > > +#define virtio_for_each_try_unroll(iter, val, size) _Pragma("unroll (4= )") \ > > + for (iter =3D val; iter < size; iter++) > > +#endif > > + > > +#ifndef virtio_for_each_try_unroll > > +#define virtio_for_each_try_unroll(iter, val, num) \ > > + for (iter =3D val; iter < num; iter++) > > +#endif > > + > > +static inline void > > +virtio_update_batch_stats(struct virtnet_stats *stats, > > + uint16_t pkt_len1, > > + uint16_t pkt_len2, > > + uint16_t pkt_len3, > > + uint16_t pkt_len4) > > +{ > > + stats->bytes +=3D pkt_len1; > > + stats->bytes +=3D pkt_len2; > > + stats->bytes +=3D pkt_len3; > > + stats->bytes +=3D pkt_len4; > > +} > > + > > +/* Optionally fill offload information in structure */ > > +static inline int > > +virtio_vec_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr) > > +{ > > + struct rte_net_hdr_lens hdr_lens; > > + uint32_t hdrlen, ptype; > > + int l4_supported =3D 0; > > + > > + /* nothing to do */ > > + if (hdr->flags =3D=3D 0) > > + return 0; >=20 > IIUC, the only difference with the non-vectorized version is the GSO > support removed here. > gso_type being in the same cacheline as flags in virtio_net_hdr, I don't > think checking the performance gain is worth the added maintainance > effort due to code duplication. >=20 > Please prove I'm wrong, otherwise please move virtio_rx_offload() in a > header and use it here. Alternative if it really imapcts performance is > to put all the shared code in a dedicated function that can be re-used > by both implementations. >=20 Maxime, It won't be much performance difference between non-vectorized and vectoriz= ed. The reason to add special vectorized version is for skipping the handling o= f garbage GSO packets.=20 As all descs have been handled in batch, it is needed to revert when found = garbage packets.=20 That will introduce complicated logic in vectorized path. Regards, Marvin > > + > > + /* GSO not support in vec path, skip check */ > > + m->ol_flags |=3D PKT_RX_IP_CKSUM_UNKNOWN; > > + > > + ptype =3D rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK); > > + m->packet_type =3D ptype; > > + if ((ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_TCP || > > + (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_UDP || > > + (ptype & RTE_PTYPE_L4_MASK) =3D=3D RTE_PTYPE_L4_SCTP) > > + l4_supported =3D 1; > > + > > + if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { > > + hdrlen =3D hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len; > > + if (hdr->csum_start <=3D hdrlen && l4_supported) { > > + m->ol_flags |=3D PKT_RX_L4_CKSUM_NONE; > > + } else { > > + /* Unknown proto or tunnel, do sw cksum. We can > assume > > + * the cksum field is in the first segment since the > > + * buffers we provided to the host are large enough. > > + * In case of SCTP, this will be wrong since it's a CRC > > + * but there's nothing we can do. > > + */ > > + uint16_t csum =3D 0, off; > > + > > + rte_raw_cksum_mbuf(m, hdr->csum_start, > > + rte_pktmbuf_pkt_len(m) - hdr->csum_start, > > + &csum); > > + if (likely(csum !=3D 0xffff)) > > + csum =3D ~csum; > > + off =3D hdr->csum_offset + hdr->csum_start; > > + if (rte_pktmbuf_data_len(m) >=3D off + 1) > > + *rte_pktmbuf_mtod_offset(m, uint16_t *, > > + off) =3D csum; > > + } > > + } else if (hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID && > l4_supported) { > > + m->ol_flags |=3D PKT_RX_L4_CKSUM_GOOD; > > + } > > + > > + return 0; > > +} >=20 > Otherwise, the patch looks okay to me. >=20 > Thanks, > Maxime