From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 887B1A0613
	for <public@inbox.dpdk.org>; Wed, 25 Sep 2019 11:36:25 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 626CC288C;
	Wed, 25 Sep 2019 11:36:25 +0200 (CEST)
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by dpdk.org (Postfix) with ESMTP id 2C85923D
 for <dev@dpdk.org>; Wed, 25 Sep 2019 11:36:24 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga007.jf.intel.com ([10.7.209.58])
 by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 25 Sep 2019 02:36:23 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.64,547,1559545200"; d="scan'208";a="179764480"
Received: from fmsmsx104.amr.corp.intel.com ([10.18.124.202])
 by orsmga007.jf.intel.com with ESMTP; 25 Sep 2019 02:36:22 -0700
Received: from shsmsx107.ccr.corp.intel.com (10.239.4.96) by
 fmsmsx104.amr.corp.intel.com (10.18.124.202) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Wed, 25 Sep 2019 02:36:21 -0700
Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.140]) by
 SHSMSX107.ccr.corp.intel.com ([169.254.9.89]) with mapi id 14.03.0439.000;
 Wed, 25 Sep 2019 17:36:20 +0800
From: "Liu, Yong" <yong.liu@intel.com>
To: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>, nd <nd@arm.com>,
 "maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>, "Bie, Tiwei"
 <tiwei.bie@intel.com>, "Wang, Zhihong" <zhihong.wang@intel.com>
Thread-Topic: [dpdk-dev] [PATCH v2 03/16] vhost: add burst enqueue function
 for	packed ring
Thread-Index: AQHVcf9tU8514n/ZukGYby7mJZvv0Kc6KDHwgAF1FICAAIdFwA==
Date: Wed, 25 Sep 2019 09:36:19 +0000
Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E633B0F58@SHSMSX103.ccr.corp.intel.com>
References: <20190905161421.55981-2-yong.liu@intel.com>
 <20190919163643.24130-1-yong.liu@intel.com>
 <20190919163643.24130-4-yong.liu@intel.com>
 <VI1PR08MB537678465C5F9E3B033389D68F850@VI1PR08MB5376.eurprd08.prod.outlook.com>
 <86228AFD5BCD8E4EBFD2B90117B5E81E633AA324@SHSMSX103.ccr.corp.intel.com>
 <VI1PR08MB5376813B4A696E01AC42A54B8F870@VI1PR08MB5376.eurprd08.prod.outlook.com>
In-Reply-To: <VI1PR08MB5376813B4A696E01AC42A54B8F870@VI1PR08MB5376.eurprd08.prod.outlook.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMDM5MGRjNWItNDk1NS00NjdiLThmYzItNzNkNTgwNmRjMTM3IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiMUxUSUd1YWhEWGZQb3Q3Vm80OWtOWnVQbnk2OU9VUCt6RzNwWlwvaEhhXC96OU1ZXC83b0ExN3RvQkRRQUlNMVRXMSJ9
x-ctpclassification: CTP_NT
dlp-product: dlpe-windows
dlp-version: 11.2.0.6
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v2 03/16] vhost: add burst enqueue function
 for	packed ring
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>



> -----Original Message-----
> From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> Sent: Wednesday, September 25, 2019 5:29 PM
> To: Liu, Yong <yong.liu@intel.com>; maxime.coquelin@redhat.com; Bie, Tiwe=
i
> <tiwei.bie@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 03/16] vhost: add burst enqueue functio=
n
> for packed ring
>=20
> Hi Marvin,
>=20
> > -----Original Message-----
> > From: Liu, Yong <yong.liu@intel.com>
> > Sent: Tuesday, September 24, 2019 11:31 AM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > maxime.coquelin@redhat.com; Bie, Tiwei <tiwei.bie@intel.com>; Wang,
> > Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v2 03/16] vhost: add burst enqueue
> function
> > for packed ring
> >
> > Thanks, Gavin. My comments are inline.
> >
> > > -----Original Message-----
> > > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > > Sent: Monday, September 23, 2019 7:09 PM
> > > To: Liu, Yong <yong.liu@intel.com>; maxime.coquelin@redhat.com; Bie,
> > Tiwei
> > > <tiwei.bie@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> > > Cc: dev@dpdk.org; nd <nd@arm.com>
> > > Subject: RE: [dpdk-dev] [PATCH v2 03/16] vhost: add burst enqueue
> > function
> > > for packed ring
> > >
> > > Hi Marvin,
> > >
> > > Is it possible to vectorize the processing?
> > > Other comments inline:
> > > /Gavin
> >
> > Gavin,
> > According to our experiment, only vectorize some parts in [ed]nqueue
> > function can't benefit performance.
> > Functions like vhost_iova_to_vva and virtio_enqueue_offload can't be
> > easily vectorized as they are full of judgment conditions.
> >
> > Thanks,
> > Marvin
> >
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Marvin Liu
> > > > Sent: Friday, September 20, 2019 12:37 AM
> > > > To: maxime.coquelin@redhat.com; tiwei.bie@intel.com;
> > > > zhihong.wang@intel.com
> > > > Cc: dev@dpdk.org; Marvin Liu <yong.liu@intel.com>
> > > > Subject: [dpdk-dev] [PATCH v2 03/16] vhost: add burst enqueue
> function
> > > for
> > > > packed ring
> > > >
> > > > Burst enqueue function will first check whether descriptors are cac=
he
> > > > aligned. It will also check prerequisites in the beginning. Burst
> > > > enqueue function not support chained mbufs, single packet enqueue
> > > > function will handle it.
> > > >
> > > > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> > > >
> > > > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > > > index 5074226f0..67889c80a 100644
> > > > --- a/lib/librte_vhost/vhost.h
> > > > +++ b/lib/librte_vhost/vhost.h
> > > > @@ -39,6 +39,9 @@
> > > >
> > > >  #define VHOST_LOG_CACHE_NR 32
> > > >
> > > > +#define PACKED_DESCS_BURST (RTE_CACHE_LINE_SIZE / \
> > > > +			    sizeof(struct vring_packed_desc))
> > > > +
> > > >  #ifdef SUPPORT_GCC_UNROLL_PRAGMA
> > > >  #define PRAGMA_PARAM "GCC unroll 4"
> > > >  #endif
> > > > @@ -57,6 +60,8 @@
> > > >  #define UNROLL_PRAGMA(param) do {} while(0);
> > > >  #endif
> > > >
> > > > +#define PACKED_BURST_MASK (PACKED_DESCS_BURST - 1)
> > > > +
> > > >  /**
> > > >   * Structure contains buffer address, length and descriptor index
> > > >   * from vring to do scatter RX.
> > > > diff --git a/lib/librte_vhost/virtio_net.c
> > > b/lib/librte_vhost/virtio_net.c
> > > > index 2b5c47145..c664b27c5 100644
> > > > --- a/lib/librte_vhost/virtio_net.c
> > > > +++ b/lib/librte_vhost/virtio_net.c
> > > > @@ -895,6 +895,84 @@ virtio_dev_rx_split(struct virtio_net *dev,
> > struct
> > > > vhost_virtqueue *vq,
> > > >  	return pkt_idx;
> > > >  }
> > > >
> > > > +static __rte_unused __rte_always_inline int
> > > I remember "__rte_always_inline" should start at the first and separa=
te
> > > line, otherwise you will get a style issue.
> > > /Gavin
> > > > +virtio_dev_rx_burst_packed(struct virtio_net *dev, struct
> > > vhost_virtqueue
> > > > *vq,
> > > > +	 struct rte_mbuf **pkts)
> > > > +{
> > > > +	bool wrap_counter =3D vq->avail_wrap_counter;
> > > > +	struct vring_packed_desc *descs =3D vq->desc_packed;
> > > > +	uint16_t avail_idx =3D vq->last_avail_idx;
> > > > +
> > > > +	uint64_t desc_addrs[PACKED_DESCS_BURST];
> > > > +	struct virtio_net_hdr_mrg_rxbuf *hdrs[PACKED_DESCS_BURST];
> > > > +	uint32_t buf_offset =3D dev->vhost_hlen;
> > > > +	uint64_t lens[PACKED_DESCS_BURST];
> > > > +
> > > > +	uint16_t i;
> > > > +
> > > > +	if (unlikely(avail_idx & PACKED_BURST_MASK))
> > > > +		return -1;
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++) {
> > > > +		if (unlikely(pkts[i]->next !=3D NULL))
> > > > +			return -1;
> > > > +		if (unlikely(!desc_is_avail(&descs[avail_idx + i],
> > > > +					    wrap_counter)))
> > > > +			return -1;
> > > > +	}
> > > > +
> > > > +	rte_smp_rmb();
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++)
> > > > +		lens[i] =3D descs[avail_idx + i].len;
> > > Looks like the code is a strong candidate for vectorization.
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++) {
> > > > +		if (unlikely(pkts[i]->pkt_len > (lens[i] - buf_offset)))
> > > > +			return -1;
> > > > +	}
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++)
> > > > +		desc_addrs[i] =3D vhost_iova_to_vva(dev, vq,
> > > > +						 descs[avail_idx + i].addr,
> > > > +						 &lens[i],
> > > > +						 VHOST_ACCESS_RW);
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++) {
> > > > +		if (unlikely(lens[i] !=3D descs[avail_idx + i].len))
> > > > +			return -1;
> > > > +	}
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++) {
> > > > +		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > > > +		hdrs[i] =3D (struct virtio_net_hdr_mrg_rxbuf
> *)desc_addrs[i];
> > > > +		lens[i] =3D pkts[i]->pkt_len + dev->vhost_hlen;
> > > > +	}
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++)
> > > > +		virtio_enqueue_offload(pkts[i], &hdrs[i]->hdr);
> > > > +
> > > A store barrier here is missing, last_avail_idx may be observed befor=
e
> the
> > > above enqueue completion on weak memory order architectures.
> > > For x86, a compiler barrier is also required.
> > >
> > Thanks a lot for point out. I guess your mention is that need to add
> barrier
> > between memcpy and enqueue.
> > last_avail_idx is just local variable, no barrier is need to protect it=
.
>=20
> Sorry I was wrong, yes, last_avail_idx is a local variable(or we may call
> it meta data).
> Copying the headers and payload does not need to be ordered, we just need
> to ensure all these happen before updating the idx, which is the single
> synchronization point.
> In one word, no barriers are required here.
> /Gavin
> >

NP:) Nothing changed here in V3 patch. Thanks for kindly reviewing.=20

> > > > +	vq->last_avail_idx +=3D PACKED_DESCS_BURST;
> > > > +	if (vq->last_avail_idx >=3D vq->size) {
> > > > +		vq->last_avail_idx -=3D vq->size;
> > > > +		vq->avail_wrap_counter ^=3D 1;
> > > > +	}
> > > > +
> > > > +	UNROLL_PRAGMA(PRAGMA_PARAM)
> > > > +	for (i =3D 0; i < PACKED_DESCS_BURST; i++) {
> > > > +		rte_memcpy((void *)(uintptr_t)(desc_addrs[i] +
> buf_offset),
> > > > +			   rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
> > > > +			   pkts[i]->pkt_len);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static __rte_unused int16_t
> > > >  virtio_dev_rx_single_packed(struct virtio_net *dev, struct
> > > vhost_virtqueue
> > > > *vq,
> > > >  	struct rte_mbuf *pkt)
> > > > --
> > > > 2.17.1