From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id AD746A056A;
	Wed, 18 Mar 2020 02:22:45 +0100 (CET)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 305FA1C08E;
	Wed, 18 Mar 2020 02:22:45 +0100 (CET)
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by dpdk.org (Postfix) with ESMTP id 8145E1C08D
 for <dev@dpdk.org>; Wed, 18 Mar 2020 02:22:43 +0100 (CET)
IronPort-SDR: HhH2mLj6aObMHJVBXCA4TS/9yPz7tJZlYiUPLj6Z9DrAANAzgG/aOkIOSOjKRJesMR02NfSPmY
 VJ4nXWRxBF5A==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Mar 2020 18:22:42 -0700
IronPort-SDR: f5xvNL5IucfBFebQt2QJniBewfRlpNnuxwIX3JRh/RyjNi5X76PN7NUcyrquSzbxWrNZQr15UI
 oaCnj+Z7Mr8g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.70,565,1574150400"; d="scan'208";a="443987541"
Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201])
 by fmsmga005.fm.intel.com with ESMTP; 17 Mar 2020 18:22:42 -0700
Received: from fmsmsx153.amr.corp.intel.com (10.18.125.6) by
 FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Tue, 17 Mar 2020 18:22:42 -0700
Received: from shsmsx108.ccr.corp.intel.com (10.239.4.97) by
 FMSMSX153.amr.corp.intel.com (10.18.125.6) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Tue, 17 Mar 2020 18:22:42 -0700
Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.137]) by
 SHSMSX108.ccr.corp.intel.com ([169.254.8.235]) with mapi id 14.03.0439.000;
 Wed, 18 Mar 2020 09:22:39 +0800
From: "Liu, Yong" <yong.liu@intel.com>
To: "Hu, Jiayu" <jiayu.hu@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>, "Ye, Xiaolong"
 <xiaolong.ye@intel.com>, "Wang, Zhihong" <zhihong.wang@intel.com>
Thread-Topic: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
 accelerate Tx operations
Thread-Index: AQHV/AXmSvVjRouUCkiasNlGvhPWIqhMSZsA//+13gCAAYqqsA==
Date: Wed, 18 Mar 2020 01:22:39 +0000
Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B775@SHSMSX103.ccr.corp.intel.com>
References: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com>
 <1584436885-18651-4-git-send-email-jiayu.hu@intel.com>
 <86228AFD5BCD8E4EBFD2B90117B5E81E6350B05E@SHSMSX103.ccr.corp.intel.com>
 <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com>
In-Reply-To: <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 11.2.0.6
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines
	to	accelerate Tx operations
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


> -----Original Message-----
> From: Hu, Jiayu <jiayu.hu@intel.com>
> Sent: Tuesday, March 17, 2020 5:31 PM
> To: Liu, Yong <yong.liu@intel.com>; dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> accelerate Tx operations
>=20
> Hi Marvin,
>=20
> Thanks for comments. Replies are inline.
>=20
> > -----Original Message-----
> > From: Liu, Yong <yong.liu@intel.com>
> > Sent: Tuesday, March 17, 2020 3:21 PM
> > To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > accelerate Tx operations
> >
> > Hi Jiayu,
> > Some comments are inline.
> >
> > Thanks,
> > Marvin
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Jiayu Hu
> > > Sent: Tuesday, March 17, 2020 5:21 PM
> > > To: dev@dpdk.org
> > > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> > > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>
> > > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > > accelerate Tx operations
> > >
> > >
> > >  int vhost_logtype;
> > > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > >  #define ETH_VHOST_IOMMU_SUPPORT		"iommu-support"
> > >  #define ETH_VHOST_POSTCOPY_SUPPORT	"postcopy-support"
> > >  #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
> > > +#define ETH_VHOST_DMA_ARG		"dmas"
> > >  #define VHOST_MAX_PKT_BURST 32
> > >
> > > +/* ring size of I/OAT */
> > > +#define IOAT_RING_SIZE 1024
> > > +
> >
> > Jiayu,
> > Configured I/OAT ring size is 1024 here, but do not see in_flight or
> > nr_batching size check in enqueue function.
> > Is there any possibility that IOAT ring exhausted?
>=20
> We will wait for IOAT's copy completion, when its ring is full.
> This is to guarantee that all enqueue to IOAT can success.
>=20
> > > +struct dma_info_input {
> > > +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > > +	uint16_t nr;
> > > +};
> > > +
> > > +static inline int
> > > +open_dma(const char *key __rte_unused, const char *value, void
> > > *extra_args)
> > > +{
> > > +	struct dma_info_input *dma_info =3D extra_args;
> > > +	char *input =3D strndup(value, strlen(value) + 1);
> > > +	char *addrs =3D input;
> > > +	char *ptrs[2];
> > > +	char *start, *end, *substr;
> > > +	int64_t qid, vring_id;
> > > +	struct rte_ioat_rawdev_config config;
> > > +	struct rte_rawdev_info info =3D { .dev_private =3D &config };
> > > +	char name[32];
> > > +	int dev_id;
> > > +	int ret =3D 0;
> > > +
> > > +	while (isblank(*addrs))
> > > +		addrs++;
> > > +	if (addrs =3D=3D '\0') {
> > > +		VHOST_LOG(ERR, "No input DMA addresses\n");
> > > +		ret =3D -1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* process DMA devices within bracket. */
> > > +	addrs++;
> > > +	substr =3D strtok(addrs, ";]");
> > > +	if (!substr) {
> > > +		VHOST_LOG(ERR, "No input DMA addresse\n");
> > > +		ret =3D -1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	do {
> > > +		rte_strsplit(substr, strlen(substr), ptrs, 2, '@');
> > > +
> > Function rte_strsplit can be failed. Need to check return value.
>=20
> Thanks. Will check it later.
>=20
> >
> > > +		start =3D strstr(ptrs[0], "txq");
> > > +		if (start =3D=3D NULL) {
> > > +			VHOST_LOG(ERR, "Illegal queue\n");
> > > +			ret =3D -1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		start +=3D 3;
> >
> > It's better not use hardcode value.
> >
> > > +		qid =3D strtol(start, &end, 0);
> > > +		if (end =3D=3D start) {
> > > +			VHOST_LOG(ERR, "No input queue ID\n");
> > > +			ret =3D -1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		vring_id =3D qid * 2 + VIRTIO_RXQ;
> > > +		if (rte_pci_addr_parse(ptrs[1],
> > > +				       &dma_info->dmas[vring_id].addr) < 0) {
> > > +			VHOST_LOG(ERR, "Invalid DMA address %s\n",
> > > ptrs[1]);
> > > +			ret =3D -1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		rte_pci_device_name(&dma_info->dmas[vring_id].addr,
> > > +				    name, sizeof(name));
> > > +		dev_id =3D rte_rawdev_get_dev_id(name);
> > > +		if (dev_id =3D=3D (uint16_t)(-ENODEV) ||
> > > +		    dev_id =3D=3D (uint16_t)(-EINVAL)) {
> > > +			VHOST_LOG(ERR, "Cannot find device %s.\n", name);
> > > +			ret =3D -1;
> > > +			goto out;
> > > +		}
> > > +
> > Multiple queues can't share one IOAT device. Check should be here as it=
 is
> > not allowed.
>=20
> I just claim it in the doc. Will add the check later.
>=20
> > > +
> > > +/* notify front-end of enqueued packets */
> > > +static __rte_always_inline void
> > > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring
> > > *dma_vr)
> > > +{
> > > +	vhost_vring_call_split(dev, dma_vr);
> > > +}
> > > +
> > > +int
> > > +free_dma_done(void *dev, void *dma_vr)
> > > +{
> > > +	uintptr_t flags[255], tmps[255];
> >
> > Please add meaningful macro for 255, not sure why limitation is 255 not
> 256.
>=20
> The second parameter of rte_ioat_completed_copies() is uint8_t, so the
> max
> value can only be 255. I can replace it with UINT8_MAX later.
>=20
> >
> > > +	int dma_done, i;
> > > +	uint16_t used_idx;
> > > +	struct pmd_internal *device =3D dev;
> > > +	struct dma_vring *dma_vring =3D dma_vr;
> > > +
> > > +	dma_done =3D rte_ioat_completed_copies(dma_vring->dev_id, 255,
> > > flags,
> > > +					     tmps);
> > > +	if (unlikely(dma_done <=3D 0))
> > > +		return dma_done;
> > > +
> > > +	dma_vring->nr_inflight -=3D dma_done;
> >
> > Not sure whether DMA engine will return completion as input sequence,
> > mbuf free should after index update done.
>=20
> IMO, pktmbuf can be freed once the IOAT doesn't occupy it.
> We don't need to wait for the update of used index.
>=20
> This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1,
> once submit its copy job to the IOAT. If it has N copies that are all off=
loaded
> to the IOAT, the refcnt will be increased by N. On completion of a IOAT c=
opy,
> if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg().
> When the value of refcnt reaches 1, which means all its copies are
> completed,
> the mbuf will be freed.
>=20

Thanks for reply.=20
My concern is that whether IOAT can preserve order as input sequence. If ha=
rdware can guarantee that,  no more question here.

> >
> > > +	for (i =3D 0; i < dma_done; i++) {
> > > +		if ((uint64_t)flags[i] >=3D dma_vring->max_indices) {
> > > +			struct rte_mbuf *pkt =3D (struct rte_mbuf *)flags[i];
> > > +
> > > +			/**
> > > +			 * the DMA completes a packet copy job, we
> > > +			 * decrease the refcnt or free the mbuf segment.
> > > +			 */
> > > +			rte_pktmbuf_free_seg(pkt);
> > > +		} else {
> > > +			uint16_t id =3D flags[i];
> > > +
> > > +			/**
> > > +			 * the DMA completes updating index of the
> > > +			 * used ring.
> > > +			 */
> > > +			used_idx =3D dma_vring->indices[id].data;
> > > +			VHOST_LOG(DEBUG, "The DMA finishes updating
> > > index %u "
> > > +				  "for the used ring.\n", used_idx);
> > > +
> > > +			dma_vring->copy_done_used =3D used_idx;
> > > +			vhost_dma_vring_call(device, dma_vring);
> > > +			put_used_index(dma_vring->indices,
> > > +				       dma_vring->max_indices, id);
> > > +		}
> > > +	}
> > > +	return dma_done;
> > > +}
> > > +
> > > +static  __rte_always_inline bool
> > > +rxvq_is_mergeable(struct pmd_internal *dev)
> > > +{
> > > +	return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > > +}
> > > +
> >
> > I'm not sure whether shadow used ring can help in DMA acceleration
> > scenario.
> > Vhost driver will wait until DMA copy is done. Optimization in CPU move
> may
> > not help in overall performance but just add weird codes.
>=20
> For small copies, we still use the CPU, as the IOAT is less efficient in =
small
> copies.
> Therefore, I think we still need CPU optimization here.
>=20

Thanks for clarification, got it.

> > > +static __rte_always_inline void
> > > +flush_shadow_used_ring_split(struct pmd_internal *dev,
> > > +			     struct dma_vring *dma_vr)
> > > +{
> > > +	uint16_t used_idx =3D dma_vr->last_used_idx & (dma_vr->vr.size - 1)=
;
> > > +
> > > +	if (used_idx + dma_vr->shadow_used_idx <=3D dma_vr->vr.size) {
> > > +		do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > > +						dma_vr->shadow_used_idx);
> > > +	} else {
> > > +		uint16_t size;
> > > +
> > > +		/* update used ring interval [used_idx, vr->size] */
> > > +		size =3D dma_vr->vr.size - used_idx;
> > > +		do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > size);
> > > +
> > > +		/* update the left half used ring interval [0, left_size] */
> > > +		do_flush_shadow_used_ring_split(dma_vr, 0, size,
> > > +						dma_vr->shadow_used_idx -
> > > +						size);
> > > +	}
> > > +	dma_vr->last_used_idx +=3D dma_vr->shadow_used_idx;
> > > +
> > > +	rte_smp_wmb();
> > > +
> > > +	if (dma_vr->nr_inflight > 0) {
> > > +		struct ring_index *index;
> > > +
> > > +		index =3D get_empty_index(dma_vr->indices, dma_vr-
> > > >max_indices);
> > > +		index->data =3D dma_vr->last_used_idx;
> > > +		while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id,
> > > +						      index->pa,
> > > +						      dma_vr->used_idx_hpa,
> > > +						      sizeof(uint16_t),
> > > +						      index->idx, 0, 0) =3D=3D
> > > +				0)) {
> > > +			int ret;
> > > +
> > > +			do {
> > > +				ret =3D dma_vr->dma_done_fn(dev, dma_vr);
> > > +			} while (ret <=3D 0);
> > > +		}
> > > +		dma_vr->nr_batching++;
> > > +		dma_vr->nr_inflight++;
> > > +	} else {
> > > +		/**
> > > +		 * we update index of used ring when all previous copy
> > > +		 * jobs are completed.
> > > +		 *
> > > +		 * When enabling DMA copy, if there are outstanding copy
> > > +		 * jobs of the DMA, to avoid the DMA overwriting the
> > > +		 * write of the CPU, the DMA is in charge of updating
> > > +		 * the index of used ring.
> > > +		 */
> >
> > According to comments, here should be DMA data move. But following
> code
> > is CPU data move. Anything wrong here?
>=20
> The update of used index is done by the CPU, if there are no inflight IOA=
T
> copies;
> otherwise, it's done by the IOAT. The code in "else {}" is executed only =
when
> dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU=
 is
> in
> charge of updating used ring's index.
>=20

So look like comments can be separated into two parts, the second parts des=
cribes the behavior of index update when has infight copies.=20

> >
> > > +		*(volatile uint16_t *)&dma_vr->vr.used->idx +=3D
> > > +			dma_vr->shadow_used_idx;
> > > +		dma_vr->copy_done_used +=3D dma_vr->shadow_used_idx;
> > > +	}
> > > +
> > > +	dma_vr->shadow_used_idx =3D 0;
> > > +}
> > > +
> > > 2.7.4