Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Liu, Yong" <yong.liu@intel.com>
To: "Hu, Jiayu" <jiayu.hu@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>,
	"Ye, Xiaolong" <xiaolong.ye@intel.com>,
	"Wang, Zhihong" <zhihong.wang@intel.com>
Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to	accelerate Tx operations
Date: Wed, 18 Mar 2020 01:22:39 +0000	[thread overview]
Message-ID: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B775@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com>



> -----Original Message-----
> From: Hu, Jiayu <jiayu.hu@intel.com>
> Sent: Tuesday, March 17, 2020 5:31 PM
> To: Liu, Yong <yong.liu@intel.com>; dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> accelerate Tx operations
> 
> Hi Marvin,
> 
> Thanks for comments. Replies are inline.
> 
> > -----Original Message-----
> > From: Liu, Yong <yong.liu@intel.com>
> > Sent: Tuesday, March 17, 2020 3:21 PM
> > To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > accelerate Tx operations
> >
> > Hi Jiayu,
> > Some comments are inline.
> >
> > Thanks,
> > Marvin
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Jiayu Hu
> > > Sent: Tuesday, March 17, 2020 5:21 PM
> > > To: dev@dpdk.org
> > > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong
> <xiaolong.ye@intel.com>;
> > > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>
> > > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > > accelerate Tx operations
> > >
> > >
> > >  int vhost_logtype;
> > > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > >  #define ETH_VHOST_IOMMU_SUPPORT		"iommu-support"
> > >  #define ETH_VHOST_POSTCOPY_SUPPORT	"postcopy-support"
> > >  #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
> > > +#define ETH_VHOST_DMA_ARG		"dmas"
> > >  #define VHOST_MAX_PKT_BURST 32
> > >
> > > +/* ring size of I/OAT */
> > > +#define IOAT_RING_SIZE 1024
> > > +
> >
> > Jiayu,
> > Configured I/OAT ring size is 1024 here, but do not see in_flight or
> > nr_batching size check in enqueue function.
> > Is there any possibility that IOAT ring exhausted?
> 
> We will wait for IOAT's copy completion, when its ring is full.
> This is to guarantee that all enqueue to IOAT can success.
> 
> > > +struct dma_info_input {
> > > +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > > +	uint16_t nr;
> > > +};
> > > +
> > > +static inline int
> > > +open_dma(const char *key __rte_unused, const char *value, void
> > > *extra_args)
> > > +{
> > > +	struct dma_info_input *dma_info = extra_args;
> > > +	char *input = strndup(value, strlen(value) + 1);
> > > +	char *addrs = input;
> > > +	char *ptrs[2];
> > > +	char *start, *end, *substr;
> > > +	int64_t qid, vring_id;
> > > +	struct rte_ioat_rawdev_config config;
> > > +	struct rte_rawdev_info info = { .dev_private = &config };
> > > +	char name[32];
> > > +	int dev_id;
> > > +	int ret = 0;
> > > +
> > > +	while (isblank(*addrs))
> > > +		addrs++;
> > > +	if (addrs == '\0') {
> > > +		VHOST_LOG(ERR, "No input DMA addresses\n");
> > > +		ret = -1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* process DMA devices within bracket. */
> > > +	addrs++;
> > > +	substr = strtok(addrs, ";]");
> > > +	if (!substr) {
> > > +		VHOST_LOG(ERR, "No input DMA addresse\n");
> > > +		ret = -1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	do {
> > > +		rte_strsplit(substr, strlen(substr), ptrs, 2, '@');
> > > +
> > Function rte_strsplit can be failed. Need to check return value.
> 
> Thanks. Will check it later.
> 
> >
> > > +		start = strstr(ptrs[0], "txq");
> > > +		if (start == NULL) {
> > > +			VHOST_LOG(ERR, "Illegal queue\n");
> > > +			ret = -1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		start += 3;
> >
> > It's better not use hardcode value.
> >
> > > +		qid = strtol(start, &end, 0);
> > > +		if (end == start) {
> > > +			VHOST_LOG(ERR, "No input queue ID\n");
> > > +			ret = -1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		vring_id = qid * 2 + VIRTIO_RXQ;
> > > +		if (rte_pci_addr_parse(ptrs[1],
> > > +				       &dma_info->dmas[vring_id].addr) < 0) {
> > > +			VHOST_LOG(ERR, "Invalid DMA address %s\n",
> > > ptrs[1]);
> > > +			ret = -1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		rte_pci_device_name(&dma_info->dmas[vring_id].addr,
> > > +				    name, sizeof(name));
> > > +		dev_id = rte_rawdev_get_dev_id(name);
> > > +		if (dev_id == (uint16_t)(-ENODEV) ||
> > > +		    dev_id == (uint16_t)(-EINVAL)) {
> > > +			VHOST_LOG(ERR, "Cannot find device %s.\n", name);
> > > +			ret = -1;
> > > +			goto out;
> > > +		}
> > > +
> > Multiple queues can't share one IOAT device. Check should be here as it is
> > not allowed.
> 
> I just claim it in the doc. Will add the check later.
> 
> > > +
> > > +/* notify front-end of enqueued packets */
> > > +static __rte_always_inline void
> > > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring
> > > *dma_vr)
> > > +{
> > > +	vhost_vring_call_split(dev, dma_vr);
> > > +}
> > > +
> > > +int
> > > +free_dma_done(void *dev, void *dma_vr)
> > > +{
> > > +	uintptr_t flags[255], tmps[255];
> >
> > Please add meaningful macro for 255, not sure why limitation is 255 not
> 256.
> 
> The second parameter of rte_ioat_completed_copies() is uint8_t, so the
> max
> value can only be 255. I can replace it with UINT8_MAX later.
> 
> >
> > > +	int dma_done, i;
> > > +	uint16_t used_idx;
> > > +	struct pmd_internal *device = dev;
> > > +	struct dma_vring *dma_vring = dma_vr;
> > > +
> > > +	dma_done = rte_ioat_completed_copies(dma_vring->dev_id, 255,
> > > flags,
> > > +					     tmps);
> > > +	if (unlikely(dma_done <= 0))
> > > +		return dma_done;
> > > +
> > > +	dma_vring->nr_inflight -= dma_done;
> >
> > Not sure whether DMA engine will return completion as input sequence,
> > mbuf free should after index update done.
> 
> IMO, pktmbuf can be freed once the IOAT doesn't occupy it.
> We don't need to wait for the update of used index.
> 
> This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1,
> once submit its copy job to the IOAT. If it has N copies that are all offloaded
> to the IOAT, the refcnt will be increased by N. On completion of a IOAT copy,
> if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg().
> When the value of refcnt reaches 1, which means all its copies are
> completed,
> the mbuf will be freed.
> 

Thanks for reply. 
My concern is that whether IOAT can preserve order as input sequence. If hardware can guarantee that,  no more question here.

> >
> > > +	for (i = 0; i < dma_done; i++) {
> > > +		if ((uint64_t)flags[i] >= dma_vring->max_indices) {
> > > +			struct rte_mbuf *pkt = (struct rte_mbuf *)flags[i];
> > > +
> > > +			/**
> > > +			 * the DMA completes a packet copy job, we
> > > +			 * decrease the refcnt or free the mbuf segment.
> > > +			 */
> > > +			rte_pktmbuf_free_seg(pkt);
> > > +		} else {
> > > +			uint16_t id = flags[i];
> > > +
> > > +			/**
> > > +			 * the DMA completes updating index of the
> > > +			 * used ring.
> > > +			 */
> > > +			used_idx = dma_vring->indices[id].data;
> > > +			VHOST_LOG(DEBUG, "The DMA finishes updating
> > > index %u "
> > > +				  "for the used ring.\n", used_idx);
> > > +
> > > +			dma_vring->copy_done_used = used_idx;
> > > +			vhost_dma_vring_call(device, dma_vring);
> > > +			put_used_index(dma_vring->indices,
> > > +				       dma_vring->max_indices, id);
> > > +		}
> > > +	}
> > > +	return dma_done;
> > > +}
> > > +
> > > +static  __rte_always_inline bool
> > > +rxvq_is_mergeable(struct pmd_internal *dev)
> > > +{
> > > +	return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > > +}
> > > +
> >
> > I'm not sure whether shadow used ring can help in DMA acceleration
> > scenario.
> > Vhost driver will wait until DMA copy is done. Optimization in CPU move
> may
> > not help in overall performance but just add weird codes.
> 
> For small copies, we still use the CPU, as the IOAT is less efficient in small
> copies.
> Therefore, I think we still need CPU optimization here.
> 

Thanks for clarification, got it.

> > > +static __rte_always_inline void
> > > +flush_shadow_used_ring_split(struct pmd_internal *dev,
> > > +			     struct dma_vring *dma_vr)
> > > +{
> > > +	uint16_t used_idx = dma_vr->last_used_idx & (dma_vr->vr.size - 1);
> > > +
> > > +	if (used_idx + dma_vr->shadow_used_idx <= dma_vr->vr.size) {
> > > +		do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > > +						dma_vr->shadow_used_idx);
> > > +	} else {
> > > +		uint16_t size;
> > > +
> > > +		/* update used ring interval [used_idx, vr->size] */
> > > +		size = dma_vr->vr.size - used_idx;
> > > +		do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > size);
> > > +
> > > +		/* update the left half used ring interval [0, left_size] */
> > > +		do_flush_shadow_used_ring_split(dma_vr, 0, size,
> > > +						dma_vr->shadow_used_idx -
> > > +						size);
> > > +	}
> > > +	dma_vr->last_used_idx += dma_vr->shadow_used_idx;
> > > +
> > > +	rte_smp_wmb();
> > > +
> > > +	if (dma_vr->nr_inflight > 0) {
> > > +		struct ring_index *index;
> > > +
> > > +		index = get_empty_index(dma_vr->indices, dma_vr-
> > > >max_indices);
> > > +		index->data = dma_vr->last_used_idx;
> > > +		while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id,
> > > +						      index->pa,
> > > +						      dma_vr->used_idx_hpa,
> > > +						      sizeof(uint16_t),
> > > +						      index->idx, 0, 0) ==
> > > +				0)) {
> > > +			int ret;
> > > +
> > > +			do {
> > > +				ret = dma_vr->dma_done_fn(dev, dma_vr);
> > > +			} while (ret <= 0);
> > > +		}
> > > +		dma_vr->nr_batching++;
> > > +		dma_vr->nr_inflight++;
> > > +	} else {
> > > +		/**
> > > +		 * we update index of used ring when all previous copy
> > > +		 * jobs are completed.
> > > +		 *
> > > +		 * When enabling DMA copy, if there are outstanding copy
> > > +		 * jobs of the DMA, to avoid the DMA overwriting the
> > > +		 * write of the CPU, the DMA is in charge of updating
> > > +		 * the index of used ring.
> > > +		 */
> >
> > According to comments, here should be DMA data move. But following
> code
> > is CPU data move. Anything wrong here?
> 
> The update of used index is done by the CPU, if there are no inflight IOAT
> copies;
> otherwise, it's done by the IOAT. The code in "else {}" is executed only when
> dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU is
> in
> charge of updating used ring's index.
> 

So look like comments can be separated into two parts, the second parts describes the behavior of index update when has infight copies. 

> >
> > > +		*(volatile uint16_t *)&dma_vr->vr.used->idx +=
> > > +			dma_vr->shadow_used_idx;
> > > +		dma_vr->copy_done_used += dma_vr->shadow_used_idx;
> > > +	}
> > > +
> > > +	dma_vr->shadow_used_idx = 0;
> > > +}
> > > +
> > > 2.7.4

next prev parent reply	other threads:[~2020-03-18  1:22 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-17  9:21 [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD Jiayu Hu
2020-03-17  9:21 ` [dpdk-dev] [PATCH 1/4] vhost: populate guest memory for DMA-accelerated vhost-user Jiayu Hu
2020-03-17  9:21 ` [dpdk-dev] [PATCH 2/4] net/vhost: setup vrings for DMA-accelerated datapath Jiayu Hu
2020-03-17  6:29   ` Liu, Yong
2020-03-17  9:35     ` Hu, Jiayu
2020-03-18  1:17       ` Liu, Yong
2020-03-17  9:21 ` [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations Jiayu Hu
2020-03-17  7:21   ` Liu, Yong
2020-03-17  9:31     ` Hu, Jiayu
2020-03-18  1:22       ` Liu, Yong [this message]
2020-03-17  9:21 ` [dpdk-dev] [PATCH 4/4] doc: add I/OAT acceleration support for vhost-user PMD Jiayu Hu
2020-03-17  6:36   ` Ye Xiaolong
2020-03-17  9:53 ` [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations " Maxime Coquelin
2020-03-19  7:33   ` Hu, Jiayu
2020-03-19  9:10     ` Maxime Coquelin
2020-03-19 11:47       ` Hu, Jiayu
2020-03-26  7:52         ` Maxime Coquelin
2020-03-26  8:25           ` Hu, Jiayu
2020-03-26  8:47             ` Maxime Coquelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86228AFD5BCD8E4EBFD2B90117B5E81E6350B775@SHSMSX103.ccr.corp.intel.com \
    --to=yong.liu@intel.com \
    --cc=dev@dpdk.org \
    --cc=jiayu.hu@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=xiaolong.ye@intel.com \
    --cc=zhihong.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).