DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Hu, Jiayu" <jiayu.hu@intel.com>
To: "Liu, Yong" <yong.liu@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>,
	"Ye, Xiaolong" <xiaolong.ye@intel.com>,
	"Wang, Zhihong" <zhihong.wang@intel.com>
Subject: Re: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to	accelerate Tx operations
Date: Tue, 17 Mar 2020 09:31:20 +0000	[thread overview]
Message-ID: <2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com> (raw)
In-Reply-To: <86228AFD5BCD8E4EBFD2B90117B5E81E6350B05E@SHSMSX103.ccr.corp.intel.com>

Hi Marvin,

Thanks for comments. Replies are inline.

> -----Original Message-----
> From: Liu, Yong <yong.liu@intel.com>
> Sent: Tuesday, March 17, 2020 3:21 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> accelerate Tx operations
> 
> Hi Jiayu,
> Some comments are inline.
> 
> Thanks,
> Marvin
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Jiayu Hu
> > Sent: Tuesday, March 17, 2020 5:21 PM
> > To: dev@dpdk.org
> > Cc: maxime.coquelin@redhat.com; Ye, Xiaolong <xiaolong.ye@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>; Hu, Jiayu <jiayu.hu@intel.com>
> > Subject: [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to
> > accelerate Tx operations
> >
> >
> >  int vhost_logtype;
> > @@ -30,8 +34,12 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> >  #define ETH_VHOST_IOMMU_SUPPORT		"iommu-support"
> >  #define ETH_VHOST_POSTCOPY_SUPPORT	"postcopy-support"
> >  #define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
> > +#define ETH_VHOST_DMA_ARG		"dmas"
> >  #define VHOST_MAX_PKT_BURST 32
> >
> > +/* ring size of I/OAT */
> > +#define IOAT_RING_SIZE 1024
> > +
> 
> Jiayu,
> Configured I/OAT ring size is 1024 here, but do not see in_flight or
> nr_batching size check in enqueue function.
> Is there any possibility that IOAT ring exhausted?

We will wait for IOAT's copy completion, when its ring is full.
This is to guarantee that all enqueue to IOAT can success.

> > +struct dma_info_input {
> > +	struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
> > +	uint16_t nr;
> > +};
> > +
> > +static inline int
> > +open_dma(const char *key __rte_unused, const char *value, void
> > *extra_args)
> > +{
> > +	struct dma_info_input *dma_info = extra_args;
> > +	char *input = strndup(value, strlen(value) + 1);
> > +	char *addrs = input;
> > +	char *ptrs[2];
> > +	char *start, *end, *substr;
> > +	int64_t qid, vring_id;
> > +	struct rte_ioat_rawdev_config config;
> > +	struct rte_rawdev_info info = { .dev_private = &config };
> > +	char name[32];
> > +	int dev_id;
> > +	int ret = 0;
> > +
> > +	while (isblank(*addrs))
> > +		addrs++;
> > +	if (addrs == '\0') {
> > +		VHOST_LOG(ERR, "No input DMA addresses\n");
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	/* process DMA devices within bracket. */
> > +	addrs++;
> > +	substr = strtok(addrs, ";]");
> > +	if (!substr) {
> > +		VHOST_LOG(ERR, "No input DMA addresse\n");
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	do {
> > +		rte_strsplit(substr, strlen(substr), ptrs, 2, '@');
> > +
> Function rte_strsplit can be failed. Need to check return value.

Thanks. Will check it later.

> 
> > +		start = strstr(ptrs[0], "txq");
> > +		if (start == NULL) {
> > +			VHOST_LOG(ERR, "Illegal queue\n");
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		start += 3;
> 
> It's better not use hardcode value.
> 
> > +		qid = strtol(start, &end, 0);
> > +		if (end == start) {
> > +			VHOST_LOG(ERR, "No input queue ID\n");
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		vring_id = qid * 2 + VIRTIO_RXQ;
> > +		if (rte_pci_addr_parse(ptrs[1],
> > +				       &dma_info->dmas[vring_id].addr) < 0) {
> > +			VHOST_LOG(ERR, "Invalid DMA address %s\n",
> > ptrs[1]);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		rte_pci_device_name(&dma_info->dmas[vring_id].addr,
> > +				    name, sizeof(name));
> > +		dev_id = rte_rawdev_get_dev_id(name);
> > +		if (dev_id == (uint16_t)(-ENODEV) ||
> > +		    dev_id == (uint16_t)(-EINVAL)) {
> > +			VHOST_LOG(ERR, "Cannot find device %s.\n", name);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> Multiple queues can't share one IOAT device. Check should be here as it is
> not allowed.

I just claim it in the doc. Will add the check later.

> > +
> > +/* notify front-end of enqueued packets */
> > +static __rte_always_inline void
> > +vhost_dma_vring_call(struct pmd_internal *dev, struct dma_vring
> > *dma_vr)
> > +{
> > +	vhost_vring_call_split(dev, dma_vr);
> > +}
> > +
> > +int
> > +free_dma_done(void *dev, void *dma_vr)
> > +{
> > +	uintptr_t flags[255], tmps[255];
> 
> Please add meaningful macro for 255, not sure why limitation is 255 not 256.

The second parameter of rte_ioat_completed_copies() is uint8_t, so the max
value can only be 255. I can replace it with UINT8_MAX later.

> 
> > +	int dma_done, i;
> > +	uint16_t used_idx;
> > +	struct pmd_internal *device = dev;
> > +	struct dma_vring *dma_vring = dma_vr;
> > +
> > +	dma_done = rte_ioat_completed_copies(dma_vring->dev_id, 255,
> > flags,
> > +					     tmps);
> > +	if (unlikely(dma_done <= 0))
> > +		return dma_done;
> > +
> > +	dma_vring->nr_inflight -= dma_done;
> 
> Not sure whether DMA engine will return completion as input sequence,
> mbuf free should after index update done.

IMO, pktmbuf can be freed once the IOAT doesn't occupy it.
We don't need to wait for the update of used index. 

This is achieved by using mbuf's refcnt. We increase mbuf's refcnt by 1,
once submit its copy job to the IOAT. If it has N copies that are all offloaded
to the IOAT, the refcnt will be increased by N. On completion of a IOAT copy,
if it's a pktmbuf copy, we decrease its refcnt by rte_pktmbuf_free_seg().
When the value of refcnt reaches 1, which means all its copies are completed,
the mbuf will be freed.

> 
> > +	for (i = 0; i < dma_done; i++) {
> > +		if ((uint64_t)flags[i] >= dma_vring->max_indices) {
> > +			struct rte_mbuf *pkt = (struct rte_mbuf *)flags[i];
> > +
> > +			/**
> > +			 * the DMA completes a packet copy job, we
> > +			 * decrease the refcnt or free the mbuf segment.
> > +			 */
> > +			rte_pktmbuf_free_seg(pkt);
> > +		} else {
> > +			uint16_t id = flags[i];
> > +
> > +			/**
> > +			 * the DMA completes updating index of the
> > +			 * used ring.
> > +			 */
> > +			used_idx = dma_vring->indices[id].data;
> > +			VHOST_LOG(DEBUG, "The DMA finishes updating
> > index %u "
> > +				  "for the used ring.\n", used_idx);
> > +
> > +			dma_vring->copy_done_used = used_idx;
> > +			vhost_dma_vring_call(device, dma_vring);
> > +			put_used_index(dma_vring->indices,
> > +				       dma_vring->max_indices, id);
> > +		}
> > +	}
> > +	return dma_done;
> > +}
> > +
> > +static  __rte_always_inline bool
> > +rxvq_is_mergeable(struct pmd_internal *dev)
> > +{
> > +	return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > +}
> > +
> 
> I'm not sure whether shadow used ring can help in DMA acceleration
> scenario.
> Vhost driver will wait until DMA copy is done. Optimization in CPU move may
> not help in overall performance but just add weird codes.

For small copies, we still use the CPU, as the IOAT is less efficient in small copies.
Therefore, I think we still need CPU optimization here.

> > +static __rte_always_inline void
> > +flush_shadow_used_ring_split(struct pmd_internal *dev,
> > +			     struct dma_vring *dma_vr)
> > +{
> > +	uint16_t used_idx = dma_vr->last_used_idx & (dma_vr->vr.size - 1);
> > +
> > +	if (used_idx + dma_vr->shadow_used_idx <= dma_vr->vr.size) {
> > +		do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> > +						dma_vr->shadow_used_idx);
> > +	} else {
> > +		uint16_t size;
> > +
> > +		/* update used ring interval [used_idx, vr->size] */
> > +		size = dma_vr->vr.size - used_idx;
> > +		do_flush_shadow_used_ring_split(dma_vr, used_idx, 0,
> size);
> > +
> > +		/* update the left half used ring interval [0, left_size] */
> > +		do_flush_shadow_used_ring_split(dma_vr, 0, size,
> > +						dma_vr->shadow_used_idx -
> > +						size);
> > +	}
> > +	dma_vr->last_used_idx += dma_vr->shadow_used_idx;
> > +
> > +	rte_smp_wmb();
> > +
> > +	if (dma_vr->nr_inflight > 0) {
> > +		struct ring_index *index;
> > +
> > +		index = get_empty_index(dma_vr->indices, dma_vr-
> > >max_indices);
> > +		index->data = dma_vr->last_used_idx;
> > +		while (unlikely(rte_ioat_enqueue_copy(dma_vr->dev_id,
> > +						      index->pa,
> > +						      dma_vr->used_idx_hpa,
> > +						      sizeof(uint16_t),
> > +						      index->idx, 0, 0) ==
> > +				0)) {
> > +			int ret;
> > +
> > +			do {
> > +				ret = dma_vr->dma_done_fn(dev, dma_vr);
> > +			} while (ret <= 0);
> > +		}
> > +		dma_vr->nr_batching++;
> > +		dma_vr->nr_inflight++;
> > +	} else {
> > +		/**
> > +		 * we update index of used ring when all previous copy
> > +		 * jobs are completed.
> > +		 *
> > +		 * When enabling DMA copy, if there are outstanding copy
> > +		 * jobs of the DMA, to avoid the DMA overwriting the
> > +		 * write of the CPU, the DMA is in charge of updating
> > +		 * the index of used ring.
> > +		 */
> 
> According to comments, here should be DMA data move. But following code
> is CPU data move. Anything wrong here?

The update of used index is done by the CPU, if there are no inflight IOAT copies;
otherwise, it's done by the IOAT. The code in "else {}" is executed only when
dma_vr->nr_inflight is 0, which means no inflight IOAT copies, so the CPU is in
charge of updating used ring's index.

> 
> > +		*(volatile uint16_t *)&dma_vr->vr.used->idx +=
> > +			dma_vr->shadow_used_idx;
> > +		dma_vr->copy_done_used += dma_vr->shadow_used_idx;
> > +	}
> > +
> > +	dma_vr->shadow_used_idx = 0;
> > +}
> > +
> > 2.7.4


  reply	other threads:[~2020-03-17  9:31 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-17  9:21 [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD Jiayu Hu
2020-03-17  9:21 ` [dpdk-dev] [PATCH 1/4] vhost: populate guest memory for DMA-accelerated vhost-user Jiayu Hu
2020-03-17  9:21 ` [dpdk-dev] [PATCH 2/4] net/vhost: setup vrings for DMA-accelerated datapath Jiayu Hu
2020-03-17  6:29   ` Liu, Yong
2020-03-17  9:35     ` Hu, Jiayu
2020-03-18  1:17       ` Liu, Yong
2020-03-17  9:21 ` [dpdk-dev] [PATCH 3/4] net/vhost: leverage DMA engines to accelerate Tx operations Jiayu Hu
2020-03-17  7:21   ` Liu, Yong
2020-03-17  9:31     ` Hu, Jiayu [this message]
2020-03-18  1:22       ` Liu, Yong
2020-03-17  9:21 ` [dpdk-dev] [PATCH 4/4] doc: add I/OAT acceleration support for vhost-user PMD Jiayu Hu
2020-03-17  6:36   ` Ye Xiaolong
2020-03-17  9:53 ` [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations " Maxime Coquelin
2020-03-19  7:33   ` Hu, Jiayu
2020-03-19  9:10     ` Maxime Coquelin
2020-03-19 11:47       ` Hu, Jiayu
2020-03-26  7:52         ` Maxime Coquelin
2020-03-26  8:25           ` Hu, Jiayu
2020-03-26  8:47             ` Maxime Coquelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2484dc0f250f4ddabfecd6f7c9ed8f9b@intel.com \
    --to=jiayu.hu@intel.com \
    --cc=dev@dpdk.org \
    --cc=maxime.coquelin@redhat.com \
    --cc=xiaolong.ye@intel.com \
    --cc=yong.liu@intel.com \
    --cc=zhihong.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).