DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Hu, Jiayu" <jiayu.hu@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: "i.maximets@ovn.org" <i.maximets@ovn.org>,
	"Xia, Chenbo" <chenbo.xia@intel.com>,
	"Richardson, Bruce" <bruce.richardson@intel.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Mcnamara, John" <john.mcnamara@intel.com>,
	"Pai G, Sunil" <sunil.pai.g@intel.com>
Subject: RE: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
Date: Tue, 28 Dec 2021 01:15:21 +0000	[thread overview]
Message-ID: <c4cf886f882e44b794597157f70e8445@intel.com> (raw)
In-Reply-To: <61b6c53f-753b-7bc7-9aef-58e706e647aa@redhat.com>

Hi Maxime,

Thanks for your comments, and some replies are inline.

Thanks,
Jiayu

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, December 24, 2021 6:40 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Mcnamara, John
> <john.mcnamara@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>
> Subject: Re: [RFC 1/1] vhost: integrate dmadev in asynchronous datapath
> 
> Hi Jiayu,
> 
> This is a first review, I need to spend more time on the series to understand
> it well. Do you have a prototype of the OVS part, so that it helps us to grasp
> how the full integration would look like?

I think OVS patch will be sent soon. And we will send the deq side implementation too.

> 
> On 11/22/21 11:54, Jiayu Hu wrote:
> > Since dmadev is introduced in 21.11, to avoid the overhead of vhost
> > DMA abstraction layer and simplify application logics, this patch
> > integrates dmadev in asynchronous data path.
> >
> > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> > ---
> >   doc/guides/prog_guide/vhost_lib.rst |  63 ++++----
> >   examples/vhost/ioat.c               | 218 ----------------------------
> >   examples/vhost/ioat.h               |  63 --------
> >   examples/vhost/main.c               | 144 +++++++++++++++---
> >   examples/vhost/main.h               |  12 ++
> >   examples/vhost/meson.build          |   6 +-
> >   lib/vhost/meson.build               |   3 +-
> >   lib/vhost/rte_vhost_async.h         |  73 +++-------
> >   lib/vhost/vhost.c                   |  37 ++---
> >   lib/vhost/vhost.h                   |  45 +++++-
> >   lib/vhost/virtio_net.c              | 198 ++++++++++++++++++++-----
> >   11 files changed, 410 insertions(+), 452 deletions(-)
> >   delete mode 100644 examples/vhost/ioat.c
> >   delete mode 100644 examples/vhost/ioat.h
> >
> > diff --git a/doc/guides/prog_guide/vhost_lib.rst
> > b/doc/guides/prog_guide/vhost_lib.rst
> > index 76f5d303c9..32969a1c41 100644
> > --- a/doc/guides/prog_guide/vhost_lib.rst
> > +++ b/doc/guides/prog_guide/vhost_lib.rst
> > @@ -113,8 +113,8 @@ The following is an overview of some key Vhost API
> functions:
> >       the async capability. Only packets enqueued/dequeued by async APIs
> are
> >       processed through the async data path.
> >
> > -    Currently this feature is only implemented on split ring enqueue data
> > -    path.
> > +    Currently this feature is only implemented on split and packed ring
> > +    enqueue data path.
> 
> That's not related to the topic of this patch, you may move it in a dedicated
> patch in v1.

Sure, will remove later.

> 
> >
> > diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c deleted
> > file mode 100644 index 9aeeb12fd9..0000000000
> > --- a/examples/vhost/ioat.c
> > +++ /dev/null
> 
> Nice to see platform-specific code not being necessary for the
> application.
> 
> > diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> > index 33d023aa39..16a02b9219 100644
> > --- a/examples/vhost/main.c
> > +++ b/examples/vhost/main.c
> > @@ -199,10 +205,113 @@ struct vhost_bufftable
> *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
> >   static inline int
> >   open_dma(const char *value)
> >   {
> > +		if (rte_dma_start(dev_id) != 0) {
> > +			RTE_LOG(ERR, VHOST_CONFIG, "Fail to start
> DMA %u.\n", dev_id);
> > +			ret = -1;
> > +			goto out;
> > +		}
> > +
> > +		(dma_info + vid)->dmas[vring_id].dev_id = dev_id;
> > +		(dma_info + vid)->dmas[vring_id].is_valid = true;
> 
> This is_valid field is never used AFAICT, my understanding is that it
> had been added to differentiate between not valid and first dmadev,
> where dev_id will be both zero. Either make use of is_valid in the code,
> or change dev_id to int, and initialize it with -1 value.

Right, I will change it later. Thanks for reminder.

> 
> > +		dma_info->nr++;
> > +		i++;
> > +	}
> > +out:
> > +	free(input);
> > +	return ret;
> >   }
> >
> >   /*
> > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
> > index 13a9bb9dd1..595cf63b8d 100644
> > --- a/lib/vhost/vhost.c
> > +++ b/lib/vhost/vhost.c
> > @@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
> >   		return;
> >
> >   	rte_free(vq->async->pkts_info);
> > +	rte_free(vq->async->pkts_cmpl_flag);
> >
> >   	rte_free(vq->async->buffers_packed);
> >   	vq->async->buffers_packed = NULL;
> > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> >   }
> >
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > index 7085e0885c..974e495b56 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -51,6 +51,11 @@
> >   #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> >   #define VHOST_MAX_ASYNC_VEC 2048
> >
> > +/* DMA device copy operation tracking ring size. */
> > +#define VHOST_ASYNC_DMA_TRACK_RING_SIZE (uint32_t)4096
> 
> How is this value chosen? Is that specific to your hardware?

Yes. But in fact, this value should be equal to or greater than vchan
desc number, and it should be dynamic. In addition, the context tracking
array " dma_copy_track" should be per-vchan basis, rather than per-device,
although existed DMA devices only supports 1 vchan at most.

I have reworked this part which can be configured by users dynamically.

> 
> > +#define VHOST_ASYNC_DMA_TRACK_RING_MASK
> (VHOST_ASYNC_DMA_TRACK_RING_SIZE - 1)
> > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> > +
> >   #define PACKED_DESC_ENQUEUE_USED_FLAG(w)	\
> >   	((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> VRING_DESC_F_WRITE) : \
> >   		VRING_DESC_F_WRITE)
> > @@ -119,6 +124,29 @@ struct vring_used_elem_packed {
> >   	uint32_t count;
> >   };
> >
> > +struct async_dma_info {
> > +	/* circular array to track copy metadata */
> > +	bool *metadata[VHOST_ASYNC_DMA_TRACK_RING_SIZE];
> > +
> > +	/* batching copies before a DMA doorbell */
> > +	uint16_t nr_batching;
> > +
> > +	/**
> > +	 * DMA virtual channel lock. Although it is able to bind DMA
> > +	 * virtual channels to data plane threads, vhost control plane
> > +	 * thread could call data plane functions too, thus causing
> > +	 * DMA device contention.
> > +	 *
> > +	 * For example, in VM exit case, vhost control plane thread needs
> > +	 * to clear in-flight packets before disable vring, but there could
> > +	 * be anotther data plane thread is enqueuing packets to the same
> > +	 * vring with the same DMA virtual channel. But dmadev PMD
> functions
> > +	 * are lock-free, so the control plane and data plane threads
> > +	 * could operate the same DMA virtual channel at the same time.
> > +	 */
> > +	rte_spinlock_t dma_lock;
> > +};
> > +
> >   /**
> >    * inflight async packet information
> >    */
> > @@ -129,9 +157,6 @@ struct async_inflight_info {
> >   };
> >
> >   struct vhost_async {
> > -	/* operation callbacks for DMA */
> > -	struct rte_vhost_async_channel_ops ops;
> > -
> >   	struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> >   	struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> >   	uint16_t iter_idx;
> > @@ -139,8 +164,22 @@ struct vhost_async {
> >
> >   	/* data transfer status */
> >   	struct async_inflight_info *pkts_info;
> > +	/**
> > +	 * packet reorder array. "true" indicates that DMA
> > +	 * device completes all copies for the packet.
> > +	 *
> > +	 * Note that this arry could be written by multiple
> 
> array

I will change later.

> 
> > +	 * threads at the same time. For example, two threads
> > +	 * enqueue packets to the same virtqueue with their
> > +	 * own DMA devices. However, since offloading is
> > +	 * per-packet basis, each packet flag will only be
> > +	 * written by one thread. And single byte write is
> > +	 * atomic, so no lock is needed.
> > +	 */
> > +	bool *pkts_cmpl_flag;
> >   	uint16_t pkts_idx;
> >   	uint16_t pkts_inflight_n;
> > +
> >   	union {
> >   		struct vring_used_elem  *descs_split;
> >   		struct vring_used_elem_packed *buffers_packed;
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > index b3d954aab4..95ecfeb64b 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -11,6 +11,7 @@
> >   #include <rte_net.h>
> >   #include <rte_ether.h>
> >   #include <rte_ip.h>
> > +#include <rte_dmadev.h>
> >   #include <rte_vhost.h>
> >   #include <rte_tcp.h>
> >   #include <rte_udp.h>
> > @@ -25,6 +26,9 @@
> >
> >   #define MAX_BATCH_LEN 256
> >
> > +/* DMA device copy operation tracking array. */
> > +static struct async_dma_info
> dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +
> >   static  __rte_always_inline bool
> >   rxvq_is_mergeable(struct virtio_net *dev)
> >   {
> > @@ -43,6 +47,108 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t nr_vring)
> >   	return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> >   }
> >
> > +static uint16_t
> > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, uint16_t dma_id,
> > +		uint16_t dma_vchan, uint16_t head_idx,
> > +		struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
> > +{
> > +	struct async_dma_info *dma_info = &dma_copy_track[dma_id];
> > +	uint16_t dma_space_left = rte_dma_burst_capacity(dma_id, 0);
> > +	uint16_t pkt_idx = 0;
> > +
> > +	rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > +	while (pkt_idx < nr_pkts) {
> 
> A for loop would be prefered here.

Sure, I will change later.
> 
> > +		struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > +		int copy_idx = 0;
> > +		uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > +		uint16_t i;
> > +
> > +		if (unlikely(dma_space_left < nr_segs)) {
> > +			goto out;
> > +		}
> > +
> > +		for (i = 0; i < nr_segs; i++) {
> > +			copy_idx = rte_dma_copy(dma_id, dma_vchan,
> > +					(rte_iova_t)iov[i].src_addr,
> > +					(rte_iova_t)iov[i].dst_addr,
> > +					iov[i].len, RTE_DMA_OP_FLAG_LLC);
> > +			if (unlikely(copy_idx < 0)) {
> > +				VHOST_LOG_DATA(ERR, "DMA device %u
> (%u) copy failed\n",
> > +						dma_id, dma_vchan);
> > +				dma_info->nr_batching += i;
> > +				goto out;
> > +			}
> > +
> > +			dma_info->metadata[copy_idx &
> VHOST_ASYNC_DMA_TRACK_RING_MASK] = NULL;
> > +		}
> > +
> > +		/**
> > +		 * Only store packet completion flag address in the last copy's
> > +		 * slot, and other slots are set to NULL.
> > +		 */
> > +		dma_info->metadata[copy_idx &
> VHOST_ASYNC_DMA_TRACK_RING_MASK] =
> > +			&vq->async->pkts_cmpl_flag[head_idx % vq->size];
> > +
> > +		dma_info->nr_batching += nr_segs;
> > +		if (unlikely(dma_info->nr_batching >
> VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> > +			rte_dma_submit(dma_id, 0);
> > +			dma_info->nr_batching = 0;
> > +		}
> > +
> > +		dma_space_left -= nr_segs;
> > +		pkt_idx++;
> > +		head_idx++;
> > +	}
> > +
> > +out:
> > +	if (dma_info->nr_batching > 0) {
> > +		rte_dma_submit(dma_id, 0);
> > +		dma_info->nr_batching = 0;
> > +	}
> > +	rte_spinlock_unlock(&dma_info->dma_lock);
> 
> At a first sight, that looks like a lot of thing being done while the
> spinlock is held. But maybe there will be only contention in some corner
> cases?

I think the answer is yes. As a typical model is binding dma devices to data-path
threads. That is, there is no DMA sharing in data-path. One case of contention
happens between control plane and data plane threads. But it only happens when
vrings needs to stop etc.. So I think the contention should be not very often.

Thanks,
Jiayu

  reply	other threads:[~2021-12-28  1:15 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-22 10:54 [RFC 0/1] integrate dmadev in vhost Jiayu Hu
2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2021-12-24 10:39   ` Maxime Coquelin
2021-12-28  1:15     ` Hu, Jiayu [this message]
2022-01-03 10:26       ` Maxime Coquelin
2022-01-06  5:46         ` Hu, Jiayu
2021-12-03  3:49 ` [RFC 0/1] integrate dmadev in vhost fengchengwen
2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
2021-12-30 21:55   ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2021-12-31  0:55     ` Liang Ma
2022-01-14  6:30     ` Xia, Chenbo
2022-01-17  5:39       ` Hu, Jiayu
2022-01-19  2:18         ` Xia, Chenbo
2022-01-20 17:00     ` Maxime Coquelin
2022-01-21  1:56       ` Hu, Jiayu
2022-01-24 16:40   ` [PATCH v2 0/1] integrate dmadev in vhost Jiayu Hu
2022-01-24 16:40     ` [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2022-02-03 13:04       ` Maxime Coquelin
2022-02-07  1:34         ` Hu, Jiayu
2022-02-08 10:40       ` [PATCH v3 0/1] integrate dmadev in vhost Jiayu Hu
2022-02-08 10:40         ` [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
2022-02-08 17:46           ` Maxime Coquelin
2022-02-09 12:51           ` [PATCH v4 0/1] integrate dmadev in vhost Jiayu Hu
2022-02-09 12:51             ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
2022-02-10  7:58               ` Yang, YvonneX
2022-02-10 13:44               ` Maxime Coquelin
2022-02-10 15:14               ` Maxime Coquelin
2022-02-10 20:50               ` Ferruh Yigit
2022-02-10 21:01                 ` Maxime Coquelin
2022-02-10 20:56               ` Ferruh Yigit
2022-02-10 21:00                 ` Maxime Coquelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c4cf886f882e44b794597157f70e8445@intel.com \
    --to=jiayu.hu@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=chenbo.xia@intel.com \
    --cc=dev@dpdk.org \
    --cc=harry.van.haaren@intel.com \
    --cc=i.maximets@ovn.org \
    --cc=john.mcnamara@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=sunil.pai.g@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).