From: "Hu, Jiayu" <jiayu.hu@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>,
"dev@dpdk.org" <dev@dpdk.org>
Cc: "i.maximets@ovn.org" <i.maximets@ovn.org>,
"Xia, Chenbo" <chenbo.xia@intel.com>,
"Richardson, Bruce" <bruce.richardson@intel.com>,
"Van Haaren, Harry" <harry.van.haaren@intel.com>,
"Pai G, Sunil" <sunil.pai.g@intel.com>,
"Mcnamara, John" <john.mcnamara@intel.com>,
"Ding, Xuan" <xuan.ding@intel.com>,
"Jiang, Cheng1" <cheng1.jiang@intel.com>,
"liangma@liangbit.com" <liangma@liangbit.com>
Subject: RE: [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath
Date: Mon, 7 Feb 2022 01:34:52 +0000 [thread overview]
Message-ID: <74b39fe943e740bea4e8d5a6358b99b2@intel.com> (raw)
In-Reply-To: <42a88a85-f4cd-d6e0-f5d2-411db1879380@redhat.com>
Hi Maxime,
Thanks for your comments. Please see replies inline.
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, February 3, 2022 9:04 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: i.maximets@ovn.org; Xia, Chenbo <chenbo.xia@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Pai G, Sunil <sunil.pai.g@intel.com>;
> Mcnamara, John <john.mcnamara@intel.com>; Ding, Xuan
> <xuan.ding@intel.com>; Jiang, Cheng1 <cheng1.jiang@intel.com>;
> liangma@liangbit.com
> Subject: Re: [PATCH v2 1/1] vhost: integrate dmadev in asynchronous
> datapath
>
> Hi Jiayu,
>
> On 1/24/22 17:40, Jiayu Hu wrote:
> > Since dmadev is introduced in 21.11, to avoid the overhead of vhost
> > DMA abstraction layer and simplify application logics, this patch
> > integrates dmadev in asynchronous data path.
> >
> > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
> > ---
> > doc/guides/prog_guide/vhost_lib.rst | 95 ++++-----
> > examples/vhost/Makefile | 2 +-
> > examples/vhost/ioat.c | 218 --------------------
> > examples/vhost/ioat.h | 63 ------
> > examples/vhost/main.c | 255 ++++++++++++++++++-----
> > examples/vhost/main.h | 11 +
> > examples/vhost/meson.build | 6 +-
> > lib/vhost/meson.build | 2 +-
> > lib/vhost/rte_vhost.h | 2 +
> > lib/vhost/rte_vhost_async.h | 132 +++++-------
> > lib/vhost/version.map | 3 +
> > lib/vhost/vhost.c | 148 ++++++++++----
> > lib/vhost/vhost.h | 64 +++++-
> > lib/vhost/vhost_user.c | 2 +
> > lib/vhost/virtio_net.c | 305 +++++++++++++++++++++++-----
> > 15 files changed, 744 insertions(+), 564 deletions(-)
> > delete mode 100644 examples/vhost/ioat.c
> > delete mode 100644 examples/vhost/ioat.h
> >
>
> When you rebase to the next version, please ensure to rework all the logs to
> follow the new standard:
> VHOST_LOG_CONFIG(ERR,"(%s) .....", dev->ifname, ...);
Sure, will do.
>
> > git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h index
> > a87ea6ba37..758a80f403 100644
> > --- a/lib/vhost/rte_vhost_async.h
> > +++ b/lib/vhost/rte_vhost_async.h
> > @@ -26,73 +26,6 @@ struct rte_vhost_iov_iter {
> > unsigned long nr_segs;
> > };
> >
> > -/**
> > - * dma transfer status
> > - */
> > -struct rte_vhost_async_status {
> > - /** An array of application specific data for source memory */
> > - uintptr_t *src_opaque_data;
> > - /** An array of application specific data for destination memory */
> > - uintptr_t *dst_opaque_data;
> > -};
> > -
> > -/**
> > - * dma operation callbacks to be implemented by applications
> > - */
> > -struct rte_vhost_async_channel_ops {
> > - /**
> > - * instruct async engines to perform copies for a batch of packets
> > - *
> > - * @param vid
> > - * id of vhost device to perform data copies
> > - * @param queue_id
> > - * queue id to perform data copies
> > - * @param iov_iter
> > - * an array of IOV iterators
> > - * @param opaque_data
> > - * opaque data pair sending to DMA engine
> > - * @param count
> > - * number of elements in the "descs" array
> > - * @return
> > - * number of IOV iterators processed, negative value means error
> > - */
> > - int32_t (*transfer_data)(int vid, uint16_t queue_id,
> > - struct rte_vhost_iov_iter *iov_iter,
> > - struct rte_vhost_async_status *opaque_data,
> > - uint16_t count);
> > - /**
> > - * check copy-completed packets from the async engine
> > - * @param vid
> > - * id of vhost device to check copy completion
> > - * @param queue_id
> > - * queue id to check copy completion
> > - * @param opaque_data
> > - * buffer to receive the opaque data pair from DMA engine
> > - * @param max_packets
> > - * max number of packets could be completed
> > - * @return
> > - * number of async descs completed, negative value means error
> > - */
> > - int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
> > - struct rte_vhost_async_status *opaque_data,
> > - uint16_t max_packets);
> > -};
> > -
> > -/**
> > - * async channel features
> > - */
> > -enum {
> > - RTE_VHOST_ASYNC_INORDER = 1U << 0,
> > -};
> > -
> > -/**
> > - * async channel configuration
> > - */
> > -struct rte_vhost_async_config {
> > - uint32_t features;
> > - uint32_t rsvd[2];
> > -};
> > -
> > /**
> > * Register an async channel for a vhost queue
> > *
> > @@ -100,17 +33,11 @@ struct rte_vhost_async_config {
> > * vhost device id async channel to be attached to
> > * @param queue_id
> > * vhost queue id async channel to be attached to
> > - * @param config
> > - * Async channel configuration structure
> > - * @param ops
> > - * Async channel operation callbacks
> > * @return
> > * 0 on success, -1 on failures
> > */
> > __rte_experimental
> > -int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > - struct rte_vhost_async_config config,
> > - struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
> >
> > /**
> > * Unregister an async channel for a vhost queue @@ -136,17 +63,11
> > @@ int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
> > * vhost device id async channel to be attached to
> > * @param queue_id
> > * vhost queue id async channel to be attached to
> > - * @param config
> > - * Async channel configuration
> > - * @param ops
> > - * Async channel operation callbacks
> > * @return
> > * 0 on success, -1 on failures
> > */
> > __rte_experimental
> > -int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > - struct rte_vhost_async_config config,
> > - struct rte_vhost_async_channel_ops *ops);
> > +int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > +queue_id);
> >
> > /**
> > * Unregister an async channel for a vhost queue without performing
> > any @@ -179,12 +100,17 @@ int
> rte_vhost_async_channel_unregister_thread_unsafe(int vid,
> > * array of packets to be enqueued
> > * @param count
> > * packets num to be enqueued
> > + * @param dma_id
> > + * the identifier of the DMA device
> > + * @param vchan_id
> > + * the identifier of virtual DMA channel
> > * @return
> > * num of packets enqueued
> > */
> > __rte_experimental
> > uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
> > - struct rte_mbuf **pkts, uint16_t count);
> > + struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > + uint16_t vchan_id);
> >
> > /**
> > * This function checks async completion status for a specific vhost
> > @@ -199,12 +125,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int
> vid, uint16_t queue_id,
> > * blank array to get return packet pointer
> > * @param count
> > * size of the packet array
> > + * @param dma_id
> > + * the identifier of the DMA device
> > + * @param vchan_id
> > + * the identifier of virtual DMA channel
> > * @return
> > * num of packets returned
> > */
> > __rte_experimental
> > uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
> > - struct rte_mbuf **pkts, uint16_t count);
> > + struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > + uint16_t vchan_id);
> >
> > /**
> > * This function returns the amount of in-flight packets for the
> > vhost @@ -235,11 +166,44 @@ int rte_vhost_async_get_inflight(int vid,
> uint16_t queue_id);
> > * Blank array to get return packet pointer
> > * @param count
> > * Size of the packet array
> > + * @param dma_id
> > + * the identifier of the DMA device
> > + * @param vchan_id
> > + * the identifier of virtual DMA channel
> > * @return
> > * Number of packets returned
> > */
> > __rte_experimental
> > uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
> > - struct rte_mbuf **pkts, uint16_t count);
> > + struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
> > + uint16_t vchan_id);
> > +/**
> > + * The DMA vChannels used in asynchronous data path must be
> > +configured
> > + * first. So this function needs to be called before enabling DMA
> > + * acceleration for vring. If this function fails, asynchronous data
> > +path
> > + * cannot be enabled for any vring further.
> > + *
> > + * DMA devices used in data-path must belong to DMA devices given in
> > +this
> > + * function. But users are free to use DMA devices given in the
> > +function
> > + * in non-vhost scenarios, only if guarantee no copies in vhost are
> > + * offloaded to them at the same time.
> > + *
> > + * @param dmas_id
> > + * DMA ID array
> > + * @param count
> > + * Element number of 'dmas_id'
> > + * @param poll_factor
> > + * For large or scatter-gather packets, one packet would consist of
> > + * small buffers. In this case, vhost will issue several DMA copy
> > + * operations for the packet. Therefore, the number of copies to
> > + * check by rte_dma_completed() is calculated by "nb_pkts_to_poll *
> > + * poll_factor" andused in rte_vhost_poll_enqueue_completed(). The
> > + * default value of "poll_factor" is 1.
> > + * @return
> > + * 0 on success, and -1 on failure
> > + */
> > +__rte_experimental
> > +int rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
> > + uint16_t poll_factor);
> >
> > #endif /* _RTE_VHOST_ASYNC_H_ */
> > diff --git a/lib/vhost/version.map b/lib/vhost/version.map index
> > a7ef7f1976..1202ba9c1a 100644
> > --- a/lib/vhost/version.map
> > +++ b/lib/vhost/version.map
> > @@ -84,6 +84,9 @@ EXPERIMENTAL {
> >
> > # added in 21.11
> > rte_vhost_get_monitor_addr;
> > +
> > + # added in 22.03
> > + rte_vhost_async_dma_configure;
> > };
> >
> > INTERNAL {
> > diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c index
> > 13a9bb9dd1..c408cee63e 100644
> > --- a/lib/vhost/vhost.c
> > +++ b/lib/vhost/vhost.c
> > @@ -25,7 +25,7 @@
> > #include "vhost.h"
> > #include "vhost_user.h"
> >
> > -struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> > +struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
> > pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
> >
> > /* Called with iotlb_lock read-locked */ @@ -344,6 +344,7 @@
> > vhost_free_async_mem(struct vhost_virtqueue *vq)
> > return;
> >
> > rte_free(vq->async->pkts_info);
> > + rte_free(vq->async->pkts_cmpl_flag);
> >
> > rte_free(vq->async->buffers_packed);
> > vq->async->buffers_packed = NULL;
> > @@ -667,12 +668,12 @@ vhost_new_device(void)
> > int i;
> >
> > pthread_mutex_lock(&vhost_dev_lock);
> > - for (i = 0; i < MAX_VHOST_DEVICE; i++) {
> > + for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
> > if (vhost_devices[i] == NULL)
> > break;
> > }
> >
> > - if (i == MAX_VHOST_DEVICE) {
> > + if (i == RTE_MAX_VHOST_DEVICE) {
> > VHOST_LOG_CONFIG(ERR,
> > "Failed to find a free slot for new device.\n");
> > pthread_mutex_unlock(&vhost_dev_lock);
> > @@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
> > }
> >
> > static __rte_always_inline int
> > -async_channel_register(int vid, uint16_t queue_id,
> > - struct rte_vhost_async_channel_ops *ops)
> > +async_channel_register(int vid, uint16_t queue_id)
> > {
> > struct virtio_net *dev = get_device(vid);
> > struct vhost_virtqueue *vq = dev->virtqueue[queue_id]; @@ -1656,6
> > +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
> > goto out_free_async;
> > }
> >
> > + async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size *
> sizeof(bool),
> > + RTE_CACHE_LINE_SIZE, node);
> > + if (!async->pkts_cmpl_flag) {
> > + VHOST_LOG_CONFIG(ERR, "failed to allocate async
> pkts_cmpl_flag (vid %d, qid: %d)\n",
> > + vid, queue_id);
> > + goto out_free_async;
> > + }
> > +
> > if (vq_is_packed(dev)) {
> > async->buffers_packed = rte_malloc_socket(NULL,
> > vq->size * sizeof(struct
> vring_used_elem_packed), @@ -1676,9
> > +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
> > }
> > }
> >
> > - async->ops.check_completed_copies = ops-
> >check_completed_copies;
> > - async->ops.transfer_data = ops->transfer_data;
> > -
> > vq->async = async;
> >
> > return 0;
> > @@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t
> queue_id,
> > }
> >
> > int
> > -rte_vhost_async_channel_register(int vid, uint16_t queue_id,
> > - struct rte_vhost_async_config config,
> > - struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register(int vid, uint16_t queue_id)
> > {
> > struct vhost_virtqueue *vq;
> > struct virtio_net *dev = get_device(vid);
> > int ret;
> >
> > - if (dev == NULL || ops == NULL)
> > + if (dev == NULL)
> > return -1;
> >
> > if (queue_id >= VHOST_MAX_VRING)
> > @@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid,
> uint16_t queue_id,
> > if (unlikely(vq == NULL || !dev->async_copy))
> > return -1;
> >
> > - if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > - VHOST_LOG_CONFIG(ERR,
> > - "async copy is not supported on non-inorder mode "
> > - "(vid %d, qid: %d)\n", vid, queue_id);
> > - return -1;
> > - }
> > -
> > - if (unlikely(ops->check_completed_copies == NULL ||
> > - ops->transfer_data == NULL))
> > - return -1;
> > -
> > rte_spinlock_lock(&vq->access_lock);
> > - ret = async_channel_register(vid, queue_id, ops);
> > + ret = async_channel_register(vid, queue_id);
> > rte_spinlock_unlock(&vq->access_lock);
> >
> > return ret;
> > }
> >
> > int
> > -rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> queue_id,
> > - struct rte_vhost_async_config config,
> > - struct rte_vhost_async_channel_ops *ops)
> > +rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t
> > +queue_id)
> > {
> > struct vhost_virtqueue *vq;
> > struct virtio_net *dev = get_device(vid);
> >
> > - if (dev == NULL || ops == NULL)
> > + if (dev == NULL)
> > return -1;
> >
> > if (queue_id >= VHOST_MAX_VRING)
> > @@ -1747,18 +1737,7 @@
> rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
> > if (unlikely(vq == NULL || !dev->async_copy))
> > return -1;
> >
> > - if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
> > - VHOST_LOG_CONFIG(ERR,
> > - "async copy is not supported on non-inorder mode "
> > - "(vid %d, qid: %d)\n", vid, queue_id);
> > - return -1;
> > - }
> > -
> > - if (unlikely(ops->check_completed_copies == NULL ||
> > - ops->transfer_data == NULL))
> > - return -1;
> > -
> > - return async_channel_register(vid, queue_id, ops);
> > + return async_channel_register(vid, queue_id);
> > }
> >
> > int
> > @@ -1835,6 +1814,95 @@
> rte_vhost_async_channel_unregister_thread_unsafe(int vid, uint16_t
> queue_id)
> > return 0;
> > }
> >
> > +static __rte_always_inline void
> > +vhost_free_async_dma_mem(void)
> > +{
> > + uint16_t i;
> > +
> > + for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
> > + struct async_dma_info *dma = &dma_copy_track[i];
> > + int16_t j;
> > +
> > + if (dma->max_vchans == 0)
> > + continue;
> > +
> > + for (j = 0; j < dma->max_vchans; j++)
> > + rte_free(dma->vchans[j].pkts_completed_flag);
> > +
> > + rte_free(dma->vchans);
> > + dma->vchans = NULL;
> > + dma->max_vchans = 0;
> > + }
> > +}
> > +
> > +int
> > +rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
> > +uint16_t poll_factor)
>
> I'm not fan of the poll_factor, I think it is too complex for the user to know
> what value he should set.
It seems so, and users need to know DMA offloading details before setting it.
The simple way is setting the max copies to check by rte_dma_completed to
9728/2048*32, where 9728 is equal to VIRTIO_MAX_RX_PKTLEN and 32 is
MAX_PKT_BURST. It should be able to cover most of cases.
>
> Also, I would like that the API only registers one DMA channel at a time and
> let the application call it multiple times. Dong that, user can still use the DMA
> channels that could be registered.
Sure, I will change it.
>
> > +{
> > + uint16_t i;
> > +
> > + if (!dmas_id) {
> > + VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration
> parameter.\n");
> > + return -1;
> > + }
> > +
> > + if (poll_factor == 0) {
> > + VHOST_LOG_CONFIG(ERR, "Invalid DMA poll factor %u\n",
> poll_factor);
> > + return -1;
> > + }
> > + dma_poll_factor = poll_factor;
> > +
> > + for (i = 0; i < count; i++) {
> > + struct async_dma_vchan_info *vchans;
> > + struct rte_dma_info info;
> > + uint16_t max_vchans;
> > + uint16_t max_desc;
> > + uint16_t j;
> > +
> > + if (!rte_dma_is_valid(dmas_id[i])) {
> > + VHOST_LOG_CONFIG(ERR, "DMA %d is not found.
> Cannot enable async"
> > + " data-path\n.", dmas_id[i]);
> > + vhost_free_async_dma_mem();
> > + return -1;
> > + }
> > +
> > + rte_dma_info_get(dmas_id[i], &info);
> > +
> > + max_vchans = info.max_vchans;
> > + max_desc = info.max_desc;
> > +
> > + if (!rte_is_power_of_2(max_desc))
> > + max_desc = rte_align32pow2(max_desc);
> > +
> > + vchans = rte_zmalloc(NULL, sizeof(struct
> async_dma_vchan_info) * max_vchans,
> > + RTE_CACHE_LINE_SIZE);
> > + if (vchans == NULL) {
> > + VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans
> for dma-%d."
> > + " Cannot enable async data-path.\n",
> dmas_id[i]);
> > + vhost_free_async_dma_mem();
> > + return -1;
> > + }
> > +
> > + for (j = 0; j < max_vchans; j++) {
> > + vchans[j].pkts_completed_flag = rte_zmalloc(NULL,
> sizeof(bool *) * max_desc,
> > + RTE_CACHE_LINE_SIZE);
> > + if (!vchans[j].pkts_completed_flag) {
> > + VHOST_LOG_CONFIG(ERR, "Failed to allocate
> pkts_completed_flag for "
> > + "dma-%d vchan-%u\n",
> dmas_id[i], j);
> > + vhost_free_async_dma_mem();
> > + return -1;
> > + }
> > +
> > + vchans[j].ring_size = max_desc;
> > + vchans[j].ring_mask = max_desc - 1;
> > + }
> > +
> > + dma_copy_track[dmas_id[i]].vchans = vchans;
> > + dma_copy_track[dmas_id[i]].max_vchans = max_vchans;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > int
> > rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
> > {
> > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h index
> > 7085e0885c..475843fec0 100644
> > --- a/lib/vhost/vhost.h
> > +++ b/lib/vhost/vhost.h
> > @@ -19,6 +19,7 @@
> > #include <rte_ether.h>
> > #include <rte_rwlock.h>
> > #include <rte_malloc.h>
> > +#include <rte_dmadev.h>
> >
> > #include "rte_vhost.h"
> > #include "rte_vdpa.h"
> > @@ -50,6 +51,7 @@
> >
> > #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
> > #define VHOST_MAX_ASYNC_VEC 2048
> > +#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
> >
> > #define PACKED_DESC_ENQUEUE_USED_FLAG(w) \
> > ((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED |
> > VRING_DESC_F_WRITE) : \ @@ -119,6 +121,42 @@ struct
> vring_used_elem_packed {
> > uint32_t count;
> > };
> >
> > +struct async_dma_vchan_info {
> > + /* circular array to track if packet copy completes */
> > + bool **pkts_completed_flag;
> > +
> > + /* max elements in 'metadata' */
> > + uint16_t ring_size;
> > + /* ring index mask for 'metadata' */
> > + uint16_t ring_mask;
> > +
> > + /* batching copies before a DMA doorbell */
> > + uint16_t nr_batching;
> > +
> > + /**
> > + * DMA virtual channel lock. Although it is able to bind DMA
> > + * virtual channels to data plane threads, vhost control plane
> > + * thread could call data plane functions too, thus causing
> > + * DMA device contention.
> > + *
> > + * For example, in VM exit case, vhost control plane thread needs
> > + * to clear in-flight packets before disable vring, but there could
> > + * be anotther data plane thread is enqueuing packets to the same
> > + * vring with the same DMA virtual channel. But dmadev PMD
> functions
> > + * are lock-free, so the control plane and data plane threads
> > + * could operate the same DMA virtual channel at the same time.
> > + */
> > + rte_spinlock_t dma_lock;
> > +};
> > +
> > +struct async_dma_info {
> > + uint16_t max_vchans;
> > + struct async_dma_vchan_info *vchans; };
> > +
> > +extern struct async_dma_info
> dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +extern uint16_t dma_poll_factor;
> > +
> > /**
> > * inflight async packet information
> > */
> > @@ -129,9 +167,6 @@ struct async_inflight_info {
> > };
> >
> > struct vhost_async {
> > - /* operation callbacks for DMA */
> > - struct rte_vhost_async_channel_ops ops;
> > -
> > struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
> > struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
> > uint16_t iter_idx;
> > @@ -139,6 +174,25 @@ struct vhost_async {
> >
> > /* data transfer status */
> > struct async_inflight_info *pkts_info;
> > + /**
> > + * Packet reorder array. "true" indicates that DMA device
> > + * completes all copies for the packet.
> > + *
> > + * Note that this array could be written by multiple threads
> > + * simultaneously. For example, in the case of thread0 and
> > + * thread1 RX packets from NIC and then enqueue packets to
> > + * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
> > + * possible for thread0 to get completed copies belonging to
> > + * vring1 from DMA0, while thread0 is calling rte_vhost_poll
> > + * _enqueue_completed() for vring0 and thread1 is calling
> > + * rte_vhost_submit_enqueue_burst() for vring1. In this case,
> > + * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
> > + *
> > + * However, since offloading is per-packet basis, each packet
> > + * flag will only be written by one thread. And single byte
> > + * write is atomic, so no lock for pkts_cmpl_flag is needed.
> > + */
> > + bool *pkts_cmpl_flag;
> > uint16_t pkts_idx;
> > uint16_t pkts_inflight_n;
> > union {
> > @@ -198,6 +252,7 @@ struct vhost_virtqueue {
> > /* Record packed ring first dequeue desc index */
> > uint16_t shadow_last_used_idx;
> >
> > + uint16_t batch_copy_max_elems;
> > uint16_t batch_copy_nb_elems;
> > struct batch_copy_elem *batch_copy_elems;
> > int numa_node;
> > @@ -568,8 +623,7 @@ extern int vhost_data_log_level;
> > #define PRINT_PACKET(device, addr, size, header) do {} while (0)
> > #endif
> >
> > -#define MAX_VHOST_DEVICE 1024
> > -extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
> > +extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
> >
> > #define VHOST_BINARY_SEARCH_THRESH 256
> >
> > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c index
> > 5eb1dd6812..3147e72f04 100644
> > --- a/lib/vhost/vhost_user.c
> > +++ b/lib/vhost/vhost_user.c
> > @@ -527,6 +527,8 @@ vhost_user_set_vring_num(struct virtio_net
> **pdev,
> > return RTE_VHOST_MSG_RESULT_ERR;
> > }
> >
> > + vq->batch_copy_max_elems = vq->size;
> > +
>
> I don't understand the point of this new field. But it can be removed anyway
> if we agree to drop the SW fallback.
This is for handling lacking of batch_copy elements in SW fallback.
>
> > return RTE_VHOST_MSG_RESULT_OK;
> > }
> >
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c index
> > b3d954aab4..305f6cd562 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -11,6 +11,7 @@
> > #include <rte_net.h>
> > #include <rte_ether.h>
> > #include <rte_ip.h>
> > +#include <rte_dmadev.h>
> > #include <rte_vhost.h>
> > #include <rte_tcp.h>
> > #include <rte_udp.h>
> > @@ -25,6 +26,10 @@
> >
> > #define MAX_BATCH_LEN 256
> >
> > +/* DMA device copy operation tracking array. */ struct async_dma_info
> > +dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
> > +uint16_t dma_poll_factor = 1;
> > +
> > static __rte_always_inline bool
> > rxvq_is_mergeable(struct virtio_net *dev)
> > {
> > @@ -43,6 +48,140 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t nr_vring)
> > return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
> > }
> >
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
> > + uint16_t vchan_id, uint16_t head_idx,
> > + struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts) {
> > + struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan_id];
> > + uint16_t ring_mask = dma_info->ring_mask;
> > + uint16_t pkt_idx, bce_idx = 0;
> > +
> > + rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > + for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
> > + struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
> > + int copy_idx, last_copy_idx = 0;
> > + uint16_t nr_segs = pkts[pkt_idx].nr_segs;
> > + uint16_t nr_sw_copy = 0;
> > + uint16_t i;
> > +
> > + if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
> > + goto out;
>
> I would consider introducing a vhost_async_dma_transfer_one function to
> avoid nesting too much loops and make the code cleaner.
Sure, I will add it.
>
> > + for (i = 0; i < nr_segs; i++) {
> > + /* Fallback to SW copy if error happens */
> > + copy_idx = rte_dma_copy(dma_id, vchan_id,
> (rte_iova_t)iov[i].src_addr,
> > + (rte_iova_t)iov[i].dst_addr, iov[i].len,
> > + RTE_DMA_OP_FLAG_LLC);
> > + if (unlikely(copy_idx < 0)) {
>
> The DMA channel is protected by a lock, and we check the capacity before
> initiating the copy.
> So I don't expect rte_dma_copy() to fail because of lack of capacity. If an
> error happens, that is a serious one.
>
> So, I wonder whether having a SW fallback makes sense. Code would be
> much simpler if we just exit early if an error happens. Logging an error
> message instead would help debugging. Certainly with rate limiting not to
> flood the log file.
That's correct. If error really happens in this case, it means DMA definitely
has something wrong. Better to stop async data-path and debug. SW fallback
may hide this serious issue.
If no objections, I will remove SW fallback but adding an error log.
>
> > + /* Find corresponding VA pair and do SW
> copy */
> > + rte_memcpy(vq-
> >batch_copy_elems[bce_idx].dst,
> > + vq-
> >batch_copy_elems[bce_idx].src,
> > + vq-
> >batch_copy_elems[bce_idx].len);
> > + nr_sw_copy++;
> > +
> > + /**
> > + * All copies of the packet are performed
> > + * by the CPU, set the packet completion flag
> > + * to true, as all copies are done.
> > + */
>
> I think it would better be moved out of the loop to avoid doing the check for
> every segment while only the last one has a chance to match.
I didn't get the point. How to get rid of the loop? Do you suggest to do SW copy
for all left copies once one DMA error happens (if SW copy is kept)?
>
> > + if (nr_sw_copy == nr_segs) {
> > + vq->async->pkts_cmpl_flag[head_idx %
> vq->size] = true;
> > + break;
> > + } else if (i == (nr_segs - 1)) {
> > + /**
> > + * A part of copies of current packet
> > + * are enqueued to the DMA
> successfully
> > + * but the last copy fails, store the
> > + * packet completion flag address
> > + * in the last DMA copy slot.
> > + */
> > + dma_info-
> >pkts_completed_flag[last_copy_idx & ring_mask] =
> > + &vq->async-
> >pkts_cmpl_flag[head_idx % vq->size];
> > + break;
> > + }
> > + } else
> > + last_copy_idx = copy_idx;
>
> Braces on the else as you have braces for the if statement.
>
> > +
> > + bce_idx++;
> > +
> > + /**
> > + * Only store packet completion flag address in the
> last copy's
> > + * slot, and other slots are set to NULL.
> > + */
> > + if (i == (nr_segs - 1)) {
> > + dma_info->pkts_completed_flag[copy_idx &
> ring_mask] =
> > + &vq->async-
> >pkts_cmpl_flag[head_idx % vq->size];
> > + }
> > + }
> > +
> > + dma_info->nr_batching += nr_segs;
> > + if (unlikely(dma_info->nr_batching >=
> VHOST_ASYNC_DMA_BATCHING_SIZE)) {
> > + rte_dma_submit(dma_id, vchan_id);
> > + dma_info->nr_batching = 0;
> > + }
>
> I wonder whether we could just remove this submit.
> I don't expect completions to happen between two packets as the DMA
> channel is protected by a lock, so my understanding is once the DMA ring is
> full, we just end-up exiting early because DMA channel capacity is checked
> for every packet.
>
> Removing it will maybe improve performance a (very) little bit, but will
> certainly make the code simpler to follow.
Good suggestion. I will remove it.
>
> > +
> > + head_idx++;
> > + }
> > +
> > +out:
> > + if (dma_info->nr_batching > 0) {
>
> if (likely(...))
Sure, I will add later.
>
> > + rte_dma_submit(dma_id, vchan_id);
> > + dma_info->nr_batching = 0;
> > + }
> > + rte_spinlock_unlock(&dma_info->dma_lock);
> > + vq->batch_copy_nb_elems = 0;
> > +
> > + return pkt_idx;
> > +}
> > +
> > +static __rte_always_inline uint16_t
> > +vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan_id,
> > +uint16_t max_pkts) {
> > + struct async_dma_vchan_info *dma_info =
> &dma_copy_track[dma_id].vchans[vchan_id];
> > + uint16_t ring_mask = dma_info->ring_mask;
> > + uint16_t last_idx = 0;
> > + uint16_t nr_copies;
> > + uint16_t copy_idx;
> > + uint16_t i;
> > + bool has_error = false;
> > +
> > + rte_spinlock_lock(&dma_info->dma_lock);
> > +
> > + /**
> > + * Print error log for debugging, if DMA reports error during
> > + * DMA transfer. We do not handle error in vhost level.
> > + */
> > + nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts,
> &last_idx, &has_error);
> > + if (unlikely(has_error)) {
> > + VHOST_LOG_DATA(ERR, "dma %d vchannel %u reports error
> in rte_dma_completed()\n",
> > + dma_id, vchan_id);
>
> I wonder if rate limiting would
Sure, I will avoid log flooding.
>
> > + } else if (nr_copies == 0)
> > + goto out;
> > +
> > + copy_idx = last_idx - nr_copies + 1;
> > + for (i = 0; i < nr_copies; i++) {
> > + bool *flag;
> > +
> > + flag = dma_info->pkts_completed_flag[copy_idx &
> ring_mask];
> > + if (flag) {
> > + /**
> > + * Mark the packet flag as received. The flag
> > + * could belong to another virtqueue but write
> > + * is atomic.
> > + */
> > + *flag = true;
> > + dma_info->pkts_completed_flag[copy_idx &
> ring_mask] = NULL;
> > + }
> > + copy_idx++;
> > + }
> > +
> > +out:
> > + rte_spinlock_unlock(&dma_info->dma_lock);
> > + return nr_copies;
> > +}
> > +
> > static inline void
> > do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > {
> > @@ -865,12 +1004,13 @@ async_iter_reset(struct vhost_async *async)
> > static __rte_always_inline int
> > async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue
> *vq,
> > struct rte_mbuf *m, uint32_t mbuf_offset,
> > - uint64_t buf_iova, uint32_t cpy_len)
> > + uint64_t buf_addr, uint64_t buf_iova, uint32_t cpy_len)
> > {
> > struct vhost_async *async = vq->async;
> > uint64_t mapped_len;
> > uint32_t buf_offset = 0;
> > void *hpa;
> > + struct batch_copy_elem *bce = vq->batch_copy_elems;
> >
> > while (cpy_len) {
> > hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> > @@ -886,6 +1026,31 @@ async_mbuf_to_desc_seg(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> > hpa, (size_t)mapped_len)))
> > return -1;
> >
> > + /**
> > + * Keep VA for all IOVA segments for falling back to SW
> > + * copy in case of rte_dma_copy() error.
> > + */
>
> As said below, I think we could get rid off the SW fallback.
Like the replies above, a better way for me is to remove SW fallback.
> But in case we didn't, I think it would be prefferable to change the
> rte_vhost_iovec struct to have both the iova and the VA, that would make
> the code simpler.
>
> Also, while looking at this, I notice the structs rte_vhost_iov_iter and
> rte_vhost_iovec are still part of the Vhost API, but it should not be necessary
> now since application no more need to know about it.
Good catch, and I will change it later.
>
> > + if (unlikely(vq->batch_copy_nb_elems >= vq-
> >batch_copy_max_elems)) {
> > + struct batch_copy_elem *tmp;
> > + uint16_t nb_elems = 2 * vq->batch_copy_max_elems;
> > +
> > + VHOST_LOG_DATA(DEBUG, "(%d) %s: run out of
> batch_copy_elems, "
> > + "and realloc double elements.\n",
> dev->vid, __func__);
> > + tmp = rte_realloc_socket(vq->batch_copy_elems,
> nb_elems * sizeof(*tmp),
> > + RTE_CACHE_LINE_SIZE, vq-
> >numa_node);
> > + if (!tmp) {
> > + VHOST_LOG_DATA(ERR, "Failed to re-alloc
> batch_copy_elems\n");
> > + return -1;
> > + }
> > +
> > + vq->batch_copy_max_elems = nb_elems;
> > + vq->batch_copy_elems = tmp;
> > + bce = tmp;
> > + }
> > + bce[vq->batch_copy_nb_elems].dst = (void
> *)((uintptr_t)(buf_addr + buf_offset));
> > + bce[vq->batch_copy_nb_elems].src =
> rte_pktmbuf_mtod_offset(m, void *, mbuf_offset);
> > + bce[vq->batch_copy_nb_elems++].len = mapped_len;
> > +
> > cpy_len -= (uint32_t)mapped_len;
> > mbuf_offset += (uint32_t)mapped_len;
> > buf_offset += (uint32_t)mapped_len; @@ -901,7 +1066,8
> @@
> > sync_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > {
> > struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> >
> > - if (likely(cpy_len > MAX_BATCH_LEN || vq->batch_copy_nb_elems >=
> vq->size)) {
> > + if (likely(cpy_len > MAX_BATCH_LEN ||
> > + vq->batch_copy_nb_elems >= vq-
> >batch_copy_max_elems)) {
> > rte_memcpy((void *)((uintptr_t)(buf_addr)),
> > rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset),
> > cpy_len);
> > @@ -1020,8 +1186,10 @@ mbuf_to_desc(struct virtio_net *dev, struct
> > vhost_virtqueue *vq,
> >
> > if (is_async) {
> > if (async_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
> > + buf_addr + buf_offset,
> > buf_iova + buf_offset,
> cpy_len) < 0)
> > goto error;
> > +
>
> Remove new line.
I will remove it later.
Thanks,
Jiayu
next prev parent reply other threads:[~2022-02-07 1:35 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-22 10:54 [RFC 0/1] integrate dmadev in vhost Jiayu Hu
2021-11-22 10:54 ` [RFC 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2021-12-24 10:39 ` Maxime Coquelin
2021-12-28 1:15 ` Hu, Jiayu
2022-01-03 10:26 ` Maxime Coquelin
2022-01-06 5:46 ` Hu, Jiayu
2021-12-03 3:49 ` [RFC 0/1] integrate dmadev in vhost fengchengwen
2021-12-30 21:55 ` [PATCH v1 " Jiayu Hu
2021-12-30 21:55 ` [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2021-12-31 0:55 ` Liang Ma
2022-01-14 6:30 ` Xia, Chenbo
2022-01-17 5:39 ` Hu, Jiayu
2022-01-19 2:18 ` Xia, Chenbo
2022-01-20 17:00 ` Maxime Coquelin
2022-01-21 1:56 ` Hu, Jiayu
2022-01-24 16:40 ` [PATCH v2 0/1] integrate dmadev in vhost Jiayu Hu
2022-01-24 16:40 ` [PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath Jiayu Hu
2022-02-03 13:04 ` Maxime Coquelin
2022-02-07 1:34 ` Hu, Jiayu [this message]
2022-02-08 10:40 ` [PATCH v3 0/1] integrate dmadev in vhost Jiayu Hu
2022-02-08 10:40 ` [PATCH v3 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
2022-02-08 17:46 ` Maxime Coquelin
2022-02-09 12:51 ` [PATCH v4 0/1] integrate dmadev in vhost Jiayu Hu
2022-02-09 12:51 ` [PATCH v4 1/1] vhost: integrate dmadev in asynchronous data-path Jiayu Hu
2022-02-10 7:58 ` Yang, YvonneX
2022-02-10 13:44 ` Maxime Coquelin
2022-02-10 15:14 ` Maxime Coquelin
2022-02-10 20:50 ` Ferruh Yigit
2022-02-10 21:01 ` Maxime Coquelin
2022-02-10 20:56 ` Ferruh Yigit
2022-02-10 21:00 ` Maxime Coquelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=74b39fe943e740bea4e8d5a6358b99b2@intel.com \
--to=jiayu.hu@intel.com \
--cc=bruce.richardson@intel.com \
--cc=chenbo.xia@intel.com \
--cc=cheng1.jiang@intel.com \
--cc=dev@dpdk.org \
--cc=harry.van.haaren@intel.com \
--cc=i.maximets@ovn.org \
--cc=john.mcnamara@intel.com \
--cc=liangma@liangbit.com \
--cc=maxime.coquelin@redhat.com \
--cc=sunil.pai.g@intel.com \
--cc=xuan.ding@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).