From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0716BA04FD; Mon, 11 Apr 2022 12:01:24 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id EA94D427F0; Mon, 11 Apr 2022 12:01:15 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by mails.dpdk.org (Postfix) with ESMTP id CC315427F0 for ; Mon, 11 Apr 2022 12:01:13 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649671274; x=1681207274; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=0o02K3DrzGuP0SbXdzldW3mht4D27FLKzrn95L9rSxY=; b=jc5jvw/3sKOFm7q0rJ8Eg+DfacR1jm2yYXhY24WO96d2CP5LVWoC5qrW J6ysV+RI/7OpZQ62Uk/LmlAGlrmQ4rOXT+7UTv3c1HFuIbmgvFVmWAg51 O7MAbE8tRnnDjHg0/2H+q2UCxYfj7uxAGKh2QMxWM3oK8Z1k8xG7K690C cw4DWMR1v7YH+oYrV4EUjqiyrifePtS1n3flk2LIFQylXZNC50MgZzYbq THRyV5SyAWLZD9i0JaA6vQlUvBOpJTOH8FRJGtaoyMeNFDjXvxBDyyR4y otxo32yn+GMWTDfUK3tk/nqJUa+jzXzAoh+H7iIF9nAEubFFG6mqIeeJm A==; X-IronPort-AV: E=McAfee;i="6400,9594,10313"; a="249361632" X-IronPort-AV: E=Sophos;i="5.90,251,1643702400"; d="scan'208";a="249361632" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Apr 2022 03:01:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,251,1643702400"; d="scan'208";a="525446328" Received: from npg-dpdk-xuan-cbdma.sh.intel.com ([10.67.110.228]) by orsmga002.jf.intel.com with ESMTP; 11 Apr 2022 03:01:10 -0700 From: xuan.ding@intel.com To: maxime.coquelin@redhat.com, chenbo.xia@intel.com Cc: dev@dpdk.org, jiayu.hu@intel.com, cheng1.jiang@intel.com, sunil.pai.g@intel.com, liangma@liangbit.com, Xuan Ding , Yuan Wang Subject: [PATCH v2 4/5] vhost: support async dequeue for split ring Date: Mon, 11 Apr 2022 10:00:31 +0000 Message-Id: <20220411100032.114434-5-xuan.ding@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220411100032.114434-1-xuan.ding@intel.com> References: <20220407152546.38167-1-xuan.ding@intel.com> <20220411100032.114434-1-xuan.ding@intel.com> X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org From: Xuan Ding This patch implements asynchronous dequeue data path for vhost split ring, a new API rte_vhost_async_try_dequeue_burst() is introduced. Signed-off-by: Xuan Ding Signed-off-by: Yuan Wang --- doc/guides/prog_guide/vhost_lib.rst | 7 + doc/guides/rel_notes/release_22_07.rst | 4 + lib/vhost/rte_vhost_async.h | 33 +++ lib/vhost/version.map | 3 + lib/vhost/virtio_net.c | 335 +++++++++++++++++++++++++ 5 files changed, 382 insertions(+) diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst index 886f8f5e72..40cf315170 100644 --- a/doc/guides/prog_guide/vhost_lib.rst +++ b/doc/guides/prog_guide/vhost_lib.rst @@ -276,6 +276,13 @@ The following is an overview of some key Vhost API functions: Clear inflight packets which are submitted to DMA engine in vhost async data path. Completed packets are returned to applications through ``pkts``. +* ``rte_vhost_async_try_dequeue_burst(int vid, uint16_t queue_id, + struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count, + int *nr_inflight, uint16_t dma_id, uint16_t vchan_id)`` + + Receives (dequeues) ``count`` packets from guest to host in async data path, + and stored them at ``pkts``. + Vhost-user Implementations -------------------------- diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst index 42a5f2d990..422a6673cb 100644 --- a/doc/guides/rel_notes/release_22_07.rst +++ b/doc/guides/rel_notes/release_22_07.rst @@ -55,6 +55,10 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Added vhost async dequeue API to receive pkts from guest.** + + Added vhost async dequeue API which can leverage DMA devices to accelerate + receiving pkts from guest. Removed Items ------------- diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h index f1293c6a9d..23fe1a7316 100644 --- a/lib/vhost/rte_vhost_async.h +++ b/lib/vhost/rte_vhost_async.h @@ -187,6 +187,39 @@ uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id, __rte_experimental int rte_vhost_async_dma_configure(int16_t dma_id, uint16_t vchan_id); +/** + * This function tries to receive packets from the guest with offloading + * copies to the async channel. The packets that are transfer completed + * are returned in "pkts". The other packets that their copies are submitted to + * the async channel but not completed are called "in-flight packets". + * This function will not return in-flight packets until their copies are + * completed by the async channel. + * + * @param vid + * ID of vhost device to dequeue data + * @param queue_id + * ID of virtqueue to dequeue data + * @param mbuf_pool + * Mbuf_pool where host mbuf is allocated + * @param pkts + * Blank array to keep successfully dequeued packets + * @param count + * Size of the packet array + * @param nr_inflight + * The amount of in-flight packets. If error occurred, its value is set to -1. + * @param dma_id + * The identifier of DMA device + * @param vchan_id + * The identifier of virtual DMA channel + * @return + * Number of successfully dequeued packets + */ +__rte_experimental +uint16_t +rte_vhost_async_try_dequeue_burst(int vid, uint16_t queue_id, + struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count, + int *nr_inflight, uint16_t dma_id, uint16_t vchan_id); + #ifdef __cplusplus } #endif diff --git a/lib/vhost/version.map b/lib/vhost/version.map index 0a66c5840c..514e3ff6a6 100644 --- a/lib/vhost/version.map +++ b/lib/vhost/version.map @@ -87,6 +87,9 @@ EXPERIMENTAL { # added in 22.03 rte_vhost_async_dma_configure; + + # added in 22.07 + rte_vhost_async_try_dequeue_burst; }; INTERNAL { diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c index 56904ad9a5..514315ef50 100644 --- a/lib/vhost/virtio_net.c +++ b/lib/vhost/virtio_net.c @@ -3166,3 +3166,338 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, return count; } + +static __rte_always_inline uint16_t +async_poll_dequeue_completed_split(struct virtio_net *dev, uint16_t queue_id, + struct rte_mbuf **pkts, uint16_t count, uint16_t dma_id, + uint16_t vchan_id, bool legacy_ol_flags) +{ + uint16_t start_idx, from, i; + uint16_t nr_cpl_pkts = 0; + struct async_inflight_info *pkts_info; + struct vhost_virtqueue *vq = dev->virtqueue[queue_id]; + + pkts_info = vq->async->pkts_info; + + vhost_async_dma_check_completed(dev, dma_id, vchan_id, VHOST_DMA_MAX_COPY_COMPLETE); + + start_idx = async_get_first_inflight_pkt_idx(vq); + + from = start_idx; + while (vq->async->pkts_cmpl_flag[from] && count--) { + vq->async->pkts_cmpl_flag[from] = false; + from = (from + 1) & (vq->size - 1); + nr_cpl_pkts++; + } + + if (nr_cpl_pkts == 0) + return 0; + + for (i = 0; i < nr_cpl_pkts; i++) { + from = (start_idx + i) & (vq->size - 1); + pkts[i] = pkts_info[from].mbuf; + + if (virtio_net_with_host_offload(dev)) + vhost_dequeue_offload(dev, &pkts_info[from].nethdr, pkts[i], + legacy_ol_flags); + } + + /* write back completed descs to used ring and update used idx */ + write_back_completed_descs_split(vq, nr_cpl_pkts); + __atomic_add_fetch(&vq->used->idx, nr_cpl_pkts, __ATOMIC_RELEASE); + vhost_vring_call_split(dev, vq); + + vq->async->pkts_inflight_n -= nr_cpl_pkts; + + return nr_cpl_pkts; +} + +static __rte_always_inline uint16_t +virtio_dev_tx_async_split(struct virtio_net *dev, struct vhost_virtqueue *vq, + uint16_t queue_id, struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, + uint16_t count, uint16_t dma_id, uint16_t vchan_id, bool legacy_ol_flags) +{ + static bool allocerr_warned; + bool dropped = false; + uint16_t free_entries; + uint16_t pkt_idx, slot_idx = 0; + uint16_t nr_done_pkts = 0; + uint16_t pkt_err = 0; + uint16_t n_xfer; + struct vhost_async *async = vq->async; + struct async_inflight_info *pkts_info = async->pkts_info; + struct rte_mbuf *pkts_prealloc[MAX_PKT_BURST]; + uint16_t pkts_size = count; + + /** + * The ordering between avail index and + * desc reads needs to be enforced. + */ + free_entries = __atomic_load_n(&vq->avail->idx, __ATOMIC_ACQUIRE) - vq->last_avail_idx; + if (free_entries == 0) + goto out; + + rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]); + + async_iter_reset(async); + + count = RTE_MIN(count, MAX_PKT_BURST); + count = RTE_MIN(count, free_entries); + VHOST_LOG_DATA(DEBUG, "(%s) about to dequeue %u buffers\n", dev->ifname, count); + + if (rte_pktmbuf_alloc_bulk(mbuf_pool, pkts_prealloc, count)) + goto out; + + for (pkt_idx = 0; pkt_idx < count; pkt_idx++) { + uint16_t head_idx = 0; + uint16_t nr_vec = 0; + uint16_t to; + uint32_t buf_len; + int err; + struct buf_vector buf_vec[BUF_VECTOR_MAX]; + struct rte_mbuf *pkt = pkts_prealloc[pkt_idx]; + + if (unlikely(fill_vec_buf_split(dev, vq, vq->last_avail_idx, + &nr_vec, buf_vec, + &head_idx, &buf_len, + VHOST_ACCESS_RO) < 0)) { + dropped = true; + break; + } + + err = virtio_dev_pktmbuf_prep(dev, pkt, buf_len); + if (unlikely(err)) { + /** + * mbuf allocation fails for jumbo packets when external + * buffer allocation is not allowed and linear buffer + * is required. Drop this packet. + */ + if (!allocerr_warned) { + VHOST_LOG_DATA(ERR, + "(%s) %s: Failed mbuf alloc of size %d from %s\n", + dev->ifname, __func__, buf_len, mbuf_pool->name); + allocerr_warned = true; + } + dropped = true; + break; + } + + slot_idx = (async->pkts_idx + pkt_idx) & (vq->size - 1); + err = desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkt, mbuf_pool, + legacy_ol_flags, slot_idx, true); + if (unlikely(err)) { + if (!allocerr_warned) { + VHOST_LOG_DATA(ERR, + "(%s) %s: Failed to offload copies to async channel.\n", + dev->ifname, __func__); + allocerr_warned = true; + } + dropped = true; + break; + } + + pkts_info[slot_idx].mbuf = pkt; + + /* store used descs */ + to = async->desc_idx_split & (vq->size - 1); + async->descs_split[to].id = head_idx; + async->descs_split[to].len = 0; + async->desc_idx_split++; + + vq->last_avail_idx++; + } + + if (unlikely(dropped)) + rte_pktmbuf_free_bulk(&pkts_prealloc[pkt_idx], count - pkt_idx); + + n_xfer = vhost_async_dma_transfer(dev, vq, dma_id, vchan_id, async->pkts_idx, + async->iov_iter, pkt_idx); + + async->pkts_inflight_n += n_xfer; + + pkt_err = pkt_idx - n_xfer; + if (unlikely(pkt_err)) { + VHOST_LOG_DATA(DEBUG, + "(%s) %s: failed to transfer data for queue id %d.\n", + dev->ifname, __func__, queue_id); + + pkt_idx = n_xfer; + /* recover available ring */ + vq->last_avail_idx -= pkt_err; + + /** + * recover async channel copy related structures and free pktmbufs + * for error pkts. + */ + async->desc_idx_split -= pkt_err; + while (pkt_err-- > 0) { + rte_pktmbuf_free(pkts_info[slot_idx & (vq->size - 1)].mbuf); + slot_idx--; + } + } + + async->pkts_idx += pkt_idx; + if (async->pkts_idx >= vq->size) + async->pkts_idx -= vq->size; + +out: + /* DMA device may serve other queues, unconditionally check completed. */ + nr_done_pkts = async_poll_dequeue_completed_split(dev, queue_id, pkts, pkts_size, + dma_id, vchan_id, legacy_ol_flags); + + return nr_done_pkts; +} + +__rte_noinline +static uint16_t +virtio_dev_tx_async_split_legacy(struct virtio_net *dev, + struct vhost_virtqueue *vq, uint16_t queue_id, + struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, + uint16_t count, uint16_t dma_id, uint16_t vchan_id) +{ + return virtio_dev_tx_async_split(dev, vq, queue_id, mbuf_pool, + pkts, count, dma_id, vchan_id, true); +} + +__rte_noinline +static uint16_t +virtio_dev_tx_async_split_compliant(struct virtio_net *dev, + struct vhost_virtqueue *vq, uint16_t queue_id, + struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, + uint16_t count, uint16_t dma_id, uint16_t vchan_id) +{ + return virtio_dev_tx_async_split(dev, vq, queue_id, mbuf_pool, + pkts, count, dma_id, vchan_id, false); +} + +uint16_t +rte_vhost_async_try_dequeue_burst(int vid, uint16_t queue_id, + struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count, + int *nr_inflight, uint16_t dma_id, uint16_t vchan_id) +{ + struct virtio_net *dev; + struct rte_mbuf *rarp_mbuf = NULL; + struct vhost_virtqueue *vq; + int16_t success = 1; + + *nr_inflight = -1; + + dev = get_device(vid); + if (!dev) + return 0; + + if (unlikely(!(dev->flags & VIRTIO_DEV_BUILTIN_VIRTIO_NET))) { + VHOST_LOG_DATA(ERR, + "(%s) %s: built-in vhost net backend is disabled.\n", + dev->ifname, __func__); + return 0; + } + + if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->nr_vring))) { + VHOST_LOG_DATA(ERR, + "(%s) %s: invalid virtqueue idx %d.\n", + dev->ifname, __func__, queue_id); + return 0; + } + + if (unlikely(!dma_copy_track[dma_id].vchans || + !dma_copy_track[dma_id].vchans[vchan_id].pkts_cmpl_flag_addr)) { + VHOST_LOG_DATA(ERR, "(%s) %s: invalid channel %d:%u.\n", dev->ifname, __func__, + dma_id, vchan_id); + return 0; + } + + vq = dev->virtqueue[queue_id]; + + if (unlikely(rte_spinlock_trylock(&vq->access_lock) == 0)) + return 0; + + if (unlikely(vq->enabled == 0)) { + count = 0; + goto out_access_unlock; + } + + if (unlikely(!vq->async)) { + VHOST_LOG_DATA(ERR, "(%s) %s: async not registered for queue id %d.\n", + dev->ifname, __func__, queue_id); + count = 0; + goto out_access_unlock; + } + + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) + vhost_user_iotlb_rd_lock(vq); + + if (unlikely(vq->access_ok == 0)) + if (unlikely(vring_translate(dev, vq) < 0)) { + count = 0; + goto out; + } + + /* + * Construct a RARP broadcast packet, and inject it to the "pkts" + * array, to looks like that guest actually send such packet. + * + * Check user_send_rarp() for more information. + * + * broadcast_rarp shares a cacheline in the virtio_net structure + * with some fields that are accessed during enqueue and + * __atomic_compare_exchange_n causes a write if performed compare + * and exchange. This could result in false sharing between enqueue + * and dequeue. + * + * Prevent unnecessary false sharing by reading broadcast_rarp first + * and only performing compare and exchange if the read indicates it + * is likely to be set. + */ + if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) && + __atomic_compare_exchange_n(&dev->broadcast_rarp, + &success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) { + + rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac); + if (rarp_mbuf == NULL) { + VHOST_LOG_DATA(ERR, "Failed to make RARP packet.\n"); + count = 0; + goto out; + } + /* + * Inject it to the head of "pkts" array, so that switch's mac + * learning table will get updated first. + */ + pkts[0] = rarp_mbuf; + pkts++; + count -= 1; + } + + if (unlikely(vq_is_packed(dev))) { + static bool not_support_pack_log; + if (!not_support_pack_log) { + VHOST_LOG_DATA(ERR, + "(%s) %s: async dequeue does not support packed ring.\n", + dev->ifname, __func__); + not_support_pack_log = true; + } + count = 0; + goto out; + } + + if (dev->flags & VIRTIO_DEV_LEGACY_OL_FLAGS) + count = virtio_dev_tx_async_split_legacy(dev, vq, queue_id, + mbuf_pool, pkts, count, dma_id, vchan_id); + else + count = virtio_dev_tx_async_split_compliant(dev, vq, queue_id, + mbuf_pool, pkts, count, dma_id, vchan_id); + + *nr_inflight = vq->async->pkts_inflight_n; + +out: + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) + vhost_user_iotlb_rd_unlock(vq); + +out_access_unlock: + rte_spinlock_unlock(&vq->access_lock); + + if (unlikely(rarp_mbuf != NULL)) + count += 1; + + return count; +} -- 2.17.1