From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id AFF2FA0524; Mon, 31 May 2021 08:21:43 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 3321A40040; Mon, 31 May 2021 08:21:43 +0200 (CEST) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by mails.dpdk.org (Postfix) with ESMTP id ED0FD4003E for ; Mon, 31 May 2021 08:21:40 +0200 (CEST) IronPort-SDR: 5Xg/Vdxt6LLclSB2HK5bZWJ898JOlUji8PCit1uXO1JfzEKFNnJfvQbiqVJLq5Qde6wcZ6Jx7q qKZ4pTFjzTDw== X-IronPort-AV: E=McAfee;i="6200,9189,10000"; a="190665718" X-IronPort-AV: E=Sophos;i="5.83,236,1616482800"; d="scan'208";a="190665718" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 May 2021 23:21:38 -0700 IronPort-SDR: fZ23YAlK098aT6bftRk9aPzA6th9O9x4iOEwNXkKrNl1aH6UNr/Cwh8x8Q2Fr9xG8rffoEv2Mm /NuZyf2h+47A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,236,1616482800"; d="scan'208";a="478787394" Received: from npg-dpdk-xuan-cbdma.sh.intel.com ([10.67.111.132]) by orsmga001.jf.intel.com with ESMTP; 30 May 2021 23:21:35 -0700 From: xuan.ding@intel.com To: maxime.coquelin@redhat.com, chenbo.xia@intel.com Cc: dev@dpdk.org, jiayu.hu@intel.com, sunil.pai.g@intel.com, bruce.richardson@intel.com, harry.van.haaren@intel.com, yong.liu@intel.com, Xuan Ding Date: Mon, 31 May 2021 15:06:29 +0000 Message-Id: <20210531150629.35020-1-xuan.ding@intel.com> X-Mailer: git-send-email 2.17.1 Subject: [dpdk-dev] [PATCH v1] lib/vhost: enable IOMMU for async vhost X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" From: Xuan Ding For async copy, it is unsafe to directly use the physical address. and current address translation from GPA to HPA via SW also takes CPU cycles, these can all benefit from IOMMU. Since the existing DMA engine supports to use platform IOMMU, this patch enables IOMMU for async vhost, which defines IOAT devices to use virtual address instead of physical address. When set memory table, the frontend's memory will be mapped to the default container of DPDK where IOAT devices has been added into. When DMA copy fails, the virtual address provided to IOAT devices also allow us fallback to SW copy or PA copy. With IOMMU enabled, to use IOAT devices: 1. IOAT devices must be binded to vfio-pci, rather than igb_uio. 2. DPDK must use "--iova-mode=va". Signed-off-by: Xuan Ding --- doc/guides/prog_guide/vhost_lib.rst | 17 +++++ lib/vhost/vhost_user.c | 102 ++++------------------------ lib/vhost/virtio_net.c | 30 +++----- 3 files changed, 41 insertions(+), 108 deletions(-) diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst index d18fb98910..6b7206bc1d 100644 --- a/doc/guides/prog_guide/vhost_lib.rst +++ b/doc/guides/prog_guide/vhost_lib.rst @@ -420,3 +420,20 @@ Finally, a set of device ops is defined for device specific operations: * ``get_notify_area`` Called to get the notify area info of the queue. + + Vhost async data path + ----------------------------------- +* Address mode + Modern IOAT devices supports to use the IOMMU, which can avoid using + the unsafe HPA. Besides, the CPU cycles took by SW to translate from + GPA to HPA can also be saved. So IOAT devices are defined to use + virtual address instead of physical address. + + With IOMMU enabled, to use IOAT devices: + 1. IOAT devices must be binded to vfio-pci, rather than igb_uio. + 2. DPDK must use ``--iova-mode=va``. + +* Fallback + When the DMA copy fails, the user who implements the transfer_data + callback can fallback to SW copy or fallback to PA through + rte_mem_virt2iova(). diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c index 8f0eba6412..4d562e0091 100644 --- a/lib/vhost/vhost_user.c +++ b/lib/vhost/vhost_user.c @@ -45,6 +45,7 @@ #include #include #include +#include #include "iotlb.h" #include "vhost.h" @@ -866,87 +867,6 @@ vhost_user_set_vring_base(struct virtio_net **pdev, return RTE_VHOST_MSG_RESULT_OK; } -static int -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr, - uint64_t host_phys_addr, uint64_t size) -{ - struct guest_page *page, *last_page; - struct guest_page *old_pages; - - if (dev->nr_guest_pages == dev->max_guest_pages) { - dev->max_guest_pages *= 2; - old_pages = dev->guest_pages; - dev->guest_pages = rte_realloc(dev->guest_pages, - dev->max_guest_pages * sizeof(*page), - RTE_CACHE_LINE_SIZE); - if (dev->guest_pages == NULL) { - VHOST_LOG_CONFIG(ERR, "cannot realloc guest_pages\n"); - rte_free(old_pages); - return -1; - } - } - - if (dev->nr_guest_pages > 0) { - last_page = &dev->guest_pages[dev->nr_guest_pages - 1]; - /* merge if the two pages are continuous */ - if (host_phys_addr == last_page->host_phys_addr + - last_page->size) { - last_page->size += size; - return 0; - } - } - - page = &dev->guest_pages[dev->nr_guest_pages++]; - page->guest_phys_addr = guest_phys_addr; - page->host_phys_addr = host_phys_addr; - page->size = size; - - return 0; -} - -static int -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg, - uint64_t page_size) -{ - uint64_t reg_size = reg->size; - uint64_t host_user_addr = reg->host_user_addr; - uint64_t guest_phys_addr = reg->guest_phys_addr; - uint64_t host_phys_addr; - uint64_t size; - - host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr); - size = page_size - (guest_phys_addr & (page_size - 1)); - size = RTE_MIN(size, reg_size); - - if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) < 0) - return -1; - - host_user_addr += size; - guest_phys_addr += size; - reg_size -= size; - - while (reg_size > 0) { - size = RTE_MIN(reg_size, page_size); - host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t) - host_user_addr); - if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, - size) < 0) - return -1; - - host_user_addr += size; - guest_phys_addr += size; - reg_size -= size; - } - - /* sort guest page array if over binary search threshold */ - if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) { - qsort((void *)dev->guest_pages, dev->nr_guest_pages, - sizeof(struct guest_page), guest_page_addrcmp); - } - - return 0; -} - #ifdef RTE_LIBRTE_VHOST_DEBUG /* TODO: enable it only in debug mode? */ static void @@ -1158,13 +1078,6 @@ vhost_user_mmap_region(struct virtio_net *dev, region->mmap_size = mmap_size; region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset; - if (dev->async_copy) - if (add_guest_pages(dev, region, alignment) < 0) { - VHOST_LOG_CONFIG(ERR, - "adding guest pages to region failed.\n"); - return -1; - } - VHOST_LOG_CONFIG(INFO, "guest memory region size: 0x%" PRIx64 "\n" "\t guest physical addr: 0x%" PRIx64 "\n" @@ -1196,6 +1109,7 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg, uint64_t mmap_offset; uint32_t i; + int ret; if (validate_msg_fds(msg, memory->nregions) != 0) return RTE_VHOST_MSG_RESULT_ERR; @@ -1280,6 +1194,18 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg, } dev->mem->nregions++; + + if (dev->async_copy) { + /* Add mapped region into the default container of DPDK. */ + ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, + reg->host_user_addr, + reg->host_user_addr, + reg->size); + if (ret < 0) { + VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA engine failed"); + goto free_mem_table; + } + } } if (vhost_user_postcopy_register(dev, main_fd, msg) < 0) diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c index 8da8a86a10..88110d2cb3 100644 --- a/lib/vhost/virtio_net.c +++ b/lib/vhost/virtio_net.c @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, struct batch_copy_elem *batch_copy = vq->batch_copy_elems; struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL; int error = 0; - uint64_t mapped_len; uint32_t tlen = 0; int tvec_idx = 0; - void *hpa; if (unlikely(m == NULL)) { error = -1; @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, cpy_len = RTE_MIN(buf_avail, mbuf_avail); - while (unlikely(cpy_len && cpy_len >= cpy_threshold)) { - hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev, - buf_iova + buf_offset, - cpy_len, &mapped_len); - - if (unlikely(!hpa || mapped_len < cpy_threshold)) - break; - + if (unlikely(cpy_len >= cpy_threshold)) { async_fill_vec(src_iovec + tvec_idx, - (void *)(uintptr_t)rte_pktmbuf_iova_offset(m, - mbuf_offset), (size_t)mapped_len); + rte_pktmbuf_mtod_offset(m, void *, mbuf_offset), (size_t)cpy_len); async_fill_vec(dst_iovec + tvec_idx, - hpa, (size_t)mapped_len); - - tlen += (uint32_t)mapped_len; - cpy_len -= (uint32_t)mapped_len; - mbuf_avail -= (uint32_t)mapped_len; - mbuf_offset += (uint32_t)mapped_len; - buf_avail -= (uint32_t)mapped_len; - buf_offset += (uint32_t)mapped_len; + (void *)((uintptr_t)(buf_addr + buf_offset)), (size_t)cpy_len); + + tlen += cpy_len; + mbuf_avail -= cpy_len; + mbuf_offset += cpy_len; + buf_avail -= cpy_len; + buf_offset += cpy_len; + cpy_len = 0; tvec_idx++; } -- 2.17.1