From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 32181A0C4B; Wed, 7 Jul 2021 17:00:18 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id E1509413EE; Wed, 7 Jul 2021 17:00:17 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by mails.dpdk.org (Postfix) with ESMTP id CBD8A413B6 for ; Wed, 7 Jul 2021 17:00:15 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10037"; a="231058966" X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="231058966" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2021 08:00:14 -0700 X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="457505454" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.4.148]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 07 Jul 2021 08:00:11 -0700 Date: Wed, 7 Jul 2021 16:00:07 +0100 From: Bruce Richardson To: "Burakov, Anatoly" Cc: "Ding, Xuan" , Maxime Coquelin , "Xia, Chenbo" , Thomas Monjalon , David Marchand , "dev@dpdk.org" , "Hu, Jiayu" , "Pai G, Sunil" , "Van Haaren, Harry" , "Liu, Yong" , "Ma, WenwuX" Message-ID: References: <20210705084026.99898-2-xuan.ding@intel.com> <37d0691a-5094-4fea-8557-d117d230dcc8@intel.com> <68f64aa6-59a8-2e17-6eab-a49a6682e626@redhat.com> <4077e127-afff-0a2b-f2ba-5850b5e0a2ff@intel.com> <1013369b-3661-38fd-7207-6993e61da08e@redhat.com> <095cc219-4cba-5fbe-a7c0-bf02138f0959@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Wed, Jul 07, 2021 at 03:33:44PM +0100, Burakov, Anatoly wrote: > On 07-Jul-21 1:54 PM, Ding, Xuan wrote: > > Hi Anatoly, > > > > > -----Original Message----- > > > From: Burakov, Anatoly > > > Sent: Wednesday, July 7, 2021 8:18 PM > > > To: Ding, Xuan ; Maxime Coquelin > > > ; Xia, Chenbo ; > > > Thomas Monjalon ; David Marchand > > > > > > Cc: dev@dpdk.org; Hu, Jiayu ; Pai G, Sunil > > > ; Richardson, Bruce ; Van > > > Haaren, Harry ; Liu, Yong ; > > > Ma, WenwuX > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost > > > > > > On 07-Jul-21 7:25 AM, Ding, Xuan wrote: > > > > Hi, > > > > > > > > > -----Original Message----- > > > > > From: Maxime Coquelin > > > > > Sent: Tuesday, July 6, 2021 5:32 PM > > > > > To: Burakov, Anatoly ; Ding, Xuan > > > > > ; Xia, Chenbo ; Thomas > > > > > Monjalon ; David Marchand > > > > > > > > > > Cc: dev@dpdk.org; Hu, Jiayu ; Pai G, Sunil > > > > > ; Richardson, Bruce ; > > > Van > > > > > Haaren, Harry ; Liu, Yong > > > ; > > > > > Ma, WenwuX > > > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost > > > > > > > > > > > > > > > > > > > > On 7/6/21 11:16 AM, Burakov, Anatoly wrote: > > > > > > On 06-Jul-21 9:31 AM, Ding, Xuan wrote: > > > > > > > Hi Maxime, > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Maxime Coquelin > > > > > > > > Sent: Monday, July 5, 2021 8:46 PM > > > > > > > > To: Burakov, Anatoly ; Ding, Xuan > > > > > > > > ; Xia, Chenbo ; Thomas > > > > > > > > Monjalon ; David Marchand > > > > > > > > > > > > > > > > Cc: dev@dpdk.org; Hu, Jiayu ; Pai G, Sunil > > > > > > > > ; Richardson, Bruce > > > ; > > > > > > > > Van Haaren, Harry ; Liu, Yong > > > > > > > > ; Ma, WenwuX > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async > > > > > > > > vhost > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/5/21 2:16 PM, Burakov, Anatoly wrote: > > > > > > > > > On 05-Jul-21 9:40 AM, Xuan Ding wrote: > > > > > > > > > > The use of IOMMU has many advantages, such as isolation and address > > > > > > > > > > translation. This patch extends the capbility of DMA engine to use > > > > > > > > > > IOMMU if the DMA device is bound to vfio. > > > > > > > > > > > > > > > > > > > > When set memory table, the guest memory will be mapped > > > > > > > > > > into the default container of DPDK. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Xuan Ding > > > > > > > > > > --- > > > > > > > > > > doc/guides/prog_guide/vhost_lib.rst | 9 ++++++ > > > > > > > > > > lib/vhost/rte_vhost.h | 1 + > > > > > > > > > > lib/vhost/socket.c | 9 ++++++ > > > > > > > > > > lib/vhost/vhost.h | 1 + > > > > > > > > > > lib/vhost/vhost_user.c | 46 > > > > > > > > > > ++++++++++++++++++++++++++++- > > > > > > > > > > 5 files changed, 65 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/prog_guide/vhost_lib.rst > > > > > > > > > > b/doc/guides/prog_guide/vhost_lib.rst > > > > > > > > > > index 05c42c9b11..c3beda23d9 100644 > > > > > > > > > > --- a/doc/guides/prog_guide/vhost_lib.rst > > > > > > > > > > +++ b/doc/guides/prog_guide/vhost_lib.rst > > > > > > > > > > @@ -118,6 +118,15 @@ The following is an overview of some key > > > Vhost > > > > > > > > > > API functions: > > > > > > > > > > It is disabled by default. > > > > > > > > > > + - ``RTE_VHOST_USER_ASYNC_USE_VFIO`` > > > > > > > > > > + > > > > > > > > > > + In asynchronous data path, vhost liarary is not aware of which > > > > > > > > > > driver > > > > > > > > > > + (igb_uio/vfio) the DMA device is bound to. Application should > > > > > > > > > > pass > > > > > > > > > > + this flag to tell vhost library whether IOMMU should be > > > > > > > > > > programmed > > > > > > > > > > + for guest memory. > > > > > > > > > > + > > > > > > > > > > + It is disabled by default. > > > > > > > > > > + > > > > > > > > > > - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` > > > > > > > > > > Since v16.04, the vhost library forwards checksum and gso > > > > > > > > > > requests for > > > > > > > > > > diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h > > > > > > > > > > index 8d875e9322..a766ea7b6b 100644 > > > > > > > > > > --- a/lib/vhost/rte_vhost.h > > > > > > > > > > +++ b/lib/vhost/rte_vhost.h > > > > > > > > > > @@ -37,6 +37,7 @@ extern "C" { > > > > > > > > > > #define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6) > > > > > > > > > > #define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7) > > > > > > > > > > #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS (1ULL << > > > 8) > > > > > > > > > > +#define RTE_VHOST_USER_ASYNC_USE_VFIO (1ULL << 9) > > > > > > > > > > /* Features. */ > > > > > > > > > > #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE > > > > > > > > > > diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c > > > > > > > > > > index 5d0d728d52..77c722c86b 100644 > > > > > > > > > > --- a/lib/vhost/socket.c > > > > > > > > > > +++ b/lib/vhost/socket.c > > > > > > > > > > @@ -42,6 +42,7 @@ struct vhost_user_socket { > > > > > > > > > > bool extbuf; > > > > > > > > > > bool linearbuf; > > > > > > > > > > bool async_copy; > > > > > > > > > > + bool async_use_vfio; > > > > > > > > > > bool net_compliant_ol_flags; > > > > > > > > > > /* > > > > > > > > > > @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct > > > > > > > > > > vhost_user_socket *vsocket) > > > > > > > > > > dev->async_copy = 1; > > > > > > > > > > } > > > > > > > > > > + if (vsocket->async_use_vfio) { > > > > > > > > > > + dev = get_device(vid); > > > > > > > > > > + > > > > > > > > > > + if (dev) > > > > > > > > > > + dev->async_use_vfio = 1; > > > > > > > > > > + } > > > > > > > > > > + > > > > > > > > > > VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid); > > > > > > > > > > if (vsocket->notify_ops->new_connection) { > > > > > > > > > > @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path, > > > > > > > > > > uint64_t flags) > > > > > > > > > > vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT; > > > > > > > > > > vsocket->linearbuf = flags & > > > > > RTE_VHOST_USER_LINEARBUF_SUPPORT; > > > > > > > > > > vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY; > > > > > > > > > > + vsocket->async_use_vfio = flags & > > > > > > > > RTE_VHOST_USER_ASYNC_USE_VFIO; > > > > > > > > > > vsocket->net_compliant_ol_flags = flags & > > > > > > > > > > RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS; > > > > > > > > > > if (vsocket->async_copy && > > > > > > > > > > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h > > > > > > > > > > index 8078ddff79..fb775ce4ed 100644 > > > > > > > > > > --- a/lib/vhost/vhost.h > > > > > > > > > > +++ b/lib/vhost/vhost.h > > > > > > > > > > @@ -370,6 +370,7 @@ struct virtio_net { > > > > > > > > > > int16_t broadcast_rarp; > > > > > > > > > > uint32_t nr_vring; > > > > > > > > > > int async_copy; > > > > > > > > > > + int async_use_vfio; > > > > > > > > > > int extbuf; > > > > > > > > > > int linearbuf; > > > > > > > > > > struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * > > > > > > > > > > 2]; > > > > > > > > > > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c > > > > > > > > > > index 8f0eba6412..f3703f2e72 100644 > > > > > > > > > > --- a/lib/vhost/vhost_user.c > > > > > > > > > > +++ b/lib/vhost/vhost_user.c > > > > > > > > > > @@ -45,6 +45,7 @@ > > > > > > > > > > #include > > > > > > > > > > #include > > > > > > > > > > #include > > > > > > > > > > +#include > > > > > > > > > > #include "iotlb.h" > > > > > > > > > > #include "vhost.h" > > > > > > > > > > @@ -141,6 +142,36 @@ get_blk_size(int fd) > > > > > > > > > > return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize; > > > > > > > > > > } > > > > > > > > > > +static int > > > > > > > > > > +async_dma_map(struct rte_vhost_mem_region *region, bool do_map) > > > > > > > > > > +{ > > > > > > > > > > + int ret = 0; > > > > > > > > > > + uint64_t host_iova; > > > > > > > > > > + host_iova = rte_mem_virt2iova((void > > > > > > > > > > *)(uintptr_t)region->host_user_addr); > > > > > > > > > > + if (do_map) { > > > > > > > > > > + /* Add mapped region into the default container of DPDK. */ > > > > > > > > > > + ret = > > > > > > > > rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, > > > > > > > > > > + region->host_user_addr, > > > > > > > > > > + host_iova, > > > > > > > > > > + region->size); > > > > > > > > > > + if (ret) { > > > > > > > > > > + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n"); > > > > > > > > > > + return ret; > > > > > > > > > > + } > > > > > > > > > > + } else { > > > > > > > > > > + /* Remove mapped region from the default container of > > > > > > > > > > DPDK. */ > > > > > > > > > > + ret = > > > > > > > > > > rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, > > > > > > > > > > + region->host_user_addr, > > > > > > > > > > + host_iova, > > > > > > > > > > + region->size); > > > > > > > > > > + if (ret) { > > > > > > > > > > + VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n"); > > > > > > > > > > + return ret; > > > > > > > > > > + } > > > > > > > > > > + } > > > > > > > > > > + return ret; > > > > > > > > > > +} > > > > > > > > > > > > > > > > > > We've been discussing this off list with Xuan, and unfortunately > > > > > > > > > this is > > > > > > > > > a blocker for now. > > > > > > > > > > > > > > > > > > Currently, the x86 IOMMU does not support partial unmap - the > > > segments > > > > > > > > > have to be unmapped exactly the same addr/len as they were mapped. > > > We > > > > > > > > > also concatenate adjacent mappings to prevent filling up the DMA > > > > > > > > > mapping > > > > > > > > > entry table with superfluous entries. > > > > > > > > > > > > > > > > > > This means that, when two unrelated mappings are contiguous in > > > memory > > > > > > > > > (e.g. if you map regions 1 and 2 independently, but they happen to be > > > > > > > > > sitting right next to each other in virtual memory), we cannot later > > > > > > > > > unmap one of them because, even though these are two separate > > > > > > > > mappings > > > > > > > > > as far as kernel VFIO infrastructure is concerned, the mapping gets > > > > > > > > > compacted and looks like one single mapping to VFIO, so DPDK API will > > > > > > > > > not let us unmap region 1 without also unmapping region 2. > > > > > > > > > > > > > > > > > > The proper fix for this problem would be to always map memory > > > > > > > > > page-by-page regardless of where it comes from (we already do that for > > > > > > > > > internal memory, but not for external). However, the reason this works > > > > > > > > > for internal memory is because when mapping internal memory > > > segments, > > > > > > > > > *we know the page size*. For external memory segments, there is no > > > such > > > > > > > > > guarantee, so we cannot deduce page size for a given memory segment, > > > > > > > > and > > > > > > > > > thus can't map things page-by-page. > > > > > > > > > > > > > > > > > > So, the proper fix for it would be to add page size to the VFIO DMA > > > > > > > > > API. > > > > > > > > > Unfortunately, it probably has to wait until 21.11 because it is an API > > > > > > > > > change. > > > > > > > > > > > > > > > > > > The slightly hacky fix for this would be to forego user mem map > > > > > > > > > concatenation and trust that user is not going to do anything stupid, > > > > > > > > > and will not spam the VFIO DMA API without reason. I would rather > > > > > > > > > not go > > > > > > > > > down this road, but this could be an option in this case. > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > Thanks Anatoly for the detailed description of the issue. > > > > > > > > It may be possible to either create a versioned symbol for this API > > > > > > > > change, or maybe even to have a temporary internal API. > > > > > > > > > > > > > > > > But I think this series in its current form is not acceptable, so > > > > > > > > waiting for v21.11 would be the best option (we may want to send the > > > > > > > > deprecation notice in this release though). > > > > > > > > > > > > > > > > In this series, I don't like the user application has to pass a flag to > > > > > > > > state whether the DMA engine uses VFIO or not. AFAICT, this new revision > > > > > > > > does not implement what was discussed in the previous one, i.e. > > > > > > > > supporting both IOVA_AS_VA and IOVA_AS_PA. > > > > > > > > > > > > > > Thanks for your comments. Here I hope to explain some questions: > > > > > > > 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now? > > > > > > > A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, > > > the > > > > > > > virtual address is replaced with iova address of mapped region, and > > > > > > > the iova > > > > > > > address is selected to program the IOMMU instead of virtual address only. > > > > > > > > > > Good! > > > > > > > > > > > > > > > > > > > 2. Why a flag is chosen to be passed by application? > > > > > > > A: Yes, as we discussed before, the rte_eal_iova_mode() API can be > > > > > > > used to > > > > > > > get the IOVA mode, so as to determine whether IOMMU should be > > > > > programmed. > > > > > > > However, in the implementation process, I found a problem. That is how to > > > > > > > distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we > > > should > > > > > > > always program the IOMMU. While in IGB_UIO cases, it depends on > > > IOMMU > > > > > > > capability of platform. > > > > > > > > > > > > How does one program IOMMU with igb_uio? I was under impression that > > > > > > igb_uio (and uio_pci_generic for that matter) does not provide such > > > > > > facilities. > > > > > > > > > > +1 > > > > > > > > Maybe some misunderstanding in this sentence here. > > > > In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be > > > programmed. > > > > True means vfio module is modprobed. > > > > > > > > But there is an exception here, that is, even if vfio module is modprobed, > > > > DPDK user still bind all the devices to igb_uio. > > > > > > > > This situation can be distinguished in DPDK eal initialization, because the > > > resource mapping > > > > is according to the driver loaded by each device(rte_pci_map_device). > > > > > > > > But in our scenario, this judgment is somewhat weak. Because we cannot get > > > > the device driver info in vhost library. I also think it is unreasonable for vhost to > > > > do this. Only trust that users will not use it like this. Thoughts for this scenario? > > > > > > I don't see how igb_uio would make any difference at all. If you are > > > using igb_uio, you *don't have DMA mapping at all* and will use raw > > > physical addresses. Assuming your code supports this, that's all you're > > > ever going to get. The point of VFIO is to have memory regions that are > > > mapped for DMA *because real physical addresses are assumed to be not > > > available*. When you're using igb_uio, you effectively do have DMA > > > access to the entire memory, and thus can bypass IOMMU altogether > > > (assuming you're using passthrough mode). > > > > My concern is exactly here. > > In igb_uio cases, although devices are not added to the default container in eal init, > > but the "IOMMU programming" actually happens when the rte_vfio_container_dma_map() is called. > > It is no harm but it is also unnecessary. > > Yes, it is unnecessary, but it's also not actively harmful, which means you > can still do it without any regard as to whether you do or don't have IOMMU > :) > > Think of a hybrid VFIO/igb_uio setup - some NICs will have VFIO, some will > have igb_uio. The igb_uio-bound NICs will not care if you have mapped > anything for DMA because they don't go through IOMMU, things will "just > work". The VFIO-bound NICs will get the memory mapped, because they are the > ones who actually need the DMA mapping. > Do we even support a hybrid setup. I would have thought that was just asking for problems and should be considered an unsupported configuration?