From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 51A54A0C47; Tue, 6 Jul 2021 11:32:46 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 151C74120E; Tue, 6 Jul 2021 11:32:43 +0200 (CEST) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mails.dpdk.org (Postfix) with ESMTP id 88A4540688 for ; Tue, 6 Jul 2021 11:32:41 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1625563960; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yoh7Iyq/4CHbl/NNvZNnGvEAtuHf1aEv7IsTtU9M1u4=; b=YaEvMe3CO4Tz9yI41LfAgBp5Yh/1IhiymfI4XWS0nijeHi2MVB4PZ98xLjSW+DiqO5GcNh LMep1Fr1xyBrCLhweScFKCfAvS0qlWNx+AQQ/hQDSOwB7TM/JpQraC4WHL8iTZgDCs1Kk6 rZuF+wkOK1FEOgZxwdUv2dOKlRaXJNI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-402-HuR9oWlzOGypo_BxR5mL_g-1; Tue, 06 Jul 2021 05:32:39 -0400 X-MC-Unique: HuR9oWlzOGypo_BxR5mL_g-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 210EC10C1ADC; Tue, 6 Jul 2021 09:32:38 +0000 (UTC) Received: from [10.36.110.36] (unknown [10.36.110.36]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4F3735C22A; Tue, 6 Jul 2021 09:32:26 +0000 (UTC) To: "Burakov, Anatoly" , "Ding, Xuan" , "Xia, Chenbo" , Thomas Monjalon , David Marchand Cc: "dev@dpdk.org" , "Hu, Jiayu" , "Pai G, Sunil" , "Richardson, Bruce" , "Van Haaren, Harry" , "Liu, Yong" , "Ma, WenwuX" References: <20210531150629.35020-1-xuan.ding@intel.com> <20210705084026.99898-1-xuan.ding@intel.com> <20210705084026.99898-2-xuan.ding@intel.com> <37d0691a-5094-4fea-8557-d117d230dcc8@intel.com> <68f64aa6-59a8-2e17-6eab-a49a6682e626@redhat.com> <4077e127-afff-0a2b-f2ba-5850b5e0a2ff@intel.com> From: Maxime Coquelin Message-ID: <1013369b-3661-38fd-7207-6993e61da08e@redhat.com> Date: Tue, 6 Jul 2021 11:32:23 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <4077e127-afff-0a2b-f2ba-5850b5e0a2ff@intel.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=maxime.coquelin@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 7/6/21 11:16 AM, Burakov, Anatoly wrote: > On 06-Jul-21 9:31 AM, Ding, Xuan wrote: >> Hi Maxime, >> >>> -----Original Message----- >>> From: Maxime Coquelin >>> Sent: Monday, July 5, 2021 8:46 PM >>> To: Burakov, Anatoly ; Ding, Xuan >>> ; Xia, Chenbo ; Thomas >>> Monjalon ; David Marchand >>> >>> Cc: dev@dpdk.org; Hu, Jiayu ; Pai G, Sunil >>> ; Richardson, Bruce ; >>> Van Haaren, Harry ; Liu, Yong >>> ; Ma, WenwuX >>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async >>> vhost >>> >>> >>> >>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote: >>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote: >>>>> The use of IOMMU has many advantages, such as isolation and address >>>>> translation. This patch extends the capbility of DMA engine to use >>>>> IOMMU if the DMA device is bound to vfio. >>>>> >>>>> When set memory table, the guest memory will be mapped >>>>> into the default container of DPDK. >>>>> >>>>> Signed-off-by: Xuan Ding >>>>> --- >>>>>    doc/guides/prog_guide/vhost_lib.rst |  9 ++++++ >>>>>    lib/vhost/rte_vhost.h               |  1 + >>>>>    lib/vhost/socket.c                  |  9 ++++++ >>>>>    lib/vhost/vhost.h                   |  1 + >>>>>    lib/vhost/vhost_user.c              | 46 >>>>> ++++++++++++++++++++++++++++- >>>>>    5 files changed, 65 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst >>>>> b/doc/guides/prog_guide/vhost_lib.rst >>>>> index 05c42c9b11..c3beda23d9 100644 >>>>> --- a/doc/guides/prog_guide/vhost_lib.rst >>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst >>>>> @@ -118,6 +118,15 @@ The following is an overview of some key Vhost >>>>> API functions: >>>>>          It is disabled by default. >>>>>    +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO`` >>>>> + >>>>> +    In asynchronous data path, vhost liarary is not aware of which >>>>> driver >>>>> +    (igb_uio/vfio) the DMA device is bound to. Application should >>>>> pass >>>>> +    this flag to tell vhost library whether IOMMU should be >>>>> programmed >>>>> +    for guest memory. >>>>> + >>>>> +    It is disabled by default. >>>>> + >>>>>      - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` >>>>>          Since v16.04, the vhost library forwards checksum and gso >>>>> requests for >>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h >>>>> index 8d875e9322..a766ea7b6b 100644 >>>>> --- a/lib/vhost/rte_vhost.h >>>>> +++ b/lib/vhost/rte_vhost.h >>>>> @@ -37,6 +37,7 @@ extern "C" { >>>>>    #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6) >>>>>    #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7) >>>>>    #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL << 8) >>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9) >>>>>      /* Features. */ >>>>>    #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE >>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c >>>>> index 5d0d728d52..77c722c86b 100644 >>>>> --- a/lib/vhost/socket.c >>>>> +++ b/lib/vhost/socket.c >>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket { >>>>>        bool extbuf; >>>>>        bool linearbuf; >>>>>        bool async_copy; >>>>> +    bool async_use_vfio; >>>>>        bool net_compliant_ol_flags; >>>>>          /* >>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct >>>>> vhost_user_socket *vsocket) >>>>>                dev->async_copy = 1; >>>>>        } >>>>>    +    if (vsocket->async_use_vfio) { >>>>> +        dev = get_device(vid); >>>>> + >>>>> +        if (dev) >>>>> +            dev->async_use_vfio = 1; >>>>> +    } >>>>> + >>>>>        VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid); >>>>>          if (vsocket->notify_ops->new_connection) { >>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path, >>>>> uint64_t flags) >>>>>        vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT; >>>>>        vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT; >>>>>        vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY; >>>>> +    vsocket->async_use_vfio = flags & >>> RTE_VHOST_USER_ASYNC_USE_VFIO; >>>>>        vsocket->net_compliant_ol_flags = flags & >>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS; >>>>>          if (vsocket->async_copy && >>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h >>>>> index 8078ddff79..fb775ce4ed 100644 >>>>> --- a/lib/vhost/vhost.h >>>>> +++ b/lib/vhost/vhost.h >>>>> @@ -370,6 +370,7 @@ struct virtio_net { >>>>>        int16_t            broadcast_rarp; >>>>>        uint32_t        nr_vring; >>>>>        int            async_copy; >>>>> +    int            async_use_vfio; >>>>>        int            extbuf; >>>>>        int            linearbuf; >>>>>        struct vhost_virtqueue    *virtqueue[VHOST_MAX_QUEUE_PAIRS * >>>>> 2]; >>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c >>>>> index 8f0eba6412..f3703f2e72 100644 >>>>> --- a/lib/vhost/vhost_user.c >>>>> +++ b/lib/vhost/vhost_user.c >>>>> @@ -45,6 +45,7 @@ >>>>>    #include >>>>>    #include >>>>>    #include >>>>> +#include >>>>>      #include "iotlb.h" >>>>>    #include "vhost.h" >>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd) >>>>>        return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize; >>>>>    } >>>>>    +static int >>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map) >>>>> +{ >>>>> +    int ret = 0; >>>>> +    uint64_t host_iova; >>>>> +    host_iova = rte_mem_virt2iova((void >>>>> *)(uintptr_t)region->host_user_addr); >>>>> +    if (do_map) { >>>>> +        /* Add mapped region into the default container of DPDK. */ >>>>> +        ret = >>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, >>>>> +                         region->host_user_addr, >>>>> +                         host_iova, >>>>> +                         region->size); >>>>> +        if (ret) { >>>>> +            VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n"); >>>>> +            return ret; >>>>> +        } >>>>> +    } else { >>>>> +        /* Remove mapped region from the default container of >>>>> DPDK. */ >>>>> +        ret = >>>>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, >>>>> +                           region->host_user_addr, >>>>> +                           host_iova, >>>>> +                           region->size); >>>>> +        if (ret) { >>>>> +            VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n"); >>>>> +            return ret; >>>>> +        } >>>>> +    } >>>>> +    return ret; >>>>> +} >>>> >>>> We've been discussing this off list with Xuan, and unfortunately >>>> this is >>>> a blocker for now. >>>> >>>> Currently, the x86 IOMMU does not support partial unmap - the segments >>>> have to be unmapped exactly the same addr/len as they were mapped. We >>>> also concatenate adjacent mappings to prevent filling up the DMA >>>> mapping >>>> entry table with superfluous entries. >>>> >>>> This means that, when two unrelated mappings are contiguous in memory >>>> (e.g. if you map regions 1 and 2 independently, but they happen to be >>>> sitting right next to each other in virtual memory), we cannot later >>>> unmap one of them because, even though these are two separate >>> mappings >>>> as far as kernel VFIO infrastructure is concerned, the mapping gets >>>> compacted and looks like one single mapping to VFIO, so DPDK API will >>>> not let us unmap region 1 without also unmapping region 2. >>>> >>>> The proper fix for this problem would be to always map memory >>>> page-by-page regardless of where it comes from (we already do that for >>>> internal memory, but not for external). However, the reason this works >>>> for internal memory is because when mapping internal memory segments, >>>> *we know the page size*. For external memory segments, there is no such >>>> guarantee, so we cannot deduce page size for a given memory segment, >>> and >>>> thus can't map things page-by-page. >>>> >>>> So, the proper fix for it would be to add page size to the VFIO DMA >>>> API. >>>> Unfortunately, it probably has to wait until 21.11 because it is an API >>>> change. >>>> >>>> The slightly hacky fix for this would be to forego user mem map >>>> concatenation and trust that user is not going to do anything stupid, >>>> and will not spam the VFIO DMA API without reason. I would rather >>>> not go >>>> down this road, but this could be an option in this case. >>>> >>>> Thoughts? >>>> >>> >>> Thanks Anatoly for the detailed description of the issue. >>> It may be possible to either create a versioned symbol for this API >>> change, or maybe even to have a temporary internal API. >>> >>> But I think this series in its current form is not acceptable, so >>> waiting for v21.11 would be the best option (we may want to send the >>> deprecation notice in this release though). >>> >>> In this series, I don't like the user application has to pass a flag to >>> state whether the DMA engine uses VFIO or not. AFAICT, this new revision >>> does not implement what was discussed in the previous one, i.e. >>> supporting both IOVA_AS_VA and IOVA_AS_PA. >> >> Thanks for your comments. Here I hope to explain some questions: >> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now? >> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, the >> virtual address is replaced with iova address of mapped region, and >> the iova >> address is selected to program the IOMMU instead of virtual address only. Good! >> >> 2. Why a flag is chosen to be passed by application? >> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be >> used to >> get the IOVA mode, so as to determine whether IOMMU should be programmed. >> However, in the implementation process, I found a problem. That is how to >> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we should >> always program the IOMMU. While in IGB_UIO cases, it depends on IOMMU >> capability of platform. > > How does one program IOMMU with igb_uio? I was under impression that > igb_uio (and uio_pci_generic for that matter) does not provide such > facilities. +1 >> >> So a flag is selected, but this requires the application to do extra >> things. >> I find another solution, is to use >> #ifdef VFIO_PRESENT >>          If(rte_vfio_is_enabled("vfio")) >>                  program_iommu; >> #endif >> >> Because all the devices are managed by DPDK, we can follow DPDK to do the >> decision. Does this make sense for you, or any some suggestions? > > IMO the #ifdef is not needed. The API will always work, it's just that > if VFIO is not compiled, it'll just compile down to noops. Agree the #ifdef is not necessary. To be clear, rte_vfio_is_enabled() check is going to be done in the Vhost library, making this transparent to the application? >> >> 3.  The partial unmap issue >> A: Thanks Anatoly for the detailed explanation. This problem was found in >> reconnection cases. After our off list discussion, the solution requires >> rte_vfio_container_dma_map/unmap API change. Here I want to consult >> if there are some hope for versioned symbol or a temporary internal API >> be used in this release. > > I don't think we can add a versioned symbol in this release unless > there's an exception to rc1 feature freeze. I also don't like the idea > of a temporary internal API because vhost is not in EAL, it's a library > - meaning, the "internal" API has to in fact be external API, added to > the .map file etc., otherwise it won't work with shared library builds. > > That said, i'm not an expert on versioning, so maybe there are other > ways i'm not aware of, or i have some misconceptions about how it works :) Ok, it maybe indeed be better to wait for v21.11, it is too late for this release. Thanks, Maxime >> >> Thanks for your time! >> >> Regards, >> Xuan >> >>> >>> Regards, >>> Maxime >> > >