Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
	"Ding, Xuan" <xuan.ding@intel.com>,
	"Xia, Chenbo" <chenbo.xia@intel.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	David Marchand <david.marchand@redhat.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, "Hu, Jiayu" <jiayu.hu@intel.com>,
	"Pai G, Sunil" <sunil.pai.g@intel.com>,
	"Richardson, Bruce" <bruce.richardson@intel.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Liu, Yong" <yong.liu@intel.com>,
	"Ma, WenwuX" <wenwux.ma@intel.com>
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
Date: Tue, 6 Jul 2021 11:32:23 +0200	[thread overview]
Message-ID: <1013369b-3661-38fd-7207-6993e61da08e@redhat.com> (raw)
In-Reply-To: <4077e127-afff-0a2b-f2ba-5850b5e0a2ff@intel.com>



On 7/6/21 11:16 AM, Burakov, Anatoly wrote:
> On 06-Jul-21 9:31 AM, Ding, Xuan wrote:
>> Hi Maxime,
>>
>>> -----Original Message-----
>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> Sent: Monday, July 5, 2021 8:46 PM
>>> To: Burakov, Anatoly <anatoly.burakov@intel.com>; Ding, Xuan
>>> <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>; Thomas
>>> Monjalon <thomas@monjalon.net>; David Marchand
>>> <david.marchand@redhat.com>
>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
>>> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
>>> Van Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong
>>> <yong.liu@intel.com>; Ma, WenwuX <wenwux.ma@intel.com>
>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async
>>> vhost
>>>
>>>
>>>
>>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
>>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote:
>>>>> The use of IOMMU has many advantages, such as isolation and address
>>>>> translation. This patch extends the capbility of DMA engine to use
>>>>> IOMMU if the DMA device is bound to vfio.
>>>>>
>>>>> When set memory table, the guest memory will be mapped
>>>>> into the default container of DPDK.
>>>>>
>>>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
>>>>> ---
>>>>>    doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
>>>>>    lib/vhost/rte_vhost.h               |  1 +
>>>>>    lib/vhost/socket.c                  |  9 ++++++
>>>>>    lib/vhost/vhost.h                   |  1 +
>>>>>    lib/vhost/vhost_user.c              | 46
>>>>> ++++++++++++++++++++++++++++-
>>>>>    5 files changed, 65 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst
>>>>> b/doc/guides/prog_guide/vhost_lib.rst
>>>>> index 05c42c9b11..c3beda23d9 100644
>>>>> --- a/doc/guides/prog_guide/vhost_lib.rst
>>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst
>>>>> @@ -118,6 +118,15 @@ The following is an overview of some key Vhost
>>>>> API functions:
>>>>>          It is disabled by default.
>>>>>    +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
>>>>> +
>>>>> +    In asynchronous data path, vhost liarary is not aware of which
>>>>> driver
>>>>> +    (igb_uio/vfio) the DMA device is bound to. Application should
>>>>> pass
>>>>> +    this flag to tell vhost library whether IOMMU should be
>>>>> programmed
>>>>> +    for guest memory.
>>>>> +
>>>>> +    It is disabled by default.
>>>>> +
>>>>>      - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
>>>>>          Since v16.04, the vhost library forwards checksum and gso
>>>>> requests for
>>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
>>>>> index 8d875e9322..a766ea7b6b 100644
>>>>> --- a/lib/vhost/rte_vhost.h
>>>>> +++ b/lib/vhost/rte_vhost.h
>>>>> @@ -37,6 +37,7 @@ extern "C" {
>>>>>    #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6)
>>>>>    #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7)
>>>>>    #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL << 8)
>>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9)
>>>>>      /* Features. */
>>>>>    #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
>>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
>>>>> index 5d0d728d52..77c722c86b 100644
>>>>> --- a/lib/vhost/socket.c
>>>>> +++ b/lib/vhost/socket.c
>>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>>>>>        bool extbuf;
>>>>>        bool linearbuf;
>>>>>        bool async_copy;
>>>>> +    bool async_use_vfio;
>>>>>        bool net_compliant_ol_flags;
>>>>>          /*
>>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct
>>>>> vhost_user_socket *vsocket)
>>>>>                dev->async_copy = 1;
>>>>>        }
>>>>>    +    if (vsocket->async_use_vfio) {
>>>>> +        dev = get_device(vid);
>>>>> +
>>>>> +        if (dev)
>>>>> +            dev->async_use_vfio = 1;
>>>>> +    }
>>>>> +
>>>>>        VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
>>>>>          if (vsocket->notify_ops->new_connection) {
>>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,
>>>>> uint64_t flags)
>>>>>        vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
>>>>>        vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
>>>>>        vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
>>>>> +    vsocket->async_use_vfio = flags &
>>> RTE_VHOST_USER_ASYNC_USE_VFIO;
>>>>>        vsocket->net_compliant_ol_flags = flags &
>>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
>>>>>          if (vsocket->async_copy &&
>>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
>>>>> index 8078ddff79..fb775ce4ed 100644
>>>>> --- a/lib/vhost/vhost.h
>>>>> +++ b/lib/vhost/vhost.h
>>>>> @@ -370,6 +370,7 @@ struct virtio_net {
>>>>>        int16_t            broadcast_rarp;
>>>>>        uint32_t        nr_vring;
>>>>>        int            async_copy;
>>>>> +    int            async_use_vfio;
>>>>>        int            extbuf;
>>>>>        int            linearbuf;
>>>>>        struct vhost_virtqueue    *virtqueue[VHOST_MAX_QUEUE_PAIRS *
>>>>> 2];
>>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
>>>>> index 8f0eba6412..f3703f2e72 100644
>>>>> --- a/lib/vhost/vhost_user.c
>>>>> +++ b/lib/vhost/vhost_user.c
>>>>> @@ -45,6 +45,7 @@
>>>>>    #include <rte_common.h>
>>>>>    #include <rte_malloc.h>
>>>>>    #include <rte_log.h>
>>>>> +#include <rte_vfio.h>
>>>>>      #include "iotlb.h"
>>>>>    #include "vhost.h"
>>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd)
>>>>>        return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>>>>>    }
>>>>>    +static int
>>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
>>>>> +{
>>>>> +    int ret = 0;
>>>>> +    uint64_t host_iova;
>>>>> +    host_iova = rte_mem_virt2iova((void
>>>>> *)(uintptr_t)region->host_user_addr);
>>>>> +    if (do_map) {
>>>>> +        /* Add mapped region into the default container of DPDK. */
>>>>> +        ret =
>>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>>> +                         region->host_user_addr,
>>>>> +                         host_iova,
>>>>> +                         region->size);
>>>>> +        if (ret) {
>>>>> +            VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
>>>>> +            return ret;
>>>>> +        }
>>>>> +    } else {
>>>>> +        /* Remove mapped region from the default container of
>>>>> DPDK. */
>>>>> +        ret =
>>>>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>>> +                           region->host_user_addr,
>>>>> +                           host_iova,
>>>>> +                           region->size);
>>>>> +        if (ret) {
>>>>> +            VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
>>>>> +            return ret;
>>>>> +        }
>>>>> +    }
>>>>> +    return ret;
>>>>> +}
>>>>
>>>> We've been discussing this off list with Xuan, and unfortunately
>>>> this is
>>>> a blocker for now.
>>>>
>>>> Currently, the x86 IOMMU does not support partial unmap - the segments
>>>> have to be unmapped exactly the same addr/len as they were mapped. We
>>>> also concatenate adjacent mappings to prevent filling up the DMA
>>>> mapping
>>>> entry table with superfluous entries.
>>>>
>>>> This means that, when two unrelated mappings are contiguous in memory
>>>> (e.g. if you map regions 1 and 2 independently, but they happen to be
>>>> sitting right next to each other in virtual memory), we cannot later
>>>> unmap one of them because, even though these are two separate
>>> mappings
>>>> as far as kernel VFIO infrastructure is concerned, the mapping gets
>>>> compacted and looks like one single mapping to VFIO, so DPDK API will
>>>> not let us unmap region 1 without also unmapping region 2.
>>>>
>>>> The proper fix for this problem would be to always map memory
>>>> page-by-page regardless of where it comes from (we already do that for
>>>> internal memory, but not for external). However, the reason this works
>>>> for internal memory is because when mapping internal memory segments,
>>>> *we know the page size*. For external memory segments, there is no such
>>>> guarantee, so we cannot deduce page size for a given memory segment,
>>> and
>>>> thus can't map things page-by-page.
>>>>
>>>> So, the proper fix for it would be to add page size to the VFIO DMA
>>>> API.
>>>> Unfortunately, it probably has to wait until 21.11 because it is an API
>>>> change.
>>>>
>>>> The slightly hacky fix for this would be to forego user mem map
>>>> concatenation and trust that user is not going to do anything stupid,
>>>> and will not spam the VFIO DMA API without reason. I would rather
>>>> not go
>>>> down this road, but this could be an option in this case.
>>>>
>>>> Thoughts?
>>>>
>>>
>>> Thanks Anatoly for the detailed description of the issue.
>>> It may be possible to either create a versioned symbol for this API
>>> change, or maybe even to have a temporary internal API.
>>>
>>> But I think this series in its current form is not acceptable, so
>>> waiting for v21.11 would be the best option (we may want to send the
>>> deprecation notice in this release though).
>>>
>>> In this series, I don't like the user application has to pass a flag to
>>> state whether the DMA engine uses VFIO or not. AFAICT, this new revision
>>> does not implement what was discussed in the previous one, i.e.
>>> supporting both IOVA_AS_VA and IOVA_AS_PA.
>>
>> Thanks for your comments. Here I hope to explain some questions:
>> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
>> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, the
>> virtual address is replaced with iova address of mapped region, and
>> the iova
>> address is selected to program the IOMMU instead of virtual address only.

Good!

>>
>> 2. Why a flag is chosen to be passed by application?
>> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
>> used to
>> get the IOVA mode, so as to determine whether IOMMU should be programmed.
>> However, in the implementation process, I found a problem. That is how to
>> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we should
>> always program the IOMMU. While in IGB_UIO cases, it depends on IOMMU
>> capability of platform.
> 
> How does one program IOMMU with igb_uio? I was under impression that
> igb_uio (and uio_pci_generic for that matter) does not provide such
> facilities.

+1

>>
>> So a flag is selected, but this requires the application to do extra
>> things.
>> I find another solution, is to use
>> #ifdef VFIO_PRESENT
>>          If(rte_vfio_is_enabled("vfio"))
>>                  program_iommu;
>> #endif
>>
>> Because all the devices are managed by DPDK, we can follow DPDK to do the
>> decision. Does this make sense for you, or any some suggestions?
> 
> IMO the #ifdef is not needed. The API will always work, it's just that
> if VFIO is not compiled, it'll just compile down to noops.

Agree the #ifdef is not necessary.

To be clear, rte_vfio_is_enabled() check is going to be done in the
Vhost library, making this transparent to the application?

>>
>> 3.  The partial unmap issue
>> A: Thanks Anatoly for the detailed explanation. This problem was found in
>> reconnection cases. After our off list discussion, the solution requires
>> rte_vfio_container_dma_map/unmap API change. Here I want to consult
>> if there are some hope for versioned symbol or a temporary internal API
>> be used in this release.
> 
> I don't think we can add a versioned symbol in this release unless
> there's an exception to rc1 feature freeze. I also don't like the idea
> of a temporary internal API because vhost is not in EAL, it's a library
> - meaning, the "internal" API has to in fact be external API, added to
> the .map file etc., otherwise it won't work with shared library builds.
> 
> That said, i'm not an expert on versioning, so maybe there are other
> ways i'm not aware of, or i have some misconceptions about how it works :)

Ok, it maybe indeed be better to wait for v21.11, it is too late for
this release.

Thanks,
Maxime

>>
>> Thanks for your time!
>>
>> Regards,
>> Xuan
>>
>>>
>>> Regards,
>>> Maxime
>>
> 
>

next prev parent reply	other threads:[~2021-07-06  9:32 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-31 15:06 [dpdk-dev] [PATCH v1] lib/vhost: " xuan.ding
2021-06-02 14:26 ` [dpdk-dev] [PATCH v2] " xuan.ding
2021-06-03 17:30 ` [dpdk-dev] [PATCH v3] vhost: " xuan.ding
2021-06-18 16:17   ` Maxime Coquelin
2021-06-21  3:57     ` Hu, Jiayu
2021-06-22  6:18     ` Ding, Xuan
2021-06-29  9:23       ` Maxime Coquelin
2021-07-01  5:12         ` Ding, Xuan
2021-07-05  8:19 ` [dpdk-dev] [PATCH v4 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05  8:19   ` [dpdk-dev] [PATCH v4 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05  8:19   ` [dpdk-dev] [PATCH v4 2/2] example/vhost: add dma vfio parsing Xuan Ding
2021-07-05  8:40 ` [dpdk-dev] [PATCH v5 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05  8:40   ` [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05 12:16     ` Burakov, Anatoly
2021-07-05 12:45       ` Maxime Coquelin
2021-07-06  8:31         ` Ding, Xuan
2021-07-06  9:16           ` Burakov, Anatoly
2021-07-06  9:32             ` Maxime Coquelin [this message]
2021-07-07  6:25               ` Ding, Xuan
2021-07-07 12:17                 ` Burakov, Anatoly
2021-07-07 12:54                   ` Ding, Xuan
2021-07-07 14:33                     ` Burakov, Anatoly
2021-07-07 15:00                       ` Bruce Richardson
2021-07-07 15:09                       ` Ding, Xuan
2021-07-05  8:40   ` [dpdk-dev] [PATCH v5 2/2] example/vhost: add dma vfio parsing Xuan Ding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1013369b-3661-38fd-7207-6993e61da08e@redhat.com \
    --to=maxime.coquelin@redhat.com \
    --cc=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=chenbo.xia@intel.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=harry.van.haaren@intel.com \
    --cc=jiayu.hu@intel.com \
    --cc=sunil.pai.g@intel.com \
    --cc=thomas@monjalon.net \
    --cc=wenwux.ma@intel.com \
    --cc=xuan.ding@intel.com \
    --cc=yong.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).