From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
"Ding, Xuan" <xuan.ding@intel.com>,
"Xia, Chenbo" <chenbo.xia@intel.com>,
Thomas Monjalon <thomas@monjalon.net>,
David Marchand <david.marchand@redhat.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, "Hu, Jiayu" <jiayu.hu@intel.com>,
"Pai G, Sunil" <sunil.pai.g@intel.com>,
"Richardson, Bruce" <bruce.richardson@intel.com>,
"Van Haaren, Harry" <harry.van.haaren@intel.com>,
"Liu, Yong" <yong.liu@intel.com>,
"Ma, WenwuX" <wenwux.ma@intel.com>
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
Date: Tue, 6 Jul 2021 11:32:23 +0200 [thread overview]
Message-ID: <1013369b-3661-38fd-7207-6993e61da08e@redhat.com> (raw)
In-Reply-To: <4077e127-afff-0a2b-f2ba-5850b5e0a2ff@intel.com>
On 7/6/21 11:16 AM, Burakov, Anatoly wrote:
> On 06-Jul-21 9:31 AM, Ding, Xuan wrote:
>> Hi Maxime,
>>
>>> -----Original Message-----
>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> Sent: Monday, July 5, 2021 8:46 PM
>>> To: Burakov, Anatoly <anatoly.burakov@intel.com>; Ding, Xuan
>>> <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>; Thomas
>>> Monjalon <thomas@monjalon.net>; David Marchand
>>> <david.marchand@redhat.com>
>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
>>> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
>>> Van Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong
>>> <yong.liu@intel.com>; Ma, WenwuX <wenwux.ma@intel.com>
>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async
>>> vhost
>>>
>>>
>>>
>>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
>>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote:
>>>>> The use of IOMMU has many advantages, such as isolation and address
>>>>> translation. This patch extends the capbility of DMA engine to use
>>>>> IOMMU if the DMA device is bound to vfio.
>>>>>
>>>>> When set memory table, the guest memory will be mapped
>>>>> into the default container of DPDK.
>>>>>
>>>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
>>>>> ---
>>>>> doc/guides/prog_guide/vhost_lib.rst | 9 ++++++
>>>>> lib/vhost/rte_vhost.h | 1 +
>>>>> lib/vhost/socket.c | 9 ++++++
>>>>> lib/vhost/vhost.h | 1 +
>>>>> lib/vhost/vhost_user.c | 46
>>>>> ++++++++++++++++++++++++++++-
>>>>> 5 files changed, 65 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst
>>>>> b/doc/guides/prog_guide/vhost_lib.rst
>>>>> index 05c42c9b11..c3beda23d9 100644
>>>>> --- a/doc/guides/prog_guide/vhost_lib.rst
>>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst
>>>>> @@ -118,6 +118,15 @@ The following is an overview of some key Vhost
>>>>> API functions:
>>>>> It is disabled by default.
>>>>> + - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
>>>>> +
>>>>> + In asynchronous data path, vhost liarary is not aware of which
>>>>> driver
>>>>> + (igb_uio/vfio) the DMA device is bound to. Application should
>>>>> pass
>>>>> + this flag to tell vhost library whether IOMMU should be
>>>>> programmed
>>>>> + for guest memory.
>>>>> +
>>>>> + It is disabled by default.
>>>>> +
>>>>> - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
>>>>> Since v16.04, the vhost library forwards checksum and gso
>>>>> requests for
>>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
>>>>> index 8d875e9322..a766ea7b6b 100644
>>>>> --- a/lib/vhost/rte_vhost.h
>>>>> +++ b/lib/vhost/rte_vhost.h
>>>>> @@ -37,6 +37,7 @@ extern "C" {
>>>>> #define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6)
>>>>> #define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7)
>>>>> #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS (1ULL << 8)
>>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO (1ULL << 9)
>>>>> /* Features. */
>>>>> #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
>>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
>>>>> index 5d0d728d52..77c722c86b 100644
>>>>> --- a/lib/vhost/socket.c
>>>>> +++ b/lib/vhost/socket.c
>>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>>>>> bool extbuf;
>>>>> bool linearbuf;
>>>>> bool async_copy;
>>>>> + bool async_use_vfio;
>>>>> bool net_compliant_ol_flags;
>>>>> /*
>>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct
>>>>> vhost_user_socket *vsocket)
>>>>> dev->async_copy = 1;
>>>>> }
>>>>> + if (vsocket->async_use_vfio) {
>>>>> + dev = get_device(vid);
>>>>> +
>>>>> + if (dev)
>>>>> + dev->async_use_vfio = 1;
>>>>> + }
>>>>> +
>>>>> VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
>>>>> if (vsocket->notify_ops->new_connection) {
>>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,
>>>>> uint64_t flags)
>>>>> vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
>>>>> vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
>>>>> vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
>>>>> + vsocket->async_use_vfio = flags &
>>> RTE_VHOST_USER_ASYNC_USE_VFIO;
>>>>> vsocket->net_compliant_ol_flags = flags &
>>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
>>>>> if (vsocket->async_copy &&
>>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
>>>>> index 8078ddff79..fb775ce4ed 100644
>>>>> --- a/lib/vhost/vhost.h
>>>>> +++ b/lib/vhost/vhost.h
>>>>> @@ -370,6 +370,7 @@ struct virtio_net {
>>>>> int16_t broadcast_rarp;
>>>>> uint32_t nr_vring;
>>>>> int async_copy;
>>>>> + int async_use_vfio;
>>>>> int extbuf;
>>>>> int linearbuf;
>>>>> struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS *
>>>>> 2];
>>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
>>>>> index 8f0eba6412..f3703f2e72 100644
>>>>> --- a/lib/vhost/vhost_user.c
>>>>> +++ b/lib/vhost/vhost_user.c
>>>>> @@ -45,6 +45,7 @@
>>>>> #include <rte_common.h>
>>>>> #include <rte_malloc.h>
>>>>> #include <rte_log.h>
>>>>> +#include <rte_vfio.h>
>>>>> #include "iotlb.h"
>>>>> #include "vhost.h"
>>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd)
>>>>> return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>>>>> }
>>>>> +static int
>>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
>>>>> +{
>>>>> + int ret = 0;
>>>>> + uint64_t host_iova;
>>>>> + host_iova = rte_mem_virt2iova((void
>>>>> *)(uintptr_t)region->host_user_addr);
>>>>> + if (do_map) {
>>>>> + /* Add mapped region into the default container of DPDK. */
>>>>> + ret =
>>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>>> + region->host_user_addr,
>>>>> + host_iova,
>>>>> + region->size);
>>>>> + if (ret) {
>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
>>>>> + return ret;
>>>>> + }
>>>>> + } else {
>>>>> + /* Remove mapped region from the default container of
>>>>> DPDK. */
>>>>> + ret =
>>>>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>>> + region->host_user_addr,
>>>>> + host_iova,
>>>>> + region->size);
>>>>> + if (ret) {
>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
>>>>> + return ret;
>>>>> + }
>>>>> + }
>>>>> + return ret;
>>>>> +}
>>>>
>>>> We've been discussing this off list with Xuan, and unfortunately
>>>> this is
>>>> a blocker for now.
>>>>
>>>> Currently, the x86 IOMMU does not support partial unmap - the segments
>>>> have to be unmapped exactly the same addr/len as they were mapped. We
>>>> also concatenate adjacent mappings to prevent filling up the DMA
>>>> mapping
>>>> entry table with superfluous entries.
>>>>
>>>> This means that, when two unrelated mappings are contiguous in memory
>>>> (e.g. if you map regions 1 and 2 independently, but they happen to be
>>>> sitting right next to each other in virtual memory), we cannot later
>>>> unmap one of them because, even though these are two separate
>>> mappings
>>>> as far as kernel VFIO infrastructure is concerned, the mapping gets
>>>> compacted and looks like one single mapping to VFIO, so DPDK API will
>>>> not let us unmap region 1 without also unmapping region 2.
>>>>
>>>> The proper fix for this problem would be to always map memory
>>>> page-by-page regardless of where it comes from (we already do that for
>>>> internal memory, but not for external). However, the reason this works
>>>> for internal memory is because when mapping internal memory segments,
>>>> *we know the page size*. For external memory segments, there is no such
>>>> guarantee, so we cannot deduce page size for a given memory segment,
>>> and
>>>> thus can't map things page-by-page.
>>>>
>>>> So, the proper fix for it would be to add page size to the VFIO DMA
>>>> API.
>>>> Unfortunately, it probably has to wait until 21.11 because it is an API
>>>> change.
>>>>
>>>> The slightly hacky fix for this would be to forego user mem map
>>>> concatenation and trust that user is not going to do anything stupid,
>>>> and will not spam the VFIO DMA API without reason. I would rather
>>>> not go
>>>> down this road, but this could be an option in this case.
>>>>
>>>> Thoughts?
>>>>
>>>
>>> Thanks Anatoly for the detailed description of the issue.
>>> It may be possible to either create a versioned symbol for this API
>>> change, or maybe even to have a temporary internal API.
>>>
>>> But I think this series in its current form is not acceptable, so
>>> waiting for v21.11 would be the best option (we may want to send the
>>> deprecation notice in this release though).
>>>
>>> In this series, I don't like the user application has to pass a flag to
>>> state whether the DMA engine uses VFIO or not. AFAICT, this new revision
>>> does not implement what was discussed in the previous one, i.e.
>>> supporting both IOVA_AS_VA and IOVA_AS_PA.
>>
>> Thanks for your comments. Here I hope to explain some questions:
>> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
>> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, the
>> virtual address is replaced with iova address of mapped region, and
>> the iova
>> address is selected to program the IOMMU instead of virtual address only.
Good!
>>
>> 2. Why a flag is chosen to be passed by application?
>> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
>> used to
>> get the IOVA mode, so as to determine whether IOMMU should be programmed.
>> However, in the implementation process, I found a problem. That is how to
>> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we should
>> always program the IOMMU. While in IGB_UIO cases, it depends on IOMMU
>> capability of platform.
>
> How does one program IOMMU with igb_uio? I was under impression that
> igb_uio (and uio_pci_generic for that matter) does not provide such
> facilities.
+1
>>
>> So a flag is selected, but this requires the application to do extra
>> things.
>> I find another solution, is to use
>> #ifdef VFIO_PRESENT
>> If(rte_vfio_is_enabled("vfio"))
>> program_iommu;
>> #endif
>>
>> Because all the devices are managed by DPDK, we can follow DPDK to do the
>> decision. Does this make sense for you, or any some suggestions?
>
> IMO the #ifdef is not needed. The API will always work, it's just that
> if VFIO is not compiled, it'll just compile down to noops.
Agree the #ifdef is not necessary.
To be clear, rte_vfio_is_enabled() check is going to be done in the
Vhost library, making this transparent to the application?
>>
>> 3. The partial unmap issue
>> A: Thanks Anatoly for the detailed explanation. This problem was found in
>> reconnection cases. After our off list discussion, the solution requires
>> rte_vfio_container_dma_map/unmap API change. Here I want to consult
>> if there are some hope for versioned symbol or a temporary internal API
>> be used in this release.
>
> I don't think we can add a versioned symbol in this release unless
> there's an exception to rc1 feature freeze. I also don't like the idea
> of a temporary internal API because vhost is not in EAL, it's a library
> - meaning, the "internal" API has to in fact be external API, added to
> the .map file etc., otherwise it won't work with shared library builds.
>
> That said, i'm not an expert on versioning, so maybe there are other
> ways i'm not aware of, or i have some misconceptions about how it works :)
Ok, it maybe indeed be better to wait for v21.11, it is too late for
this release.
Thanks,
Maxime
>>
>> Thanks for your time!
>>
>> Regards,
>> Xuan
>>
>>>
>>> Regards,
>>> Maxime
>>
>
>
next prev parent reply other threads:[~2021-07-06 9:32 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-31 15:06 [dpdk-dev] [PATCH v1] lib/vhost: " xuan.ding
2021-06-02 14:26 ` [dpdk-dev] [PATCH v2] " xuan.ding
2021-06-03 17:30 ` [dpdk-dev] [PATCH v3] vhost: " xuan.ding
2021-06-18 16:17 ` Maxime Coquelin
2021-06-21 3:57 ` Hu, Jiayu
2021-06-22 6:18 ` Ding, Xuan
2021-06-29 9:23 ` Maxime Coquelin
2021-07-01 5:12 ` Ding, Xuan
2021-07-05 8:19 ` [dpdk-dev] [PATCH v4 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05 8:19 ` [dpdk-dev] [PATCH v4 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05 8:19 ` [dpdk-dev] [PATCH v4 2/2] example/vhost: add dma vfio parsing Xuan Ding
2021-07-05 8:40 ` [dpdk-dev] [PATCH v5 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05 8:40 ` [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05 12:16 ` Burakov, Anatoly
2021-07-05 12:45 ` Maxime Coquelin
2021-07-06 8:31 ` Ding, Xuan
2021-07-06 9:16 ` Burakov, Anatoly
2021-07-06 9:32 ` Maxime Coquelin [this message]
2021-07-07 6:25 ` Ding, Xuan
2021-07-07 12:17 ` Burakov, Anatoly
2021-07-07 12:54 ` Ding, Xuan
2021-07-07 14:33 ` Burakov, Anatoly
2021-07-07 15:00 ` Bruce Richardson
2021-07-07 15:09 ` Ding, Xuan
2021-07-05 8:40 ` [dpdk-dev] [PATCH v5 2/2] example/vhost: add dma vfio parsing Xuan Ding
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1013369b-3661-38fd-7207-6993e61da08e@redhat.com \
--to=maxime.coquelin@redhat.com \
--cc=anatoly.burakov@intel.com \
--cc=bruce.richardson@intel.com \
--cc=chenbo.xia@intel.com \
--cc=david.marchand@redhat.com \
--cc=dev@dpdk.org \
--cc=harry.van.haaren@intel.com \
--cc=jiayu.hu@intel.com \
--cc=sunil.pai.g@intel.com \
--cc=thomas@monjalon.net \
--cc=wenwux.ma@intel.com \
--cc=xuan.ding@intel.com \
--cc=yong.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).