DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Ding, Xuan" <xuan.ding@intel.com>
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
	Maxime Coquelin <maxime.coquelin@redhat.com>,
	"Xia, Chenbo" <chenbo.xia@intel.com>,
	"Thomas Monjalon" <thomas@monjalon.net>,
	David Marchand <david.marchand@redhat.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, "Hu, Jiayu" <jiayu.hu@intel.com>,
	"Pai G, Sunil" <sunil.pai.g@intel.com>,
	"Richardson, Bruce" <bruce.richardson@intel.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Liu, Yong" <yong.liu@intel.com>,
	"Ma, WenwuX" <wenwux.ma@intel.com>
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
Date: Wed, 7 Jul 2021 12:54:46 +0000	[thread overview]
Message-ID: <BN9PR11MB551341408628DA3F3C610CC9E71A9@BN9PR11MB5513.namprd11.prod.outlook.com> (raw)
In-Reply-To: <095cc219-4cba-5fbe-a7c0-bf02138f0959@intel.com>

Hi Anatoly,

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, July 7, 2021 8:18 PM
> To: Ding, Xuan <xuan.ding@intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Thomas Monjalon <thomas@monjalon.net>; David Marchand
> <david.marchand@redhat.com>
> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong <yong.liu@intel.com>;
> Ma, WenwuX <wenwux.ma@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
> 
> On 07-Jul-21 7:25 AM, Ding, Xuan wrote:
> > Hi,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Tuesday, July 6, 2021 5:32 PM
> >> To: Burakov, Anatoly <anatoly.burakov@intel.com>; Ding, Xuan
> >> <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>; Thomas
> >> Monjalon <thomas@monjalon.net>; David Marchand
> >> <david.marchand@redhat.com>
> >> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> >> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> Van
> >> Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong
> <yong.liu@intel.com>;
> >> Ma, WenwuX <wenwux.ma@intel.com>
> >> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
> >>
> >>
> >>
> >> On 7/6/21 11:16 AM, Burakov, Anatoly wrote:
> >>> On 06-Jul-21 9:31 AM, Ding, Xuan wrote:
> >>>> Hi Maxime,
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>>>> Sent: Monday, July 5, 2021 8:46 PM
> >>>>> To: Burakov, Anatoly <anatoly.burakov@intel.com>; Ding, Xuan
> >>>>> <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>; Thomas
> >>>>> Monjalon <thomas@monjalon.net>; David Marchand
> >>>>> <david.marchand@redhat.com>
> >>>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> >>>>> <sunil.pai.g@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>;
> >>>>> Van Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong
> >>>>> <yong.liu@intel.com>; Ma, WenwuX <wenwux.ma@intel.com>
> >>>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async
> >>>>> vhost
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
> >>>>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote:
> >>>>>>> The use of IOMMU has many advantages, such as isolation and address
> >>>>>>> translation. This patch extends the capbility of DMA engine to use
> >>>>>>> IOMMU if the DMA device is bound to vfio.
> >>>>>>>
> >>>>>>> When set memory table, the guest memory will be mapped
> >>>>>>> into the default container of DPDK.
> >>>>>>>
> >>>>>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> >>>>>>> ---
> >>>>>>>     doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
> >>>>>>>     lib/vhost/rte_vhost.h               |  1 +
> >>>>>>>     lib/vhost/socket.c                  |  9 ++++++
> >>>>>>>     lib/vhost/vhost.h                   |  1 +
> >>>>>>>     lib/vhost/vhost_user.c              | 46
> >>>>>>> ++++++++++++++++++++++++++++-
> >>>>>>>     5 files changed, 65 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst
> >>>>>>> b/doc/guides/prog_guide/vhost_lib.rst
> >>>>>>> index 05c42c9b11..c3beda23d9 100644
> >>>>>>> --- a/doc/guides/prog_guide/vhost_lib.rst
> >>>>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst
> >>>>>>> @@ -118,6 +118,15 @@ The following is an overview of some key
> Vhost
> >>>>>>> API functions:
> >>>>>>>           It is disabled by default.
> >>>>>>>     +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
> >>>>>>> +
> >>>>>>> +    In asynchronous data path, vhost liarary is not aware of which
> >>>>>>> driver
> >>>>>>> +    (igb_uio/vfio) the DMA device is bound to. Application should
> >>>>>>> pass
> >>>>>>> +    this flag to tell vhost library whether IOMMU should be
> >>>>>>> programmed
> >>>>>>> +    for guest memory.
> >>>>>>> +
> >>>>>>> +    It is disabled by default.
> >>>>>>> +
> >>>>>>>       - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
> >>>>>>>           Since v16.04, the vhost library forwards checksum and gso
> >>>>>>> requests for
> >>>>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> >>>>>>> index 8d875e9322..a766ea7b6b 100644
> >>>>>>> --- a/lib/vhost/rte_vhost.h
> >>>>>>> +++ b/lib/vhost/rte_vhost.h
> >>>>>>> @@ -37,6 +37,7 @@ extern "C" {
> >>>>>>>     #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6)
> >>>>>>>     #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7)
> >>>>>>>     #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL <<
> 8)
> >>>>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9)
> >>>>>>>       /* Features. */
> >>>>>>>     #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
> >>>>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
> >>>>>>> index 5d0d728d52..77c722c86b 100644
> >>>>>>> --- a/lib/vhost/socket.c
> >>>>>>> +++ b/lib/vhost/socket.c
> >>>>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket {
> >>>>>>>         bool extbuf;
> >>>>>>>         bool linearbuf;
> >>>>>>>         bool async_copy;
> >>>>>>> +    bool async_use_vfio;
> >>>>>>>         bool net_compliant_ol_flags;
> >>>>>>>           /*
> >>>>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct
> >>>>>>> vhost_user_socket *vsocket)
> >>>>>>>                 dev->async_copy = 1;
> >>>>>>>         }
> >>>>>>>     +    if (vsocket->async_use_vfio) {
> >>>>>>> +        dev = get_device(vid);
> >>>>>>> +
> >>>>>>> +        if (dev)
> >>>>>>> +            dev->async_use_vfio = 1;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>>         VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> >>>>>>>           if (vsocket->notify_ops->new_connection) {
> >>>>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,
> >>>>>>> uint64_t flags)
> >>>>>>>         vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
> >>>>>>>         vsocket->linearbuf = flags &
> >> RTE_VHOST_USER_LINEARBUF_SUPPORT;
> >>>>>>>         vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> >>>>>>> +    vsocket->async_use_vfio = flags &
> >>>>> RTE_VHOST_USER_ASYNC_USE_VFIO;
> >>>>>>>         vsocket->net_compliant_ol_flags = flags &
> >>>>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
> >>>>>>>           if (vsocket->async_copy &&
> >>>>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> >>>>>>> index 8078ddff79..fb775ce4ed 100644
> >>>>>>> --- a/lib/vhost/vhost.h
> >>>>>>> +++ b/lib/vhost/vhost.h
> >>>>>>> @@ -370,6 +370,7 @@ struct virtio_net {
> >>>>>>>         int16_t            broadcast_rarp;
> >>>>>>>         uint32_t        nr_vring;
> >>>>>>>         int            async_copy;
> >>>>>>> +    int            async_use_vfio;
> >>>>>>>         int            extbuf;
> >>>>>>>         int            linearbuf;
> >>>>>>>         struct vhost_virtqueue    *virtqueue[VHOST_MAX_QUEUE_PAIRS *
> >>>>>>> 2];
> >>>>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> >>>>>>> index 8f0eba6412..f3703f2e72 100644
> >>>>>>> --- a/lib/vhost/vhost_user.c
> >>>>>>> +++ b/lib/vhost/vhost_user.c
> >>>>>>> @@ -45,6 +45,7 @@
> >>>>>>>     #include <rte_common.h>
> >>>>>>>     #include <rte_malloc.h>
> >>>>>>>     #include <rte_log.h>
> >>>>>>> +#include <rte_vfio.h>
> >>>>>>>       #include "iotlb.h"
> >>>>>>>     #include "vhost.h"
> >>>>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd)
> >>>>>>>         return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
> >>>>>>>     }
> >>>>>>>     +static int
> >>>>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> >>>>>>> +{
> >>>>>>> +    int ret = 0;
> >>>>>>> +    uint64_t host_iova;
> >>>>>>> +    host_iova = rte_mem_virt2iova((void
> >>>>>>> *)(uintptr_t)region->host_user_addr);
> >>>>>>> +    if (do_map) {
> >>>>>>> +        /* Add mapped region into the default container of DPDK. */
> >>>>>>> +        ret =
> >>>>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> >>>>>>> +                         region->host_user_addr,
> >>>>>>> +                         host_iova,
> >>>>>>> +                         region->size);
> >>>>>>> +        if (ret) {
> >>>>>>> +            VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> >>>>>>> +            return ret;
> >>>>>>> +        }
> >>>>>>> +    } else {
> >>>>>>> +        /* Remove mapped region from the default container of
> >>>>>>> DPDK. */
> >>>>>>> +        ret =
> >>>>>>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> >>>>>>> +                           region->host_user_addr,
> >>>>>>> +                           host_iova,
> >>>>>>> +                           region->size);
> >>>>>>> +        if (ret) {
> >>>>>>> +            VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
> >>>>>>> +            return ret;
> >>>>>>> +        }
> >>>>>>> +    }
> >>>>>>> +    return ret;
> >>>>>>> +}
> >>>>>>
> >>>>>> We've been discussing this off list with Xuan, and unfortunately
> >>>>>> this is
> >>>>>> a blocker for now.
> >>>>>>
> >>>>>> Currently, the x86 IOMMU does not support partial unmap - the
> segments
> >>>>>> have to be unmapped exactly the same addr/len as they were mapped.
> We
> >>>>>> also concatenate adjacent mappings to prevent filling up the DMA
> >>>>>> mapping
> >>>>>> entry table with superfluous entries.
> >>>>>>
> >>>>>> This means that, when two unrelated mappings are contiguous in
> memory
> >>>>>> (e.g. if you map regions 1 and 2 independently, but they happen to be
> >>>>>> sitting right next to each other in virtual memory), we cannot later
> >>>>>> unmap one of them because, even though these are two separate
> >>>>> mappings
> >>>>>> as far as kernel VFIO infrastructure is concerned, the mapping gets
> >>>>>> compacted and looks like one single mapping to VFIO, so DPDK API will
> >>>>>> not let us unmap region 1 without also unmapping region 2.
> >>>>>>
> >>>>>> The proper fix for this problem would be to always map memory
> >>>>>> page-by-page regardless of where it comes from (we already do that for
> >>>>>> internal memory, but not for external). However, the reason this works
> >>>>>> for internal memory is because when mapping internal memory
> segments,
> >>>>>> *we know the page size*. For external memory segments, there is no
> such
> >>>>>> guarantee, so we cannot deduce page size for a given memory segment,
> >>>>> and
> >>>>>> thus can't map things page-by-page.
> >>>>>>
> >>>>>> So, the proper fix for it would be to add page size to the VFIO DMA
> >>>>>> API.
> >>>>>> Unfortunately, it probably has to wait until 21.11 because it is an API
> >>>>>> change.
> >>>>>>
> >>>>>> The slightly hacky fix for this would be to forego user mem map
> >>>>>> concatenation and trust that user is not going to do anything stupid,
> >>>>>> and will not spam the VFIO DMA API without reason. I would rather
> >>>>>> not go
> >>>>>> down this road, but this could be an option in this case.
> >>>>>>
> >>>>>> Thoughts?
> >>>>>>
> >>>>>
> >>>>> Thanks Anatoly for the detailed description of the issue.
> >>>>> It may be possible to either create a versioned symbol for this API
> >>>>> change, or maybe even to have a temporary internal API.
> >>>>>
> >>>>> But I think this series in its current form is not acceptable, so
> >>>>> waiting for v21.11 would be the best option (we may want to send the
> >>>>> deprecation notice in this release though).
> >>>>>
> >>>>> In this series, I don't like the user application has to pass a flag to
> >>>>> state whether the DMA engine uses VFIO or not. AFAICT, this new revision
> >>>>> does not implement what was discussed in the previous one, i.e.
> >>>>> supporting both IOVA_AS_VA and IOVA_AS_PA.
> >>>>
> >>>> Thanks for your comments. Here I hope to explain some questions:
> >>>> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
> >>>> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version,
> the
> >>>> virtual address is replaced with iova address of mapped region, and
> >>>> the iova
> >>>> address is selected to program the IOMMU instead of virtual address only.
> >>
> >> Good!
> >>
> >>>>
> >>>> 2. Why a flag is chosen to be passed by application?
> >>>> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
> >>>> used to
> >>>> get the IOVA mode, so as to determine whether IOMMU should be
> >> programmed.
> >>>> However, in the implementation process, I found a problem. That is how to
> >>>> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we
> should
> >>>> always program the IOMMU. While in IGB_UIO cases, it depends on
> IOMMU
> >>>> capability of platform.
> >>>
> >>> How does one program IOMMU with igb_uio? I was under impression that
> >>> igb_uio (and uio_pci_generic for that matter) does not provide such
> >>> facilities.
> >>
> >> +1
> >
> > Maybe some misunderstanding in this sentence here.
> > In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be
> programmed.
> > True means vfio module is modprobed.
> >
> > But there is an exception here, that is, even if vfio module is modprobed,
> > DPDK user still bind all the devices to igb_uio.
> >
> > This situation can be distinguished in DPDK eal initialization, because the
> resource mapping
> > is according to the driver loaded by each device(rte_pci_map_device).
> >
> > But in our scenario, this judgment is somewhat weak. Because we cannot get
> > the device driver info in vhost library. I also think it is unreasonable for vhost to
> > do this. Only trust that users will not use it like this. Thoughts for this scenario?
> 
> I don't see how igb_uio would make any difference at all. If you are
> using igb_uio, you *don't have DMA mapping at all* and will use raw
> physical addresses. Assuming your code supports this, that's all you're
> ever going to get. The point of VFIO is to have memory regions that are
> mapped for DMA *because real physical addresses are assumed to be not
> available*. When you're using igb_uio, you effectively do have DMA
> access to the entire memory, and thus can bypass IOMMU altogether
> (assuming you're using passthrough mode).

My concern is exactly here.
In igb_uio cases, although devices are not added to the default container in eal init,
but the "IOMMU programming" actually happens when the rte_vfio_container_dma_map() is called.
It is no harm but it is also unnecessary.

> 
> Bottom line: do VFIO DMA mapping unconditionally. If VFIO is active -
> great, the memory will be DMA mapped. If it's not active - no harm will
> ever be done by mapping the memory for DMA anyway.

Do VFIO DMA mapping unconditionally, do you mean the rte_eal_vfio_is_enabled() is unnecessary?
What if the platform does not have IOMMU?

Thanks very much.

Regards,
Xuan

> 
> --
> Thanks,
> Anatoly

  reply	other threads:[~2021-07-07 12:55 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-31 15:06 [dpdk-dev] [PATCH v1] lib/vhost: " xuan.ding
2021-06-02 14:26 ` [dpdk-dev] [PATCH v2] " xuan.ding
2021-06-03 17:30 ` [dpdk-dev] [PATCH v3] vhost: " xuan.ding
2021-06-18 16:17   ` Maxime Coquelin
2021-06-21  3:57     ` Hu, Jiayu
2021-06-22  6:18     ` Ding, Xuan
2021-06-29  9:23       ` Maxime Coquelin
2021-07-01  5:12         ` Ding, Xuan
2021-07-05  8:19 ` [dpdk-dev] [PATCH v4 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05  8:19   ` [dpdk-dev] [PATCH v4 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05  8:19   ` [dpdk-dev] [PATCH v4 2/2] example/vhost: add dma vfio parsing Xuan Ding
2021-07-05  8:40 ` [dpdk-dev] [PATCH v5 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05  8:40   ` [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05 12:16     ` Burakov, Anatoly
2021-07-05 12:45       ` Maxime Coquelin
2021-07-06  8:31         ` Ding, Xuan
2021-07-06  9:16           ` Burakov, Anatoly
2021-07-06  9:32             ` Maxime Coquelin
2021-07-07  6:25               ` Ding, Xuan
2021-07-07 12:17                 ` Burakov, Anatoly
2021-07-07 12:54                   ` Ding, Xuan [this message]
2021-07-07 14:33                     ` Burakov, Anatoly
2021-07-07 15:00                       ` Bruce Richardson
2021-07-07 15:09                       ` Ding, Xuan
2021-07-05  8:40   ` [dpdk-dev] [PATCH v5 2/2] example/vhost: add dma vfio parsing Xuan Ding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN9PR11MB551341408628DA3F3C610CC9E71A9@BN9PR11MB5513.namprd11.prod.outlook.com \
    --to=xuan.ding@intel.com \
    --cc=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=chenbo.xia@intel.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=harry.van.haaren@intel.com \
    --cc=jiayu.hu@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=sunil.pai.g@intel.com \
    --cc=thomas@monjalon.net \
    --cc=wenwux.ma@intel.com \
    --cc=yong.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).