DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Ding, Xuan" <xuan.ding@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>,
	"Burakov, Anatoly" <anatoly.burakov@intel.com>,
	"Xia, Chenbo" <chenbo.xia@intel.com>,
	"Thomas Monjalon" <thomas@monjalon.net>,
	David Marchand <david.marchand@redhat.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, "Hu, Jiayu" <jiayu.hu@intel.com>,
	"Pai G, Sunil" <sunil.pai.g@intel.com>,
	"Richardson, Bruce" <bruce.richardson@intel.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	"Liu, Yong" <yong.liu@intel.com>,
	"Ma, WenwuX" <wenwux.ma@intel.com>
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
Date: Tue, 6 Jul 2021 08:31:04 +0000	[thread overview]
Message-ID: <BN9PR11MB551332B3D24B226A9ADAF629E71B9@BN9PR11MB5513.namprd11.prod.outlook.com> (raw)
In-Reply-To: <68f64aa6-59a8-2e17-6eab-a49a6682e626@redhat.com>

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, July 5, 2021 8:46 PM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>; Ding, Xuan
> <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>; Thomas
> Monjalon <thomas@monjalon.net>; David Marchand
> <david.marchand@redhat.com>
> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> Van Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong
> <yong.liu@intel.com>; Ma, WenwuX <wenwux.ma@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
> 
> 
> 
> On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
> > On 05-Jul-21 9:40 AM, Xuan Ding wrote:
> >> The use of IOMMU has many advantages, such as isolation and address
> >> translation. This patch extends the capbility of DMA engine to use
> >> IOMMU if the DMA device is bound to vfio.
> >>
> >> When set memory table, the guest memory will be mapped
> >> into the default container of DPDK.
> >>
> >> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> >> ---
> >>   doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
> >>   lib/vhost/rte_vhost.h               |  1 +
> >>   lib/vhost/socket.c                  |  9 ++++++
> >>   lib/vhost/vhost.h                   |  1 +
> >>   lib/vhost/vhost_user.c              | 46 ++++++++++++++++++++++++++++-
> >>   5 files changed, 65 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/doc/guides/prog_guide/vhost_lib.rst
> >> b/doc/guides/prog_guide/vhost_lib.rst
> >> index 05c42c9b11..c3beda23d9 100644
> >> --- a/doc/guides/prog_guide/vhost_lib.rst
> >> +++ b/doc/guides/prog_guide/vhost_lib.rst
> >> @@ -118,6 +118,15 @@ The following is an overview of some key Vhost
> >> API functions:
> >>         It is disabled by default.
> >>   +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
> >> +
> >> +    In asynchronous data path, vhost liarary is not aware of which
> >> driver
> >> +    (igb_uio/vfio) the DMA device is bound to. Application should pass
> >> +    this flag to tell vhost library whether IOMMU should be programmed
> >> +    for guest memory.
> >> +
> >> +    It is disabled by default.
> >> +
> >>     - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
> >>         Since v16.04, the vhost library forwards checksum and gso
> >> requests for
> >> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> >> index 8d875e9322..a766ea7b6b 100644
> >> --- a/lib/vhost/rte_vhost.h
> >> +++ b/lib/vhost/rte_vhost.h
> >> @@ -37,6 +37,7 @@ extern "C" {
> >>   #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6)
> >>   #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7)
> >>   #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL << 8)
> >> +#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9)
> >>     /* Features. */
> >>   #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
> >> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
> >> index 5d0d728d52..77c722c86b 100644
> >> --- a/lib/vhost/socket.c
> >> +++ b/lib/vhost/socket.c
> >> @@ -42,6 +42,7 @@ struct vhost_user_socket {
> >>       bool extbuf;
> >>       bool linearbuf;
> >>       bool async_copy;
> >> +    bool async_use_vfio;
> >>       bool net_compliant_ol_flags;
> >>         /*
> >> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct
> >> vhost_user_socket *vsocket)
> >>               dev->async_copy = 1;
> >>       }
> >>   +    if (vsocket->async_use_vfio) {
> >> +        dev = get_device(vid);
> >> +
> >> +        if (dev)
> >> +            dev->async_use_vfio = 1;
> >> +    }
> >> +
> >>       VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> >>         if (vsocket->notify_ops->new_connection) {
> >> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,
> >> uint64_t flags)
> >>       vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
> >>       vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
> >>       vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> >> +    vsocket->async_use_vfio = flags &
> RTE_VHOST_USER_ASYNC_USE_VFIO;
> >>       vsocket->net_compliant_ol_flags = flags &
> >> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
> >>         if (vsocket->async_copy &&
> >> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> >> index 8078ddff79..fb775ce4ed 100644
> >> --- a/lib/vhost/vhost.h
> >> +++ b/lib/vhost/vhost.h
> >> @@ -370,6 +370,7 @@ struct virtio_net {
> >>       int16_t            broadcast_rarp;
> >>       uint32_t        nr_vring;
> >>       int            async_copy;
> >> +    int            async_use_vfio;
> >>       int            extbuf;
> >>       int            linearbuf;
> >>       struct vhost_virtqueue    *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> >> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> >> index 8f0eba6412..f3703f2e72 100644
> >> --- a/lib/vhost/vhost_user.c
> >> +++ b/lib/vhost/vhost_user.c
> >> @@ -45,6 +45,7 @@
> >>   #include <rte_common.h>
> >>   #include <rte_malloc.h>
> >>   #include <rte_log.h>
> >> +#include <rte_vfio.h>
> >>     #include "iotlb.h"
> >>   #include "vhost.h"
> >> @@ -141,6 +142,36 @@ get_blk_size(int fd)
> >>       return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
> >>   }
> >>   +static int
> >> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> >> +{
> >> +    int ret = 0;
> >> +    uint64_t host_iova;
> >> +    host_iova = rte_mem_virt2iova((void
> >> *)(uintptr_t)region->host_user_addr);
> >> +    if (do_map) {
> >> +        /* Add mapped region into the default container of DPDK. */
> >> +        ret =
> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> >> +                         region->host_user_addr,
> >> +                         host_iova,
> >> +                         region->size);
> >> +        if (ret) {
> >> +            VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> >> +            return ret;
> >> +        }
> >> +    } else {
> >> +        /* Remove mapped region from the default container of DPDK. */
> >> +        ret =
> >> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> >> +                           region->host_user_addr,
> >> +                           host_iova,
> >> +                           region->size);
> >> +        if (ret) {
> >> +            VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
> >> +            return ret;
> >> +        }
> >> +    }
> >> +    return ret;
> >> +}
> >
> > We've been discussing this off list with Xuan, and unfortunately this is
> > a blocker for now.
> >
> > Currently, the x86 IOMMU does not support partial unmap - the segments
> > have to be unmapped exactly the same addr/len as they were mapped. We
> > also concatenate adjacent mappings to prevent filling up the DMA mapping
> > entry table with superfluous entries.
> >
> > This means that, when two unrelated mappings are contiguous in memory
> > (e.g. if you map regions 1 and 2 independently, but they happen to be
> > sitting right next to each other in virtual memory), we cannot later
> > unmap one of them because, even though these are two separate
> mappings
> > as far as kernel VFIO infrastructure is concerned, the mapping gets
> > compacted and looks like one single mapping to VFIO, so DPDK API will
> > not let us unmap region 1 without also unmapping region 2.
> >
> > The proper fix for this problem would be to always map memory
> > page-by-page regardless of where it comes from (we already do that for
> > internal memory, but not for external). However, the reason this works
> > for internal memory is because when mapping internal memory segments,
> > *we know the page size*. For external memory segments, there is no such
> > guarantee, so we cannot deduce page size for a given memory segment,
> and
> > thus can't map things page-by-page.
> >
> > So, the proper fix for it would be to add page size to the VFIO DMA API.
> > Unfortunately, it probably has to wait until 21.11 because it is an API
> > change.
> >
> > The slightly hacky fix for this would be to forego user mem map
> > concatenation and trust that user is not going to do anything stupid,
> > and will not spam the VFIO DMA API without reason. I would rather not go
> > down this road, but this could be an option in this case.
> >
> > Thoughts?
> >
> 
> Thanks Anatoly for the detailed description of the issue.
> It may be possible to either create a versioned symbol for this API
> change, or maybe even to have a temporary internal API.
> 
> But I think this series in its current form is not acceptable, so
> waiting for v21.11 would be the best option (we may want to send the
> deprecation notice in this release though).
> 
> In this series, I don't like the user application has to pass a flag to
> state whether the DMA engine uses VFIO or not. AFAICT, this new revision
> does not implement what was discussed in the previous one, i.e.
> supporting both IOVA_AS_VA and IOVA_AS_PA.

Thanks for your comments. Here I hope to explain some questions:
1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, the
virtual address is replaced with iova address of mapped region, and the iova
address is selected to program the IOMMU instead of virtual address only.

2. Why a flag is chosen to be passed by application?
A: Yes, as we discussed before, the rte_eal_iova_mode() API can be used to
get the IOVA mode, so as to determine whether IOMMU should be programmed.
However, in the implementation process, I found a problem. That is how to	
distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we should
always program the IOMMU. While in IGB_UIO cases, it depends on IOMMU
capability of platform.

So a flag is selected, but this requires the application to do extra things.
I find another solution, is to use
#ifdef VFIO_PRESENT
	If(rte_vfio_is_enabled("vfio"))
		program_iommu;
#endif

Because all the devices are managed by DPDK, we can follow DPDK to do the
decision. Does this make sense for you, or any some suggestions?

3.  The partial unmap issue
A: Thanks Anatoly for the detailed explanation. This problem was found in
reconnection cases. After our off list discussion, the solution requires
rte_vfio_container_dma_map/unmap API change. Here I want to consult
if there are some hope for versioned symbol or a temporary internal API 
be used in this release.

Thanks for your time!

Regards,
Xuan

> 
> Regards,
> Maxime


  reply	other threads:[~2021-07-06  8:31 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-31 15:06 [dpdk-dev] [PATCH v1] lib/vhost: " xuan.ding
2021-06-02 14:26 ` [dpdk-dev] [PATCH v2] " xuan.ding
2021-06-03 17:30 ` [dpdk-dev] [PATCH v3] vhost: " xuan.ding
2021-06-18 16:17   ` Maxime Coquelin
2021-06-21  3:57     ` Hu, Jiayu
2021-06-22  6:18     ` Ding, Xuan
2021-06-29  9:23       ` Maxime Coquelin
2021-07-01  5:12         ` Ding, Xuan
2021-07-05  8:19 ` [dpdk-dev] [PATCH v4 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05  8:19   ` [dpdk-dev] [PATCH v4 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05  8:19   ` [dpdk-dev] [PATCH v4 2/2] example/vhost: add dma vfio parsing Xuan Ding
2021-07-05  8:40 ` [dpdk-dev] [PATCH v5 0/2] vhost: add IOMMU support in async data path Xuan Ding
2021-07-05  8:40   ` [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost Xuan Ding
2021-07-05 12:16     ` Burakov, Anatoly
2021-07-05 12:45       ` Maxime Coquelin
2021-07-06  8:31         ` Ding, Xuan [this message]
2021-07-06  9:16           ` Burakov, Anatoly
2021-07-06  9:32             ` Maxime Coquelin
2021-07-07  6:25               ` Ding, Xuan
2021-07-07 12:17                 ` Burakov, Anatoly
2021-07-07 12:54                   ` Ding, Xuan
2021-07-07 14:33                     ` Burakov, Anatoly
2021-07-07 15:00                       ` Bruce Richardson
2021-07-07 15:09                       ` Ding, Xuan
2021-07-05  8:40   ` [dpdk-dev] [PATCH v5 2/2] example/vhost: add dma vfio parsing Xuan Ding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN9PR11MB551332B3D24B226A9ADAF629E71B9@BN9PR11MB5513.namprd11.prod.outlook.com \
    --to=xuan.ding@intel.com \
    --cc=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=chenbo.xia@intel.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=harry.van.haaren@intel.com \
    --cc=jiayu.hu@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=sunil.pai.g@intel.com \
    --cc=thomas@monjalon.net \
    --cc=wenwux.ma@intel.com \
    --cc=yong.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).