From: David Marchand <david.marchand@redhat.com>
To: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: dev <dev@dpdk.org>, John McNamara <john.mcnamara@intel.com>,
Marko Kovacevic <marko.kovacevic@intel.com>,
dariusz.stojaczyk@intel.com,
Thomas Monjalon <thomas@monjalon.net>,
Jerin Jacob Kollanukkaran <jerinj@marvell.com>
Subject: Re: [dpdk-dev] [PATCH v4] eal: pick IOVA as PA if IOMMU is not available
Date: Mon, 29 Jul 2019 11:31:23 +0200 [thread overview]
Message-ID: <CAJFAV8weSCL=--j=J04hdxiFJiL7P5vF4i2ud8rW0Ep0ecU4YA@mail.gmail.com> (raw)
In-Reply-To: <cda08c84d00201139718e15eeeca0dc5e67ead31.1564155444.git.anatoly.burakov@intel.com>
On Fri, Jul 26, 2019 at 5:37 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> When IOMMU is not available, /sys/kernel/iommu_groups will not be
> populated. This is happening since at least 3.6 when VFIO support
> was added. If the directory is empty, EAL should not pick IOVA as
> VA as the default IOVA mode.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Tested-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
> Tested-by: Jerin Jacob <jerinj@marvell.com>
> Reviewed-by: Jerin Jacob <jerinj@marvell.com>
> ---
>
> Notes:
> v4:
> - Fix indentation in release notes' known issues
>
> v3:
> - Add documentation changes
> - Fix a typo pointed out by checkpatch
>
> v2:
> - Decouple IOMMU from VFIO
> - Add a check for physical addresses availability
>
> .../prog_guide/env_abstraction_layer.rst | 27 +++++++++++------
> doc/guides/rel_notes/known_issues.rst | 26 ++++++++++++++++
> doc/guides/rel_notes/release_19_08.rst | 16 ++++++++++
> lib/librte_eal/linux/eal/eal.c | 21 +++++++++++--
> lib/librte_eal/linux/eal/eal_vfio.c | 30 +++++++++++++++++++
> lib/librte_eal/linux/eal/eal_vfio.h | 2 ++
> 6 files changed, 111 insertions(+), 11 deletions(-)
>
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 1487ea550..e6e70e5a8 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -425,6 +425,9 @@ IOVA Mode Detection
> IOVA Mode is selected by considering what the current usable Devices on the
> system require and/or support.
>
> +On FreeBSD, RTE_IOVA_VA mode is not supported, so RTE_IOVA_PA is always used.
We still allow setting it via --iova-mode=
Is it really unsupported ? vdev like rings could work.
> +On Linux, the IOVA mode is detected based on a heuristic.
> +
> Below is the 2-step heuristic for this choice.
We can combine those two sentences as a single one.
>
> For the first step, EAL asks each bus its requirement in terms of IOVA mode
> @@ -438,20 +441,26 @@ and decides on a preferred IOVA mode.
> RTE_IOVA_VA), then the preferred IOVA mode is RTE_IOVA_DC (see below with the
> check on Physical Addresses availability),
>
> +If the buses have expressed no preference on which IOVA mode to pick, then a
> +default is selected using the following logic:
> +
> +- if physical addresses are not available, RTE_IOVA_VA mode is used
> +- if /sys/kernel/iommu_groups is not empty, RTE_IOVA_VA mode is used
> +- otherwise, RTE_IOVA_PA mode is used
> +
> +In the case when the buses had disagreed on their preferred IOVA mode, part of
> +the buses won't work because of this decision.
> +
> The second step checks if the preferred mode complies with the Physical
> Addresses availability since those are only available to root user in recent
> -kernels.
> -
> -- if the preferred mode is RTE_IOVA_PA but there is no access to Physical
> - Addresses, then EAL init fails early, since later probing of the devices
> - would fail anyway,
> -- if the preferred mode is RTE_IOVA_DC then EAL selects the RTE_IOVA_VA mode.
> - In the case when the buses had disagreed on the IOVA Mode at the first step,
> - part of the buses won't work because of this decision.
> +kernels. Namely, if the preferred mode is RTE_IOVA_PA but there is no access to
> +Physical Addresses, then EAL init fails early, since later probing of the
> +devices would fail anyway.
>
> .. note::
>
> - The RTE_IOVA_VA mode is selected as the default for the following reasons:
> + The RTE_IOVA_VA mode is preferred as the default in most cases for the
> + following reasons:
>
> - All drivers are expected to work in RTE_IOVA_VA mode, irrespective of
> physical address availability.
> diff --git a/doc/guides/rel_notes/known_issues.rst b/doc/guides/rel_notes/known_issues.rst
> index 276327c15..0b50c8306 100644
> --- a/doc/guides/rel_notes/known_issues.rst
> +++ b/doc/guides/rel_notes/known_issues.rst
> @@ -861,3 +861,29 @@ AVX-512 support disabled
>
> **Driver/Module**:
> ALL.
> +
> +
> +Unsuitable IOVA mode may be picked as the default
> +----------------------------------------------------------------
> +**Description**
> + Not all kernel drivers and not all devices support all IOVA modes. EAL will
> + attempt to pick a reasonable default based on a number of factors, but there
> + may be cases where the default may be unsuitable (for example, hotplugging
> + devices using `igb_uio` driver while having picked IOVA as VA mode on EAL
> + initialization).
> +
> +**Implication**
> + Some devices (hotplugged or otherwise) may not work due to incompatible IOVA
> + mode being automatically picked by EAL.
> +
> +**Resolution/Workaround**:
> + It is possible to force EAL to pick a particular IOVA mode by using the
> + `--iova-mode` command-line parameter. If conflicting requirements are present
> + (such as one device requiring IOVA as PA and one requiring IOVA as VA mode),
> + there is no workaround.
> +
> +**Affected Environment/Platform**:
> + Linux.
> +
> +**Driver/Module**:
> + ALL.
> diff --git a/doc/guides/rel_notes/release_19_08.rst b/doc/guides/rel_notes/release_19_08.rst
> index c9bd3ce18..b399ca536 100644
> --- a/doc/guides/rel_notes/release_19_08.rst
> +++ b/doc/guides/rel_notes/release_19_08.rst
> @@ -56,6 +56,12 @@ New Features
> Also, make sure to start the actual text at the margin.
> =========================================================
>
> +* **EAL will now pick IOVA as VA mode as the default in most cases.**
> +
> + Previously, preferred default IOVA mode was selected to be IOVA as PA. The
> + behavior has now been changed to handle IOVA mode detection in a more complex
> + manner, and will default to IOVA as VA in most cases.
> +
> * **Added MCS lock.**
>
> MCS lock provides scalability by spinning on a CPU/thread local variable
> @@ -436,6 +442,16 @@ Known Issues
> =========================================================
>
>
> +* **Unsuitable IOVA mode may be picked as the default**
> +
> + Not all kernel drivers and not all devices support all IOVA modes. EAL will
> + attempt to pick a reasonable default based on a number of factors, but
> + there may be cases where the default may be unsuitable.
> +
> + It is recommended to use the `--iova-mode` command-line parameter if the
> + default is not suitable.
> +
> +
> Tested Platforms
> ----------------
>
> diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
> index 34db78753..29972b896 100644
> --- a/lib/librte_eal/linux/eal/eal.c
> +++ b/lib/librte_eal/linux/eal/eal.c
> @@ -1061,8 +1061,25 @@ rte_eal_init(int argc, char **argv)
> enum rte_iova_mode iova_mode = rte_bus_get_iommu_class();
>
> if (iova_mode == RTE_IOVA_DC) {
> - iova_mode = RTE_IOVA_VA;
> - RTE_LOG(DEBUG, EAL, "Buses did not request a specific IOVA mode, select IOVA as VA mode.\n");
> + RTE_LOG(DEBUG, EAL, "Buses did not request a specific IOVA mode.\n");
> +
> + if (!phys_addrs) {
> + /* if we have no access to physical addresses,
> + * pick IOVA as VA mode.
> + */
> + iova_mode = RTE_IOVA_VA;
> + RTE_LOG(DEBUG, EAL, "Physical addresses are unavailable, selecting IOVA as VA mode.\n");
> + } else if (vfio_iommu_enabled()) {
How about:
s/vfio_iommu_enabled/is_iommu_available/
And the code would move from vfio specific files to eal.c.
> + /* we have an IOMMU, pick IOVA as VA mode */
> + iova_mode = RTE_IOVA_VA;
> + RTE_LOG(DEBUG, EAL, "IOMMU is available, selecting IOVA as VA mode.\n");
> + } else {
> + /* physical addresses available, and no IOMMU
> + * found, so pick IOVA as PA.
> + */
> + iova_mode = RTE_IOVA_PA;
> + RTE_LOG(DEBUG, EAL, "IOMMU is not available, selecting IOVA as PA mode.\n");
> + }
> }
> #ifdef RTE_LIBRTE_KNI
> /* Workaround for KNI which requires physical address to work */
> diff --git a/lib/librte_eal/linux/eal/eal_vfio.c b/lib/librte_eal/linux/eal/eal_vfio.c
> index 501c74f23..92d290284 100644
> --- a/lib/librte_eal/linux/eal/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> @@ -2,6 +2,7 @@
> * Copyright(c) 2010-2018 Intel Corporation
> */
>
> +#include <dirent.h>
> #include <inttypes.h>
> #include <string.h>
> #include <fcntl.h>
> @@ -19,6 +20,8 @@
> #include "eal_vfio.h"
> #include "eal_private.h"
>
> +#define VFIO_KERNEL_IOMMU_GROUPS_PATH "/sys/kernel/iommu_groups"
> +
> #ifdef VFIO_PRESENT
>
> #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
> @@ -2147,3 +2150,30 @@ rte_vfio_container_dma_unmap(__rte_unused int container_fd,
> }
>
> #endif /* VFIO_PRESENT */
> +
> +/*
> + * on Linux 3.6+, even if VFIO is not loaded, whenever IOMMU is enabled in the
> + * BIOS and in the kernel, /sys/kernel/iommu_groups path will contain kernel
> + * IOMMU groups. If IOMMU is not enabled, that path would be empty. Therefore,
> + * checking if the path is empty will tell us if IOMMU is enabled.
> + */
> +int
> +vfio_iommu_enabled(void)
> +{
> + DIR *dir = opendir(VFIO_KERNEL_IOMMU_GROUPS_PATH);
> + struct dirent *d;
> + int n = 0;
> +
> + /* if directory doesn't exist, assume IOMMU is not enabled */
> + if (dir == NULL)
> + return 0;
> +
> + while ((d = readdir(dir)) != NULL) {
> + /* skip dot and dot-dot */
> + if (++n > 2)
> + break;
> + }
> + closedir(dir);
> +
> + return n > 2;
> +}
> diff --git a/lib/librte_eal/linux/eal/eal_vfio.h b/lib/librte_eal/linux/eal/eal_vfio.h
> index cb2d35fb1..58c7a7309 100644
> --- a/lib/librte_eal/linux/eal/eal_vfio.h
> +++ b/lib/librte_eal/linux/eal/eal_vfio.h
> @@ -133,6 +133,8 @@ vfio_has_supported_extensions(int vfio_container_fd);
>
> int vfio_mp_sync_setup(void);
>
> +int vfio_iommu_enabled(void);
> +
> #define EAL_VFIO_MP "eal_vfio_mp_sync"
>
> #define SOCKET_REQ_CONTAINER 0x100
> --
> 2.17.1
--
David Marchand
next prev parent reply other threads:[~2019-07-29 9:31 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-07-24 16:46 [dpdk-dev] [PATCH] " Anatoly Burakov
2019-07-25 8:05 ` David Marchand
2019-07-25 9:31 ` Burakov, Anatoly
2019-07-25 9:35 ` David Marchand
2019-07-25 9:38 ` Burakov, Anatoly
2019-07-25 9:40 ` Burakov, Anatoly
2019-07-25 18:58 ` Thomas Monjalon
2019-07-25 9:52 ` [dpdk-dev] [PATCH v2] " Anatoly Burakov
2019-07-25 9:56 ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2019-07-25 11:05 ` [dpdk-dev] [PATCH v3] " Anatoly Burakov
2019-07-26 5:08 ` Stojaczyk, Dariusz
2019-07-26 15:37 ` [dpdk-dev] [PATCH v4] " Anatoly Burakov
2019-07-29 9:31 ` David Marchand [this message]
2019-07-29 11:18 ` Burakov, Anatoly
2019-07-29 13:52 ` [dpdk-dev] [PATCH v5] " Anatoly Burakov
2019-07-30 7:21 ` David Marchand
2019-07-30 8:10 ` Thomas Monjalon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJFAV8weSCL=--j=J04hdxiFJiL7P5vF4i2ud8rW0Ep0ecU4YA@mail.gmail.com' \
--to=david.marchand@redhat.com \
--cc=anatoly.burakov@intel.com \
--cc=dariusz.stojaczyk@intel.com \
--cc=dev@dpdk.org \
--cc=jerinj@marvell.com \
--cc=john.mcnamara@intel.com \
--cc=marko.kovacevic@intel.com \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).