DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap
@ 2020-10-12  8:11 Nithin Dabilpuram
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
                   ` (8 more replies)
  0 siblings, 9 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-12  8:11 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fixe to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

Nithin Dabilpuram (2):
  test: add test case to validate VFIO DMA map/unmap
  vfio: fix partial DMA unmapping for VFIO type1

 app/test/test_memory.c          | 79 +++++++++++++++++++++++++++++++++++++++++
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++----
 lib/librte_eal/linux/eal_vfio.h |  1 +
 3 files changed, 108 insertions(+), 6 deletions(-)

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-10-12  8:11 ` Nithin Dabilpuram
  2020-10-14 14:39   ` Burakov, Anatoly
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1 Nithin Dabilpuram
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-12  8:11 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram

Add test case in test_memory to test VFIO DMA map/unmap.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_memory.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/app/test/test_memory.c b/app/test/test_memory.c
index 7d5ae99..1c56455 100644
--- a/app/test/test_memory.c
+++ b/app/test/test_memory.c
@@ -4,11 +4,16 @@
 
 #include <stdio.h>
 #include <stdint.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
 
 #include <rte_eal.h>
+#include <rte_errno.h>
 #include <rte_memory.h>
 #include <rte_common.h>
 #include <rte_memzone.h>
+#include <rte_vfio.h>
 
 #include "test.h"
 
@@ -70,6 +75,71 @@ check_seg_fds(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
 }
 
 static int
+test_memory_vfio_dma_map(void)
+{
+	uint64_t sz = 2 * sysconf(_SC_PAGESIZE), sz1, sz2;
+	uint64_t unmap1, unmap2;
+	uint8_t *mem;
+	int ret;
+
+	/* Check if vfio is enabled in both kernel and eal */
+	ret = rte_vfio_is_enabled("vfio");
+	if (!ret)
+		return 1;
+
+	/* Allocate twice size of page */
+	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED) {
+		printf("Failed to allocate memory for external heap\n");
+		return -1;
+	}
+
+	/* Force page allocation */
+	memset(mem, 0, sz);
+
+	/* map the whole region */
+	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					 (uint64_t)mem, (rte_iova_t)mem, sz);
+	if (ret) {
+		printf("Failed to dma map whole region, ret=%d\n", ret);
+		goto fail;
+	}
+
+	unmap1 = (uint64_t)mem + (sz / 2);
+	sz1 = sz / 2;
+	unmap2 = (uint64_t)mem;
+	sz2 = sz / 2;
+	/* unmap the partial region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap1, (rte_iova_t)unmap1, sz1);
+	if (ret) {
+		if (rte_errno == ENOTSUP) {
+			printf("Partial dma unmap not supported\n");
+			unmap2 = (uint64_t)mem;
+			sz2 = sz;
+		} else {
+			printf("Failed to unmap send half region, ret=%d(%d)\n",
+			       ret, rte_errno);
+			goto fail;
+		}
+	}
+
+	/* unmap the remaining region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap2, (rte_iova_t)unmap2, sz2);
+	if (ret) {
+		printf("Failed to unmap remaining region, ret=%d(%d)\n", ret,
+		       rte_errno);
+		goto fail;
+	}
+
+fail:
+	munmap(mem, sz);
+	return ret;
+}
+
+static int
 test_memory(void)
 {
 	uint64_t s;
@@ -101,6 +171,15 @@ test_memory(void)
 		return -1;
 	}
 
+	/* test for vfio dma map/unmap */
+	ret = test_memory_vfio_dma_map();
+	if (ret == 1) {
+		printf("VFIO dma map/unmap unsupported\n");
+	} else if (ret < 0) {
+		printf("Error vfio dma map/unmap, ret=%d\n", ret);
+		return -1;
+	}
+
 	return 0;
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-10-12  8:11 ` Nithin Dabilpuram
  2020-10-14 15:07   ` Burakov, Anatoly
  2020-11-05  9:04 ` [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-12  8:11 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For case of DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index d26e164..ef95259 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -69,6 +69,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -76,6 +77,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -83,6 +85,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1383,6 +1393,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-10-14 14:39   ` Burakov, Anatoly
  2020-10-15  9:54     ` [dpdk-dev] [EXT] " Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-14 14:39 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev

On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> Add test case in test_memory to test VFIO DMA map/unmap.
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---
>   app/test/test_memory.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 79 insertions(+)
> 
> diff --git a/app/test/test_memory.c b/app/test/test_memory.c
> index 7d5ae99..1c56455 100644
> --- a/app/test/test_memory.c
> +++ b/app/test/test_memory.c
> @@ -4,11 +4,16 @@
>   
>   #include <stdio.h>
>   #include <stdint.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
>   
>   #include <rte_eal.h>
> +#include <rte_errno.h>
>   #include <rte_memory.h>
>   #include <rte_common.h>
>   #include <rte_memzone.h>
> +#include <rte_vfio.h>
>   
>   #include "test.h"
>   
> @@ -70,6 +75,71 @@ check_seg_fds(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
>   }
>   
>   static int
> +test_memory_vfio_dma_map(void)
> +{
> +	uint64_t sz = 2 * sysconf(_SC_PAGESIZE), sz1, sz2;

i think we now have a function for that, rte_page_size() ?

Also, i would prefer

uint64_t sz1, sz2, sz = 2 * rte_page_size();

Easier to parse IMO.

> +	uint64_t unmap1, unmap2;
> +	uint8_t *mem;
> +	int ret;
> +
> +	/* Check if vfio is enabled in both kernel and eal */
> +	ret = rte_vfio_is_enabled("vfio");
> +	if (!ret)
> +		return 1;

No need, rte_vfio_container_dma_map() should set errno to ENODEV if vfio 
is not enabled.

> +
> +	/* Allocate twice size of page */
> +	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +	if (mem == MAP_FAILED) {
> +		printf("Failed to allocate memory for external heap\n");
> +		return -1;
> +	}
> +
> +	/* Force page allocation */
> +	memset(mem, 0, sz);
> +
> +	/* map the whole region */
> +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					 (uint64_t)mem, (rte_iova_t)mem, sz);

should be (uintptr_t) perhaps?

Also, this can return -1 with rte_errno == ENOTSUP, i think this happens 
if there are no devices attached (or if there's no VFIO support, like it 
would be on FreeBSD or Windows).

> +	if (ret) {
> +		printf("Failed to dma map whole region, ret=%d\n", ret);
> +		goto fail;
> +	}
> +
> +	unmap1 = (uint64_t)mem + (sz / 2);
> +	sz1 = sz / 2;
> +	unmap2 = (uint64_t)mem;
> +	sz2 = sz / 2;
> +	/* unmap the partial region */
> +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					   unmap1, (rte_iova_t)unmap1, sz1);
> +	if (ret) {
> +		if (rte_errno == ENOTSUP) {
> +			printf("Partial dma unmap not supported\n");
> +			unmap2 = (uint64_t)mem;
> +			sz2 = sz;
> +		} else {
> +			printf("Failed to unmap send half region, ret=%d(%d)\n",

I think "send half" is a typo? Also, here and in other places, i would 
prefer a rte_strerror() instead of raw rte_errno number.

> +			       ret, rte_errno);
> +			goto fail;
> +		}
> +	}
> +
> +	/* unmap the remaining region */
> +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					   unmap2, (rte_iova_t)unmap2, sz2);
> +	if (ret) {
> +		printf("Failed to unmap remaining region, ret=%d(%d)\n", ret,
> +		       rte_errno);
> +		goto fail;
> +	}
> +
> +fail:
> +	munmap(mem, sz);
> +	return ret;
> +}
> +
> +static int
>   test_memory(void)
>   {
>   	uint64_t s;
> @@ -101,6 +171,15 @@ test_memory(void)
>   		return -1;
>   	}
>   
> +	/* test for vfio dma map/unmap */
> +	ret = test_memory_vfio_dma_map();
> +	if (ret == 1) {
> +		printf("VFIO dma map/unmap unsupported\n");
> +	} else if (ret < 0) {
> +		printf("Error vfio dma map/unmap, ret=%d\n", ret);
> +		return -1;
> +	}
> +

This looks odd in this autotest. Perhaps create a new autotest for VFIO?

>   	return 0;
>   }
>   
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1 Nithin Dabilpuram
@ 2020-10-14 15:07   ` Burakov, Anatoly
  2020-10-15  6:09     ` [dpdk-dev] [EXT] " Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-14 15:07 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev, stable

On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> Partial unmapping is not supported for VFIO IOMMU type1
> by kernel. Though kernel gives return as zero, the unmapped size
> returned will not be same as expected. So check for
> returned unmap size and return error.
> 
> For case of DMA map/unmap triggered by heap allocations,
> maintain granularity of memseg page size so that heap
> expansion and contraction does not have this issue.

This is quite unfortunate, because there was a different bug that had to 
do with kernel having a very limited number of mappings available [1], 
as a result of which the page concatenation code was added.

It should therefore be documented that the dma_entry_limit parameter 
should be adjusted should the user run out of the DMA entries.

[1] 
https://lore.kernel.org/lkml/155414977872.12780.13728555131525362206.stgit@gimli.home/T/

> 
> For user requested DMA map/unmap disallow partial unmapping
> for VFIO type1.
> 
> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---
>   lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
>   lib/librte_eal/linux/eal_vfio.h |  1 +
>   2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> index d26e164..ef95259 100644
> --- a/lib/librte_eal/linux/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal_vfio.c
> @@ -69,6 +69,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_TYPE1,
>   		.name = "Type 1",
> +		.partial_unmap = false,
>   		.dma_map_func = &vfio_type1_dma_map,
>   		.dma_user_map_func = &vfio_type1_dma_mem_map
>   	},
> @@ -76,6 +77,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_SPAPR,
>   		.name = "sPAPR",
> +		.partial_unmap = true,
>   		.dma_map_func = &vfio_spapr_dma_map,
>   		.dma_user_map_func = &vfio_spapr_dma_mem_map
>   	},
> @@ -83,6 +85,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_NOIOMMU,
>   		.name = "No-IOMMU",
> +		.partial_unmap = true,
>   		.dma_map_func = &vfio_noiommu_dma_map,
>   		.dma_user_map_func = &vfio_noiommu_dma_mem_map
>   	},
> @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   	/* for IOVA as VA mode, no need to care for IOVA addresses */
>   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
>   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> -		if (type == RTE_MEM_EVENT_ALLOC)
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 1);
> -		else
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 0);
> +		uint64_t page_sz = msl->page_sz;
> +
> +		/* Maintain granularity of DMA map/unmap to memseg size */
> +		for (; cur_len < len; cur_len += page_sz) {
> +			if (type == RTE_MEM_EVENT_ALLOC)
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 1);
> +			else
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 0);
> +			vfio_va += page_sz;
> +		}
> +

You'd also have to revert d1c7c0cdf7bac5eb40d3a2a690453aefeee5887b 
because currently the PA path will opportunistically concantenate 
contiguous segments into single mapping too.

>   		return;
>   	}
>   
> @@ -1383,6 +1393,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
>   			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>   					errno, strerror(errno));
>   			return -1;
> +		} else if (dma_unmap.size != len) {
> +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> +				"remapping cleared instead of %"PRIu64"\n",
> +				(uint64_t)dma_unmap.size, len);
> +			rte_errno = EIO;
> +			return -1;
>   		}
>   	}
>   
> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>   		/* we're partially unmapping a previously mapped region, so we
>   		 * need to split entry into two.
>   		 */
> +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> +			rte_errno = ENOTSUP;
> +			ret = -1;
> +			goto out;
> +		}

How would we ever arrive here if we never do more than 1 page worth of 
memory anyway? I don't think this is needed.

>   		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
>   			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
>   			rte_errno = ENOMEM;
> diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
> index cb2d35f..6ebaca6 100644
> --- a/lib/librte_eal/linux/eal_vfio.h
> +++ b/lib/librte_eal/linux/eal_vfio.h
> @@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
>   struct vfio_iommu_type {
>   	int type_id;
>   	const char *name;
> +	bool partial_unmap;
>   	vfio_dma_user_func_t dma_user_map_func;
>   	vfio_dma_func_t dma_map_func;
>   };
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-14 15:07   ` Burakov, Anatoly
@ 2020-10-15  6:09     ` Nithin Dabilpuram
  2020-10-15 10:00       ` Burakov, Anatoly
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-15  6:09 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: jerinj, dev, stable

On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> External Email
> 
> ----------------------------------------------------------------------
> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > Partial unmapping is not supported for VFIO IOMMU type1
> > by kernel. Though kernel gives return as zero, the unmapped size
> > returned will not be same as expected. So check for
> > returned unmap size and return error.
> > 
> > For case of DMA map/unmap triggered by heap allocations,
> > maintain granularity of memseg page size so that heap
> > expansion and contraction does not have this issue.
> 
> This is quite unfortunate, because there was a different bug that had to do
> with kernel having a very limited number of mappings available [1], as a
> result of which the page concatenation code was added.
> 
> It should therefore be documented that the dma_entry_limit parameter should
> be adjusted should the user run out of the DMA entries.
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=

Ack, I'll document it in guides/linux_gsg/linux_drivers.rst in vfio section.
> 
> > 
> > For user requested DMA map/unmap disallow partial unmapping
> > for VFIO type1.
> > 
> > Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> > Cc: anatoly.burakov@intel.com
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > ---
> >   lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
> >   lib/librte_eal/linux/eal_vfio.h |  1 +
> >   2 files changed, 29 insertions(+), 6 deletions(-)
> > 
> > diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> > index d26e164..ef95259 100644
> > --- a/lib/librte_eal/linux/eal_vfio.c
> > +++ b/lib/librte_eal/linux/eal_vfio.c
> > @@ -69,6 +69,7 @@ static const struct vfio_iommu_type iommu_types[] = {
> >   	{
> >   		.type_id = RTE_VFIO_TYPE1,
> >   		.name = "Type 1",
> > +		.partial_unmap = false,
> >   		.dma_map_func = &vfio_type1_dma_map,
> >   		.dma_user_map_func = &vfio_type1_dma_mem_map
> >   	},
> > @@ -76,6 +77,7 @@ static const struct vfio_iommu_type iommu_types[] = {
> >   	{
> >   		.type_id = RTE_VFIO_SPAPR,
> >   		.name = "sPAPR",
> > +		.partial_unmap = true,
> >   		.dma_map_func = &vfio_spapr_dma_map,
> >   		.dma_user_map_func = &vfio_spapr_dma_mem_map
> >   	},
> > @@ -83,6 +85,7 @@ static const struct vfio_iommu_type iommu_types[] = {
> >   	{
> >   		.type_id = RTE_VFIO_NOIOMMU,
> >   		.name = "No-IOMMU",
> > +		.partial_unmap = true,
> >   		.dma_map_func = &vfio_noiommu_dma_map,
> >   		.dma_user_map_func = &vfio_noiommu_dma_mem_map
> >   	},
> > @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
> >   	/* for IOVA as VA mode, no need to care for IOVA addresses */
> >   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
> >   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> > -		if (type == RTE_MEM_EVENT_ALLOC)
> > -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> > -					len, 1);
> > -		else
> > -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> > -					len, 0);
> > +		uint64_t page_sz = msl->page_sz;
> > +
> > +		/* Maintain granularity of DMA map/unmap to memseg size */
> > +		for (; cur_len < len; cur_len += page_sz) {
> > +			if (type == RTE_MEM_EVENT_ALLOC)
> > +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> > +						 vfio_va, page_sz, 1);
> > +			else
> > +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> > +						 vfio_va, page_sz, 0);
> > +			vfio_va += page_sz;
> > +		}
> > +
> 
> You'd also have to revert d1c7c0cdf7bac5eb40d3a2a690453aefeee5887b because
> currently the PA path will opportunistically concantenate contiguous
> segments into single mapping too.

Ack, I'll change it even for IOVA as PA mode. I missed that.
> 
> >   		return;
> >   	}
> > @@ -1383,6 +1393,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
> >   			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> >   					errno, strerror(errno));
> >   			return -1;
> > +		} else if (dma_unmap.size != len) {
> > +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > +				"remapping cleared instead of %"PRIu64"\n",
> > +				(uint64_t)dma_unmap.size, len);
> > +			rte_errno = EIO;
> > +			return -1;
> >   		}
> >   	}
> > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> >   		/* we're partially unmapping a previously mapped region, so we
> >   		 * need to split entry into two.
> >   		 */
> > +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > +			rte_errno = ENOTSUP;
> > +			ret = -1;
> > +			goto out;
> > +		}
> 
> How would we ever arrive here if we never do more than 1 page worth of
> memory anyway? I don't think this is needed.

container_dma_unmap() is called by user via rte_vfio_container_dma_unmap() 
and when he maps we don't split it as we don't about his memory.
So if he maps multiple pages and tries to unmap partially, then we should fail.

> 
> >   		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
> >   			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
> >   			rte_errno = ENOMEM;
> > diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
> > index cb2d35f..6ebaca6 100644
> > --- a/lib/librte_eal/linux/eal_vfio.h
> > +++ b/lib/librte_eal/linux/eal_vfio.h
> > @@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
> >   struct vfio_iommu_type {
> >   	int type_id;
> >   	const char *name;
> > +	bool partial_unmap;
> >   	vfio_dma_user_func_t dma_user_map_func;
> >   	vfio_dma_func_t dma_map_func;
> >   };
> > 
> 
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap
  2020-10-14 14:39   ` Burakov, Anatoly
@ 2020-10-15  9:54     ` Nithin Dabilpuram
  0 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-15  9:54 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: jerinj, dev

On Wed, Oct 14, 2020 at 03:39:36PM +0100, Burakov, Anatoly wrote:
> External Email
> 
> ----------------------------------------------------------------------
> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > Add test case in test_memory to test VFIO DMA map/unmap.
> > 
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > ---
> >   app/test/test_memory.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 79 insertions(+)
> > 
> > diff --git a/app/test/test_memory.c b/app/test/test_memory.c
> > index 7d5ae99..1c56455 100644
> > --- a/app/test/test_memory.c
> > +++ b/app/test/test_memory.c
> > @@ -4,11 +4,16 @@
> >   #include <stdio.h>
> >   #include <stdint.h>
> > +#include <string.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> >   #include <rte_eal.h>
> > +#include <rte_errno.h>
> >   #include <rte_memory.h>
> >   #include <rte_common.h>
> >   #include <rte_memzone.h>
> > +#include <rte_vfio.h>
> >   #include "test.h"
> > @@ -70,6 +75,71 @@ check_seg_fds(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
> >   }
> >   static int
> > +test_memory_vfio_dma_map(void)
> > +{
> > +	uint64_t sz = 2 * sysconf(_SC_PAGESIZE), sz1, sz2;
> 
> i think we now have a function for that, rte_page_size() ?
> 
> Also, i would prefer
> 
> uint64_t sz1, sz2, sz = 2 * rte_page_size();
> 
> Easier to parse IMO.

Ack, will use rte_mem_page_size().
> 
> > +	uint64_t unmap1, unmap2;
> > +	uint8_t *mem;
> > +	int ret;
> > +
> > +	/* Check if vfio is enabled in both kernel and eal */
> > +	ret = rte_vfio_is_enabled("vfio");
> > +	if (!ret)
> > +		return 1;
> 
> No need, rte_vfio_container_dma_map() should set errno to ENODEV if vfio is
> not enabled.

Ack.

> 
> > +
> > +	/* Allocate twice size of page */
> > +	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
> > +		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +	if (mem == MAP_FAILED) {
> > +		printf("Failed to allocate memory for external heap\n");
> > +		return -1;
> > +	}
> > +
> > +	/* Force page allocation */
> > +	memset(mem, 0, sz);
> > +
> > +	/* map the whole region */
> > +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +					 (uint64_t)mem, (rte_iova_t)mem, sz);
> 
> should be (uintptr_t) perhaps?
> 
> Also, this can return -1 with rte_errno == ENOTSUP, i think this happens if
> there are no devices attached (or if there's no VFIO support, like it would
> be on FreeBSD or Windows).
Ok. Will return 1 if NOTSUP.
> 
> > +	if (ret) {
> > +		printf("Failed to dma map whole region, ret=%d\n", ret);
> > +		goto fail;
> > +	}
> > +
> > +	unmap1 = (uint64_t)mem + (sz / 2);
> > +	sz1 = sz / 2;
> > +	unmap2 = (uint64_t)mem;
> > +	sz2 = sz / 2;
> > +	/* unmap the partial region */
> > +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +					   unmap1, (rte_iova_t)unmap1, sz1);
> > +	if (ret) {
> > +		if (rte_errno == ENOTSUP) {
> > +			printf("Partial dma unmap not supported\n");
> > +			unmap2 = (uint64_t)mem;
> > +			sz2 = sz;
> > +		} else {
> > +			printf("Failed to unmap send half region, ret=%d(%d)\n",
> 
> I think "send half" is a typo? Also, here and in other places, i would
> prefer a rte_strerror() instead of raw rte_errno number.
Ack.
> 
> > +			       ret, rte_errno);
> > +			goto fail;
> > +		}
> > +	}
> > +
> > +	/* unmap the remaining region */
> > +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +					   unmap2, (rte_iova_t)unmap2, sz2);
> > +	if (ret) {
> > +		printf("Failed to unmap remaining region, ret=%d(%d)\n", ret,
> > +		       rte_errno);
> > +		goto fail;
> > +	}
> > +
> > +fail:
> > +	munmap(mem, sz);
> > +	return ret;
> > +}
> > +
> > +static int
> >   test_memory(void)
> >   {
> >   	uint64_t s;
> > @@ -101,6 +171,15 @@ test_memory(void)
> >   		return -1;
> >   	}
> > +	/* test for vfio dma map/unmap */
> > +	ret = test_memory_vfio_dma_map();
> > +	if (ret == 1) {
> > +		printf("VFIO dma map/unmap unsupported\n");
> > +	} else if (ret < 0) {
> > +		printf("Error vfio dma map/unmap, ret=%d\n", ret);
> > +		return -1;
> > +	}
> > +
> 
> This looks odd in this autotest. Perhaps create a new autotest for VFIO?
Ack, will add test_vfio.c
> 
> >   	return 0;
> >   }
> > 
> 
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-15  6:09     ` [dpdk-dev] [EXT] " Nithin Dabilpuram
@ 2020-10-15 10:00       ` Burakov, Anatoly
  2020-10-15 11:38         ` Nithin Dabilpuram
                           ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-15 10:00 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev, stable

On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>> Partial unmapping is not supported for VFIO IOMMU type1
>>> by kernel. Though kernel gives return as zero, the unmapped size
>>> returned will not be same as expected. So check for
>>> returned unmap size and return error.
>>>
>>> For case of DMA map/unmap triggered by heap allocations,
>>> maintain granularity of memseg page size so that heap
>>> expansion and contraction does not have this issue.
>>
>> This is quite unfortunate, because there was a different bug that had to do
>> with kernel having a very limited number of mappings available [1], as a
>> result of which the page concatenation code was added.
>>
>> It should therefore be documented that the dma_entry_limit parameter should
>> be adjusted should the user run out of the DMA entries.
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=

<snip>

>>>    			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>    					errno, strerror(errno));
>>>    			return -1;
>>> +		} else if (dma_unmap.size != len) {
>>> +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>> +				"remapping cleared instead of %"PRIu64"\n",
>>> +				(uint64_t)dma_unmap.size, len);
>>> +			rte_errno = EIO;
>>> +			return -1;
>>>    		}
>>>    	}
>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>    		/* we're partially unmapping a previously mapped region, so we
>>>    		 * need to split entry into two.
>>>    		 */
>>> +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>> +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>> +			rte_errno = ENOTSUP;
>>> +			ret = -1;
>>> +			goto out;
>>> +		}
>>
>> How would we ever arrive here if we never do more than 1 page worth of
>> memory anyway? I don't think this is needed.
> 
> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> and when he maps we don't split it as we don't about his memory.
> So if he maps multiple pages and tries to unmap partially, then we should fail.

Should we map it in page granularity then, instead of adding this 
discrepancy between EAL and user mapping? I.e. instead of adding a 
workaround, how about we just do the same thing for user mem mappings?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-15 10:00       ` Burakov, Anatoly
@ 2020-10-15 11:38         ` Nithin Dabilpuram
  2020-10-15 11:50         ` Nithin Dabilpuram
  2020-10-15 11:57         ` Nithin Dabilpuram
  2 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-15 11:38 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: jerinj, dev, stable


On Thu, Oct 15, 2020 at 11:00:59AM +0100, Burakov, Anatoly wrote:
> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > External Email
> > > 
> > > ----------------------------------------------------------------------
> > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > returned will not be same as expected. So check for
> > > > returned unmap size and return error.
> > > > 
> > > > For case of DMA map/unmap triggered by heap allocations,
> > > > maintain granularity of memseg page size so that heap
> > > > expansion and contraction does not have this issue.
> > > 
> > > This is quite unfortunate, because there was a different bug that had to do
> > > with kernel having a very limited number of mappings available [1], as a
> > > result of which the page concatenation code was added.
> > > 
> > > It should therefore be documented that the dma_entry_limit parameter should
> > > be adjusted should the user run out of the DMA entries.
> > > 
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> 
> <snip>
> 
> > > >    			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > >    					errno, strerror(errno));
> > > >    			return -1;
> > > > +		} else if (dma_unmap.size != len) {
> > > > +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > +				"remapping cleared instead of %"PRIu64"\n",
> > > > +				(uint64_t)dma_unmap.size, len);
> > > > +			rte_errno = EIO;
> > > > +			return -1;
> > > >    		}
> > > >    	}
> > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > >    		/* we're partially unmapping a previously mapped region, so we
> > > >    		 * need to split entry into two.
> > > >    		 */
> > > > +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > +			rte_errno = ENOTSUP;
> > > > +			ret = -1;
> > > > +			goto out;
> > > > +		}
> > > 
> > > How would we ever arrive here if we never do more than 1 page worth of
> > > memory anyway? I don't think this is needed.
> > 
> > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > and when he maps we don't split it as we don't about his memory.
> > So if he maps multiple pages and tries to unmap partially, then we should fail.
> 
> Should we map it in page granularity then, instead of adding this
> discrepancy between EAL and user mapping? I.e. instead of adding a
> workaround, how about we just do the same thing for user mem mappings?

In heap mapping's we map and unmap it at huge page granularity as we will always
maintain that.

But here I think we don't know if user's allocation is huge page or collection of system
pages. Only thing we can do here is map it at system page granularity which
could waste entries if he say really is working with hugepages. Isn't ?

> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-15 10:00       ` Burakov, Anatoly
  2020-10-15 11:38         ` Nithin Dabilpuram
@ 2020-10-15 11:50         ` Nithin Dabilpuram
  2020-10-15 11:57         ` Nithin Dabilpuram
  2 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-15 11:50 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: jerinj, dev, stable


On Thu, Oct 15, 2020 at 11:00:59AM +0100, Burakov, Anatoly wrote:
> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > External Email
> > > 
> > > ----------------------------------------------------------------------
> > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > returned will not be same as expected. So check for
> > > > returned unmap size and return error.
> > > > 
> > > > For case of DMA map/unmap triggered by heap allocations,
> > > > maintain granularity of memseg page size so that heap
> > > > expansion and contraction does not have this issue.
> > > 
> > > This is quite unfortunate, because there was a different bug that had to do
> > > with kernel having a very limited number of mappings available [1], as a
> > > result of which the page concatenation code was added.
> > > 
> > > It should therefore be documented that the dma_entry_limit parameter should
> > > be adjusted should the user run out of the DMA entries.
> > > 
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> 
> <snip>
> 
> > > >    			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > >    					errno, strerror(errno));
> > > >    			return -1;
> > > > +		} else if (dma_unmap.size != len) {
> > > > +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > +				"remapping cleared instead of %"PRIu64"\n",
> > > > +				(uint64_t)dma_unmap.size, len);
> > > > +			rte_errno = EIO;
> > > > +			return -1;
> > > >    		}
> > > >    	}
> > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > >    		/* we're partially unmapping a previously mapped region, so we
> > > >    		 * need to split entry into two.
> > > >    		 */
> > > > +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > +			rte_errno = ENOTSUP;
> > > > +			ret = -1;
> > > > +			goto out;
> > > > +		}
> > > 
> > > How would we ever arrive here if we never do more than 1 page worth of
> > > memory anyway? I don't think this is needed.
> > 
> > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > and when he maps we don't split it as we don't about his memory.
> > So if he maps multiple pages and tries to unmap partially, then we should fail.
> 
> Should we map it in page granularity then, instead of adding this
> discrepancy between EAL and user mapping? I.e. instead of adding a
> workaround, how about we just do the same thing for user mem mappings?

In heap mapping's we map and unmap it at huge page granularity as we will always
maintain that.

But here I think we don't know if user's allocation is huge page or collection
of system
pages. Only thing we can do here is map it at system page granularity which
could waste entries if he say really is working with hugepages. Isn't ?

> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-15 10:00       ` Burakov, Anatoly
  2020-10-15 11:38         ` Nithin Dabilpuram
  2020-10-15 11:50         ` Nithin Dabilpuram
@ 2020-10-15 11:57         ` Nithin Dabilpuram
  2020-10-15 15:10           ` Burakov, Anatoly
  2 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-15 11:57 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Nithin Dabilpuram, Jerin Jacob, dev, stable

On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> >> External Email
> >>
> >> ----------------------------------------------------------------------
> >> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> >>> Partial unmapping is not supported for VFIO IOMMU type1
> >>> by kernel. Though kernel gives return as zero, the unmapped size
> >>> returned will not be same as expected. So check for
> >>> returned unmap size and return error.
> >>>
> >>> For case of DMA map/unmap triggered by heap allocations,
> >>> maintain granularity of memseg page size so that heap
> >>> expansion and contraction does not have this issue.
> >>
> >> This is quite unfortunate, because there was a different bug that had to do
> >> with kernel having a very limited number of mappings available [1], as a
> >> result of which the page concatenation code was added.
> >>
> >> It should therefore be documented that the dma_entry_limit parameter should
> >> be adjusted should the user run out of the DMA entries.
> >>
> >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>
> <snip>
>
> >>>                     RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> >>>                                     errno, strerror(errno));
> >>>                     return -1;
> >>> +           } else if (dma_unmap.size != len) {
> >>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> >>> +                           "remapping cleared instead of %"PRIu64"\n",
> >>> +                           (uint64_t)dma_unmap.size, len);
> >>> +                   rte_errno = EIO;
> >>> +                   return -1;
> >>>             }
> >>>     }
> >>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> >>>             /* we're partially unmapping a previously mapped region, so we
> >>>              * need to split entry into two.
> >>>              */
> >>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> >>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> >>> +                   rte_errno = ENOTSUP;
> >>> +                   ret = -1;
> >>> +                   goto out;
> >>> +           }
> >>
> >> How would we ever arrive here if we never do more than 1 page worth of
> >> memory anyway? I don't think this is needed.
> >
> > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > and when he maps we don't split it as we don't about his memory.
> > So if he maps multiple pages and tries to unmap partially, then we should fail.
>
> Should we map it in page granularity then, instead of adding this
> discrepancy between EAL and user mapping? I.e. instead of adding a
> workaround, how about we just do the same thing for user mem mappings?
>
In heap mapping's we map and unmap it at huge page granularity as we will always
maintain that.

But here I think we don't know if user's allocation is huge page or
collection of system
pages. Only thing we can do here is map it at system page granularity which
could waste entries if he say really is working with hugepages. Isn't ?


>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-15 11:57         ` Nithin Dabilpuram
@ 2020-10-15 15:10           ` Burakov, Anatoly
  2020-10-16  7:10             ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-15 15:10 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: Nithin Dabilpuram, Jerin Jacob, dev, stable

On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
>>
>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>> External Email
>>>>
>>>> ----------------------------------------------------------------------
>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>> returned will not be same as expected. So check for
>>>>> returned unmap size and return error.
>>>>>
>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>> maintain granularity of memseg page size so that heap
>>>>> expansion and contraction does not have this issue.
>>>>
>>>> This is quite unfortunate, because there was a different bug that had to do
>>>> with kernel having a very limited number of mappings available [1], as a
>>>> result of which the page concatenation code was added.
>>>>
>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>> be adjusted should the user run out of the DMA entries.
>>>>
>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>
>> <snip>
>>
>>>>>                      RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>                                      errno, strerror(errno));
>>>>>                      return -1;
>>>>> +           } else if (dma_unmap.size != len) {
>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>> +                   rte_errno = EIO;
>>>>> +                   return -1;
>>>>>              }
>>>>>      }
>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>              /* we're partially unmapping a previously mapped region, so we
>>>>>               * need to split entry into two.
>>>>>               */
>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>> +                   rte_errno = ENOTSUP;
>>>>> +                   ret = -1;
>>>>> +                   goto out;
>>>>> +           }
>>>>
>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>> memory anyway? I don't think this is needed.
>>>
>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>> and when he maps we don't split it as we don't about his memory.
>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>
>> Should we map it in page granularity then, instead of adding this
>> discrepancy between EAL and user mapping? I.e. instead of adding a
>> workaround, how about we just do the same thing for user mem mappings?
>>
> In heap mapping's we map and unmap it at huge page granularity as we will always
> maintain that.
> 
> But here I think we don't know if user's allocation is huge page or
> collection of system
> pages. Only thing we can do here is map it at system page granularity which
> could waste entries if he say really is working with hugepages. Isn't ?
> 

Yeah we do. The API mandates the pages granularity, and it will check 
against page size and number of IOVA entries, so yes, we do enforce the 
fact that the IOVA addresses supplied by the user have to be page addresses.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-15 15:10           ` Burakov, Anatoly
@ 2020-10-16  7:10             ` Nithin Dabilpuram
  2020-10-17 16:14               ` Burakov, Anatoly
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-16  7:10 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Jerin Jacob, dev, stable

On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> > > 
> > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > External Email
> > > > > 
> > > > > ----------------------------------------------------------------------
> > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > returned will not be same as expected. So check for
> > > > > > returned unmap size and return error.
> > > > > > 
> > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > maintain granularity of memseg page size so that heap
> > > > > > expansion and contraction does not have this issue.
> > > > > 
> > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > result of which the page concatenation code was added.
> > > > > 
> > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > be adjusted should the user run out of the DMA entries.
> > > > > 
> > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > 
> > > <snip>
> > > 
> > > > > >                      RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > >                                      errno, strerror(errno));
> > > > > >                      return -1;
> > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > +                   rte_errno = EIO;
> > > > > > +                   return -1;
> > > > > >              }
> > > > > >      }
> > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > >              /* we're partially unmapping a previously mapped region, so we
> > > > > >               * need to split entry into two.
> > > > > >               */
> > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > +                   rte_errno = ENOTSUP;
> > > > > > +                   ret = -1;
> > > > > > +                   goto out;
> > > > > > +           }
> > > > > 
> > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > memory anyway? I don't think this is needed.
> > > > 
> > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > and when he maps we don't split it as we don't about his memory.
> > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > 
> > > Should we map it in page granularity then, instead of adding this
> > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > workaround, how about we just do the same thing for user mem mappings?
> > > 
> > In heap mapping's we map and unmap it at huge page granularity as we will always
> > maintain that.
> > 
> > But here I think we don't know if user's allocation is huge page or
> > collection of system
> > pages. Only thing we can do here is map it at system page granularity which
> > could waste entries if he say really is working with hugepages. Isn't ?
> > 
> 
> Yeah we do. The API mandates the pages granularity, and it will check
> against page size and number of IOVA entries, so yes, we do enforce the fact
> that the IOVA addresses supplied by the user have to be page addresses.

If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
user is providing or we computing. He can call rte_vfio_container_dma_map()
with 1GB huge page or 4K system page.

Am I missing something ?
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-16  7:10             ` Nithin Dabilpuram
@ 2020-10-17 16:14               ` Burakov, Anatoly
  2020-10-19  9:43                 ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-17 16:14 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: Jerin Jacob, dev, stable

On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>>
>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>>>> External Email
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>>>> returned will not be same as expected. So check for
>>>>>>> returned unmap size and return error.
>>>>>>>
>>>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>>>> maintain granularity of memseg page size so that heap
>>>>>>> expansion and contraction does not have this issue.
>>>>>>
>>>>>> This is quite unfortunate, because there was a different bug that had to do
>>>>>> with kernel having a very limited number of mappings available [1], as a
>>>>>> result of which the page concatenation code was added.
>>>>>>
>>>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>>>> be adjusted should the user run out of the DMA entries.
>>>>>>
>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>>>
>>>> <snip>
>>>>
>>>>>>>                       RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>>>                                       errno, strerror(errno));
>>>>>>>                       return -1;
>>>>>>> +           } else if (dma_unmap.size != len) {
>>>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>>>> +                   rte_errno = EIO;
>>>>>>> +                   return -1;
>>>>>>>               }
>>>>>>>       }
>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>>>               /* we're partially unmapping a previously mapped region, so we
>>>>>>>                * need to split entry into two.
>>>>>>>                */
>>>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>>>> +                   rte_errno = ENOTSUP;
>>>>>>> +                   ret = -1;
>>>>>>> +                   goto out;
>>>>>>> +           }
>>>>>>
>>>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>>>> memory anyway? I don't think this is needed.
>>>>>
>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>>>> and when he maps we don't split it as we don't about his memory.
>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>>>
>>>> Should we map it in page granularity then, instead of adding this
>>>> discrepancy between EAL and user mapping? I.e. instead of adding a
>>>> workaround, how about we just do the same thing for user mem mappings?
>>>>
>>> In heap mapping's we map and unmap it at huge page granularity as we will always
>>> maintain that.
>>>
>>> But here I think we don't know if user's allocation is huge page or
>>> collection of system
>>> pages. Only thing we can do here is map it at system page granularity which
>>> could waste entries if he say really is working with hugepages. Isn't ?
>>>
>>
>> Yeah we do. The API mandates the pages granularity, and it will check
>> against page size and number of IOVA entries, so yes, we do enforce the fact
>> that the IOVA addresses supplied by the user have to be page addresses.
> 
> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> user is providing or we computing. He can call rte_vfio_container_dma_map()
> with 1GB huge page or 4K system page.
> 
> Am I missing something ?

Are you suggesting that a DMA mapping for hugepage-backed memory will be 
made at system page size granularity? E.g. will a 1GB page-backed 
segment be mapped for DMA as a contiguous 4K-based block?

>>
>> -- 
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-17 16:14               ` Burakov, Anatoly
@ 2020-10-19  9:43                 ` Nithin Dabilpuram
  2020-10-22 12:13                   ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-19  9:43 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Jerin Jacob, dev, stable

On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > <anatoly.burakov@intel.com> wrote:
> > > > > 
> > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > External Email
> > > > > > > 
> > > > > > > ----------------------------------------------------------------------
> > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > returned will not be same as expected. So check for
> > > > > > > > returned unmap size and return error.
> > > > > > > > 
> > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > expansion and contraction does not have this issue.
> > > > > > > 
> > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > result of which the page concatenation code was added.
> > > > > > > 
> > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > 
> > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > > >                       RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > >                                       errno, strerror(errno));
> > > > > > > >                       return -1;
> > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > +                   rte_errno = EIO;
> > > > > > > > +                   return -1;
> > > > > > > >               }
> > > > > > > >       }
> > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > >               /* we're partially unmapping a previously mapped region, so we
> > > > > > > >                * need to split entry into two.
> > > > > > > >                */
> > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > +                   ret = -1;
> > > > > > > > +                   goto out;
> > > > > > > > +           }
> > > > > > > 
> > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > memory anyway? I don't think this is needed.
> > > > > > 
> > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > 
> > > > > Should we map it in page granularity then, instead of adding this
> > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > 
> > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > maintain that.
> > > > 
> > > > But here I think we don't know if user's allocation is huge page or
> > > > collection of system
> > > > pages. Only thing we can do here is map it at system page granularity which
> > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > 
> > > 
> > > Yeah we do. The API mandates the pages granularity, and it will check
> > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > that the IOVA addresses supplied by the user have to be page addresses.
> > 
> > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > with 1GB huge page or 4K system page.
> > 
> > Am I missing something ?
> 
> Are you suggesting that a DMA mapping for hugepage-backed memory will be
> made at system page size granularity? E.g. will a 1GB page-backed segment be
> mapped for DMA as a contiguous 4K-based block?

I'm not suggesting anything. My only thought is how to solve below problem.
Say application does the following.

#1 Allocate 1GB memory from huge page or some external mem.
#2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
   In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
   don't know where this memory is coming from or backed by what.
#3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
 
Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
can we allow #3 ?


static int
container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
                uint64_t len) 
{
        struct user_mem_map *map, *new_map = NULL;
        struct user_mem_maps *user_mem_maps;
        int ret = 0; 

        user_mem_maps = &vfio_cfg->mem_maps;
        rte_spinlock_recursive_lock(&user_mem_maps->lock);

        /* find our mapping */
        map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
        if (!map) {
                RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
                rte_errno = EINVAL;
                ret = -1;
                goto out; 
        }
        if (map->addr != vaddr || map->iova != iova || map->len != len) {
                /* we're partially unmapping a previously mapped region, so we
                 * need to split entry into two.
                 */


> 
> > > 
> > > -- 
> > > Thanks,
> > > Anatoly
> 
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-19  9:43                 ` Nithin Dabilpuram
@ 2020-10-22 12:13                   ` Nithin Dabilpuram
  2020-10-28 13:04                     ` Burakov, Anatoly
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-22 12:13 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Jerin Jacob, dev, stable

Ping.

On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > 
> > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > External Email
> > > > > > > > 
> > > > > > > > ----------------------------------------------------------------------
> > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > returned unmap size and return error.
> > > > > > > > > 
> > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > 
> > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > result of which the page concatenation code was added.
> > > > > > > > 
> > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > 
> > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > 
> > > > > > <snip>
> > > > > > 
> > > > > > > > >                       RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > >                                       errno, strerror(errno));
> > > > > > > > >                       return -1;
> > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > +                   return -1;
> > > > > > > > >               }
> > > > > > > > >       }
> > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > >               /* we're partially unmapping a previously mapped region, so we
> > > > > > > > >                * need to split entry into two.
> > > > > > > > >                */
> > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > +                   ret = -1;
> > > > > > > > > +                   goto out;
> > > > > > > > > +           }
> > > > > > > > 
> > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > 
> > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > 
> > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > 
> > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > maintain that.
> > > > > 
> > > > > But here I think we don't know if user's allocation is huge page or
> > > > > collection of system
> > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > 
> > > > 
> > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > 
> > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > with 1GB huge page or 4K system page.
> > > 
> > > Am I missing something ?
> > 
> > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > mapped for DMA as a contiguous 4K-based block?
> 
> I'm not suggesting anything. My only thought is how to solve below problem.
> Say application does the following.
> 
> #1 Allocate 1GB memory from huge page or some external mem.
> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
>    In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
>    don't know where this memory is coming from or backed by what.
> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
>  
> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> can we allow #3 ?
> 
> 
> static int
> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>                 uint64_t len) 
> {
>         struct user_mem_map *map, *new_map = NULL;
>         struct user_mem_maps *user_mem_maps;
>         int ret = 0; 
> 
>         user_mem_maps = &vfio_cfg->mem_maps;
>         rte_spinlock_recursive_lock(&user_mem_maps->lock);
> 
>         /* find our mapping */
>         map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
>         if (!map) {
>                 RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
>                 rte_errno = EINVAL;
>                 ret = -1;
>                 goto out; 
>         }
>         if (map->addr != vaddr || map->iova != iova || map->len != len) {
>                 /* we're partially unmapping a previously mapped region, so we
>                  * need to split entry into two.
>                  */
> 
> 
> > 
> > > > 
> > > > -- 
> > > > Thanks,
> > > > Anatoly
> > 
> > 
> > -- 
> > Thanks,
> > Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-22 12:13                   ` Nithin Dabilpuram
@ 2020-10-28 13:04                     ` Burakov, Anatoly
  2020-10-28 14:17                       ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-28 13:04 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: Jerin Jacob, dev, stable

On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> Ping.
> 
> On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
>> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
>>> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
>>>> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
>>>>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
>>>>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>
>>>>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>>>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>>>>>>> External Email
>>>>>>>>>
>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>>>>>>> returned will not be same as expected. So check for
>>>>>>>>>> returned unmap size and return error.
>>>>>>>>>>
>>>>>>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>>>>>>> maintain granularity of memseg page size so that heap
>>>>>>>>>> expansion and contraction does not have this issue.
>>>>>>>>>
>>>>>>>>> This is quite unfortunate, because there was a different bug that had to do
>>>>>>>>> with kernel having a very limited number of mappings available [1], as a
>>>>>>>>> result of which the page concatenation code was added.
>>>>>>>>>
>>>>>>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>>>>>>> be adjusted should the user run out of the DMA entries.
>>>>>>>>>
>>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>>>                        RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>>>>>>                                        errno, strerror(errno));
>>>>>>>>>>                        return -1;
>>>>>>>>>> +           } else if (dma_unmap.size != len) {
>>>>>>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>>>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>>>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>>>>>>> +                   rte_errno = EIO;
>>>>>>>>>> +                   return -1;
>>>>>>>>>>                }
>>>>>>>>>>        }
>>>>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>>>>>>                /* we're partially unmapping a previously mapped region, so we
>>>>>>>>>>                 * need to split entry into two.
>>>>>>>>>>                 */
>>>>>>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>>>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>>>>>>> +                   rte_errno = ENOTSUP;
>>>>>>>>>> +                   ret = -1;
>>>>>>>>>> +                   goto out;
>>>>>>>>>> +           }
>>>>>>>>>
>>>>>>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>>>>>>> memory anyway? I don't think this is needed.
>>>>>>>>
>>>>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>>>>>>> and when he maps we don't split it as we don't about his memory.
>>>>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>>>>>>
>>>>>>> Should we map it in page granularity then, instead of adding this
>>>>>>> discrepancy between EAL and user mapping? I.e. instead of adding a
>>>>>>> workaround, how about we just do the same thing for user mem mappings?
>>>>>>>
>>>>>> In heap mapping's we map and unmap it at huge page granularity as we will always
>>>>>> maintain that.
>>>>>>
>>>>>> But here I think we don't know if user's allocation is huge page or
>>>>>> collection of system
>>>>>> pages. Only thing we can do here is map it at system page granularity which
>>>>>> could waste entries if he say really is working with hugepages. Isn't ?
>>>>>>
>>>>>
>>>>> Yeah we do. The API mandates the pages granularity, and it will check
>>>>> against page size and number of IOVA entries, so yes, we do enforce the fact
>>>>> that the IOVA addresses supplied by the user have to be page addresses.
>>>>
>>>> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
>>>> user is providing or we computing. He can call rte_vfio_container_dma_map()
>>>> with 1GB huge page or 4K system page.
>>>>
>>>> Am I missing something ?
>>>
>>> Are you suggesting that a DMA mapping for hugepage-backed memory will be
>>> made at system page size granularity? E.g. will a 1GB page-backed segment be
>>> mapped for DMA as a contiguous 4K-based block?
>>
>> I'm not suggesting anything. My only thought is how to solve below problem.
>> Say application does the following.
>>
>> #1 Allocate 1GB memory from huge page or some external mem.
>> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
>>     In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
>>     don't know where this memory is coming from or backed by what.
>> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
>>   
>> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
>> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
>> can we allow #3 ?
>>
>>
>> static int
>> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>                  uint64_t len)
>> {
>>          struct user_mem_map *map, *new_map = NULL;
>>          struct user_mem_maps *user_mem_maps;
>>          int ret = 0;
>>
>>          user_mem_maps = &vfio_cfg->mem_maps;
>>          rte_spinlock_recursive_lock(&user_mem_maps->lock);
>>
>>          /* find our mapping */
>>          map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
>>          if (!map) {
>>                  RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
>>                  rte_errno = EINVAL;
>>                  ret = -1;
>>                  goto out;
>>          }
>>          if (map->addr != vaddr || map->iova != iova || map->len != len) {
>>                  /* we're partially unmapping a previously mapped region, so we
>>                   * need to split entry into two.
>>                   */

Hi,

Apologies, i was on vacation.

Yes, I can see the problem now. Does VFIO even support non-system page 
sizes? Like, if i allocated a 1GB page, would i be able to map *this 
page* for DMA, as opposed to first 4K of this page? I suspect that the 
mapping doesn't support page sizes other than the system page size.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-28 13:04                     ` Burakov, Anatoly
@ 2020-10-28 14:17                       ` Nithin Dabilpuram
  2020-10-28 16:07                         ` Burakov, Anatoly
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-28 14:17 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Jerin Jacob, dev, stable

On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
> On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> > Ping.
> > 
> > On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> > > On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > > > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > > > External Email
> > > > > > > > > > 
> > > > > > > > > > ----------------------------------------------------------------------
> > > > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > > > returned unmap size and return error.
> > > > > > > > > > > 
> > > > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > > > 
> > > > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > > > result of which the page concatenation code was added.
> > > > > > > > > > 
> > > > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > > > 
> > > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > > > 
> > > > > > > > <snip>
> > > > > > > > 
> > > > > > > > > > >                        RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > > > >                                        errno, strerror(errno));
> > > > > > > > > > >                        return -1;
> > > > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > > > +                   return -1;
> > > > > > > > > > >                }
> > > > > > > > > > >        }
> > > > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > > > >                /* we're partially unmapping a previously mapped region, so we
> > > > > > > > > > >                 * need to split entry into two.
> > > > > > > > > > >                 */
> > > > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > > > +                   ret = -1;
> > > > > > > > > > > +                   goto out;
> > > > > > > > > > > +           }
> > > > > > > > > > 
> > > > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > > > 
> > > > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > > > 
> > > > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > > > 
> > > > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > > > maintain that.
> > > > > > > 
> > > > > > > But here I think we don't know if user's allocation is huge page or
> > > > > > > collection of system
> > > > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > > > 
> > > > > > 
> > > > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > > > 
> > > > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > > > with 1GB huge page or 4K system page.
> > > > > 
> > > > > Am I missing something ?
> > > > 
> > > > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > > > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > > > mapped for DMA as a contiguous 4K-based block?
> > > 
> > > I'm not suggesting anything. My only thought is how to solve below problem.
> > > Say application does the following.
> > > 
> > > #1 Allocate 1GB memory from huge page or some external mem.
> > > #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
> > >     In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
> > >     don't know where this memory is coming from or backed by what.
> > > #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
> > > Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> > > In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> > > can we allow #3 ?
> > > 
> > > 
> > > static int
> > > container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > >                  uint64_t len)
> > > {
> > >          struct user_mem_map *map, *new_map = NULL;
> > >          struct user_mem_maps *user_mem_maps;
> > >          int ret = 0;
> > > 
> > >          user_mem_maps = &vfio_cfg->mem_maps;
> > >          rte_spinlock_recursive_lock(&user_mem_maps->lock);
> > > 
> > >          /* find our mapping */
> > >          map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
> > >          if (!map) {
> > >                  RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
> > >                  rte_errno = EINVAL;
> > >                  ret = -1;
> > >                  goto out;
> > >          }
> > >          if (map->addr != vaddr || map->iova != iova || map->len != len) {
> > >                  /* we're partially unmapping a previously mapped region, so we
> > >                   * need to split entry into two.
> > >                   */
> 
> Hi,
> 
> Apologies, i was on vacation.
> 
> Yes, I can see the problem now. Does VFIO even support non-system page
> sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
> for DMA, as opposed to first 4K of this page? I suspect that the mapping
> doesn't support page sizes other than the system page size.

It does support mapping any multiple of system page size.
See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
./driver-api/vfio.rst doesn't mention any such restrictions even in its
example.

Also my test case is passing so that confirms the behavior.


> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-28 14:17                       ` Nithin Dabilpuram
@ 2020-10-28 16:07                         ` Burakov, Anatoly
  2020-10-28 16:31                           ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-10-28 16:07 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: Jerin Jacob, dev, stable

On 28-Oct-20 2:17 PM, Nithin Dabilpuram wrote:
> On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
>> On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
>>> Ping.
>>>
>>> On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
>>>> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
>>>>> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
>>>>>> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
>>>>>>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
>>>>>>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>
>>>>>>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>>>>>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>>>>>>>>> External Email
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>>>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>>>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>>>>>>>>> returned will not be same as expected. So check for
>>>>>>>>>>>> returned unmap size and return error.
>>>>>>>>>>>>
>>>>>>>>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>>>>>>>>> maintain granularity of memseg page size so that heap
>>>>>>>>>>>> expansion and contraction does not have this issue.
>>>>>>>>>>>
>>>>>>>>>>> This is quite unfortunate, because there was a different bug that had to do
>>>>>>>>>>> with kernel having a very limited number of mappings available [1], as a
>>>>>>>>>>> result of which the page concatenation code was added.
>>>>>>>>>>>
>>>>>>>>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>>>>>>>>> be adjusted should the user run out of the DMA entries.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>>>>>>>>
>>>>>>>>> <snip>
>>>>>>>>>
>>>>>>>>>>>>                         RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>>>>>>>>                                         errno, strerror(errno));
>>>>>>>>>>>>                         return -1;
>>>>>>>>>>>> +           } else if (dma_unmap.size != len) {
>>>>>>>>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>>>>>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>>>>>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>>>>>>>>> +                   rte_errno = EIO;
>>>>>>>>>>>> +                   return -1;
>>>>>>>>>>>>                 }
>>>>>>>>>>>>         }
>>>>>>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>>>>>>>>                 /* we're partially unmapping a previously mapped region, so we
>>>>>>>>>>>>                  * need to split entry into two.
>>>>>>>>>>>>                  */
>>>>>>>>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>>>>>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>>>>>>>>> +                   rte_errno = ENOTSUP;
>>>>>>>>>>>> +                   ret = -1;
>>>>>>>>>>>> +                   goto out;
>>>>>>>>>>>> +           }
>>>>>>>>>>>
>>>>>>>>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>>>>>>>>> memory anyway? I don't think this is needed.
>>>>>>>>>>
>>>>>>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>>>>>>>>> and when he maps we don't split it as we don't about his memory.
>>>>>>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>>>>>>>>
>>>>>>>>> Should we map it in page granularity then, instead of adding this
>>>>>>>>> discrepancy between EAL and user mapping? I.e. instead of adding a
>>>>>>>>> workaround, how about we just do the same thing for user mem mappings?
>>>>>>>>>
>>>>>>>> In heap mapping's we map and unmap it at huge page granularity as we will always
>>>>>>>> maintain that.
>>>>>>>>
>>>>>>>> But here I think we don't know if user's allocation is huge page or
>>>>>>>> collection of system
>>>>>>>> pages. Only thing we can do here is map it at system page granularity which
>>>>>>>> could waste entries if he say really is working with hugepages. Isn't ?
>>>>>>>>
>>>>>>>
>>>>>>> Yeah we do. The API mandates the pages granularity, and it will check
>>>>>>> against page size and number of IOVA entries, so yes, we do enforce the fact
>>>>>>> that the IOVA addresses supplied by the user have to be page addresses.
>>>>>>
>>>>>> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
>>>>>> user is providing or we computing. He can call rte_vfio_container_dma_map()
>>>>>> with 1GB huge page or 4K system page.
>>>>>>
>>>>>> Am I missing something ?
>>>>>
>>>>> Are you suggesting that a DMA mapping for hugepage-backed memory will be
>>>>> made at system page size granularity? E.g. will a 1GB page-backed segment be
>>>>> mapped for DMA as a contiguous 4K-based block?
>>>>
>>>> I'm not suggesting anything. My only thought is how to solve below problem.
>>>> Say application does the following.
>>>>
>>>> #1 Allocate 1GB memory from huge page or some external mem.
>>>> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
>>>>      In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
>>>>      don't know where this memory is coming from or backed by what.
>>>> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
>>>> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
>>>> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
>>>> can we allow #3 ?
>>>>
>>>>
>>>> static int
>>>> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>                   uint64_t len)
>>>> {
>>>>           struct user_mem_map *map, *new_map = NULL;
>>>>           struct user_mem_maps *user_mem_maps;
>>>>           int ret = 0;
>>>>
>>>>           user_mem_maps = &vfio_cfg->mem_maps;
>>>>           rte_spinlock_recursive_lock(&user_mem_maps->lock);
>>>>
>>>>           /* find our mapping */
>>>>           map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
>>>>           if (!map) {
>>>>                   RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
>>>>                   rte_errno = EINVAL;
>>>>                   ret = -1;
>>>>                   goto out;
>>>>           }
>>>>           if (map->addr != vaddr || map->iova != iova || map->len != len) {
>>>>                   /* we're partially unmapping a previously mapped region, so we
>>>>                    * need to split entry into two.
>>>>                    */
>>
>> Hi,
>>
>> Apologies, i was on vacation.
>>
>> Yes, I can see the problem now. Does VFIO even support non-system page
>> sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
>> for DMA, as opposed to first 4K of this page? I suspect that the mapping
>> doesn't support page sizes other than the system page size.
> 
> It does support mapping any multiple of system page size.
> See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
> ./driver-api/vfio.rst doesn't mention any such restrictions even in its
> example.
> 
> Also my test case is passing so that confirms the behavior.

Can we perhaps make it so that the API mandates mapping/unmapping the 
same chunks? That would be the easiest solution here.

> 
> 
>>
>> -- 
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1
  2020-10-28 16:07                         ` Burakov, Anatoly
@ 2020-10-28 16:31                           ` Nithin Dabilpuram
  0 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-10-28 16:31 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Jerin Jacob, dev, stable

On Wed, Oct 28, 2020 at 04:07:17PM +0000, Burakov, Anatoly wrote:
> On 28-Oct-20 2:17 PM, Nithin Dabilpuram wrote:
> > On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
> > > On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> > > > Ping.
> > > > 
> > > > On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> > > > > On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > > > > > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > > > > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > > > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > > > > > External Email
> > > > > > > > > > > > 
> > > > > > > > > > > > ----------------------------------------------------------------------
> > > > > > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > > > > > returned unmap size and return error.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > > > > > 
> > > > > > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > > > > > result of which the page concatenation code was added.
> > > > > > > > > > > > 
> > > > > > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > > > > > 
> > > > > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > > > > > 
> > > > > > > > > > <snip>
> > > > > > > > > > 
> > > > > > > > > > > > >                         RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > > > > > >                                         errno, strerror(errno));
> > > > > > > > > > > > >                         return -1;
> > > > > > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > > > > > +                   return -1;
> > > > > > > > > > > > >                 }
> > > > > > > > > > > > >         }
> > > > > > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > > > > > >                 /* we're partially unmapping a previously mapped region, so we
> > > > > > > > > > > > >                  * need to split entry into two.
> > > > > > > > > > > > >                  */
> > > > > > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > > > > > +                   ret = -1;
> > > > > > > > > > > > > +                   goto out;
> > > > > > > > > > > > > +           }
> > > > > > > > > > > > 
> > > > > > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > > > > > 
> > > > > > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > > > > > 
> > > > > > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > > > > > 
> > > > > > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > > > > > maintain that.
> > > > > > > > > 
> > > > > > > > > But here I think we don't know if user's allocation is huge page or
> > > > > > > > > collection of system
> > > > > > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > > > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > > > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > > > > > 
> > > > > > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > > > > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > > > > > with 1GB huge page or 4K system page.
> > > > > > > 
> > > > > > > Am I missing something ?
> > > > > > 
> > > > > > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > > > > > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > > > > > mapped for DMA as a contiguous 4K-based block?
> > > > > 
> > > > > I'm not suggesting anything. My only thought is how to solve below problem.
> > > > > Say application does the following.
> > > > > 
> > > > > #1 Allocate 1GB memory from huge page or some external mem.
> > > > > #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
> > > > >      In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
> > > > >      don't know where this memory is coming from or backed by what.
> > > > > #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
> > > > > Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> > > > > In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> > > > > can we allow #3 ?
> > > > > 
> > > > > 
> > > > > static int
> > > > > container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > >                   uint64_t len)
> > > > > {
> > > > >           struct user_mem_map *map, *new_map = NULL;
> > > > >           struct user_mem_maps *user_mem_maps;
> > > > >           int ret = 0;
> > > > > 
> > > > >           user_mem_maps = &vfio_cfg->mem_maps;
> > > > >           rte_spinlock_recursive_lock(&user_mem_maps->lock);
> > > > > 
> > > > >           /* find our mapping */
> > > > >           map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
> > > > >           if (!map) {
> > > > >                   RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
> > > > >                   rte_errno = EINVAL;
> > > > >                   ret = -1;
> > > > >                   goto out;
> > > > >           }
> > > > >           if (map->addr != vaddr || map->iova != iova || map->len != len) {
> > > > >                   /* we're partially unmapping a previously mapped region, so we
> > > > >                    * need to split entry into two.
> > > > >                    */
> > > 
> > > Hi,
> > > 
> > > Apologies, i was on vacation.
> > > 
> > > Yes, I can see the problem now. Does VFIO even support non-system page
> > > sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
> > > for DMA, as opposed to first 4K of this page? I suspect that the mapping
> > > doesn't support page sizes other than the system page size.
> > 
> > It does support mapping any multiple of system page size.
> > See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
> > ./driver-api/vfio.rst doesn't mention any such restrictions even in its
> > example.
> > 
> > Also my test case is passing so that confirms the behavior.
> 
> Can we perhaps make it so that the API mandates mapping/unmapping the same
> chunks? That would be the easiest solution here.

Ack, I was already doing that for type1 IOMMU with my above patch.
I didn't change the behavior for sPAPR or no-iommu mode.
> 
> > 
> > 
> > > 
> > > -- 
> > > Thanks,
> > > Anatoly
> 
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
  2020-10-12  8:11 ` [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1 Nithin Dabilpuram
@ 2020-11-05  9:04 ` Nithin Dabilpuram
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (2 more replies)
  2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-11-05  9:04 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.


Nithin Dabilpuram (3):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 iova as va
  test: add test case to validate VFIO DMA map/unmap

 app/test/meson.build                   |   1 +
 app/test/test_vfio.c                   | 103 +++++++++++++++++++++++++++++++++
 doc/guides/linux_gsg/linux_drivers.rst |  10 ++++
 lib/librte_eal/linux/eal_vfio.c        |  93 ++++++++++++-----------------
 lib/librte_eal/linux/eal_vfio.h        |   1 +
 5 files changed, 151 insertions(+), 57 deletions(-)
 create mode 100644 app/test/test_vfio.c

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v2 1/3] vfio: revert changes for map contiguous areas in one go
  2020-11-05  9:04 ` [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-11-05  9:04   ` Nithin Dabilpuram
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va Nithin Dabilpuram
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 3/3] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
  2 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-11-05  9:04 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts commit
d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 080b449..bb43ab2 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -67,6 +67,16 @@ Note that in order to use VFIO, your kernel must support it.
 VFIO kernel modules have been included in the Linux kernel since version 3.6.0 and are usually present by default,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 The ``vfio-pci`` module since Linux version 5.7 supports the creation of virtual
 functions. After the PF is bound to vfio-pci module, the user can create the VFs
 by sysfs interface, and these VFs are bound to vfio-pci module automatically.
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 380f2f4..dbefcba 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -516,11 +516,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -549,63 +547,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 #endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 #ifdef RTE_ARCH_PPC_64
 	cur_len = 0;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va
  2020-11-05  9:04 ` [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2020-11-05  9:04   ` Nithin Dabilpuram
  2020-11-10 14:04     ` Burakov, Anatoly
  2020-11-10 14:17     ` Burakov, Anatoly
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 3/3] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
  2 siblings, 2 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-11-05  9:04 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index dbefcba..b4f9c33 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -69,6 +69,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -76,6 +77,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -83,6 +85,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1369,6 +1379,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1839,6 +1855,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v2 3/3] test: add test case to validate VFIO DMA map/unmap
  2020-11-05  9:04 ` [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va Nithin Dabilpuram
@ 2020-11-05  9:04   ` Nithin Dabilpuram
  2 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-11-05  9:04 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: jerinj, dev, Nithin Dabilpuram

Test case mmap's system pages and tries to performs a user
DMA map and unmap both partially and fully.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/meson.build |   1 +
 app/test/test_vfio.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)
 create mode 100644 app/test/test_vfio.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 88c831a..b0411ee 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -139,6 +139,7 @@ test_sources = files('commands.c',
 	'test_trace_register.c',
 	'test_trace_perf.c',
 	'test_version.c',
+	'test_vfio.c',
 	'virtual_pmd.c'
 )
 
diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
new file mode 100644
index 0000000..00626d4
--- /dev/null
+++ b/app/test/test_vfio.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_common.h>
+#include <rte_eal.h>
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+#include <rte_memory.h>
+#include <rte_vfio.h>
+
+#include "test.h"
+
+static int
+test_memory_vfio_dma_map(void)
+{
+	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
+	uint64_t unmap1, unmap2;
+	uint8_t *mem;
+	int ret;
+
+	/* Allocate twice size of page */
+	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED) {
+		printf("Failed to allocate memory for external heap\n");
+		return -1;
+	}
+
+	/* Force page allocation */
+	memset(mem, 0, sz);
+
+	/* map the whole region */
+	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					 (uintptr_t)mem, (rte_iova_t)mem, sz);
+	if (ret) {
+		/* Check if VFIO is not available or no device is probed */
+		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
+			ret = 1;
+			goto fail;
+		}
+		printf("Failed to dma map whole region, ret=%d(%s)\n",
+		       ret, rte_strerror(rte_errno));
+		goto fail;
+	}
+
+	unmap1 = (uint64_t)mem + (sz / 2);
+	sz1 = sz / 2;
+	unmap2 = (uint64_t)mem;
+	sz2 = sz / 2;
+	/* unmap the partial region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap1, (rte_iova_t)unmap1, sz1);
+	if (ret) {
+		if (rte_errno == ENOTSUP) {
+			printf("Partial dma unmap not supported\n");
+			unmap2 = (uint64_t)mem;
+			sz2 = sz;
+		} else {
+			printf("Failed to unmap second half region, ret=%d(%s)\n",
+			       ret, rte_strerror(rte_errno));
+			goto fail;
+		}
+	}
+
+	/* unmap the remaining region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap2, (rte_iova_t)unmap2, sz2);
+	if (ret) {
+		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
+		       rte_strerror(rte_errno));
+		goto fail;
+	}
+
+fail:
+	munmap(mem, sz);
+	return ret;
+}
+
+static int
+test_vfio(void)
+{
+	int ret;
+
+	/* test for vfio dma map/unmap */
+	ret = test_memory_vfio_dma_map();
+	if (ret == 1) {
+		printf("VFIO dma map/unmap unsupported\n");
+	} else if (ret < 0) {
+		printf("Error vfio dma map/unmap, ret=%d\n", ret);
+		return -1;
+	}
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va Nithin Dabilpuram
@ 2020-11-10 14:04     ` Burakov, Anatoly
  2020-11-10 14:22       ` Burakov, Anatoly
  2020-11-10 14:17     ` Burakov, Anatoly
  1 sibling, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-11-10 14:04 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev, stable

On 05-Nov-20 9:04 AM, Nithin Dabilpuram wrote:
> Partial unmapping is not supported for VFIO IOMMU type1
> by kernel. Though kernel gives return as zero, the unmapped size
> returned will not be same as expected. So check for
> returned unmap size and return error.
> 
> For IOVA as PA, DMA mapping is already at memseg size
> granularity. Do the same even for IOVA as VA mode as
> DMA map/unmap triggered by heap allocations,
> maintain granularity of memseg page size so that heap
> expansion and contraction does not have this issue.
> 
> For user requested DMA map/unmap disallow partial unmapping
> for VFIO type1.
> 
> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---

Maybe i just didn't have enough coffee today, but i still don't see why 
this "partial unmap" thing exists.

We are already mapping the addresses page-by-page, so surely "partial" 
unmaps can't even exist in the first place?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va
  2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va Nithin Dabilpuram
  2020-11-10 14:04     ` Burakov, Anatoly
@ 2020-11-10 14:17     ` Burakov, Anatoly
  2020-11-11  5:08       ` Nithin Dabilpuram
  1 sibling, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2020-11-10 14:17 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev, stable

On 05-Nov-20 9:04 AM, Nithin Dabilpuram wrote:
> Partial unmapping is not supported for VFIO IOMMU type1
> by kernel. Though kernel gives return as zero, the unmapped size
> returned will not be same as expected. So check for
> returned unmap size and return error.
> 
> For IOVA as PA, DMA mapping is already at memseg size
> granularity. Do the same even for IOVA as VA mode as
> DMA map/unmap triggered by heap allocations,
> maintain granularity of memseg page size so that heap
> expansion and contraction does not have this issue.
> 
> For user requested DMA map/unmap disallow partial unmapping
> for VFIO type1.
> 
> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---

<snip>

> @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   	/* for IOVA as VA mode, no need to care for IOVA addresses */
>   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
>   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> -		if (type == RTE_MEM_EVENT_ALLOC)
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 1);
> -		else
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 0);
> +		uint64_t page_sz = msl->page_sz;
> +
> +		/* Maintain granularity of DMA map/unmap to memseg size */
> +		for (; cur_len < len; cur_len += page_sz) {
> +			if (type == RTE_MEM_EVENT_ALLOC)
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 1);
> +			else
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 0);

I think you're mapping the same address here, over and over. Perhaps you 
meant `vfio_va + cur_len` for the mapping addresses?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va
  2020-11-10 14:04     ` Burakov, Anatoly
@ 2020-11-10 14:22       ` Burakov, Anatoly
  0 siblings, 0 replies; 76+ messages in thread
From: Burakov, Anatoly @ 2020-11-10 14:22 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev, stable

On 10-Nov-20 2:04 PM, Burakov, Anatoly wrote:
> On 05-Nov-20 9:04 AM, Nithin Dabilpuram wrote:
>> Partial unmapping is not supported for VFIO IOMMU type1
>> by kernel. Though kernel gives return as zero, the unmapped size
>> returned will not be same as expected. So check for
>> returned unmap size and return error.
>>
>> For IOVA as PA, DMA mapping is already at memseg size
>> granularity. Do the same even for IOVA as VA mode as
>> DMA map/unmap triggered by heap allocations,
>> maintain granularity of memseg page size so that heap
>> expansion and contraction does not have this issue.
>>
>> For user requested DMA map/unmap disallow partial unmapping
>> for VFIO type1.
>>
>> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
>> Cc: anatoly.burakov@intel.com
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
>> ---
> 
> Maybe i just didn't have enough coffee today, but i still don't see why 
> this "partial unmap" thing exists.

Oh, right, this is for *user* mapped memory. Disregard this email.

> 
> We are already mapping the addresses page-by-page, so surely "partial" 
> unmaps can't even exist in the first place?
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va
  2020-11-10 14:17     ` Burakov, Anatoly
@ 2020-11-11  5:08       ` Nithin Dabilpuram
  2020-11-11 10:00         ` Burakov, Anatoly
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-11-11  5:08 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: jerinj, dev, stable

On Tue, Nov 10, 2020 at 02:17:39PM +0000, Burakov, Anatoly wrote:
> On 05-Nov-20 9:04 AM, Nithin Dabilpuram wrote:
> > Partial unmapping is not supported for VFIO IOMMU type1
> > by kernel. Though kernel gives return as zero, the unmapped size
> > returned will not be same as expected. So check for
> > returned unmap size and return error.
> > 
> > For IOVA as PA, DMA mapping is already at memseg size
> > granularity. Do the same even for IOVA as VA mode as
> > DMA map/unmap triggered by heap allocations,
> > maintain granularity of memseg page size so that heap
> > expansion and contraction does not have this issue.
> > 
> > For user requested DMA map/unmap disallow partial unmapping
> > for VFIO type1.
> > 
> > Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> > Cc: anatoly.burakov@intel.com
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > ---
> 
> <snip>
> 
> > @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
> >   	/* for IOVA as VA mode, no need to care for IOVA addresses */
> >   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
> >   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> > -		if (type == RTE_MEM_EVENT_ALLOC)
> > -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> > -					len, 1);
> > -		else
> > -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> > -					len, 0);
> > +		uint64_t page_sz = msl->page_sz;
> > +
> > +		/* Maintain granularity of DMA map/unmap to memseg size */
> > +		for (; cur_len < len; cur_len += page_sz) {
> > +			if (type == RTE_MEM_EVENT_ALLOC)
> > +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> > +						 vfio_va, page_sz, 1);
> > +			else
> > +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> > +						 vfio_va, page_sz, 0);
> 
> I think you're mapping the same address here, over and over. Perhaps you
> meant `vfio_va + cur_len` for the mapping addresses?

There is a 'vfio_va += page_sz;' in next line right ?
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va
  2020-11-11  5:08       ` Nithin Dabilpuram
@ 2020-11-11 10:00         ` Burakov, Anatoly
  0 siblings, 0 replies; 76+ messages in thread
From: Burakov, Anatoly @ 2020-11-11 10:00 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: jerinj, dev, stable

On 11-Nov-20 5:08 AM, Nithin Dabilpuram wrote:
> On Tue, Nov 10, 2020 at 02:17:39PM +0000, Burakov, Anatoly wrote:
>> On 05-Nov-20 9:04 AM, Nithin Dabilpuram wrote:
>>> Partial unmapping is not supported for VFIO IOMMU type1
>>> by kernel. Though kernel gives return as zero, the unmapped size
>>> returned will not be same as expected. So check for
>>> returned unmap size and return error.
>>>
>>> For IOVA as PA, DMA mapping is already at memseg size
>>> granularity. Do the same even for IOVA as VA mode as
>>> DMA map/unmap triggered by heap allocations,
>>> maintain granularity of memseg page size so that heap
>>> expansion and contraction does not have this issue.
>>>
>>> For user requested DMA map/unmap disallow partial unmapping
>>> for VFIO type1.
>>>
>>> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
>>> Cc: anatoly.burakov@intel.com
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
>>> ---
>>
>> <snip>
>>
>>> @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>>>    	/* for IOVA as VA mode, no need to care for IOVA addresses */
>>>    	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
>>>    		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
>>> -		if (type == RTE_MEM_EVENT_ALLOC)
>>> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
>>> -					len, 1);
>>> -		else
>>> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
>>> -					len, 0);
>>> +		uint64_t page_sz = msl->page_sz;
>>> +
>>> +		/* Maintain granularity of DMA map/unmap to memseg size */
>>> +		for (; cur_len < len; cur_len += page_sz) {
>>> +			if (type == RTE_MEM_EVENT_ALLOC)
>>> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
>>> +						 vfio_va, page_sz, 1);
>>> +			else
>>> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
>>> +						 vfio_va, page_sz, 0);
>>
>> I think you're mapping the same address here, over and over. Perhaps you
>> meant `vfio_va + cur_len` for the mapping addresses?
> 
> There is a 'vfio_va += page_sz;' in next line right ?
>>
>> -- 
>> Thanks,
>> Anatoly

Oh, right, my apologies. I did need more coffee :D

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (2 preceding siblings ...)
  2020-11-05  9:04 ` [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-01 19:32 ` Nithin Dabilpuram
  2020-12-01 19:32   ` [dpdk-dev] [PATCH v3 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (3 more replies)
  2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (4 subsequent siblings)
  8 siblings, 4 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-01 19:32 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v3:
- Fixed external memory test case(4/4) to use system page size
  instead of 4K.
- Fixed check-git-log.sh issue and rebased.
- Added acked-by from anatoly.burakov@intel.com to first 3 patches.

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.

Nithin Dabilpuram (4):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 IOVA as VA
  test: add test case to validate VFIO DMA map/unmap
  test: change external memory test to use system page sz

 app/test/meson.build                   |   1 +
 app/test/test_external_mem.c           |   2 +-
 app/test/test_vfio.c                   | 103 +++++++++++++++++++++++++++++++++
 doc/guides/linux_gsg/linux_drivers.rst |  10 ++++
 lib/librte_eal/linux/eal_vfio.c        |  93 ++++++++++++-----------------
 lib/librte_eal/linux/eal_vfio.h        |   1 +
 6 files changed, 152 insertions(+), 58 deletions(-)
 create mode 100644 app/test/test_vfio.c

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v3 1/4] vfio: revert changes for map contiguous areas in one go
  2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-01 19:32   ` Nithin Dabilpuram
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-01 19:32 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts
commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 90635a4..9a662a7 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
 VFIO kernel is usually present by default in all distributions,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 Since Linux version 5.7,
 the ``vfio-pci`` module supports the creation of virtual functions.
 After the PF is bound to ``vfio-pci`` module,
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 0500824..64b134d 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -517,11 +517,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v3 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-01 19:32   ` [dpdk-dev] [PATCH v3 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2020-12-01 19:33   ` Nithin Dabilpuram
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-01 19:33 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 64b134d..b15b758 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v3 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-01 19:32   ` [dpdk-dev] [PATCH v3 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2020-12-01 19:33   ` Nithin Dabilpuram
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-01 19:33 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Test case mmap's system pages and tries to performs a user
DMA map and unmap both partially and fully.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test/meson.build |   1 +
 app/test/test_vfio.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)
 create mode 100644 app/test/test_vfio.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 94fd39f..d9eedb6 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -139,6 +139,7 @@ test_sources = files('commands.c',
 	'test_trace_register.c',
 	'test_trace_perf.c',
 	'test_version.c',
+	'test_vfio.c',
 	'virtual_pmd.c'
 )
 
diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
new file mode 100644
index 0000000..00626d4
--- /dev/null
+++ b/app/test/test_vfio.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_common.h>
+#include <rte_eal.h>
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+#include <rte_memory.h>
+#include <rte_vfio.h>
+
+#include "test.h"
+
+static int
+test_memory_vfio_dma_map(void)
+{
+	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
+	uint64_t unmap1, unmap2;
+	uint8_t *mem;
+	int ret;
+
+	/* Allocate twice size of page */
+	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED) {
+		printf("Failed to allocate memory for external heap\n");
+		return -1;
+	}
+
+	/* Force page allocation */
+	memset(mem, 0, sz);
+
+	/* map the whole region */
+	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					 (uintptr_t)mem, (rte_iova_t)mem, sz);
+	if (ret) {
+		/* Check if VFIO is not available or no device is probed */
+		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
+			ret = 1;
+			goto fail;
+		}
+		printf("Failed to dma map whole region, ret=%d(%s)\n",
+		       ret, rte_strerror(rte_errno));
+		goto fail;
+	}
+
+	unmap1 = (uint64_t)mem + (sz / 2);
+	sz1 = sz / 2;
+	unmap2 = (uint64_t)mem;
+	sz2 = sz / 2;
+	/* unmap the partial region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap1, (rte_iova_t)unmap1, sz1);
+	if (ret) {
+		if (rte_errno == ENOTSUP) {
+			printf("Partial dma unmap not supported\n");
+			unmap2 = (uint64_t)mem;
+			sz2 = sz;
+		} else {
+			printf("Failed to unmap second half region, ret=%d(%s)\n",
+			       ret, rte_strerror(rte_errno));
+			goto fail;
+		}
+	}
+
+	/* unmap the remaining region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap2, (rte_iova_t)unmap2, sz2);
+	if (ret) {
+		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
+		       rte_strerror(rte_errno));
+		goto fail;
+	}
+
+fail:
+	munmap(mem, sz);
+	return ret;
+}
+
+static int
+test_vfio(void)
+{
+	int ret;
+
+	/* test for vfio dma map/unmap */
+	ret = test_memory_vfio_dma_map();
+	if (ret == 1) {
+		printf("VFIO dma map/unmap unsupported\n");
+	} else if (ret < 0) {
+		printf("Error vfio dma map/unmap, ret=%d\n", ret);
+		return -1;
+	}
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz
  2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                     ` (2 preceding siblings ...)
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-12-01 19:33   ` Nithin Dabilpuram
  2020-12-01 23:23     ` David Christensen
  3 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-01 19:33 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Currently external memory test uses 4K page size.
VFIO DMA mapping works only with system page granularity.

Earlier it was working because all the contiguous mappings
were coalesced and mapped in one-go which ended up becoming
a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
and IOVA as PA mode, are being done at memseg list granularity,
we need to use system page size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_external_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
index 7eb81f6..67690c6 100644
--- a/app/test/test_external_mem.c
+++ b/app/test/test_external_mem.c
@@ -532,8 +532,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
 static int
 test_external_mem(void)
 {
+	size_t pgsz = rte_mem_page_size();
 	size_t len = EXTERNAL_MEM_SZ;
-	size_t pgsz = RTE_PGSIZE_4K;
 	rte_iova_t iova[len / pgsz];
 	void *addr;
 	int ret, n_pages;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz
  2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
@ 2020-12-01 23:23     ` David Christensen
  2020-12-02  5:40       ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: David Christensen @ 2020-12-01 23:23 UTC (permalink / raw)
  To: Nithin Dabilpuram, anatoly.burakov, david.marchand; +Cc: jerinj, dev

> diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
> index 7eb81f6..67690c6 100644
> --- a/app/test/test_external_mem.c
> +++ b/app/test/test_external_mem.c
> @@ -532,8 +532,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
>   static int
>   test_external_mem(void)
>   {
> +	size_t pgsz = rte_mem_page_size();

I'm seeing a build warning with this code.  Looks like you need:

#include <rte_eal_paging.h>

Dave

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz
  2020-12-01 23:23     ` David Christensen
@ 2020-12-02  5:40       ` Nithin Dabilpuram
  0 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-02  5:40 UTC (permalink / raw)
  To: David Christensen; +Cc: anatoly.burakov, david.marchand, jerinj, dev

On Tue, Dec 01, 2020 at 03:23:39PM -0800, David Christensen wrote:
> > diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
> > index 7eb81f6..67690c6 100644
> > --- a/app/test/test_external_mem.c
> > +++ b/app/test/test_external_mem.c
> > @@ -532,8 +532,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
> >   static int
> >   test_external_mem(void)
> >   {
> > +	size_t pgsz = rte_mem_page_size();
> 
> I'm seeing a build warning with this code.  Looks like you need:
> 
> #include <rte_eal_paging.h>

Ack, will fix it in v4. Missed to test this series with x86 but just tested with
arm64.

Thanks.
> 
> Dave

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (3 preceding siblings ...)
  2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-02  5:46 ` Nithin Dabilpuram
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (3 more replies)
  2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (3 subsequent siblings)
  8 siblings, 4 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-02  5:46 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v4:
- Fixed issue with patch 4/4 on x86 builds.

v3:
- Fixed external memory test case(4/4) to use system page size
  instead of 4K.
- Fixed check-git-log.sh issue and rebased.
- Added acked-by from anatoly.burakov@intel.com to first 3 patches.

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.

Nithin Dabilpuram (4):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 IOVA as VA
  test: add test case to validate VFIO DMA map/unmap
  test: change external memory test to use system page sz

 app/test/meson.build                   |   1 +
 app/test/test_external_mem.c           |   3 +-
 app/test/test_vfio.c                   | 103 +++++++++++++++++++++++++++++++++
 doc/guides/linux_gsg/linux_drivers.rst |  10 ++++
 lib/librte_eal/linux/eal_vfio.c        |  93 ++++++++++++-----------------
 lib/librte_eal/linux/eal_vfio.h        |   1 +
 6 files changed, 153 insertions(+), 58 deletions(-)
 create mode 100644 app/test/test_vfio.c

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go
  2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-02  5:46   ` Nithin Dabilpuram
  2020-12-02 18:36     ` David Christensen
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-02  5:46 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts
commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 90635a4..9a662a7 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
 VFIO kernel is usually present by default in all distributions,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 Since Linux version 5.7,
 the ``vfio-pci`` module supports the creation of virtual functions.
 After the PF is bound to ``vfio-pci`` module,
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 0500824..64b134d 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -517,11 +517,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v4 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2020-12-02  5:46   ` Nithin Dabilpuram
  2020-12-02 18:38     ` David Christensen
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  3 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-02  5:46 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 64b134d..b15b758 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2020-12-02  5:46   ` Nithin Dabilpuram
  2020-12-02 19:23     ` David Christensen
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  3 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-02  5:46 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Test case mmap's system pages and tries to performs a user
DMA map and unmap both partially and fully.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test/meson.build |   1 +
 app/test/test_vfio.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)
 create mode 100644 app/test/test_vfio.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 94fd39f..d9eedb6 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -139,6 +139,7 @@ test_sources = files('commands.c',
 	'test_trace_register.c',
 	'test_trace_perf.c',
 	'test_version.c',
+	'test_vfio.c',
 	'virtual_pmd.c'
 )
 
diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
new file mode 100644
index 0000000..00626d4
--- /dev/null
+++ b/app/test/test_vfio.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_common.h>
+#include <rte_eal.h>
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+#include <rte_memory.h>
+#include <rte_vfio.h>
+
+#include "test.h"
+
+static int
+test_memory_vfio_dma_map(void)
+{
+	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
+	uint64_t unmap1, unmap2;
+	uint8_t *mem;
+	int ret;
+
+	/* Allocate twice size of page */
+	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED) {
+		printf("Failed to allocate memory for external heap\n");
+		return -1;
+	}
+
+	/* Force page allocation */
+	memset(mem, 0, sz);
+
+	/* map the whole region */
+	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					 (uintptr_t)mem, (rte_iova_t)mem, sz);
+	if (ret) {
+		/* Check if VFIO is not available or no device is probed */
+		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
+			ret = 1;
+			goto fail;
+		}
+		printf("Failed to dma map whole region, ret=%d(%s)\n",
+		       ret, rte_strerror(rte_errno));
+		goto fail;
+	}
+
+	unmap1 = (uint64_t)mem + (sz / 2);
+	sz1 = sz / 2;
+	unmap2 = (uint64_t)mem;
+	sz2 = sz / 2;
+	/* unmap the partial region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap1, (rte_iova_t)unmap1, sz1);
+	if (ret) {
+		if (rte_errno == ENOTSUP) {
+			printf("Partial dma unmap not supported\n");
+			unmap2 = (uint64_t)mem;
+			sz2 = sz;
+		} else {
+			printf("Failed to unmap second half region, ret=%d(%s)\n",
+			       ret, rte_strerror(rte_errno));
+			goto fail;
+		}
+	}
+
+	/* unmap the remaining region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap2, (rte_iova_t)unmap2, sz2);
+	if (ret) {
+		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
+		       rte_strerror(rte_errno));
+		goto fail;
+	}
+
+fail:
+	munmap(mem, sz);
+	return ret;
+}
+
+static int
+test_vfio(void)
+{
+	int ret;
+
+	/* test for vfio dma map/unmap */
+	ret = test_memory_vfio_dma_map();
+	if (ret == 1) {
+		printf("VFIO dma map/unmap unsupported\n");
+	} else if (ret < 0) {
+		printf("Error vfio dma map/unmap, ret=%d\n", ret);
+		return -1;
+	}
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v4 4/4] test: change external memory test to use system page sz
  2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                     ` (2 preceding siblings ...)
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-12-02  5:46   ` Nithin Dabilpuram
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-02  5:46 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Currently external memory test uses 4K page size.
VFIO DMA mapping works only with system page granularity.

Earlier it was working because all the contiguous mappings
were coalesced and mapped in one-go which ended up becoming
a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
and IOVA as PA mode, are being done at memseg list granularity,
we need to use system page size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_external_mem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
index 7eb81f6..5edf88b 100644
--- a/app/test/test_external_mem.c
+++ b/app/test/test_external_mem.c
@@ -13,6 +13,7 @@
 #include <rte_common.h>
 #include <rte_debug.h>
 #include <rte_eal.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_malloc.h>
 #include <rte_ring.h>
@@ -532,8 +533,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
 static int
 test_external_mem(void)
 {
+	size_t pgsz = rte_mem_page_size();
 	size_t len = EXTERNAL_MEM_SZ;
-	size_t pgsz = RTE_PGSIZE_4K;
 	rte_iova_t iova[len / pgsz];
 	void *addr;
 	int ret, n_pages;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2020-12-02 18:36     ` David Christensen
  0 siblings, 0 replies; 76+ messages in thread
From: David Christensen @ 2020-12-02 18:36 UTC (permalink / raw)
  To: Nithin Dabilpuram, anatoly.burakov, david.marchand; +Cc: jerinj, dev, stable



On 12/1/20 9:46 PM, Nithin Dabilpuram wrote:
> In order to save DMA entries limited by kernel both for externel
> memory and hugepage memory, an attempt was made to map physically
> contiguous memory in one go. This cannot be done as VFIO IOMMU type1
> does not support partially unmapping a previously mapped memory
> region while Heap can request for multi page mapping and
> partial unmapping.
> Hence for going back to old method of mapping/unmapping at
> memseg granularity, this commit reverts
> commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
> 
> Also add documentation on what module parameter needs to be used
> to increase the per-container dma map limit for VFIO.
> 
> Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>   doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
>   lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
>   2 files changed, 18 insertions(+), 51 deletions(-)
> 
> diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
> index 90635a4..9a662a7 100644
> --- a/doc/guides/linux_gsg/linux_drivers.rst
> +++ b/doc/guides/linux_gsg/linux_drivers.rst
> @@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
>   VFIO kernel is usually present by default in all distributions,
>   however please consult your distributions documentation to make sure that is the case.
> 
> +For DMA mapping of either external memory or hugepages, VFIO interface is used.
> +VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
> +mapped in hugepage granularity or system page granularity. Number of DMA
> +mappings is limited by kernel with user locked memory limit of a process(rlimit)
> +for system/hugepage memory. Another per-container overall limit applicable both
> +for external memory and system memory was added in kernel 5.1 defined by
> +VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
> +When application is out of DMA entries, these limits need to be adjusted to
> +increase the allowed limit.
> +
>   Since Linux version 5.7,
>   the ``vfio-pci`` module supports the creation of virtual functions.
>   After the PF is bound to ``vfio-pci`` module,
> diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> index 0500824..64b134d 100644
> --- a/lib/librte_eal/linux/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal_vfio.c
> @@ -517,11 +517,9 @@ static void
>   vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   		void *arg __rte_unused)
>   {
> -	rte_iova_t iova_start, iova_expected;
>   	struct rte_memseg_list *msl;
>   	struct rte_memseg *ms;
>   	size_t cur_len = 0;
> -	uint64_t va_start;
> 
>   	msl = rte_mem_virt2memseg_list(addr);
> 
> @@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
> 
>   	/* memsegs are contiguous in memory */
>   	ms = rte_mem_virt2memseg(addr, msl);
> -
> -	/*
> -	 * This memory is not guaranteed to be contiguous, but it still could
> -	 * be, or it could have some small contiguous chunks. Since the number
> -	 * of VFIO mappings is limited, and VFIO appears to not concatenate
> -	 * adjacent mappings, we have to do this ourselves.
> -	 *
> -	 * So, find contiguous chunks, then map them.
> -	 */
> -	va_start = ms->addr_64;
> -	iova_start = iova_expected = ms->iova;
>   	while (cur_len < len) {
> -		bool new_contig_area = ms->iova != iova_expected;
> -		bool last_seg = (len - cur_len) == ms->len;
> -		bool skip_last = false;
> -
> -		/* only do mappings when current contiguous area ends */
> -		if (new_contig_area) {
> -			if (type == RTE_MEM_EVENT_ALLOC)
> -				vfio_dma_mem_map(default_vfio_cfg, va_start,
> -						iova_start,
> -						iova_expected - iova_start, 1);
> -			else
> -				vfio_dma_mem_map(default_vfio_cfg, va_start,
> -						iova_start,
> -						iova_expected - iova_start, 0);
> -			va_start = ms->addr_64;
> -			iova_start = ms->iova;
> -		}
>   		/* some memory segments may have invalid IOVA */
>   		if (ms->iova == RTE_BAD_IOVA) {
>   			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
>   					ms->addr);
> -			skip_last = true;
> +			goto next;
>   		}
> -		iova_expected = ms->iova + ms->len;
> +		if (type == RTE_MEM_EVENT_ALLOC)
> +			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
> +					ms->iova, ms->len, 1);
> +		else
> +			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
> +					ms->iova, ms->len, 0);
> +next:
>   		cur_len += ms->len;
>   		++ms;
> -
> -		/*
> -		 * don't count previous segment, and don't attempt to
> -		 * dereference a potentially invalid pointer.
> -		 */
> -		if (skip_last && !last_seg) {
> -			iova_expected = iova_start = ms->iova;
> -			va_start = ms->addr_64;
> -		} else if (!skip_last && last_seg) {
> -			/* this is the last segment and we're not skipping */
> -			if (type == RTE_MEM_EVENT_ALLOC)
> -				vfio_dma_mem_map(default_vfio_cfg, va_start,
> -						iova_start,
> -						iova_expected - iova_start, 1);
> -			else
> -				vfio_dma_mem_map(default_vfio_cfg, va_start,
> -						iova_start,
> -						iova_expected - iova_start, 0);
> -		}
>   	}
>   }
> 

Acked-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2020-12-02 18:38     ` David Christensen
  0 siblings, 0 replies; 76+ messages in thread
From: David Christensen @ 2020-12-02 18:38 UTC (permalink / raw)
  To: Nithin Dabilpuram, anatoly.burakov, david.marchand; +Cc: jerinj, dev, stable



On 12/1/20 9:46 PM, Nithin Dabilpuram wrote:
> Partial unmapping is not supported for VFIO IOMMU type1
> by kernel. Though kernel gives return as zero, the unmapped size
> returned will not be same as expected. So check for
> returned unmap size and return error.
> 
> For IOVA as PA, DMA mapping is already at memseg size
> granularity. Do the same even for IOVA as VA mode as
> DMA map/unmap triggered by heap allocations,
> maintain granularity of memseg page size so that heap
> expansion and contraction does not have this issue.
> 
> For user requested DMA map/unmap disallow partial unmapping
> for VFIO type1.
> 
> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>   lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
>   lib/librte_eal/linux/eal_vfio.h |  1 +
>   2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> index 64b134d..b15b758 100644
> --- a/lib/librte_eal/linux/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal_vfio.c
> @@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_TYPE1,
>   		.name = "Type 1",
> +		.partial_unmap = false,
>   		.dma_map_func = &vfio_type1_dma_map,
>   		.dma_user_map_func = &vfio_type1_dma_mem_map
>   	},
> @@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_SPAPR,
>   		.name = "sPAPR",
> +		.partial_unmap = true,
>   		.dma_map_func = &vfio_spapr_dma_map,
>   		.dma_user_map_func = &vfio_spapr_dma_mem_map
>   	},
> @@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_NOIOMMU,
>   		.name = "No-IOMMU",
> +		.partial_unmap = true,
>   		.dma_map_func = &vfio_noiommu_dma_map,
>   		.dma_user_map_func = &vfio_noiommu_dma_mem_map
>   	},
> @@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   	/* for IOVA as VA mode, no need to care for IOVA addresses */
>   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
>   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> -		if (type == RTE_MEM_EVENT_ALLOC)
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 1);
> -		else
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 0);
> +		uint64_t page_sz = msl->page_sz;
> +
> +		/* Maintain granularity of DMA map/unmap to memseg size */
> +		for (; cur_len < len; cur_len += page_sz) {
> +			if (type == RTE_MEM_EVENT_ALLOC)
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 1);
> +			else
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 0);
> +			vfio_va += page_sz;
> +		}
> +
>   		return;
>   	}
> 
> @@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
>   			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>   					errno, strerror(errno));
>   			return -1;
> +		} else if (dma_unmap.size != len) {
> +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> +				"remapping cleared instead of %"PRIu64"\n",
> +				(uint64_t)dma_unmap.size, len);
> +			rte_errno = EIO;
> +			return -1;
>   		}
>   	}
> 
> @@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>   		/* we're partially unmapping a previously mapped region, so we
>   		 * need to split entry into two.
>   		 */
> +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> +			rte_errno = ENOTSUP;
> +			ret = -1;
> +			goto out;
> +		}
>   		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
>   			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
>   			rte_errno = ENOMEM;
> diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
> index cb2d35f..6ebaca6 100644
> --- a/lib/librte_eal/linux/eal_vfio.h
> +++ b/lib/librte_eal/linux/eal_vfio.h
> @@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
>   struct vfio_iommu_type {
>   	int type_id;
>   	const char *name;
> +	bool partial_unmap;
>   	vfio_dma_user_func_t dma_user_map_func;
>   	vfio_dma_func_t dma_map_func;
>   };
> 

Acked-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-12-02 19:23     ` David Christensen
  2020-12-03  7:14       ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: David Christensen @ 2020-12-02 19:23 UTC (permalink / raw)
  To: Nithin Dabilpuram, anatoly.burakov, david.marchand; +Cc: jerinj, dev



On 12/1/20 9:46 PM, Nithin Dabilpuram wrote:
> Test case mmap's system pages and tries to performs a user
> DMA map and unmap both partially and fully.
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>   app/test/meson.build |   1 +
>   app/test/test_vfio.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 104 insertions(+)
>   create mode 100644 app/test/test_vfio.c
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index 94fd39f..d9eedb6 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -139,6 +139,7 @@ test_sources = files('commands.c',
>   	'test_trace_register.c',
>   	'test_trace_perf.c',
>   	'test_version.c',
> +	'test_vfio.c',
>   	'virtual_pmd.c'
>   )
> 
> diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
> new file mode 100644
> index 0000000..00626d4
> --- /dev/null
> +++ b/app/test/test_vfio.c
> @@ -0,0 +1,103 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2020 Marvell.
> + */
> +
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_common.h>
> +#include <rte_eal.h>
> +#include <rte_eal_paging.h>
> +#include <rte_errno.h>
> +#include <rte_memory.h>
> +#include <rte_vfio.h>
> +
> +#include "test.h"
> +
> +static int
> +test_memory_vfio_dma_map(void)
> +{
> +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
> +	uint64_t unmap1, unmap2;
> +	uint8_t *mem;
> +	int ret;
> +
> +	/* Allocate twice size of page */
> +	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +	if (mem == MAP_FAILED) {
> +		printf("Failed to allocate memory for external heap\n");
> +		return -1;
> +	}
> +
> +	/* Force page allocation */
> +	memset(mem, 0, sz);
> +
> +	/* map the whole region */
> +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					 (uintptr_t)mem, (rte_iova_t)mem, sz);
> +	if (ret) {
> +		/* Check if VFIO is not available or no device is probed */
> +		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
> +			ret = 1;
> +			goto fail;
> +		}
> +		printf("Failed to dma map whole region, ret=%d(%s)\n",
> +		       ret, rte_strerror(rte_errno));
> +		goto fail;
> +	}
> +
> +	unmap1 = (uint64_t)mem + (sz / 2);
> +	sz1 = sz / 2;
> +	unmap2 = (uint64_t)mem;
> +	sz2 = sz / 2;
> +	/* unmap the partial region */
> +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					   unmap1, (rte_iova_t)unmap1, sz1);
> +	if (ret) {
> +		if (rte_errno == ENOTSUP) {
> +			printf("Partial dma unmap not supported\n");
> +			unmap2 = (uint64_t)mem;
> +			sz2 = sz;
> +		} else {
> +			printf("Failed to unmap second half region, ret=%d(%s)\n",
> +			       ret, rte_strerror(rte_errno));
> +			goto fail;
> +		}
> +	}
> +
> +	/* unmap the remaining region */
> +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					   unmap2, (rte_iova_t)unmap2, sz2);
> +	if (ret) {
> +		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
> +		       rte_strerror(rte_errno));
> +		goto fail;
> +	}
> +
> +fail:
> +	munmap(mem, sz);
> +	return ret;
> +}
> +
> +static int
> +test_vfio(void)
> +{
> +	int ret;
> +
> +	/* test for vfio dma map/unmap */
> +	ret = test_memory_vfio_dma_map();
> +	if (ret == 1) {
> +		printf("VFIO dma map/unmap unsupported\n");
> +	} else if (ret < 0) {
> +		printf("Error vfio dma map/unmap, ret=%d\n", ret);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
> 

The test as written fails on a POWER9 system (see below for debug output).

IOMMU on POWER systems requires that a DMA window be defined and that 
all DMA mappings reside within that window.  In this test, the DMA 
window is defined as 0x0 to 0x4000000000, but the VA allocated in your 
test is 0x7fffb8680000, well outside that range.

I recently submitted a change in the 20.11 release which scans the 
memseg list in order to set the DMA window.  That test can be seen here:

EAL: Highest VA address in memseg list is 0x2200000000
EAL: Setting DMA window size to 0x4000000000

Can we modify the test to allocate memory out of the exsitng memseg 
allocations?

Dave

$ sudo ~/src/dpdk/build/app/test/dpdk-test --log="eal,debug" 
--iova-mode=va -l 64-127
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 0
EAL: Detected lcore 2 as core 0 on socket 0
EAL: Detected lcore 3 as core 0 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 4 on socket 0
EAL: Detected lcore 6 as core 4 on socket 0
EAL: Detected lcore 7 as core 4 on socket 0
EAL: Detected lcore 8 as core 8 on socket 0
EAL: Detected lcore 9 as core 8 on socket 0
EAL: Detected lcore 10 as core 8 on socket 0
EAL: Detected lcore 11 as core 8 on socket 0
EAL: Detected lcore 12 as core 12 on socket 0
EAL: Detected lcore 13 as core 12 on socket 0
EAL: Detected lcore 14 as core 12 on socket 0
EAL: Detected lcore 15 as core 12 on socket 0
EAL: Detected lcore 16 as core 16 on socket 0
EAL: Detected lcore 17 as core 16 on socket 0
EAL: Detected lcore 18 as core 16 on socket 0
EAL: Detected lcore 19 as core 16 on socket 0
EAL: Detected lcore 20 as core 20 on socket 0
EAL: Detected lcore 21 as core 20 on socket 0
EAL: Detected lcore 22 as core 20 on socket 0
EAL: Detected lcore 23 as core 20 on socket 0
EAL: Detected lcore 24 as core 24 on socket 0
EAL: Detected lcore 25 as core 24 on socket 0
EAL: Detected lcore 26 as core 24 on socket 0
EAL: Detected lcore 27 as core 24 on socket 0
EAL: Detected lcore 28 as core 28 on socket 0
EAL: Detected lcore 29 as core 28 on socket 0
EAL: Detected lcore 30 as core 28 on socket 0
EAL: Detected lcore 31 as core 28 on socket 0
EAL: Detected lcore 32 as core 32 on socket 0
EAL: Detected lcore 33 as core 32 on socket 0
EAL: Detected lcore 34 as core 32 on socket 0
EAL: Detected lcore 35 as core 32 on socket 0
EAL: Detected lcore 36 as core 36 on socket 0
EAL: Detected lcore 37 as core 36 on socket 0
EAL: Detected lcore 38 as core 36 on socket 0
EAL: Detected lcore 39 as core 36 on socket 0
EAL: Detected lcore 40 as core 48 on socket 0
EAL: Detected lcore 41 as core 48 on socket 0
EAL: Detected lcore 42 as core 48 on socket 0
EAL: Detected lcore 43 as core 48 on socket 0
EAL: Detected lcore 44 as core 52 on socket 0
EAL: Detected lcore 45 as core 52 on socket 0
EAL: Detected lcore 46 as core 52 on socket 0
EAL: Detected lcore 47 as core 52 on socket 0
EAL: Detected lcore 48 as core 72 on socket 0
EAL: Detected lcore 49 as core 72 on socket 0
EAL: Detected lcore 50 as core 72 on socket 0
EAL: Detected lcore 51 as core 72 on socket 0
EAL: Detected lcore 52 as core 76 on socket 0
EAL: Detected lcore 53 as core 76 on socket 0
EAL: Detected lcore 54 as core 76 on socket 0
EAL: Detected lcore 55 as core 76 on socket 0
EAL: Detected lcore 56 as core 80 on socket 0
EAL: Detected lcore 57 as core 80 on socket 0
EAL: Detected lcore 58 as core 80 on socket 0
EAL: Detected lcore 59 as core 80 on socket 0
EAL: Detected lcore 60 as core 84 on socket 0
EAL: Detected lcore 61 as core 84 on socket 0
EAL: Detected lcore 62 as core 84 on socket 0
EAL: Detected lcore 63 as core 84 on socket 0
EAL: Detected lcore 64 as core 2048 on socket 8
EAL: Detected lcore 65 as core 2048 on socket 8
EAL: Detected lcore 66 as core 2048 on socket 8
EAL: Detected lcore 67 as core 2048 on socket 8
EAL: Detected lcore 68 as core 2052 on socket 8
EAL: Detected lcore 69 as core 2052 on socket 8
EAL: Detected lcore 70 as core 2052 on socket 8
EAL: Detected lcore 71 as core 2052 on socket 8
EAL: Detected lcore 72 as core 2056 on socket 8
EAL: Detected lcore 73 as core 2056 on socket 8
EAL: Detected lcore 74 as core 2056 on socket 8
EAL: Detected lcore 75 as core 2056 on socket 8
EAL: Detected lcore 76 as core 2060 on socket 8
EAL: Detected lcore 77 as core 2060 on socket 8
EAL: Detected lcore 78 as core 2060 on socket 8
EAL: Detected lcore 79 as core 2060 on socket 8
EAL: Detected lcore 80 as core 2072 on socket 8
EAL: Detected lcore 81 as core 2072 on socket 8
EAL: Detected lcore 82 as core 2072 on socket 8
EAL: Detected lcore 83 as core 2072 on socket 8
EAL: Detected lcore 84 as core 2076 on socket 8
EAL: Detected lcore 85 as core 2076 on socket 8
EAL: Detected lcore 86 as core 2076 on socket 8
EAL: Detected lcore 87 as core 2076 on socket 8
EAL: Detected lcore 88 as core 2080 on socket 8
EAL: Detected lcore 89 as core 2080 on socket 8
EAL: Detected lcore 90 as core 2080 on socket 8
EAL: Detected lcore 91 as core 2080 on socket 8
EAL: Detected lcore 92 as core 2084 on socket 8
EAL: Detected lcore 93 as core 2084 on socket 8
EAL: Detected lcore 94 as core 2084 on socket 8
EAL: Detected lcore 95 as core 2084 on socket 8
EAL: Detected lcore 96 as core 2088 on socket 8
EAL: Detected lcore 97 as core 2088 on socket 8
EAL: Detected lcore 98 as core 2088 on socket 8
EAL: Detected lcore 99 as core 2088 on socket 8
EAL: Detected lcore 100 as core 2092 on socket 8
EAL: Detected lcore 101 as core 2092 on socket 8
EAL: Detected lcore 102 as core 2092 on socket 8
EAL: Detected lcore 103 as core 2092 on socket 8
EAL: Detected lcore 104 as core 2096 on socket 8
EAL: Detected lcore 105 as core 2096 on socket 8
EAL: Detected lcore 106 as core 2096 on socket 8
EAL: Detected lcore 107 as core 2096 on socket 8
EAL: Detected lcore 108 as core 2100 on socket 8
EAL: Detected lcore 109 as core 2100 on socket 8
EAL: Detected lcore 110 as core 2100 on socket 8
EAL: Detected lcore 111 as core 2100 on socket 8
EAL: Detected lcore 112 as core 2120 on socket 8
EAL: Detected lcore 113 as core 2120 on socket 8
EAL: Detected lcore 114 as core 2120 on socket 8
EAL: Detected lcore 115 as core 2120 on socket 8
EAL: Detected lcore 116 as core 2124 on socket 8
EAL: Detected lcore 117 as core 2124 on socket 8
EAL: Detected lcore 118 as core 2124 on socket 8
EAL: Detected lcore 119 as core 2124 on socket 8
EAL: Detected lcore 120 as core 2136 on socket 8
EAL: Detected lcore 121 as core 2136 on socket 8
EAL: Detected lcore 122 as core 2136 on socket 8
EAL: Detected lcore 123 as core 2136 on socket 8
EAL: Detected lcore 124 as core 2140 on socket 8
EAL: Detected lcore 125 as core 2140 on socket 8
EAL: Detected lcore 126 as core 2140 on socket 8
EAL: Detected lcore 127 as core 2140 on socket 8
EAL: Support maximum 1536 logical core(s) by configuration.
EAL: Detected 128 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x10000)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: DPAA Bus not present. Skipping.
EAL: VFIO PCI modules not loaded
EAL: Selected IOVA mode 'VA'
EAL: 2 hugepages of size 2097152 reserved, but no mounted hugetlbfs 
found for that size
EAL: Probing VFIO support...
EAL:   IOMMU type 1 (Type 1) is not supported
EAL:   IOMMU type 7 (sPAPR) is supported
EAL:   IOMMU type 8 (No-IOMMU) is not supported
EAL: VFIO support initialized
EAL: Ask a virtual area of 0x30000 bytes
EAL: Virtual area found at 0x100010000 (size = 0x30000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 32768
EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
EAL: Detected memory type: socket_id:8 hugepage_sz:1073741824
EAL: Creating 2 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x100040000 (size = 0x10000)
EAL: Memseg list allocated at socket 0, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x140000000, size 800000000
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x940000000 (size = 0x10000)
EAL: Memseg list allocated at socket 0, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x980000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x980000000, size 800000000
EAL: Creating 2 segment lists: n_segs:32 socket_id:8 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x1180000000 (size = 0x10000)
EAL: Memseg list allocated at socket 8, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x11c0000000, size 800000000
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x19c0000000 (size = 0x10000)
EAL: Memseg list allocated at socket 8, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x1a00000000, size 800000000
EAL: TSC frequency is ~510000 KHz
EAL: Main lcore 64 is ready (tid=7fffb8018890;cpuset=[64])
EAL: lcore 65 is ready (tid=7fffb64ad090;cpuset=[65])
EAL: lcore 66 is ready (tid=7fffb5c9d090;cpuset=[66])
EAL: lcore 67 is ready (tid=7fffb548d090;cpuset=[67])
EAL: lcore 68 is ready (tid=7fffb4c7d090;cpuset=[68])
EAL: lcore 69 is ready (tid=7fffa7ffd090;cpuset=[69])
EAL: lcore 70 is ready (tid=7fffa77ed090;cpuset=[70])
EAL: lcore 71 is ready (tid=7fffa6fdd090;cpuset=[71])
EAL: lcore 72 is ready (tid=7fffa67cd090;cpuset=[72])
EAL: lcore 73 is ready (tid=7fffa5fbd090;cpuset=[73])
EAL: lcore 74 is ready (tid=7fffa57ad090;cpuset=[74])
EAL: lcore 75 is ready (tid=7fffa4f9d090;cpuset=[75])
EAL: lcore 76 is ready (tid=7fff8fffd090;cpuset=[76])
EAL: lcore 77 is ready (tid=7fff8f7ed090;cpuset=[77])
EAL: lcore 78 is ready (tid=7fff8efdd090;cpuset=[78])
EAL: lcore 79 is ready (tid=7fff8e7cd090;cpuset=[79])
EAL: lcore 80 is ready (tid=7fff8dfbd090;cpuset=[80])
EAL: lcore 81 is ready (tid=7fff8d7ad090;cpuset=[81])
EAL: lcore 82 is ready (tid=7fff8cf9d090;cpuset=[82])
EAL: lcore 83 is ready (tid=7fff6bffd090;cpuset=[83])
EAL: lcore 84 is ready (tid=7fff6b7ed090;cpuset=[84])
EAL: lcore 85 is ready (tid=7fff6afdd090;cpuset=[85])
EAL: lcore 86 is ready (tid=7fff6a7cd090;cpuset=[86])
EAL: lcore 87 is ready (tid=7fff69fbd090;cpuset=[87])
EAL: lcore 88 is ready (tid=7fff697ad090;cpuset=[88])
EAL: lcore 89 is ready (tid=7fff68f9d090;cpuset=[89])
EAL: lcore 90 is ready (tid=7fff4bffd090;cpuset=[90])
EAL: lcore 91 is ready (tid=7fff4b7ed090;cpuset=[91])
EAL: lcore 92 is ready (tid=7fff4afdd090;cpuset=[92])
EAL: lcore 93 is ready (tid=7fff4a7cd090;cpuset=[93])
EAL: lcore 94 is ready (tid=7fff49fbd090;cpuset=[94])
EAL: lcore 95 is ready (tid=7fff497ad090;cpuset=[95])
EAL: lcore 96 is ready (tid=7fff48f9d090;cpuset=[96])
EAL: lcore 97 is ready (tid=7fff2bffd090;cpuset=[97])
EAL: lcore 98 is ready (tid=7fff2b7ed090;cpuset=[98])
EAL: lcore 99 is ready (tid=7fff2afdd090;cpuset=[99])
EAL: lcore 100 is ready (tid=7fff2a7cd090;cpuset=[100])
EAL: lcore 101 is ready (tid=7fff29fbd090;cpuset=[101])
EAL: lcore 102 is ready (tid=7fff297ad090;cpuset=[102])
EAL: lcore 103 is ready (tid=7fff28f9d090;cpuset=[103])
EAL: lcore 104 is ready (tid=7fff07ffd090;cpuset=[104])
EAL: lcore 105 is ready (tid=7ffeff7ed090;cpuset=[105])
EAL: lcore 106 is ready (tid=7fff077ed090;cpuset=[106])
EAL: lcore 107 is ready (tid=7fff06fdd090;cpuset=[107])
EAL: lcore 108 is ready (tid=7fff067cd090;cpuset=[108])
EAL: lcore 109 is ready (tid=7fff05fbd090;cpuset=[109])
EAL: lcore 110 is ready (tid=7fff057ad090;cpuset=[110])
EAL: lcore 111 is ready (tid=7fff04f9d090;cpuset=[111])
EAL: lcore 112 is ready (tid=7ffeffffd090;cpuset=[112])
EAL: lcore 113 is ready (tid=7ffefefdd090;cpuset=[113])
EAL: lcore 114 is ready (tid=7ffefe7cd090;cpuset=[114])
EAL: lcore 115 is ready (tid=7ffefdfbd090;cpuset=[115])
EAL: lcore 116 is ready (tid=7ffefd7ad090;cpuset=[116])
EAL: lcore 117 is ready (tid=7ffefcf9d090;cpuset=[117])
EAL: lcore 118 is ready (tid=7ffecfffd090;cpuset=[118])
EAL: lcore 119 is ready (tid=7ffecf7ed090;cpuset=[119])
EAL: lcore 120 is ready (tid=7ffecefdd090;cpuset=[120])
EAL: lcore 121 is ready (tid=7ffece7cd090;cpuset=[121])
EAL: lcore 122 is ready (tid=7ffecdfbd090;cpuset=[122])
EAL: lcore 123 is ready (tid=7ffecd7ad090;cpuset=[123])
EAL: lcore 124 is ready (tid=7ffeccf9d090;cpuset=[124])
EAL: lcore 125 is ready (tid=7ffe9bffd090;cpuset=[125])
EAL: lcore 126 is ready (tid=7ffe9b7ed090;cpuset=[126])
EAL: lcore 127 is ready (tid=7ffe9afdd090;cpuset=[127])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 8
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 8 was expanded by 1024MB
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL:   probe driver: 15b3:1019 mlx5_pci
EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0000:01:00.0 (socket 0)
EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: Calling mem event callback 'MLX5_MEM_EVENT_CB:(nil)'
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 1024MB
mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
mlx5_pci: Default miss action is not supported.
EAL: PCI device 0000:01:00.1 on NUMA socket 0
EAL:   probe driver: 15b3:1019 mlx5_pci
EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0000:01:00.1 (socket 0)
mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
mlx5_pci: Default miss action is not supported.
EAL: PCI device 0003:01:00.0 on NUMA socket 0
EAL:   probe driver: 14e4:168a net_bnx2x
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0003:01:00.1 on NUMA socket 0
EAL:   probe driver: 14e4:168a net_bnx2x
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0003:01:00.2 on NUMA socket 0
EAL:   probe driver: 14e4:168a net_bnx2x
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0003:01:00.3 on NUMA socket 0
EAL:   probe driver: 14e4:168a net_bnx2x
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0030:01:00.0 on NUMA socket 8
EAL:   probe driver: 15b3:1019 mlx5_pci
EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0030:01:00.0 (socket 8)
mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
mlx5_pci: Default miss action is not supported.
EAL: PCI device 0030:01:00.1 on NUMA socket 8
EAL:   probe driver: 15b3:1019 mlx5_pci
EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0030:01:00.1 (socket 8)
mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
mlx5_pci: Default miss action is not supported.
EAL: PCI device 0034:01:00.0 on NUMA socket 8
EAL:   probe driver: 8086:1583 net_i40e
EAL:   set IOMMU type 1 (Type 1) failed, error 19 (No such device)
EAL:   using IOMMU type 7 (sPAPR)
EAL: Highest VA address in memseg list is 0x2200000000
EAL: Setting DMA window size to 0x4000000000
EAL: Mem event callback 'vfio_mem_event_clb:(nil)' registered
EAL: Installed memory event callback for VFIO
EAL: VFIO reports MSI-X BAR as mappable
EAL:   PCI memory mapped at 0x2200000000
EAL:   PCI memory mapped at 0x2200800000
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.0 (socket 8)
EAL: PCI device 0034:01:00.1 on NUMA socket 8
EAL:   probe driver: 8086:1583 net_i40e
EAL: VFIO reports MSI-X BAR as mappable
EAL:   PCI memory mapped at 0x2200810000
EAL:   PCI memory mapped at 0x2201010000
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.1 (socket 8)
APP: HPET is not enabled, using TSC as default timer
RTE>>vfio_autotest
DRC: sz = 0x20000
DRC: mem = 0x0x7fffb8680000
EAL:   dma map attempt outside DMA window
EAL: Failed to map DMA
EAL: Couldn't map new region for DMA
Failed to dma map whole region, ret=-1(No such file or directory)
Error vfio dma map/unmap, ret=-1
Test Failed
RTE>>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-02 19:23     ` David Christensen
@ 2020-12-03  7:14       ` Nithin Dabilpuram
  2020-12-14  8:24         ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-03  7:14 UTC (permalink / raw)
  To: David Christensen; +Cc: anatoly.burakov, david.marchand, jerinj, dev

On Wed, Dec 02, 2020 at 11:23:09AM -0800, David Christensen wrote:
> 
> 
> On 12/1/20 9:46 PM, Nithin Dabilpuram wrote:
> > Test case mmap's system pages and tries to performs a user
> > DMA map and unmap both partially and fully.
> > 
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> >   app/test/meson.build |   1 +
> >   app/test/test_vfio.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 104 insertions(+)
> >   create mode 100644 app/test/test_vfio.c
> > 
> > diff --git a/app/test/meson.build b/app/test/meson.build
> > index 94fd39f..d9eedb6 100644
> > --- a/app/test/meson.build
> > +++ b/app/test/meson.build
> > @@ -139,6 +139,7 @@ test_sources = files('commands.c',
> >   	'test_trace_register.c',
> >   	'test_trace_perf.c',
> >   	'test_version.c',
> > +	'test_vfio.c',
> >   	'virtual_pmd.c'
> >   )
> > 
> > diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
> > new file mode 100644
> > index 0000000..00626d4
> > --- /dev/null
> > +++ b/app/test/test_vfio.c
> > @@ -0,0 +1,103 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(C) 2020 Marvell.
> > + */
> > +
> > +#include <stdio.h>
> > +#include <stdint.h>
> > +#include <string.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_eal.h>
> > +#include <rte_eal_paging.h>
> > +#include <rte_errno.h>
> > +#include <rte_memory.h>
> > +#include <rte_vfio.h>
> > +
> > +#include "test.h"
> > +
> > +static int
> > +test_memory_vfio_dma_map(void)
> > +{
> > +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
> > +	uint64_t unmap1, unmap2;
> > +	uint8_t *mem;
> > +	int ret;
> > +
> > +	/* Allocate twice size of page */
> > +	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
> > +		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +	if (mem == MAP_FAILED) {
> > +		printf("Failed to allocate memory for external heap\n");
> > +		return -1;
> > +	}
> > +
> > +	/* Force page allocation */
> > +	memset(mem, 0, sz);
> > +
> > +	/* map the whole region */
> > +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +					 (uintptr_t)mem, (rte_iova_t)mem, sz);
> > +	if (ret) {
> > +		/* Check if VFIO is not available or no device is probed */
> > +		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
> > +			ret = 1;
> > +			goto fail;
> > +		}
> > +		printf("Failed to dma map whole region, ret=%d(%s)\n",
> > +		       ret, rte_strerror(rte_errno));
> > +		goto fail;
> > +	}
> > +
> > +	unmap1 = (uint64_t)mem + (sz / 2);
> > +	sz1 = sz / 2;
> > +	unmap2 = (uint64_t)mem;
> > +	sz2 = sz / 2;
> > +	/* unmap the partial region */
> > +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +					   unmap1, (rte_iova_t)unmap1, sz1);
> > +	if (ret) {
> > +		if (rte_errno == ENOTSUP) {
> > +			printf("Partial dma unmap not supported\n");
> > +			unmap2 = (uint64_t)mem;
> > +			sz2 = sz;
> > +		} else {
> > +			printf("Failed to unmap second half region, ret=%d(%s)\n",
> > +			       ret, rte_strerror(rte_errno));
> > +			goto fail;
> > +		}
> > +	}
> > +
> > +	/* unmap the remaining region */
> > +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +					   unmap2, (rte_iova_t)unmap2, sz2);
> > +	if (ret) {
> > +		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
> > +		       rte_strerror(rte_errno));
> > +		goto fail;
> > +	}
> > +
> > +fail:
> > +	munmap(mem, sz);
> > +	return ret;
> > +}
> > +
> > +static int
> > +test_vfio(void)
> > +{
> > +	int ret;
> > +
> > +	/* test for vfio dma map/unmap */
> > +	ret = test_memory_vfio_dma_map();
> > +	if (ret == 1) {
> > +		printf("VFIO dma map/unmap unsupported\n");
> > +	} else if (ret < 0) {
> > +		printf("Error vfio dma map/unmap, ret=%d\n", ret);
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
> > 
> 
> The test as written fails on a POWER9 system (see below for debug output).
> 
> IOMMU on POWER systems requires that a DMA window be defined and that all
> DMA mappings reside within that window.  In this test, the DMA window is
> defined as 0x0 to 0x4000000000, but the VA allocated in your test is
> 0x7fffb8680000, well outside that range.
> 
> I recently submitted a change in the 20.11 release which scans the memseg
> list in order to set the DMA window.  That test can be seen here:
> 
> EAL: Highest VA address in memseg list is 0x2200000000
> EAL: Setting DMA window size to 0x4000000000
> 

I missed that thread. So basically external memory with IOVA as VA mode
is not supported in POWER9 systems as it's memseg lists can be created later
after DMA window size is fixed. Correct ?

> Can we modify the test to allocate memory out of the exsitng memseg
> allocations?

Since I'm mmapin'g normal pages for this test outside EAL, I cannot use memseg list 
range as VA as it is already reserved by memseg list's.

I can see only three options left.

#1 Use initial process VA range by using heap memory instead of mmap 
which is falling below eal's base_virtaddr for both freebsd and windows and also I think 
your DMA window will include that.

#2 Use PA from real PA range or like externel_mem_autotest(). The test currently
   is acting like IOVA as VA test but this change will make it as IOVA as PA.

#3 Disable this test for VFIO SPAPR or remove it completely.

Will #1 work in your case ?
> 
> Dave
> 
> $ sudo ~/src/dpdk/build/app/test/dpdk-test --log="eal,debug" --iova-mode=va
> -l 64-127
> EAL: Detected lcore 0 as core 0 on socket 0
> EAL: Detected lcore 1 as core 0 on socket 0
> EAL: Detected lcore 2 as core 0 on socket 0
> EAL: Detected lcore 3 as core 0 on socket 0
> EAL: Detected lcore 4 as core 4 on socket 0
> EAL: Detected lcore 5 as core 4 on socket 0
> EAL: Detected lcore 6 as core 4 on socket 0
> EAL: Detected lcore 7 as core 4 on socket 0
> EAL: Detected lcore 8 as core 8 on socket 0
> EAL: Detected lcore 9 as core 8 on socket 0
> EAL: Detected lcore 10 as core 8 on socket 0
> EAL: Detected lcore 11 as core 8 on socket 0
> EAL: Detected lcore 12 as core 12 on socket 0
> EAL: Detected lcore 13 as core 12 on socket 0
> EAL: Detected lcore 14 as core 12 on socket 0
> EAL: Detected lcore 15 as core 12 on socket 0
> EAL: Detected lcore 16 as core 16 on socket 0
> EAL: Detected lcore 17 as core 16 on socket 0
> EAL: Detected lcore 18 as core 16 on socket 0
> EAL: Detected lcore 19 as core 16 on socket 0
> EAL: Detected lcore 20 as core 20 on socket 0
> EAL: Detected lcore 21 as core 20 on socket 0
> EAL: Detected lcore 22 as core 20 on socket 0
> EAL: Detected lcore 23 as core 20 on socket 0
> EAL: Detected lcore 24 as core 24 on socket 0
> EAL: Detected lcore 25 as core 24 on socket 0
> EAL: Detected lcore 26 as core 24 on socket 0
> EAL: Detected lcore 27 as core 24 on socket 0
> EAL: Detected lcore 28 as core 28 on socket 0
> EAL: Detected lcore 29 as core 28 on socket 0
> EAL: Detected lcore 30 as core 28 on socket 0
> EAL: Detected lcore 31 as core 28 on socket 0
> EAL: Detected lcore 32 as core 32 on socket 0
> EAL: Detected lcore 33 as core 32 on socket 0
> EAL: Detected lcore 34 as core 32 on socket 0
> EAL: Detected lcore 35 as core 32 on socket 0
> EAL: Detected lcore 36 as core 36 on socket 0
> EAL: Detected lcore 37 as core 36 on socket 0
> EAL: Detected lcore 38 as core 36 on socket 0
> EAL: Detected lcore 39 as core 36 on socket 0
> EAL: Detected lcore 40 as core 48 on socket 0
> EAL: Detected lcore 41 as core 48 on socket 0
> EAL: Detected lcore 42 as core 48 on socket 0
> EAL: Detected lcore 43 as core 48 on socket 0
> EAL: Detected lcore 44 as core 52 on socket 0
> EAL: Detected lcore 45 as core 52 on socket 0
> EAL: Detected lcore 46 as core 52 on socket 0
> EAL: Detected lcore 47 as core 52 on socket 0
> EAL: Detected lcore 48 as core 72 on socket 0
> EAL: Detected lcore 49 as core 72 on socket 0
> EAL: Detected lcore 50 as core 72 on socket 0
> EAL: Detected lcore 51 as core 72 on socket 0
> EAL: Detected lcore 52 as core 76 on socket 0
> EAL: Detected lcore 53 as core 76 on socket 0
> EAL: Detected lcore 54 as core 76 on socket 0
> EAL: Detected lcore 55 as core 76 on socket 0
> EAL: Detected lcore 56 as core 80 on socket 0
> EAL: Detected lcore 57 as core 80 on socket 0
> EAL: Detected lcore 58 as core 80 on socket 0
> EAL: Detected lcore 59 as core 80 on socket 0
> EAL: Detected lcore 60 as core 84 on socket 0
> EAL: Detected lcore 61 as core 84 on socket 0
> EAL: Detected lcore 62 as core 84 on socket 0
> EAL: Detected lcore 63 as core 84 on socket 0
> EAL: Detected lcore 64 as core 2048 on socket 8
> EAL: Detected lcore 65 as core 2048 on socket 8
> EAL: Detected lcore 66 as core 2048 on socket 8
> EAL: Detected lcore 67 as core 2048 on socket 8
> EAL: Detected lcore 68 as core 2052 on socket 8
> EAL: Detected lcore 69 as core 2052 on socket 8
> EAL: Detected lcore 70 as core 2052 on socket 8
> EAL: Detected lcore 71 as core 2052 on socket 8
> EAL: Detected lcore 72 as core 2056 on socket 8
> EAL: Detected lcore 73 as core 2056 on socket 8
> EAL: Detected lcore 74 as core 2056 on socket 8
> EAL: Detected lcore 75 as core 2056 on socket 8
> EAL: Detected lcore 76 as core 2060 on socket 8
> EAL: Detected lcore 77 as core 2060 on socket 8
> EAL: Detected lcore 78 as core 2060 on socket 8
> EAL: Detected lcore 79 as core 2060 on socket 8
> EAL: Detected lcore 80 as core 2072 on socket 8
> EAL: Detected lcore 81 as core 2072 on socket 8
> EAL: Detected lcore 82 as core 2072 on socket 8
> EAL: Detected lcore 83 as core 2072 on socket 8
> EAL: Detected lcore 84 as core 2076 on socket 8
> EAL: Detected lcore 85 as core 2076 on socket 8
> EAL: Detected lcore 86 as core 2076 on socket 8
> EAL: Detected lcore 87 as core 2076 on socket 8
> EAL: Detected lcore 88 as core 2080 on socket 8
> EAL: Detected lcore 89 as core 2080 on socket 8
> EAL: Detected lcore 90 as core 2080 on socket 8
> EAL: Detected lcore 91 as core 2080 on socket 8
> EAL: Detected lcore 92 as core 2084 on socket 8
> EAL: Detected lcore 93 as core 2084 on socket 8
> EAL: Detected lcore 94 as core 2084 on socket 8
> EAL: Detected lcore 95 as core 2084 on socket 8
> EAL: Detected lcore 96 as core 2088 on socket 8
> EAL: Detected lcore 97 as core 2088 on socket 8
> EAL: Detected lcore 98 as core 2088 on socket 8
> EAL: Detected lcore 99 as core 2088 on socket 8
> EAL: Detected lcore 100 as core 2092 on socket 8
> EAL: Detected lcore 101 as core 2092 on socket 8
> EAL: Detected lcore 102 as core 2092 on socket 8
> EAL: Detected lcore 103 as core 2092 on socket 8
> EAL: Detected lcore 104 as core 2096 on socket 8
> EAL: Detected lcore 105 as core 2096 on socket 8
> EAL: Detected lcore 106 as core 2096 on socket 8
> EAL: Detected lcore 107 as core 2096 on socket 8
> EAL: Detected lcore 108 as core 2100 on socket 8
> EAL: Detected lcore 109 as core 2100 on socket 8
> EAL: Detected lcore 110 as core 2100 on socket 8
> EAL: Detected lcore 111 as core 2100 on socket 8
> EAL: Detected lcore 112 as core 2120 on socket 8
> EAL: Detected lcore 113 as core 2120 on socket 8
> EAL: Detected lcore 114 as core 2120 on socket 8
> EAL: Detected lcore 115 as core 2120 on socket 8
> EAL: Detected lcore 116 as core 2124 on socket 8
> EAL: Detected lcore 117 as core 2124 on socket 8
> EAL: Detected lcore 118 as core 2124 on socket 8
> EAL: Detected lcore 119 as core 2124 on socket 8
> EAL: Detected lcore 120 as core 2136 on socket 8
> EAL: Detected lcore 121 as core 2136 on socket 8
> EAL: Detected lcore 122 as core 2136 on socket 8
> EAL: Detected lcore 123 as core 2136 on socket 8
> EAL: Detected lcore 124 as core 2140 on socket 8
> EAL: Detected lcore 125 as core 2140 on socket 8
> EAL: Detected lcore 126 as core 2140 on socket 8
> EAL: Detected lcore 127 as core 2140 on socket 8
> EAL: Support maximum 1536 logical core(s) by configuration.
> EAL: Detected 128 lcore(s)
> EAL: Detected 2 NUMA nodes
> EAL: Ask a virtual area of 0x10000 bytes
> EAL: Virtual area found at 0x100000000 (size = 0x10000)
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> EAL: DPAA Bus not present. Skipping.
> EAL: VFIO PCI modules not loaded
> EAL: Selected IOVA mode 'VA'
> EAL: 2 hugepages of size 2097152 reserved, but no mounted hugetlbfs found
> for that size
> EAL: Probing VFIO support...
> EAL:   IOMMU type 1 (Type 1) is not supported
> EAL:   IOMMU type 7 (sPAPR) is supported
> EAL:   IOMMU type 8 (No-IOMMU) is not supported
> EAL: VFIO support initialized
> EAL: Ask a virtual area of 0x30000 bytes
> EAL: Virtual area found at 0x100010000 (size = 0x30000)
> EAL: Setting up physically contiguous memory...
> EAL: Setting maximum number of open files to 32768
> EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
> EAL: Detected memory type: socket_id:8 hugepage_sz:1073741824
> EAL: Creating 2 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
> EAL: Ask a virtual area of 0x10000 bytes
> EAL: Virtual area found at 0x100040000 (size = 0x10000)
> EAL: Memseg list allocated at socket 0, page size 0x100000kB
> EAL: Ask a virtual area of 0x800000000 bytes
> EAL: Virtual area found at 0x140000000 (size = 0x800000000)
> EAL: VA reserved for memseg list at 0x140000000, size 800000000
> EAL: Ask a virtual area of 0x10000 bytes
> EAL: Virtual area found at 0x940000000 (size = 0x10000)
> EAL: Memseg list allocated at socket 0, page size 0x100000kB
> EAL: Ask a virtual area of 0x800000000 bytes
> EAL: Virtual area found at 0x980000000 (size = 0x800000000)
> EAL: VA reserved for memseg list at 0x980000000, size 800000000
> EAL: Creating 2 segment lists: n_segs:32 socket_id:8 hugepage_sz:1073741824
> EAL: Ask a virtual area of 0x10000 bytes
> EAL: Virtual area found at 0x1180000000 (size = 0x10000)
> EAL: Memseg list allocated at socket 8, page size 0x100000kB
> EAL: Ask a virtual area of 0x800000000 bytes
> EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
> EAL: VA reserved for memseg list at 0x11c0000000, size 800000000
> EAL: Ask a virtual area of 0x10000 bytes
> EAL: Virtual area found at 0x19c0000000 (size = 0x10000)
> EAL: Memseg list allocated at socket 8, page size 0x100000kB
> EAL: Ask a virtual area of 0x800000000 bytes
> EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
> EAL: VA reserved for memseg list at 0x1a00000000, size 800000000
> EAL: TSC frequency is ~510000 KHz
> EAL: Main lcore 64 is ready (tid=7fffb8018890;cpuset=[64])
> EAL: lcore 65 is ready (tid=7fffb64ad090;cpuset=[65])
> EAL: lcore 66 is ready (tid=7fffb5c9d090;cpuset=[66])
> EAL: lcore 67 is ready (tid=7fffb548d090;cpuset=[67])
> EAL: lcore 68 is ready (tid=7fffb4c7d090;cpuset=[68])
> EAL: lcore 69 is ready (tid=7fffa7ffd090;cpuset=[69])
> EAL: lcore 70 is ready (tid=7fffa77ed090;cpuset=[70])
> EAL: lcore 71 is ready (tid=7fffa6fdd090;cpuset=[71])
> EAL: lcore 72 is ready (tid=7fffa67cd090;cpuset=[72])
> EAL: lcore 73 is ready (tid=7fffa5fbd090;cpuset=[73])
> EAL: lcore 74 is ready (tid=7fffa57ad090;cpuset=[74])
> EAL: lcore 75 is ready (tid=7fffa4f9d090;cpuset=[75])
> EAL: lcore 76 is ready (tid=7fff8fffd090;cpuset=[76])
> EAL: lcore 77 is ready (tid=7fff8f7ed090;cpuset=[77])
> EAL: lcore 78 is ready (tid=7fff8efdd090;cpuset=[78])
> EAL: lcore 79 is ready (tid=7fff8e7cd090;cpuset=[79])
> EAL: lcore 80 is ready (tid=7fff8dfbd090;cpuset=[80])
> EAL: lcore 81 is ready (tid=7fff8d7ad090;cpuset=[81])
> EAL: lcore 82 is ready (tid=7fff8cf9d090;cpuset=[82])
> EAL: lcore 83 is ready (tid=7fff6bffd090;cpuset=[83])
> EAL: lcore 84 is ready (tid=7fff6b7ed090;cpuset=[84])
> EAL: lcore 85 is ready (tid=7fff6afdd090;cpuset=[85])
> EAL: lcore 86 is ready (tid=7fff6a7cd090;cpuset=[86])
> EAL: lcore 87 is ready (tid=7fff69fbd090;cpuset=[87])
> EAL: lcore 88 is ready (tid=7fff697ad090;cpuset=[88])
> EAL: lcore 89 is ready (tid=7fff68f9d090;cpuset=[89])
> EAL: lcore 90 is ready (tid=7fff4bffd090;cpuset=[90])
> EAL: lcore 91 is ready (tid=7fff4b7ed090;cpuset=[91])
> EAL: lcore 92 is ready (tid=7fff4afdd090;cpuset=[92])
> EAL: lcore 93 is ready (tid=7fff4a7cd090;cpuset=[93])
> EAL: lcore 94 is ready (tid=7fff49fbd090;cpuset=[94])
> EAL: lcore 95 is ready (tid=7fff497ad090;cpuset=[95])
> EAL: lcore 96 is ready (tid=7fff48f9d090;cpuset=[96])
> EAL: lcore 97 is ready (tid=7fff2bffd090;cpuset=[97])
> EAL: lcore 98 is ready (tid=7fff2b7ed090;cpuset=[98])
> EAL: lcore 99 is ready (tid=7fff2afdd090;cpuset=[99])
> EAL: lcore 100 is ready (tid=7fff2a7cd090;cpuset=[100])
> EAL: lcore 101 is ready (tid=7fff29fbd090;cpuset=[101])
> EAL: lcore 102 is ready (tid=7fff297ad090;cpuset=[102])
> EAL: lcore 103 is ready (tid=7fff28f9d090;cpuset=[103])
> EAL: lcore 104 is ready (tid=7fff07ffd090;cpuset=[104])
> EAL: lcore 105 is ready (tid=7ffeff7ed090;cpuset=[105])
> EAL: lcore 106 is ready (tid=7fff077ed090;cpuset=[106])
> EAL: lcore 107 is ready (tid=7fff06fdd090;cpuset=[107])
> EAL: lcore 108 is ready (tid=7fff067cd090;cpuset=[108])
> EAL: lcore 109 is ready (tid=7fff05fbd090;cpuset=[109])
> EAL: lcore 110 is ready (tid=7fff057ad090;cpuset=[110])
> EAL: lcore 111 is ready (tid=7fff04f9d090;cpuset=[111])
> EAL: lcore 112 is ready (tid=7ffeffffd090;cpuset=[112])
> EAL: lcore 113 is ready (tid=7ffefefdd090;cpuset=[113])
> EAL: lcore 114 is ready (tid=7ffefe7cd090;cpuset=[114])
> EAL: lcore 115 is ready (tid=7ffefdfbd090;cpuset=[115])
> EAL: lcore 116 is ready (tid=7ffefd7ad090;cpuset=[116])
> EAL: lcore 117 is ready (tid=7ffefcf9d090;cpuset=[117])
> EAL: lcore 118 is ready (tid=7ffecfffd090;cpuset=[118])
> EAL: lcore 119 is ready (tid=7ffecf7ed090;cpuset=[119])
> EAL: lcore 120 is ready (tid=7ffecefdd090;cpuset=[120])
> EAL: lcore 121 is ready (tid=7ffece7cd090;cpuset=[121])
> EAL: lcore 122 is ready (tid=7ffecdfbd090;cpuset=[122])
> EAL: lcore 123 is ready (tid=7ffecd7ad090;cpuset=[123])
> EAL: lcore 124 is ready (tid=7ffeccf9d090;cpuset=[124])
> EAL: lcore 125 is ready (tid=7ffe9bffd090;cpuset=[125])
> EAL: lcore 126 is ready (tid=7ffe9b7ed090;cpuset=[126])
> EAL: lcore 127 is ready (tid=7ffe9afdd090;cpuset=[127])
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 8
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 8 was expanded by 1024MB
> EAL: PCI device 0000:01:00.0 on NUMA socket 0
> EAL:   probe driver: 15b3:1019 mlx5_pci
> EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0000:01:00.0 (socket 0)
> EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: Calling mem event callback 'MLX5_MEM_EVENT_CB:(nil)'
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 1024MB
> mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> mlx5_pci: Default miss action is not supported.
> EAL: PCI device 0000:01:00.1 on NUMA socket 0
> EAL:   probe driver: 15b3:1019 mlx5_pci
> EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0000:01:00.1 (socket 0)
> mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> mlx5_pci: Default miss action is not supported.
> EAL: PCI device 0003:01:00.0 on NUMA socket 0
> EAL:   probe driver: 14e4:168a net_bnx2x
> EAL:   Not managed by a supported kernel driver, skipped
> EAL: PCI device 0003:01:00.1 on NUMA socket 0
> EAL:   probe driver: 14e4:168a net_bnx2x
> EAL:   Not managed by a supported kernel driver, skipped
> EAL: PCI device 0003:01:00.2 on NUMA socket 0
> EAL:   probe driver: 14e4:168a net_bnx2x
> EAL:   Not managed by a supported kernel driver, skipped
> EAL: PCI device 0003:01:00.3 on NUMA socket 0
> EAL:   probe driver: 14e4:168a net_bnx2x
> EAL:   Not managed by a supported kernel driver, skipped
> EAL: PCI device 0030:01:00.0 on NUMA socket 8
> EAL:   probe driver: 15b3:1019 mlx5_pci
> EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0030:01:00.0 (socket 8)
> mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> mlx5_pci: Default miss action is not supported.
> EAL: PCI device 0030:01:00.1 on NUMA socket 8
> EAL:   probe driver: 15b3:1019 mlx5_pci
> EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0030:01:00.1 (socket 8)
> mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> mlx5_pci: Default miss action is not supported.
> EAL: PCI device 0034:01:00.0 on NUMA socket 8
> EAL:   probe driver: 8086:1583 net_i40e
> EAL:   set IOMMU type 1 (Type 1) failed, error 19 (No such device)
> EAL:   using IOMMU type 7 (sPAPR)
> EAL: Highest VA address in memseg list is 0x2200000000
> EAL: Setting DMA window size to 0x4000000000
> EAL: Mem event callback 'vfio_mem_event_clb:(nil)' registered
> EAL: Installed memory event callback for VFIO
> EAL: VFIO reports MSI-X BAR as mappable
> EAL:   PCI memory mapped at 0x2200000000
> EAL:   PCI memory mapped at 0x2200800000
> EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.0 (socket 8)
> EAL: PCI device 0034:01:00.1 on NUMA socket 8
> EAL:   probe driver: 8086:1583 net_i40e
> EAL: VFIO reports MSI-X BAR as mappable
> EAL:   PCI memory mapped at 0x2200810000
> EAL:   PCI memory mapped at 0x2201010000
> EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.1 (socket 8)
> APP: HPET is not enabled, using TSC as default timer
> RTE>>vfio_autotest
> DRC: sz = 0x20000
> DRC: mem = 0x0x7fffb8680000
> EAL:   dma map attempt outside DMA window
> EAL: Failed to map DMA
> EAL: Couldn't map new region for DMA
> Failed to dma map whole region, ret=-1(No such file or directory)
> Error vfio dma map/unmap, ret=-1
> Test Failed
> RTE>>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (4 preceding siblings ...)
  2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-14  8:19 ` Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (3 more replies)
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (2 subsequent siblings)
  8 siblings, 4 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-14  8:19 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v5:
- Changed vfio test in test_vfio.c to use system pages allocated from
  heap instead of mmap() so that it comes in range of initially configured
  window for POWER9 System.
- Added acked-by from David for 1/4, 2/4.

v4:
- Fixed issue with patch 4/4 on x86 builds.

v3:
- Fixed external memory test case(4/4) to use system page size
  instead of 4K.
- Fixed check-git-log.sh issue and rebased.
- Added acked-by from anatoly.burakov@intel.com to first 3 patches.

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.

Nithin Dabilpuram (4):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 IOVA as VA
  test: add test case to validate VFIO DMA map/unmap
  test: change external memory test to use system page sz

 app/test/meson.build                   |   1 +
 app/test/test_external_mem.c           |   3 +-
 app/test/test_vfio.c                   | 106 +++++++++++++++++++++++++++++++++
 doc/guides/linux_gsg/linux_drivers.rst |  10 ++++
 lib/librte_eal/linux/eal_vfio.c        |  93 +++++++++++------------------
 lib/librte_eal/linux/eal_vfio.h        |   1 +
 6 files changed, 156 insertions(+), 58 deletions(-)
 create mode 100644 app/test/test_vfio.c

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v5 1/4] vfio: revert changes for map contiguous areas in one go
  2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-14  8:19   ` Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-14  8:19 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts
commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 90635a4..9a662a7 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
 VFIO kernel is usually present by default in all distributions,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 Since Linux version 5.7,
 the ``vfio-pci`` module supports the creation of virtual functions.
 After the PF is bound to ``vfio-pci`` module,
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 0500824..64b134d 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -517,11 +517,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v5 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2020-12-14  8:19   ` Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-14  8:19 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 64b134d..b15b758 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v5 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2020-12-14  8:19   ` Nithin Dabilpuram
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-14  8:19 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Test case alloc's system pages and tries to performs a user
DMA map and unmap both partially and fully.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test/meson.build |   1 +
 app/test/test_vfio.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)
 create mode 100644 app/test/test_vfio.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 94fd39f..d9eedb6 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -139,6 +139,7 @@ test_sources = files('commands.c',
 	'test_trace_register.c',
 	'test_trace_perf.c',
 	'test_version.c',
+	'test_vfio.c',
 	'virtual_pmd.c'
 )
 
diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
new file mode 100644
index 0000000..9febf35
--- /dev/null
+++ b/app/test/test_vfio.c
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_common.h>
+#include <rte_eal.h>
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+#include <rte_memory.h>
+#include <rte_vfio.h>
+
+#include "test.h"
+
+static int
+test_memory_vfio_dma_map(void)
+{
+	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
+	uint64_t unmap1, unmap2;
+	uint8_t *alloc_mem;
+	uint8_t *mem;
+	int ret;
+
+	/* Allocate twice size of requirement from heap to align later */
+	alloc_mem = malloc(sz * 2);
+	if (!alloc_mem) {
+		printf("Skipping test as unable to alloc %luB from heap\n",
+		       sz * 2);
+		return 1;
+	}
+
+	/* Force page allocation */
+	memset(alloc_mem, 0, sz * 2);
+
+	mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
+
+	/* map the whole region */
+	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					 (uintptr_t)mem, (rte_iova_t)mem, sz);
+	if (ret) {
+		/* Check if VFIO is not available or no device is probed */
+		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
+			ret = 1;
+			goto fail;
+		}
+		printf("Failed to dma map whole region, ret=%d(%s)\n",
+		       ret, rte_strerror(rte_errno));
+		goto fail;
+	}
+
+	unmap1 = (uint64_t)mem + (sz / 2);
+	sz1 = sz / 2;
+	unmap2 = (uint64_t)mem;
+	sz2 = sz / 2;
+	/* unmap the partial region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap1, (rte_iova_t)unmap1, sz1);
+	if (ret) {
+		if (rte_errno == ENOTSUP) {
+			printf("Partial dma unmap not supported\n");
+			unmap2 = (uint64_t)mem;
+			sz2 = sz;
+		} else {
+			printf("Failed to unmap second half region, ret=%d(%s)\n",
+			       ret, rte_strerror(rte_errno));
+			goto fail;
+		}
+	}
+
+	/* unmap the remaining region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap2, (rte_iova_t)unmap2, sz2);
+	if (ret) {
+		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
+		       rte_strerror(rte_errno));
+		goto fail;
+	}
+
+fail:
+	free(alloc_mem);
+	return ret;
+}
+
+static int
+test_vfio(void)
+{
+	int ret;
+
+	/* test for vfio dma map/unmap */
+	ret = test_memory_vfio_dma_map();
+	if (ret == 1) {
+		printf("VFIO dma map/unmap unsupported\n");
+	} else if (ret < 0) {
+		printf("Error vfio dma map/unmap, ret=%d\n", ret);
+		return -1;
+	}
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v5 4/4] test: change external memory test to use system page sz
  2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                     ` (2 preceding siblings ...)
  2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-12-14  8:19   ` Nithin Dabilpuram
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-14  8:19 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Currently external memory test uses 4K page size.
VFIO DMA mapping works only with system page granularity.

Earlier it was working because all the contiguous mappings
were coalesced and mapped in one-go which ended up becoming
a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
and IOVA as PA mode, are being done at memseg list granularity,
we need to use system page size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_external_mem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
index 7eb81f6..5edf88b 100644
--- a/app/test/test_external_mem.c
+++ b/app/test/test_external_mem.c
@@ -13,6 +13,7 @@
 #include <rte_common.h>
 #include <rte_debug.h>
 #include <rte_eal.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_malloc.h>
 #include <rte_ring.h>
@@ -532,8 +533,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
 static int
 test_external_mem(void)
 {
+	size_t pgsz = rte_mem_page_size();
 	size_t len = EXTERNAL_MEM_SZ;
-	size_t pgsz = RTE_PGSIZE_4K;
 	rte_iova_t iova[len / pgsz];
 	void *addr;
 	int ret, n_pages;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-03  7:14       ` Nithin Dabilpuram
@ 2020-12-14  8:24         ` Nithin Dabilpuram
  0 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-14  8:24 UTC (permalink / raw)
  To: David Christensen; +Cc: anatoly.burakov, david.marchand, jerinj, dev

Hi David,

As mentioned below in #1, I sent v5 with memory allocated from Heap which I think
falls in initially configured DMA window 0x0 .... 0x4000000000 atleast in 
Linux as DPDK memory starts after HEAP.

Let me know if it is ok for POWER9 system.

On Thu, Dec 03, 2020 at 12:44:06PM +0530, Nithin Dabilpuram wrote:
> On Wed, Dec 02, 2020 at 11:23:09AM -0800, David Christensen wrote:
> > 
> > 
> > On 12/1/20 9:46 PM, Nithin Dabilpuram wrote:
> > > Test case mmap's system pages and tries to performs a user
> > > DMA map and unmap both partially and fully.
> > > 
> > > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > > Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > > ---
> > >   app/test/meson.build |   1 +
> > >   app/test/test_vfio.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 104 insertions(+)
> > >   create mode 100644 app/test/test_vfio.c
> > > 
> > > diff --git a/app/test/meson.build b/app/test/meson.build
> > > index 94fd39f..d9eedb6 100644
> > > --- a/app/test/meson.build
> > > +++ b/app/test/meson.build
> > > @@ -139,6 +139,7 @@ test_sources = files('commands.c',
> > >   	'test_trace_register.c',
> > >   	'test_trace_perf.c',
> > >   	'test_version.c',
> > > +	'test_vfio.c',
> > >   	'virtual_pmd.c'
> > >   )
> > > 
> > > diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
> > > new file mode 100644
> > > index 0000000..00626d4
> > > --- /dev/null
> > > +++ b/app/test/test_vfio.c
> > > @@ -0,0 +1,103 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(C) 2020 Marvell.
> > > + */
> > > +
> > > +#include <stdio.h>
> > > +#include <stdint.h>
> > > +#include <string.h>
> > > +#include <sys/mman.h>
> > > +#include <unistd.h>
> > > +
> > > +#include <rte_common.h>
> > > +#include <rte_eal.h>
> > > +#include <rte_eal_paging.h>
> > > +#include <rte_errno.h>
> > > +#include <rte_memory.h>
> > > +#include <rte_vfio.h>
> > > +
> > > +#include "test.h"
> > > +
> > > +static int
> > > +test_memory_vfio_dma_map(void)
> > > +{
> > > +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
> > > +	uint64_t unmap1, unmap2;
> > > +	uint8_t *mem;
> > > +	int ret;
> > > +
> > > +	/* Allocate twice size of page */
> > > +	mem = mmap(NULL, sz, PROT_READ | PROT_WRITE,
> > > +		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > +	if (mem == MAP_FAILED) {
> > > +		printf("Failed to allocate memory for external heap\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	/* Force page allocation */
> > > +	memset(mem, 0, sz);
> > > +
> > > +	/* map the whole region */
> > > +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > +					 (uintptr_t)mem, (rte_iova_t)mem, sz);
> > > +	if (ret) {
> > > +		/* Check if VFIO is not available or no device is probed */
> > > +		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
> > > +			ret = 1;
> > > +			goto fail;
> > > +		}
> > > +		printf("Failed to dma map whole region, ret=%d(%s)\n",
> > > +		       ret, rte_strerror(rte_errno));
> > > +		goto fail;
> > > +	}
> > > +
> > > +	unmap1 = (uint64_t)mem + (sz / 2);
> > > +	sz1 = sz / 2;
> > > +	unmap2 = (uint64_t)mem;
> > > +	sz2 = sz / 2;
> > > +	/* unmap the partial region */
> > > +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > +					   unmap1, (rte_iova_t)unmap1, sz1);
> > > +	if (ret) {
> > > +		if (rte_errno == ENOTSUP) {
> > > +			printf("Partial dma unmap not supported\n");
> > > +			unmap2 = (uint64_t)mem;
> > > +			sz2 = sz;
> > > +		} else {
> > > +			printf("Failed to unmap second half region, ret=%d(%s)\n",
> > > +			       ret, rte_strerror(rte_errno));
> > > +			goto fail;
> > > +		}
> > > +	}
> > > +
> > > +	/* unmap the remaining region */
> > > +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > +					   unmap2, (rte_iova_t)unmap2, sz2);
> > > +	if (ret) {
> > > +		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
> > > +		       rte_strerror(rte_errno));
> > > +		goto fail;
> > > +	}
> > > +
> > > +fail:
> > > +	munmap(mem, sz);
> > > +	return ret;
> > > +}
> > > +
> > > +static int
> > > +test_vfio(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	/* test for vfio dma map/unmap */
> > > +	ret = test_memory_vfio_dma_map();
> > > +	if (ret == 1) {
> > > +		printf("VFIO dma map/unmap unsupported\n");
> > > +	} else if (ret < 0) {
> > > +		printf("Error vfio dma map/unmap, ret=%d\n", ret);
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
> > > 
> > 
> > The test as written fails on a POWER9 system (see below for debug output).
> > 
> > IOMMU on POWER systems requires that a DMA window be defined and that all
> > DMA mappings reside within that window.  In this test, the DMA window is
> > defined as 0x0 to 0x4000000000, but the VA allocated in your test is
> > 0x7fffb8680000, well outside that range.
> > 
> > I recently submitted a change in the 20.11 release which scans the memseg
> > list in order to set the DMA window.  That test can be seen here:
> > 
> > EAL: Highest VA address in memseg list is 0x2200000000
> > EAL: Setting DMA window size to 0x4000000000
> > 
> 
> I missed that thread. So basically external memory with IOVA as VA mode
> is not supported in POWER9 systems as it's memseg lists can be created later
> after DMA window size is fixed. Correct ?
> 
> > Can we modify the test to allocate memory out of the exsitng memseg
> > allocations?
> 
> Since I'm mmapin'g normal pages for this test outside EAL, I cannot use memseg list 
> range as VA as it is already reserved by memseg list's.
> 
> I can see only three options left.
> 
> #1 Use initial process VA range by using heap memory instead of mmap 
> which is falling below eal's base_virtaddr for both freebsd and windows and also I think 
> your DMA window will include that.
> 
> #2 Use PA from real PA range or like externel_mem_autotest(). The test currently
>    is acting like IOVA as VA test but this change will make it as IOVA as PA.
> 
> #3 Disable this test for VFIO SPAPR or remove it completely.
> 
> Will #1 work in your case ?
> > 
> > Dave
> > 
> > $ sudo ~/src/dpdk/build/app/test/dpdk-test --log="eal,debug" --iova-mode=va
> > -l 64-127
> > EAL: Detected lcore 0 as core 0 on socket 0
> > EAL: Detected lcore 1 as core 0 on socket 0
> > EAL: Detected lcore 2 as core 0 on socket 0
> > EAL: Detected lcore 3 as core 0 on socket 0
> > EAL: Detected lcore 4 as core 4 on socket 0
> > EAL: Detected lcore 5 as core 4 on socket 0
> > EAL: Detected lcore 6 as core 4 on socket 0
> > EAL: Detected lcore 7 as core 4 on socket 0
> > EAL: Detected lcore 8 as core 8 on socket 0
> > EAL: Detected lcore 9 as core 8 on socket 0
> > EAL: Detected lcore 10 as core 8 on socket 0
> > EAL: Detected lcore 11 as core 8 on socket 0
> > EAL: Detected lcore 12 as core 12 on socket 0
> > EAL: Detected lcore 13 as core 12 on socket 0
> > EAL: Detected lcore 14 as core 12 on socket 0
> > EAL: Detected lcore 15 as core 12 on socket 0
> > EAL: Detected lcore 16 as core 16 on socket 0
> > EAL: Detected lcore 17 as core 16 on socket 0
> > EAL: Detected lcore 18 as core 16 on socket 0
> > EAL: Detected lcore 19 as core 16 on socket 0
> > EAL: Detected lcore 20 as core 20 on socket 0
> > EAL: Detected lcore 21 as core 20 on socket 0
> > EAL: Detected lcore 22 as core 20 on socket 0
> > EAL: Detected lcore 23 as core 20 on socket 0
> > EAL: Detected lcore 24 as core 24 on socket 0
> > EAL: Detected lcore 25 as core 24 on socket 0
> > EAL: Detected lcore 26 as core 24 on socket 0
> > EAL: Detected lcore 27 as core 24 on socket 0
> > EAL: Detected lcore 28 as core 28 on socket 0
> > EAL: Detected lcore 29 as core 28 on socket 0
> > EAL: Detected lcore 30 as core 28 on socket 0
> > EAL: Detected lcore 31 as core 28 on socket 0
> > EAL: Detected lcore 32 as core 32 on socket 0
> > EAL: Detected lcore 33 as core 32 on socket 0
> > EAL: Detected lcore 34 as core 32 on socket 0
> > EAL: Detected lcore 35 as core 32 on socket 0
> > EAL: Detected lcore 36 as core 36 on socket 0
> > EAL: Detected lcore 37 as core 36 on socket 0
> > EAL: Detected lcore 38 as core 36 on socket 0
> > EAL: Detected lcore 39 as core 36 on socket 0
> > EAL: Detected lcore 40 as core 48 on socket 0
> > EAL: Detected lcore 41 as core 48 on socket 0
> > EAL: Detected lcore 42 as core 48 on socket 0
> > EAL: Detected lcore 43 as core 48 on socket 0
> > EAL: Detected lcore 44 as core 52 on socket 0
> > EAL: Detected lcore 45 as core 52 on socket 0
> > EAL: Detected lcore 46 as core 52 on socket 0
> > EAL: Detected lcore 47 as core 52 on socket 0
> > EAL: Detected lcore 48 as core 72 on socket 0
> > EAL: Detected lcore 49 as core 72 on socket 0
> > EAL: Detected lcore 50 as core 72 on socket 0
> > EAL: Detected lcore 51 as core 72 on socket 0
> > EAL: Detected lcore 52 as core 76 on socket 0
> > EAL: Detected lcore 53 as core 76 on socket 0
> > EAL: Detected lcore 54 as core 76 on socket 0
> > EAL: Detected lcore 55 as core 76 on socket 0
> > EAL: Detected lcore 56 as core 80 on socket 0
> > EAL: Detected lcore 57 as core 80 on socket 0
> > EAL: Detected lcore 58 as core 80 on socket 0
> > EAL: Detected lcore 59 as core 80 on socket 0
> > EAL: Detected lcore 60 as core 84 on socket 0
> > EAL: Detected lcore 61 as core 84 on socket 0
> > EAL: Detected lcore 62 as core 84 on socket 0
> > EAL: Detected lcore 63 as core 84 on socket 0
> > EAL: Detected lcore 64 as core 2048 on socket 8
> > EAL: Detected lcore 65 as core 2048 on socket 8
> > EAL: Detected lcore 66 as core 2048 on socket 8
> > EAL: Detected lcore 67 as core 2048 on socket 8
> > EAL: Detected lcore 68 as core 2052 on socket 8
> > EAL: Detected lcore 69 as core 2052 on socket 8
> > EAL: Detected lcore 70 as core 2052 on socket 8
> > EAL: Detected lcore 71 as core 2052 on socket 8
> > EAL: Detected lcore 72 as core 2056 on socket 8
> > EAL: Detected lcore 73 as core 2056 on socket 8
> > EAL: Detected lcore 74 as core 2056 on socket 8
> > EAL: Detected lcore 75 as core 2056 on socket 8
> > EAL: Detected lcore 76 as core 2060 on socket 8
> > EAL: Detected lcore 77 as core 2060 on socket 8
> > EAL: Detected lcore 78 as core 2060 on socket 8
> > EAL: Detected lcore 79 as core 2060 on socket 8
> > EAL: Detected lcore 80 as core 2072 on socket 8
> > EAL: Detected lcore 81 as core 2072 on socket 8
> > EAL: Detected lcore 82 as core 2072 on socket 8
> > EAL: Detected lcore 83 as core 2072 on socket 8
> > EAL: Detected lcore 84 as core 2076 on socket 8
> > EAL: Detected lcore 85 as core 2076 on socket 8
> > EAL: Detected lcore 86 as core 2076 on socket 8
> > EAL: Detected lcore 87 as core 2076 on socket 8
> > EAL: Detected lcore 88 as core 2080 on socket 8
> > EAL: Detected lcore 89 as core 2080 on socket 8
> > EAL: Detected lcore 90 as core 2080 on socket 8
> > EAL: Detected lcore 91 as core 2080 on socket 8
> > EAL: Detected lcore 92 as core 2084 on socket 8
> > EAL: Detected lcore 93 as core 2084 on socket 8
> > EAL: Detected lcore 94 as core 2084 on socket 8
> > EAL: Detected lcore 95 as core 2084 on socket 8
> > EAL: Detected lcore 96 as core 2088 on socket 8
> > EAL: Detected lcore 97 as core 2088 on socket 8
> > EAL: Detected lcore 98 as core 2088 on socket 8
> > EAL: Detected lcore 99 as core 2088 on socket 8
> > EAL: Detected lcore 100 as core 2092 on socket 8
> > EAL: Detected lcore 101 as core 2092 on socket 8
> > EAL: Detected lcore 102 as core 2092 on socket 8
> > EAL: Detected lcore 103 as core 2092 on socket 8
> > EAL: Detected lcore 104 as core 2096 on socket 8
> > EAL: Detected lcore 105 as core 2096 on socket 8
> > EAL: Detected lcore 106 as core 2096 on socket 8
> > EAL: Detected lcore 107 as core 2096 on socket 8
> > EAL: Detected lcore 108 as core 2100 on socket 8
> > EAL: Detected lcore 109 as core 2100 on socket 8
> > EAL: Detected lcore 110 as core 2100 on socket 8
> > EAL: Detected lcore 111 as core 2100 on socket 8
> > EAL: Detected lcore 112 as core 2120 on socket 8
> > EAL: Detected lcore 113 as core 2120 on socket 8
> > EAL: Detected lcore 114 as core 2120 on socket 8
> > EAL: Detected lcore 115 as core 2120 on socket 8
> > EAL: Detected lcore 116 as core 2124 on socket 8
> > EAL: Detected lcore 117 as core 2124 on socket 8
> > EAL: Detected lcore 118 as core 2124 on socket 8
> > EAL: Detected lcore 119 as core 2124 on socket 8
> > EAL: Detected lcore 120 as core 2136 on socket 8
> > EAL: Detected lcore 121 as core 2136 on socket 8
> > EAL: Detected lcore 122 as core 2136 on socket 8
> > EAL: Detected lcore 123 as core 2136 on socket 8
> > EAL: Detected lcore 124 as core 2140 on socket 8
> > EAL: Detected lcore 125 as core 2140 on socket 8
> > EAL: Detected lcore 126 as core 2140 on socket 8
> > EAL: Detected lcore 127 as core 2140 on socket 8
> > EAL: Support maximum 1536 logical core(s) by configuration.
> > EAL: Detected 128 lcore(s)
> > EAL: Detected 2 NUMA nodes
> > EAL: Ask a virtual area of 0x10000 bytes
> > EAL: Virtual area found at 0x100000000 (size = 0x10000)
> > EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> > EAL: DPAA Bus not present. Skipping.
> > EAL: VFIO PCI modules not loaded
> > EAL: Selected IOVA mode 'VA'
> > EAL: 2 hugepages of size 2097152 reserved, but no mounted hugetlbfs found
> > for that size
> > EAL: Probing VFIO support...
> > EAL:   IOMMU type 1 (Type 1) is not supported
> > EAL:   IOMMU type 7 (sPAPR) is supported
> > EAL:   IOMMU type 8 (No-IOMMU) is not supported
> > EAL: VFIO support initialized
> > EAL: Ask a virtual area of 0x30000 bytes
> > EAL: Virtual area found at 0x100010000 (size = 0x30000)
> > EAL: Setting up physically contiguous memory...
> > EAL: Setting maximum number of open files to 32768
> > EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
> > EAL: Detected memory type: socket_id:8 hugepage_sz:1073741824
> > EAL: Creating 2 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
> > EAL: Ask a virtual area of 0x10000 bytes
> > EAL: Virtual area found at 0x100040000 (size = 0x10000)
> > EAL: Memseg list allocated at socket 0, page size 0x100000kB
> > EAL: Ask a virtual area of 0x800000000 bytes
> > EAL: Virtual area found at 0x140000000 (size = 0x800000000)
> > EAL: VA reserved for memseg list at 0x140000000, size 800000000
> > EAL: Ask a virtual area of 0x10000 bytes
> > EAL: Virtual area found at 0x940000000 (size = 0x10000)
> > EAL: Memseg list allocated at socket 0, page size 0x100000kB
> > EAL: Ask a virtual area of 0x800000000 bytes
> > EAL: Virtual area found at 0x980000000 (size = 0x800000000)
> > EAL: VA reserved for memseg list at 0x980000000, size 800000000
> > EAL: Creating 2 segment lists: n_segs:32 socket_id:8 hugepage_sz:1073741824
> > EAL: Ask a virtual area of 0x10000 bytes
> > EAL: Virtual area found at 0x1180000000 (size = 0x10000)
> > EAL: Memseg list allocated at socket 8, page size 0x100000kB
> > EAL: Ask a virtual area of 0x800000000 bytes
> > EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
> > EAL: VA reserved for memseg list at 0x11c0000000, size 800000000
> > EAL: Ask a virtual area of 0x10000 bytes
> > EAL: Virtual area found at 0x19c0000000 (size = 0x10000)
> > EAL: Memseg list allocated at socket 8, page size 0x100000kB
> > EAL: Ask a virtual area of 0x800000000 bytes
> > EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
> > EAL: VA reserved for memseg list at 0x1a00000000, size 800000000
> > EAL: TSC frequency is ~510000 KHz
> > EAL: Main lcore 64 is ready (tid=7fffb8018890;cpuset=[64])
> > EAL: lcore 65 is ready (tid=7fffb64ad090;cpuset=[65])
> > EAL: lcore 66 is ready (tid=7fffb5c9d090;cpuset=[66])
> > EAL: lcore 67 is ready (tid=7fffb548d090;cpuset=[67])
> > EAL: lcore 68 is ready (tid=7fffb4c7d090;cpuset=[68])
> > EAL: lcore 69 is ready (tid=7fffa7ffd090;cpuset=[69])
> > EAL: lcore 70 is ready (tid=7fffa77ed090;cpuset=[70])
> > EAL: lcore 71 is ready (tid=7fffa6fdd090;cpuset=[71])
> > EAL: lcore 72 is ready (tid=7fffa67cd090;cpuset=[72])
> > EAL: lcore 73 is ready (tid=7fffa5fbd090;cpuset=[73])
> > EAL: lcore 74 is ready (tid=7fffa57ad090;cpuset=[74])
> > EAL: lcore 75 is ready (tid=7fffa4f9d090;cpuset=[75])
> > EAL: lcore 76 is ready (tid=7fff8fffd090;cpuset=[76])
> > EAL: lcore 77 is ready (tid=7fff8f7ed090;cpuset=[77])
> > EAL: lcore 78 is ready (tid=7fff8efdd090;cpuset=[78])
> > EAL: lcore 79 is ready (tid=7fff8e7cd090;cpuset=[79])
> > EAL: lcore 80 is ready (tid=7fff8dfbd090;cpuset=[80])
> > EAL: lcore 81 is ready (tid=7fff8d7ad090;cpuset=[81])
> > EAL: lcore 82 is ready (tid=7fff8cf9d090;cpuset=[82])
> > EAL: lcore 83 is ready (tid=7fff6bffd090;cpuset=[83])
> > EAL: lcore 84 is ready (tid=7fff6b7ed090;cpuset=[84])
> > EAL: lcore 85 is ready (tid=7fff6afdd090;cpuset=[85])
> > EAL: lcore 86 is ready (tid=7fff6a7cd090;cpuset=[86])
> > EAL: lcore 87 is ready (tid=7fff69fbd090;cpuset=[87])
> > EAL: lcore 88 is ready (tid=7fff697ad090;cpuset=[88])
> > EAL: lcore 89 is ready (tid=7fff68f9d090;cpuset=[89])
> > EAL: lcore 90 is ready (tid=7fff4bffd090;cpuset=[90])
> > EAL: lcore 91 is ready (tid=7fff4b7ed090;cpuset=[91])
> > EAL: lcore 92 is ready (tid=7fff4afdd090;cpuset=[92])
> > EAL: lcore 93 is ready (tid=7fff4a7cd090;cpuset=[93])
> > EAL: lcore 94 is ready (tid=7fff49fbd090;cpuset=[94])
> > EAL: lcore 95 is ready (tid=7fff497ad090;cpuset=[95])
> > EAL: lcore 96 is ready (tid=7fff48f9d090;cpuset=[96])
> > EAL: lcore 97 is ready (tid=7fff2bffd090;cpuset=[97])
> > EAL: lcore 98 is ready (tid=7fff2b7ed090;cpuset=[98])
> > EAL: lcore 99 is ready (tid=7fff2afdd090;cpuset=[99])
> > EAL: lcore 100 is ready (tid=7fff2a7cd090;cpuset=[100])
> > EAL: lcore 101 is ready (tid=7fff29fbd090;cpuset=[101])
> > EAL: lcore 102 is ready (tid=7fff297ad090;cpuset=[102])
> > EAL: lcore 103 is ready (tid=7fff28f9d090;cpuset=[103])
> > EAL: lcore 104 is ready (tid=7fff07ffd090;cpuset=[104])
> > EAL: lcore 105 is ready (tid=7ffeff7ed090;cpuset=[105])
> > EAL: lcore 106 is ready (tid=7fff077ed090;cpuset=[106])
> > EAL: lcore 107 is ready (tid=7fff06fdd090;cpuset=[107])
> > EAL: lcore 108 is ready (tid=7fff067cd090;cpuset=[108])
> > EAL: lcore 109 is ready (tid=7fff05fbd090;cpuset=[109])
> > EAL: lcore 110 is ready (tid=7fff057ad090;cpuset=[110])
> > EAL: lcore 111 is ready (tid=7fff04f9d090;cpuset=[111])
> > EAL: lcore 112 is ready (tid=7ffeffffd090;cpuset=[112])
> > EAL: lcore 113 is ready (tid=7ffefefdd090;cpuset=[113])
> > EAL: lcore 114 is ready (tid=7ffefe7cd090;cpuset=[114])
> > EAL: lcore 115 is ready (tid=7ffefdfbd090;cpuset=[115])
> > EAL: lcore 116 is ready (tid=7ffefd7ad090;cpuset=[116])
> > EAL: lcore 117 is ready (tid=7ffefcf9d090;cpuset=[117])
> > EAL: lcore 118 is ready (tid=7ffecfffd090;cpuset=[118])
> > EAL: lcore 119 is ready (tid=7ffecf7ed090;cpuset=[119])
> > EAL: lcore 120 is ready (tid=7ffecefdd090;cpuset=[120])
> > EAL: lcore 121 is ready (tid=7ffece7cd090;cpuset=[121])
> > EAL: lcore 122 is ready (tid=7ffecdfbd090;cpuset=[122])
> > EAL: lcore 123 is ready (tid=7ffecd7ad090;cpuset=[123])
> > EAL: lcore 124 is ready (tid=7ffeccf9d090;cpuset=[124])
> > EAL: lcore 125 is ready (tid=7ffe9bffd090;cpuset=[125])
> > EAL: lcore 126 is ready (tid=7ffe9b7ed090;cpuset=[126])
> > EAL: lcore 127 is ready (tid=7ffe9afdd090;cpuset=[127])
> > EAL: Trying to obtain current memory policy.
> > EAL: Setting policy MPOL_PREFERRED for socket 8
> > EAL: Restoring previous memory policy: 0
> > EAL: request: mp_malloc_sync
> > EAL: Heap on socket 8 was expanded by 1024MB
> > EAL: PCI device 0000:01:00.0 on NUMA socket 0
> > EAL:   probe driver: 15b3:1019 mlx5_pci
> > EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0000:01:00.0 (socket 0)
> > EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
> > EAL: Trying to obtain current memory policy.
> > EAL: Setting policy MPOL_PREFERRED for socket 0
> > EAL: Restoring previous memory policy: 0
> > EAL: Calling mem event callback 'MLX5_MEM_EVENT_CB:(nil)'
> > EAL: request: mp_malloc_sync
> > EAL: Heap on socket 0 was expanded by 1024MB
> > mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> > mlx5_pci: Default miss action is not supported.
> > EAL: PCI device 0000:01:00.1 on NUMA socket 0
> > EAL:   probe driver: 15b3:1019 mlx5_pci
> > EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0000:01:00.1 (socket 0)
> > mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> > mlx5_pci: Default miss action is not supported.
> > EAL: PCI device 0003:01:00.0 on NUMA socket 0
> > EAL:   probe driver: 14e4:168a net_bnx2x
> > EAL:   Not managed by a supported kernel driver, skipped
> > EAL: PCI device 0003:01:00.1 on NUMA socket 0
> > EAL:   probe driver: 14e4:168a net_bnx2x
> > EAL:   Not managed by a supported kernel driver, skipped
> > EAL: PCI device 0003:01:00.2 on NUMA socket 0
> > EAL:   probe driver: 14e4:168a net_bnx2x
> > EAL:   Not managed by a supported kernel driver, skipped
> > EAL: PCI device 0003:01:00.3 on NUMA socket 0
> > EAL:   probe driver: 14e4:168a net_bnx2x
> > EAL:   Not managed by a supported kernel driver, skipped
> > EAL: PCI device 0030:01:00.0 on NUMA socket 8
> > EAL:   probe driver: 15b3:1019 mlx5_pci
> > EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0030:01:00.0 (socket 8)
> > mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> > mlx5_pci: Default miss action is not supported.
> > EAL: PCI device 0030:01:00.1 on NUMA socket 8
> > EAL:   probe driver: 15b3:1019 mlx5_pci
> > EAL: Probe PCI driver: mlx5_pci (15b3:1019) device: 0030:01:00.1 (socket 8)
> > mlx5_pci: Size 0xFFFF is not power of 2, will be aligned to 0x10000.
> > mlx5_pci: Default miss action is not supported.
> > EAL: PCI device 0034:01:00.0 on NUMA socket 8
> > EAL:   probe driver: 8086:1583 net_i40e
> > EAL:   set IOMMU type 1 (Type 1) failed, error 19 (No such device)
> > EAL:   using IOMMU type 7 (sPAPR)
> > EAL: Highest VA address in memseg list is 0x2200000000
> > EAL: Setting DMA window size to 0x4000000000
> > EAL: Mem event callback 'vfio_mem_event_clb:(nil)' registered
> > EAL: Installed memory event callback for VFIO
> > EAL: VFIO reports MSI-X BAR as mappable
> > EAL:   PCI memory mapped at 0x2200000000
> > EAL:   PCI memory mapped at 0x2200800000
> > EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.0 (socket 8)
> > EAL: PCI device 0034:01:00.1 on NUMA socket 8
> > EAL:   probe driver: 8086:1583 net_i40e
> > EAL: VFIO reports MSI-X BAR as mappable
> > EAL:   PCI memory mapped at 0x2200810000
> > EAL:   PCI memory mapped at 0x2201010000
> > EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.1 (socket 8)
> > APP: HPET is not enabled, using TSC as default timer
> > RTE>>vfio_autotest
> > DRC: sz = 0x20000
> > DRC: mem = 0x0x7fffb8680000
> > EAL:   dma map attempt outside DMA window
> > EAL: Failed to map DMA
> > EAL: Couldn't map new region for DMA
> > Failed to dma map whole region, ret=-1(No such file or directory)
> > Error vfio dma map/unmap, ret=-1
> > Test Failed
> > RTE>>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (5 preceding siblings ...)
  2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-17 19:06 ` Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (4 more replies)
  2021-01-12 17:39 ` [dpdk-dev] [PATCH v7 0/3] " Nithin Dabilpuram
  2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
  8 siblings, 5 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-17 19:06 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v6:
- Fixed issue with x86-32 build introduced by v5.

v5:
- Changed vfio test in test_vfio.c to use system pages allocated from
  heap instead of mmap() so that it comes in range of initially configured
  window for POWER9 System.
- Added acked-by from David for 1/4, 2/4.

v4:
- Fixed issue with patch 4/4 on x86 builds.

v3:
- Fixed external memory test case(4/4) to use system page size
  instead of 4K.
- Fixed check-git-log.sh issue and rebased.
- Added acked-by from anatoly.burakov@intel.com to first 3 patches.

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.

Nithin Dabilpuram (4):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 IOVA as VA
  test: add test case to validate VFIO DMA map/unmap
  test: change external memory test to use system page sz

 app/test/meson.build                   |   1 +
 app/test/test_external_mem.c           |   3 +-
 app/test/test_vfio.c                   | 107 +++++++++++++++++++++++++++++++++
 doc/guides/linux_gsg/linux_drivers.rst |  10 +++
 lib/librte_eal/linux/eal_vfio.c        |  93 +++++++++++-----------------
 lib/librte_eal/linux/eal_vfio.h        |   1 +
 6 files changed, 157 insertions(+), 58 deletions(-)
 create mode 100644 app/test/test_vfio.c

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v6 1/4] vfio: revert changes for map contiguous areas in one go
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2020-12-17 19:06   ` Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-17 19:06 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts
commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 90635a4..9a662a7 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
 VFIO kernel is usually present by default in all distributions,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 Since Linux version 5.7,
 the ``vfio-pci`` module supports the creation of virtual functions.
 After the PF is bound to ``vfio-pci`` module,
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 0500824..64b134d 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -517,11 +517,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v6 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2020-12-17 19:06   ` Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-17 19:06 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 64b134d..b15b758 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2020-12-17 19:06   ` Nithin Dabilpuram
  2020-12-17 19:10     ` Nithin Dabilpuram
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
  2020-12-23  5:13   ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  4 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-17 19:06 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Test case alloc's system pages and tries to performs a user
DMA map and unmap both partially and fully.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test/meson.build |   1 +
 app/test/test_vfio.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 app/test/test_vfio.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 94fd39f..d9eedb6 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -139,6 +139,7 @@ test_sources = files('commands.c',
 	'test_trace_register.c',
 	'test_trace_perf.c',
 	'test_version.c',
+	'test_vfio.c',
 	'virtual_pmd.c'
 )
 
diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
new file mode 100644
index 0000000..c35efed
--- /dev/null
+++ b/app/test/test_vfio.c
@@ -0,0 +1,107 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2020 Marvell.
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_common.h>
+#include <rte_eal.h>
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+#include <rte_memory.h>
+#include <rte_vfio.h>
+
+#include "test.h"
+
+static int
+test_memory_vfio_dma_map(void)
+{
+	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
+	uint64_t unmap1, unmap2;
+	uint8_t *alloc_mem;
+	uint8_t *mem;
+	int ret;
+
+	/* Allocate twice size of requirement from heap to align later */
+	alloc_mem = malloc(sz * 2);
+	if (!alloc_mem) {
+		printf("Skipping test as unable to alloc %"PRIx64"B from heap\n",
+		       sz * 2);
+		return 1;
+	}
+
+	/* Force page allocation */
+	memset(alloc_mem, 0, sz * 2);
+
+	mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
+
+	/* map the whole region */
+	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					 (uintptr_t)mem, (rte_iova_t)mem, sz);
+	if (ret) {
+		/* Check if VFIO is not available or no device is probed */
+		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
+			ret = 1;
+			goto fail;
+		}
+		printf("Failed to dma map whole region, ret=%d(%s)\n",
+		       ret, rte_strerror(rte_errno));
+		goto fail;
+	}
+
+	unmap1 = (uint64_t)mem + (sz / 2);
+	sz1 = sz / 2;
+	unmap2 = (uint64_t)mem;
+	sz2 = sz / 2;
+	/* unmap the partial region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap1, (rte_iova_t)unmap1, sz1);
+	if (ret) {
+		if (rte_errno == ENOTSUP) {
+			printf("Partial dma unmap not supported\n");
+			unmap2 = (uint64_t)mem;
+			sz2 = sz;
+		} else {
+			printf("Failed to unmap second half region, ret=%d(%s)\n",
+			       ret, rte_strerror(rte_errno));
+			goto fail;
+		}
+	}
+
+	/* unmap the remaining region */
+	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+					   unmap2, (rte_iova_t)unmap2, sz2);
+	if (ret) {
+		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
+		       rte_strerror(rte_errno));
+		goto fail;
+	}
+
+fail:
+	free(alloc_mem);
+	return ret;
+}
+
+static int
+test_vfio(void)
+{
+	int ret;
+
+	/* test for vfio dma map/unmap */
+	ret = test_memory_vfio_dma_map();
+	if (ret == 1) {
+		printf("VFIO dma map/unmap unsupported\n");
+	} else if (ret < 0) {
+		printf("Error vfio dma map/unmap, ret=%d\n", ret);
+		return -1;
+	}
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v6 4/4] test: change external memory test to use system page sz
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                     ` (2 preceding siblings ...)
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-12-17 19:06   ` Nithin Dabilpuram
  2020-12-23  5:13   ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
  4 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-17 19:06 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Currently external memory test uses 4K page size.
VFIO DMA mapping works only with system page granularity.

Earlier it was working because all the contiguous mappings
were coalesced and mapped in one-go which ended up becoming
a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
and IOVA as PA mode, are being done at memseg list granularity,
we need to use system page size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_external_mem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
index 7eb81f6..5edf88b 100644
--- a/app/test/test_external_mem.c
+++ b/app/test/test_external_mem.c
@@ -13,6 +13,7 @@
 #include <rte_common.h>
 #include <rte_debug.h>
 #include <rte_eal.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_malloc.h>
 #include <rte_ring.h>
@@ -532,8 +533,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
 static int
 test_external_mem(void)
 {
+	size_t pgsz = rte_mem_page_size();
 	size_t len = EXTERNAL_MEM_SZ;
-	size_t pgsz = RTE_PGSIZE_4K;
 	rte_iova_t iova[len / pgsz];
 	void *addr;
 	int ret, n_pages;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
@ 2020-12-17 19:10     ` Nithin Dabilpuram
  2021-01-05 19:33       ` David Christensen
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-17 19:10 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand; +Cc: jerinj, dev

Hi David Christensen,

Ping. Let me know if this way of allocation from heap is fine with POWER9 system.

On Fri, Dec 18, 2020 at 12:36:03AM +0530, Nithin Dabilpuram wrote:
> Test case alloc's system pages and tries to performs a user
> DMA map and unmap both partially and fully.
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  app/test/meson.build |   1 +
>  app/test/test_vfio.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 108 insertions(+)
>  create mode 100644 app/test/test_vfio.c
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index 94fd39f..d9eedb6 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -139,6 +139,7 @@ test_sources = files('commands.c',
>  	'test_trace_register.c',
>  	'test_trace_perf.c',
>  	'test_version.c',
> +	'test_vfio.c',
>  	'virtual_pmd.c'
>  )
>  
> diff --git a/app/test/test_vfio.c b/app/test/test_vfio.c
> new file mode 100644
> index 0000000..c35efed
> --- /dev/null
> +++ b/app/test/test_vfio.c
> @@ -0,0 +1,107 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2020 Marvell.
> + */
> +
> +#include <inttypes.h>
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_common.h>
> +#include <rte_eal.h>
> +#include <rte_eal_paging.h>
> +#include <rte_errno.h>
> +#include <rte_memory.h>
> +#include <rte_vfio.h>
> +
> +#include "test.h"
> +
> +static int
> +test_memory_vfio_dma_map(void)
> +{
> +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
> +	uint64_t unmap1, unmap2;
> +	uint8_t *alloc_mem;
> +	uint8_t *mem;
> +	int ret;
> +
> +	/* Allocate twice size of requirement from heap to align later */
> +	alloc_mem = malloc(sz * 2);
> +	if (!alloc_mem) {
> +		printf("Skipping test as unable to alloc %"PRIx64"B from heap\n",
> +		       sz * 2);
> +		return 1;
> +	}
> +
> +	/* Force page allocation */
> +	memset(alloc_mem, 0, sz * 2);
> +
> +	mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
> +
> +	/* map the whole region */
> +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					 (uintptr_t)mem, (rte_iova_t)mem, sz);
> +	if (ret) {
> +		/* Check if VFIO is not available or no device is probed */
> +		if (rte_errno == ENOTSUP || rte_errno == ENODEV) {
> +			ret = 1;
> +			goto fail;
> +		}
> +		printf("Failed to dma map whole region, ret=%d(%s)\n",
> +		       ret, rte_strerror(rte_errno));
> +		goto fail;
> +	}
> +
> +	unmap1 = (uint64_t)mem + (sz / 2);
> +	sz1 = sz / 2;
> +	unmap2 = (uint64_t)mem;
> +	sz2 = sz / 2;
> +	/* unmap the partial region */
> +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					   unmap1, (rte_iova_t)unmap1, sz1);
> +	if (ret) {
> +		if (rte_errno == ENOTSUP) {
> +			printf("Partial dma unmap not supported\n");
> +			unmap2 = (uint64_t)mem;
> +			sz2 = sz;
> +		} else {
> +			printf("Failed to unmap second half region, ret=%d(%s)\n",
> +			       ret, rte_strerror(rte_errno));
> +			goto fail;
> +		}
> +	}
> +
> +	/* unmap the remaining region */
> +	ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +					   unmap2, (rte_iova_t)unmap2, sz2);
> +	if (ret) {
> +		printf("Failed to unmap remaining region, ret=%d(%s)\n", ret,
> +		       rte_strerror(rte_errno));
> +		goto fail;
> +	}
> +
> +fail:
> +	free(alloc_mem);
> +	return ret;
> +}
> +
> +static int
> +test_vfio(void)
> +{
> +	int ret;
> +
> +	/* test for vfio dma map/unmap */
> +	ret = test_memory_vfio_dma_map();
> +	if (ret == 1) {
> +		printf("VFIO dma map/unmap unsupported\n");
> +	} else if (ret < 0) {
> +		printf("Error vfio dma map/unmap, ret=%d\n", ret);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +REGISTER_TEST_COMMAND(vfio_autotest, test_vfio);
> -- 
> 2.8.4
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
                     ` (3 preceding siblings ...)
  2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
@ 2020-12-23  5:13   ` Nithin Dabilpuram
  2021-01-04 22:29     ` David Christensen
  4 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2020-12-23  5:13 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand; +Cc: jerinj, dev

Ping.

On Fri, Dec 18, 2020 at 12:36:00AM +0530, Nithin Dabilpuram wrote:
> Partial DMA unmap is not supported by VFIO type1 IOMMU
> in Linux. Though the return value is zero, the returned
> DMA unmap size is not same as expected size.
> So add test case and fix to both heap triggered DMA
> mapping and user triggered DMA mapping/unmapping.
> 
> Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
> Snippet of comment is below.
> 
>         /*
>          * vfio-iommu-type1 (v1) - User mappings were coalesced together to
>          * avoid tracking individual mappings.  This means that the granularity
>          * of the original mapping was lost and the user was allowed to attempt
>          * to unmap any range.  Depending on the contiguousness of physical
>          * memory and page sizes supported by the IOMMU, arbitrary unmaps may
>          * or may not have worked.  We only guaranteed unmap granularity
>          * matching the original mapping; even though it was untracked here,
>          * the original mappings are reflected in IOMMU mappings.  This
>          * resulted in a couple unusual behaviors.  First, if a range is not
>          * able to be unmapped, ex. a set of 4k pages that was mapped as a
>          * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
>          * a zero sized unmap.  Also, if an unmap request overlaps the first
>          * address of a hugepage, the IOMMU will unmap the entire hugepage.
>          * This also returns success and the returned unmap size reflects the
>          * actual size unmapped.
> 
>          * We attempt to maintain compatibility with this "v1" interface, but  
>          * we take control out of the hands of the IOMMU.  Therefore, an unmap 
>          * request offset from the beginning of the original mapping will      
>          * return success with zero sized unmap.  And an unmap request covering
>          * the first iova of mapping will unmap the entire range.              
> 
> This behavior can be verified by using first patch and add return check for
> dma_unmap.size != len in vfio_type1_dma_mem_map()
> 
> v6:
> - Fixed issue with x86-32 build introduced by v5.
> 
> v5:
> - Changed vfio test in test_vfio.c to use system pages allocated from
>   heap instead of mmap() so that it comes in range of initially configured
>   window for POWER9 System.
> - Added acked-by from David for 1/4, 2/4.
> 
> v4:
> - Fixed issue with patch 4/4 on x86 builds.
> 
> v3:
> - Fixed external memory test case(4/4) to use system page size
>   instead of 4K.
> - Fixed check-git-log.sh issue and rebased.
> - Added acked-by from anatoly.burakov@intel.com to first 3 patches.
> 
> v2: 
> - Reverted earlier commit that enables mergin contiguous mapping for
>   IOVA as PA. (see 1/3)
> - Updated documentation about kernel dma mapping limits and vfio
>   module parameter.
> - Moved vfio test to test_vfio.c and handled comments from
>   Anatoly.
> 
> Nithin Dabilpuram (4):
>   vfio: revert changes for map contiguous areas in one go
>   vfio: fix DMA mapping granularity for type1 IOVA as VA
>   test: add test case to validate VFIO DMA map/unmap
>   test: change external memory test to use system page sz
> 
>  app/test/meson.build                   |   1 +
>  app/test/test_external_mem.c           |   3 +-
>  app/test/test_vfio.c                   | 107 +++++++++++++++++++++++++++++++++
>  doc/guides/linux_gsg/linux_drivers.rst |  10 +++
>  lib/librte_eal/linux/eal_vfio.c        |  93 +++++++++++-----------------
>  lib/librte_eal/linux/eal_vfio.h        |   1 +
>  6 files changed, 157 insertions(+), 58 deletions(-)
>  create mode 100644 app/test/test_vfio.c
> 
> -- 
> 2.8.4
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap
  2020-12-23  5:13   ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2021-01-04 22:29     ` David Christensen
  0 siblings, 0 replies; 76+ messages in thread
From: David Christensen @ 2021-01-04 22:29 UTC (permalink / raw)
  To: Nithin Dabilpuram, anatoly.burakov, david.marchand; +Cc: jerinj, dev



On 12/22/20 9:13 PM, Nithin Dabilpuram wrote:
> Ping.

Tested the patches and they generate a failure on my P9 system:

EAL:   cannot map vaddr for IOMMU, error 22 (Invalid argument)

I'm looking at it now to see what the problem might be.  I'm assuing 
it's related to the size paramter (see 
https://elixir.bootlin.com/linux/v4.18/source/drivers/vfio/vfio_iommu_spapr_tce.c#L906) 
but I'm not sure yet.  See below for a bit more of the debug output from 
the failure.

Dave


sudo gdb --args /home/drc/src/dpdk/build/app/test/dpdk-test 
--log="eal,debug" --iova-mode=va -a 0034:01:00.0 -a 0034:01:00.1 -l 
64-127 -n 4
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
     <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/drc/src/dpdk/build/app/test/dpdk-test...done.
(gdb) b test_vfio
Breakpoint 1 at 0x1066ae34: file ../app/test/test_vfio.c, line 96.
(gdb) start
Temporary breakpoint 2 at 0x1000e724: file ../app/test/test.c, line 100.
Starting program: /home/drc/src/dpdk/build/app/test/dpdk-test 
--log=eal,debug --iova-mode=va -a 0034:01:00.0 -a 0034:01:00.1 -l 64-127 
-n 4
Missing separate debuginfos, use: yum debuginfo-install 
glibc-2.28-127.el8.ppc64le
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power9/libthread_db.so.1".

Temporary breakpoint 2, main (argc=11, argv=0x7ffffffff098) at 
../app/test/test.c:100
100		extra_args = getenv("DPDK_TEST_PARAMS");
Missing separate debuginfos, use: yum debuginfo-install 
elfutils-libelf-0.180-1.el8.ppc64le jansson-2.11-3.el8.ppc64le 
libibverbs-29.0-3.el8.ppc64le libnl3-3.5.0-1.el8.ppc64le 
libpcap-1.9.1-4.el8.ppc64le numactl-libs-2.0.12-11.el8.ppc64le 
openssl-libs-1.1.1g-11.el8.ppc64le zlib-1.2.11-16.el8_2.ppc64le
(gdb) c
Continuing.
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 0
EAL: Detected lcore 2 as core 0 on socket 0
EAL: Detected lcore 3 as core 0 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 4 on socket 0
EAL: Detected lcore 6 as core 4 on socket 0
EAL: Detected lcore 7 as core 4 on socket 0
EAL: Detected lcore 8 as core 8 on socket 0
EAL: Detected lcore 9 as core 8 on socket 0
EAL: Detected lcore 10 as core 8 on socket 0
EAL: Detected lcore 11 as core 8 on socket 0
EAL: Detected lcore 12 as core 12 on socket 0
EAL: Detected lcore 13 as core 12 on socket 0
EAL: Detected lcore 14 as core 12 on socket 0
EAL: Detected lcore 15 as core 12 on socket 0
EAL: Detected lcore 16 as core 16 on socket 0
EAL: Detected lcore 17 as core 16 on socket 0
EAL: Detected lcore 18 as core 16 on socket 0
EAL: Detected lcore 19 as core 16 on socket 0
EAL: Detected lcore 20 as core 20 on socket 0
EAL: Detected lcore 21 as core 20 on socket 0
EAL: Detected lcore 22 as core 20 on socket 0
EAL: Detected lcore 23 as core 20 on socket 0
EAL: Detected lcore 24 as core 24 on socket 0
EAL: Detected lcore 25 as core 24 on socket 0
EAL: Detected lcore 26 as core 24 on socket 0
EAL: Detected lcore 27 as core 24 on socket 0
EAL: Detected lcore 28 as core 28 on socket 0
EAL: Detected lcore 29 as core 28 on socket 0
EAL: Detected lcore 30 as core 28 on socket 0
EAL: Detected lcore 31 as core 28 on socket 0
EAL: Detected lcore 32 as core 32 on socket 0
EAL: Detected lcore 33 as core 32 on socket 0
EAL: Detected lcore 34 as core 32 on socket 0
EAL: Detected lcore 35 as core 32 on socket 0
EAL: Detected lcore 36 as core 36 on socket 0
EAL: Detected lcore 37 as core 36 on socket 0
EAL: Detected lcore 38 as core 36 on socket 0
EAL: Detected lcore 39 as core 36 on socket 0
EAL: Detected lcore 40 as core 48 on socket 0
EAL: Detected lcore 41 as core 48 on socket 0
EAL: Detected lcore 42 as core 48 on socket 0
EAL: Detected lcore 43 as core 48 on socket 0
EAL: Detected lcore 44 as core 52 on socket 0
EAL: Detected lcore 45 as core 52 on socket 0
EAL: Detected lcore 46 as core 52 on socket 0
EAL: Detected lcore 47 as core 52 on socket 0
EAL: Detected lcore 48 as core 72 on socket 0
EAL: Detected lcore 49 as core 72 on socket 0
EAL: Detected lcore 50 as core 72 on socket 0
EAL: Detected lcore 51 as core 72 on socket 0
EAL: Detected lcore 52 as core 76 on socket 0
EAL: Detected lcore 53 as core 76 on socket 0
EAL: Detected lcore 54 as core 76 on socket 0
EAL: Detected lcore 55 as core 76 on socket 0
EAL: Detected lcore 56 as core 80 on socket 0
EAL: Detected lcore 57 as core 80 on socket 0
EAL: Detected lcore 58 as core 80 on socket 0
EAL: Detected lcore 59 as core 80 on socket 0
EAL: Detected lcore 60 as core 84 on socket 0
EAL: Detected lcore 61 as core 84 on socket 0
EAL: Detected lcore 62 as core 84 on socket 0
EAL: Detected lcore 63 as core 84 on socket 0
EAL: Detected lcore 64 as core 2048 on socket 8
EAL: Detected lcore 65 as core 2048 on socket 8
EAL: Detected lcore 66 as core 2048 on socket 8
EAL: Detected lcore 67 as core 2048 on socket 8
EAL: Detected lcore 68 as core 2052 on socket 8
EAL: Detected lcore 69 as core 2052 on socket 8
EAL: Detected lcore 70 as core 2052 on socket 8
EAL: Detected lcore 71 as core 2052 on socket 8
EAL: Detected lcore 72 as core 2056 on socket 8
EAL: Detected lcore 73 as core 2056 on socket 8
EAL: Detected lcore 74 as core 2056 on socket 8
EAL: Detected lcore 75 as core 2056 on socket 8
EAL: Detected lcore 76 as core 2060 on socket 8
EAL: Detected lcore 77 as core 2060 on socket 8
EAL: Detected lcore 78 as core 2060 on socket 8
EAL: Detected lcore 79 as core 2060 on socket 8
EAL: Detected lcore 80 as core 2072 on socket 8
EAL: Detected lcore 81 as core 2072 on socket 8
EAL: Detected lcore 82 as core 2072 on socket 8
EAL: Detected lcore 83 as core 2072 on socket 8
EAL: Detected lcore 84 as core 2076 on socket 8
EAL: Detected lcore 85 as core 2076 on socket 8
EAL: Detected lcore 86 as core 2076 on socket 8
EAL: Detected lcore 87 as core 2076 on socket 8
EAL: Detected lcore 88 as core 2080 on socket 8
EAL: Detected lcore 89 as core 2080 on socket 8
EAL: Detected lcore 90 as core 2080 on socket 8
EAL: Detected lcore 91 as core 2080 on socket 8
EAL: Detected lcore 92 as core 2084 on socket 8
EAL: Detected lcore 93 as core 2084 on socket 8
EAL: Detected lcore 94 as core 2084 on socket 8
EAL: Detected lcore 95 as core 2084 on socket 8
EAL: Detected lcore 96 as core 2088 on socket 8
EAL: Detected lcore 97 as core 2088 on socket 8
EAL: Detected lcore 98 as core 2088 on socket 8
EAL: Detected lcore 99 as core 2088 on socket 8
EAL: Detected lcore 100 as core 2092 on socket 8
EAL: Detected lcore 101 as core 2092 on socket 8
EAL: Detected lcore 102 as core 2092 on socket 8
EAL: Detected lcore 103 as core 2092 on socket 8
EAL: Detected lcore 104 as core 2096 on socket 8
EAL: Detected lcore 105 as core 2096 on socket 8
EAL: Detected lcore 106 as core 2096 on socket 8
EAL: Detected lcore 107 as core 2096 on socket 8
EAL: Detected lcore 108 as core 2100 on socket 8
EAL: Detected lcore 109 as core 2100 on socket 8
EAL: Detected lcore 110 as core 2100 on socket 8
EAL: Detected lcore 111 as core 2100 on socket 8
EAL: Detected lcore 112 as core 2120 on socket 8
EAL: Detected lcore 113 as core 2120 on socket 8
EAL: Detected lcore 114 as core 2120 on socket 8
EAL: Detected lcore 115 as core 2120 on socket 8
EAL: Detected lcore 116 as core 2124 on socket 8
EAL: Detected lcore 117 as core 2124 on socket 8
EAL: Detected lcore 118 as core 2124 on socket 8
EAL: Detected lcore 119 as core 2124 on socket 8
EAL: Detected lcore 120 as core 2136 on socket 8
EAL: Detected lcore 121 as core 2136 on socket 8
EAL: Detected lcore 122 as core 2136 on socket 8
EAL: Detected lcore 123 as core 2136 on socket 8
EAL: Detected lcore 124 as core 2140 on socket 8
EAL: Detected lcore 125 as core 2140 on socket 8
EAL: Detected lcore 126 as core 2140 on socket 8
EAL: Detected lcore 127 as core 2140 on socket 8
EAL: Support maximum 1536 logical core(s) by configuration.
EAL: Detected 128 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x10000)
[New Thread 0x7ffff74ad090 (LWP 140091)]
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
[New Thread 0x7ffff6c9d090 (LWP 140092)]
EAL: DPAA Bus not present. Skipping.
EAL: VFIO PCI modules not loaded
EAL: Selected IOVA mode 'VA'
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL:   IOMMU type 1 (Type 1) is not supported
EAL:   IOMMU type 7 (sPAPR) is supported
EAL:   IOMMU type 8 (No-IOMMU) is not supported
EAL: VFIO support initialized
EAL: Ask a virtual area of 0x30000 bytes
EAL: Virtual area found at 0x100010000 (size = 0x30000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 32768
EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
EAL: Detected memory type: socket_id:8 hugepage_sz:1073741824
EAL: Creating 2 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x100040000 (size = 0x10000)
EAL: Memseg list allocated at socket 0, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x140000000, size 800000000
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x940000000 (size = 0x10000)
EAL: Memseg list allocated at socket 0, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x980000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x980000000, size 800000000
EAL: Creating 2 segment lists: n_segs:32 socket_id:8 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x1180000000 (size = 0x10000)
EAL: Memseg list allocated at socket 8, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x11c0000000, size 800000000
EAL: Ask a virtual area of 0x10000 bytes
EAL: Virtual area found at 0x19c0000000 (size = 0x10000)
EAL: Memseg list allocated at socket 8, page size 0x100000kB
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
EAL: VA reserved for memseg list at 0x1a00000000, size 800000000
EAL: TSC frequency is ~510000 KHz
EAL: Main lcore 64 is ready (tid=7ffff7ff8890;cpuset=[64])
[New Thread 0x7ffff648d090 (LWP 140093)]
[New Thread 0x7ffff5c7d090 (LWP 140094)]
EAL: lcore 65 is ready (tid=7ffff648d090;cpuset=[65])
[New Thread 0x7ffff546d090 (LWP 140095)]
EAL: lcore 66 is ready (tid=7ffff5c7d090;cpuset=[66])
[New Thread 0x7ffff4c5d090 (LWP 140096)]
EAL: lcore 67 is ready (tid=7ffff546d090;cpuset=[67])
[New Thread 0x7fffdfffd090 (LWP 140097)]
EAL: lcore 68 is ready (tid=7ffff4c5d090;cpuset=[68])
[New Thread 0x7fffdf7ed090 (LWP 140098)]
EAL: lcore 69 is ready (tid=7fffdfffd090;cpuset=[69])
[New Thread 0x7fffdefdd090 (LWP 140099)]
EAL: lcore 70 is ready (tid=7fffdf7ed090;cpuset=[70])
[New Thread 0x7fffde7cd090 (LWP 140100)]
EAL: lcore 71 is ready (tid=7fffdefdd090;cpuset=[71])
[New Thread 0x7fffddfbd090 (LWP 140101)]
EAL: lcore 72 is ready (tid=7fffde7cd090;cpuset=[72])
[New Thread 0x7fffdd7ad090 (LWP 140102)]
EAL: lcore 73 is ready (tid=7fffddfbd090;cpuset=[73])
[New Thread 0x7fffdcf9d090 (LWP 140103)]
EAL: lcore 74 is ready (tid=7fffdd7ad090;cpuset=[74])
[New Thread 0x7fffbfffd090 (LWP 140104)]
EAL: lcore 75 is ready (tid=7fffdcf9d090;cpuset=[75])
EAL: lcore 76 is ready (tid=7fffbfffd090;cpuset=[76])
[New Thread 0x7fffbf7ed090 (LWP 140105)]
[New Thread 0x7fffbefdd090 (LWP 140106)]
EAL: lcore 77 is ready (tid=7fffbf7ed090;cpuset=[77])
[New Thread 0x7fffbe7cd090 (LWP 140107)]
EAL: lcore 78 is ready (tid=7fffbefdd090;cpuset=[78])
[New Thread 0x7fffbdfbd090 (LWP 140108)]
EAL: lcore 79 is ready (tid=7fffbe7cd090;cpuset=[79])
[New Thread 0x7fffbd7ad090 (LWP 140109)]
EAL: lcore 80 is ready (tid=7fffbdfbd090;cpuset=[80])
[New Thread 0x7fffbcf9d090 (LWP 140110)]
EAL: lcore 81 is ready (tid=7fffbd7ad090;cpuset=[81])
[New Thread 0x7fff9fffd090 (LWP 140111)]
EAL: lcore 82 is ready (tid=7fffbcf9d090;cpuset=[82])
[New Thread 0x7fff9f7ed090 (LWP 140112)]
EAL: lcore 83 is ready (tid=7fff9fffd090;cpuset=[83])
[New Thread 0x7fff9efdd090 (LWP 140113)]
EAL: lcore 84 is ready (tid=7fff9f7ed090;cpuset=[84])
[New Thread 0x7fff9e7cd090 (LWP 140114)]
EAL: lcore 85 is ready (tid=7fff9efdd090;cpuset=[85])
[New Thread 0x7fff9dfbd090 (LWP 140115)]
EAL: lcore 86 is ready (tid=7fff9e7cd090;cpuset=[86])
[New Thread 0x7fff9d7ad090 (LWP 140116)]
EAL: lcore 87 is ready (tid=7fff9dfbd090;cpuset=[87])
[New Thread 0x7fff9cf9d090 (LWP 140117)]
EAL: lcore 88 is ready (tid=7fff9d7ad090;cpuset=[88])
[New Thread 0x7fff7fffd090 (LWP 140118)]
EAL: lcore 89 is ready (tid=7fff9cf9d090;cpuset=[89])
[New Thread 0x7fff7f7ed090 (LWP 140119)]
EAL: lcore 90 is ready (tid=7fff7fffd090;cpuset=[90])
[New Thread 0x7fff7efdd090 (LWP 140120)]
EAL: lcore 91 is ready (tid=7fff7f7ed090;cpuset=[91])
EAL: lcore 92 is ready (tid=7fff7efdd090;cpuset=[92])
[New Thread 0x7fff7e7cd090 (LWP 140121)]
[New Thread 0x7fff7dfbd090 (LWP 140122)]
EAL: lcore 93 is ready (tid=7fff7e7cd090;cpuset=[93])
[New Thread 0x7fff7d7ad090 (LWP 140123)]
EAL: lcore 94 is ready (tid=7fff7dfbd090;cpuset=[94])
[New Thread 0x7fff7cf9d090 (LWP 140124)]
EAL: lcore 95 is ready (tid=7fff7d7ad090;cpuset=[95])
[New Thread 0x7fff5fffd090 (LWP 140125)]
EAL: lcore 96 is ready (tid=7fff7cf9d090;cpuset=[96])
[New Thread 0x7fff5f7ed090 (LWP 140126)]
EAL: lcore 97 is ready (tid=7fff5fffd090;cpuset=[97])
[New Thread 0x7fff5efdd090 (LWP 140127)]
EAL: lcore 98 is ready (tid=7fff5f7ed090;cpuset=[98])
[New Thread 0x7fff5e7cd090 (LWP 140128)]
EAL: lcore 99 is ready (tid=7fff5efdd090;cpuset=[99])
[New Thread 0x7fff5dfbd090 (LWP 140129)]
EAL: lcore 100 is ready (tid=7fff5e7cd090;cpuset=[100])
[New Thread 0x7fff5d7ad090 (LWP 140130)]
EAL: lcore 101 is ready (tid=7fff5dfbd090;cpuset=[101])
[New Thread 0x7fff5cf9d090 (LWP 140131)]
EAL: lcore 102 is ready (tid=7fff5d7ad090;cpuset=[102])
[New Thread 0x7fff3fffd090 (LWP 140132)]
EAL: lcore 103 is ready (tid=7fff5cf9d090;cpuset=[103])
[New Thread 0x7fff3f7ed090 (LWP 140133)]
EAL: lcore 104 is ready (tid=7fff3fffd090;cpuset=[104])
[New Thread 0x7fff3efdd090 (LWP 140134)]
EAL: lcore 105 is ready (tid=7fff3f7ed090;cpuset=[105])
[New Thread 0x7fff3e7cd090 (LWP 140135)]
EAL: lcore 106 is ready (tid=7fff3efdd090;cpuset=[106])
[New Thread 0x7fff3dfbd090 (LWP 140136)]
EAL: lcore 107 is ready (tid=7fff3e7cd090;cpuset=[107])
[New Thread 0x7fff3d7ad090 (LWP 140137)]
EAL: lcore 108 is ready (tid=7fff3dfbd090;cpuset=[108])
[New Thread 0x7fff3cf9d090 (LWP 140138)]
EAL: lcore 109 is ready (tid=7fff3d7ad090;cpuset=[109])
[New Thread 0x7fff1fffd090 (LWP 140139)]
EAL: lcore 110 is ready (tid=7fff3cf9d090;cpuset=[110])
[New Thread 0x7fff1f7ed090 (LWP 140140)]
EAL: lcore 111 is ready (tid=7fff1fffd090;cpuset=[111])
[New Thread 0x7fff1efdd090 (LWP 140141)]
EAL: lcore 112 is ready (tid=7fff1f7ed090;cpuset=[112])
[New Thread 0x7fff1e7cd090 (LWP 140142)]
EAL: lcore 113 is ready (tid=7fff1efdd090;cpuset=[113])
[New Thread 0x7fff1dfbd090 (LWP 140143)]
EAL: lcore 114 is ready (tid=7fff1e7cd090;cpuset=[114])
[New Thread 0x7fff1d7ad090 (LWP 140144)]
EAL: lcore 115 is ready (tid=7fff1dfbd090;cpuset=[115])
[New Thread 0x7fff1cf9d090 (LWP 140145)]
EAL: lcore 116 is ready (tid=7fff1d7ad090;cpuset=[116])
[New Thread 0x7ffeffffd090 (LWP 140146)]
EAL: lcore 117 is ready (tid=7fff1cf9d090;cpuset=[117])
[New Thread 0x7ffeff7ed090 (LWP 140147)]
EAL: lcore 118 is ready (tid=7ffeffffd090;cpuset=[118])
[New Thread 0x7ffefefdd090 (LWP 140148)]
EAL: lcore 119 is ready (tid=7ffeff7ed090;cpuset=[119])
[New Thread 0x7ffefe7cd090 (LWP 140149)]
EAL: lcore 120 is ready (tid=7ffefefdd090;cpuset=[120])
[New Thread 0x7ffefdfbd090 (LWP 140150)]
EAL: lcore 121 is ready (tid=7ffefe7cd090;cpuset=[121])
[New Thread 0x7ffefd7ad090 (LWP 140151)]
EAL: lcore 122 is ready (tid=7ffefdfbd090;cpuset=[122])
[New Thread 0x7ffefcf9d090 (LWP 140152)]
EAL: lcore 123 is ready (tid=7ffefd7ad090;cpuset=[123])
EAL: lcore 124 is ready (tid=7ffefcf9d090;cpuset=[124])
[New Thread 0x7ffedfffd090 (LWP 140153)]
[New Thread 0x7ffedf7ed090 (LWP 140154)]
EAL: lcore 125 is ready (tid=7ffedfffd090;cpuset=[125])
[New Thread 0x7ffedefdd090 (LWP 140155)]
EAL: lcore 126 is ready (tid=7ffedf7ed090;cpuset=[126])
EAL: lcore 127 is ready (tid=7ffedefdd090;cpuset=[127])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 8
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 8 was expanded by 1024MB
EAL: PCI device 0034:01:00.0 on NUMA socket 8
EAL:   probe driver: 8086:1583 net_i40e
EAL:   set IOMMU type 1 (Type 1) failed, error 19 (No such device)
EAL:   using IOMMU type 7 (sPAPR)
EAL: Highest VA address in memseg list is 0x2200000000
EAL: Setting DMA window size to 0x4000000000
EAL: Mem event callback 'vfio_mem_event_clb:(nil)' registered
EAL: Installed memory event callback for VFIO
EAL: VFIO reports MSI-X BAR as mappable
EAL:   PCI memory mapped at 0x2200000000
EAL:   PCI memory mapped at 0x2200800000
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.0 (socket 8)
EAL: PCI device 0034:01:00.1 on NUMA socket 8
EAL:   probe driver: 8086:1583 net_i40e
EAL: VFIO reports MSI-X BAR as mappable
EAL:   PCI memory mapped at 0x2200810000
EAL:   PCI memory mapped at 0x2201010000
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0034:01:00.1 (socket 8)
[New Thread 0x7ffede7cd090 (LWP 140156)]
[New Thread 0x7ffeddfbd090 (LWP 140157)]
APP: HPET is not enabled, using TSC as default timer
RTE>>vfio_autotest

Thread 1 "dpdk-test" hit Breakpoint 1, test_vfio () at 
../app/test/test_vfio.c:96
96		ret = test_memory_vfio_dma_map();
(gdb) s
test_memory_vfio_dma_map () at ../app/test/test_vfio.c:24
24		uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
(gdb) n
31		alloc_mem = malloc(sz * 2);
(gdb) p sz
$1 = 131072
(gdb) n
32		if (!alloc_mem) {
(gdb) n
39		memset(alloc_mem, 0, sz * 2);
(gdb) n
41		mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
(gdb) p/x mem
$2 = 0x12c37400
(gdb) n
44		ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
(gdb) p/x mem
$3 = 0x24d20000
(gdb) s
rte_vfio_container_dma_map (container_fd=-1, vaddr=617742336, 
iova=617742336, len=131072)
     at ../lib/librte_eal/linux/eal_vfio.c:2038
2038		if (len == 0) {
(gdb) p/x iova
$4 = 0x24d20000
(gdb) n
2043		vfio_cfg = get_vfio_cfg_by_container_fd(container_fd);
(gdb) n
2044		if (vfio_cfg == NULL) {
(gdb) n
2049		return container_dma_map(vfio_cfg, vaddr, iova, len);
(gdb) s
container_dma_map (vfio_cfg=0x23faba88 <vfio_cfgs>, vaddr=617742336, 
iova=617742336, len=131072)
     at ../lib/librte_eal/linux/eal_vfio.c:1784
1784		int ret = 0;
(gdb) n
1786		user_mem_maps = &vfio_cfg->mem_maps;
(gdb) n
1787		rte_spinlock_recursive_lock(&user_mem_maps->lock);
(gdb) n
1788		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
(gdb) n
1795		if (vfio_dma_mem_map(vfio_cfg, vaddr, iova, len, 1)) {
(gdb) s
vfio_dma_mem_map (vfio_cfg=0x23faba88 <vfio_cfgs>, vaddr=617742336, 
iova=617742336, len=131072, do_map=1)
     at ../lib/librte_eal/linux/eal_vfio.c:1758
1758		const struct vfio_iommu_type *t = vfio_cfg->vfio_iommu_type;
(gdb) n
1760		if (!t) {
(gdb) n
1766		if (!t->dma_user_map_func) {
(gdb) n
1774		return t->dma_user_map_func(vfio_cfg->vfio_container_fd, vaddr, iova,
(gdb) s
vfio_spapr_dma_mem_map (vfio_container_fd=11, vaddr=617742336, 
iova=617742336, len=131072, do_map=1)
     at ../lib/librte_eal/linux/eal_vfio.c:1703
1703		int ret = 0;
(gdb) n
1705		if (do_map) {
(gdb) n
1706			if (vfio_spapr_dma_do_map(vfio_container_fd,
(gdb) s
vfio_spapr_dma_do_map (vfio_container_fd=11, vaddr=617742336, 
iova=617742336, len=131072, do_map=1)
     at ../lib/librte_eal/linux/eal_vfio.c:1399
1399		struct vfio_iommu_spapr_register_memory reg = {
(gdb) n
1407		if (do_map != 0) {
(gdb) n
1410			if (iova + len > spapr_dma_win_len) {
(gdb) p/x iova
$5 = 0x24d20000
(gdb) p/x spapr_dma_win_len
$6 = 0x4000000000
(gdb) n
1415			ret = ioctl(vfio_container_fd,
(gdb) n
1417			if (ret) {
(gdb) p ret
$7 = 0
(gdb) n
1423			memset(&dma_map, 0, sizeof(dma_map));
(gdb) n
1424			dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
(gdb) n
1425			dma_map.vaddr = vaddr;
(gdb) n
1426			dma_map.size = len;
(gdb) n
1427			dma_map.iova = iova;
(gdb) n
1428			dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
(gdb) n
1431			ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
(gdb) p/x dma_map
$8 = {argsz = 0x20, flags = 0x3, vaddr = 0x24d20000, iova = 0x24d20000, 
size = 0x20000}
(gdb) n
1432			if (ret) {
(gdb) p ret
$9 = -1
(gdb) n
1433				RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, error %i (%s)\n",
(gdb) n
EAL:   cannot map vaddr for IOMMU, error 22 (Invalid argument)
1435				return -1;
(gdb) quit

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap
  2020-12-17 19:10     ` Nithin Dabilpuram
@ 2021-01-05 19:33       ` David Christensen
  2021-01-06  8:40         ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: David Christensen @ 2021-01-05 19:33 UTC (permalink / raw)
  To: Nithin Dabilpuram, anatoly.burakov, david.marchand; +Cc: jerinj, dev

Hey Nithin,

>> +static int
>> +test_memory_vfio_dma_map(void)
>> +{
>> +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
>> +	uint64_t unmap1, unmap2;
>> +	uint8_t *alloc_mem;
>> +	uint8_t *mem;
>> +	int ret;
>> +
>> +	/* Allocate twice size of requirement from heap to align later */
>> +	alloc_mem = malloc(sz * 2);
>> +	if (!alloc_mem) {
>> +		printf("Skipping test as unable to alloc %"PRIx64"B from heap\n",
>> +		       sz * 2);
>> +		return 1;
>> +	}
>> +
>> +	/* Force page allocation */
>> +	memset(alloc_mem, 0, sz * 2);
>> +
>> +	mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
>> +
>> +	/* map the whole region */
>> +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
>> +					 (uintptr_t)mem, (rte_iova_t)mem, sz);

I'm not sure how to resolve this patch for POWER systems.  The patch 
currently fails with the error:

EAL:   cannot map vaddr for IOMMU, error 22 (Invalid argument)

The problem is that the size argument (page size of 64KB * 2) is smaller 
than the page size set when the DMA window is created (2MB or 1GB 
depending on system configuration for hugepages), resulting in the 
EINVAL error.  When I tried bumping the sz value up to 2 * 1GB the test 
also failed because the VA address was well outside the DMA window set 
when scanning memseg lists.

Allocating heap memory dynamically through the EAL works since it's 
allocated in hugepage size segments and the EAL attempts to keep VA 
memory addresses contiguous, therefore within the defined DMA window. 
But the downside is that the memory is DMA mapped behind the scenes in 
vfio_mem_event_callback().

Not sure how to get around this without duplicating a lot of the heap 
management code in your test.  Maybe others have a suggestion.

Dave

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap
  2021-01-05 19:33       ` David Christensen
@ 2021-01-06  8:40         ` Nithin Dabilpuram
  2021-01-06 21:20           ` David Christensen
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-06  8:40 UTC (permalink / raw)
  To: David Christensen; +Cc: anatoly.burakov, david.marchand, jerinj, dev

On Tue, Jan 05, 2021 at 11:33:20AM -0800, David Christensen wrote:
> Hey Nithin,
> 
> > > +static int
> > > +test_memory_vfio_dma_map(void)
> > > +{
> > > +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
> > > +	uint64_t unmap1, unmap2;
> > > +	uint8_t *alloc_mem;
> > > +	uint8_t *mem;
> > > +	int ret;
> > > +
> > > +	/* Allocate twice size of requirement from heap to align later */
> > > +	alloc_mem = malloc(sz * 2);
> > > +	if (!alloc_mem) {
> > > +		printf("Skipping test as unable to alloc %"PRIx64"B from heap\n",
> > > +		       sz * 2);
> > > +		return 1;
> > > +	}
> > > +
> > > +	/* Force page allocation */
> > > +	memset(alloc_mem, 0, sz * 2);
> > > +
> > > +	mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
> > > +
> > > +	/* map the whole region */
> > > +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > +					 (uintptr_t)mem, (rte_iova_t)mem, sz);
> 
> I'm not sure how to resolve this patch for POWER systems.  The patch
> currently fails with the error:
> 
> EAL:   cannot map vaddr for IOMMU, error 22 (Invalid argument)
> 
> The problem is that the size argument (page size of 64KB * 2) is smaller
> than the page size set when the DMA window is created (2MB or 1GB depending
> on system configuration for hugepages), resulting in the EINVAL error.  When
> I tried bumping the sz value up to 2 * 1GB the test also failed because the
> VA address was well outside the DMA window set when scanning memseg lists.
> 
> Allocating heap memory dynamically through the EAL works since it's
> allocated in hugepage size segments and the EAL attempts to keep VA memory
> addresses contiguous, therefore within the defined DMA window. But the
> downside is that the memory is DMA mapped behind the scenes in
> vfio_mem_event_callback().
> 
> Not sure how to get around this without duplicating a lot of the heap
> management code in your test.  Maybe others have a suggestion.

David, Anatoly Burakov,

Given that both malloc'ed memory and mmap'd memory is not working for POWER9
setup, I can either drop this test patch(3/4) alone or restrict it to non-POWER9
system's. Since main fix is already acked, I think it shouldn't be a problem to
drop test case which was only added to demostrate the problem.

Any thoughts ?
> 
> Dave

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap
  2021-01-06  8:40         ` Nithin Dabilpuram
@ 2021-01-06 21:20           ` David Christensen
  0 siblings, 0 replies; 76+ messages in thread
From: David Christensen @ 2021-01-06 21:20 UTC (permalink / raw)
  To: Nithin Dabilpuram; +Cc: anatoly.burakov, david.marchand, jerinj, dev



On 1/6/21 12:40 AM, Nithin Dabilpuram wrote:
> On Tue, Jan 05, 2021 at 11:33:20AM -0800, David Christensen wrote:
>> Hey Nithin,
>>
>>>> +static int
>>>> +test_memory_vfio_dma_map(void)
>>>> +{
>>>> +	uint64_t sz1, sz2, sz = 2 * rte_mem_page_size();
>>>> +	uint64_t unmap1, unmap2;
>>>> +	uint8_t *alloc_mem;
>>>> +	uint8_t *mem;
>>>> +	int ret;
>>>> +
>>>> +	/* Allocate twice size of requirement from heap to align later */
>>>> +	alloc_mem = malloc(sz * 2);
>>>> +	if (!alloc_mem) {
>>>> +		printf("Skipping test as unable to alloc %"PRIx64"B from heap\n",
>>>> +		       sz * 2);
>>>> +		return 1;
>>>> +	}
>>>> +
>>>> +	/* Force page allocation */
>>>> +	memset(alloc_mem, 0, sz * 2);
>>>> +
>>>> +	mem = RTE_PTR_ALIGN(alloc_mem, rte_mem_page_size());
>>>> +
>>>> +	/* map the whole region */
>>>> +	ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>> +					 (uintptr_t)mem, (rte_iova_t)mem, sz);
>>
>> I'm not sure how to resolve this patch for POWER systems.  The patch
>> currently fails with the error:
>>
>> EAL:   cannot map vaddr for IOMMU, error 22 (Invalid argument)
>>
>> The problem is that the size argument (page size of 64KB * 2) is smaller
>> than the page size set when the DMA window is created (2MB or 1GB depending
>> on system configuration for hugepages), resulting in the EINVAL error.  When
>> I tried bumping the sz value up to 2 * 1GB the test also failed because the
>> VA address was well outside the DMA window set when scanning memseg lists.
>>
>> Allocating heap memory dynamically through the EAL works since it's
>> allocated in hugepage size segments and the EAL attempts to keep VA memory
>> addresses contiguous, therefore within the defined DMA window. But the
>> downside is that the memory is DMA mapped behind the scenes in
>> vfio_mem_event_callback().
>>
>> Not sure how to get around this without duplicating a lot of the heap
>> management code in your test.  Maybe others have a suggestion.
> 
> David, Anatoly Burakov,
> 
> Given that both malloc'ed memory and mmap'd memory is not working for POWER9
> setup, I can either drop this test patch(3/4) alone or restrict it to non-POWER9
> system's. Since main fix is already acked, I think it shouldn't be a problem to
> drop test case which was only added to demostrate the problem.
> 
> Any thoughts ?

I dislike having to special case the architecture in general but I don't 
see an easy solution in this case. I'd be fine with either option, 
dropping the test case of disabling test case support for all POWER 
architecture.

Dave

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v7 0/3] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (6 preceding siblings ...)
  2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2021-01-12 17:39 ` Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (2 more replies)
  2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
  8 siblings, 3 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-12 17:39 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v7:
- Dropped vfio test case of patch 3/4 i.e
  "test: add test case to validate VFIO DMA map/unmap"
  as it couldn't be supported in POWER9 system.

v6:
- Fixed issue with x86-32 build introduced by v5.

v5:
- Changed vfio test in test_vfio.c to use system pages allocated from
  heap instead of mmap() so that it comes in range of initially configured
  window for POWER9 System.
- Added acked-by from David for 1/4, 2/4.

v4:
- Fixed issue with patch 4/4 on x86 builds.

v3:
- Fixed external memory test case(4/4) to use system page size
  instead of 4K.
- Fixed check-git-log.sh issue and rebased.
- Added acked-by from anatoly.burakov@intel.com to first 3 patches.

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.

Nithin Dabilpuram (3):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 IOVA as VA
  test: change external memory test to use system page sz

 app/test/test_external_mem.c           |  3 +-
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++
 lib/librte_eal/linux/eal_vfio.c        | 93 +++++++++++++---------------------
 lib/librte_eal/linux/eal_vfio.h        |  1 +
 4 files changed, 49 insertions(+), 58 deletions(-)

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v7 1/3] vfio: revert changes for map contiguous areas in one go
  2021-01-12 17:39 ` [dpdk-dev] [PATCH v7 0/3] " Nithin Dabilpuram
@ 2021-01-12 17:39   ` Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
  2 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-12 17:39 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts
commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 90635a4..9a662a7 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
 VFIO kernel is usually present by default in all distributions,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 Since Linux version 5.7,
 the ``vfio-pci`` module supports the creation of virtual functions.
 After the PF is bound to ``vfio-pci`` module,
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 0500824..64b134d 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -517,11 +517,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v7 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2021-01-12 17:39 ` [dpdk-dev] [PATCH v7 0/3] " Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2021-01-12 17:39   ` Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
  2 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-12 17:39 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 64b134d..b15b758 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz
  2021-01-12 17:39 ` [dpdk-dev] [PATCH v7 0/3] " Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2021-01-12 17:39   ` Nithin Dabilpuram
  2021-01-14 16:30     ` David Marchand
  2 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-12 17:39 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Currently external memory test uses 4K page size.
VFIO DMA mapping works only with system page granularity.

Earlier it was working because all the contiguous mappings
were coalesced and mapped in one-go which ended up becoming
a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
and IOVA as PA mode, are being done at memseg list granularity,
we need to use system page size.

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_external_mem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
index 7eb81f6..5edf88b 100644
--- a/app/test/test_external_mem.c
+++ b/app/test/test_external_mem.c
@@ -13,6 +13,7 @@
 #include <rte_common.h>
 #include <rte_debug.h>
 #include <rte_eal.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_malloc.h>
 #include <rte_ring.h>
@@ -532,8 +533,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
 static int
 test_external_mem(void)
 {
+	size_t pgsz = rte_mem_page_size();
 	size_t len = EXTERNAL_MEM_SZ;
-	size_t pgsz = RTE_PGSIZE_4K;
 	rte_iova_t iova[len / pgsz];
 	void *addr;
 	int ret, n_pages;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz
  2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
@ 2021-01-14 16:30     ` David Marchand
  2021-01-15  6:57       ` Nithin Dabilpuram
  0 siblings, 1 reply; 76+ messages in thread
From: David Marchand @ 2021-01-14 16:30 UTC (permalink / raw)
  To: Nithin Dabilpuram, Burakov, Anatoly, David Christensen
  Cc: Jerin Jacob Kollanukkaran, dev

On Tue, Jan 12, 2021 at 6:39 PM Nithin Dabilpuram
<ndabilpuram@marvell.com> wrote:
>
> Currently external memory test uses 4K page size.
> VFIO DMA mapping works only with system page granularity.
>
> Earlier it was working because all the contiguous mappings
> were coalesced and mapped in one-go which ended up becoming
> a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
> and IOVA as PA mode, are being done at memseg list granularity,
> we need to use system page size.

When you say "earlier", do you mean before this series?
The other patches have been marked for backports, so either this test
change must be too, or its content must be squashed in the right patch
of this series.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz
  2021-01-14 16:30     ` David Marchand
@ 2021-01-15  6:57       ` Nithin Dabilpuram
  0 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-15  6:57 UTC (permalink / raw)
  To: David Marchand
  Cc: Burakov, Anatoly, David Christensen, Jerin Jacob Kollanukkaran, dev

On Thu, Jan 14, 2021 at 05:30:05PM +0100, David Marchand wrote:
> On Tue, Jan 12, 2021 at 6:39 PM Nithin Dabilpuram
> <ndabilpuram@marvell.com> wrote:
> >
> > Currently external memory test uses 4K page size.
> > VFIO DMA mapping works only with system page granularity.
> >
> > Earlier it was working because all the contiguous mappings
> > were coalesced and mapped in one-go which ended up becoming
> > a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
> > and IOVA as PA mode, are being done at memseg list granularity,
> > we need to use system page size.
> 
> When you say "earlier", do you mean before this series?
> The other patches have been marked for backports, so either this test
> change must be too, or its content must be squashed in the right patch
> of this series.

Yes, I meant before this series when I said "earlier".
Missed to cc stable on this patch. Will send a V8.

> 
> 
> -- 
> David Marchand
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap
  2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
                   ` (7 preceding siblings ...)
  2021-01-12 17:39 ` [dpdk-dev] [PATCH v7 0/3] " Nithin Dabilpuram
@ 2021-01-15  7:32 ` Nithin Dabilpuram
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
                     ` (3 more replies)
  8 siblings, 4 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-15  7:32 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram

Partial DMA unmap is not supported by VFIO type1 IOMMU
in Linux. Though the return value is zero, the returned
DMA unmap size is not same as expected size.
So add test case and fix to both heap triggered DMA
mapping and user triggered DMA mapping/unmapping.

Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
Snippet of comment is below.

        /*
         * vfio-iommu-type1 (v1) - User mappings were coalesced together to
         * avoid tracking individual mappings.  This means that the granularity
         * of the original mapping was lost and the user was allowed to attempt
         * to unmap any range.  Depending on the contiguousness of physical
         * memory and page sizes supported by the IOMMU, arbitrary unmaps may
         * or may not have worked.  We only guaranteed unmap granularity
         * matching the original mapping; even though it was untracked here,
         * the original mappings are reflected in IOMMU mappings.  This
         * resulted in a couple unusual behaviors.  First, if a range is not
         * able to be unmapped, ex. a set of 4k pages that was mapped as a
         * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
         * a zero sized unmap.  Also, if an unmap request overlaps the first
         * address of a hugepage, the IOMMU will unmap the entire hugepage.
         * This also returns success and the returned unmap size reflects the
         * actual size unmapped.

         * We attempt to maintain compatibility with this "v1" interface, but  
         * we take control out of the hands of the IOMMU.  Therefore, an unmap 
         * request offset from the beginning of the original mapping will      
         * return success with zero sized unmap.  And an unmap request covering
         * the first iova of mapping will unmap the entire range.              

This behavior can be verified by using first patch and add return check for
dma_unmap.size != len in vfio_type1_dma_mem_map()

v8:
- Add cc stable to patch 3/3

v7:
- Dropped vfio test case of patch 3/4 i.e
  "test: add test case to validate VFIO DMA map/unmap"
  as it couldn't be supported in POWER9 system.

v6:
- Fixed issue with x86-32 build introduced by v5.

v5:
- Changed vfio test in test_vfio.c to use system pages allocated from
  heap instead of mmap() so that it comes in range of initially configured
  window for POWER9 System.
- Added acked-by from David for 1/4, 2/4.

v4:
- Fixed issue with patch 4/4 on x86 builds.

v3:
- Fixed external memory test case(4/4) to use system page size
  instead of 4K.
- Fixed check-git-log.sh issue and rebased.
- Added acked-by from anatoly.burakov@intel.com to first 3 patches.

v2: 
- Reverted earlier commit that enables mergin contiguous mapping for
  IOVA as PA. (see 1/3)
- Updated documentation about kernel dma mapping limits and vfio
  module parameter.
- Moved vfio test to test_vfio.c and handled comments from
  Anatoly.

Nithin Dabilpuram (3):
  vfio: revert changes for map contiguous areas in one go
  vfio: fix DMA mapping granularity for type1 IOVA as VA
  test: change external memory test to use system page sz

 app/test/test_external_mem.c           |  3 +-
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++
 lib/librte_eal/linux/eal_vfio.c        | 93 +++++++++++++---------------------
 lib/librte_eal/linux/eal_vfio.h        |  1 +
 4 files changed, 49 insertions(+), 58 deletions(-)

-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v8 1/3] vfio: revert changes for map contiguous areas in one go
  2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
@ 2021-01-15  7:32   ` Nithin Dabilpuram
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-15  7:32 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

In order to save DMA entries limited by kernel both for externel
memory and hugepage memory, an attempt was made to map physically
contiguous memory in one go. This cannot be done as VFIO IOMMU type1
does not support partially unmapping a previously mapped memory
region while Heap can request for multi page mapping and
partial unmapping.
Hence for going back to old method of mapping/unmapping at
memseg granularity, this commit reverts
commit d1c7c0cdf7ba ("vfio: map contiguous areas in one go")

Also add documentation on what module parameter needs to be used
to increase the per-container dma map limit for VFIO.

Fixes: d1c7c0cdf7ba ("vfio: map contiguous areas in one go")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 doc/guides/linux_gsg/linux_drivers.rst | 10 ++++++
 lib/librte_eal/linux/eal_vfio.c        | 59 +++++-----------------------------
 2 files changed, 18 insertions(+), 51 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_drivers.rst b/doc/guides/linux_gsg/linux_drivers.rst
index 90635a4..9a662a7 100644
--- a/doc/guides/linux_gsg/linux_drivers.rst
+++ b/doc/guides/linux_gsg/linux_drivers.rst
@@ -25,6 +25,16 @@ To make use of VFIO, the ``vfio-pci`` module must be loaded:
 VFIO kernel is usually present by default in all distributions,
 however please consult your distributions documentation to make sure that is the case.
 
+For DMA mapping of either external memory or hugepages, VFIO interface is used.
+VFIO does not support partial unmap of once mapped memory. Hence DPDK's memory is
+mapped in hugepage granularity or system page granularity. Number of DMA
+mappings is limited by kernel with user locked memory limit of a process(rlimit)
+for system/hugepage memory. Another per-container overall limit applicable both
+for external memory and system memory was added in kernel 5.1 defined by
+VFIO module parameter ``dma_entry_limit`` with a default value of 64K.
+When application is out of DMA entries, these limits need to be adjusted to
+increase the allowed limit.
+
 Since Linux version 5.7,
 the ``vfio-pci`` module supports the creation of virtual functions.
 After the PF is bound to ``vfio-pci`` module,
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 0500824..64b134d 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -517,11 +517,9 @@ static void
 vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		void *arg __rte_unused)
 {
-	rte_iova_t iova_start, iova_expected;
 	struct rte_memseg_list *msl;
 	struct rte_memseg *ms;
 	size_t cur_len = 0;
-	uint64_t va_start;
 
 	msl = rte_mem_virt2memseg_list(addr);
 
@@ -539,63 +537,22 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
-
-	/*
-	 * This memory is not guaranteed to be contiguous, but it still could
-	 * be, or it could have some small contiguous chunks. Since the number
-	 * of VFIO mappings is limited, and VFIO appears to not concatenate
-	 * adjacent mappings, we have to do this ourselves.
-	 *
-	 * So, find contiguous chunks, then map them.
-	 */
-	va_start = ms->addr_64;
-	iova_start = iova_expected = ms->iova;
 	while (cur_len < len) {
-		bool new_contig_area = ms->iova != iova_expected;
-		bool last_seg = (len - cur_len) == ms->len;
-		bool skip_last = false;
-
-		/* only do mappings when current contiguous area ends */
-		if (new_contig_area) {
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-			va_start = ms->addr_64;
-			iova_start = ms->iova;
-		}
 		/* some memory segments may have invalid IOVA */
 		if (ms->iova == RTE_BAD_IOVA) {
 			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, skipping\n",
 					ms->addr);
-			skip_last = true;
+			goto next;
 		}
-		iova_expected = ms->iova + ms->len;
+		if (type == RTE_MEM_EVENT_ALLOC)
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 1);
+		else
+			vfio_dma_mem_map(default_vfio_cfg, ms->addr_64,
+					ms->iova, ms->len, 0);
+next:
 		cur_len += ms->len;
 		++ms;
-
-		/*
-		 * don't count previous segment, and don't attempt to
-		 * dereference a potentially invalid pointer.
-		 */
-		if (skip_last && !last_seg) {
-			iova_expected = iova_start = ms->iova;
-			va_start = ms->addr_64;
-		} else if (!skip_last && last_seg) {
-			/* this is the last segment and we're not skipping */
-			if (type == RTE_MEM_EVENT_ALLOC)
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 1);
-			else
-				vfio_dma_mem_map(default_vfio_cfg, va_start,
-						iova_start,
-						iova_expected - iova_start, 0);
-		}
 	}
 }
 
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v8 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA
  2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
@ 2021-01-15  7:32   ` Nithin Dabilpuram
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
  2021-02-16 13:14   ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Burakov, Anatoly
  3 siblings, 0 replies; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-15  7:32 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For IOVA as PA, DMA mapping is already at memseg size
granularity. Do the same even for IOVA as VA mode as
DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 64b134d..b15b758 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -70,6 +70,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -77,6 +78,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -84,6 +86,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -526,12 +529,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1348,6 +1358,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1823,6 +1839,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dpdk-dev] [PATCH v8 3/3] test: change external memory test to use system page sz
  2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
@ 2021-01-15  7:32   ` Nithin Dabilpuram
  2021-02-11 11:21     ` Burakov, Anatoly
  2021-02-16 13:14   ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Burakov, Anatoly
  3 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-01-15  7:32 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen, david.marchand
  Cc: jerinj, dev, Nithin Dabilpuram, stable

Currently external memory test uses 4K page size.
VFIO DMA mapping works only with system page granularity.

Earlier it was working because all the contiguous mappings
were coalesced and mapped in one-go which ended up becoming
a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
and IOVA as PA mode, are being done at memseg list granularity,
we need to use system page size.

Fixes: b270daa43b3d ("test: support external memory")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 app/test/test_external_mem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/test_external_mem.c b/app/test/test_external_mem.c
index 7eb81f6..5edf88b 100644
--- a/app/test/test_external_mem.c
+++ b/app/test/test_external_mem.c
@@ -13,6 +13,7 @@
 #include <rte_common.h>
 #include <rte_debug.h>
 #include <rte_eal.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_malloc.h>
 #include <rte_ring.h>
@@ -532,8 +533,8 @@ test_extmem_basic(void *addr, size_t len, size_t pgsz, rte_iova_t *iova,
 static int
 test_external_mem(void)
 {
+	size_t pgsz = rte_mem_page_size();
 	size_t len = EXTERNAL_MEM_SZ;
-	size_t pgsz = RTE_PGSIZE_4K;
 	rte_iova_t iova[len / pgsz];
 	void *addr;
 	int ret, n_pages;
-- 
2.8.4


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v8 3/3] test: change external memory test to use system page sz
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
@ 2021-02-11 11:21     ` Burakov, Anatoly
  0 siblings, 0 replies; 76+ messages in thread
From: Burakov, Anatoly @ 2021-02-11 11:21 UTC (permalink / raw)
  To: Nithin Dabilpuram, David Christensen, david.marchand; +Cc: jerinj, dev, stable

On 15-Jan-21 7:32 AM, Nithin Dabilpuram wrote:
> Currently external memory test uses 4K page size.
> VFIO DMA mapping works only with system page granularity.
> 
> Earlier it was working because all the contiguous mappings
> were coalesced and mapped in one-go which ended up becoming
> a lot bigger page. Now that VFIO DMA mappings both in IOVA as VA
> and IOVA as PA mode, are being done at memseg list granularity,
> we need to use system page size.
> 
> Fixes: b270daa43b3d ("test: support external memory")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap
  2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
                     ` (2 preceding siblings ...)
  2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
@ 2021-02-16 13:14   ` Burakov, Anatoly
  2021-02-22  9:41     ` Nithin Dabilpuram
  3 siblings, 1 reply; 76+ messages in thread
From: Burakov, Anatoly @ 2021-02-16 13:14 UTC (permalink / raw)
  To: Nithin Dabilpuram, David Christensen, david.marchand; +Cc: jerinj, dev

On 15-Jan-21 7:32 AM, Nithin Dabilpuram wrote:
> Partial DMA unmap is not supported by VFIO type1 IOMMU
> in Linux. Though the return value is zero, the returned
> DMA unmap size is not same as expected size.
> So add test case and fix to both heap triggered DMA
> mapping and user triggered DMA mapping/unmapping.
> 
> Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
> Snippet of comment is below.
> 
>          /*
>           * vfio-iommu-type1 (v1) - User mappings were coalesced together to
>           * avoid tracking individual mappings.  This means that the granularity
>           * of the original mapping was lost and the user was allowed to attempt
>           * to unmap any range.  Depending on the contiguousness of physical
>           * memory and page sizes supported by the IOMMU, arbitrary unmaps may
>           * or may not have worked.  We only guaranteed unmap granularity
>           * matching the original mapping; even though it was untracked here,
>           * the original mappings are reflected in IOMMU mappings.  This
>           * resulted in a couple unusual behaviors.  First, if a range is not
>           * able to be unmapped, ex. a set of 4k pages that was mapped as a
>           * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
>           * a zero sized unmap.  Also, if an unmap request overlaps the first
>           * address of a hugepage, the IOMMU will unmap the entire hugepage.
>           * This also returns success and the returned unmap size reflects the
>           * actual size unmapped.
> 
>           * We attempt to maintain compatibility with this "v1" interface, but
>           * we take control out of the hands of the IOMMU.  Therefore, an unmap
>           * request offset from the beginning of the original mapping will
>           * return success with zero sized unmap.  And an unmap request covering
>           * the first iova of mapping will unmap the entire range.
> 
> This behavior can be verified by using first patch and add return check for
> dma_unmap.size != len in vfio_type1_dma_mem_map()
> 
> v8:
> - Add cc stable to patch 3/3
> 
> v7:
> - Dropped vfio test case of patch 3/4 i.e
>    "test: add test case to validate VFIO DMA map/unmap"
>    as it couldn't be supported in POWER9 system.
> 
> v6:
> - Fixed issue with x86-32 build introduced by v5.
> 
> v5:
> - Changed vfio test in test_vfio.c to use system pages allocated from
>    heap instead of mmap() so that it comes in range of initially configured
>    window for POWER9 System.
> - Added acked-by from David for 1/4, 2/4.
> 
> v4:
> - Fixed issue with patch 4/4 on x86 builds.
> 
> v3:
> - Fixed external memory test case(4/4) to use system page size
>    instead of 4K.
> - Fixed check-git-log.sh issue and rebased.
> - Added acked-by from anatoly.burakov@intel.com to first 3 patches.
> 
> v2:
> - Reverted earlier commit that enables mergin contiguous mapping for
>    IOVA as PA. (see 1/3)
> - Updated documentation about kernel dma mapping limits and vfio
>    module parameter.
> - Moved vfio test to test_vfio.c and handled comments from
>    Anatoly.
> 
> Nithin Dabilpuram (3):
>    vfio: revert changes for map contiguous areas in one go
>    vfio: fix DMA mapping granularity for type1 IOVA as VA
>    test: change external memory test to use system page sz
> 

Is there anything preventing this from getting merged? Let's try for 
21.05 :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap
  2021-02-16 13:14   ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Burakov, Anatoly
@ 2021-02-22  9:41     ` Nithin Dabilpuram
  2021-02-22 10:06       ` David Marchand
  0 siblings, 1 reply; 76+ messages in thread
From: Nithin Dabilpuram @ 2021-02-22  9:41 UTC (permalink / raw)
  To: david.marchand
  Cc: David Christensen, david.marchand, jerinj, dev, Burakov, Anatoly

Ping. 
Can this be merged for 21.05 ? It is pending since few releases.

--
Thanks
Nihtin

On Tue, Feb 16, 2021 at 01:14:37PM +0000, Burakov, Anatoly wrote:
> On 15-Jan-21 7:32 AM, Nithin Dabilpuram wrote:
> > Partial DMA unmap is not supported by VFIO type1 IOMMU
> > in Linux. Though the return value is zero, the returned
> > DMA unmap size is not same as expected size.
> > So add test case and fix to both heap triggered DMA
> > mapping and user triggered DMA mapping/unmapping.
> > 
> > Refer vfio_dma_do_unmap() in drivers/vfio/vfio_iommu_type1.c
> > Snippet of comment is below.
> > 
> >          /*
> >           * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> >           * avoid tracking individual mappings.  This means that the granularity
> >           * of the original mapping was lost and the user was allowed to attempt
> >           * to unmap any range.  Depending on the contiguousness of physical
> >           * memory and page sizes supported by the IOMMU, arbitrary unmaps may
> >           * or may not have worked.  We only guaranteed unmap granularity
> >           * matching the original mapping; even though it was untracked here,
> >           * the original mappings are reflected in IOMMU mappings.  This
> >           * resulted in a couple unusual behaviors.  First, if a range is not
> >           * able to be unmapped, ex. a set of 4k pages that was mapped as a
> >           * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
> >           * a zero sized unmap.  Also, if an unmap request overlaps the first
> >           * address of a hugepage, the IOMMU will unmap the entire hugepage.
> >           * This also returns success and the returned unmap size reflects the
> >           * actual size unmapped.
> > 
> >           * We attempt to maintain compatibility with this "v1" interface, but
> >           * we take control out of the hands of the IOMMU.  Therefore, an unmap
> >           * request offset from the beginning of the original mapping will
> >           * return success with zero sized unmap.  And an unmap request covering
> >           * the first iova of mapping will unmap the entire range.
> > 
> > This behavior can be verified by using first patch and add return check for
> > dma_unmap.size != len in vfio_type1_dma_mem_map()
> > 
> > v8:
> > - Add cc stable to patch 3/3
> > 
> > v7:
> > - Dropped vfio test case of patch 3/4 i.e
> >    "test: add test case to validate VFIO DMA map/unmap"
> >    as it couldn't be supported in POWER9 system.
> > 
> > v6:
> > - Fixed issue with x86-32 build introduced by v5.
> > 
> > v5:
> > - Changed vfio test in test_vfio.c to use system pages allocated from
> >    heap instead of mmap() so that it comes in range of initially configured
> >    window for POWER9 System.
> > - Added acked-by from David for 1/4, 2/4.
> > 
> > v4:
> > - Fixed issue with patch 4/4 on x86 builds.
> > 
> > v3:
> > - Fixed external memory test case(4/4) to use system page size
> >    instead of 4K.
> > - Fixed check-git-log.sh issue and rebased.
> > - Added acked-by from anatoly.burakov@intel.com to first 3 patches.
> > 
> > v2:
> > - Reverted earlier commit that enables mergin contiguous mapping for
> >    IOVA as PA. (see 1/3)
> > - Updated documentation about kernel dma mapping limits and vfio
> >    module parameter.
> > - Moved vfio test to test_vfio.c and handled comments from
> >    Anatoly.
> > 
> > Nithin Dabilpuram (3):
> >    vfio: revert changes for map contiguous areas in one go
> >    vfio: fix DMA mapping granularity for type1 IOVA as VA
> >    test: change external memory test to use system page sz
> > 
> 
> Is there anything preventing this from getting merged? Let's try for 21.05
> :)
> 
> -- 
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap
  2021-02-22  9:41     ` Nithin Dabilpuram
@ 2021-02-22 10:06       ` David Marchand
  0 siblings, 0 replies; 76+ messages in thread
From: David Marchand @ 2021-02-22 10:06 UTC (permalink / raw)
  To: Nithin Dabilpuram
  Cc: David Christensen, Jerin Jacob Kollanukkaran, dev, Burakov, Anatoly

On Mon, Feb 22, 2021 at 10:42 AM Nithin Dabilpuram
<nithind1988@gmail.com> wrote:
>
> Can this be merged for 21.05 ? It is pending since few releases.

I'll get them this week, hopefully.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2021-02-22 10:06 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12  8:11 [dpdk-dev] [PATCH 0/2] fix issue with partial DMA unmap Nithin Dabilpuram
2020-10-12  8:11 ` [dpdk-dev] [PATCH 1/2] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
2020-10-14 14:39   ` Burakov, Anatoly
2020-10-15  9:54     ` [dpdk-dev] [EXT] " Nithin Dabilpuram
2020-10-12  8:11 ` [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1 Nithin Dabilpuram
2020-10-14 15:07   ` Burakov, Anatoly
2020-10-15  6:09     ` [dpdk-dev] [EXT] " Nithin Dabilpuram
2020-10-15 10:00       ` Burakov, Anatoly
2020-10-15 11:38         ` Nithin Dabilpuram
2020-10-15 11:50         ` Nithin Dabilpuram
2020-10-15 11:57         ` Nithin Dabilpuram
2020-10-15 15:10           ` Burakov, Anatoly
2020-10-16  7:10             ` Nithin Dabilpuram
2020-10-17 16:14               ` Burakov, Anatoly
2020-10-19  9:43                 ` Nithin Dabilpuram
2020-10-22 12:13                   ` Nithin Dabilpuram
2020-10-28 13:04                     ` Burakov, Anatoly
2020-10-28 14:17                       ` Nithin Dabilpuram
2020-10-28 16:07                         ` Burakov, Anatoly
2020-10-28 16:31                           ` Nithin Dabilpuram
2020-11-05  9:04 ` [dpdk-dev] [PATCH v2 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 2/3] vfio: fix DMA mapping granularity for type1 iova as va Nithin Dabilpuram
2020-11-10 14:04     ` Burakov, Anatoly
2020-11-10 14:22       ` Burakov, Anatoly
2020-11-10 14:17     ` Burakov, Anatoly
2020-11-11  5:08       ` Nithin Dabilpuram
2020-11-11 10:00         ` Burakov, Anatoly
2020-11-05  9:04   ` [dpdk-dev] [PATCH v2 3/3] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
2020-12-01 19:32 ` [dpdk-dev] [PATCH v3 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
2020-12-01 19:32   ` [dpdk-dev] [PATCH v3 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
2020-12-01 19:33   ` [dpdk-dev] [PATCH v3 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
2020-12-01 23:23     ` David Christensen
2020-12-02  5:40       ` Nithin Dabilpuram
2020-12-02  5:46 ` [dpdk-dev] [PATCH v4 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2020-12-02 18:36     ` David Christensen
2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
2020-12-02 18:38     ` David Christensen
2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
2020-12-02 19:23     ` David Christensen
2020-12-03  7:14       ` Nithin Dabilpuram
2020-12-14  8:24         ` Nithin Dabilpuram
2020-12-02  5:46   ` [dpdk-dev] [PATCH v4 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
2020-12-14  8:19 ` [dpdk-dev] [PATCH v5 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
2020-12-14  8:19   ` [dpdk-dev] [PATCH v5 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
2020-12-17 19:06 ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 1/4] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 2/4] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 3/4] test: add test case to validate VFIO DMA map/unmap Nithin Dabilpuram
2020-12-17 19:10     ` Nithin Dabilpuram
2021-01-05 19:33       ` David Christensen
2021-01-06  8:40         ` Nithin Dabilpuram
2021-01-06 21:20           ` David Christensen
2020-12-17 19:06   ` [dpdk-dev] [PATCH v6 4/4] test: change external memory test to use system page sz Nithin Dabilpuram
2020-12-23  5:13   ` [dpdk-dev] [PATCH v6 0/4] fix issue with partial DMA unmap Nithin Dabilpuram
2021-01-04 22:29     ` David Christensen
2021-01-12 17:39 ` [dpdk-dev] [PATCH v7 0/3] " Nithin Dabilpuram
2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
2021-01-12 17:39   ` [dpdk-dev] [PATCH v7 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
2021-01-14 16:30     ` David Marchand
2021-01-15  6:57       ` Nithin Dabilpuram
2021-01-15  7:32 ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Nithin Dabilpuram
2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 1/3] vfio: revert changes for map contiguous areas in one go Nithin Dabilpuram
2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 2/3] vfio: fix DMA mapping granularity for type1 IOVA as VA Nithin Dabilpuram
2021-01-15  7:32   ` [dpdk-dev] [PATCH v8 3/3] test: change external memory test to use system page sz Nithin Dabilpuram
2021-02-11 11:21     ` Burakov, Anatoly
2021-02-16 13:14   ` [dpdk-dev] [PATCH v8 0/3] fix issue with partial DMA unmap Burakov, Anatoly
2021-02-22  9:41     ` Nithin Dabilpuram
2021-02-22 10:06       ` David Marchand

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git