* [dpdk-dev] [PATCH 0/2] vfio: change spapr DMA window sizing operation
@ 2020-04-29 23:29 David Christensen
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code David Christensen
                   ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: David Christensen @ 2020-04-29 23:29 UTC (permalink / raw)
  To: dev; +Cc: David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA window
be defined before mapping/unmapping memory.  The current VFIO code dynamically
resizes this DMA window every time a new memory request is made, which requires
that all existing memory be unmapped/remapped.  While this strategy worked in
DPDK 17.11 and earlier where memory was statically allocated during startup, it
is potentially dangerous in DPDK 18.11 and later where memory can be allocated
during runtime, temporarily invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the amount of
memory installed in the system, avoiding the need to unmap memory during
runtime.
David Christensen (2):
  vfio: use ifdef's for ppc64 spapr code
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 396 +++++++++++++++-----------------
 1 file changed, 187 insertions(+), 209 deletions(-)
--
2.18.1
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code
  2020-04-29 23:29 [dpdk-dev] [PATCH 0/2] vfio: change spapr DMA window sizing operation David Christensen
@ 2020-04-29 23:29 ` David Christensen
  2020-04-30 11:14   ` Burakov, Anatoly
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing David Christensen
  2020-06-30 21:38 ` [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation David Christensen
  2 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-04-29 23:29 UTC (permalink / raw)
  To: dev; +Cc: David Christensen
Enclose ppc64 specific SPAPR VFIO support with ifdef's.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index d26e1649a..953397984 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -56,8 +56,10 @@ static struct vfio_config *default_vfio_cfg = &vfio_cfgs[0];
 
 static int vfio_type1_dma_map(int);
 static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
+#ifdef RTE_ARCH_PPC_64
 static int vfio_spapr_dma_map(int);
 static int vfio_spapr_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
+#endif
 static int vfio_noiommu_dma_map(int);
 static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
 static int vfio_dma_mem_map(struct vfio_config *vfio_cfg, uint64_t vaddr,
@@ -72,6 +74,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
+#ifdef RTE_ARCH_PPC_64
 	/* ppc64 IOMMU, otherwise known as spapr */
 	{
 		.type_id = RTE_VFIO_SPAPR,
@@ -79,6 +82,7 @@ static const struct vfio_iommu_type iommu_types[] = {
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
+#endif
 	/* IOMMU-less mode */
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
@@ -1407,6 +1411,7 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+#ifdef RTE_ARCH_PPC_64
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
@@ -1578,7 +1583,6 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
 	/* create new DMA window */
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
 	if (ret) {
-#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
 		/* try possible page_shift and levels for workaround */
 		uint32_t levels;
 
@@ -1588,7 +1592,6 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
 			ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
 		}
-#endif
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
 					"error %i (%s)\n", errno, strerror(errno));
@@ -1747,6 +1750,7 @@ vfio_spapr_dma_map(int vfio_container_fd)
 
 	return 0;
 }
+#endif /* RTE_ARCH_PPC_64 */
 
 static int
 vfio_noiommu_dma_map(int __rte_unused vfio_container_fd)
-- 
2.18.1
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-04-29 23:29 [dpdk-dev] [PATCH 0/2] vfio: change spapr DMA window sizing operation David Christensen
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code David Christensen
@ 2020-04-29 23:29 ` David Christensen
  2020-04-30 11:34   ` Burakov, Anatoly
  2020-06-30 21:38 ` [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation David Christensen
  2 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-04-29 23:29 UTC (permalink / raw)
  To: dev; +Cc: David Christensen
Current SPAPR IOMMU support code dynamically modifies the DMA window
size in response to every new memory allocation. This is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are not properly prepared for DMA.  The new SPAPR code statically
assigns the DMA window size on first use, using the largest physical
memory address when IOVA=PA and the base_virtaddr + physical memory size
when IOVA=VA.  As a result, memory will only be unmapped when
specifically requested.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 388 +++++++++++++++-----------------
 1 file changed, 181 insertions(+), 207 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 953397984..2716ae557 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -538,17 +539,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -609,17 +599,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1416,17 +1395,16 @@ static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
-	if (do_map != 0) {
+	if (do_map == 1) {
+		struct vfio_iommu_type1_dma_map dma_map;
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1441,28 +1419,17 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		dma_map.size = len;
 		dma_map.iova = iova;
 		dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
-				VFIO_DMA_MAP_FLAG_WRITE;
+			VFIO_DMA_MAP_FLAG_WRITE;
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 				return -1;
-			}
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_unmap dma_unmap;
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1471,16 +1438,16 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 			return -1;
 		}
 
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 			return -1;
 		}
 	}
@@ -1502,26 +1469,8 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
-}
-
-static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
-{
-	int *vfio_container_fd = arg;
-
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
-		return 0;
-
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
-
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
 struct spapr_walk_param {
@@ -1552,26 +1501,150 @@ vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
 	return 0;
 }
 
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  The default window is 2GB with
+ * a 4KB page.  The DMA window must be defined before any
+ * pages are mapped.
+ */
+uint64_t spapr_dma_win_start;
+uint64_t spapr_dma_win_len;
+
+static int
+spapr_dma_win_size(void)
+{
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
+		return 0;
+
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		/* Set the DMA window to cover the max physical address */
+		const char proc_iomem[] = "/proc/iomem";
+		const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
+
+		/*
+		 * Read "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
+					line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
+				PRIx64 " to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
+
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" entry "
+				"in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
+		rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
+		return 0;
+
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		/* Set the DMA window to base_virtaddr + system memory size */
+		const char proc_meminfo[] = "/proc/meminfo";
+		const char str_memtotal[] = "MemTotal:";
+		int memtotal_len = sizeof(str_memtotal) - 1;
+		char buffer[256];
+		uint64_t size = 0;
+
+		FILE *fd = fopen(proc_meminfo, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
+			return -1;
+		}
+		while (fgets(buffer, sizeof(buffer), fd)) {
+			if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
+				size = rte_str_to_size(&buffer[memtotal_len]);
+				break;
+			}
+		}
+		fclose(fd);
+
+		if (size == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" entry "
+				"in file %s\n", proc_meminfo);
+			return -1;
+		}
+
+		RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
+		/* if no base virtual address is configured use 4GB */
+		spapr_dma_win_len = rte_align64pow2(size +
+			(internal_config.base_virtaddr > 0 ?
+			(uint64_t)internal_config.base_virtaddr : 1ULL << 32));
+		rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
+		return 0;
+	}
+
+	/* must be an unsupported IOVA mode */
+	return -1;
+}
+
+
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
+	struct spapr_walk_param param;
 	int ret;
 
+	/* exit if we can't define the DMA window size */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
+	/* walk the memseg list to find the hugepage size */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_walk(vfio_spapr_window_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Could not get hugepage size\n");
+		return -1;
+	}
+
 	/* query spapr iommu info */
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, "
+			"error %i (%s)\n", errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/* remove default DMA window */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
 	if (ret) {
@@ -1580,27 +1653,34 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
 		return -1;
 	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
+	/* create a new DMA window */
+	create.start_addr  = spapr_dma_win_start;
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(param.hugepage_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 	if (ret) {
-		/* try possible page_shift and levels for workaround */
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
+			"error %i (%s)\n", errno, strerror(errno));
+		return -1;
+	}
 
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+	/* verify the start address is what we requested */
+	if (create.start_addr != spapr_dma_win_start) {
+		RTE_LOG(ERR, EAL, "  requested start address 0x%" PRIx64
+			", received start address 0x%" PRIx64 "\n",
+			spapr_dma_win_start, create.start_addr);
 		return -1;
 	}
 
@@ -1608,143 +1688,37 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+	uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+				vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+				vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.1
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code David Christensen
@ 2020-04-30 11:14   ` Burakov, Anatoly
  2020-04-30 16:22     ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Burakov, Anatoly @ 2020-04-30 11:14 UTC (permalink / raw)
  To: David Christensen, dev
On 30-Apr-20 12:29 AM, David Christensen wrote:
> Enclose ppc64 specific SPAPR VFIO support with ifdef's.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
Why is this needed?
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-04-30 11:34   ` Burakov, Anatoly
  2020-04-30 17:36     ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Burakov, Anatoly @ 2020-04-30 11:34 UTC (permalink / raw)
  To: David Christensen, dev
On 30-Apr-20 12:29 AM, David Christensen wrote:
> Current SPAPR IOMMU support code dynamically modifies the DMA window
> size in response to every new memory allocation. This is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are not properly prepared for DMA.  The new SPAPR code statically
> assigns the DMA window size on first use, using the largest physical
> memory address when IOVA=PA and the base_virtaddr + physical memory size
> when IOVA=VA.  As a result, memory will only be unmapped when
> specifically requested.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
Hi David,
I haven't yet looked at the code in detail (will do so later), but some 
general comments and questions below.
> +	/* only create DMA window once */
> +	if (spapr_dma_win_len > 0)
> +		return 0;
> +
> +	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
> +		/* Set the DMA window to cover the max physical address */
> +		const char proc_iomem[] = "/proc/iomem";
> +		const char str_sysram[] = "System RAM";
> +		uint64_t start, end, max = 0;
> +		char *line = NULL;
> +		char *dash, *space;
> +		size_t line_len;
> +
> +		/*
> +		 * Read "System RAM" in /proc/iomem:
> +		 * 00000000-1fffffffff : System RAM
> +		 * 200000000000-201fffffffff : System RAM
> +		 */
> +		FILE *fd = fopen(proc_iomem, "r");
> +		if (fd == NULL) {
> +			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
> +			return -1;
> +		}
> +		/* Scan /proc/iomem for the highest PA in the system */
> +		while (getline(&line, &line_len, fd) != -1) {
> +			if (strstr(line, str_sysram) == NULL)
> +				continue;
> +
> +			space = strstr(line, " ");
> +			dash = strstr(line, "-");
> +
> +			/* Validate the format of the memory string */
> +			if (space == NULL || dash == NULL || space < dash) {
> +				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
> +					line, proc_iomem);
> +				continue;
> +			}
> +
> +			start = strtoull(line, NULL, 16);
> +			end   = strtoull(dash + 1, NULL, 16);
> +			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
> +				PRIx64 " to 0x%" PRIx64 "\n", start, end);
> +			if (end > max)
> +				max = end;
> +		}
> +		free(line);
> +		fclose(fd);
> +
> +		if (max == 0) {
> +			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" entry "
> +				"in file %s\n", proc_iomem);
> +			return -1;
> +		}
> +
> +		spapr_dma_win_len = rte_align64pow2(max + 1);
> +		rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
A quick check on my machines shows that when cat'ing /proc/iomem as 
non-root, you get zeroes everywhere, which leads me to believe that you 
have to be root to get anything useful out of /proc/iomem. Since one of 
the major selling points of VFIO is the ability to run as non-root, 
depending on iomem kind of defeats the purpose a bit.
> +		return 0;
> +
> +	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> +		/* Set the DMA window to base_virtaddr + system memory size */
> +		const char proc_meminfo[] = "/proc/meminfo";
> +		const char str_memtotal[] = "MemTotal:";
> +		int memtotal_len = sizeof(str_memtotal) - 1;
> +		char buffer[256];
> +		uint64_t size = 0;
> +
> +		FILE *fd = fopen(proc_meminfo, "r");
> +		if (fd == NULL) {
> +			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
> +			return -1;
> +		}
> +		while (fgets(buffer, sizeof(buffer), fd)) {
> +			if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
> +				size = rte_str_to_size(&buffer[memtotal_len]);
> +				break;
> +			}
> +		}
> +		fclose(fd);
> +
> +		if (size == 0) {
> +			RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" entry "
> +				"in file %s\n", proc_meminfo);
> +			return -1;
> +		}
> +
> +		RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
> +		/* if no base virtual address is configured use 4GB */
> +		spapr_dma_win_len = rte_align64pow2(size +
> +			(internal_config.base_virtaddr > 0 ?
> +			(uint64_t)internal_config.base_virtaddr : 1ULL << 32));
> +		rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
I'm not sure of the algorithm for "memory size" here.
Technically, DPDK can reserve memory segments anywhere in the VA space 
allocated by memseg lists. That space may be far bigger than system 
memory (on a typical Intel server board you'd see 128GB of VA space 
preallocated even though the machine itself might only have, say, 16GB 
of RAM installed). The same applies to any other arch running on Linux, 
so the window needs to cover at least RTE_MIN(base_virtaddr, lowest 
memseglist VA address) and up to highest memseglist VA address. That's 
not even mentioning the fact that the user may register external memory 
for DMA which may cause the window to be of insufficient size to cover 
said external memory.
I also think that in general, "system memory" metric is ill suited for 
measuring VA space, because unlike system memory, the VA space is sparse 
and can therefore span *a lot* of address space even though in reality 
it may actually use very little physical memory.
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code
  2020-04-30 11:14   ` Burakov, Anatoly
@ 2020-04-30 16:22     ` David Christensen
  2020-04-30 16:24       ` Burakov, Anatoly
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-04-30 16:22 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
>> Enclose ppc64 specific SPAPR VFIO support with ifdef's.
>>
>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>> ---
> 
> Why is this needed?
It's hardware specific to the PPC64 platform.  I don't know of a 
situation where the IOMMU would be present on other hardware.  Even 
running a VM in KVM/QEMU on a PPC64 platform results in a SPAPR V1 IOMMU 
which isn't supported in DPDK.
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code
  2020-04-30 16:22     ` David Christensen
@ 2020-04-30 16:24       ` Burakov, Anatoly
  2020-04-30 17:38         ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Burakov, Anatoly @ 2020-04-30 16:24 UTC (permalink / raw)
  To: David Christensen, dev
On 30-Apr-20 5:22 PM, David Christensen wrote:
>>> Enclose ppc64 specific SPAPR VFIO support with ifdef's.
>>>
>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>>> ---
>>
>> Why is this needed?
> 
> It's hardware specific to the PPC64 platform.  I don't know of a 
> situation where the IOMMU would be present on other hardware.  Even 
> running a VM in KVM/QEMU on a PPC64 platform results in a SPAPR V1 IOMMU 
> which isn't supported in DPDK.
> 
> Dave
Yes, but generally #ifdef's are there for detecting compile-time 
conditions. Is there anything specific to that code that would cause 
trouble when compiled on other platforms?
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-04-30 11:34   ` Burakov, Anatoly
@ 2020-04-30 17:36     ` David Christensen
  2020-05-01  9:06       ` Burakov, Anatoly
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-04-30 17:36 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
On 4/30/20 4:34 AM, Burakov, Anatoly wrote:
> On 30-Apr-20 12:29 AM, David Christensen wrote:
>> Current SPAPR IOMMU support code dynamically modifies the DMA window
>> size in response to every new memory allocation. This is potentially
>> dangerous because all existing mappings need to be unmapped/remapped in
>> order to resize the DMA window, leaving hardware holding IOVA addresses
>> that are not properly prepared for DMA.  The new SPAPR code statically
>> assigns the DMA window size on first use, using the largest physical
>> memory address when IOVA=PA and the base_virtaddr + physical memory size
>> when IOVA=VA.  As a result, memory will only be unmapped when
>> specifically requested.
>>
>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>> ---
> 
> Hi David,
> 
> I haven't yet looked at the code in detail (will do so later), but some 
> general comments and questions below.
> 
>> +        /*
>> +         * Read "System RAM" in /proc/iomem:
>> +         * 00000000-1fffffffff : System RAM
>> +         * 200000000000-201fffffffff : System RAM
>> +         */
>> +        FILE *fd = fopen(proc_iomem, "r");
>> +        if (fd == NULL) {
>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
>> +            return -1;
>> +        }
> 
> A quick check on my machines shows that when cat'ing /proc/iomem as 
> non-root, you get zeroes everywhere, which leads me to believe that you 
> have to be root to get anything useful out of /proc/iomem. Since one of 
> the major selling points of VFIO is the ability to run as non-root, 
> depending on iomem kind of defeats the purpose a bit.
I observed the same thing on my system during development.  I didn't see 
anything that precluded support for RTE_IOVA_PA in the VFIO code.  Are 
you suggesting that I should explicitly not support that configuration? 
If you're attempting to use RTE_IOVA_PA then you're already required to 
run as root, so there shouldn't be an issue accessing this
>> +        return 0;
>> +
>> +    } else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
>> +        /* Set the DMA window to base_virtaddr + system memory size */
>> +        const char proc_meminfo[] = "/proc/meminfo";
>> +        const char str_memtotal[] = "MemTotal:";
>> +        int memtotal_len = sizeof(str_memtotal) - 1;
>> +        char buffer[256];
>> +        uint64_t size = 0;
>> +
>> +        FILE *fd = fopen(proc_meminfo, "r");
>> +        if (fd == NULL) {
>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
>> +            return -1;
>> +        }
>> +        while (fgets(buffer, sizeof(buffer), fd)) {
>> +            if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
>> +                size = rte_str_to_size(&buffer[memtotal_len]);
>> +                break;
>> +            }
>> +        }
>> +        fclose(fd);
>> +
>> +        if (size == 0) {
>> +            RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" entry "
>> +                "in file %s\n", proc_meminfo);
>> +            return -1;
>> +        }
>> +
>> +        RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
>> +        /* if no base virtual address is configured use 4GB */
>> +        spapr_dma_win_len = rte_align64pow2(size +
>> +            (internal_config.base_virtaddr > 0 ?
>> +            (uint64_t)internal_config.base_virtaddr : 1ULL << 32));
>> +        rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
> 
> I'm not sure of the algorithm for "memory size" here.
> 
> Technically, DPDK can reserve memory segments anywhere in the VA space 
> allocated by memseg lists. That space may be far bigger than system 
> memory (on a typical Intel server board you'd see 128GB of VA space 
> preallocated even though the machine itself might only have, say, 16GB 
> of RAM installed). The same applies to any other arch running on Linux, 
> so the window needs to cover at least RTE_MIN(base_virtaddr, lowest 
> memseglist VA address) and up to highest memseglist VA address. That's 
> not even mentioning the fact that the user may register external memory 
> for DMA which may cause the window to be of insufficient size to cover 
> said external memory.
> 
> I also think that in general, "system memory" metric is ill suited for 
> measuring VA space, because unlike system memory, the VA space is sparse 
> and can therefore span *a lot* of address space even though in reality 
> it may actually use very little physical memory.
I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:
VmallocTotal:   549755813888 kB
I tested it with 1GB hugepages and it works, need to check with 2M as 
well.  If there's no alternative for sizing the window based on 
available system parameters then I have another option which creates a 
new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X 
where X is configured on the EAL command-line (--iova-base, --iova-len). 
  I use these command-line values to create a static window.
Dave
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code
  2020-04-30 16:24       ` Burakov, Anatoly
@ 2020-04-30 17:38         ` David Christensen
  2020-05-01  8:49           ` Burakov, Anatoly
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-04-30 17:38 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
>>> Why is this needed?
>>
>> It's hardware specific to the PPC64 platform.  I don't know of a 
>> situation where the IOMMU would be present on other hardware.  Even 
>> running a VM in KVM/QEMU on a PPC64 platform results in a SPAPR V1 
>> IOMMU which isn't supported in DPDK.
>>
>> Dave
> 
> Yes, but generally #ifdef's are there for detecting compile-time 
> conditions. Is there anything specific to that code that would cause 
> trouble when compiled on other platforms?
No, I can't say that's the case, it's been operating this way for a while.
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code
  2020-04-30 17:38         ` David Christensen
@ 2020-05-01  8:49           ` Burakov, Anatoly
  0 siblings, 0 replies; 48+ messages in thread
From: Burakov, Anatoly @ 2020-05-01  8:49 UTC (permalink / raw)
  To: David Christensen, dev
On 30-Apr-20 6:38 PM, David Christensen wrote:
>>>> Why is this needed?
>>>
>>> It's hardware specific to the PPC64 platform.  I don't know of a 
>>> situation where the IOMMU would be present on other hardware.  Even 
>>> running a VM in KVM/QEMU on a PPC64 platform results in a SPAPR V1 
>>> IOMMU which isn't supported in DPDK.
>>>
>>> Dave
>>
>> Yes, but generally #ifdef's are there for detecting compile-time 
>> conditions. Is there anything specific to that code that would cause 
>> trouble when compiled on other platforms?
> 
> No, I can't say that's the case, it's been operating this way for a while.
> 
> Dave
So no #ifdef's necessary then :)
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-04-30 17:36     ` David Christensen
@ 2020-05-01  9:06       ` Burakov, Anatoly
  2020-05-01 16:48         ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Burakov, Anatoly @ 2020-05-01  9:06 UTC (permalink / raw)
  To: David Christensen, dev
On 30-Apr-20 6:36 PM, David Christensen wrote:
> 
> 
> On 4/30/20 4:34 AM, Burakov, Anatoly wrote:
>> On 30-Apr-20 12:29 AM, David Christensen wrote:
>>> Current SPAPR IOMMU support code dynamically modifies the DMA window
>>> size in response to every new memory allocation. This is potentially
>>> dangerous because all existing mappings need to be unmapped/remapped in
>>> order to resize the DMA window, leaving hardware holding IOVA addresses
>>> that are not properly prepared for DMA.  The new SPAPR code statically
>>> assigns the DMA window size on first use, using the largest physical
>>> memory address when IOVA=PA and the base_virtaddr + physical memory size
>>> when IOVA=VA.  As a result, memory will only be unmapped when
>>> specifically requested.
>>>
>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>>> ---
>>
>> Hi David,
>>
>> I haven't yet looked at the code in detail (will do so later), but 
>> some general comments and questions below.
>>
>>> +        /*
>>> +         * Read "System RAM" in /proc/iomem:
>>> +         * 00000000-1fffffffff : System RAM
>>> +         * 200000000000-201fffffffff : System RAM
>>> +         */
>>> +        FILE *fd = fopen(proc_iomem, "r");
>>> +        if (fd == NULL) {
>>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
>>> +            return -1;
>>> +        }
>>
>> A quick check on my machines shows that when cat'ing /proc/iomem as 
>> non-root, you get zeroes everywhere, which leads me to believe that 
>> you have to be root to get anything useful out of /proc/iomem. Since 
>> one of the major selling points of VFIO is the ability to run as 
>> non-root, depending on iomem kind of defeats the purpose a bit.
> 
> I observed the same thing on my system during development.  I didn't see 
> anything that precluded support for RTE_IOVA_PA in the VFIO code.  Are 
> you suggesting that I should explicitly not support that configuration? 
> If you're attempting to use RTE_IOVA_PA then you're already required to 
> run as root, so there shouldn't be an issue accessing this
Oh, right, forgot about that. That's OK then.
> 
>>> +        return 0;
>>> +
>>> +    } else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
>>> +        /* Set the DMA window to base_virtaddr + system memory size */
>>> +        const char proc_meminfo[] = "/proc/meminfo";
>>> +        const char str_memtotal[] = "MemTotal:";
>>> +        int memtotal_len = sizeof(str_memtotal) - 1;
>>> +        char buffer[256];
>>> +        uint64_t size = 0;
>>> +
>>> +        FILE *fd = fopen(proc_meminfo, "r");
>>> +        if (fd == NULL) {
>>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
>>> +            return -1;
>>> +        }
>>> +        while (fgets(buffer, sizeof(buffer), fd)) {
>>> +            if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
>>> +                size = rte_str_to_size(&buffer[memtotal_len]);
>>> +                break;
>>> +            }
>>> +        }
>>> +        fclose(fd);
>>> +
>>> +        if (size == 0) {
>>> +            RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" 
>>> entry "
>>> +                "in file %s\n", proc_meminfo);
>>> +            return -1;
>>> +        }
>>> +
>>> +        RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
>>> +        /* if no base virtual address is configured use 4GB */
>>> +        spapr_dma_win_len = rte_align64pow2(size +
>>> +            (internal_config.base_virtaddr > 0 ?
>>> +            (uint64_t)internal_config.base_virtaddr : 1ULL << 32));
>>> +        rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
>>
>> I'm not sure of the algorithm for "memory size" here.
>>
>> Technically, DPDK can reserve memory segments anywhere in the VA space 
>> allocated by memseg lists. That space may be far bigger than system 
>> memory (on a typical Intel server board you'd see 128GB of VA space 
>> preallocated even though the machine itself might only have, say, 16GB 
>> of RAM installed). The same applies to any other arch running on 
>> Linux, so the window needs to cover at least RTE_MIN(base_virtaddr, 
>> lowest memseglist VA address) and up to highest memseglist VA address. 
>> That's not even mentioning the fact that the user may register 
>> external memory for DMA which may cause the window to be of 
>> insufficient size to cover said external memory.
>>
>> I also think that in general, "system memory" metric is ill suited for 
>> measuring VA space, because unlike system memory, the VA space is 
>> sparse and can therefore span *a lot* of address space even though in 
>> reality it may actually use very little physical memory.
> 
> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:
> 
> VmallocTotal:   549755813888 kB
> 
> I tested it with 1GB hugepages and it works, need to check with 2M as 
> well.  If there's no alternative for sizing the window based on 
> available system parameters then I have another option which creates a 
> new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X 
> where X is configured on the EAL command-line (--iova-base, --iova-len). 
>   I use these command-line values to create a static window.
> 
A whole new IOVA mode, while being a cleaner solution, would require a 
lot of testing, and it doesn't really solve the external memory problem, 
because we're still reliant on the user to provide IOVA addresses. 
Perhaps something akin to VA/IOVA address reservation would solve the 
problem, but again, lots of changes and testing, all for a comparatively 
narrow use case.
The vmalloc area seems big enough (512 terabytes on your machine, 32 
terabytes on mine), so it'll probably be OK. I'd settle for:
1) start at base_virtaddr OR lowest memseg list address, whichever is lowest
2) end at lowest addr + VmallocTotal OR highest memseglist addr, 
whichever is higher
3) a check in user DMA map function that would warn/throw an error 
whenever there is an attempt to map an address for DMA that doesn't fit 
into the DMA window
I think that would be best approach. Thoughts?
> Dave
> 
> Dave
> 
> 
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-05-01  9:06       ` Burakov, Anatoly
@ 2020-05-01 16:48         ` David Christensen
  2020-05-05 14:57           ` Burakov, Anatoly
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-05-01 16:48 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
>>> I'm not sure of the algorithm for "memory size" here.
>>>
>>> Technically, DPDK can reserve memory segments anywhere in the VA 
>>> space allocated by memseg lists. That space may be far bigger than 
>>> system memory (on a typical Intel server board you'd see 128GB of VA 
>>> space preallocated even though the machine itself might only have, 
>>> say, 16GB of RAM installed). The same applies to any other arch 
>>> running on Linux, so the window needs to cover at least 
>>> RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to 
>>> highest memseglist VA address. That's not even mentioning the fact 
>>> that the user may register external memory for DMA which may cause 
>>> the window to be of insufficient size to cover said external memory.
>>>
>>> I also think that in general, "system memory" metric is ill suited 
>>> for measuring VA space, because unlike system memory, the VA space is 
>>> sparse and can therefore span *a lot* of address space even though in 
>>> reality it may actually use very little physical memory.
>>
>> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:
>>
>> VmallocTotal:   549755813888 kB
>>
>> I tested it with 1GB hugepages and it works, need to check with 2M as 
>> well.  If there's no alternative for sizing the window based on 
>> available system parameters then I have another option which creates a 
>> new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X 
>> where X is configured on the EAL command-line (--iova-base, 
>> --iova-len).   I use these command-line values to create a static window.
>>
> 
> A whole new IOVA mode, while being a cleaner solution, would require a 
> lot of testing, and it doesn't really solve the external memory problem, 
> because we're still reliant on the user to provide IOVA addresses. 
> Perhaps something akin to VA/IOVA address reservation would solve the 
> problem, but again, lots of changes and testing, all for a comparatively 
> narrow use case.
> 
> The vmalloc area seems big enough (512 terabytes on your machine, 32 
> terabytes on mine), so it'll probably be OK. I'd settle for:
> 
> 1) start at base_virtaddr OR lowest memseg list address, whichever is 
> lowest
The IOMMU only supports two starting addresses, 0 or 1<<59, so 
implementation will need to start at 0.  (I've been bit by this before, 
my understanding is that the processor only supports 54 bits of the 
address and that the PCI host bridge uses bit 59 of the IOVA as a signal 
to do the address translation for the second DMA window.)
> 2) end at lowest addr + VmallocTotal OR highest memseglist addr, 
> whichever is higher
So, instead of rte_memseg_walk() execute rte_memseg_list_walk() to find 
the lowest/highest msl addresses?
> 3) a check in user DMA map function that would warn/throw an error 
> whenever there is an attempt to map an address for DMA that doesn't fit 
> into the DMA window
Isn't this mostly prevented by the use of  rte_mem_set_dma_mask() and 
rte_mem_check_dma_mask()?  I'd expect an error would be thrown by the 
kernel IOMMU API for an out-of-range mapping that I would simply return 
to the caller (drivers/vfio/vfio_iommu_spapr_tce.c includes the comment 
/* iova is checked by the IOMMU API */).  Why do you think double 
checking this would help?
> 
> I think that would be best approach. Thoughts?
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-05-01 16:48         ` David Christensen
@ 2020-05-05 14:57           ` Burakov, Anatoly
  2020-05-05 16:26             ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Burakov, Anatoly @ 2020-05-05 14:57 UTC (permalink / raw)
  To: David Christensen, dev
On 01-May-20 5:48 PM, David Christensen wrote:
>>>> I'm not sure of the algorithm for "memory size" here.
>>>>
>>>> Technically, DPDK can reserve memory segments anywhere in the VA 
>>>> space allocated by memseg lists. That space may be far bigger than 
>>>> system memory (on a typical Intel server board you'd see 128GB of VA 
>>>> space preallocated even though the machine itself might only have, 
>>>> say, 16GB of RAM installed). The same applies to any other arch 
>>>> running on Linux, so the window needs to cover at least 
>>>> RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to 
>>>> highest memseglist VA address. That's not even mentioning the fact 
>>>> that the user may register external memory for DMA which may cause 
>>>> the window to be of insufficient size to cover said external memory.
>>>>
>>>> I also think that in general, "system memory" metric is ill suited 
>>>> for measuring VA space, because unlike system memory, the VA space 
>>>> is sparse and can therefore span *a lot* of address space even 
>>>> though in reality it may actually use very little physical memory.
>>>
>>> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:
>>>
>>> VmallocTotal:   549755813888 kB
>>>
>>> I tested it with 1GB hugepages and it works, need to check with 2M as 
>>> well.  If there's no alternative for sizing the window based on 
>>> available system parameters then I have another option which creates 
>>> a new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to 
>>> X where X is configured on the EAL command-line (--iova-base, 
>>> --iova-len).   I use these command-line values to create a static 
>>> window.
>>>
>>
>> A whole new IOVA mode, while being a cleaner solution, would require a 
>> lot of testing, and it doesn't really solve the external memory 
>> problem, because we're still reliant on the user to provide IOVA 
>> addresses. Perhaps something akin to VA/IOVA address reservation would 
>> solve the problem, but again, lots of changes and testing, all for a 
>> comparatively narrow use case.
>>
>> The vmalloc area seems big enough (512 terabytes on your machine, 32 
>> terabytes on mine), so it'll probably be OK. I'd settle for:
>>
>> 1) start at base_virtaddr OR lowest memseg list address, whichever is 
>> lowest
> 
> The IOMMU only supports two starting addresses, 0 or 1<<59, so 
> implementation will need to start at 0.  (I've been bit by this before, 
> my understanding is that the processor only supports 54 bits of the 
> address and that the PCI host bridge uses bit 59 of the IOVA as a signal 
> to do the address translation for the second DMA window.)
Fair enough, 0 it is then.
> 
>> 2) end at lowest addr + VmallocTotal OR highest memseglist addr, 
>> whichever is higher
> 
> So, instead of rte_memseg_walk() execute rte_memseg_list_walk() to find 
> the lowest/highest msl addresses?
Yep. rte_memseg_walk() will only cover allocated pages, while 
rte_memseg_list_walk() will cover even empty page tables.
> 
>> 3) a check in user DMA map function that would warn/throw an error 
>> whenever there is an attempt to map an address for DMA that doesn't 
>> fit into the DMA window
> 
> Isn't this mostly prevented by the use of  rte_mem_set_dma_mask() and 
> rte_mem_check_dma_mask()?  I'd expect an error would be thrown by the 
> kernel IOMMU API for an out-of-range mapping that I would simply return 
> to the caller (drivers/vfio/vfio_iommu_spapr_tce.c includes the comment 
> /* iova is checked by the IOMMU API */).  Why do you think double 
> checking this would help?
I don't think we check rte_mem_check_dma_mask() anywhere in the call 
path of external memory code. Also, i just checked, and you're right, 
rte_vfio_container_dma_map() will fail if the kernel fails to map the 
memory, however nothing will fail in external memory because the IOVA 
addresses aren't checked for being within DMA mask.
See malloc_heap.c:1097 onwards, we simply add user-specified IOVA 
addresses into the page table without checking if the fit into the DMA 
mask. The DMA mapping will then happen through a mem event callback, but 
we don't check return value of that callback either, so even if DMA 
mapping fails, we'll only get a log message.
So, perhaps the real solution here is to add a DMA mask check into 
rte_malloc_heap_memory_add(), so that we check the IOVA addresses before 
we ever try to do anything with them. I'll submit a patch for this.
> 
>>
>> I think that would be best approach. Thoughts?
> 
> Dave
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-05-05 14:57           ` Burakov, Anatoly
@ 2020-05-05 16:26             ` David Christensen
  2020-05-06 10:18               ` Burakov, Anatoly
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-05-05 16:26 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
>>>>> That's not even mentioning the fact 
>>>>> that the user may register external memory for DMA which may cause 
>>>>> the window to be of insufficient size to cover said external memory.
Regarding external memory, I can think of two obvious options:
1) Skip window sizing altogether if external memory is detected and 
assume the user has set things up appropriately.
2) Add an EAL command line option --iova-len that would allow the 
external memory requirements to be considered it required.
I'll work on a new patch with (2) along with the other changes discussed 
and resubmit.  Thanks for the feedback.
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
  2020-05-05 16:26             ` David Christensen
@ 2020-05-06 10:18               ` Burakov, Anatoly
  0 siblings, 0 replies; 48+ messages in thread
From: Burakov, Anatoly @ 2020-05-06 10:18 UTC (permalink / raw)
  To: David Christensen, dev
On 05-May-20 5:26 PM, David Christensen wrote:
>>>>>> That's not even mentioning the fact that the user may register 
>>>>>> external memory for DMA which may cause the window to be of 
>>>>>> insufficient size to cover said external memory.
> 
> Regarding external memory, I can think of two obvious options:
> 
> 1) Skip window sizing altogether if external memory is detected and 
> assume the user has set things up appropriately.
A third option is just to caution users that external memory might not 
work on PPC64, and rely on my patch and the DMA mask infrastructure to 
ensure that any IOVA user requests, we can satisfy. If they can't be, 
user can adjust their IOVA request accordingly.
> 2) Add an EAL command line option --iova-len that would allow the 
> external memory requirements to be considered it required.
> 
I'm not keen on an additional platform-specific EAL option, to be 
honest, and i don't think others would be either.
> I'll work on a new patch with (2) along with the other changes discussed 
> and resubmit.  Thanks for the feedback.
> 
> Dave
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation
  2020-04-29 23:29 [dpdk-dev] [PATCH 0/2] vfio: change spapr DMA window sizing operation David Christensen
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code David Christensen
  2020-04-29 23:29 ` [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-06-30 21:38 ` David Christensen
  2020-06-30 21:38   ` [dpdk-dev] [PATCH v2 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
  2020-08-10 21:07   ` [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation David Christensen
  2 siblings, 2 replies; 48+ messages in thread
From: David Christensen @ 2020-06-30 21:38 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev, David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
---
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 412 ++++++++++++++------------------
 1 file changed, 186 insertions(+), 226 deletions(-)
--
2.18.2
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v2 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-06-30 21:38 ` [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation David Christensen
@ 2020-06-30 21:38   ` David Christensen
  2020-08-10 21:07   ` [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation David Christensen
  1 sibling, 0 replies; 48+ messages in thread
From: David Christensen @ 2020-06-30 21:38 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev, David Christensen
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 412 ++++++++++++++------------------
 1 file changed, 186 insertions(+), 226 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index d26e1649a..1f11b40cc 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -534,17 +535,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -605,17 +595,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1407,21 +1386,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1440,24 +1428,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1466,21 +1444,21 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 			return -1;
 		}
 
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1497,251 +1475,233 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	int external;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
-		return 0;
+	if (msl->external) {
+		param->external++;
+		if (!msl->heap)
+			return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
-
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+spapr_dma_win_size(void)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
+	struct spapr_size_walk_param param;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA "
+			"window size\n");
+		return -1;
+	}
+
+	/* We can't be sure if DMA window covers external memory */
+	if (param.external > 0)
+		RTE_LOG(WARNING, EAL, "Detected external memory which may "
+			"not be managed by the IOMMU\n");
+
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
+
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in "
+					"file %s\n", line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
+				PRIx64 " to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
+		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
+			PRIx64 "\n", spapr_dma_win_len);
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param.max_va);
+		spapr_dma_win_len = rte_align64pow2(param.max_va);
+		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
+			PRIx64 "\n", spapr_dma_win_len);
+	} else {
+		RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+		return -1;
 	}
 
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, "
+			"error %i (%s)\n", errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/* remove default DMA window */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 	if (ret) {
-#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
+			"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size if "
+			"supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+	uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.2
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation
  2020-06-30 21:38 ` [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation David Christensen
  2020-06-30 21:38   ` [dpdk-dev] [PATCH v2 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-08-10 21:07   ` David Christensen
  2020-08-10 21:07     ` [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
  2020-10-15 17:23     ` [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation David Christensen
  1 sibling, 2 replies; 48+ messages in thread
From: David Christensen @ 2020-08-10 21:07 UTC (permalink / raw)
  To: anatoly.burakov, david.marchand; +Cc: dev, David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
---
v3:
- Rebase for 20.08
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 412 ++++++++++++++------------------
 1 file changed, 186 insertions(+), 226 deletions(-)
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-08-10 21:07   ` [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation David Christensen
@ 2020-08-10 21:07     ` David Christensen
  2020-09-03 18:55       ` David Christensen
  2020-09-17 11:13       ` Burakov, Anatoly
  2020-10-15 17:23     ` [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation David Christensen
  1 sibling, 2 replies; 48+ messages in thread
From: David Christensen @ 2020-08-10 21:07 UTC (permalink / raw)
  To: anatoly.burakov, david.marchand; +Cc: dev, David Christensen
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 412 ++++++++++++++------------------
 1 file changed, 186 insertions(+), 226 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index e07979936..4456761fc 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1433,21 +1412,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1466,24 +1454,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1492,21 +1470,21 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 			return -1;
 		}
 
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, "
+				"error %i (%s)\n", errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1523,251 +1501,233 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	int external;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
-		return 0;
+	if (msl->external) {
+		param->external++;
+		if (!msl->heap)
+			return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
-
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+spapr_dma_win_size(void)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
+	struct spapr_size_walk_param param;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA "
+			"window size\n");
+		return -1;
+	}
+
+	/* We can't be sure if DMA window covers external memory */
+	if (param.external > 0)
+		RTE_LOG(WARNING, EAL, "Detected external memory which may "
+			"not be managed by the IOMMU\n");
+
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
+
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in "
+					"file %s\n", line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
+				PRIx64 " to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
+		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
+			PRIx64 "\n", spapr_dma_win_len);
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param.max_va);
+		spapr_dma_win_len = rte_align64pow2(param.max_va);
+		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
+			PRIx64 "\n", spapr_dma_win_len);
+	} else {
+		RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+		return -1;
 	}
 
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, "
+			"error %i (%s)\n", errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/* remove default DMA window */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 	if (ret) {
-#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
+			"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size if "
+			"supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+	uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-08-10 21:07     ` [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-09-03 18:55       ` David Christensen
  2020-09-17 11:13       ` Burakov, Anatoly
  1 sibling, 0 replies; 48+ messages in thread
From: David Christensen @ 2020-09-03 18:55 UTC (permalink / raw)
  To: anatoly.burakov, david.marchand; +Cc: dev
Ping
On 8/10/20 2:07 PM, David Christensen wrote:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
>   lib/librte_eal/linux/eal_vfio.c | 412 ++++++++++++++------------------
>   1 file changed, 186 insertions(+), 226 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> index e07979936..4456761fc 100644
> --- a/lib/librte_eal/linux/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal_vfio.c
> @@ -18,6 +18,7 @@
>   #include "eal_memcfg.h"
>   #include "eal_vfio.h"
>   #include "eal_private.h"
> +#include "eal_internal_cfg.h"
> 
>   #ifdef VFIO_PRESENT
> 
> @@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   		return;
>   	}
> 
> -#ifdef RTE_ARCH_PPC_64
> -	ms = rte_mem_virt2memseg(addr, msl);
> -	while (cur_len < len) {
> -		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
> -
> -		rte_fbarray_set_free(&msl->memseg_arr, idx);
> -		cur_len += ms->len;
> -		++ms;
> -	}
> -	cur_len = 0;
> -#endif
>   	/* memsegs are contiguous in memory */
>   	ms = rte_mem_virt2memseg(addr, msl);
> 
> @@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   						iova_expected - iova_start, 0);
>   		}
>   	}
> -#ifdef RTE_ARCH_PPC_64
> -	cur_len = 0;
> -	ms = rte_mem_virt2memseg(addr, msl);
> -	while (cur_len < len) {
> -		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
> -
> -		rte_fbarray_set_used(&msl->memseg_arr, idx);
> -		cur_len += ms->len;
> -		++ms;
> -	}
> -#endif
>   }
> 
>   static int
> @@ -1433,21 +1412,30 @@ vfio_type1_dma_map(int vfio_container_fd)
>   	return rte_memseg_walk(type1_map, &vfio_container_fd);
>   }
> 
> +/* Track the size of the statically allocated DMA window for SPAPR */
> +uint64_t spapr_dma_win_len;
> +uint64_t spapr_dma_win_page_sz;
> +
>   static int
>   vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
>   		uint64_t len, int do_map)
>   {
> -	struct vfio_iommu_type1_dma_map dma_map;
> -	struct vfio_iommu_type1_dma_unmap dma_unmap;
> -	int ret;
>   	struct vfio_iommu_spapr_register_memory reg = {
>   		.argsz = sizeof(reg),
> +		.vaddr = (uintptr_t) vaddr,
> +		.size = len,
>   		.flags = 0
>   	};
> -	reg.vaddr = (uintptr_t) vaddr;
> -	reg.size = len;
> +	int ret;
> 
>   	if (do_map != 0) {
> +		struct vfio_iommu_type1_dma_map dma_map;
> +
> +		if (iova + len > spapr_dma_win_len) {
> +			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
> +			return -1;
> +		}
> +
>   		ret = ioctl(vfio_container_fd,
>   				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
>   		if (ret) {
> @@ -1466,24 +1454,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
> 
>   		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
>   		if (ret) {
> -			/**
> -			 * In case the mapping was already done EBUSY will be
> -			 * returned from kernel.
> -			 */
> -			if (errno == EBUSY) {
> -				RTE_LOG(DEBUG, EAL,
> -					" Memory segment is already mapped,"
> -					" skipping");
> -			} else {
> -				RTE_LOG(ERR, EAL,
> -					"  cannot set up DMA remapping,"
> -					" error %i (%s)\n", errno,
> -					strerror(errno));
> -				return -1;
> -			}
> +			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, "
> +				"error %i (%s)\n", errno, strerror(errno));
> +			return -1;
>   		}
> 
>   	} else {
> +		struct vfio_iommu_type1_dma_map dma_unmap;
> +
>   		memset(&dma_unmap, 0, sizeof(dma_unmap));
>   		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
>   		dma_unmap.size = len;
> @@ -1492,21 +1470,21 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
>   		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
>   				&dma_unmap);
>   		if (ret) {
> -			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> -					errno, strerror(errno));
> +			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, "
> +				"error %i (%s)\n", errno, strerror(errno));
>   			return -1;
>   		}
> 
>   		ret = ioctl(vfio_container_fd,
>   				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
>   		if (ret) {
> -			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
> -					errno, strerror(errno));
> +			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, "
> +				"error %i (%s)\n", errno, strerror(errno));
>   			return -1;
>   		}
>   	}
> 
> -	return 0;
> +	return ret;
>   }
> 
>   static int
> @@ -1523,251 +1501,233 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
>   	if (ms->iova == RTE_BAD_IOVA)
>   		return 0;
> 
> -	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
> -			ms->len, 1);
> +	return vfio_spapr_dma_do_map(*vfio_container_fd,
> +		ms->addr_64, ms->iova, ms->len, 1);
>   }
> 
> +struct spapr_size_walk_param {
> +	uint64_t max_va;
> +	uint64_t page_sz;
> +	int external;
> +};
> +
> +/*
> + * In order to set the DMA window size required for the SPAPR IOMMU
> + * we need to walk the existing virtual memory allocations as well as
> + * find the hugepage size used.
> + */
>   static int
> -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
> -		const struct rte_memseg *ms, void *arg)
> +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
>   {
> -	int *vfio_container_fd = arg;
> +	struct spapr_size_walk_param *param = arg;
> +	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
> 
> -	/* skip external memory that isn't a heap */
> -	if (msl->external && !msl->heap)
> -		return 0;
> +	if (msl->external) {
> +		param->external++;
> +		if (!msl->heap)
> +			return 0;
> +	}
> 
> -	/* skip any segments with invalid IOVA addresses */
> -	if (ms->iova == RTE_BAD_IOVA)
> -		return 0;
> +	if (max > param->max_va) {
> +		param->page_sz = msl->page_sz;
> +		param->max_va = max;
> +	}
> 
> -	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
> -			ms->len, 0);
> +	return 0;
>   }
> 
> -struct spapr_walk_param {
> -	uint64_t window_size;
> -	uint64_t hugepage_sz;
> -};
> -
> +/*
> + * The SPAPRv2 IOMMU supports 2 DMA windows with starting
> + * address at 0 or 1<<59.  By default, a DMA window is set
> + * at address 0, 2GB long, with a 4KB page.  For DPDK we
> + * must remove the default window and setup a new DMA window
> + * based on the hugepage size and memory requirements of
> + * the application before we can map memory for DMA.
> + */
>   static int
> -vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
> -		const struct rte_memseg *ms, void *arg)
> +spapr_dma_win_size(void)
>   {
> -	struct spapr_walk_param *param = arg;
> -	uint64_t max = ms->iova + ms->len;
> +	struct spapr_size_walk_param param;
> 
> -	/* skip external memory that isn't a heap */
> -	if (msl->external && !msl->heap)
> +	/* only create DMA window once */
> +	if (spapr_dma_win_len > 0)
>   		return 0;
> 
> -	/* skip any segments with invalid IOVA addresses */
> -	if (ms->iova == RTE_BAD_IOVA)
> -		return 0;
> +	/* walk the memseg list to find the page size/max VA address */
> +	memset(¶m, 0, sizeof(param));
> +	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
> +		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA "
> +			"window size\n");
> +		return -1;
> +	}
> +
> +	/* We can't be sure if DMA window covers external memory */
> +	if (param.external > 0)
> +		RTE_LOG(WARNING, EAL, "Detected external memory which may "
> +			"not be managed by the IOMMU\n");
> +
> +	/* find the maximum IOVA address for setting the DMA window size */
> +	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
> +		static const char proc_iomem[] = "/proc/iomem";
> +		static const char str_sysram[] = "System RAM";
> +		uint64_t start, end, max = 0;
> +		char *line = NULL;
> +		char *dash, *space;
> +		size_t line_len;
> +
> +		/*
> +		 * Example "System RAM" in /proc/iomem:
> +		 * 00000000-1fffffffff : System RAM
> +		 * 200000000000-201fffffffff : System RAM
> +		 */
> +		FILE *fd = fopen(proc_iomem, "r");
> +		if (fd == NULL) {
> +			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
> +			return -1;
> +		}
> +		/* Scan /proc/iomem for the highest PA in the system */
> +		while (getline(&line, &line_len, fd) != -1) {
> +			if (strstr(line, str_sysram) == NULL)
> +				continue;
> +
> +			space = strstr(line, " ");
> +			dash = strstr(line, "-");
> +
> +			/* Validate the format of the memory string */
> +			if (space == NULL || dash == NULL || space < dash) {
> +				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in "
> +					"file %s\n", line, proc_iomem);
> +				continue;
> +			}
> +
> +			start = strtoull(line, NULL, 16);
> +			end   = strtoull(dash + 1, NULL, 16);
> +			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
> +				PRIx64 " to 0x%" PRIx64 "\n", start, end);
> +			if (end > max)
> +				max = end;
> +		}
> +		free(line);
> +		fclose(fd);
> 
> -	if (max > param->window_size) {
> -		param->hugepage_sz = ms->hugepage_sz;
> -		param->window_size = max;
> +		if (max == 0) {
> +			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
> +				"entry in file %s\n", proc_iomem);
> +			return -1;
> +		}
> +
> +		spapr_dma_win_len = rte_align64pow2(max + 1);
> +		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
> +			PRIx64 "\n", spapr_dma_win_len);
> +	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> +		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
> +			PRIx64 "\n", param.max_va);
> +		spapr_dma_win_len = rte_align64pow2(param.max_va);
> +		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
> +			PRIx64 "\n", spapr_dma_win_len);
> +	} else {
> +		RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
> +		return -1;
>   	}
> 
> +	spapr_dma_win_page_sz = param.page_sz;
> +	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
>   	return 0;
>   }
> 
>   static int
> -vfio_spapr_create_new_dma_window(int vfio_container_fd,
> -		struct vfio_iommu_spapr_tce_create *create) {
> +vfio_spapr_create_dma_window(int vfio_container_fd)
> +{
> +	struct vfio_iommu_spapr_tce_create create = {
> +		.argsz = sizeof(create), };
>   	struct vfio_iommu_spapr_tce_remove remove = {
> -		.argsz = sizeof(remove),
> -	};
> +		.argsz = sizeof(remove), };
>   	struct vfio_iommu_spapr_tce_info info = {
> -		.argsz = sizeof(info),
> -	};
> +		.argsz = sizeof(info), };
>   	int ret;
> 
> -	/* query spapr iommu info */
> +	ret = spapr_dma_win_size();
> +	if (ret < 0)
> +		return ret;
> +
>   	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>   	if (ret) {
> -		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
> -				"error %i (%s)\n", errno, strerror(errno));
> +		RTE_LOG(ERR, EAL, "  can't get iommu info, "
> +			"error %i (%s)\n", errno, strerror(errno));
>   		return -1;
>   	}
> 
> -	/* remove default DMA of 32 bit window */
> +	/* remove default DMA window */
>   	remove.start_addr = info.dma32_window_start;
>   	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> -	if (ret) {
> -		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
> -				"error %i (%s)\n", errno, strerror(errno));
> +	if (ret)
>   		return -1;
> -	}
> 
> -	/* create new DMA window */
> -	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
> +	/* create a new DMA window (start address is not selectable) */
> +	create.window_size = spapr_dma_win_len;
> +	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
> +	create.levels = 1;
> +	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>   	if (ret) {
> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
> -		/* try possible page_shift and levels for workaround */
> +		/* if at first we don't succeed, try more levels */
>   		uint32_t levels;
> 
> -		for (levels = create->levels + 1;
> +		for (levels = create.levels + 1;
>   			ret && levels <= info.ddw.levels; levels++) {
> -			create->levels = levels;
> +			create.levels = levels;
>   			ret = ioctl(vfio_container_fd,
> -				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
> -		}
> -#endif
> -		if (ret) {
> -			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
> -					"error %i (%s)\n", errno, strerror(errno));
> -			return -1;
> +				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>   		}
>   	}
> -
> -	if (create->start_addr != 0) {
> -		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
> +			"error %i (%s)\n", errno, strerror(errno));
> +		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size if "
> +			"supported by the system\n");
>   		return -1;
>   	}
> 
> -	return 0;
> +	/* verify the start address  */
> +	if (create.start_addr != 0) {
> +		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
> +			PRIx64 "\n", (uint64_t)create.start_addr);
> +		return -1;
> +	}
> +	return ret;
>   }
> 
>   static int
> -vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
> -		uint64_t len, int do_map)
> +vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
> +	uint64_t iova, uint64_t len, int do_map)
>   {
> -	struct spapr_walk_param param;
> -	struct vfio_iommu_spapr_tce_create create = {
> -		.argsz = sizeof(create),
> -	};
> -	struct vfio_config *vfio_cfg;
> -	struct user_mem_maps *user_mem_maps;
> -	int i, ret = 0;
> -
> -	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
> -	if (vfio_cfg == NULL) {
> -		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
> -		return -1;
> -	}
> -
> -	user_mem_maps = &vfio_cfg->mem_maps;
> -	rte_spinlock_recursive_lock(&user_mem_maps->lock);
> -
> -	/* check if window size needs to be adjusted */
> -	memset(¶m, 0, sizeof(param));
> -
> -	/* we're inside a callback so use thread-unsafe version */
> -	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
> -				¶m) < 0) {
> -		RTE_LOG(ERR, EAL, "Could not get window size\n");
> -		ret = -1;
> -		goto out;
> -	}
> -
> -	/* also check user maps */
> -	for (i = 0; i < user_mem_maps->n_maps; i++) {
> -		uint64_t max = user_mem_maps->maps[i].iova +
> -				user_mem_maps->maps[i].len;
> -		param.window_size = RTE_MAX(param.window_size, max);
> -	}
> -
> -	/* sPAPR requires window size to be a power of 2 */
> -	create.window_size = rte_align64pow2(param.window_size);
> -	create.page_shift = __builtin_ctzll(param.hugepage_sz);
> -	create.levels = 1;
> +	int ret = 0;
> 
>   	if (do_map) {
> -		/* re-create window and remap the entire memory */
> -		if (iova + len > create.window_size) {
> -			/* release all maps before recreating the window */
> -			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
> -					&vfio_container_fd) < 0) {
> -				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
> -				ret = -1;
> -				goto out;
> -			}
> -			/* release all user maps */
> -			for (i = 0; i < user_mem_maps->n_maps; i++) {
> -				struct user_mem_map *map =
> -						&user_mem_maps->maps[i];
> -				if (vfio_spapr_dma_do_map(vfio_container_fd,
> -						map->addr, map->iova, map->len,
> -						0)) {
> -					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
> -					ret = -1;
> -					goto out;
> -				}
> -			}
> -			create.window_size = rte_align64pow2(iova + len);
> -			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
> -					&create) < 0) {
> -				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
> -				ret = -1;
> -				goto out;
> -			}
> -			/* we're inside a callback, so use thread-unsafe version
> -			 */
> -			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
> -					&vfio_container_fd) < 0) {
> -				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
> -				ret = -1;
> -				goto out;
> -			}
> -			/* remap all user maps */
> -			for (i = 0; i < user_mem_maps->n_maps; i++) {
> -				struct user_mem_map *map =
> -						&user_mem_maps->maps[i];
> -				if (vfio_spapr_dma_do_map(vfio_container_fd,
> -						map->addr, map->iova, map->len,
> -						1)) {
> -					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
> -					ret = -1;
> -					goto out;
> -				}
> -			}
> -		}
> -		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
> +		if (vfio_spapr_dma_do_map(vfio_container_fd,
> +			vaddr, iova, len, 1)) {
>   			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
>   			ret = -1;
> -			goto out;
>   		}
>   	} else {
> -		/* for unmap, check if iova within DMA window */
> -		if (iova > create.window_size) {
> -			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
> +		if (vfio_spapr_dma_do_map(vfio_container_fd,
> +			vaddr, iova, len, 0)) {
> +			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
>   			ret = -1;
> -			goto out;
>   		}
> -
> -		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
>   	}
> -out:
> -	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
> +
>   	return ret;
>   }
> 
>   static int
>   vfio_spapr_dma_map(int vfio_container_fd)
>   {
> -	struct vfio_iommu_spapr_tce_create create = {
> -		.argsz = sizeof(create),
> -	};
> -	struct spapr_walk_param param;
> -
> -	memset(¶m, 0, sizeof(param));
> -
> -	/* create DMA window from 0 to max(phys_addr + len) */
> -	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
> -
> -	/* sPAPR requires window size to be a power of 2 */
> -	create.window_size = rte_align64pow2(param.window_size);
> -	create.page_shift = __builtin_ctzll(param.hugepage_sz);
> -	create.levels = 1;
> -
> -	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
> -		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
> +	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
> +		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
>   		return -1;
>   	}
> 
> -	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
> +	/* map all existing DPDK segments for DMA */
>   	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
>   		return -1;
> 
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-08-10 21:07     ` [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
  2020-09-03 18:55       ` David Christensen
@ 2020-09-17 11:13       ` Burakov, Anatoly
  2020-10-07 12:49         ` Thomas Monjalon
  2020-10-07 17:44         ` David Christensen
  1 sibling, 2 replies; 48+ messages in thread
From: Burakov, Anatoly @ 2020-09-17 11:13 UTC (permalink / raw)
  To: David Christensen, david.marchand; +Cc: dev
On 10-Aug-20 10:07 PM, David Christensen wrote:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
<snip>
> +struct spapr_size_walk_param {
> +	uint64_t max_va;
> +	uint64_t page_sz;
> +	int external;
> +};
> +
> +/*
> + * In order to set the DMA window size required for the SPAPR IOMMU
> + * we need to walk the existing virtual memory allocations as well as
> + * find the hugepage size used.
> + */
>   static int
> -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
> -		const struct rte_memseg *ms, void *arg)
> +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
>   {
> -	int *vfio_container_fd = arg;
> +	struct spapr_size_walk_param *param = arg;
> +	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
>   
> -	/* skip external memory that isn't a heap */
> -	if (msl->external && !msl->heap)
> -		return 0;
> +	if (msl->external) {
> +		param->external++;
> +		if (!msl->heap)
> +			return 0;
> +	}
It would be nice to have some comments in the code explaining what we're 
skipping and why.
Also, seems that you're using param->external as bool? This is a 
non-public API so using stdbool is not an issue here, perhaps replace it 
with bool param->has_external?
>   
> -	/* skip any segments with invalid IOVA addresses */
> -	if (ms->iova == RTE_BAD_IOVA)
> -		return 0;
> +	if (max > param->max_va) {
> +		param->page_sz = msl->page_sz;
> +		param->max_va = max;
> +	}
>   
> -	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
> -			ms->len, 0);
> +	return 0;
>   }
>   
> -struct spapr_walk_param {
> -	uint64_t window_size;
> -	uint64_t hugepage_sz;
> -};
> -
> +/*
> + * The SPAPRv2 IOMMU supports 2 DMA windows with starting
> + * address at 0 or 1<<59.  By default, a DMA window is set
> + * at address 0, 2GB long, with a 4KB page.  For DPDK we
> + * must remove the default window and setup a new DMA window
> + * based on the hugepage size and memory requirements of
> + * the application before we can map memory for DMA.
> + */
>   static int
> -vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
> -		const struct rte_memseg *ms, void *arg)
> +spapr_dma_win_size(void)
>   {
> -	struct spapr_walk_param *param = arg;
> -	uint64_t max = ms->iova + ms->len;
> +	struct spapr_size_walk_param param;
>   
> -	/* skip external memory that isn't a heap */
> -	if (msl->external && !msl->heap)
> +	/* only create DMA window once */
> +	if (spapr_dma_win_len > 0)
>   		return 0;
>   
> -	/* skip any segments with invalid IOVA addresses */
> -	if (ms->iova == RTE_BAD_IOVA)
> -		return 0;
> +	/* walk the memseg list to find the page size/max VA address */
> +	memset(¶m, 0, sizeof(param));
> +	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
> +		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA "
> +			"window size\n");
> +		return -1;
> +	}
> +
> +	/* We can't be sure if DMA window covers external memory */
> +	if (param.external > 0)
> +		RTE_LOG(WARNING, EAL, "Detected external memory which may "
> +			"not be managed by the IOMMU\n");
> +
> +	/* find the maximum IOVA address for setting the DMA window size */
> +	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
> +		static const char proc_iomem[] = "/proc/iomem";
> +		static const char str_sysram[] = "System RAM";
> +		uint64_t start, end, max = 0;
> +		char *line = NULL;
> +		char *dash, *space;
> +		size_t line_len;
> +
> +		/*
> +		 * Example "System RAM" in /proc/iomem:
> +		 * 00000000-1fffffffff : System RAM
> +		 * 200000000000-201fffffffff : System RAM
> +		 */
> +		FILE *fd = fopen(proc_iomem, "r");
> +		if (fd == NULL) {
> +			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
> +			return -1;
> +		}
> +		/* Scan /proc/iomem for the highest PA in the system */
> +		while (getline(&line, &line_len, fd) != -1) {
> +			if (strstr(line, str_sysram) == NULL)
> +				continue;
> +
> +			space = strstr(line, " ");
> +			dash = strstr(line, "-");
> +
> +			/* Validate the format of the memory string */
> +			if (space == NULL || dash == NULL || space < dash) {
> +				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in "
> +					"file %s\n", line, proc_iomem);
> +				continue;
> +			}
> +
> +			start = strtoull(line, NULL, 16);
> +			end   = strtoull(dash + 1, NULL, 16);
> +			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
> +				PRIx64 " to 0x%" PRIx64 "\n", start, end);
> +			if (end > max)
> +				max = end;
> +		}
> +		free(line);
> +		fclose(fd);
I would've put all of this file reading business into a separate 
function, as otherwise it's a bit hard to follow the mix of file ops and 
using the results. Something like
value = get_value_from_iomem();
if (value > ...)
...
is much easier on the eyes :)
>   
> -	if (max > param->window_size) {
> -		param->hugepage_sz = ms->hugepage_sz;
> -		param->window_size = max;
> +		if (max == 0) {
> +			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
> +				"entry in file %s\n", proc_iomem);
> +			return -1;
> +		}
> +
> +		spapr_dma_win_len = rte_align64pow2(max + 1);
> +		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
> +			PRIx64 "\n", spapr_dma_win_len);
> +	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> +		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
> +			PRIx64 "\n", param.max_va);
> +		spapr_dma_win_len = rte_align64pow2(param.max_va);
> +		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
> +			PRIx64 "\n", spapr_dma_win_len);
> +	} else {
> +		RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
> +		return -1;
>   	}
>   
> +	spapr_dma_win_page_sz = param.page_sz;
> +	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
>   	return 0;
>   }
>   
>   static int
> -vfio_spapr_create_new_dma_window(int vfio_container_fd,
> -		struct vfio_iommu_spapr_tce_create *create) {
> +vfio_spapr_create_dma_window(int vfio_container_fd)
> +{
> +	struct vfio_iommu_spapr_tce_create create = {
> +		.argsz = sizeof(create), };
>   	struct vfio_iommu_spapr_tce_remove remove = {
> -		.argsz = sizeof(remove),
> -	};
> +		.argsz = sizeof(remove), };
>   	struct vfio_iommu_spapr_tce_info info = {
> -		.argsz = sizeof(info),
> -	};
> +		.argsz = sizeof(info), };
>   	int ret;
>   
> -	/* query spapr iommu info */
> +	ret = spapr_dma_win_size();
> +	if (ret < 0)
> +		return ret;
> +
>   	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>   	if (ret) {
> -		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
> -				"error %i (%s)\n", errno, strerror(errno));
Here and in other similar places, no need to split strings into multiline.
Overall, since these changes are confined to PPC64 i can't really test 
these, but with the above changes:
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-09-17 11:13       ` Burakov, Anatoly
@ 2020-10-07 12:49         ` Thomas Monjalon
  2020-10-07 17:44         ` David Christensen
  1 sibling, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-10-07 12:49 UTC (permalink / raw)
  To: David Christensen; +Cc: david.marchand, dev, Burakov, Anatoly
Hi David,
Do you plan to send a v4?
17/09/2020 13:13, Burakov, Anatoly:
> On 10-Aug-20 10:07 PM, David Christensen wrote:
> > The SPAPR IOMMU requires that a DMA window size be defined before memory
> > can be mapped for DMA. Current code dynamically modifies the DMA window
> > size in response to every new memory allocation which is potentially
> > dangerous because all existing mappings need to be unmapped/remapped in
> > order to resize the DMA window, leaving hardware holding IOVA addresses
> > that are temporarily unmapped.  The new SPAPR code statically assigns
> > the DMA window size on first use, using the largest physical memory
> > memory address when IOVA=PA and the highest existing memseg virtual
> > address when IOVA=VA.
> > 
> > Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> > ---
> 
> <snip>
> 
> > +struct spapr_size_walk_param {
> > +	uint64_t max_va;
> > +	uint64_t page_sz;
> > +	int external;
> > +};
> > +
> > +/*
> > + * In order to set the DMA window size required for the SPAPR IOMMU
> > + * we need to walk the existing virtual memory allocations as well as
> > + * find the hugepage size used.
> > + */
> >   static int
> > -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
> > -		const struct rte_memseg *ms, void *arg)
> > +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
> >   {
> > -	int *vfio_container_fd = arg;
> > +	struct spapr_size_walk_param *param = arg;
> > +	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
> >   
> > -	/* skip external memory that isn't a heap */
> > -	if (msl->external && !msl->heap)
> > -		return 0;
> > +	if (msl->external) {
> > +		param->external++;
> > +		if (!msl->heap)
> > +			return 0;
> > +	}
> 
> It would be nice to have some comments in the code explaining what we're 
> skipping and why.
> 
> Also, seems that you're using param->external as bool? This is a 
> non-public API so using stdbool is not an issue here, perhaps replace it 
> with bool param->has_external?
> 
> >   
> > -	/* skip any segments with invalid IOVA addresses */
> > -	if (ms->iova == RTE_BAD_IOVA)
> > -		return 0;
> > +	if (max > param->max_va) {
> > +		param->page_sz = msl->page_sz;
> > +		param->max_va = max;
> > +	}
> >   
> > -	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
> > -			ms->len, 0);
> > +	return 0;
> >   }
> >   
> > -struct spapr_walk_param {
> > -	uint64_t window_size;
> > -	uint64_t hugepage_sz;
> > -};
> > -
> > +/*
> > + * The SPAPRv2 IOMMU supports 2 DMA windows with starting
> > + * address at 0 or 1<<59.  By default, a DMA window is set
> > + * at address 0, 2GB long, with a 4KB page.  For DPDK we
> > + * must remove the default window and setup a new DMA window
> > + * based on the hugepage size and memory requirements of
> > + * the application before we can map memory for DMA.
> > + */
> >   static int
> > -vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
> > -		const struct rte_memseg *ms, void *arg)
> > +spapr_dma_win_size(void)
> >   {
> > -	struct spapr_walk_param *param = arg;
> > -	uint64_t max = ms->iova + ms->len;
> > +	struct spapr_size_walk_param param;
> >   
> > -	/* skip external memory that isn't a heap */
> > -	if (msl->external && !msl->heap)
> > +	/* only create DMA window once */
> > +	if (spapr_dma_win_len > 0)
> >   		return 0;
> >   
> > -	/* skip any segments with invalid IOVA addresses */
> > -	if (ms->iova == RTE_BAD_IOVA)
> > -		return 0;
> > +	/* walk the memseg list to find the page size/max VA address */
> > +	memset(¶m, 0, sizeof(param));
> > +	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
> > +		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA "
> > +			"window size\n");
> > +		return -1;
> > +	}
> > +
> > +	/* We can't be sure if DMA window covers external memory */
> > +	if (param.external > 0)
> > +		RTE_LOG(WARNING, EAL, "Detected external memory which may "
> > +			"not be managed by the IOMMU\n");
> > +
> > +	/* find the maximum IOVA address for setting the DMA window size */
> > +	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
> > +		static const char proc_iomem[] = "/proc/iomem";
> > +		static const char str_sysram[] = "System RAM";
> > +		uint64_t start, end, max = 0;
> > +		char *line = NULL;
> > +		char *dash, *space;
> > +		size_t line_len;
> > +
> > +		/*
> > +		 * Example "System RAM" in /proc/iomem:
> > +		 * 00000000-1fffffffff : System RAM
> > +		 * 200000000000-201fffffffff : System RAM
> > +		 */
> > +		FILE *fd = fopen(proc_iomem, "r");
> > +		if (fd == NULL) {
> > +			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
> > +			return -1;
> > +		}
> > +		/* Scan /proc/iomem for the highest PA in the system */
> > +		while (getline(&line, &line_len, fd) != -1) {
> > +			if (strstr(line, str_sysram) == NULL)
> > +				continue;
> > +
> > +			space = strstr(line, " ");
> > +			dash = strstr(line, "-");
> > +
> > +			/* Validate the format of the memory string */
> > +			if (space == NULL || dash == NULL || space < dash) {
> > +				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in "
> > +					"file %s\n", line, proc_iomem);
> > +				continue;
> > +			}
> > +
> > +			start = strtoull(line, NULL, 16);
> > +			end   = strtoull(dash + 1, NULL, 16);
> > +			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%"
> > +				PRIx64 " to 0x%" PRIx64 "\n", start, end);
> > +			if (end > max)
> > +				max = end;
> > +		}
> > +		free(line);
> > +		fclose(fd);
> 
> I would've put all of this file reading business into a separate 
> function, as otherwise it's a bit hard to follow the mix of file ops and 
> using the results. Something like
> 
> value = get_value_from_iomem();
> if (value > ...)
> ...
> 
> is much easier on the eyes :)
> 
> >   
> > -	if (max > param->window_size) {
> > -		param->hugepage_sz = ms->hugepage_sz;
> > -		param->window_size = max;
> > +		if (max == 0) {
> > +			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
> > +				"entry in file %s\n", proc_iomem);
> > +			return -1;
> > +		}
> > +
> > +		spapr_dma_win_len = rte_align64pow2(max + 1);
> > +		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
> > +			PRIx64 "\n", spapr_dma_win_len);
> > +	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> > +		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
> > +			PRIx64 "\n", param.max_va);
> > +		spapr_dma_win_len = rte_align64pow2(param.max_va);
> > +		RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%"
> > +			PRIx64 "\n", spapr_dma_win_len);
> > +	} else {
> > +		RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
> > +		return -1;
> >   	}
> >   
> > +	spapr_dma_win_page_sz = param.page_sz;
> > +	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
> >   	return 0;
> >   }
> >   
> >   static int
> > -vfio_spapr_create_new_dma_window(int vfio_container_fd,
> > -		struct vfio_iommu_spapr_tce_create *create) {
> > +vfio_spapr_create_dma_window(int vfio_container_fd)
> > +{
> > +	struct vfio_iommu_spapr_tce_create create = {
> > +		.argsz = sizeof(create), };
> >   	struct vfio_iommu_spapr_tce_remove remove = {
> > -		.argsz = sizeof(remove),
> > -	};
> > +		.argsz = sizeof(remove), };
> >   	struct vfio_iommu_spapr_tce_info info = {
> > -		.argsz = sizeof(info),
> > -	};
> > +		.argsz = sizeof(info), };
> >   	int ret;
> >   
> > -	/* query spapr iommu info */
> > +	ret = spapr_dma_win_size();
> > +	if (ret < 0)
> > +		return ret;
> > +
> >   	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> >   	if (ret) {
> > -		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
> > -				"error %i (%s)\n", errno, strerror(errno));
> 
> Here and in other similar places, no need to split strings into multiline.
> 
> Overall, since these changes are confined to PPC64 i can't really test 
> these, but with the above changes:
> 
> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> 
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-09-17 11:13       ` Burakov, Anatoly
  2020-10-07 12:49         ` Thomas Monjalon
@ 2020-10-07 17:44         ` David Christensen
  2020-10-08  9:39           ` Burakov, Anatoly
  1 sibling, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-10-07 17:44 UTC (permalink / raw)
  To: Burakov, Anatoly, david.marchand; +Cc: dev
On 9/17/20 4:13 AM, Burakov, Anatoly wrote:
> On 10-Aug-20 10:07 PM, David Christensen wrote:
>> The SPAPR IOMMU requires that a DMA window size be defined before memory
>> can be mapped for DMA. Current code dynamically modifies the DMA window
>> size in response to every new memory allocation which is potentially
>> dangerous because all existing mappings need to be unmapped/remapped in
>> order to resize the DMA window, leaving hardware holding IOVA addresses
>> that are temporarily unmapped.  The new SPAPR code statically assigns
>> the DMA window size on first use, using the largest physical memory
>> memory address when IOVA=PA and the highest existing memseg virtual
>> address when IOVA=VA.
>>
>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>> ---
> 
> <snip>
> 
>> +struct spapr_size_walk_param {
>> +    uint64_t max_va;
>> +    uint64_t page_sz;
>> +    int external;
>> +};
>> +
>> +/*
>> + * In order to set the DMA window size required for the SPAPR IOMMU
>> + * we need to walk the existing virtual memory allocations as well as
>> + * find the hugepage size used.
>> + */
>>   static int
>> -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
>> -        const struct rte_memseg *ms, void *arg)
>> +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
>>   {
>> -    int *vfio_container_fd = arg;
>> +    struct spapr_size_walk_param *param = arg;
>> +    uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
>> -    /* skip external memory that isn't a heap */
>> -    if (msl->external && !msl->heap)
>> -        return 0;
>> +    if (msl->external) {
>> +        param->external++;
>> +        if (!msl->heap)
>> +            return 0;
>> +    }
> 
> It would be nice to have some comments in the code explaining what we're 
> skipping and why.
Reviewing this again, my inclination is to skip ALL external memory, 
which by definition would seem to be outside of IOMMU control, so the 
code would read:
    if (msl->external) {
        param->external++;
        return 0;
    }
Not sure why existing code such as vfio_spapr_map_walk() distinguishes 
between heap and non-heap in this situation.  Are there instances in x86 
where it would matter?
> Also, seems that you're using param->external as bool? This is a 
> non-public API so using stdbool is not an issue here, perhaps replace it 
> with bool param->has_external?
Why do you think the distinction is necessary?
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-07 17:44         ` David Christensen
@ 2020-10-08  9:39           ` Burakov, Anatoly
  2020-10-12 19:19             ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Burakov, Anatoly @ 2020-10-08  9:39 UTC (permalink / raw)
  To: David Christensen, david.marchand; +Cc: dev
On 07-Oct-20 6:44 PM, David Christensen wrote:
> 
> 
> On 9/17/20 4:13 AM, Burakov, Anatoly wrote:
>> On 10-Aug-20 10:07 PM, David Christensen wrote:
>>> The SPAPR IOMMU requires that a DMA window size be defined before memory
>>> can be mapped for DMA. Current code dynamically modifies the DMA window
>>> size in response to every new memory allocation which is potentially
>>> dangerous because all existing mappings need to be unmapped/remapped in
>>> order to resize the DMA window, leaving hardware holding IOVA addresses
>>> that are temporarily unmapped.  The new SPAPR code statically assigns
>>> the DMA window size on first use, using the largest physical memory
>>> memory address when IOVA=PA and the highest existing memseg virtual
>>> address when IOVA=VA.
>>>
>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>>> ---
>>
>> <snip>
>>
>>> +struct spapr_size_walk_param {
>>> +    uint64_t max_va;
>>> +    uint64_t page_sz;
>>> +    int external;
>>> +};
>>> +
>>> +/*
>>> + * In order to set the DMA window size required for the SPAPR IOMMU
>>> + * we need to walk the existing virtual memory allocations as well as
>>> + * find the hugepage size used.
>>> + */
>>>   static int
>>> -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
>>> -        const struct rte_memseg *ms, void *arg)
>>> +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
>>>   {
>>> -    int *vfio_container_fd = arg;
>>> +    struct spapr_size_walk_param *param = arg;
>>> +    uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
>>> -    /* skip external memory that isn't a heap */
>>> -    if (msl->external && !msl->heap)
>>> -        return 0;
>>> +    if (msl->external) {
>>> +        param->external++;
>>> +        if (!msl->heap)
>>> +            return 0;
>>> +    }
>>
>> It would be nice to have some comments in the code explaining what 
>> we're skipping and why.
> 
> Reviewing this again, my inclination is to skip ALL external memory, 
> which by definition would seem to be outside of IOMMU control, so the 
> code would read:
> 
>     if (msl->external) {
>         param->external++;
>         return 0;
>     }
The external memory can still be mapped for DMA with rte_dev_dma_map() 
API. The heap memory is meant to be mapped automatically by DPDK, while 
the non-heap memory (created with rte_extmem_register() API) is meant to 
be managed by the user and will be mapped using the user_mem_map 
functions in this file.
> 
> Not sure why existing code such as vfio_spapr_map_walk() distinguishes 
> between heap and non-heap in this situation.  Are there instances in x86 
> where it would matter?
> 
>> Also, seems that you're using param->external as bool? This is a 
>> non-public API so using stdbool is not an issue here, perhaps replace 
>> it with bool param->has_external?
> 
> Why do you think the distinction is necessary?
> 
It's not *necessary*, i just don't like the ancient C style where ints 
are used as booleans :D Not a serious issue though, your choice.
> Dave
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-08  9:39           ` Burakov, Anatoly
@ 2020-10-12 19:19             ` David Christensen
  2020-10-14  9:27               ` Burakov, Anatoly
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-10-12 19:19 UTC (permalink / raw)
  To: Burakov, Anatoly, david.marchand; +Cc: dev
\>>>> -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
>>>> -        const struct rte_memseg *ms, void *arg)
>>>> +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
>>>>   {
>>>> -    int *vfio_container_fd = arg;
>>>> +    struct spapr_size_walk_param *param = arg;
>>>> +    uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
>>>> -    /* skip external memory that isn't a heap */
>>>> -    if (msl->external && !msl->heap)
>>>> -        return 0;
>>>> +    if (msl->external) {
>>>> +        param->external++;
>>>> +        if (!msl->heap)
>>>> +            return 0;
>>>> +    }
>>>
>>> It would be nice to have some comments in the code explaining what 
>>> we're skipping and why.
>>
>> Reviewing this again, my inclination is to skip ALL external memory, 
>> which by definition would seem to be outside of IOMMU control, so the 
>> code would read:
>>
>>     if (msl->external) {
>>         param->external++;
>>         return 0;
>>     }
> 
> The external memory can still be mapped for DMA with rte_dev_dma_map() 
> API. The heap memory is meant to be mapped automatically by DPDK, while 
> the non-heap memory (created with rte_extmem_register() API) is meant to 
> be managed by the user and will be mapped using the user_mem_map 
> functions in this file.
So for my purpose of identifying the memory range qualified for IOMMU 
protection, are you saying that external memory in the heap should be 
included in the DMA window calculation?  Like this:
         if (msl->external && !msl->heap) {
                 /* ignore user managed external memory */
                 param->is_user_managed = true;
                 return 0;
         }
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-12 19:19             ` David Christensen
@ 2020-10-14  9:27               ` Burakov, Anatoly
  0 siblings, 0 replies; 48+ messages in thread
From: Burakov, Anatoly @ 2020-10-14  9:27 UTC (permalink / raw)
  To: David Christensen, david.marchand; +Cc: dev
On 12-Oct-20 8:19 PM, David Christensen wrote:
> \>>>> -vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
>>>>> -        const struct rte_memseg *ms, void *arg)
>>>>> +vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
>>>>>   {
>>>>> -    int *vfio_container_fd = arg;
>>>>> +    struct spapr_size_walk_param *param = arg;
>>>>> +    uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
>>>>> -    /* skip external memory that isn't a heap */
>>>>> -    if (msl->external && !msl->heap)
>>>>> -        return 0;
>>>>> +    if (msl->external) {
>>>>> +        param->external++;
>>>>> +        if (!msl->heap)
>>>>> +            return 0;
>>>>> +    }
>>>>
>>>> It would be nice to have some comments in the code explaining what 
>>>> we're skipping and why.
>>>
>>> Reviewing this again, my inclination is to skip ALL external memory, 
>>> which by definition would seem to be outside of IOMMU control, so the 
>>> code would read:
>>>
>>>     if (msl->external) {
>>>         param->external++;
>>>         return 0;
>>>     }
>>
>> The external memory can still be mapped for DMA with rte_dev_dma_map() 
>> API. The heap memory is meant to be mapped automatically by DPDK, 
>> while the non-heap memory (created with rte_extmem_register() API) is 
>> meant to be managed by the user and will be mapped using the 
>> user_mem_map functions in this file.
> 
> So for my purpose of identifying the memory range qualified for IOMMU 
> protection, are you saying that external memory in the heap should be 
> included in the DMA window calculation?  Like this:
> 
>          if (msl->external && !msl->heap) {
>                  /* ignore user managed external memory */
>                  param->is_user_managed = true;
>                  return 0;
>          }
> 
> Dave
I would say so, yes. I would also double check the user mem map path to 
see if it makes sense with these changes, and correctly calculates the 
new window, should it be needed.
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation
  2020-08-10 21:07   ` [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation David Christensen
  2020-08-10 21:07     ` [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-10-15 17:23     ` David Christensen
  2020-10-15 17:23       ` [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
  2020-11-03 22:05       ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  1 sibling, 2 replies; 48+ messages in thread
From: David Christensen @ 2020-10-15 17:23 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
v4:
- Move file reading code out of vfio_spapr_window_size_walk()
v3:
- Rebase for 20.08
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 412 +++++++++++++++-----------------
 1 file changed, 188 insertions(+), 224 deletions(-)
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-15 17:23     ` [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation David Christensen
@ 2020-10-15 17:23       ` David Christensen
  2020-10-20 12:05         ` Thomas Monjalon
  2020-11-02 11:04         ` Burakov, Anatoly
  2020-11-03 22:05       ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  1 sibling, 2 replies; 48+ messages in thread
From: David Christensen @ 2020-10-15 17:23 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 412 +++++++++++++++-----------------
 1 file changed, 188 insertions(+), 224 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 380f2f44a..dfb1125f3 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1436,21 +1415,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1469,24 +1457,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1495,8 +1473,8 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
 			return -1;
 		}
 
@@ -1504,12 +1482,12 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+				errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1526,251 +1504,237 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	bool is_user_managed;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	if (msl->external && !msl->heap) {
+		/* ignore user managed external memory */
+		param->is_user_managed = true;
 		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
+static uint64_t
+get_highest_mem_addr(struct spapr_size_walk_param *param)
+{
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
 
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
+					line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" PRIx64
+				" to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
+
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		return rte_align64pow2(max + 1);
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param->max_va);
+		return rte_align64pow2(param->max_va);
+	}
+
+	RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+	return 0;
+}
+
+
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+spapr_dma_win_size(void)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
-
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
-		return 0;
+	struct spapr_size_walk_param param;
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA window size\n");
+		return -1;
 	}
 
+	/* We can't be sure if DMA window covers external memory */
+	if (param.is_user_managed)
+		RTE_LOG(WARNING, EAL, "Detected user managed external memory which may not be managed by the IOMMU\n");
+
+	spapr_dma_win_len = get_highest_mem_addr(¶m);
+	if (spapr_dma_win_len == 0)
+		return -1;
+	RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%" PRIx64 "\n",
+		spapr_dma_win_len);
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, error %i (%s)\n",
+			errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/* remove default DMA window */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 	if (ret) {
-#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, error %i (%s)\n",
+			errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size "
+			"if supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+	uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-15 17:23       ` [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-10-20 12:05         ` Thomas Monjalon
  2020-10-29 21:30           ` Thomas Monjalon
  2020-11-02 11:04         ` Burakov, Anatoly
  1 sibling, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-10-20 12:05 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: dev, david.marchand, David Christensen
Anatoly, please could you review this patch?
15/10/2020 19:23, David Christensen:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-20 12:05         ` Thomas Monjalon
@ 2020-10-29 21:30           ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-10-29 21:30 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: dev, david.marchand, David Christensen
Ping for review please
20/10/2020 14:05, Thomas Monjalon:
> Anatoly, please could you review this patch?
> 
> 15/10/2020 19:23, David Christensen:
> > The SPAPR IOMMU requires that a DMA window size be defined before memory
> > can be mapped for DMA. Current code dynamically modifies the DMA window
> > size in response to every new memory allocation which is potentially
> > dangerous because all existing mappings need to be unmapped/remapped in
> > order to resize the DMA window, leaving hardware holding IOVA addresses
> > that are temporarily unmapped.  The new SPAPR code statically assigns
> > the DMA window size on first use, using the largest physical memory
> > memory address when IOVA=PA and the highest existing memseg virtual
> > address when IOVA=VA.
> > 
> > Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-15 17:23       ` [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
  2020-10-20 12:05         ` Thomas Monjalon
@ 2020-11-02 11:04         ` Burakov, Anatoly
  1 sibling, 0 replies; 48+ messages in thread
From: Burakov, Anatoly @ 2020-11-02 11:04 UTC (permalink / raw)
  To: David Christensen, dev, david.marchand
On 15-Oct-20 6:23 PM, David Christensen wrote:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
These changes are almost exclusively contained to PPC64 code, so with 
below changes,
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> +static uint64_t
> +get_highest_mem_addr(struct spapr_size_walk_param *param)
> +{
> +	/* find the maximum IOVA address for setting the DMA window size */
> +	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
> +		static const char proc_iomem[] = "/proc/iomem";
> +		static const char str_sysram[] = "System RAM";
> +		uint64_t start, end, max = 0;
> +		char *line = NULL;
> +		char *dash, *space;
> +		size_t line_len;
>   
> +		/*
> +		 * Example "System RAM" in /proc/iomem:
> +		 * 00000000-1fffffffff : System RAM
> +		 * 200000000000-201fffffffff : System RAM
> +		 */
> +		FILE *fd = fopen(proc_iomem, "r");
> +		if (fd == NULL) {
> +			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
> +			return -1;
> +		}
> +		/* Scan /proc/iomem for the highest PA in the system */
> +		while (getline(&line, &line_len, fd) != -1) {
> +			if (strstr(line, str_sysram) == NULL)
> +				continue;
> +
> +			space = strstr(line, " ");
> +			dash = strstr(line, "-");
> +
> +			/* Validate the format of the memory string */
> +			if (space == NULL || dash == NULL || space < dash) {
> +				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
> +					line, proc_iomem);
> +				continue;
> +			}
> +
> +			start = strtoull(line, NULL, 16);
> +			end   = strtoull(dash + 1, NULL, 16);
> +			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" PRIx64
> +				" to 0x%" PRIx64 "\n", start, end);
> +			if (end > max)
> +				max = end;
> +		}
> +		free(line);
> +		fclose(fd);
> +
> +		if (max == 0) {
> +			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
> +				"entry in file %s\n", proc_iomem);
> +			return -1;
> +		}
> +
> +		return rte_align64pow2(max + 1);
> +	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> +		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
> +			PRIx64 "\n", param->max_va);
> +		return rte_align64pow2(param->max_va);
> +	}
> +
> +	RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
> +	return 0;
You're returning a uint64_t here, while also returning -1 in some of the 
error cases above, but not here. How about making the address an output 
parameter, and return 0 on success and -1 on error? It makes the code a 
bit messier at the call site, but would probably make more sense than 
returning 0 or -1 depending on which error condition you've hit.
Also, because of this, there's a bug below where you check for return 
value of 0, but not -1.
> +}
> +
> +
> +/*
> + * The SPAPRv2 IOMMU supports 2 DMA windows with starting
> + * address at 0 or 1<<59.  By default, a DMA window is set
> + * at address 0, 2GB long, with a 4KB page.  For DPDK we
> + * must remove the default window and setup a new DMA window
> + * based on the hugepage size and memory requirements of
> + * the application before we can map memory for DMA.
> + */
>   static int
> -vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
> -		const struct rte_memseg *ms, void *arg)
> +spapr_dma_win_size(void)
>   {
> -	struct spapr_walk_param *param = arg;
> -	uint64_t max = ms->iova + ms->len;
> -
> -	/* skip external memory that isn't a heap */
> -	if (msl->external && !msl->heap)
> -		return 0;
> +	struct spapr_size_walk_param param;
>   
> -	/* skip any segments with invalid IOVA addresses */
> -	if (ms->iova == RTE_BAD_IOVA)
> +	/* only create DMA window once */
> +	if (spapr_dma_win_len > 0)
>   		return 0;
>   
> -	if (max > param->window_size) {
> -		param->hugepage_sz = ms->hugepage_sz;
> -		param->window_size = max;
> +	/* walk the memseg list to find the page size/max VA address */
> +	memset(¶m, 0, sizeof(param));
> +	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
> +		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA window size\n");
> +		return -1;
>   	}
>   
> +	/* We can't be sure if DMA window covers external memory */
> +	if (param.is_user_managed)
> +		RTE_LOG(WARNING, EAL, "Detected user managed external memory which may not be managed by the IOMMU\n");
> +
> +	spapr_dma_win_len = get_highest_mem_addr(¶m);
> +	if (spapr_dma_win_len == 0)
> +		return -1;
This error check doesn't catch all errors, as indicated above.
> +	RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%" PRIx64 "\n",
> +		spapr_dma_win_len);
> +	spapr_dma_win_page_sz = param.page_sz;
> +	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
>   	return 0;
>   }
>   
>   static int
> -vfio_spapr_create_new_dma_window(int vfio_container_fd,
> -		struct vfio_iommu_spapr_tce_create *create) {
> +vfio_spapr_create_dma_window(int vfio_container_fd)
> +{
> +	struct vfio_iommu_spapr_tce_create create = {
> +		.argsz = sizeof(create), };
>   	struct vfio_iommu_spapr_tce_remove remove = {
> -		.argsz = sizeof(remove),
> -	};
> +		.argsz = sizeof(remove), };
>   	struct vfio_iommu_spapr_tce_info info = {
> -		.argsz = sizeof(info),
> -	};
> +		.argsz = sizeof(info), };
>   	int ret;
>   
> -	/* query spapr iommu info */
> +	ret = spapr_dma_win_size();
> +	if (ret < 0)
> +		return ret;
> +
>   	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>   	if (ret) {
> -		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
> -				"error %i (%s)\n", errno, strerror(errno));
> +		RTE_LOG(ERR, EAL, "  can't get iommu info, error %i (%s)\n",
> +			errno, strerror(errno));
>   		return -1;
>   	}
>   
> -	/* remove default DMA of 32 bit window */
> +	/* remove default DMA window */
>   	remove.start_addr = info.dma32_window_start;
>   	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
If you're never recreating a window, does it need to be removed? Or is 
this some kind of default window that is always present?
> -	if (ret) {
> -		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
> -				"error %i (%s)\n", errno, strerror(errno));
> +	if (ret)
>   		return -1;
> -	}
>   
> -	/* create new DMA window */
> -	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
> +	/* create a new DMA window (start address is not selectable) */
> +	create.window_size = spapr_dma_win_len;
> +	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
> +	create.levels = 1;
> +	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>   	if (ret) {
> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
> -		/* try possible page_shift and levels for workaround */
> +		/* if at first we don't succeed, try more levels */
>   		uint32_t levels;
>   
> -		for (levels = create->levels + 1;
> +		for (levels = create.levels + 1;
>   			ret && levels <= info.ddw.levels; levels++) {
> -			create->levels = levels;
> +			create.levels = levels;
>   			ret = ioctl(vfio_container_fd,
> -				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
> -		}
> -#endif
> -		if (ret) {
> -			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
> -					"error %i (%s)\n", errno, strerror(errno));
> -			return -1;
> +				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>   		}
>   	}
> -
> -	if (create->start_addr != 0) {
> -		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "  cannot create new DMA window, error %i (%s)\n",
> +			errno, strerror(errno));
> +		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size "
> +			"if supported by the system\n");
>   		return -1;
>   	}
>   
> -	return 0;
> +	/* verify the start address  */
> +	if (create.start_addr != 0) {
> +		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
> +			PRIx64 "\n", (uint64_t)create.start_addr);
> +		return -1;
> +	}
> +	return ret;
>   }
>   
>   static int
> -vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
> -		uint64_t len, int do_map)
> +vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
> +	uint64_t iova, uint64_t len, int do_map)
Nitpick, but this bit after newline should have two indents, not one.
-- 
Thanks,
Anatoly
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v5 0/1] vfio: modify spapr iommu support to use static window sizing
  2020-10-15 17:23     ` [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation David Christensen
  2020-10-15 17:23       ` [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
@ 2020-11-03 22:05       ` David Christensen
  2020-11-03 22:05         ` [dpdk-dev] [PATCH v5 1/1] " David Christensen
  2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  1 sibling, 2 replies; 48+ messages in thread
From: David Christensen @ 2020-11-03 22:05 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
---
v5:
- Modify get_highest_mem_addr to return error, not address
- Add comment regarding sPAPR v1/v2 default window and why it
  needs to be removed
- Added indent to second line of vfio_spapr_dma_mem_map() definition
v4:
- Move file reading code out of vfio_spapr_window_size_walk()
v3:
- Rebase for 20.08
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 421 +++++++++++++++-----------------
 1 file changed, 198 insertions(+), 223 deletions(-)
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-03 22:05       ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
@ 2020-11-03 22:05         ` David Christensen
  2020-11-04 19:43           ` Thomas Monjalon
  2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  1 sibling, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-03 22:05 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linux/eal_vfio.c | 421 +++++++++++++++-----------------
 1 file changed, 198 insertions(+), 223 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 380f2f44a..f716710d7 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1436,21 +1415,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1469,24 +1457,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1495,8 +1473,8 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
 			return -1;
 		}
 
@@ -1504,12 +1482,12 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+				errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1526,251 +1504,248 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	bool is_user_managed;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	if (msl->external && !msl->heap) {
+		/* ignore user managed external memory */
+		param->is_user_managed = true;
 		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
-
+/*
+ * Find the highest memory address used in physical or virtual address
+ * space and use that as the top of the DMA window.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+find_highest_mem_addr(struct spapr_size_walk_param *param)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
+					line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" PRIx64
+				" to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
+
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
 		return 0;
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param->max_va);
+		spapr_dma_win_len = rte_align64pow2(param->max_va);
+		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
+	spapr_dma_win_len = 0;
+	RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+	return -1;
+}
+
+
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
+static int
+spapr_dma_win_size(void)
+{
+	struct spapr_size_walk_param param;
+
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA window size\n");
+		return -1;
 	}
 
+	/* we can't be sure if DMA window covers external memory */
+	if (param.is_user_managed)
+		RTE_LOG(WARNING, EAL, "Detected user managed external memory which may not be managed by the IOMMU\n");
+
+	/* check physical/virtual memory size */
+	if (find_highest_mem_addr(¶m) < 0)
+		return -1;
+	RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%" PRIx64 "\n",
+		spapr_dma_win_len);
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, error %i (%s)\n",
+			errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/*
+	 * sPAPR v1/v2 IOMMU always has a default 1G DMA window set.  The window
+	 * can't be changed for v1 but it can be changed for v2. Since DPDK only
+	 * supports v2, remove the default DMA window so it can be resized.
+	 */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 	if (ret) {
-#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, error %i (%s)\n",
+			errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size "
+			"if supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-03 22:05         ` [dpdk-dev] [PATCH v5 1/1] " David Christensen
@ 2020-11-04 19:43           ` Thomas Monjalon
  2020-11-04 21:00             ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-11-04 19:43 UTC (permalink / raw)
  To: David Christensen; +Cc: dev, anatoly.burakov, david.marchand
03/11/2020 23:05, David Christensen:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
> -		/* try possible page_shift and levels for workaround */
> +		/* if at first we don't succeed, try more levels */
>  		uint32_t levels;
>  
> -		for (levels = create->levels + 1;
> +		for (levels = create.levels + 1;
>  			ret && levels <= info.ddw.levels; levels++) {
There is a compilation failure with ppc64le-power8-linux-gcc:
error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-04 19:43           ` Thomas Monjalon
@ 2020-11-04 21:00             ` David Christensen
  2020-11-04 21:02               ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-04 21:00 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, anatoly.burakov, david.marchand
On 11/4/20 11:43 AM, Thomas Monjalon wrote:
>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
>> -		/* try possible page_shift and levels for workaround */
>> +		/* if at first we don't succeed, try more levels */
>>   		uint32_t levels;
>>   
>> -		for (levels = create->levels + 1;
>> +		for (levels = create.levels + 1;
>>   			ret && levels <= info.ddw.levels; levels++) {
> 
> There is a compilation failure with ppc64le-power8-linux-gcc:
> error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
How did you find that error?  It builds locally for me on a POWER system 
with Meson/gcc and there were no build failures on Travis 
(https://travis-ci.com/github/drchristensen/dpdk/builds/198047029) when 
I checked it against AMD64/ARM systems.  The code is PPC specific but it 
will build on all architectures (there are no IFDEFs around it).
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-04 21:00             ` David Christensen
@ 2020-11-04 21:02               ` Thomas Monjalon
  2020-11-04 22:25                 ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-11-04 21:02 UTC (permalink / raw)
  To: David Christensen; +Cc: dev, anatoly.burakov, david.marchand
04/11/2020 22:00, David Christensen:
> 
> On 11/4/20 11:43 AM, Thomas Monjalon wrote:
> >> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> >> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
> >> -		/* try possible page_shift and levels for workaround */
> >> +		/* if at first we don't succeed, try more levels */
> >>   		uint32_t levels;
> >>   
> >> -		for (levels = create->levels + 1;
> >> +		for (levels = create.levels + 1;
> >>   			ret && levels <= info.ddw.levels; levels++) {
> > 
> > There is a compilation failure with ppc64le-power8-linux-gcc:
> > error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
> 
> How did you find that error?  It builds locally for me on a POWER system 
> with Meson/gcc and there were no build failures on Travis 
> (https://travis-ci.com/github/drchristensen/dpdk/builds/198047029) when 
> I checked it against AMD64/ARM systems.  The code is PPC specific but it 
> will build on all architectures (there are no IFDEFs around it).
Remember, I cross-build with test-meson-builds.sh
Is it an issue of my toolchain?
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-04 21:02               ` Thomas Monjalon
@ 2020-11-04 22:25                 ` David Christensen
  2020-11-05  7:12                   ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-04 22:25 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, anatoly.burakov, david.marchand
On 11/4/20 1:02 PM, Thomas Monjalon wrote:
> 04/11/2020 22:00, David Christensen:
>>
>> On 11/4/20 11:43 AM, Thomas Monjalon wrote:
>>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>>>> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
>>>> -		/* try possible page_shift and levels for workaround */
>>>> +		/* if at first we don't succeed, try more levels */
>>>>    		uint32_t levels;
>>>>    
>>>> -		for (levels = create->levels + 1;
>>>> +		for (levels = create.levels + 1;
>>>>    			ret && levels <= info.ddw.levels; levels++) {
>>>
>>> There is a compilation failure with ppc64le-power8-linux-gcc:
>>> error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
>>
>> How did you find that error?  It builds locally for me on a POWER system
>> with Meson/gcc and there were no build failures on Travis
>> (https://travis-ci.com/github/drchristensen/dpdk/builds/198047029) when
>> I checked it against AMD64/ARM systems.  The code is PPC specific but it
>> will build on all architectures (there are no IFDEFs around it).
> 
> Remember, I cross-build with test-meson-builds.sh
> Is it an issue of my toolchain?
> 
What distro/gcc version are you using?  I'll try it locally on an x86.
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-04 22:25                 ` David Christensen
@ 2020-11-05  7:12                   ` Thomas Monjalon
  2020-11-06 22:16                     ` David Christensen
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-11-05  7:12 UTC (permalink / raw)
  To: David Christensen; +Cc: dev, anatoly.burakov, david.marchand
04/11/2020 23:25, David Christensen:
> On 11/4/20 1:02 PM, Thomas Monjalon wrote:
> > 04/11/2020 22:00, David Christensen:
> >>
> >> On 11/4/20 11:43 AM, Thomas Monjalon wrote:
> >>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> >>>> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> ---
> >>>> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
> >>>> -		/* try possible page_shift and levels for workaround */
> >>>> +		/* if at first we don't succeed, try more levels */
> >>>>    		uint32_t levels;
> >>>>    
> >>>> -		for (levels = create->levels + 1;
> >>>> +		for (levels = create.levels + 1;
> >>>>    			ret && levels <= info.ddw.levels; levels++) {
> >>>
> >>> There is a compilation failure with ppc64le-power8-linux-gcc:
> >>> error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
> >>
> >> How did you find that error?  It builds locally for me on a POWER system
> >> with Meson/gcc and there were no build failures on Travis
> >> (https://travis-ci.com/github/drchristensen/dpdk/builds/198047029) when
> >> I checked it against AMD64/ARM systems.  The code is PPC specific but it
> >> will build on all architectures (there are no IFDEFs around it).
> > 
> > Remember, I cross-build with test-meson-builds.sh
> > Is it an issue of my toolchain?
> 
> What distro/gcc version are you using?  I'll try it locally on an x86.
I am using powerpc64le-power8--glibc--stable-2018.11-1 from
https://toolchains.bootlin.com/releases_powerpc64le-power8.html
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-05  7:12                   ` Thomas Monjalon
@ 2020-11-06 22:16                     ` David Christensen
  2020-11-07  9:58                       ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-06 22:16 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, anatoly.burakov, david.marchand
On 11/4/20 11:12 PM, Thomas Monjalon wrote:
> 04/11/2020 23:25, David Christensen:
>> On 11/4/20 1:02 PM, Thomas Monjalon wrote:
>>> 04/11/2020 22:00, David Christensen:
>>>>
>>>> On 11/4/20 11:43 AM, Thomas Monjalon wrote:
>>>>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>>>>>> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> ---
>>>>>> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
>>>>>> -		/* try possible page_shift and levels for workaround */
>>>>>> +		/* if at first we don't succeed, try more levels */
>>>>>>     		uint32_t levels;
>>>>>>     
>>>>>> -		for (levels = create->levels + 1;
>>>>>> +		for (levels = create.levels + 1;
>>>>>>     			ret && levels <= info.ddw.levels; levels++) {
>>>>>
>>>>> There is a compilation failure with ppc64le-power8-linux-gcc:
>>>>> error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
>>>>
>>>> How did you find that error?  It builds locally for me on a POWER system
>>>> with Meson/gcc and there were no build failures on Travis
>>>> (https://travis-ci.com/github/drchristensen/dpdk/builds/198047029) when
>>>> I checked it against AMD64/ARM systems.  The code is PPC specific but it
>>>> will build on all architectures (there are no IFDEFs around it).
>>>
>>> Remember, I cross-build with test-meson-builds.sh
>>> Is it an issue of my toolchain?
>>
>> What distro/gcc version are you using?  I'll try it locally on an x86.
> 
> I am using powerpc64le-power8--glibc--stable-2018.11-1 from
> https://toolchains.bootlin.com/releases_powerpc64le-power8.html
Here's what I found:
- Builds correctly on a RHEL 8.2 POWER9 host with gcc (GCC) 8.3.1 
20191121 (Red Hat 8.3.1-5) and kernel 4.18.0
- Builds correctly on an Ubuntu 18.04.5 POWER9 host with gcc (Ubuntu 
7.5.0-3ubuntu1~18.04) 7.5.0 and kernel 4.15.0.
- Build fails on an Ubuntu 18.04.5 AMD64 host with your POWER8 toolchain 
and the devtools/test-meson-builds.sh script.
It appears that the VFIO header file in your toolchain:
powerpc64le-buildroot-linux-gnu/sysroot/usr/include/linux/vfio.h
is from the 4.1.49 kernel, but the sPAPR v2 IOMMU support wasn't added 
until the 4.2.0 kernel (https://lkml.org/lkml/2015/4/25/56).  The update 
added the ddw member to the vfio_iommu_spapr_tce_info structure.  I'll 
submit a new patch which skips testing additional levels unless kernel 
4.2.0 or later is used.
Dave
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-06 22:16                     ` David Christensen
@ 2020-11-07  9:58                       ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-11-07  9:58 UTC (permalink / raw)
  To: David Christensen; +Cc: dev, anatoly.burakov, david.marchand
06/11/2020 23:16, David Christensen:
> On 11/4/20 11:12 PM, Thomas Monjalon wrote:
> > 04/11/2020 23:25, David Christensen:
> >> On 11/4/20 1:02 PM, Thomas Monjalon wrote:
> >>> 04/11/2020 22:00, David Christensen:
> >>>> On 11/4/20 11:43 AM, Thomas Monjalon wrote:
> >>>>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> >>>>>> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> ---
> >>>>>> -#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
> >>>>>> -		/* try possible page_shift and levels for workaround */
> >>>>>> +		/* if at first we don't succeed, try more levels */
> >>>>>>     		uint32_t levels;
> >>>>>>     
> >>>>>> -		for (levels = create->levels + 1;
> >>>>>> +		for (levels = create.levels + 1;
> >>>>>>     			ret && levels <= info.ddw.levels; levels++) {
> >>>>>
> >>>>> There is a compilation failure with ppc64le-power8-linux-gcc:
> >>>>> error: ‘struct vfio_iommu_spapr_tce_info’ has no member named ‘ddw’
> >>>>
> >>>> How did you find that error?  It builds locally for me on a POWER system
> >>>> with Meson/gcc and there were no build failures on Travis
> >>>> (https://travis-ci.com/github/drchristensen/dpdk/builds/198047029) when
> >>>> I checked it against AMD64/ARM systems.  The code is PPC specific but it
> >>>> will build on all architectures (there are no IFDEFs around it).
> >>>
> >>> Remember, I cross-build with test-meson-builds.sh
> >>> Is it an issue of my toolchain?
> >>
> >> What distro/gcc version are you using?  I'll try it locally on an x86.
> > 
> > I am using powerpc64le-power8--glibc--stable-2018.11-1 from
> > https://toolchains.bootlin.com/releases_powerpc64le-power8.html
> 
> Here's what I found:
> 
> - Builds correctly on a RHEL 8.2 POWER9 host with gcc (GCC) 8.3.1 
> 20191121 (Red Hat 8.3.1-5) and kernel 4.18.0
> - Builds correctly on an Ubuntu 18.04.5 POWER9 host with gcc (Ubuntu 
> 7.5.0-3ubuntu1~18.04) 7.5.0 and kernel 4.15.0.
> - Build fails on an Ubuntu 18.04.5 AMD64 host with your POWER8 toolchain 
> and the devtools/test-meson-builds.sh script.
> 
> It appears that the VFIO header file in your toolchain:
> 
> powerpc64le-buildroot-linux-gnu/sysroot/usr/include/linux/vfio.h
> 
> is from the 4.1.49 kernel, but the sPAPR v2 IOMMU support wasn't added 
> until the 4.2.0 kernel (https://lkml.org/lkml/2015/4/25/56).  The update 
> added the ddw member to the vfio_iommu_spapr_tce_info structure.  I'll 
> submit a new patch which skips testing additional levels unless kernel 
> 4.2.0 or later is used.
Instead of testing kernel version, which is fragile with backports,
can you test the presence of the feature itself?
If no macro (usable with #ifdef) is defined with the feature,
checking the kernel version is acceptable.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v5 0/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-03 22:05       ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  2020-11-03 22:05         ` [dpdk-dev] [PATCH v5 1/1] " David Christensen
@ 2020-11-09 20:35         ` David Christensen
  2020-11-09 20:35           ` [dpdk-dev] [PATCH v6 1/1] " David Christensen
                             ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: David Christensen @ 2020-11-09 20:35 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
---
v6:
- Fix build error on Linux kernels prior to 4.2.0
- Rebased on 20.11-rc3
v5:
- Modify get_highest_mem_addr to return error, not address
- Add comment regarding sPAPR v1/v2 default window and why it
  needs to be removed
- Added indent to second line of vfio_spapr_dma_mem_map() definition
v4:
- Move file reading code out of vfio_spapr_window_size_walk()
v3:
- Rebase for 20.08
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 430 +++++++++++++++-----------------
 1 file changed, 207 insertions(+), 223 deletions(-)
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v6 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
@ 2020-11-09 20:35           ` David Christensen
  2020-11-09 21:10             ` Thomas Monjalon
  2020-11-10 17:41           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
  2020-11-10 17:43           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
  2 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-09 20:35 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
---
 lib/librte_eal/linux/eal_vfio.c | 430 +++++++++++++++-----------------
 1 file changed, 207 insertions(+), 223 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 380f2f44a..050082444 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1436,21 +1415,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1469,24 +1457,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1495,8 +1473,8 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
 			return -1;
 		}
 
@@ -1504,12 +1482,12 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+				errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1526,251 +1504,257 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	bool is_user_managed;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	if (msl->external && !msl->heap) {
+		/* ignore user managed external memory */
+		param->is_user_managed = true;
 		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
-
+/*
+ * Find the highest memory address used in physical or virtual address
+ * space and use that as the top of the DMA window.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+find_highest_mem_addr(struct spapr_size_walk_param *param)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
+					line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" PRIx64
+				" to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
+
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
 		return 0;
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param->max_va);
+		spapr_dma_win_len = rte_align64pow2(param->max_va);
+		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
+	spapr_dma_win_len = 0;
+	RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+	return -1;
+}
+
+
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
+static int
+spapr_dma_win_size(void)
+{
+	struct spapr_size_walk_param param;
+
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA window size\n");
+		return -1;
 	}
 
+	/* we can't be sure if DMA window covers external memory */
+	if (param.is_user_managed)
+		RTE_LOG(WARNING, EAL, "Detected user managed external memory which may not be managed by the IOMMU\n");
+
+	/* check physical/virtual memory size */
+	if (find_highest_mem_addr(¶m) < 0)
+		return -1;
+	RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%" PRIx64 "\n",
+		spapr_dma_win_len);
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, error %i (%s)\n",
+			errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/*
+	 * sPAPR v1/v2 IOMMU always has a default 1G DMA window set.  The window
+	 * can't be changed for v1 but it can be changed for v2. Since DPDK only
+	 * supports v2, remove the default DMA window so it can be resized.
+	 */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-	if (ret) {
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 #ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+	/*
+	 * The vfio_iommu_spapr_tce_info structure was modified in
+	 * Linux kernel 4.2.0 to add support for the
+	 * vfio_iommu_spapr_tce_ddw_info structure needed to try
+	 * multiple table levels.  Skip the attempt if running with
+	 * an older kernel.
+	 */
+	if (ret) {
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+#endif /* VFIO_IOMMU_SPAPR_INFO_DDW */
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, error %i (%s)\n",
+			errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size "
+			"if supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v6 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-09 20:35           ` [dpdk-dev] [PATCH v6 1/1] " David Christensen
@ 2020-11-09 21:10             ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-11-09 21:10 UTC (permalink / raw)
  To: anatoly.burakov, David Christensen; +Cc: dev, david.marchand
09/11/2020 21:35, David Christensen:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Did you remove Anatoly's ack on purpose?
He must review the patch again?
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v7 0/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  2020-11-09 20:35           ` [dpdk-dev] [PATCH v6 1/1] " David Christensen
@ 2020-11-10 17:41           ` David Christensen
  2020-11-10 17:41             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
  2020-11-10 17:43           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
  2 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-10 17:41 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
From: David Christensen <drc@linux.vnet.ibm.com>
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
---
v7:
- No patch changes, fixed email patch description
v6:
- Fix build error on Linux kernels prior to 4.2.0
- Rebased on 20.11-rc3
v5:
- Modify get_highest_mem_addr to return error, not address
- Add comment regarding sPAPR v1/v2 default window and why it
  needs to be removed
- Added indent to second line of vfio_spapr_dma_mem_map() definition
v4:
- Move file reading code out of vfio_spapr_window_size_walk()
v3:
- Rebase for 20.08
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 430 +++++++++++++++-----------------
 1 file changed, 207 insertions(+), 223 deletions(-)
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v7 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-10 17:41           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
@ 2020-11-10 17:41             ` David Christensen
  0 siblings, 0 replies; 48+ messages in thread
From: David Christensen @ 2020-11-10 17:41 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
From: David Christensen <drc@linux.vnet.ibm.com>
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linux/eal_vfio.c | 430 +++++++++++++++-----------------
 1 file changed, 207 insertions(+), 223 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 380f2f44a..050082444 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1436,21 +1415,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1469,24 +1457,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1495,8 +1473,8 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
 			return -1;
 		}
 
@@ -1504,12 +1482,12 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+				errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1526,251 +1504,257 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	bool is_user_managed;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	if (msl->external && !msl->heap) {
+		/* ignore user managed external memory */
+		param->is_user_managed = true;
 		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
-
+/*
+ * Find the highest memory address used in physical or virtual address
+ * space and use that as the top of the DMA window.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+find_highest_mem_addr(struct spapr_size_walk_param *param)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
+					line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" PRIx64
+				" to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
+
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
 		return 0;
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param->max_va);
+		spapr_dma_win_len = rte_align64pow2(param->max_va);
+		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
+	spapr_dma_win_len = 0;
+	RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+	return -1;
+}
+
+
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
+static int
+spapr_dma_win_size(void)
+{
+	struct spapr_size_walk_param param;
+
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA window size\n");
+		return -1;
 	}
 
+	/* we can't be sure if DMA window covers external memory */
+	if (param.is_user_managed)
+		RTE_LOG(WARNING, EAL, "Detected user managed external memory which may not be managed by the IOMMU\n");
+
+	/* check physical/virtual memory size */
+	if (find_highest_mem_addr(¶m) < 0)
+		return -1;
+	RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%" PRIx64 "\n",
+		spapr_dma_win_len);
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, error %i (%s)\n",
+			errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/*
+	 * sPAPR v1/v2 IOMMU always has a default 1G DMA window set.  The window
+	 * can't be changed for v1 but it can be changed for v2. Since DPDK only
+	 * supports v2, remove the default DMA window so it can be resized.
+	 */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-	if (ret) {
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 #ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+	/*
+	 * The vfio_iommu_spapr_tce_info structure was modified in
+	 * Linux kernel 4.2.0 to add support for the
+	 * vfio_iommu_spapr_tce_ddw_info structure needed to try
+	 * multiple table levels.  Skip the attempt if running with
+	 * an older kernel.
+	 */
+	if (ret) {
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+#endif /* VFIO_IOMMU_SPAPR_INFO_DDW */
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, error %i (%s)\n",
+			errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size "
+			"if supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v7 0/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
  2020-11-09 20:35           ` [dpdk-dev] [PATCH v6 1/1] " David Christensen
  2020-11-10 17:41           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
@ 2020-11-10 17:43           ` David Christensen
  2020-11-10 17:43             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
  2 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-10 17:43 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR v2 IOMMU used on bare-metal PowerNV systems requires that a DMA
window be defined before mapping/unmapping memory.  The current VFIO code
dynamically resizes this DMA window every time a new memory request is
made, which requires that all existing memory be unmapped/remapped.
While this strategy worked in DPDK 17.11 and earlier where memory was
statically allocated during startup, it is potentially dangerous in DPDK
18.11 and later where memory can be allocated during runtime, temporarily
invalidating IOVA memory used by hardware.
This new code statically sizes the DMA window at startup, based on the
amount of memory installed in the system, avoiding the need to unmap
memory during runtime.
---
v7:
- No patch changes, fixed email patch description
v6:
- Fix build error on Linux kernels prior to 4.2.0
- Rebased on 20.11-rc3
v5:
- Modify get_highest_mem_addr to return error, not address
- Add comment regarding sPAPR v1/v2 default window and why it
  needs to be removed
- Added indent to second line of vfio_spapr_dma_mem_map() definition
v4:
- Move file reading code out of vfio_spapr_window_size_walk()
v3:
- Rebase for 20.08
v2:
- Drop patch to wrap ppc64 code with ifdef's
- Add warning when external memory detected
- Change VA memory size detection to scan memseg list when setting DMA window
  for IOVA=VA
- Add explicit error message when attempting to map outside the DMA window
David Christensen (1):
  vfio: modify spapr iommu support to use static window sizing
 lib/librte_eal/linux/eal_vfio.c | 430 +++++++++++++++-----------------
 1 file changed, 207 insertions(+), 223 deletions(-)
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [dpdk-dev] [PATCH v7 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-10 17:43           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
@ 2020-11-10 17:43             ` David Christensen
  2020-11-13  8:39               ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: David Christensen @ 2020-11-10 17:43 UTC (permalink / raw)
  To: dev, anatoly.burakov, david.marchand; +Cc: David Christensen
The SPAPR IOMMU requires that a DMA window size be defined before memory
can be mapped for DMA. Current code dynamically modifies the DMA window
size in response to every new memory allocation which is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are temporarily unmapped.  The new SPAPR code statically assigns
the DMA window size on first use, using the largest physical memory
memory address when IOVA=PA and the highest existing memseg virtual
address when IOVA=VA.
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linux/eal_vfio.c | 430 +++++++++++++++-----------------
 1 file changed, 207 insertions(+), 223 deletions(-)
diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index 380f2f44a..050082444 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -18,6 +18,7 @@
 #include "eal_memcfg.h"
 #include "eal_vfio.h"
 #include "eal_private.h"
+#include "eal_internal_cfg.h"
 
 #ifdef VFIO_PRESENT
 
@@ -536,17 +537,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 		return;
 	}
 
-#ifdef RTE_ARCH_PPC_64
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_free(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-	cur_len = 0;
-#endif
 	/* memsegs are contiguous in memory */
 	ms = rte_mem_virt2memseg(addr, msl);
 
@@ -607,17 +597,6 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 						iova_expected - iova_start, 0);
 		}
 	}
-#ifdef RTE_ARCH_PPC_64
-	cur_len = 0;
-	ms = rte_mem_virt2memseg(addr, msl);
-	while (cur_len < len) {
-		int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
-
-		rte_fbarray_set_used(&msl->memseg_arr, idx);
-		cur_len += ms->len;
-		++ms;
-	}
-#endif
 }
 
 static int
@@ -1436,21 +1415,30 @@ vfio_type1_dma_map(int vfio_container_fd)
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+/* Track the size of the statically allocated DMA window for SPAPR */
+uint64_t spapr_dma_win_len;
+uint64_t spapr_dma_win_page_sz;
+
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
 {
-	struct vfio_iommu_type1_dma_map dma_map;
-	struct vfio_iommu_type1_dma_unmap dma_unmap;
-	int ret;
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
+		.vaddr = (uintptr_t) vaddr,
+		.size = len,
 		.flags = 0
 	};
-	reg.vaddr = (uintptr_t) vaddr;
-	reg.size = len;
+	int ret;
 
 	if (do_map != 0) {
+		struct vfio_iommu_type1_dma_map dma_map;
+
+		if (iova + len > spapr_dma_win_len) {
+			RTE_LOG(ERR, EAL, "  dma map attempt outside DMA window\n");
+			return -1;
+		}
+
 		ret = ioctl(vfio_container_fd,
 				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
 		if (ret) {
@@ -1469,24 +1457,14 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 		if (ret) {
-			/**
-			 * In case the mapping was already done EBUSY will be
-			 * returned from kernel.
-			 */
-			if (errno == EBUSY) {
-				RTE_LOG(DEBUG, EAL,
-					" Memory segment is already mapped,"
-					" skipping");
-			} else {
-				RTE_LOG(ERR, EAL,
-					"  cannot set up DMA remapping,"
-					" error %i (%s)\n", errno,
-					strerror(errno));
-				return -1;
-			}
+			RTE_LOG(ERR, EAL, "  cannot map vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
+			return -1;
 		}
 
 	} else {
+		struct vfio_iommu_type1_dma_map dma_unmap;
+
 		memset(&dma_unmap, 0, sizeof(dma_unmap));
 		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
 		dma_unmap.size = len;
@@ -1495,8 +1473,8 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
 				&dma_unmap);
 		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
-					errno, strerror(errno));
+			RTE_LOG(ERR, EAL, "  cannot unmap vaddr for IOMMU, error %i (%s)\n",
+				errno, strerror(errno));
 			return -1;
 		}
 
@@ -1504,12 +1482,12 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 				VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, error %i (%s)\n",
-					errno, strerror(errno));
+				errno, strerror(errno));
 			return -1;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int
@@ -1526,251 +1504,257 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 1);
+	return vfio_spapr_dma_do_map(*vfio_container_fd,
+		ms->addr_64, ms->iova, ms->len, 1);
 }
 
+struct spapr_size_walk_param {
+	uint64_t max_va;
+	uint64_t page_sz;
+	bool is_user_managed;
+};
+
+/*
+ * In order to set the DMA window size required for the SPAPR IOMMU
+ * we need to walk the existing virtual memory allocations as well as
+ * find the hugepage size used.
+ */
 static int
-vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+vfio_spapr_size_walk(const struct rte_memseg_list *msl, void *arg)
 {
-	int *vfio_container_fd = arg;
+	struct spapr_size_walk_param *param = arg;
+	uint64_t max = (uint64_t) msl->base_va + (uint64_t) msl->len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+	if (msl->external && !msl->heap) {
+		/* ignore user managed external memory */
+		param->is_user_managed = true;
 		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
-		return 0;
+	if (max > param->max_va) {
+		param->page_sz = msl->page_sz;
+		param->max_va = max;
+	}
 
-	return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			ms->len, 0);
+	return 0;
 }
 
-struct spapr_walk_param {
-	uint64_t window_size;
-	uint64_t hugepage_sz;
-};
-
+/*
+ * Find the highest memory address used in physical or virtual address
+ * space and use that as the top of the DMA window.
+ */
 static int
-vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
-		const struct rte_memseg *ms, void *arg)
+find_highest_mem_addr(struct spapr_size_walk_param *param)
 {
-	struct spapr_walk_param *param = arg;
-	uint64_t max = ms->iova + ms->len;
+	/* find the maximum IOVA address for setting the DMA window size */
+	if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+		static const char proc_iomem[] = "/proc/iomem";
+		static const char str_sysram[] = "System RAM";
+		uint64_t start, end, max = 0;
+		char *line = NULL;
+		char *dash, *space;
+		size_t line_len;
 
-	/* skip external memory that isn't a heap */
-	if (msl->external && !msl->heap)
+		/*
+		 * Example "System RAM" in /proc/iomem:
+		 * 00000000-1fffffffff : System RAM
+		 * 200000000000-201fffffffff : System RAM
+		 */
+		FILE *fd = fopen(proc_iomem, "r");
+		if (fd == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+			return -1;
+		}
+		/* Scan /proc/iomem for the highest PA in the system */
+		while (getline(&line, &line_len, fd) != -1) {
+			if (strstr(line, str_sysram) == NULL)
+				continue;
+
+			space = strstr(line, " ");
+			dash = strstr(line, "-");
+
+			/* Validate the format of the memory string */
+			if (space == NULL || dash == NULL || space < dash) {
+				RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n",
+					line, proc_iomem);
+				continue;
+			}
+
+			start = strtoull(line, NULL, 16);
+			end   = strtoull(dash + 1, NULL, 16);
+			RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" PRIx64
+				" to 0x%" PRIx64 "\n", start, end);
+			if (end > max)
+				max = end;
+		}
+		free(line);
+		fclose(fd);
+
+		if (max == 0) {
+			RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" "
+				"entry in file %s\n", proc_iomem);
+			return -1;
+		}
+
+		spapr_dma_win_len = rte_align64pow2(max + 1);
 		return 0;
+	} else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+		RTE_LOG(DEBUG, EAL, "Highest VA address in memseg list is 0x%"
+			PRIx64 "\n", param->max_va);
+		spapr_dma_win_len = rte_align64pow2(param->max_va);
+		return 0;
+	}
 
-	/* skip any segments with invalid IOVA addresses */
-	if (ms->iova == RTE_BAD_IOVA)
+	spapr_dma_win_len = 0;
+	RTE_LOG(ERR, EAL, "Unsupported IOVA mode\n");
+	return -1;
+}
+
+
+/*
+ * The SPAPRv2 IOMMU supports 2 DMA windows with starting
+ * address at 0 or 1<<59.  By default, a DMA window is set
+ * at address 0, 2GB long, with a 4KB page.  For DPDK we
+ * must remove the default window and setup a new DMA window
+ * based on the hugepage size and memory requirements of
+ * the application before we can map memory for DMA.
+ */
+static int
+spapr_dma_win_size(void)
+{
+	struct spapr_size_walk_param param;
+
+	/* only create DMA window once */
+	if (spapr_dma_win_len > 0)
 		return 0;
 
-	if (max > param->window_size) {
-		param->hugepage_sz = ms->hugepage_sz;
-		param->window_size = max;
+	/* walk the memseg list to find the page size/max VA address */
+	memset(¶m, 0, sizeof(param));
+	if (rte_memseg_list_walk(vfio_spapr_size_walk, ¶m) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to walk memseg list for DMA window size\n");
+		return -1;
 	}
 
+	/* we can't be sure if DMA window covers external memory */
+	if (param.is_user_managed)
+		RTE_LOG(WARNING, EAL, "Detected user managed external memory which may not be managed by the IOMMU\n");
+
+	/* check physical/virtual memory size */
+	if (find_highest_mem_addr(¶m) < 0)
+		return -1;
+	RTE_LOG(DEBUG, EAL, "Setting DMA window size to 0x%" PRIx64 "\n",
+		spapr_dma_win_len);
+	spapr_dma_win_page_sz = param.page_sz;
+	rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
 	return 0;
 }
 
 static int
-vfio_spapr_create_new_dma_window(int vfio_container_fd,
-		struct vfio_iommu_spapr_tce_create *create) {
+vfio_spapr_create_dma_window(int vfio_container_fd)
+{
+	struct vfio_iommu_spapr_tce_create create = {
+		.argsz = sizeof(create), };
 	struct vfio_iommu_spapr_tce_remove remove = {
-		.argsz = sizeof(remove),
-	};
+		.argsz = sizeof(remove), };
 	struct vfio_iommu_spapr_tce_info info = {
-		.argsz = sizeof(info),
-	};
+		.argsz = sizeof(info), };
 	int ret;
 
-	/* query spapr iommu info */
+	ret = spapr_dma_win_size();
+	if (ret < 0)
+		return ret;
+
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
 	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot get iommu info, "
-				"error %i (%s)\n", errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  can't get iommu info, error %i (%s)\n",
+			errno, strerror(errno));
 		return -1;
 	}
 
-	/* remove default DMA of 32 bit window */
+	/*
+	 * sPAPR v1/v2 IOMMU always has a default 1G DMA window set.  The window
+	 * can't be changed for v1 but it can be changed for v2. Since DPDK only
+	 * supports v2, remove the default DMA window so it can be resized.
+	 */
 	remove.start_addr = info.dma32_window_start;
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "  cannot remove default DMA window, "
-				"error %i (%s)\n", errno, strerror(errno));
+	if (ret)
 		return -1;
-	}
 
-	/* create new DMA window */
-	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-	if (ret) {
+	/* create a new DMA window (start address is not selectable) */
+	create.window_size = spapr_dma_win_len;
+	create.page_shift  = __builtin_ctzll(spapr_dma_win_page_sz);
+	create.levels = 1;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 #ifdef VFIO_IOMMU_SPAPR_INFO_DDW
-		/* try possible page_shift and levels for workaround */
+	/*
+	 * The vfio_iommu_spapr_tce_info structure was modified in
+	 * Linux kernel 4.2.0 to add support for the
+	 * vfio_iommu_spapr_tce_ddw_info structure needed to try
+	 * multiple table levels.  Skip the attempt if running with
+	 * an older kernel.
+	 */
+	if (ret) {
+		/* if at first we don't succeed, try more levels */
 		uint32_t levels;
 
-		for (levels = create->levels + 1;
+		for (levels = create.levels + 1;
 			ret && levels <= info.ddw.levels; levels++) {
-			create->levels = levels;
+			create.levels = levels;
 			ret = ioctl(vfio_container_fd,
-				VFIO_IOMMU_SPAPR_TCE_CREATE, create);
-		}
-#endif
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-					"error %i (%s)\n", errno, strerror(errno));
-			return -1;
+				VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 		}
 	}
-
-	if (create->start_addr != 0) {
-		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+#endif /* VFIO_IOMMU_SPAPR_INFO_DDW */
+	if (ret) {
+		RTE_LOG(ERR, EAL, "  cannot create new DMA window, error %i (%s)\n",
+			errno, strerror(errno));
+		RTE_LOG(ERR, EAL, "  consider using a larger hugepage size "
+			"if supported by the system\n");
 		return -1;
 	}
 
-	return 0;
+	/* verify the start address  */
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  received unsupported start address 0x%"
+			PRIx64 "\n", (uint64_t)create.start_addr);
+		return -1;
+	}
+	return ret;
 }
 
 static int
-vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
-		uint64_t len, int do_map)
+vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map)
 {
-	struct spapr_walk_param param;
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct vfio_config *vfio_cfg;
-	struct user_mem_maps *user_mem_maps;
-	int i, ret = 0;
-
-	vfio_cfg = get_vfio_cfg_by_container_fd(vfio_container_fd);
-	if (vfio_cfg == NULL) {
-		RTE_LOG(ERR, EAL, "  invalid container fd!\n");
-		return -1;
-	}
-
-	user_mem_maps = &vfio_cfg->mem_maps;
-	rte_spinlock_recursive_lock(&user_mem_maps->lock);
-
-	/* check if window size needs to be adjusted */
-	memset(¶m, 0, sizeof(param));
-
-	/* we're inside a callback so use thread-unsafe version */
-	if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
-				¶m) < 0) {
-		RTE_LOG(ERR, EAL, "Could not get window size\n");
-		ret = -1;
-		goto out;
-	}
-
-	/* also check user maps */
-	for (i = 0; i < user_mem_maps->n_maps; i++) {
-		uint64_t max = user_mem_maps->maps[i].iova +
-				user_mem_maps->maps[i].len;
-		param.window_size = RTE_MAX(param.window_size, max);
-	}
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
+	int ret = 0;
 
 	if (do_map) {
-		/* re-create window and remap the entire memory */
-		if (iova + len > create.window_size) {
-			/* release all maps before recreating the window */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not release DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* release all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						0)) {
-					RTE_LOG(ERR, EAL, "Could not release user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-			create.window_size = rte_align64pow2(iova + len);
-			if (vfio_spapr_create_new_dma_window(vfio_container_fd,
-					&create) < 0) {
-				RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
-				ret = -1;
-				goto out;
-			}
-			/* we're inside a callback, so use thread-unsafe version
-			 */
-			if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk,
-					&vfio_container_fd) < 0) {
-				RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n");
-				ret = -1;
-				goto out;
-			}
-			/* remap all user maps */
-			for (i = 0; i < user_mem_maps->n_maps; i++) {
-				struct user_mem_map *map =
-						&user_mem_maps->maps[i];
-				if (vfio_spapr_dma_do_map(vfio_container_fd,
-						map->addr, map->iova, map->len,
-						1)) {
-					RTE_LOG(ERR, EAL, "Could not recreate user DMA maps\n");
-					ret = -1;
-					goto out;
-				}
-			}
-		}
-		if (vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 1)) {
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 1)) {
 			RTE_LOG(ERR, EAL, "Failed to map DMA\n");
 			ret = -1;
-			goto out;
 		}
 	} else {
-		/* for unmap, check if iova within DMA window */
-		if (iova > create.window_size) {
-			RTE_LOG(ERR, EAL, "iova beyond DMA window for unmap");
+		if (vfio_spapr_dma_do_map(vfio_container_fd,
+			vaddr, iova, len, 0)) {
+			RTE_LOG(ERR, EAL, "Failed to unmap DMA\n");
 			ret = -1;
-			goto out;
 		}
-
-		vfio_spapr_dma_do_map(vfio_container_fd, vaddr, iova, len, 0);
 	}
-out:
-	rte_spinlock_recursive_unlock(&user_mem_maps->lock);
+
 	return ret;
 }
 
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	struct vfio_iommu_spapr_tce_create create = {
-		.argsz = sizeof(create),
-	};
-	struct spapr_walk_param param;
-
-	memset(¶m, 0, sizeof(param));
-
-	/* create DMA window from 0 to max(phys_addr + len) */
-	rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);
-
-	/* sPAPR requires window size to be a power of 2 */
-	create.window_size = rte_align64pow2(param.window_size);
-	create.page_shift = __builtin_ctzll(param.hugepage_sz);
-	create.levels = 1;
-
-	if (vfio_spapr_create_new_dma_window(vfio_container_fd, &create) < 0) {
-		RTE_LOG(ERR, EAL, "Could not create new DMA window\n");
+	if (vfio_spapr_create_dma_window(vfio_container_fd) < 0) {
+		RTE_LOG(ERR, EAL, "Could not create new DMA window!\n");
 		return -1;
 	}
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	/* map all existing DPDK segments for DMA */
 	if (rte_memseg_walk(vfio_spapr_map_walk, &vfio_container_fd) < 0)
 		return -1;
 
-- 
2.18.4
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dpdk-dev] [PATCH v7 1/1] vfio: modify spapr iommu support to use static window sizing
  2020-11-10 17:43             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
@ 2020-11-13  8:39               ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-11-13  8:39 UTC (permalink / raw)
  To: David Christensen; +Cc: dev, anatoly.burakov, david.marchand
10/11/2020 18:43, David Christensen:
> The SPAPR IOMMU requires that a DMA window size be defined before memory
> can be mapped for DMA. Current code dynamically modifies the DMA window
> size in response to every new memory allocation which is potentially
> dangerous because all existing mappings need to be unmapped/remapped in
> order to resize the DMA window, leaving hardware holding IOVA addresses
> that are temporarily unmapped.  The new SPAPR code statically assigns
> the DMA window size on first use, using the largest physical memory
> memory address when IOVA=PA and the highest existing memseg virtual
> address when IOVA=VA.
> 
> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Applied, thanks
^ permalink raw reply	[flat|nested] 48+ messages in thread
end of thread, other threads:[~2020-11-15 18:29 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-29 23:29 [dpdk-dev] [PATCH 0/2] vfio: change spapr DMA window sizing operation David Christensen
2020-04-29 23:29 ` [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code David Christensen
2020-04-30 11:14   ` Burakov, Anatoly
2020-04-30 16:22     ` David Christensen
2020-04-30 16:24       ` Burakov, Anatoly
2020-04-30 17:38         ` David Christensen
2020-05-01  8:49           ` Burakov, Anatoly
2020-04-29 23:29 ` [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-04-30 11:34   ` Burakov, Anatoly
2020-04-30 17:36     ` David Christensen
2020-05-01  9:06       ` Burakov, Anatoly
2020-05-01 16:48         ` David Christensen
2020-05-05 14:57           ` Burakov, Anatoly
2020-05-05 16:26             ` David Christensen
2020-05-06 10:18               ` Burakov, Anatoly
2020-06-30 21:38 ` [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation David Christensen
2020-06-30 21:38   ` [dpdk-dev] [PATCH v2 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-08-10 21:07   ` [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation David Christensen
2020-08-10 21:07     ` [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-09-03 18:55       ` David Christensen
2020-09-17 11:13       ` Burakov, Anatoly
2020-10-07 12:49         ` Thomas Monjalon
2020-10-07 17:44         ` David Christensen
2020-10-08  9:39           ` Burakov, Anatoly
2020-10-12 19:19             ` David Christensen
2020-10-14  9:27               ` Burakov, Anatoly
2020-10-15 17:23     ` [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation David Christensen
2020-10-15 17:23       ` [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-10-20 12:05         ` Thomas Monjalon
2020-10-29 21:30           ` Thomas Monjalon
2020-11-02 11:04         ` Burakov, Anatoly
2020-11-03 22:05       ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
2020-11-03 22:05         ` [dpdk-dev] [PATCH v5 1/1] " David Christensen
2020-11-04 19:43           ` Thomas Monjalon
2020-11-04 21:00             ` David Christensen
2020-11-04 21:02               ` Thomas Monjalon
2020-11-04 22:25                 ` David Christensen
2020-11-05  7:12                   ` Thomas Monjalon
2020-11-06 22:16                     ` David Christensen
2020-11-07  9:58                       ` Thomas Monjalon
2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
2020-11-09 20:35           ` [dpdk-dev] [PATCH v6 1/1] " David Christensen
2020-11-09 21:10             ` Thomas Monjalon
2020-11-10 17:41           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
2020-11-10 17:41             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
2020-11-10 17:43           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
2020-11-10 17:43             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
2020-11-13  8:39               ` Thomas Monjalon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).