DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/3] Make device mapping more reliable
@ 2018-05-31 10:57 Anatoly Burakov
  2018-05-31 10:57 ` [dpdk-dev] [RFC 1/3] fbarray: allow zero-sized elements Anatoly Burakov
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Anatoly Burakov @ 2018-05-31 10:57 UTC (permalink / raw)
  To: dev
  Cc: thomas, hemant.agrawal, bruce.richardson, ferruh.yigit,
	konstantin.ananyev, jerin.jacob, olivier.matz, stephen, nhorman,
	david.marchand, gowrishankar.m

Currently, memory for device maps is allocated ad-hoc, by calculating
end of VA space allocated for hugepages and crossing fingers in hopes that
those addresses will be free in primary and secondary processes. This leads
to situations such as this:

EAL: Detected 88 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_178323_8af2229603de4
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL:   probe driver: 8086:1563 net_ixgbe
EAL: Cannot mmap device resource file /sys/bus/pci/devices/0000:81:00.0/resource0 to address: 0x7ff7f5800000
EAL: Requested device 0000:81:00.0 cannot be used
EAL: Error - exiting with code: 1
  Cause: No Ethernet ports - bye

As can be seen from the above log, secondary process has initialized
successfully, but device BAR mapping has failed, which resulted in missing ports
in the secondary process.

This patchset is an attempt to fix this problem once and for all, by using
the same method we use for memory to do device mappings as well. That is,
by preallocating all of the device memory in advance, so that initialization
either succeeds and allows for device mappings, or it fails outright (whereas
currently we may be in an in-between kind of situation, where init has
succeeded but device mappings have failed).

This change breaks the ABI, so it is not for this release. However, i'd like
to hear feedback on the approach and whether there are potential problems with
other buses/use cases that i didn't think of.

Anatoly Burakov (3):
  fbarray: allow zero-sized elements
  mem: add device memory reserve/free API
  bus/pci: use the new device memory API for BAR mapping

 drivers/bus/pci/linux/pci_init.h              |   1 -
 drivers/bus/pci/linux/pci_uio.c               |  11 +-
 drivers/bus/pci/linux/pci_vfio.c              |  27 +-
 lib/librte_eal/common/eal_common_fbarray.c    |  10 +-
 lib/librte_eal/common/eal_common_memory.c     | 270 ++++++++++++++++--
 .../common/include/rte_eal_memconfig.h        |  18 ++
 lib/librte_eal/common/include/rte_memory.h    |  40 +++
 lib/librte_pci/Makefile                       |   1 +
 lib/librte_pci/rte_pci.c                      |  20 +-
 9 files changed, 350 insertions(+), 48 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [dpdk-dev] [RFC 1/3] fbarray: allow zero-sized elements
  2018-05-31 10:57 [dpdk-dev] [RFC 0/3] Make device mapping more reliable Anatoly Burakov
@ 2018-05-31 10:57 ` Anatoly Burakov
  2018-05-31 10:57 ` [dpdk-dev] [RFC 2/3] mem: add device memory reserve/free API Anatoly Burakov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Anatoly Burakov @ 2018-05-31 10:57 UTC (permalink / raw)
  To: dev
  Cc: thomas, hemant.agrawal, bruce.richardson, ferruh.yigit,
	konstantin.ananyev, jerin.jacob, olivier.matz, stephen, nhorman,
	david.marchand, gowrishankar.m

We need to keep usage of our memory area indexed, but we don't
actually need to store any data - we need just the indexing
capabilities of fbarray. Yet, it currently disallows zero-sized
elements. Fix that by removing the check for zero-sized elements -
the rest will work correctly already.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_fbarray.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 019f84c18..4a365e7ce 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -391,9 +391,9 @@ set_used(struct rte_fbarray *arr, unsigned int idx, bool used)
 }
 
 static int
-fully_validate(const char *name, unsigned int elt_sz, unsigned int len)
+fully_validate(const char *name, unsigned int len)
 {
-	if (name == NULL || elt_sz == 0 || len == 0 || len > INT_MAX) {
+	if (name == NULL || len == 0 || len > INT_MAX) {
 		rte_errno = EINVAL;
 		return -1;
 	}
@@ -420,7 +420,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	if (fully_validate(name, elt_sz, len))
+	if (fully_validate(name, len))
 		return -1;
 
 	page_sz = sysconf(_SC_PAGESIZE);
@@ -511,7 +511,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	 * the array, so the parts we care about will not race.
 	 */
 
-	if (fully_validate(arr->name, arr->elt_sz, arr->len))
+	if (fully_validate(arr->name, arr->len))
 		return -1;
 
 	page_sz = sysconf(_SC_PAGESIZE);
@@ -858,7 +858,7 @@ rte_fbarray_dump_metadata(struct rte_fbarray *arr, FILE *f)
 		return;
 	}
 
-	if (fully_validate(arr->name, arr->elt_sz, arr->len)) {
+	if (fully_validate(arr->name, arr->len)) {
 		fprintf(f, "Invalid file-backed array\n");
 		goto out;
 	}
-- 
2.17.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [dpdk-dev] [RFC 2/3] mem: add device memory reserve/free API
  2018-05-31 10:57 [dpdk-dev] [RFC 0/3] Make device mapping more reliable Anatoly Burakov
  2018-05-31 10:57 ` [dpdk-dev] [RFC 1/3] fbarray: allow zero-sized elements Anatoly Burakov
@ 2018-05-31 10:57 ` Anatoly Burakov
  2018-05-31 10:57 ` [dpdk-dev] [RFC 3/3] bus/pci: use the new device memory API for BAR mapping Anatoly Burakov
  2018-08-14 10:13 ` [dpdk-dev] [RFC 0/3] Make device mapping more reliable Burakov, Anatoly
  3 siblings, 0 replies; 5+ messages in thread
From: Anatoly Burakov @ 2018-05-31 10:57 UTC (permalink / raw)
  To: dev
  Cc: thomas, hemant.agrawal, bruce.richardson, ferruh.yigit,
	konstantin.ananyev, jerin.jacob, olivier.matz, stephen, nhorman,
	david.marchand, gowrishankar.m

In order for hotplug in multiprocess to work reliably, we will need
a common shared memory area that is guaranteed to be accessible to all
processes at all times. This is accomplished by pre-reserving memory
that will be used for device mappings at startup, and managing it
at runtime.

Two new API calls are added: alloc and free of device memory. Once
allocation is requested, memory is considered to be reserved until it
is freed back using the same API. Usage of which blocks are occupied is
tracked using shared fbarray. This allows us to give out device memory
piecemeal and lessen fragmentation.

Naturally, this adds a limitation of how much device memory DPDK can
use. This is currently set to 2 gigabytes, but will be adjustable in
later revisions.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_memory.c     | 270 ++++++++++++++++--
 .../common/include/rte_eal_memconfig.h        |  18 ++
 lib/librte_eal/common/include/rte_memory.h    |  40 +++
 3 files changed, 312 insertions(+), 16 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4f0688f9d..8cae9b354 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -33,6 +33,7 @@
  */
 
 #define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
+#define DEVICE_MEMORY_NAME "device_memory"
 
 static uint64_t baseaddr_offset;
 static uint64_t system_page_sz;
@@ -904,6 +905,227 @@ rte_memseg_list_walk(rte_memseg_list_walk_t func, void *arg)
 	return ret;
 }
 
+void * __rte_experimental
+rte_mem_dev_memory_alloc(size_t size, size_t align)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_fbarray *arr = &mcfg->device_memory.mem_map_arr;
+	unsigned int n_pages, page_align;
+	int start_idx, cur_idx;
+	void *addr = NULL;
+
+	/* check parameters first */
+	if (size == 0 || (size & (system_page_sz - 1)) != 0) {
+		RTE_LOG(ERR, EAL, "%s(): size is not page-aligned\n",
+				__func__);
+		rte_errno = EINVAL;
+		return NULL;
+	}
+	if ((align & (system_page_sz - 1)) != 0) {
+		RTE_LOG(ERR, EAL, "%s(): alignment is not page-aligned\n",
+			__func__);
+		rte_errno = EINVAL;
+		return NULL;
+	}
+	/* PCI BAR sizes can only be powers of two, but this memory may be used
+	 * for more than just PCI BAR mappings, so only check if alignment is
+	 * power of two.
+	 */
+	if (align != 0 && !rte_is_power_of_2(align)) {
+		RTE_LOG(ERR, EAL, "%s(): alignment is not a power of two\n",
+			__func__);
+		rte_errno = EINVAL;
+		return NULL;
+	}
+	/* check if device memory map is uninitialized. */
+	if (mcfg->device_memory.base_va == NULL || arr->len == 0) {
+		RTE_LOG(ERR, EAL, "%s(): device memory map is not initialized\n",
+			__func__);
+		rte_errno = ENODEV;
+		return NULL;
+	}
+
+	n_pages = size / system_page_sz;
+	page_align = align / system_page_sz;
+
+	/* lock the device memory map */
+	rte_spinlock_lock(&mcfg->device_memory.lock);
+
+	start_idx = 0;
+	while (1) {
+		size_t offset;
+		int end;
+
+		cur_idx = rte_fbarray_find_next_n_free(arr, start_idx, n_pages);
+		if (cur_idx < 0)
+			break;
+
+		/* if there are alignment requirements, check if the offset we
+		 * found is aligned, and if not, align it and check if we still
+		 * have enough space.
+		 */
+		if (page_align != 0 && (cur_idx & (page_align - 1)) != 0) {
+			unsigned int aligned, len;
+
+			aligned = RTE_ALIGN_CEIL(cur_idx, page_align);
+			len = rte_fbarray_find_contig_free(arr, aligned);
+
+			/* if there's not enough space, keep looking */
+			if (len < n_pages) {
+				start_idx = aligned + len;
+				continue;
+			}
+
+			/* we've found space */
+			cur_idx = aligned;
+		}
+		end = cur_idx + n_pages;
+		offset = cur_idx * system_page_sz;
+		addr = RTE_PTR_ADD(mcfg->device_memory.base_va,
+				offset);
+
+		/* now, mark all space as occupied */
+		for (; cur_idx < end; cur_idx++)
+			rte_fbarray_set_used(arr, cur_idx);
+		break;
+	}
+	rte_spinlock_unlock(&mcfg->device_memory.lock);
+
+	if (addr != NULL)
+		RTE_LOG(DEBUG, EAL, "%s(): allocated %p-%p (%lu bytes) for hardware device usage\n",
+			__func__, addr, RTE_PTR_ADD(addr, size), size);
+
+	return addr;
+}
+
+int __rte_experimental
+rte_mem_dev_memory_free(void *addr, size_t size)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_fbarray *arr = &mcfg->device_memory.mem_map_arr;
+	int cur_idx, end, ret, n_pages, len;
+	void *map_end, *mem_end;
+
+	/* check parameters first */
+	if (size == 0 || (size & (system_page_sz - 1)) != 0) {
+		RTE_LOG(ERR, EAL, "%s(): size is not page-aligned\n",
+				__func__);
+		rte_errno = EINVAL;
+		return -1;
+	}
+	/* check if device memory map is uninitialized. */
+	if (mcfg->device_memory.base_va == NULL || arr->len == 0) {
+		RTE_LOG(ERR, EAL, "%s(): device memory map is not initialized\n",
+			__func__);
+		rte_errno = ENODEV;
+		return -1;
+	}
+	map_end = RTE_PTR_ADD(mcfg->device_memory.base_va,
+			arr->len * system_page_sz);
+	mem_end = RTE_PTR_ADD(addr, size);
+
+	/* check if address is within the memory map */
+	if (addr < mcfg->device_memory.base_va || addr >= map_end ||
+			mem_end > map_end) {
+		RTE_LOG(ERR, EAL, "%s(): address is beyond device memory map range\n",
+				__func__);
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	rte_spinlock_lock(&mcfg->device_memory.lock);
+
+	n_pages = size / system_page_sz;
+	cur_idx = RTE_PTR_DIFF(addr, mcfg->device_memory.base_va) /
+			system_page_sz;
+	end = cur_idx + n_pages;
+
+	/* check all space we will be marking as free is currently occupied */
+	len = rte_fbarray_find_contig_used(arr, cur_idx);
+	if (len < n_pages) {
+		RTE_LOG(ERR, EAL, "%s(): attempting to free unoccupied space\n",
+			__func__);
+		rte_errno = EINVAL;
+		ret = -1;
+		goto unlock;
+	}
+	/* now, mark all space as free */
+	for (; cur_idx < end; cur_idx++)
+		rte_fbarray_set_free(arr, cur_idx);
+
+	/* success */
+	ret = 0;
+
+	RTE_LOG(DEBUG, EAL, "%s(): deallocated %p-%p (%lu bytes) for hardware device usage\n",
+		__func__, addr, RTE_PTR_ADD(addr, size), size);
+unlock:
+	rte_spinlock_unlock(&mcfg->device_memory.lock);
+	return ret;
+}
+
+static int
+dev_memory_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	size_t size;
+	void *addr;
+	int retval;
+
+	if (system_page_sz == 0)
+		system_page_sz = sysconf(_SC_PAGESIZE);
+
+	size = (size_t) 2 << 30;
+
+	addr = eal_get_virtual_area(NULL, &size, system_page_sz, 0, 0);
+	if (addr == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot reserve device memory\n");
+		return -1;
+	}
+
+	retval = rte_fbarray_init(&mcfg->device_memory.mem_map_arr,
+			DEVICE_MEMORY_NAME, size / system_page_sz, 0);
+	if (retval < 0) {
+		RTE_LOG(ERR, EAL, "Cannot initialize device memory map\n");
+		return -1;
+	}
+	mcfg->device_memory.base_va = addr;
+	rte_spinlock_init(&mcfg->device_memory.lock);
+	return 0;
+}
+
+static int
+dev_memory_attach(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	size_t size;
+	void *addr;
+	int retval;
+
+	rte_spinlock_lock(&mcfg->device_memory.lock);
+
+	if (system_page_sz == 0)
+		system_page_sz = sysconf(_SC_PAGESIZE);
+
+	size = mcfg->device_memory.mem_map_arr.len * system_page_sz;
+
+	addr = eal_get_virtual_area(mcfg->device_memory.base_va, &size,
+			system_page_sz, 0, 0);
+	if (addr == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot reserve device memory\n");
+		return -1;
+	}
+
+	retval = rte_fbarray_attach(&mcfg->device_memory.mem_map_arr);
+	if (retval < 0) {
+		RTE_LOG(ERR, EAL, "Cannot attach to device memory map\n");
+		return -1;
+	}
+
+	rte_spinlock_unlock(&mcfg->device_memory.lock);
+
+	return 0;
+}
+
 /* init memory subsystem */
 int
 rte_eal_memory_init(void)
@@ -918,25 +1140,41 @@ rte_eal_memory_init(void)
 	/* lock mem hotplug here, to prevent races while we init */
 	rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
 
-	retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		retval = dev_memory_init();
+		if (retval < 0)
+			goto fail;
+
 #ifndef RTE_ARCH_64
-			memseg_primary_init_32() :
+		retval = memseg_primary_init_32();
 #else
-			memseg_primary_init() :
+		retval = memseg_primary_init();
 #endif
-			memseg_secondary_init();
-
-	if (retval < 0)
-		goto fail;
-
-	if (eal_memalloc_init() < 0)
-		goto fail;
-
-	retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
-			rte_eal_hugepage_init() :
-			rte_eal_hugepage_attach();
-	if (retval < 0)
-		goto fail;
+		if (retval < 0)
+			goto fail;
+
+		if (eal_memalloc_init() < 0)
+			goto fail;
+
+		retval = rte_eal_hugepage_init();
+		if (retval < 0)
+			goto fail;
+	} else {
+		retval = dev_memory_attach();
+		if (retval < 0)
+			goto fail;
+
+		retval = memseg_secondary_init();
+		if (retval < 0)
+			goto fail;
+
+		if (eal_memalloc_init() < 0)
+			goto fail;
+
+		retval = rte_eal_hugepage_attach();
+		if (retval < 0)
+			goto fail;
+	}
 
 	if (internal_config.no_shconf == 0 && rte_eal_memdevice_init() < 0)
 		goto fail;
diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h
index aff0688dd..a8e7c39ff 100644
--- a/lib/librte_eal/common/include/rte_eal_memconfig.h
+++ b/lib/librte_eal/common/include/rte_eal_memconfig.h
@@ -36,6 +36,22 @@ struct rte_memseg_list {
 	struct rte_fbarray memseg_arr;
 };
 
+/**
+ * mem map is a special case because we need to store a bunch of other data
+ * together with the array itself.
+ */
+struct rte_mem_map {
+	RTE_STD_C11
+	union {
+		void *base_va;
+		/**< Base virtual address for this mem map. */
+		uint64_t addr_64;
+		/**< Makes sure addr is always 64-bits */
+	};
+	rte_spinlock_t lock;
+	struct rte_fbarray mem_map_arr;
+};
+
 /**
  * the structure for the memory configuration for the RTE.
  * Used by the rte_config structure. It is separated out, as for multi-process
@@ -68,6 +84,8 @@ struct rte_mem_config {
 	struct rte_memseg_list memsegs[RTE_MAX_MEMSEG_LISTS];
 	/**< list of dynamic arrays holding memsegs */
 
+	struct rte_mem_map device_memory; /**< Occupancy map of preallocated device memory */
+
 	struct rte_tailq_head tailq_head[RTE_MAX_TAILQ]; /**< Tailqs for objects */
 
 	/* Heaps of Malloc per socket */
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index aab9f6fe5..4cf58bd2a 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -445,6 +445,46 @@ rte_mem_alloc_validator_register(const char *name,
 int __rte_experimental
 rte_mem_alloc_validator_unregister(const char *name, int socket_id);
 
+/**
+ * @brief Request memory for device mapping.
+ *
+ * @note after this call, reserved memory will be marked as unavailable in all
+ *       processes until it is released, even if it goes unused.
+ *
+ * @param size
+ *   Size of memory to request.
+ *
+ * @param align
+ *   Alignment of memory to be returned.
+ *
+ * @return
+ *   Valid pointer on successful fulfillment of request.
+ *   NULL on unsuccessful fulfillment of request, with rte_errno indicating the
+ *   case of error.
+ */
+void * __rte_experimental
+rte_mem_dev_memory_alloc(size_t size, size_t align);
+
+/**
+ * @brief Release memory for device mapping.
+ *
+ * @note by the time this call is made, memory region being freed must not be in
+ *       use.
+ *
+ * @param addr
+ *   Address of previously requested block of memory.
+ *
+ * @param size
+ *   Size of memory to request.
+ *
+ * @return
+ *   0 on successful memory release.
+ *   -1 on unsuccessful memory release, with rte_errno indicating the cause of
+ *   error.
+ */
+int __rte_experimental
+rte_mem_dev_memory_free(void *addr, size_t size);
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.17.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [dpdk-dev] [RFC 3/3] bus/pci: use the new device memory API for BAR mapping
  2018-05-31 10:57 [dpdk-dev] [RFC 0/3] Make device mapping more reliable Anatoly Burakov
  2018-05-31 10:57 ` [dpdk-dev] [RFC 1/3] fbarray: allow zero-sized elements Anatoly Burakov
  2018-05-31 10:57 ` [dpdk-dev] [RFC 2/3] mem: add device memory reserve/free API Anatoly Burakov
@ 2018-05-31 10:57 ` Anatoly Burakov
  2018-08-14 10:13 ` [dpdk-dev] [RFC 0/3] Make device mapping more reliable Burakov, Anatoly
  3 siblings, 0 replies; 5+ messages in thread
From: Anatoly Burakov @ 2018-05-31 10:57 UTC (permalink / raw)
  To: dev
  Cc: Ferruh Yigit, Gaetan Rivet, thomas, hemant.agrawal,
	bruce.richardson, konstantin.ananyev, jerin.jacob, olivier.matz,
	stephen, nhorman, david.marchand, gowrishankar.m

Adjust PCI infrastructure to reserve device memory through the
new device memory API. Any hotplug event will reserve memory, any
hot-unplug event will release memory back to the system.

This allows for more reliable PCI mappings in secondary processes,
and will be crucial to support multiprocess hotplug.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/bus/pci/linux/pci_init.h |  1 -
 drivers/bus/pci/linux/pci_uio.c  | 11 +----------
 drivers/bus/pci/linux/pci_vfio.c | 27 ++++++++++++---------------
 lib/librte_pci/Makefile          |  1 +
 lib/librte_pci/rte_pci.c         | 20 +++++++++++++++++++-
 5 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index c2e603a37..bc9279c66 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -14,7 +14,6 @@
 /*
  * Helper function to map PCI resources right after hugepages in virtual memory
  */
-extern void *pci_map_addr;
 void *pci_find_max_end_va(void);
 
 /* parse one line of the "resource" sysfs file (note that the 'line'
diff --git a/drivers/bus/pci/linux/pci_uio.c b/drivers/bus/pci/linux/pci_uio.c
index d423e4bb0..dbf108b6f 100644
--- a/drivers/bus/pci/linux/pci_uio.c
+++ b/drivers/bus/pci/linux/pci_uio.c
@@ -26,8 +26,6 @@
 #include "eal_filesystem.h"
 #include "pci_init.h"
 
-void *pci_map_addr = NULL;
-
 #define OFF_MAX              ((uint64_t)(off_t)-1)
 
 int
@@ -316,19 +314,12 @@ pci_uio_map_resource_by_index(struct rte_pci_device *dev, int res_idx,
 		goto error;
 	}
 
-	/* try mapping somewhere close to the end of hugepages */
-	if (pci_map_addr == NULL)
-		pci_map_addr = pci_find_max_end_va();
-
-	mapaddr = pci_map_resource(pci_map_addr, fd, 0,
+	mapaddr = pci_map_resource(NULL, fd, 0,
 			(size_t)dev->mem_resource[res_idx].len, 0);
 	close(fd);
 	if (mapaddr == MAP_FAILED)
 		goto error;
 
-	pci_map_addr = RTE_PTR_ADD(mapaddr,
-			(size_t)dev->mem_resource[res_idx].len);
-
 	maps[map_idx].phaddr = dev->mem_resource[res_idx].phys_addr;
 	maps[map_idx].size = dev->mem_resource[res_idx].len;
 	maps[map_idx].addr = mapaddr;
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..f390ea37a 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -324,7 +324,7 @@ pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd)
 
 static int
 pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
-		int bar_index, int additional_flags)
+		int bar_index)
 {
 	struct memreg {
 		unsigned long offset, size;
@@ -371,9 +371,14 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
 		memreg[0].size = bar->size;
 	}
 
-	/* reserve the address using an inaccessible mapping */
-	bar_addr = mmap(bar->addr, bar->size, 0, MAP_PRIVATE |
-			MAP_ANONYMOUS | additional_flags, -1, 0);
+	if (bar->addr == NULL) {
+		bar_addr = rte_mem_dev_memory_alloc(bar->size, 0);
+		if (bar_addr == NULL) {
+			RTE_LOG(ERR, EAL, "%s(): cannot reserve space for device\n",
+				__func__);
+			return -1;
+		}
+	}
 	if (bar_addr != MAP_FAILED) {
 		void *map_addr = NULL;
 		if (memreg[0].size) {
@@ -469,7 +474,6 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 
 	for (i = 0; i < (int) vfio_res->nb_maps; i++) {
 		struct vfio_region_info reg = { .argsz = sizeof(reg) };
-		void *bar_addr;
 
 		reg.index = i;
 
@@ -494,19 +498,12 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 		if ((reg.flags & VFIO_REGION_INFO_FLAG_MMAP) == 0)
 			continue;
 
-		/* try mapping somewhere close to the end of hugepages */
-		if (pci_map_addr == NULL)
-			pci_map_addr = pci_find_max_end_va();
-
-		bar_addr = pci_map_addr;
-		pci_map_addr = RTE_PTR_ADD(bar_addr, (size_t) reg.size);
-
-		maps[i].addr = bar_addr;
+		maps[i].addr = NULL;
 		maps[i].offset = reg.offset;
 		maps[i].size = reg.size;
 		maps[i].path = NULL; /* vfio doesn't have per-resource paths */
 
-		ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, 0);
+		ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i);
 		if (ret < 0) {
 			RTE_LOG(ERR, EAL, "  %s mapping BAR%i failed: %s\n",
 					pci_addr, i, strerror(errno));
@@ -574,7 +571,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 	maps = vfio_res->maps;
 
 	for (i = 0; i < (int) vfio_res->nb_maps; i++) {
-		ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, MAP_FIXED);
+		ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i);
 		if (ret < 0) {
 			RTE_LOG(ERR, EAL, "  %s mapping BAR%i failed: %s\n",
 					pci_addr, i, strerror(errno));
diff --git a/lib/librte_pci/Makefile b/lib/librte_pci/Makefile
index 94a632670..f996fe33c 100644
--- a/lib/librte_pci/Makefile
+++ b/lib/librte_pci/Makefile
@@ -8,6 +8,7 @@ LIB = librte_pci.a
 
 CFLAGS := -I$(SRCDIR) $(CFLAGS)
 CFLAGS += $(WERROR_FLAGS) -O3
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_pci_version.map
diff --git a/lib/librte_pci/rte_pci.c b/lib/librte_pci/rte_pci.c
index 530738dbd..c425a624e 100644
--- a/lib/librte_pci/rte_pci.c
+++ b/lib/librte_pci/rte_pci.c
@@ -151,6 +151,16 @@ pci_map_resource(void *requested_addr, int fd, off_t offset, size_t size,
 {
 	void *mapaddr;
 
+	if (requested_addr == NULL) {
+		requested_addr = rte_mem_dev_memory_alloc(size, 0);
+		if (requested_addr == NULL) {
+			RTE_LOG(ERR, EAL, "%s(): cannot reserve space for device\n",
+				__func__);
+			return MAP_FAILED;
+		}
+	}
+	additional_flags |= MAP_FIXED;
+
 	/* Map the PCI memory resource of device */
 	mapaddr = mmap(requested_addr, size, PROT_READ | PROT_WRITE,
 			MAP_SHARED | additional_flags, fd, offset);
@@ -170,15 +180,23 @@ pci_map_resource(void *requested_addr, int fd, off_t offset, size_t size,
 void
 pci_unmap_resource(void *requested_addr, size_t size)
 {
+	void *mapped;
 	if (requested_addr == NULL)
 		return;
 
+	mapped = mmap(requested_addr, size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
 	/* Unmap the PCI memory resource of device */
-	if (munmap(requested_addr, size)) {
+	if (mapped == MAP_FAILED) {
 		RTE_LOG(ERR, EAL, "%s(): cannot munmap(%p, %#zx): %s\n",
 			__func__, requested_addr, size,
 			strerror(errno));
 	} else
 		RTE_LOG(DEBUG, EAL, "  PCI memory unmapped at %p\n",
 				requested_addr);
+	if (rte_mem_dev_memory_free(requested_addr, size))
+		RTE_LOG(ERR, EAL, "%s(): cannot mark %p-%p as free\n",
+			__func__, requested_addr,
+			RTE_PTR_ADD(requested_addr, size));
 }
-- 
2.17.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Make device mapping more reliable
  2018-05-31 10:57 [dpdk-dev] [RFC 0/3] Make device mapping more reliable Anatoly Burakov
                   ` (2 preceding siblings ...)
  2018-05-31 10:57 ` [dpdk-dev] [RFC 3/3] bus/pci: use the new device memory API for BAR mapping Anatoly Burakov
@ 2018-08-14 10:13 ` Burakov, Anatoly
  3 siblings, 0 replies; 5+ messages in thread
From: Burakov, Anatoly @ 2018-08-14 10:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, hemant.agrawal, bruce.richardson, ferruh.yigit,
	konstantin.ananyev, jerin.jacob, olivier.matz, stephen, nhorman,
	david.marchand, gowrishankar.m

On 31-May-18 11:57 AM, Anatoly Burakov wrote:
> Currently, memory for device maps is allocated ad-hoc, by calculating
> end of VA space allocated for hugepages and crossing fingers in hopes that
> those addresses will be free in primary and secondary processes. This leads
> to situations such as this:
> 
> EAL: Detected 88 lcore(s)
> EAL: Detected 2 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_178323_8af2229603de4
> EAL: Probing VFIO support...
> EAL: VFIO support initialized
> EAL: PCI device 0000:81:00.0 on NUMA socket 1
> EAL:   probe driver: 8086:1563 net_ixgbe
> EAL: Cannot mmap device resource file /sys/bus/pci/devices/0000:81:00.0/resource0 to address: 0x7ff7f5800000
> EAL: Requested device 0000:81:00.0 cannot be used
> EAL: Error - exiting with code: 1
>    Cause: No Ethernet ports - bye
> 
> As can be seen from the above log, secondary process has initialized
> successfully, but device BAR mapping has failed, which resulted in missing ports
> in the secondary process.
> 
> This patchset is an attempt to fix this problem once and for all, by using
> the same method we use for memory to do device mappings as well. That is,
> by preallocating all of the device memory in advance, so that initialization
> either succeeds and allows for device mappings, or it fails outright (whereas
> currently we may be in an in-between kind of situation, where init has
> succeeded but device mappings have failed).
> 
> This change breaks the ABI, so it is not for this release. However, i'd like
> to hear feedback on the approach and whether there are potential problems with
> other buses/use cases that i didn't think of.
> 

I would like to draw attention to the RFC, now that the release has gone 
out. We have a deprecation notice for ABI break due to external memory 
work, so we now have an opportunity to implement this for 18.11.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-08-14 10:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-31 10:57 [dpdk-dev] [RFC 0/3] Make device mapping more reliable Anatoly Burakov
2018-05-31 10:57 ` [dpdk-dev] [RFC 1/3] fbarray: allow zero-sized elements Anatoly Burakov
2018-05-31 10:57 ` [dpdk-dev] [RFC 2/3] mem: add device memory reserve/free API Anatoly Burakov
2018-05-31 10:57 ` [dpdk-dev] [RFC 3/3] bus/pci: use the new device memory API for BAR mapping Anatoly Burakov
2018-08-14 10:13 ` [dpdk-dev] [RFC 0/3] Make device mapping more reliable Burakov, Anatoly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).