DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
@ 2017-12-19 11:14 Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 01/23] eal: move get_virtual_area out of linuxapp eal_memory.c Anatoly Burakov
                   ` (27 more replies)
  0 siblings, 28 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This patchset introduces a prototype implementation of dynamic memory allocation
for DPDK. It is intended to start a conversation and build consensus on the best
way to implement this functionality. The patchset works well enough to pass all
unit tests, and to work with traffic forwarding, provided the device drivers are
adjusted to ensure contiguous memory allocation where it matters.

The vast majority of changes are in the EAL and malloc, the external API
disruption is minimal: a new set of API's are added for contiguous memory
allocation (for rte_malloc and rte_memzone), and a few API additions in
rte_memory. Every other API change is internal to EAL, and all of the memory
allocation/freeing is handled through rte_malloc, with no externally visible
API changes, aside from a call to get physmem layout, which no longer makes
sense given that there are multiple memseg lists.

Quick outline of all changes done as part of this patchset:

 * Malloc heap adjusted to handle holes in address space
 * Single memseg list replaced by multiple expandable memseg lists
 * VA space for hugepages is preallocated in advance
 * Added dynamic alloc/free for pages, happening as needed on malloc/free
 * Added contiguous memory allocation API's for rte_malloc and rte_memzone
 * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory
   with VFIO

The biggest difference is a "memseg" now represents a single page (as opposed to
being a big contiguous block of pages). As a consequence, both memzones and
malloc elements are no longer guaranteed to be physically contiguous, unless
the user asks for it. To preserve whatever functionality that was dependent
on previous behavior, a legacy memory option is also provided, however it is
expected to be temporary solution. The drivers weren't adjusted in this patchset,
and it is expected that whoever shall test the drivers with this patchset will
modify their relevant drivers to support the new set of API's. Basic testing
with forwarding traffic was performed, both with UIO and VFIO, and no performance
degradation was observed.

Why multiple memseg lists instead of one? It makes things easier on a number of
fronts. Since memseg is a single page now, the list will get quite big, and we
need to locate pages somehow when we allocate and free them. We could of course
just walk the list and allocate one contiguous chunk of VA space for memsegs,
but i chose to use separate lists instead, to speed up many operations with the
list.

It would be great to see the following discussions within the community regarding
both current implementation and future work:

 * Any suggestions to improve current implementation. The whole system with
   multiple memseg lists is kind of unweildy, so maybe there are better ways to
   do the same thing. Maybe use a single list after all? We're not expecting
   malloc/free on hot path, so maybe it doesn't matter that we have to walk
   the list of potentially thousands of pages?
 * Pluggable memory allocators. Right now, allocators are hardcoded, but down
   the line it would be great to have custom allocators (e.g. for externally
   allocated memory). I've tried to keep the memalloc API minimal and generic
   enough to be able to easily change it down the line, but suggestions are
   welcome. Memory drivers, with ops for alloc/free etc.?
 * Memory tagging. This is related to previous item. Right now, we can only ask
   malloc to allocate memory by page size, but one could potentially have
   different memory regions backed by pages of similar sizes (for example,
   locked 1G pages, to completely avoid TLB misses, alongside regular 1G pages),
   and it would be good to have that kind of mechanism to distinguish between
   different memory types available to a DPDK application. One could, for example,
   tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.
 * Secondary process implementation, in particular when it comes to allocating/
   freeing new memory. Current plan is to make use of RPC mechanism proposed by
   Jianfeng [2] to communicate between primary and secondary processes, however
   other suggestions are welcome.
 * Support for non-hugepage memory. This work is planned down the line. Aside
   from obvious concerns about physical addresses, 4K pages are small and will
   eat up enormous amounts of memseg list space, so my proposal would be to
   allocate 4K pages in bigger blocks (say, 2MB).
 * 32-bit support. Current implementation lacks it, and i don't see a trivial
   way to make it work if we are to preallocate huge chunks of VA space in
   advance. We could limit it to 1G per page size, but even that, on multiple
   sockets, won't work that well, and we can't know in advance what kind of
   memory user will try to allocate. Drop it? Leave it in legacy mode only?
 * Preallocation. Right now, malloc will free any and all memory that it can,
   which could lead to a (perhaps counterintuitive) situation where a user
   calls DPDK with --socket-mem=1024,1024, does a single "rte_free" and loses
   all of the preallocated memory in the process. Would preallocating memory
   *and keeping it no matter what* be a valid use case? E.g. if DPDK was run
   without any memory requirements specified, grow and shrink as needed, but
   DPDK was asked to preallocate memory, we can grow but we can't shrink
   past the preallocated amount?

Any other feedback about things i didn't think of or missed is greatly
appreciated.

[1] http://dpdk.org/dev/patchwork/patch/24484/
[2] http://dpdk.org/dev/patchwork/patch/31838/

Anatoly Burakov (23):
  eal: move get_virtual_area out of linuxapp eal_memory.c
  eal: add function to report number of detected sockets
  eal: add rte_fbarray
  eal: move all locking to heap
  eal: protect malloc heap stats with a lock
  eal: make malloc a doubly-linked list
  eal: make malloc_elem_join_adjacent_free public
  eal: add "single file segments" command-line option
  eal: add "legacy memory" option
  eal: read hugepage counts from node-specific sysfs path
  eal: replace memseg with memseg lists
  eal: add support for dynamic memory allocation
  eal: make use of dynamic memory allocation for init
  eal: add support for dynamic unmapping of pages
  eal: add API to check if memory is physically contiguous
  eal: enable dynamic memory allocation/free on malloc/free
  eal: add backend support for contiguous memory allocation
  eal: add rte_malloc support for allocating contiguous memory
  eal: enable reserving physically contiguous memzones
  eal: make memzones use rte_fbarray
  mempool: add support for the new memory allocation methods
  vfio: allow to map other memory regions
  eal: map/unmap memory with VFIO when alloc/free pages

 config/common_base                                |   5 +-
 drivers/bus/pci/linux/pci.c                       |  29 +-
 drivers/net/ena/ena_ethdev.c                      |  10 +-
 drivers/net/virtio/virtio_user/vhost_kernel.c     | 106 ++--
 lib/librte_eal/common/Makefile                    |   2 +-
 lib/librte_eal/common/eal_common_fbarray.c        | 585 ++++++++++++++++++++++
 lib/librte_eal/common/eal_common_lcore.c          |  11 +
 lib/librte_eal/common/eal_common_memalloc.c       |  79 +++
 lib/librte_eal/common/eal_common_memory.c         | 315 +++++++++++-
 lib/librte_eal/common/eal_common_memzone.c        | 250 ++++++---
 lib/librte_eal/common/eal_common_options.c        |   8 +
 lib/librte_eal/common/eal_filesystem.h            |  13 +
 lib/librte_eal/common/eal_hugepages.h             |   1 +
 lib/librte_eal/common/eal_internal_cfg.h          |   6 +
 lib/librte_eal/common/eal_memalloc.h              |  55 ++
 lib/librte_eal/common/eal_options.h               |   4 +
 lib/librte_eal/common/eal_private.h               |  29 ++
 lib/librte_eal/common/include/rte_eal.h           |   1 +
 lib/librte_eal/common/include/rte_eal_memconfig.h |  26 +-
 lib/librte_eal/common/include/rte_fbarray.h       |  98 ++++
 lib/librte_eal/common/include/rte_lcore.h         |   8 +
 lib/librte_eal/common/include/rte_malloc.h        | 181 +++++++
 lib/librte_eal/common/include/rte_malloc_heap.h   |   6 +
 lib/librte_eal/common/include/rte_memory.h        |  16 +
 lib/librte_eal/common/include/rte_memzone.h       | 158 ++++++
 lib/librte_eal/common/malloc_elem.c               | 411 ++++++++++++---
 lib/librte_eal/common/malloc_elem.h               |  30 +-
 lib/librte_eal/common/malloc_heap.c               | 433 ++++++++++++++--
 lib/librte_eal/common/malloc_heap.h               |  14 +-
 lib/librte_eal/common/rte_malloc.c                | 139 +++--
 lib/librte_eal/linuxapp/eal/Makefile              |   4 +
 lib/librte_eal/linuxapp/eal/eal.c                 |  23 +-
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c   |  73 ++-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c        | 556 ++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/eal_memory.c          | 452 ++++++++++-------
 lib/librte_eal/linuxapp/eal/eal_vfio.c            | 280 ++++++++---
 lib/librte_eal/linuxapp/eal/eal_vfio.h            |  11 +
 lib/librte_mempool/rte_mempool.c                  |  84 +++-
 test/test/test_malloc.c                           |  29 +-
 test/test/test_memory.c                           |  44 +-
 test/test/test_memzone.c                          |  17 +-
 41 files changed, 3999 insertions(+), 603 deletions(-)
 create mode 100755 lib/librte_eal/common/eal_common_fbarray.c
 create mode 100755 lib/librte_eal/common/eal_common_memalloc.c
 create mode 100755 lib/librte_eal/common/eal_memalloc.h
 create mode 100755 lib/librte_eal/common/include/rte_fbarray.h
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_memalloc.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 01/23] eal: move get_virtual_area out of linuxapp eal_memory.c
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 02/23] eal: add function to report number of detected sockets Anatoly Burakov
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

Move get_virtual_area out of linuxapp EAL memory and make it
common to EAL, so that other code could reserve virtual areas
as well.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_memory.c | 70 ++++++++++++++++++++++++++++++
 lib/librte_eal/common/eal_private.h       | 29 +++++++++++++
 lib/librte_eal/linuxapp/eal/eal_memory.c  | 71 ++-----------------------------
 3 files changed, 102 insertions(+), 68 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index fc6c44d..96570a7 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -31,6 +31,8 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
+#include <errno.h>
+#include <string.h>
 #include <stdio.h>
 #include <stdint.h>
 #include <stdlib.h>
@@ -49,6 +51,74 @@
 #include "eal_internal_cfg.h"
 
 /*
+ * Try to mmap *size bytes in /dev/zero. If it is successful, return the
+ * pointer to the mmap'd area and keep *size unmodified. Else, retry
+ * with a smaller zone: decrease *size by hugepage_sz until it reaches
+ * 0. In this case, return NULL. Note: this function returns an address
+ * which is a multiple of hugepage size.
+ */
+
+static uint64_t baseaddr_offset;
+
+void *
+eal_get_virtual_area(void *requested_addr, uint64_t *size,
+		uint64_t page_sz, int flags)
+{
+	bool addr_is_hint, allow_shrink;
+	void *addr;
+
+	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
+
+	addr_is_hint = (flags & EAL_VIRTUAL_AREA_ADDR_IS_HINT) > 0;
+	allow_shrink = (flags & EAL_VIRTUAL_AREA_ALLOW_SHRINK) > 0;
+
+	if (requested_addr == NULL && internal_config.base_virtaddr != 0) {
+		requested_addr = (void*) (internal_config.base_virtaddr +
+				baseaddr_offset);
+		addr_is_hint = true;
+	}
+
+	do {
+		// TODO: we may not necessarily be using memory mapped by this
+		// function for hugepage mapping, so... HUGETLB flag?
+
+		addr = mmap(requested_addr,
+				(*size) + page_sz, PROT_READ,
+#ifdef RTE_ARCH_PPC_64
+				MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+#else
+				MAP_PRIVATE | MAP_ANONYMOUS,
+#endif
+				-1, 0);
+		if (addr == MAP_FAILED && allow_shrink)
+			*size -= page_sz;
+	} while (allow_shrink && addr == MAP_FAILED && *size > 0);
+
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
+			strerror(errno));
+		return NULL;
+	} else if (requested_addr != NULL && !addr_is_hint &&
+			addr != requested_addr) {
+		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p\n",
+			requested_addr);
+		munmap(addr, (*size) + page_sz);
+		return NULL;
+	}
+
+	/* align addr to page size boundary */
+	addr = RTE_PTR_ALIGN(addr, page_sz);
+
+	RTE_LOG(DEBUG, EAL, "Virtual area found at %p (size = 0x%zx)\n",
+		addr, *size);
+
+	baseaddr_offset += *size;
+
+	return addr;
+}
+
+
+/*
  * Return a pointer to a read-only table of struct rte_physmem_desc
  * elements, containing the layout of all addressable physical
  * memory. The last element of the table contains a NULL address.
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 462226f..5d57fc1 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -34,6 +34,7 @@
 #ifndef _EAL_PRIVATE_H_
 #define _EAL_PRIVATE_H_
 
+#include <stdint.h>
 #include <stdbool.h>
 #include <stdint.h>
 #include <stdio.h>
@@ -224,4 +225,32 @@ int rte_eal_hugepage_attach(void);
  */
 struct rte_bus *rte_bus_find_by_device_name(const char *str);
 
+/**
+ * Get virtual area of specified size from the OS.
+ *
+ * This function is private to the EAL.
+ *
+ * @param requested_addr
+ *   Address where to request address space.
+ * @param size
+ *   Size of requested area.
+ * @param page_sz
+ *   Page size on which to align requested virtual area.
+ * @param flags
+ *   EAL_VIRTUAL_AREA_* flags.
+ *
+ * @return
+ *   Virtual area address if successful.
+ *   NULL if unsuccessful.
+ */
+
+#define EAL_VIRTUAL_AREA_ADDR_IS_HINT 0x1
+/**< don't fail if cannot get exact requested address */
+#define EAL_VIRTUAL_AREA_ALLOW_SHRINK 0x2
+/**< try getting smaller sized (decrement by page size) virtual areas if cannot
+ * get area of requested size. */
+void *
+eal_get_virtual_area(void *requested_addr, uint64_t *size,
+		uint64_t page_sz, int flags);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 16a181c..dd18d98 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -86,8 +86,6 @@
  * zone as well as a physical contiguous zone.
  */
 
-static uint64_t baseaddr_offset;
-
 static bool phys_addrs_available = true;
 
 #define RANDOMIZE_VA_SPACE_FILE "/proc/sys/kernel/randomize_va_space"
@@ -250,71 +248,6 @@ aslr_enabled(void)
 	}
 }
 
-/*
- * Try to mmap *size bytes in /dev/zero. If it is successful, return the
- * pointer to the mmap'd area and keep *size unmodified. Else, retry
- * with a smaller zone: decrease *size by hugepage_sz until it reaches
- * 0. In this case, return NULL. Note: this function returns an address
- * which is a multiple of hugepage size.
- */
-static void *
-get_virtual_area(size_t *size, size_t hugepage_sz)
-{
-	void *addr;
-	int fd;
-	long aligned_addr;
-
-	if (internal_config.base_virtaddr != 0) {
-		addr = (void*) (uintptr_t) (internal_config.base_virtaddr +
-				baseaddr_offset);
-	}
-	else addr = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
-
-	fd = open("/dev/zero", O_RDONLY);
-	if (fd < 0){
-		RTE_LOG(ERR, EAL, "Cannot open /dev/zero\n");
-		return NULL;
-	}
-	do {
-		addr = mmap(addr,
-				(*size) + hugepage_sz, PROT_READ,
-#ifdef RTE_ARCH_PPC_64
-				MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-#else
-				MAP_PRIVATE,
-#endif
-				fd, 0);
-		if (addr == MAP_FAILED)
-			*size -= hugepage_sz;
-	} while (addr == MAP_FAILED && *size > 0);
-
-	if (addr == MAP_FAILED) {
-		close(fd);
-		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		return NULL;
-	}
-
-	munmap(addr, (*size) + hugepage_sz);
-	close(fd);
-
-	/* align addr to a huge page size boundary */
-	aligned_addr = (long)addr;
-	aligned_addr += (hugepage_sz - 1);
-	aligned_addr &= (~(hugepage_sz - 1));
-	addr = (void *)(aligned_addr);
-
-	RTE_LOG(DEBUG, EAL, "Virtual area found at %p (size = 0x%zx)\n",
-		addr, *size);
-
-	/* increment offset */
-	baseaddr_offset += *size;
-
-	return addr;
-}
-
 static sigjmp_buf huge_jmpenv;
 
 static void huge_sigbus_handler(int signo __rte_unused)
@@ -463,7 +396,9 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 			/* get the biggest virtual memory area up to
 			 * vma_len. If it fails, vma_addr is NULL, so
 			 * let the kernel provide the address. */
-			vma_addr = get_virtual_area(&vma_len, hpi->hugepage_sz);
+			vma_addr = eal_get_virtual_area(NULL, &vma_len,
+					hpi->hugepage_sz,
+					EAL_VIRTUAL_AREA_ALLOW_SHRINK);
 			if (vma_addr == NULL)
 				vma_len = hugepage_sz;
 		}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 02/23] eal: add function to report number of detected sockets
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 01/23] eal: move get_virtual_area out of linuxapp eal_memory.c Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 03/23] eal: add rte_fbarray Anatoly Burakov
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

At the moment, we always rely on scanning everything for every
socket up until RTE_MAX_NUMA_NODES and checking if there's a memseg
associated with each socket if we want to find out how many sockets
we actually have. This becomes a problem when we may have memory on
socket but it's not allocated yet, so we do the detection on lcore
scan instead, and store the value for later use.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_lcore.c  | 11 +++++++++++
 lib/librte_eal/common/include/rte_eal.h   |  1 +
 lib/librte_eal/common/include/rte_lcore.h |  8 ++++++++
 3 files changed, 20 insertions(+)

diff --git a/lib/librte_eal/common/eal_common_lcore.c b/lib/librte_eal/common/eal_common_lcore.c
index 0db1555..7566a6b 100644
--- a/lib/librte_eal/common/eal_common_lcore.c
+++ b/lib/librte_eal/common/eal_common_lcore.c
@@ -57,6 +57,7 @@ rte_eal_cpu_init(void)
 	struct rte_config *config = rte_eal_get_configuration();
 	unsigned lcore_id;
 	unsigned count = 0;
+	unsigned max_socket_id = 0;
 
 	/*
 	 * Parse the maximum set of logical cores, detect the subset of running
@@ -100,6 +101,8 @@ rte_eal_cpu_init(void)
 				lcore_id, lcore_config[lcore_id].core_id,
 				lcore_config[lcore_id].socket_id);
 		count++;
+		max_socket_id = RTE_MAX(max_socket_id,
+					lcore_config[lcore_id].socket_id);
 	}
 	/* Set the count of enabled logical cores of the EAL configuration */
 	config->lcore_count = count;
@@ -108,5 +111,13 @@ rte_eal_cpu_init(void)
 		RTE_MAX_LCORE);
 	RTE_LOG(INFO, EAL, "Detected %u lcore(s)\n", config->lcore_count);
 
+	config->numa_node_count = max_socket_id + 1;
+	RTE_LOG(INFO, EAL, "Detected %u NUMA nodes\n", config->numa_node_count);
+
 	return 0;
 }
+
+unsigned rte_num_sockets(void) {
+	const struct rte_config *config = rte_eal_get_configuration();
+	return config->numa_node_count;
+}
diff --git a/lib/librte_eal/common/include/rte_eal.h b/lib/librte_eal/common/include/rte_eal.h
index 8e4e71c..5b12914 100644
--- a/lib/librte_eal/common/include/rte_eal.h
+++ b/lib/librte_eal/common/include/rte_eal.h
@@ -83,6 +83,7 @@ enum rte_proc_type_t {
 struct rte_config {
 	uint32_t master_lcore;       /**< Id of the master lcore */
 	uint32_t lcore_count;        /**< Number of available logical cores. */
+	uint32_t numa_node_count;    /**< Number of detected NUMA nodes. */
 	uint32_t service_lcore_count;/**< Number of available service cores. */
 	enum rte_lcore_role_t lcore_role[RTE_MAX_LCORE]; /**< State of cores. */
 
diff --git a/lib/librte_eal/common/include/rte_lcore.h b/lib/librte_eal/common/include/rte_lcore.h
index c89e6ba..6a75c9b 100644
--- a/lib/librte_eal/common/include/rte_lcore.h
+++ b/lib/librte_eal/common/include/rte_lcore.h
@@ -148,6 +148,14 @@ rte_lcore_index(int lcore_id)
 unsigned rte_socket_id(void);
 
 /**
+ * Return number of physical sockets on the system.
+ * @return
+ *   the number of physical sockets as recognized by EAL
+ *
+ */
+unsigned rte_num_sockets(void);
+
+/**
  * Get the ID of the physical socket of the specified lcore
  *
  * @param lcore_id
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 03/23] eal: add rte_fbarray
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 01/23] eal: move get_virtual_area out of linuxapp eal_memory.c Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 02/23] eal: add function to report number of detected sockets Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 04/23] eal: move all locking to heap Anatoly Burakov
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

rte_fbarray is a simple resizable array, not unlike vectors in
higher-level languages. Rationale for its existence is the
following: since we are going to map memory page-by-page, there
could be quite a lot of memory segments to keep track of (for
smaller page sizes, page count can easily reach thousands). We
can't really make page lists truly dynamic and infinitely expandable,
because that involves reallocating memory (which is a big no-no in
multiprocess). What we can do instead is have a maximum capacity as
something really, really large, and preallocate address space for
that, but only use a small portion of that memory as needed, via
mmap()'ing portions of the address space to an actual file. This
also doubles as a mechanism to share fbarrays between processes
(although multiprocess is neither implemented nor tested at the
moment). Hence the name: file-backed array.

In addition, in understanding that we will frequently need to scan
this array for free space and iterating over array linearly can
become slow, rte_fbarray provides facilities to index array's
usage. The following use cases are covered:
 - find next free/used slot (useful either for adding new elements
   to fbarray, or walking the list)
 - find starting index for next N free/used slots (useful for when
   we want to allocate chunk of VA-contiguous memory composed of
   several pages)
 - find how many contiguous free/used slots there are, starting
   from specified index (useful for when we want to figure out
   how many pages we have until next hole in allocated memory, to
   speed up some bulk operations where we would otherwise have to
   walk the array and add pages one by one)

This is accomplished by storing a usage mask in-memory, right
after the data section of the array, and using some bit-level
magic to figure out the info we need.

rte_fbarray is a bit clunky to use and its primary purpose is to
be used within EAL for certain things, but hopefully it is (or
can be made so) generic enough to be useful in other contexts.

Note that current implementation is leaking fd's whenever new
allocations happen.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/Makefile              |   2 +-
 lib/librte_eal/common/eal_common_fbarray.c  | 585 ++++++++++++++++++++++++++++
 lib/librte_eal/common/eal_filesystem.h      |  13 +
 lib/librte_eal/common/include/rte_fbarray.h |  98 +++++
 lib/librte_eal/linuxapp/eal/Makefile        |   1 +
 5 files changed, 698 insertions(+), 1 deletion(-)
 create mode 100755 lib/librte_eal/common/eal_common_fbarray.c
 create mode 100755 lib/librte_eal/common/include/rte_fbarray.h

diff --git a/lib/librte_eal/common/Makefile b/lib/librte_eal/common/Makefile
index 9effd0d..7868698 100644
--- a/lib/librte_eal/common/Makefile
+++ b/lib/librte_eal/common/Makefile
@@ -43,7 +43,7 @@ INC += rte_hexdump.h rte_devargs.h rte_bus.h rte_dev.h
 INC += rte_pci_dev_feature_defs.h rte_pci_dev_features.h
 INC += rte_malloc.h rte_keepalive.h rte_time.h
 INC += rte_service.h rte_service_component.h
-INC += rte_bitmap.h rte_vfio.h
+INC += rte_bitmap.h rte_vfio.h rte_fbarray.h
 
 GENERIC_INC := rte_atomic.h rte_byteorder.h rte_cycles.h rte_prefetch.h
 GENERIC_INC += rte_spinlock.h rte_memcpy.h rte_cpuflags.h rte_rwlock.h
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
new file mode 100755
index 0000000..6e71909
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -0,0 +1,585 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <inttypes.h>
+#include <sys/mman.h>
+#include <stdint.h>
+#include <errno.h>
+#include <sys/file.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+
+#include "eal_filesystem.h"
+#include "eal_private.h"
+
+#include "rte_fbarray.h"
+
+#define MASK_SHIFT 6ULL
+#define MASK_ALIGN (1 << MASK_SHIFT)
+#define MASK_LEN_TO_IDX(x) ((x) >> MASK_SHIFT)
+#define MASK_LEN_TO_MOD(x) ((x) - RTE_ALIGN_FLOOR(x, MASK_ALIGN))
+#define MASK_GET_IDX(idx, mod) ((idx << MASK_SHIFT) + mod)
+
+/*
+ * This is a mask that is always stored at the end of array, to provide fast
+ * way of finding free/used spots without looping through each element.
+ */
+
+struct used_mask {
+	int n_masks;
+	uint64_t data[];
+};
+
+static size_t
+calc_mask_size(int len) {
+	return sizeof(struct used_mask) + sizeof(uint64_t) * MASK_LEN_TO_IDX(len);
+}
+
+static size_t
+calc_data_size(size_t page_sz, int elt_sz, int len) {
+	size_t data_sz = elt_sz * len;
+	size_t msk_sz = calc_mask_size(len);
+	return RTE_ALIGN_CEIL(data_sz + msk_sz, page_sz);
+}
+
+static struct used_mask *
+get_used_mask(void *data, int elt_sz, int len) {
+	return (struct used_mask *) RTE_PTR_ADD(data, elt_sz * len);
+}
+
+static void
+move_mask(void *data, int elt_sz, int old_len, int new_len) {
+	struct used_mask *old_msk, *new_msk;
+
+	old_msk = get_used_mask(data, elt_sz, old_len);
+	new_msk = get_used_mask(data, elt_sz, new_len);
+
+	memset(new_msk, 0, calc_mask_size(new_len));
+	memcpy(new_msk, old_msk, calc_mask_size(old_len));
+	memset(old_msk, 0, calc_mask_size(old_len));
+	new_msk->n_masks = MASK_LEN_TO_IDX(new_len);
+}
+
+static int
+expand_and_map(void *addr, const char *name, size_t old_len, size_t new_len) {
+	char path[PATH_MAX];
+	void *map_addr, *adj_addr;
+	size_t map_len;
+	int fd, ret = 0;
+
+	map_len = new_len - old_len;
+	adj_addr = RTE_PTR_ADD(addr, old_len);
+
+	eal_get_fbarray_path(path, sizeof(path), name);
+
+	/* open our file */
+	fd = open(path, O_CREAT | O_RDWR, 0600);
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "Cannot open %s\n", path);
+		return -1;
+	}
+	if (ftruncate(fd, new_len)) {
+		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
+		ret = -1;
+		goto out;
+	}
+
+	map_addr = mmap(adj_addr, map_len, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd, old_len);
+	if (map_addr != adj_addr) {
+		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
+		ret = -1;
+		goto out;
+	}
+out:
+	close(fd);
+	return ret;
+}
+
+static int
+find_next_n(const struct used_mask *msk, int start, int n, bool used) {
+	int msk_idx, lookahead_idx, first, first_mod;
+	uint64_t first_msk;
+
+	/*
+	 * mask only has granularity of MASK_ALIGN, but start may not be aligned
+	 * on that boundary, so construct a special mask to exclude anything we
+	 * don't want to see to avoid confusing ctz.
+	 */
+	first = MASK_LEN_TO_IDX(start);
+	first_mod = MASK_LEN_TO_MOD(start);
+	first_msk = ~((1ULL << first_mod) - 1);
+
+	for (msk_idx = first; msk_idx < msk->n_masks; msk_idx++) {
+		uint64_t cur_msk, lookahead_msk;
+		int run_start, clz, left;
+		bool found = false;
+		/*
+		 * The process of getting n consecutive bits for arbitrary n is
+		 * a bit involved, but here it is in a nutshell:
+		 *
+		 *  1. let n be the number of consecutive bits we're looking for
+		 *  2. check if n can fit in one mask, and if so, do n-1
+		 *     rshift-ands to see if there is an appropriate run inside
+		 *     our current mask
+		 *    2a. if we found a run, bail out early
+		 *    2b. if we didn't find a run, proceed
+		 *  3. invert the mask and count leading zeroes (that is, count
+		 *     how many consecutive set bits we had starting from the
+		 *     end of current mask) as k
+		 *    3a. if k is 0, continue to next mask
+		 *    3b. if k is not 0, we have a potential run
+		 *  4. to satisfy our requirements, next mask must have n-k
+		 *     consecutive set bits right at the start, so we will do
+		 *     (n-k-1) rshift-ands and check if first bit is set.
+		 *
+		 * Step 4 will need to be repeated if (n-k) > MASK_ALIGN until
+		 * we either run out of masks, lose the run, or find what we
+		 * were looking for.
+		 */
+		cur_msk = msk->data[msk_idx];
+		left = n;
+
+		/* if we're looking for free spaces, invert the mask */
+		if (!used)
+			cur_msk = ~cur_msk;
+
+		/* ignore everything before start on first iteration */
+		if (msk_idx == first)
+			cur_msk &= first_msk;
+
+		/* if n can fit in within a single mask, do a search */
+		if (n <= MASK_ALIGN) {
+			uint64_t tmp_msk = cur_msk;
+			int s_idx;
+			for (s_idx = 0; s_idx < n - 1; s_idx++) {
+				tmp_msk &= tmp_msk >> 1ULL;
+			}
+			/* we found what we were looking for */
+			if (tmp_msk != 0) {
+				run_start = __builtin_ctzll(tmp_msk);
+				return MASK_GET_IDX(msk_idx, run_start);
+			}
+		}
+
+		/*
+		 * we didn't find our run within the mask, or n > MASK_ALIGN,
+		 * so we're going for plan B.
+		 */
+
+		/* count leading zeroes on inverted mask */
+		clz = __builtin_clzll(~cur_msk);
+
+		/* if there aren't any runs at the end either, just continue */
+		if (clz == 0)
+			continue;
+
+		/* we have a partial run at the end, so try looking ahead */
+		run_start = MASK_ALIGN - clz;
+		left -= clz;
+
+		for (lookahead_idx = msk_idx + 1; lookahead_idx < msk->n_masks;
+				lookahead_idx++) {
+			int s_idx, need;
+			lookahead_msk = msk->data[lookahead_idx];
+
+			/* if we're looking for free space, invert the mask */
+			if (!used)
+				lookahead_msk = ~lookahead_msk;
+
+			/* figure out how many consecutive bits we need here */
+			need = RTE_MIN(left, MASK_ALIGN);
+
+			for (s_idx = 0; s_idx < need - 1; s_idx++)
+				lookahead_msk &= lookahead_msk >> 1ULL;
+
+			/* if first bit is not set, we've lost the run */
+			if ((lookahead_msk & 1) == 0)
+				break;
+
+			left -= need;
+
+			/* check if we've found what we were looking for */
+			if (left == 0) {
+				found = true;
+				break;
+			}
+		}
+
+		/* we didn't find anything, so continue */
+		if (!found) {
+			continue;
+		}
+
+		return MASK_GET_IDX(msk_idx, run_start);
+	}
+	return used ? -ENOENT : -ENOSPC;
+}
+
+static int
+find_next(const struct used_mask *msk, int start, bool used) {
+	int idx, first, first_mod;
+	uint64_t first_msk;
+
+	/*
+	 * mask only has granularity of MASK_ALIGN, but start may not be aligned
+	 * on that boundary, so construct a special mask to exclude anything we
+	 * don't want to see to avoid confusing ctz.
+	 */
+	first = MASK_LEN_TO_IDX(start);
+	first_mod = MASK_LEN_TO_MOD(start);
+	first_msk = ~((1ULL << first_mod) - 1ULL);
+
+	for (idx = first; idx < msk->n_masks; idx++) {
+		uint64_t cur = msk->data[idx];
+		int found;
+
+		/* if we're looking for free entries, invert mask */
+		if (!used) {
+			cur = ~cur;
+		}
+
+		/* ignore everything before start on first iteration */
+		if (idx == first)
+			cur &= first_msk;
+
+		/* check if we have any entries */
+		if (cur == 0)
+			continue;
+
+		/*
+		 * find first set bit - that will correspond to whatever it is
+		 * that we're looking for.
+		 */
+		found = __builtin_ctzll(cur);
+		return MASK_GET_IDX(idx, found);
+	}
+	return used ? -ENOENT : -ENOSPC;
+}
+
+static int
+find_contig(const struct used_mask *msk, int start, bool used) {
+	int idx, first;
+	int need_len, result = 0;
+
+	first = MASK_LEN_TO_IDX(start);
+	for (idx = first; idx < msk->n_masks; idx++, result += need_len) {
+		uint64_t cur = msk->data[idx];
+		int run_len;
+
+		need_len = MASK_ALIGN;
+
+		/* if we're looking for free entries, invert mask */
+		if (!used) {
+			cur = ~cur;
+		}
+
+		/* ignore everything before start on first iteration */
+		if (idx == first) {
+			cur >>= start;
+			/* at the start, we don't need the full mask len */
+			need_len -= start;
+		}
+
+		/* we will be looking for zeroes, so invert the mask */
+		cur = ~cur;
+
+		/* if mask is zero, we have a complete run */
+		if (cur == 0)
+			continue;
+
+		/*
+		 * see if current run ends before mask end.
+		 */
+		run_len = __builtin_ctzll(cur);
+
+		/* add however many zeroes we've had in the last run and quit */
+		if (run_len < need_len) {
+			result += run_len;
+			break;
+		}
+	}
+	return result;
+}
+
+int
+rte_fbarray_alloc(struct rte_fbarray *arr, const char *name, int cur_len,
+		int max_len, int elt_sz) {
+	size_t max_mmap_len, cur_mmap_len, page_sz;
+	char path[PATH_MAX];
+	struct used_mask *msk;
+	void *data;
+
+	// TODO: validation
+
+	/* lengths must be aligned */
+	cur_len = RTE_ALIGN_CEIL(cur_len, MASK_ALIGN);
+	max_len = RTE_ALIGN_CEIL(max_len, MASK_ALIGN);
+
+	page_sz = sysconf(_SC_PAGESIZE);
+
+	cur_mmap_len = calc_data_size(page_sz, elt_sz, cur_len);
+	max_mmap_len = calc_data_size(page_sz, elt_sz, max_len);
+
+	data = eal_get_virtual_area(NULL, &max_mmap_len, page_sz, 0);
+	if (data == NULL)
+		return -1;
+
+	eal_get_fbarray_path(path, sizeof(path), name);
+	unlink(path);
+
+	if (expand_and_map(data, name, 0, cur_mmap_len)) {
+		return -1;
+	}
+
+	/* populate data structure */
+	snprintf(arr->name, sizeof(arr->name), "%s", name);
+	arr->data = data;
+	arr->capacity = max_len;
+	arr->len = cur_len;
+	arr->elt_sz = elt_sz;
+	arr->count = 0;
+
+	msk = get_used_mask(data, elt_sz, cur_len);
+	msk->n_masks = MASK_LEN_TO_IDX(cur_len);
+
+	return 0;
+}
+
+int
+rte_fbarray_attach(const struct rte_fbarray *arr) {
+	uint64_t max_mmap_len, cur_mmap_len, page_sz;
+	void *data;
+
+	page_sz = sysconf(_SC_PAGESIZE);
+
+	cur_mmap_len = calc_data_size(page_sz, arr->elt_sz, arr->len);
+	max_mmap_len = calc_data_size(page_sz, arr->elt_sz, arr->capacity);
+
+	data = eal_get_virtual_area(arr->data, &max_mmap_len, page_sz, 0);
+	if (data == NULL)
+		return -1;
+
+	if (expand_and_map(data, arr->name, 0, cur_mmap_len)) {
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+rte_fbarray_free(struct rte_fbarray *arr) {
+	size_t page_sz = sysconf(_SC_PAGESIZE);
+	munmap(arr->data, calc_data_size(page_sz, arr->elt_sz, arr->capacity));
+	memset(arr, 0, sizeof(*arr));
+}
+
+int
+rte_fbarray_resize(struct rte_fbarray *arr, int new_len) {
+	size_t cur_mmap_len, new_mmap_len, page_sz;
+
+	// TODO: validation
+	if (arr->len >= new_len) {
+		RTE_LOG(ERR, EAL, "Invalid length: %i >= %i\n", arr->len, new_len);
+		return -1;
+	}
+
+	page_sz = sysconf(_SC_PAGESIZE);
+
+	new_len = RTE_ALIGN_CEIL(new_len, MASK_ALIGN);
+
+	cur_mmap_len = calc_data_size(page_sz, arr->elt_sz, arr->len);
+	new_mmap_len = calc_data_size(page_sz, arr->elt_sz, new_len);
+
+	if (cur_mmap_len != new_mmap_len &&
+			expand_and_map(arr->data, arr->name, cur_mmap_len,
+				new_mmap_len)) {
+		return -1;
+	}
+
+	move_mask(arr->data, arr->elt_sz, arr->len, new_len);
+
+	arr->len = new_len;
+
+	return 0;
+}
+
+void *
+rte_fbarray_get(const struct rte_fbarray *arr, int idx) {
+	if (idx >= arr->len || idx < 0)
+		return NULL;
+	return RTE_PTR_ADD(arr->data, idx * arr->elt_sz);
+}
+
+// TODO: replace -1 with debug sanity checks
+int
+rte_fbarray_set_used(struct rte_fbarray *arr, int idx, bool used) {
+	struct used_mask *msk = get_used_mask(arr->data, arr->elt_sz, arr->len);
+	bool already_used;
+	int msk_idx = MASK_LEN_TO_IDX(idx);
+	uint64_t msk_bit = 1ULL << MASK_LEN_TO_MOD(idx);
+
+	if (idx >= arr->len || idx < 0)
+		return -1;
+
+	already_used = (msk->data[msk_idx] & msk_bit) != 0;
+
+	/* nothing to be done */
+	if (used == already_used)
+		return 0;
+
+	if (used) {
+		msk->data[msk_idx] |= msk_bit;
+		arr->count++;
+	} else {
+		msk->data[msk_idx] &= ~msk_bit;
+		arr->count--;
+	}
+
+	return 0;
+}
+int
+rte_fbarray_is_used(const struct rte_fbarray *arr, int idx) {
+	struct used_mask *msk = get_used_mask(arr->data, arr->elt_sz, arr->len);
+	int msk_idx = MASK_LEN_TO_IDX(idx);
+	uint64_t msk_bit = 1ULL << MASK_LEN_TO_MOD(idx);
+
+	if (idx >= arr->len || idx < 0)
+		return -1;
+
+	return (msk->data[msk_idx] & msk_bit) != 0;
+}
+
+int
+rte_fbarray_find_next_free(const struct rte_fbarray *arr, int start) {
+	if (start >= arr->len || start < 0)
+		return -EINVAL;
+
+	if (arr->len == arr->count)
+		return -ENOSPC;
+
+	return find_next(get_used_mask(arr->data, arr->elt_sz, arr->len),
+			start, false);
+}
+
+int
+rte_fbarray_find_next_used(const struct rte_fbarray *arr, int start) {
+	if (start >= arr->len || start < 0)
+		return -EINVAL;
+
+	if (arr->count == 0)
+		return -1;
+
+	return find_next(get_used_mask(arr->data, arr->elt_sz, arr->len),
+			start, true);
+}
+
+int
+rte_fbarray_find_next_n_free(const struct rte_fbarray *arr, int start, int n) {
+	if (start >= arr->len || start < 0 || n > arr->len)
+		return -EINVAL;
+
+	if (arr->len == arr->count || arr->len - arr->count < n)
+		return -ENOSPC;
+
+	return find_next_n(get_used_mask(arr->data, arr->elt_sz, arr->len),
+			start, n, false);
+}
+
+int
+rte_fbarray_find_next_n_used(const struct rte_fbarray *arr, int start, int n) {
+	if (start >= arr->len || start < 0 || n > arr->len)
+		return -EINVAL;
+
+	if (arr->count < n)
+		return -ENOENT;
+
+	return find_next_n(get_used_mask(arr->data, arr->elt_sz, arr->len),
+			start, n, true);
+}
+
+int
+rte_fbarray_find_contig_free(const struct rte_fbarray *arr, int start) {
+	if (start >= arr->len || start < 0)
+		return -EINVAL;
+
+	if (arr->len == arr->count)
+		return -ENOSPC;
+
+	if (arr->count == 0)
+		return arr->len - start;
+
+	return find_contig(get_used_mask(arr->data, arr->elt_sz, arr->len),
+			start, false);
+}
+
+int
+rte_fbarray_find_contig_used(const struct rte_fbarray *arr, int start) {
+	if (start >= arr->len || start < 0)
+		return -EINVAL;
+
+	if (arr->count == 0)
+		return -ENOENT;
+
+	return find_contig(get_used_mask(arr->data, arr->elt_sz, arr->len),
+			start, true);
+}
+
+int
+rte_fbarray_find_idx(const struct rte_fbarray *arr, const void *elt) {
+	void *end;
+
+	end = RTE_PTR_ADD(arr->data, arr->elt_sz * arr->len);
+
+	if (elt < arr->data || elt >= end)
+		return -EINVAL;
+	return RTE_PTR_DIFF(elt, arr->data) / arr->elt_sz;
+}
+
+void
+rte_fbarray_dump_metadata(const struct rte_fbarray *arr, FILE *f) {
+	const struct used_mask *msk = get_used_mask(arr->data, arr->elt_sz, arr->len);
+
+	fprintf(f, "File-backed array: %s\n", arr->name);
+	fprintf(f, "size: %i occupied: %i capacity: %i elt_sz: %i\n",
+	       arr->len, arr->count, arr->capacity, arr->elt_sz);
+	if (!arr->data) {
+		fprintf(f, "not allocated\n");
+		return;
+	}
+
+	for (int i = 0; i < msk->n_masks; i++) {
+		fprintf(f, "msk idx %i: 0x%016lx\n", i, msk->data[i]);
+	}
+}
diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index 8acbd99..10e7474 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -42,6 +42,7 @@
 
 /** Path of rte config file. */
 #define RUNTIME_CONFIG_FMT "%s/.%s_config"
+#define FBARRAY_FMT "%s/%s_%s"
 
 #include <stdint.h>
 #include <limits.h>
@@ -67,6 +68,18 @@ eal_runtime_config_path(void)
 	return buffer;
 }
 
+static inline const char *
+eal_get_fbarray_path(char *buffer, size_t buflen, const char *name) {
+	const char *directory = default_config_dir;
+	const char *home_dir = getenv("HOME");
+
+	if (getuid() != 0 && home_dir != NULL)
+		directory = home_dir;
+	snprintf(buffer, buflen - 1, FBARRAY_FMT, directory,
+			internal_config.hugefile_prefix, name);
+	return buffer;
+}
+
 /** Path of hugepage info file. */
 #define HUGEPAGE_INFO_FMT "%s/.%s_hugepage_info"
 
diff --git a/lib/librte_eal/common/include/rte_fbarray.h b/lib/librte_eal/common/include/rte_fbarray.h
new file mode 100755
index 0000000..d06c1ac
--- /dev/null
+++ b/lib/librte_eal/common/include/rte_fbarray.h
@@ -0,0 +1,98 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef RTE_FBARRAY_H
+#define RTE_FBARRAY_H
+
+#include <stdbool.h>
+#include <stdio.h>
+
+#define RTE_FBARRAY_NAME_LEN 64
+
+struct rte_fbarray {
+	char name[RTE_FBARRAY_NAME_LEN]; /**< name associated with an array */
+	int count;                       /**< number of entries stored */
+	int len;                         /**< current length of the array */
+	int capacity;                    /**< maximum length of the array */
+	int elt_sz;                      /**< size of each element */
+	void *data;                      /**< data pointer */
+};
+
+// TODO: tmp? shmget?
+
+int
+rte_fbarray_alloc(struct rte_fbarray *arr, const char *name, int cur_len,
+		int max_len, int elt_sz);
+
+int
+rte_fbarray_attach(const struct rte_fbarray *arr);
+
+void
+rte_fbarray_free(struct rte_fbarray *arr);
+
+int
+rte_fbarray_resize(struct rte_fbarray *arr, int new_len);
+
+void *
+rte_fbarray_get(const struct rte_fbarray *arr, int idx);
+
+int
+rte_fbarray_find_idx(const struct rte_fbarray *arr, const void *elt);
+
+int
+rte_fbarray_set_used(struct rte_fbarray *arr, int idx, bool used);
+
+int
+rte_fbarray_is_used(const struct rte_fbarray *arr, int idx);
+
+int
+rte_fbarray_find_next_free(const struct rte_fbarray *arr, int start);
+
+int
+rte_fbarray_find_next_used(const struct rte_fbarray *arr, int start);
+
+int
+rte_fbarray_find_next_n_free(const struct rte_fbarray *arr, int start, int n);
+
+int
+rte_fbarray_find_next_n_used(const struct rte_fbarray *arr, int start, int n);
+
+int
+rte_fbarray_find_contig_free(const struct rte_fbarray *arr, int start);
+
+int
+rte_fbarray_find_contig_used(const struct rte_fbarray *arr, int start);
+
+void
+rte_fbarray_dump_metadata(const struct rte_fbarray *arr, FILE *f);
+
+#endif // RTE_FBARRAY_H
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 5a7b8b2..782e1ad 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -86,6 +86,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_dev.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_options.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_thread.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_proc.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_fbarray.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_malloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += malloc_elem.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += malloc_heap.c
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 04/23] eal: move all locking to heap
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (2 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 03/23] eal: add rte_fbarray Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 05/23] eal: protect malloc heap stats with a lock Anatoly Burakov
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

Down the line, we will need to do everything from the heap as any
alloc or free may trigger alloc/free OS memory, which would involve
growing/shrinking heap.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/malloc_elem.c | 16 ++--------------
 lib/librte_eal/common/malloc_heap.c | 36 ++++++++++++++++++++++++++++++++++++
 lib/librte_eal/common/malloc_heap.h |  6 ++++++
 lib/librte_eal/common/rte_malloc.c  |  4 ++--
 4 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 98bcd37..6b4f2a5 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -271,10 +271,6 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 int
 malloc_elem_free(struct malloc_elem *elem)
 {
-	if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY)
-		return -1;
-
-	rte_spinlock_lock(&(elem->heap->lock));
 	size_t sz = elem->size - sizeof(*elem) - MALLOC_ELEM_TRAILER_LEN;
 	uint8_t *ptr = (uint8_t *)&elem[1];
 	struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size);
@@ -302,8 +298,6 @@ malloc_elem_free(struct malloc_elem *elem)
 
 	memset(ptr, 0, sz);
 
-	rte_spinlock_unlock(&(elem->heap->lock));
-
 	return 0;
 }
 
@@ -320,11 +314,10 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size)
 		return 0;
 
 	struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size);
-	rte_spinlock_lock(&elem->heap->lock);
 	if (next ->state != ELEM_FREE)
-		goto err_return;
+		return -1;
 	if (elem->size + next->size < new_size)
-		goto err_return;
+		return -1;
 
 	/* we now know the element fits, so remove from free list,
 	 * join the two
@@ -339,10 +332,5 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size)
 		split_elem(elem, split_pt);
 		malloc_elem_free_list_insert(split_pt);
 	}
-	rte_spinlock_unlock(&elem->heap->lock);
 	return 0;
-
-err_return:
-	rte_spinlock_unlock(&elem->heap->lock);
-	return -1;
 }
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 267a4c6..099e448 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -174,6 +174,42 @@ malloc_heap_alloc(struct malloc_heap *heap,
 	return elem == NULL ? NULL : (void *)(&elem[1]);
 }
 
+int
+malloc_heap_free(struct malloc_elem *elem) {
+	struct malloc_heap *heap;
+	int ret;
+
+	if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY)
+		return -1;
+
+	/* elem may be merged with previous element, so keep heap address */
+	heap = elem->heap;
+
+	rte_spinlock_lock(&(heap->lock));
+
+	ret = malloc_elem_free(elem);
+
+	rte_spinlock_unlock(&(heap->lock));
+
+	return ret;
+}
+
+int
+malloc_heap_resize(struct malloc_elem *elem, size_t size) {
+	int ret;
+
+	if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY)
+		return -1;
+
+	rte_spinlock_lock(&(elem->heap->lock));
+
+	ret = malloc_elem_resize(elem, size);
+
+	rte_spinlock_unlock(&(elem->heap->lock));
+
+	return ret;
+}
+
 /*
  * Function to retrieve data for heap on given socket
  */
diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h
index 3ccbef0..3767ef3 100644
--- a/lib/librte_eal/common/malloc_heap.h
+++ b/lib/librte_eal/common/malloc_heap.h
@@ -57,6 +57,12 @@ malloc_heap_alloc(struct malloc_heap *heap,	const char *type, size_t size,
 		unsigned flags, size_t align, size_t bound);
 
 int
+malloc_heap_free(struct malloc_elem *elem);
+
+int
+malloc_heap_resize(struct malloc_elem *elem, size_t size);
+
+int
 malloc_heap_get_stats(const struct malloc_heap *heap,
 		struct rte_malloc_socket_stats *socket_stats);
 
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index fe2278b..74b5417 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -58,7 +58,7 @@
 void rte_free(void *addr)
 {
 	if (addr == NULL) return;
-	if (malloc_elem_free(malloc_elem_from_data(addr)) < 0)
+	if (malloc_heap_free(malloc_elem_from_data(addr)) < 0)
 		rte_panic("Fatal error: Invalid memory\n");
 }
 
@@ -169,7 +169,7 @@ rte_realloc(void *ptr, size_t size, unsigned align)
 	size = RTE_CACHE_LINE_ROUNDUP(size), align = RTE_CACHE_LINE_ROUNDUP(align);
 	/* check alignment matches first, and if ok, see if we can resize block */
 	if (RTE_PTR_ALIGN(ptr,align) == ptr &&
-			malloc_elem_resize(elem, size) == 0)
+			malloc_heap_resize(elem, size) == 0)
 		return ptr;
 
 	/* either alignment is off, or we have no room to expand,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 05/23] eal: protect malloc heap stats with a lock
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (3 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 04/23] eal: move all locking to heap Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 06/23] eal: make malloc a doubly-linked list Anatoly Burakov
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This does not change the public API, as this API is not meant to be
called directly.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/malloc_heap.c | 7 ++++++-
 lib/librte_eal/common/malloc_heap.h | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 099e448..b3a1043 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -214,12 +214,14 @@ malloc_heap_resize(struct malloc_elem *elem, size_t size) {
  * Function to retrieve data for heap on given socket
  */
 int
-malloc_heap_get_stats(const struct malloc_heap *heap,
+malloc_heap_get_stats(struct malloc_heap *heap,
 		struct rte_malloc_socket_stats *socket_stats)
 {
 	size_t idx;
 	struct malloc_elem *elem;
 
+	rte_spinlock_lock(&(heap->lock));
+
 	/* Initialise variables for heap */
 	socket_stats->free_count = 0;
 	socket_stats->heap_freesz_bytes = 0;
@@ -241,6 +243,9 @@ malloc_heap_get_stats(const struct malloc_heap *heap,
 	socket_stats->heap_allocsz_bytes = (socket_stats->heap_totalsz_bytes -
 			socket_stats->heap_freesz_bytes);
 	socket_stats->alloc_count = heap->alloc_count;
+
+	rte_spinlock_unlock(&(heap->lock));
+
 	return 0;
 }
 
diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h
index 3767ef3..df04dd8 100644
--- a/lib/librte_eal/common/malloc_heap.h
+++ b/lib/librte_eal/common/malloc_heap.h
@@ -63,7 +63,7 @@ int
 malloc_heap_resize(struct malloc_elem *elem, size_t size);
 
 int
-malloc_heap_get_stats(const struct malloc_heap *heap,
+malloc_heap_get_stats(struct malloc_heap *heap,
 		struct rte_malloc_socket_stats *socket_stats);
 
 int
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 06/23] eal: make malloc a doubly-linked list
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (4 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 05/23] eal: protect malloc heap stats with a lock Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 07/23] eal: make malloc_elem_join_adjacent_free public Anatoly Burakov
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

As we are preparing for dynamic memory allocation, we need to be
able to handle holes in our malloc heap, hence we're switching to
doubly linked list, and prepare infrastructure to support it.

Since our heap is now aware where are our first and last elements,
there is no longer any need to have a dummy element at the end of
each heap, so get rid of that as well. Instead, let insert/remove/
join/split operations handle end-of-list conditions automatically.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/include/rte_malloc_heap.h |   6 +
 lib/librte_eal/common/malloc_elem.c             | 196 +++++++++++++++++++-----
 lib/librte_eal/common/malloc_elem.h             |   7 +-
 lib/librte_eal/common/malloc_heap.c             |   8 +-
 4 files changed, 170 insertions(+), 47 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_malloc_heap.h b/lib/librte_eal/common/include/rte_malloc_heap.h
index b270356..48a46c9 100644
--- a/lib/librte_eal/common/include/rte_malloc_heap.h
+++ b/lib/librte_eal/common/include/rte_malloc_heap.h
@@ -42,12 +42,18 @@
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
 
+/* dummy definition, for pointers */
+struct malloc_elem;
+
 /**
  * Structure to hold malloc heap
  */
 struct malloc_heap {
 	rte_spinlock_t lock;
 	LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS];
+	struct malloc_elem *first;
+	struct malloc_elem *last;
+
 	unsigned alloc_count;
 	size_t total_size;
 } __rte_cache_aligned;
diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 6b4f2a5..7609a9b 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -60,6 +60,7 @@ malloc_elem_init(struct malloc_elem *elem,
 	elem->heap = heap;
 	elem->ms = ms;
 	elem->prev = NULL;
+	elem->next = NULL;
 	memset(&elem->free_list, 0, sizeof(elem->free_list));
 	elem->state = ELEM_FREE;
 	elem->size = size;
@@ -68,15 +69,56 @@ malloc_elem_init(struct malloc_elem *elem,
 	set_trailer(elem);
 }
 
-/*
- * Initialize a dummy malloc_elem header for the end-of-memseg marker
- */
 void
-malloc_elem_mkend(struct malloc_elem *elem, struct malloc_elem *prev)
+malloc_elem_insert(struct malloc_elem *elem)
 {
-	malloc_elem_init(elem, prev->heap, prev->ms, 0);
-	elem->prev = prev;
-	elem->state = ELEM_BUSY; /* mark busy so its never merged */
+	struct malloc_elem *prev_elem, *next_elem;
+	struct malloc_heap *heap = elem->heap;
+
+	if (heap->first == NULL && heap->last == NULL) {
+		/* if empty heap */
+		heap->first = elem;
+		heap->last = elem;
+		prev_elem = NULL;
+		next_elem = NULL;
+	} else if (elem < heap->first) {
+		/* if lower than start */
+		prev_elem = NULL;
+		next_elem = heap->first;
+		heap->first = elem;
+	} else if (elem > heap->last) {
+		/* if higher than end */
+		prev_elem = heap->last;
+		next_elem = NULL;
+		heap->last = elem;
+	} else {
+		/* the new memory is somewhere inbetween start and end */
+		uint64_t dist_from_start, dist_from_end;
+
+		dist_from_end = RTE_PTR_DIFF(heap->last, elem);
+		dist_from_start = RTE_PTR_DIFF(elem, heap->first);
+
+		/* check which is closer, and find closest list entries */
+		if (dist_from_start < dist_from_end) {
+			prev_elem = heap->first;
+			while (prev_elem->next < elem)
+				prev_elem = prev_elem->next;
+			next_elem = prev_elem->next;
+		} else {
+			next_elem = heap->last;
+			while (next_elem->prev > elem)
+				next_elem = next_elem->prev;
+			prev_elem = next_elem->prev;
+		}
+	}
+
+	/* insert new element */
+	elem->prev = prev_elem;
+	elem->next = next_elem;
+	if (prev_elem)
+		prev_elem->next = elem;
+	if (next_elem)
+		next_elem->prev = elem;
 }
 
 /*
@@ -126,18 +168,55 @@ malloc_elem_can_hold(struct malloc_elem *elem, size_t size,	unsigned align,
 static void
 split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 {
-	struct malloc_elem *next_elem = RTE_PTR_ADD(elem, elem->size);
+	struct malloc_elem *next_elem = elem->next;
 	const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
 	const size_t new_elem_size = elem->size - old_elem_size;
 
 	malloc_elem_init(split_pt, elem->heap, elem->ms, new_elem_size);
 	split_pt->prev = elem;
-	next_elem->prev = split_pt;
+	split_pt->next = next_elem;
+	if (next_elem)
+		next_elem->prev = split_pt;
+	else
+		elem->heap->last = split_pt;
+	elem->next = split_pt;
 	elem->size = old_elem_size;
 	set_trailer(elem);
 }
 
 /*
+ * our malloc heap is a doubly linked list, so doubly remove our element.
+ */
+static void __rte_unused
+remove_elem(struct malloc_elem *elem) {
+	struct malloc_elem *next, *prev;
+	next = elem->next;
+	prev = elem->prev;
+
+	if (next)
+		next->prev = prev;
+	else
+		elem->heap->last = prev;
+	if (prev)
+		prev->next = next;
+	else
+		elem->heap->first = next;
+
+	elem->prev = NULL;
+	elem->next = NULL;
+}
+
+static int
+next_elem_is_adjacent(struct malloc_elem *elem) {
+	return elem->next == RTE_PTR_ADD(elem, elem->size);
+}
+
+static int
+prev_elem_is_adjacent(struct malloc_elem *elem) {
+	return elem == RTE_PTR_ADD(elem->prev, elem->prev->size);
+}
+
+/*
  * Given an element size, compute its freelist index.
  * We free an element into the freelist containing similarly-sized elements.
  * We try to allocate elements starting with the freelist containing
@@ -220,6 +299,9 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align,
 
 		split_elem(elem, new_free_elem);
 		malloc_elem_free_list_insert(new_free_elem);
+
+		if (elem == elem->heap->last)
+			elem->heap->last = new_free_elem;
 	}
 
 	if (old_elem_size < MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
@@ -258,9 +340,61 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align,
 static inline void
 join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 {
-	struct malloc_elem *next = RTE_PTR_ADD(elem2, elem2->size);
+	struct malloc_elem *next = elem2->next;
 	elem1->size += elem2->size;
-	next->prev = elem1;
+	if (next)
+		next->prev = elem1;
+	else
+		elem1->heap->last = elem1;
+	elem1->next = next;
+}
+
+static struct malloc_elem *
+elem_join_adjacent_free(struct malloc_elem *elem) {
+	/*
+	 * check if next element exists, is adjacent and is free, if so join
+	 * with it, need to remove from free list.
+	 */
+	if (elem->next != NULL && elem->next->state == ELEM_FREE &&
+			next_elem_is_adjacent(elem)){
+		void *erase;
+
+		/* we will want to erase the trailer and header */
+		erase = RTE_PTR_SUB(elem->next, MALLOC_ELEM_TRAILER_LEN);
+
+		/* remove from free list, join to this one */
+		elem_free_list_remove(elem->next);
+		join_elem(elem, elem->next);
+
+		/* erase header and trailer */
+		memset(erase, 0, MALLOC_ELEM_OVERHEAD);
+	}
+
+	/*
+	 * check if prev element exists, is adjacent and is free, if so join
+	 * with it, need to remove from free list.
+	 */
+	if (elem->prev != NULL && elem->prev->state == ELEM_FREE &&
+			prev_elem_is_adjacent(elem)) {
+		struct malloc_elem *new_elem;
+		void *erase;
+
+		/* we will want to erase trailer and header */
+		erase = RTE_PTR_SUB(elem, MALLOC_ELEM_TRAILER_LEN);
+
+		/* remove from free list, join to this one */
+		elem_free_list_remove(elem->prev);
+
+		new_elem = elem->prev;
+		join_elem(new_elem, elem);
+
+		/* erase header and trailer */
+		memset(erase, 0, MALLOC_ELEM_OVERHEAD);
+
+		elem = new_elem;
+	}
+
+	return elem;
 }
 
 /*
@@ -271,32 +405,20 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 int
 malloc_elem_free(struct malloc_elem *elem)
 {
-	size_t sz = elem->size - sizeof(*elem) - MALLOC_ELEM_TRAILER_LEN;
-	uint8_t *ptr = (uint8_t *)&elem[1];
-	struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size);
-	if (next->state == ELEM_FREE){
-		/* remove from free list, join to this one */
-		elem_free_list_remove(next);
-		join_elem(elem, next);
-		sz += (sizeof(*elem) + MALLOC_ELEM_TRAILER_LEN);
-	}
+	void *ptr;
+	size_t data_len;
+
+	ptr = RTE_PTR_ADD(elem, sizeof(*elem));
+	data_len = elem->size - MALLOC_ELEM_OVERHEAD;
+
+	elem = elem_join_adjacent_free(elem);
 
-	/* check if previous element is free, if so join with it and return,
-	 * need to re-insert in free list, as that element's size is changing
-	 */
-	if (elem->prev != NULL && elem->prev->state == ELEM_FREE) {
-		elem_free_list_remove(elem->prev);
-		join_elem(elem->prev, elem);
-		sz += (sizeof(*elem) + MALLOC_ELEM_TRAILER_LEN);
-		ptr -= (sizeof(*elem) + MALLOC_ELEM_TRAILER_LEN);
-		elem = elem->prev;
-	}
 	malloc_elem_free_list_insert(elem);
 
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
-	memset(ptr, 0, sz);
+	memset(ptr, 0, data_len);
 
 	return 0;
 }
@@ -309,21 +431,23 @@ int
 malloc_elem_resize(struct malloc_elem *elem, size_t size)
 {
 	const size_t new_size = size + elem->pad + MALLOC_ELEM_OVERHEAD;
+
 	/* if we request a smaller size, then always return ok */
 	if (elem->size >= new_size)
 		return 0;
 
-	struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size);
-	if (next ->state != ELEM_FREE)
+	/* check if there is a next element, it's free and adjacent */
+	if (!elem->next || elem->next->state != ELEM_FREE ||
+			!next_elem_is_adjacent(elem))
 		return -1;
-	if (elem->size + next->size < new_size)
+	if (elem->size + elem->next->size < new_size)
 		return -1;
 
 	/* we now know the element fits, so remove from free list,
 	 * join the two
 	 */
-	elem_free_list_remove(next);
-	join_elem(elem, next);
+	elem_free_list_remove(elem->next);
+	join_elem(elem, elem->next);
 
 	if (elem->size - new_size >= MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD) {
 		/* now we have a big block together. Lets cut it down a bit, by splitting */
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index ce39129..b3d39c0 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -48,6 +48,7 @@ enum elem_state {
 struct malloc_elem {
 	struct malloc_heap *heap;
 	struct malloc_elem *volatile prev;      /* points to prev elem in memseg */
+	struct malloc_elem *volatile next;      /* points to next elem in memseg */
 	LIST_ENTRY(malloc_elem) free_list;      /* list of free elements in heap */
 	const struct rte_memseg *ms;
 	volatile enum elem_state state;
@@ -139,12 +140,8 @@ malloc_elem_init(struct malloc_elem *elem,
 		const struct rte_memseg *ms,
 		size_t size);
 
-/*
- * initialise a dummy malloc_elem header for the end-of-memseg marker
- */
 void
-malloc_elem_mkend(struct malloc_elem *elem,
-		struct malloc_elem *prev_free);
+malloc_elem_insert(struct malloc_elem *elem);
 
 /*
  * return true if the current malloc_elem can hold a block of data
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index b3a1043..1b35468 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -99,15 +99,11 @@ check_hugepage_sz(unsigned flags, uint64_t hugepage_sz)
 static void
 malloc_heap_add_memseg(struct malloc_heap *heap, struct rte_memseg *ms)
 {
-	/* allocate the memory block headers, one at end, one at start */
 	struct malloc_elem *start_elem = (struct malloc_elem *)ms->addr;
-	struct malloc_elem *end_elem = RTE_PTR_ADD(ms->addr,
-			ms->len - MALLOC_ELEM_OVERHEAD);
-	end_elem = RTE_PTR_ALIGN_FLOOR(end_elem, RTE_CACHE_LINE_SIZE);
-	const size_t elem_size = (uintptr_t)end_elem - (uintptr_t)start_elem;
+	const size_t elem_size = ms->len - MALLOC_ELEM_OVERHEAD;
 
 	malloc_elem_init(start_elem, heap, ms, elem_size);
-	malloc_elem_mkend(end_elem, start_elem);
+	malloc_elem_insert(start_elem);
 	malloc_elem_free_list_insert(start_elem);
 
 	heap->total_size += elem_size;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 07/23] eal: make malloc_elem_join_adjacent_free public
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (5 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 06/23] eal: make malloc a doubly-linked list Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 08/23] eal: add "single file segments" command-line option Anatoly Burakov
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

We need this function to join newly allocated segments with the heap.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/malloc_elem.c | 6 +++---
 lib/librte_eal/common/malloc_elem.h | 3 +++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 7609a9b..782aaa7 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -349,8 +349,8 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 	elem1->next = next;
 }
 
-static struct malloc_elem *
-elem_join_adjacent_free(struct malloc_elem *elem) {
+struct malloc_elem *
+malloc_elem_join_adjacent_free(struct malloc_elem *elem) {
 	/*
 	 * check if next element exists, is adjacent and is free, if so join
 	 * with it, need to remove from free list.
@@ -411,7 +411,7 @@ malloc_elem_free(struct malloc_elem *elem)
 	ptr = RTE_PTR_ADD(elem, sizeof(*elem));
 	data_len = elem->size - MALLOC_ELEM_OVERHEAD;
 
-	elem = elem_join_adjacent_free(elem);
+	elem = malloc_elem_join_adjacent_free(elem);
 
 	malloc_elem_free_list_insert(elem);
 
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index b3d39c0..cf27b59 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -167,6 +167,9 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size,
 int
 malloc_elem_free(struct malloc_elem *elem);
 
+struct malloc_elem *
+malloc_elem_join_adjacent_free(struct malloc_elem *elem);
+
 /*
  * attempt to resize a malloc_elem by expanding into any free space
  * immediately after it in memory.
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 08/23] eal: add "single file segments" command-line option
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (6 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 07/23] eal: make malloc_elem_join_adjacent_free public Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 09/23] eal: add "legacy memory" option Anatoly Burakov
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

For now, this option does nothing, but it will be useful in
dynamic memory allocation down the line. Currently, DPDK stores
all pages as separate files in hugetlbfs. This option will allow
storing all pages in one file (one file per socket, per page size).

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_options.c | 4 ++++
 lib/librte_eal/common/eal_internal_cfg.h   | 3 +++
 lib/librte_eal/common/eal_options.h        | 2 ++
 lib/librte_eal/linuxapp/eal/eal.c          | 1 +
 4 files changed, 10 insertions(+)

diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
index 996a034..c3f7c41 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -98,6 +98,7 @@ eal_long_options[] = {
 	{OPT_VDEV,              1, NULL, OPT_VDEV_NUM             },
 	{OPT_VFIO_INTR,         1, NULL, OPT_VFIO_INTR_NUM        },
 	{OPT_VMWARE_TSC_MAP,    0, NULL, OPT_VMWARE_TSC_MAP_NUM   },
+	{OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM},
 	{0,                     0, NULL, 0                        }
 };
 
@@ -1158,6 +1159,9 @@ eal_parse_common_option(int opt, const char *optarg,
 		}
 		core_parsed = 1;
 		break;
+	case OPT_SINGLE_FILE_SEGMENTS_NUM:
+		conf->single_file_segments = 1;
+		break;
 
 	/* don't know what to do, leave this to caller */
 	default:
diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h
index fa6ccbe..484a32e 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -76,6 +76,9 @@ struct internal_config {
 	volatile unsigned force_sockets;
 	volatile uint64_t socket_mem[RTE_MAX_NUMA_NODES]; /**< amount of memory per socket */
 	uintptr_t base_virtaddr;          /**< base address to try and reserve memory from */
+	volatile unsigned single_file_segments;
+	/**< true if storing all pages within single files (per-page-size,
+	 * per-node). */
 	volatile int syslog_facility;	  /**< facility passed to openlog() */
 	/** default interrupt mode for VFIO */
 	volatile enum rte_intr_mode vfio_intr_mode;
diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h
index 30e6bb4..26a682a 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -83,6 +83,8 @@ enum {
 	OPT_VFIO_INTR_NUM,
 #define OPT_VMWARE_TSC_MAP    "vmware-tsc-map"
 	OPT_VMWARE_TSC_MAP_NUM,
+#define OPT_SINGLE_FILE_SEGMENTS    "single-file-segments"
+	OPT_SINGLE_FILE_SEGMENTS_NUM,
 	OPT_LONG_MAX_NUM
 };
 
diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 229eec9..2a3127f 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -366,6 +366,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_BASE_VIRTADDR"     Base virtual address\n"
 	       "  --"OPT_CREATE_UIO_DEV"    Create /dev/uioX (usually done by hotplug)\n"
 	       "  --"OPT_VFIO_INTR"         Interrupt mode for VFIO (legacy|msi|msix)\n"
+	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if ( rte_application_usage_hook ) {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 09/23] eal: add "legacy memory" option
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (7 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 08/23] eal: add "single file segments" command-line option Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 10/23] eal: read hugepage counts from node-specific sysfs path Anatoly Burakov
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This adds a "--legacy-mem" command-line switch. It will be used to
go back to the old memory behavior, one where we can't dynamically
allocate/free memory (the downside), but one where the user can
get physically contiguous memory, like before (the upside).

For now, nothing but the legacy behavior exists, non-legacy
memory init sequence will be added later.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_options.c |  4 ++++
 lib/librte_eal/common/eal_internal_cfg.h   |  3 +++
 lib/librte_eal/common/eal_options.h        |  2 ++
 lib/librte_eal/linuxapp/eal/eal.c          |  1 +
 lib/librte_eal/linuxapp/eal/eal_memory.c   | 22 ++++++++++++++++++----
 5 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
index c3f7c41..88ff35a 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -99,6 +99,7 @@ eal_long_options[] = {
 	{OPT_VFIO_INTR,         1, NULL, OPT_VFIO_INTR_NUM        },
 	{OPT_VMWARE_TSC_MAP,    0, NULL, OPT_VMWARE_TSC_MAP_NUM   },
 	{OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM},
+	{OPT_LEGACY_MEM,        0, NULL, OPT_LEGACY_MEM_NUM       },
 	{0,                     0, NULL, 0                        }
 };
 
@@ -1162,6 +1163,9 @@ eal_parse_common_option(int opt, const char *optarg,
 	case OPT_SINGLE_FILE_SEGMENTS_NUM:
 		conf->single_file_segments = 1;
 		break;
+	case OPT_LEGACY_MEM_NUM:
+		conf->legacy_mem = 1;
+		break;
 
 	/* don't know what to do, leave this to caller */
 	default:
diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h
index 484a32e..62ab15b 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -79,6 +79,9 @@ struct internal_config {
 	volatile unsigned single_file_segments;
 	/**< true if storing all pages within single files (per-page-size,
 	 * per-node). */
+	volatile unsigned legacy_mem;
+	/**< true to enable legacy memory behavior (no dynamic allocation,
+	 * contiguous segments). */
 	volatile int syslog_facility;	  /**< facility passed to openlog() */
 	/** default interrupt mode for VFIO */
 	volatile enum rte_intr_mode vfio_intr_mode;
diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h
index 26a682a..d09b034 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -85,6 +85,8 @@ enum {
 	OPT_VMWARE_TSC_MAP_NUM,
 #define OPT_SINGLE_FILE_SEGMENTS    "single-file-segments"
 	OPT_SINGLE_FILE_SEGMENTS_NUM,
+#define OPT_LEGACY_MEM    "legacy-mem"
+	OPT_LEGACY_MEM_NUM,
 	OPT_LONG_MAX_NUM
 };
 
diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 2a3127f..37ae8e0 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -367,6 +367,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_CREATE_UIO_DEV"    Create /dev/uioX (usually done by hotplug)\n"
 	       "  --"OPT_VFIO_INTR"         Interrupt mode for VFIO (legacy|msi|msix)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
+	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if ( rte_application_usage_hook ) {
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index dd18d98..5b18af9 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -940,8 +940,8 @@ huge_recover_sigbus(void)
  *  6. unmap the first mapping
  *  7. fill memsegs in configuration with contiguous zones
  */
-int
-rte_eal_hugepage_init(void)
+static int
+eal_legacy_hugepage_init(void)
 {
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
@@ -1283,8 +1283,8 @@ getFileSize(int fd)
  * configuration and finds the hugepages which form that segment, mapping them
  * in order to form a contiguous block in the virtual memory space
  */
-int
-rte_eal_hugepage_attach(void)
+static int
+eal_legacy_hugepage_attach(void)
 {
 	const struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
 	struct hugepage_file *hp = NULL;
@@ -1435,6 +1435,20 @@ rte_eal_hugepage_attach(void)
 }
 
 int
+rte_eal_hugepage_init(void) {
+	if (internal_config.legacy_mem)
+		return eal_legacy_hugepage_init();
+	return -1;
+}
+
+int
+rte_eal_hugepage_attach(void) {
+	if (internal_config.legacy_mem)
+		return eal_legacy_hugepage_attach();
+	return -1;
+}
+
+int
 rte_eal_using_phys_addrs(void)
 {
 	return phys_addrs_available;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 10/23] eal: read hugepage counts from node-specific sysfs path
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (8 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 09/23] eal: add "legacy memory" option Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 11/23] eal: replace memseg with memseg lists Anatoly Burakov
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

For non-legacy memory init mode, instead of looking at generic
sysfs path, look at sysfs paths pertaining to each NUMA node
for hugepage counts. Note that per-NUMA node path does not
provide information regarding reserved pages, so we might not
get the best info from these paths, but this saves us from the
whole mapping/remapping business before we're actually able to
tell which page is on which socket, because we no longer require
our memory to be physically contiguous.

Legacy memory init will not use this.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c | 73 +++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
index 86e174f..a85c15a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
+++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
@@ -59,6 +59,7 @@
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
+static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
 
 /* this function is only called from eal_hugepage_info_init which itself
  * is only called from a primary process */
@@ -99,6 +100,42 @@ get_num_hugepages(const char *subdir)
 	return num_pages;
 }
 
+static uint32_t
+get_num_hugepages_on_node(const char *subdir, unsigned socket) {
+	char path[PATH_MAX], socketpath[PATH_MAX];
+	DIR *socketdir;
+	long unsigned num_pages = 0;
+	const char *nr_hp_file = "free_hugepages";
+
+	snprintf(socketpath, sizeof(socketpath), "%s/node%u/hugepages",
+		 sys_pages_numa_dir_path, socket);
+
+	socketdir = opendir(socketpath);
+	if (socketdir) {
+		/* Keep calm and carry on */
+		closedir(socketdir);
+	} else {
+		/* Can't find socket dir, so ignore it */
+		return 0;
+	}
+
+	snprintf(path, sizeof(path), "%s/%s/%s",
+			socketpath, subdir, nr_hp_file);
+	if (eal_parse_sysfs_value(path, &num_pages) < 0)
+		return 0;
+
+	if (num_pages == 0)
+		RTE_LOG(WARNING, EAL, "No free hugepages reported in %s\n",
+				subdir);
+
+	/* we want to return a uint32_t and more than this looks suspicious
+	 * anyway ... */
+	if (num_pages > UINT32_MAX)
+		num_pages = UINT32_MAX;
+
+	return num_pages;
+}
+
 static uint64_t
 get_default_hp_size(void)
 {
@@ -277,7 +314,7 @@ eal_hugepage_info_init(void)
 {
 	const char dirent_start_text[] = "hugepages-";
 	const size_t dirent_start_len = sizeof(dirent_start_text) - 1;
-	unsigned i, num_sizes = 0;
+	unsigned i, total_pages, num_sizes = 0;
 	DIR *dir;
 	struct dirent *dirent;
 
@@ -331,9 +368,24 @@ eal_hugepage_info_init(void)
 		if (clear_hugedir(hpi->hugedir) == -1)
 			break;
 
-		/* for now, put all pages into socket 0,
-		 * later they will be sorted */
-		hpi->num_pages[0] = get_num_hugepages(dirent->d_name);
+		/* first, try to put all hugepages into relevant sockets, but
+		 * if first attempts fails, fall back to collecting all pages
+		 * in one socket and sorting them later */
+		total_pages = 0;
+		/* we also don't want to do this for legacy init */
+		if (!internal_config.legacy_mem)
+			for (i = 0; i < rte_num_sockets(); i++) {
+				unsigned num_pages =
+						get_num_hugepages_on_node(
+							dirent->d_name, i);
+				hpi->num_pages[i] = num_pages;
+				total_pages += num_pages;
+			}
+		/* we failed to sort memory from the get go, so fall
+		 * back to old way */
+		if (total_pages == 0) {
+			hpi->num_pages[0] = get_num_hugepages(dirent->d_name);
+		}
 
 #ifndef RTE_ARCH_64
 		/* for 32-bit systems, limit number of hugepages to
@@ -357,10 +409,19 @@ eal_hugepage_info_init(void)
 	      sizeof(internal_config.hugepage_info[0]), compare_hpi);
 
 	/* now we have all info, check we have at least one valid size */
-	for (i = 0; i < num_sizes; i++)
+	for (i = 0; i < num_sizes; i++) {
+		/* pages may no longer all be on socket 0, so check all */
+		unsigned j, num_pages = 0;
+
+		for (j = 0; j < RTE_MAX_NUMA_NODES; j++) {
+			struct hugepage_info *hpi =
+					&internal_config.hugepage_info[i];
+			num_pages += hpi->num_pages[j];
+		}
 		if (internal_config.hugepage_info[i].hugedir != NULL &&
-		    internal_config.hugepage_info[i].num_pages[0] > 0)
+				num_pages > 0)
 			return 0;
+	}
 
 	/* no valid hugepage mounts available, return error */
 	return -1;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 11/23] eal: replace memseg with memseg lists
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (9 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 10/23] eal: read hugepage counts from node-specific sysfs path Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 12/23] eal: add support for dynamic memory allocation Anatoly Burakov
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

Before, we were aggregating multiple pages into one memseg, so the
number of memsegs was small. Now, each page gets its own memseg,
so the list of memsegs is huge. To accommodate the new memseg list
size and to keep the under-the-hood workings sane, the memseg list
is now not just a single list, but multiple lists. To be precise,
each hugepage size available on the system gets a memseg list per
socket (so, for example, on a 2-socket system with 2M and 1G
hugepages, we will get 4 memseg lists).

In order to support dynamic memory allocation, we reserve all
memory in advance. As in, we do an anonymous mmap() of the entire
maximum size of memory per hugepage size (which is limited to
either RTE_MAX_MEMSEG_PER_LIST or 128G worth of memory, whichever
is the smaller one). The limit is arbitrary.

So, for each hugepage size, we get (by default) up to 128G worth
of memory, per socket. The address space is claimed at the start,
in eal_common_memory.c. The actual page allocation code is in
eal_memalloc.c (Linux-only for now), and largely consists of
moved EAL memory init code.

Pages in the list are also indexed by address. That is, for
non-legacy mode, in order to figure out where the page belongs,
one can simply look at base address for a memseg list. Similarly,
figuring out IOVA address of a memzone is a matter of finding the
right memseg list, getting offset and dividing by page size to get
the appropriate memseg. For legacy mode, old behavior of walking
the memseg list remains.

Due to switch to fbarray, secondary processes are not currently
supported nor tested. Also, one particular API call (dump physmem
layout) no longer makes sense not only becase there can now be
holes in memseg list, but also because there are several memseg
lists to choose from.

In legacy mode, nothing is preallocated, and all memsegs are in
a list like before, but each segment still resides in an appropriate
memseg list.

The rest of the changes are really ripple effects from the memseg
change - heap changes, compile fixes, and rewrites to support
fbarray-backed memseg lists.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base                                |   3 +-
 drivers/bus/pci/linux/pci.c                       |  29 ++-
 drivers/net/virtio/virtio_user/vhost_kernel.c     | 106 +++++---
 lib/librte_eal/common/eal_common_memory.c         | 245 ++++++++++++++++--
 lib/librte_eal/common/eal_common_memzone.c        |   5 +-
 lib/librte_eal/common/eal_hugepages.h             |   1 +
 lib/librte_eal/common/include/rte_eal_memconfig.h |  22 +-
 lib/librte_eal/common/include/rte_memory.h        |  16 ++
 lib/librte_eal/common/malloc_elem.c               |   8 +-
 lib/librte_eal/common/malloc_elem.h               |   6 +-
 lib/librte_eal/common/malloc_heap.c               |  88 +++++--
 lib/librte_eal/common/rte_malloc.c                |  20 +-
 lib/librte_eal/linuxapp/eal/eal.c                 |  21 +-
 lib/librte_eal/linuxapp/eal/eal_memory.c          | 299 ++++++++++++++--------
 lib/librte_eal/linuxapp/eal/eal_vfio.c            | 162 ++++++++----
 test/test/test_malloc.c                           |  29 ++-
 test/test/test_memory.c                           |  44 +++-
 test/test/test_memzone.c                          |  17 +-
 18 files changed, 815 insertions(+), 306 deletions(-)

diff --git a/config/common_base b/config/common_base
index e74febe..9730d4c 100644
--- a/config/common_base
+++ b/config/common_base
@@ -90,7 +90,8 @@ CONFIG_RTE_CACHE_LINE_SIZE=64
 CONFIG_RTE_LIBRTE_EAL=y
 CONFIG_RTE_MAX_LCORE=128
 CONFIG_RTE_MAX_NUMA_NODES=8
-CONFIG_RTE_MAX_MEMSEG=256
+CONFIG_RTE_MAX_MEMSEG_LISTS=16
+CONFIG_RTE_MAX_MEMSEG_PER_LIST=32768
 CONFIG_RTE_MAX_MEMZONE=2560
 CONFIG_RTE_MAX_TAILQ=32
 CONFIG_RTE_ENABLE_ASSERT=n
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 5da6728..6d3100f 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -148,19 +148,30 @@ rte_pci_unmap_device(struct rte_pci_device *dev)
 void *
 pci_find_max_end_va(void)
 {
-	const struct rte_memseg *seg = rte_eal_get_physmem_layout();
-	const struct rte_memseg *last = seg;
-	unsigned i = 0;
+	void *cur_end, *max_end = NULL;
+	int i = 0;
 
-	for (i = 0; i < RTE_MAX_MEMSEG; i++, seg++) {
-		if (seg->addr == NULL)
-			break;
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_mem_config *mcfg =
+				rte_eal_get_configuration()->mem_config;
+		const struct rte_memseg_list *msl = &mcfg->memsegs[i];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
 
-		if (seg->addr > last->addr)
-			last = seg;
+		if (arr->capacity == 0)
+			continue;
 
+		/*
+		 * we need to handle legacy mem case, so don't rely on page size
+		 * to calculate max VA end
+		 */
+		while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) {
+			struct rte_memseg *ms = rte_fbarray_get(arr, i);
+			cur_end = RTE_PTR_ADD(ms->addr, ms->len);
+			if (cur_end > max_end)
+				max_end = cur_end;
+		}
 	}
-	return RTE_PTR_ADD(last->addr, last->len);
+	return max_end;
 }
 
 /* parse one line of the "resource" sysfs file (note that the 'line'
diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c
index 68d28b1..f3f1549 100644
--- a/drivers/net/virtio/virtio_user/vhost_kernel.c
+++ b/drivers/net/virtio/virtio_user/vhost_kernel.c
@@ -99,6 +99,40 @@ static uint64_t vhost_req_user_to_kernel[] = {
 	[VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
 };
 
+/* returns number of segments processed */
+static int
+add_memory_region(struct vhost_memory_region *mr, const struct rte_fbarray *arr,
+		int reg_start_idx, int max) {
+	const struct rte_memseg *ms;
+	void *start_addr, *expected_addr;
+	uint64_t len;
+	int idx;
+
+	idx = reg_start_idx;
+	len = 0;
+	start_addr = NULL;
+	expected_addr = NULL;
+
+	/* we could've relied on page size, but we have to support legacy mem */
+	while (idx < max){
+		ms = rte_fbarray_get(arr, idx);
+		if (expected_addr == NULL) {
+			start_addr = ms->addr;
+			expected_addr = RTE_PTR_ADD(ms->addr, ms->len);
+		} else if (ms->addr != expected_addr)
+			break;
+		len += ms->len;
+		idx++;
+	}
+
+	mr->guest_phys_addr = (uint64_t)(uintptr_t) start_addr;
+	mr->userspace_addr = (uint64_t)(uintptr_t) start_addr;
+	mr->memory_size = len;
+	mr->mmap_offset = 0;
+
+	return idx;
+}
+
 /* By default, vhost kernel module allows 64 regions, but DPDK allows
  * 256 segments. As a relief, below function merges those virtually
  * adjacent memsegs into one region.
@@ -106,8 +140,7 @@ static uint64_t vhost_req_user_to_kernel[] = {
 static struct vhost_memory_kernel *
 prepare_vhost_memory_kernel(void)
 {
-	uint32_t i, j, k = 0;
-	struct rte_memseg *seg;
+	uint32_t list_idx, region_nr = 0;
 	struct vhost_memory_region *mr;
 	struct vhost_memory_kernel *vm;
 
@@ -117,52 +150,41 @@ prepare_vhost_memory_kernel(void)
 	if (!vm)
 		return NULL;
 
-	for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
-		seg = &rte_eal_get_configuration()->mem_config->memseg[i];
-		if (!seg->addr)
-			break;
-
-		int new_region = 1;
-
-		for (j = 0; j < k; ++j) {
-			mr = &vm->regions[j];
+	for (list_idx = 0; list_idx < RTE_MAX_MEMSEG_LISTS; ++list_idx) {
+		const struct rte_mem_config *mcfg =
+				rte_eal_get_configuration()->mem_config;
+		const struct rte_memseg_list *msl = &mcfg->memsegs[list_idx];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+		int reg_start_idx, search_idx;
 
-			if (mr->userspace_addr + mr->memory_size ==
-			    (uint64_t)(uintptr_t)seg->addr) {
-				mr->memory_size += seg->len;
-				new_region = 0;
-				break;
-			}
-
-			if ((uint64_t)(uintptr_t)seg->addr + seg->len ==
-			    mr->userspace_addr) {
-				mr->guest_phys_addr =
-					(uint64_t)(uintptr_t)seg->addr;
-				mr->userspace_addr =
-					(uint64_t)(uintptr_t)seg->addr;
-				mr->memory_size += seg->len;
-				new_region = 0;
-				break;
-			}
-		}
-
-		if (new_region == 0)
+		/* skip empty segment lists */
+		if (arr->count == 0)
 			continue;
 
-		mr = &vm->regions[k++];
-		/* use vaddr here! */
-		mr->guest_phys_addr = (uint64_t)(uintptr_t)seg->addr;
-		mr->userspace_addr = (uint64_t)(uintptr_t)seg->addr;
-		mr->memory_size = seg->len;
-		mr->mmap_offset = 0;
-
-		if (k >= max_regions) {
-			free(vm);
-			return NULL;
+		search_idx = 0;
+		while ((reg_start_idx = rte_fbarray_find_next_used(arr,
+				search_idx)) >= 0) {
+			int reg_n_pages;
+			if (region_nr >= max_regions) {
+				free(vm);
+				return NULL;
+			}
+			mr = &vm->regions[region_nr++];
+
+			/*
+			 * we know memseg starts at search_idx, check how many
+			 * segments there are
+			 */
+			reg_n_pages = rte_fbarray_find_contig_used(arr,
+					search_idx);
+
+			/* look at at most reg_n_pages of memsegs */
+			search_idx = add_memory_region(mr, arr, reg_start_idx,
+					search_idx + reg_n_pages);
 		}
 	}
 
-	vm->nregions = k;
+	vm->nregions = region_nr;
 	vm->padding = 0;
 	return vm;
 }
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 96570a7..bdd465b 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -42,6 +42,7 @@
 #include <sys/mman.h>
 #include <sys/queue.h>
 
+#include <rte_fbarray.h>
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_eal_memconfig.h>
@@ -58,6 +59,8 @@
  * which is a multiple of hugepage size.
  */
 
+#define MEMSEG_LIST_FMT "memseg-%luk-%i"
+
 static uint64_t baseaddr_offset;
 
 void *
@@ -117,6 +120,178 @@ eal_get_virtual_area(void *requested_addr, uint64_t *size,
 	return addr;
 }
 
+static uint64_t
+get_mem_amount(uint64_t page_sz) {
+	uint64_t area_sz;
+
+	// TODO: saner heuristics
+	/* limit to RTE_MAX_MEMSEG_PER_LIST pages or 128G worth of memory */
+	area_sz = RTE_MIN(page_sz * RTE_MAX_MEMSEG_PER_LIST, 1ULL << 37);
+
+	return rte_align64pow2(area_sz);
+}
+
+static int
+get_max_num_pages(uint64_t page_sz, uint64_t mem_amount) {
+	return mem_amount / page_sz;
+}
+
+static int
+get_min_num_pages(int max_pages) {
+	return RTE_MIN(256, max_pages);
+}
+
+static int
+alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+		int socket_id) {
+	char name[RTE_FBARRAY_NAME_LEN];
+	int min_pages, max_pages;
+	uint64_t mem_amount;
+	void *addr;
+
+	if (!internal_config.legacy_mem) {
+		mem_amount = get_mem_amount(page_sz);
+		max_pages = get_max_num_pages(page_sz, mem_amount);
+		min_pages = get_min_num_pages(max_pages);
+
+		// TODO: allow shrink?
+		addr = eal_get_virtual_area(NULL, &mem_amount, page_sz, 0);
+		if (addr == NULL) {
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+			return -1;
+		}
+	} else {
+		addr = NULL;
+		min_pages = 256;
+		max_pages = 256;
+	}
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id);
+	if (rte_fbarray_alloc(&msl->memseg_arr, name, min_pages, max_pages,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		return -1;
+	}
+
+	msl->hugepage_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = addr;
+
+	return 0;
+}
+
+static int
+memseg_init(void) {
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	int socket_id, hpi_idx, msl_idx = 0;
+	struct rte_memseg_list *msl;
+
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		RTE_LOG(ERR, EAL, "Secondary process not supported\n");
+		return -1;
+	}
+
+	/* create memseg lists */
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (socket_id = 0; socket_id < (int) rte_num_sockets();
+				socket_id++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists\n");
+				return -1;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (alloc_memseg_list(msl, hugepage_sz, socket_id)) {
+				return -1;
+			}
+		}
+	}
+	return 0;
+}
+
+static const struct rte_memseg *
+virt2memseg(const void *addr, const struct rte_memseg_list *msl) {
+	const struct rte_mem_config *mcfg =
+		rte_eal_get_configuration()->mem_config;
+	const struct rte_fbarray *arr;
+	int msl_idx, ms_idx;
+
+	/* first, find appropriate memseg list, if it wasn't specified */
+	if (msl == NULL) {
+		for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+			void *start, *end;
+			msl = &mcfg->memsegs[msl_idx];
+
+			start = msl->base_va;
+			end = RTE_PTR_ADD(start, msl->hugepage_sz *
+					msl->memseg_arr.capacity);
+			if (addr >= start && addr < end)
+				break;
+		}
+		/* if we didn't find our memseg list */
+		if (msl_idx == RTE_MAX_MEMSEG_LISTS)
+			return NULL;
+	} else {
+		/* a memseg list was specified, check if it's the right one */
+		void *start, *end;
+		start = msl->base_va;
+		end = RTE_PTR_ADD(start, msl->hugepage_sz *
+				msl->memseg_arr.capacity);
+
+		if (addr < start || addr >= end)
+			return NULL;
+	}
+
+	/* now, calculate index */
+	arr = &msl->memseg_arr;
+	ms_idx = RTE_PTR_DIFF(addr, msl->base_va) / msl->hugepage_sz;
+	return rte_fbarray_get(arr, ms_idx);
+}
+
+static const struct rte_memseg *
+virt2memseg_legacy(const void *addr) {
+	const struct rte_mem_config *mcfg =
+		rte_eal_get_configuration()->mem_config;
+	const struct rte_memseg_list *msl;
+	const struct rte_fbarray *arr;
+	int msl_idx, ms_idx;
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		msl = &mcfg->memsegs[msl_idx];
+		arr = &msl->memseg_arr;
+
+		ms_idx = 0;
+		while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) {
+			const struct rte_memseg *ms;
+			void *start, *end;
+			ms = rte_fbarray_get(arr, ms_idx);
+			start = ms->addr;
+			end = RTE_PTR_ADD(start, ms->len);
+			if (addr >= start && addr < end)
+				return ms;
+			ms_idx++;
+		}
+	}
+	return NULL;
+}
+
+const struct rte_memseg *
+rte_mem_virt2memseg(const void *addr, const struct rte_memseg_list *msl) {
+	/* for legacy memory, we just walk the list, like in the old days. */
+	if (internal_config.legacy_mem) {
+		return virt2memseg_legacy(addr);
+	} else {
+		return virt2memseg(addr, msl);
+	}
+}
+
 
 /*
  * Return a pointer to a read-only table of struct rte_physmem_desc
@@ -126,7 +301,9 @@ eal_get_virtual_area(void *requested_addr, uint64_t *size,
 const struct rte_memseg *
 rte_eal_get_physmem_layout(void)
 {
-	return rte_eal_get_configuration()->mem_config->memseg;
+	struct rte_fbarray *arr;
+	arr = &rte_eal_get_configuration()->mem_config->memsegs[0].memseg_arr;
+	return rte_fbarray_get(arr, 0);
 }
 
 
@@ -141,11 +318,24 @@ rte_eal_get_physmem_size(void)
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
 
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		if (mcfg->memseg[i].addr == NULL)
-			break;
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[i];
+
+		if (msl->memseg_arr.count == 0)
+			continue;
+
+		/* for legacy mem mode, walk the memsegs */
+		if (internal_config.legacy_mem) {
+			const struct rte_fbarray *arr = &msl->memseg_arr;
+			int ms_idx = 0;
 
-		total_len += mcfg->memseg[i].len;
+			while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx) >= 0)) {
+				const struct rte_memseg *ms =
+						rte_fbarray_get(arr, ms_idx);
+				total_len += ms->len;
+			}
+		} else
+			total_len += msl->hugepage_sz * msl->memseg_arr.count;
 	}
 
 	return total_len;
@@ -161,21 +351,29 @@ rte_dump_physmem_layout(FILE *f)
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
 
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		if (mcfg->memseg[i].addr == NULL)
-			break;
-
-		fprintf(f, "Segment %u: IOVA:0x%"PRIx64", len:%zu, "
-		       "virt:%p, socket_id:%"PRId32", "
-		       "hugepage_sz:%"PRIu64", nchannel:%"PRIx32", "
-		       "nrank:%"PRIx32"\n", i,
-		       mcfg->memseg[i].iova,
-		       mcfg->memseg[i].len,
-		       mcfg->memseg[i].addr,
-		       mcfg->memseg[i].socket_id,
-		       mcfg->memseg[i].hugepage_sz,
-		       mcfg->memseg[i].nchannel,
-		       mcfg->memseg[i].nrank);
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[i];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+		int m_idx = 0;
+
+		if (arr->count == 0)
+			continue;
+
+		while ((m_idx = rte_fbarray_find_next_used(arr, m_idx)) >= 0) {
+			struct rte_memseg *ms = rte_fbarray_get(arr, m_idx);
+			fprintf(f, "Page %u-%u: iova:0x%"PRIx64", len:%zu, "
+			       "virt:%p, socket_id:%"PRId32", "
+			       "hugepage_sz:%"PRIu64", nchannel:%"PRIx32", "
+			       "nrank:%"PRIx32"\n", i, m_idx,
+			       ms->iova,
+			       ms->len,
+			       ms->addr,
+			       ms->socket_id,
+			       ms->hugepage_sz,
+			       ms->nchannel,
+			       ms->nrank);
+			m_idx++;
+		}
 	}
 }
 
@@ -220,9 +418,14 @@ rte_mem_lock_page(const void *virt)
 int
 rte_eal_memory_init(void)
 {
+	int retval;
 	RTE_LOG(DEBUG, EAL, "Setting up physically contiguous memory...\n");
 
-	const int retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
+	retval = memseg_init();
+	if (retval < 0)
+		return -1;
+
+	retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
 			rte_eal_hugepage_init() :
 			rte_eal_hugepage_attach();
 	if (retval < 0)
diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
index ea072a2..f558ac2 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -254,10 +254,9 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 	mz->iova = rte_malloc_virt2iova(mz_addr);
 	mz->addr = mz_addr;
 	mz->len = (requested_len == 0 ? elem->size : requested_len);
-	mz->hugepage_sz = elem->ms->hugepage_sz;
-	mz->socket_id = elem->ms->socket_id;
+	mz->hugepage_sz = elem->msl->hugepage_sz;
+	mz->socket_id = elem->msl->socket_id;
 	mz->flags = 0;
-	mz->memseg_id = elem->ms - rte_eal_get_configuration()->mem_config->memseg;
 
 	return mz;
 }
diff --git a/lib/librte_eal/common/eal_hugepages.h b/lib/librte_eal/common/eal_hugepages.h
index 68369f2..cf91009 100644
--- a/lib/librte_eal/common/eal_hugepages.h
+++ b/lib/librte_eal/common/eal_hugepages.h
@@ -52,6 +52,7 @@ struct hugepage_file {
 	int socket_id;      /**< NUMA socket ID */
 	int file_id;        /**< the '%d' in HUGEFILE_FMT */
 	int memseg_id;      /**< the memory segment to which page belongs */
+	int memseg_list_id; /**< the memory segment list to which page belongs */
 	char filepath[MAX_HUGEPAGE_PATH]; /**< path to backing file on filesystem */
 };
 
diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h
index b9eee70..c9b57a4 100644
--- a/lib/librte_eal/common/include/rte_eal_memconfig.h
+++ b/lib/librte_eal/common/include/rte_eal_memconfig.h
@@ -40,12 +40,30 @@
 #include <rte_malloc_heap.h>
 #include <rte_rwlock.h>
 #include <rte_pause.h>
+#include <rte_fbarray.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
 /**
+ * memseg list is a special case as we need to store a bunch of other data
+ * together with the array itself.
+ */
+struct rte_memseg_list {
+	RTE_STD_C11
+	union {
+		void *base_va;
+		/**< Base virtual address for this memseg list. */
+		uint64_t addr_64;
+		/**< Makes sure addr is always 64-bits */
+	};
+	int socket_id; /**< Socket ID for all memsegs in this list. */
+	uint64_t hugepage_sz; /**< page size for all memsegs in this list. */
+	struct rte_fbarray memseg_arr;
+};
+
+/**
  * the structure for the memory configuration for the RTE.
  * Used by the rte_config structure. It is separated out, as for multi-process
  * support, the memory details should be shared across instances
@@ -71,9 +89,11 @@ struct rte_mem_config {
 	uint32_t memzone_cnt; /**< Number of allocated memzones */
 
 	/* memory segments and zones */
-	struct rte_memseg memseg[RTE_MAX_MEMSEG];    /**< Physmem descriptors. */
 	struct rte_memzone memzone[RTE_MAX_MEMZONE]; /**< Memzone descriptors. */
 
+	struct rte_memseg_list memsegs[RTE_MAX_MEMSEG_LISTS];
+	/**< list of dynamic arrays holding memsegs */
+
 	struct rte_tailq_head tailq_head[RTE_MAX_TAILQ]; /**< Tailqs for objects */
 
 	/* Heaps of Malloc per socket */
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 14aacea..f005716 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -50,6 +50,9 @@ extern "C" {
 
 #include <rte_common.h>
 
+/* forward declaration for pointers */
+struct rte_memseg_list;
+
 __extension__
 enum rte_page_sizes {
 	RTE_PGSIZE_4K    = 1ULL << 12,
@@ -158,6 +161,19 @@ phys_addr_t rte_mem_virt2phy(const void *virt);
 rte_iova_t rte_mem_virt2iova(const void *virt);
 
 /**
+ * Get memseg corresponding to virtual memory address.
+ *
+ * @param virt
+ *   The virtual address.
+ * @param msl
+ *   Memseg list in which to look for memsegs (can be NULL).
+ * @return
+ *   Memseg to which this virtual address belongs to.
+ */
+const struct rte_memseg *rte_mem_virt2memseg(const void *virt,
+		const struct rte_memseg_list *msl);
+
+/**
  * Get the layout of the available physical memory.
  *
  * It can be useful for an application to have the full physical
diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 782aaa7..ab09b94 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -54,11 +54,11 @@
  * Initialize a general malloc_elem header structure
  */
 void
-malloc_elem_init(struct malloc_elem *elem,
-		struct malloc_heap *heap, const struct rte_memseg *ms, size_t size)
+malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
+		const struct rte_memseg_list *msl, size_t size)
 {
 	elem->heap = heap;
-	elem->ms = ms;
+	elem->msl = msl;
 	elem->prev = NULL;
 	elem->next = NULL;
 	memset(&elem->free_list, 0, sizeof(elem->free_list));
@@ -172,7 +172,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 	const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
 	const size_t new_elem_size = elem->size - old_elem_size;
 
-	malloc_elem_init(split_pt, elem->heap, elem->ms, new_elem_size);
+	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size);
 	split_pt->prev = elem;
 	split_pt->next = next_elem;
 	if (next_elem)
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index cf27b59..330bddc 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -34,7 +34,7 @@
 #ifndef MALLOC_ELEM_H_
 #define MALLOC_ELEM_H_
 
-#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
 
 /* dummy definition of struct so we can use pointers to it in malloc_elem struct */
 struct malloc_heap;
@@ -50,7 +50,7 @@ struct malloc_elem {
 	struct malloc_elem *volatile prev;      /* points to prev elem in memseg */
 	struct malloc_elem *volatile next;      /* points to next elem in memseg */
 	LIST_ENTRY(malloc_elem) free_list;      /* list of free elements in heap */
-	const struct rte_memseg *ms;
+	const struct rte_memseg_list *msl;
 	volatile enum elem_state state;
 	uint32_t pad;
 	size_t size;
@@ -137,7 +137,7 @@ malloc_elem_from_data(const void *data)
 void
 malloc_elem_init(struct malloc_elem *elem,
 		struct malloc_heap *heap,
-		const struct rte_memseg *ms,
+		const struct rte_memseg_list *msl,
 		size_t size);
 
 void
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 1b35468..5fa21fe 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -50,6 +50,7 @@
 #include <rte_memcpy.h>
 #include <rte_atomic.h>
 
+#include "eal_internal_cfg.h"
 #include "malloc_elem.h"
 #include "malloc_heap.h"
 
@@ -91,22 +92,25 @@ check_hugepage_sz(unsigned flags, uint64_t hugepage_sz)
 }
 
 /*
- * Expand the heap with a memseg.
- * This reserves the zone and sets a dummy malloc_elem header at the end
- * to prevent overflow. The rest of the zone is added to free list as a single
- * large free block
+ * Expand the heap with a memory area.
  */
-static void
-malloc_heap_add_memseg(struct malloc_heap *heap, struct rte_memseg *ms)
+static struct malloc_elem *
+malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
+		void *start, size_t len)
 {
-	struct malloc_elem *start_elem = (struct malloc_elem *)ms->addr;
-	const size_t elem_size = ms->len - MALLOC_ELEM_OVERHEAD;
+	struct malloc_elem *elem = start;
+
+	malloc_elem_init(elem, heap, msl, len);
+
+	malloc_elem_insert(elem);
+
+	elem = malloc_elem_join_adjacent_free(elem);
 
-	malloc_elem_init(start_elem, heap, ms, elem_size);
-	malloc_elem_insert(start_elem);
-	malloc_elem_free_list_insert(start_elem);
+	malloc_elem_free_list_insert(elem);
 
-	heap->total_size += elem_size;
+	heap->total_size += len;
+
+	return elem;
 }
 
 /*
@@ -127,7 +131,7 @@ find_suitable_element(struct malloc_heap *heap, size_t size,
 		for (elem = LIST_FIRST(&heap->free_head[idx]);
 				!!elem; elem = LIST_NEXT(elem, free_list)) {
 			if (malloc_elem_can_hold(elem, size, align, bound)) {
-				if (check_hugepage_sz(flags, elem->ms->hugepage_sz))
+				if (check_hugepage_sz(flags, elem->msl->hugepage_sz))
 					return elem;
 				if (alt_elem == NULL)
 					alt_elem = elem;
@@ -249,16 +253,62 @@ int
 rte_eal_malloc_heap_init(void)
 {
 	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	unsigned ms_cnt;
-	struct rte_memseg *ms;
+	int msl_idx;
+	struct rte_memseg_list *msl;
 
 	if (mcfg == NULL)
 		return -1;
 
-	for (ms = &mcfg->memseg[0], ms_cnt = 0;
-			(ms_cnt < RTE_MAX_MEMSEG) && (ms->len > 0);
-			ms_cnt++, ms++) {
-		malloc_heap_add_memseg(&mcfg->malloc_heaps[ms->socket_id], ms);
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		int start;
+		struct rte_fbarray *arr;
+		struct malloc_heap *heap;
+
+		msl = &mcfg->memsegs[msl_idx];
+		arr = &msl->memseg_arr;
+		heap = &mcfg->malloc_heaps[msl->socket_id];
+
+		if (arr->capacity == 0)
+			continue;
+
+		/* for legacy mode, just walk the list */
+		if (internal_config.legacy_mem) {
+			int ms_idx = 0;
+			while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) {
+				struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
+				malloc_heap_add_memory(heap, msl, ms->addr, ms->len);
+				ms_idx++;
+				RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n",
+					msl->socket_id, ms->len >> 20ULL);
+			}
+			continue;
+		}
+
+		/* find first segment */
+		start = rte_fbarray_find_next_used(arr, 0);
+
+		while (start >= 0) {
+			int contig_segs;
+			struct rte_memseg *start_seg;
+			size_t len, hugepage_sz = msl->hugepage_sz;
+
+			/* find how many pages we can lump in together */
+			contig_segs = rte_fbarray_find_contig_used(arr, start);
+			start_seg = rte_fbarray_get(arr, start);
+			len = contig_segs * hugepage_sz;
+
+			/*
+			 * we've found (hopefully) a bunch of contiguous
+			 * segments, so add them to the heap.
+			 */
+			malloc_heap_add_memory(heap, msl, start_seg->addr, len);
+
+			RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n",
+				msl->socket_id, len >> 20ULL);
+
+			start = rte_fbarray_find_next_used(arr,
+					start + contig_segs);
+		}
 	}
 
 	return 0;
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index 74b5417..92cd7d8 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -251,17 +251,21 @@ rte_malloc_set_limit(__rte_unused const char *type,
 rte_iova_t
 rte_malloc_virt2iova(const void *addr)
 {
-	rte_iova_t iova;
+	const struct rte_memseg *ms;
 	const struct malloc_elem *elem = malloc_elem_from_data(addr);
+
 	if (elem == NULL)
 		return RTE_BAD_IOVA;
-	if (elem->ms->iova == RTE_BAD_IOVA)
-		return RTE_BAD_IOVA;
 
 	if (rte_eal_iova_mode() == RTE_IOVA_VA)
-		iova = (uintptr_t)addr;
-	else
-		iova = elem->ms->iova +
-			RTE_PTR_DIFF(addr, elem->ms->addr);
-	return iova;
+		return (uintptr_t) addr;
+
+	ms = rte_mem_virt2memseg(addr, elem->msl);
+	if (ms == NULL)
+		return RTE_BAD_IOVA;
+
+	if (ms->iova == RTE_BAD_IOVA)
+		return RTE_BAD_IOVA;
+
+	return ms->iova + RTE_PTR_DIFF(addr, ms->addr);
 }
diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 37ae8e0..a27536f 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -102,8 +102,8 @@ static int mem_cfg_fd = -1;
 static struct flock wr_lock = {
 		.l_type = F_WRLCK,
 		.l_whence = SEEK_SET,
-		.l_start = offsetof(struct rte_mem_config, memseg),
-		.l_len = sizeof(early_mem_config.memseg),
+		.l_start = offsetof(struct rte_mem_config, memsegs),
+		.l_len = sizeof(early_mem_config.memsegs),
 };
 
 /* Address of global and public configuration */
@@ -661,17 +661,20 @@ eal_parse_args(int argc, char **argv)
 static void
 eal_check_mem_on_local_socket(void)
 {
-	const struct rte_memseg *ms;
+	const struct rte_memseg_list *msl;
 	int i, socket_id;
 
 	socket_id = rte_lcore_to_socket_id(rte_config.master_lcore);
 
-	ms = rte_eal_get_physmem_layout();
-
-	for (i = 0; i < RTE_MAX_MEMSEG; i++)
-		if (ms[i].socket_id == socket_id &&
-				ms[i].len > 0)
-			return;
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		msl = &rte_eal_get_configuration()->mem_config->memsegs[i];
+		if (msl->socket_id != socket_id)
+			continue;
+		/* for legacy memory, check if there's anything allocated */
+		if (internal_config.legacy_mem && msl->memseg_arr.count == 0)
+			continue;
+		return;
+	}
 
 	RTE_LOG(WARNING, EAL, "WARNING: Master core has no "
 			"memory on local socket!\n");
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 5b18af9..59f6889 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -929,6 +929,24 @@ huge_recover_sigbus(void)
 	}
 }
 
+static struct rte_memseg_list *
+get_memseg_list(int socket, uint64_t page_sz) {
+	struct rte_mem_config *mcfg =
+			rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *msl;
+	int msl_idx;
+
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		msl = &mcfg->memsegs[msl_idx];
+		if (msl->hugepage_sz != page_sz)
+			continue;
+		if (msl->socket_id != socket)
+			continue;
+		return msl;
+	}
+	return NULL;
+}
+
 /*
  * Prepare physical memory mapping: fill configuration structure with
  * these infos, return 0 on success.
@@ -946,11 +964,14 @@ eal_legacy_hugepage_init(void)
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
 	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	struct rte_fbarray *arr;
+	struct rte_memseg *ms;
 
 	uint64_t memory[RTE_MAX_NUMA_NODES];
 
 	unsigned hp_offset;
 	int i, j, new_memseg;
+	int ms_idx, msl_idx;
 	int nr_hugefiles, nr_hugepages = 0;
 	void *addr;
 
@@ -963,6 +984,9 @@ eal_legacy_hugepage_init(void)
 
 	/* hugetlbfs can be disabled */
 	if (internal_config.no_hugetlbfs) {
+		arr = &mcfg->memsegs[0].memseg_arr;
+		ms = rte_fbarray_get(arr, 0);
+
 		addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
 		if (addr == MAP_FAILED) {
@@ -970,14 +994,15 @@ eal_legacy_hugepage_init(void)
 					strerror(errno));
 			return -1;
 		}
+		rte_fbarray_set_used(arr, 0, true);
 		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			mcfg->memseg[0].iova = (uintptr_t)addr;
+			ms->iova = (uintptr_t)addr;
 		else
-			mcfg->memseg[0].iova = RTE_BAD_IOVA;
-		mcfg->memseg[0].addr = addr;
-		mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
-		mcfg->memseg[0].len = internal_config.memory;
-		mcfg->memseg[0].socket_id = 0;
+			ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = RTE_PGSIZE_4K;
+		ms->len = internal_config.memory;
+		ms->socket_id = 0;
 		return 0;
 	}
 
@@ -1218,27 +1243,59 @@ eal_legacy_hugepage_init(void)
 #endif
 
 		if (new_memseg) {
-			j += 1;
-			if (j == RTE_MAX_MEMSEG)
-				break;
+			struct rte_memseg_list *msl;
+			int socket;
+			uint64_t page_sz;
 
-			mcfg->memseg[j].iova = hugepage[i].physaddr;
-			mcfg->memseg[j].addr = hugepage[i].final_va;
-			mcfg->memseg[j].len = hugepage[i].size;
-			mcfg->memseg[j].socket_id = hugepage[i].socket_id;
-			mcfg->memseg[j].hugepage_sz = hugepage[i].size;
+			socket = hugepage[i].socket_id;
+			page_sz = hugepage[i].size;
+
+			if (page_sz == 0)
+				continue;
+
+			/* figure out where to put this memseg */
+			msl = get_memseg_list(socket, page_sz);
+			if (!msl)
+				rte_panic("Unknown socket or page sz: %i %lx\n",
+					socket, page_sz);
+			msl_idx = msl - &mcfg->memsegs[0];
+			arr = &msl->memseg_arr;
+			/*
+			 * we may run out of space, so check if we have enough
+			 * and expand if necessary
+			 */
+			if (arr->count >= arr->len) {
+				int new_len = arr->len * 2;
+				new_len = RTE_MIN(new_len, arr->capacity);
+				if (rte_fbarray_resize(arr, new_len)) {
+					RTE_LOG(ERR, EAL, "Couldn't expand memseg list\n");
+					break;
+				}
+			}
+			ms_idx = rte_fbarray_find_next_free(arr, 0);
+			ms = rte_fbarray_get(arr, ms_idx);
+
+			ms->iova = hugepage[i].physaddr;
+			ms->addr = hugepage[i].final_va;
+			ms->len = page_sz;
+			ms->socket_id = socket;
+			ms->hugepage_sz = page_sz;
+
+			/* segment may be empty */
+			rte_fbarray_set_used(arr, ms_idx, true);
 		}
 		/* continuation of previous memseg */
 		else {
 #ifdef RTE_ARCH_PPC_64
 		/* Use the phy and virt address of the last page as segment
 		 * address for IBM Power architecture */
-			mcfg->memseg[j].iova = hugepage[i].physaddr;
-			mcfg->memseg[j].addr = hugepage[i].final_va;
+			ms->iova = hugepage[i].physaddr;
+			ms->addr = hugepage[i].final_va;
 #endif
-			mcfg->memseg[j].len += mcfg->memseg[j].hugepage_sz;
+			ms->len += ms->hugepage_sz;
 		}
-		hugepage[i].memseg_id = j;
+		hugepage[i].memseg_id = ms_idx;
+		hugepage[i].memseg_list_id = msl_idx;
 	}
 
 	if (i < nr_hugefiles) {
@@ -1248,7 +1305,7 @@ eal_legacy_hugepage_init(void)
 			"Please either increase it or request less amount "
 			"of memory.\n",
 			i, nr_hugefiles, RTE_STR(CONFIG_RTE_MAX_MEMSEG),
-			RTE_MAX_MEMSEG);
+			RTE_MAX_MEMSEG_PER_LIST);
 		goto fail;
 	}
 
@@ -1289,8 +1346,9 @@ eal_legacy_hugepage_attach(void)
 	const struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
 	struct hugepage_file *hp = NULL;
 	unsigned num_hp = 0;
-	unsigned i, s = 0; /* s used to track the segment number */
-	unsigned max_seg = RTE_MAX_MEMSEG;
+	unsigned i;
+	int ms_idx, msl_idx;
+	unsigned cur_seg, max_seg;
 	off_t size = 0;
 	int fd, fd_zero = -1, fd_hugepage = -1;
 
@@ -1315,53 +1373,63 @@ eal_legacy_hugepage_attach(void)
 	}
 
 	/* map all segments into memory to make sure we get the addrs */
-	for (s = 0; s < RTE_MAX_MEMSEG; ++s) {
-		void *base_addr;
+	max_seg = 0;
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[msl_idx];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
 
-		/*
-		 * the first memory segment with len==0 is the one that
-		 * follows the last valid segment.
-		 */
-		if (mcfg->memseg[s].len == 0)
-			break;
+		ms_idx = 0;
+		while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) {
+			struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
+			void *base_addr;
 
-		/*
-		 * fdzero is mmapped to get a contiguous block of virtual
-		 * addresses of the appropriate memseg size.
-		 * use mmap to get identical addresses as the primary process.
-		 */
-		base_addr = mmap(mcfg->memseg[s].addr, mcfg->memseg[s].len,
-				 PROT_READ,
+			ms = rte_fbarray_get(arr, ms_idx);
+
+			/*
+			 * the first memory segment with len==0 is the one that
+			 * follows the last valid segment.
+			 */
+			if (ms->len == 0)
+				break;
+
+			/*
+			 * fdzero is mmapped to get a contiguous block of virtual
+			 * addresses of the appropriate memseg size.
+			 * use mmap to get identical addresses as the primary process.
+			 */
+			base_addr = mmap(ms->addr, ms->len,
+					PROT_READ,
 #ifdef RTE_ARCH_PPC_64
-				 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+					MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
 #else
-				 MAP_PRIVATE,
+					MAP_PRIVATE,
 #endif
-				 fd_zero, 0);
-		if (base_addr == MAP_FAILED ||
-		    base_addr != mcfg->memseg[s].addr) {
-			max_seg = s;
-			if (base_addr != MAP_FAILED) {
-				/* errno is stale, don't use */
-				RTE_LOG(ERR, EAL, "Could not mmap %llu bytes "
-					"in /dev/zero at [%p], got [%p] - "
-					"please use '--base-virtaddr' option\n",
-					(unsigned long long)mcfg->memseg[s].len,
-					mcfg->memseg[s].addr, base_addr);
-				munmap(base_addr, mcfg->memseg[s].len);
-			} else {
-				RTE_LOG(ERR, EAL, "Could not mmap %llu bytes "
-					"in /dev/zero at [%p]: '%s'\n",
-					(unsigned long long)mcfg->memseg[s].len,
-					mcfg->memseg[s].addr, strerror(errno));
-			}
-			if (aslr_enabled() > 0) {
-				RTE_LOG(ERR, EAL, "It is recommended to "
-					"disable ASLR in the kernel "
-					"and retry running both primary "
-					"and secondary processes\n");
+					fd_zero, 0);
+			if (base_addr == MAP_FAILED || base_addr != ms->addr) {
+				if (base_addr != MAP_FAILED) {
+					/* errno is stale, don't use */
+					RTE_LOG(ERR, EAL, "Could not mmap %llu bytes "
+						"in /dev/zero at [%p], got [%p] - "
+						"please use '--base-virtaddr' option\n",
+						(unsigned long long)ms->len,
+						ms->addr, base_addr);
+					munmap(base_addr, ms->len);
+				} else {
+					RTE_LOG(ERR, EAL, "Could not mmap %llu bytes "
+						"in /dev/zero at [%p]: '%s'\n",
+						(unsigned long long)ms->len,
+						ms->addr, strerror(errno));
+				}
+				if (aslr_enabled() > 0) {
+					RTE_LOG(ERR, EAL, "It is recommended to "
+						"disable ASLR in the kernel "
+						"and retry running both primary "
+						"and secondary processes\n");
+				}
+				goto error;
 			}
-			goto error;
+			max_seg++;
+			ms_idx++;
 		}
 	}
 
@@ -1375,46 +1443,54 @@ eal_legacy_hugepage_attach(void)
 	num_hp = size / sizeof(struct hugepage_file);
 	RTE_LOG(DEBUG, EAL, "Analysing %u files\n", num_hp);
 
-	s = 0;
-	while (s < RTE_MAX_MEMSEG && mcfg->memseg[s].len > 0){
-		void *addr, *base_addr;
-		uintptr_t offset = 0;
-		size_t mapping_size;
-		/*
-		 * free previously mapped memory so we can map the
-		 * hugepages into the space
-		 */
-		base_addr = mcfg->memseg[s].addr;
-		munmap(base_addr, mcfg->memseg[s].len);
-
-		/* find the hugepages for this segment and map them
-		 * we don't need to worry about order, as the server sorted the
-		 * entries before it did the second mmap of them */
-		for (i = 0; i < num_hp && offset < mcfg->memseg[s].len; i++){
-			if (hp[i].memseg_id == (int)s){
-				fd = open(hp[i].filepath, O_RDWR);
-				if (fd < 0) {
-					RTE_LOG(ERR, EAL, "Could not open %s\n",
-						hp[i].filepath);
-					goto error;
-				}
-				mapping_size = hp[i].size;
-				addr = mmap(RTE_PTR_ADD(base_addr, offset),
-						mapping_size, PROT_READ | PROT_WRITE,
-						MAP_SHARED, fd, 0);
-				close(fd); /* close file both on success and on failure */
-				if (addr == MAP_FAILED ||
-						addr != RTE_PTR_ADD(base_addr, offset)) {
-					RTE_LOG(ERR, EAL, "Could not mmap %s\n",
-						hp[i].filepath);
-					goto error;
+	/* map all segments into memory to make sure we get the addrs */
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[msl_idx];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+
+		ms_idx = 0;
+		while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) {
+			struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
+			void *addr, *base_addr;
+			uintptr_t offset = 0;
+			size_t mapping_size;
+
+			ms = rte_fbarray_get(arr, ms_idx);
+			/*
+			 * free previously mapped memory so we can map the
+			 * hugepages into the space
+			 */
+			base_addr = ms->addr;
+			munmap(base_addr, ms->len);
+
+			/* find the hugepages for this segment and map them
+			 * we don't need to worry about order, as the server sorted the
+			 * entries before it did the second mmap of them */
+			for (i = 0; i < num_hp && offset < ms->len; i++){
+				if (hp[i].memseg_id == ms_idx &&
+						hp[i].memseg_list_id == msl_idx) {
+					fd = open(hp[i].filepath, O_RDWR);
+					if (fd < 0) {
+						RTE_LOG(ERR, EAL, "Could not open %s\n",
+							hp[i].filepath);
+						goto error;
+					}
+					mapping_size = hp[i].size;
+					addr = mmap(RTE_PTR_ADD(base_addr, offset),
+							mapping_size, PROT_READ | PROT_WRITE,
+							MAP_SHARED, fd, 0);
+					close(fd); /* close file both on success and on failure */
+					if (addr == MAP_FAILED ||
+							addr != RTE_PTR_ADD(base_addr, offset)) {
+						RTE_LOG(ERR, EAL, "Could not mmap %s\n",
+							hp[i].filepath);
+						goto error;
+					}
+					offset+=mapping_size;
 				}
-				offset+=mapping_size;
 			}
-		}
-		RTE_LOG(DEBUG, EAL, "Mapped segment %u of size 0x%llx\n", s,
-				(unsigned long long)mcfg->memseg[s].len);
-		s++;
+			RTE_LOG(DEBUG, EAL, "Mapped segment of size 0x%llx\n",
+					(unsigned long long)ms->len);		}
 	}
 	/* unmap the hugepage config file, since we are done using it */
 	munmap(hp, size);
@@ -1423,8 +1499,27 @@ eal_legacy_hugepage_attach(void)
 	return 0;
 
 error:
-	for (i = 0; i < max_seg && mcfg->memseg[i].len > 0; i++)
-		munmap(mcfg->memseg[i].addr, mcfg->memseg[i].len);
+	/* map all segments into memory to make sure we get the addrs */
+	cur_seg = 0;
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[msl_idx];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+
+		if (cur_seg >= max_seg)
+			break;
+
+		ms_idx = 0;
+		while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) {
+			struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
+
+			if (cur_seg >= max_seg)
+				break;
+			ms = rte_fbarray_get(arr, i);
+			munmap(ms->addr, ms->len);
+
+			cur_seg++;
+		}
+	}
 	if (hp != NULL && hp != MAP_FAILED)
 		munmap(hp, size);
 	if (fd_zero >= 0)
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 58f0123..09dfc68 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -696,33 +696,52 @@ vfio_get_group_no(const char *sysfs_base,
 static int
 vfio_type1_dma_map(int vfio_container_fd)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
 	int i, ret;
 
 	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
 		struct vfio_iommu_type1_dma_map dma_map;
+		const struct rte_memseg_list *msl;
+		const struct rte_fbarray *arr;
+		int ms_idx, next_idx;
 
-		if (ms[i].addr == NULL)
-			break;
+		msl = &rte_eal_get_configuration()->mem_config->memsegs[i];
+		arr = &msl->memseg_arr;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+		/* skip empty memseg lists */
+		if (arr->count == 0)
+			continue;
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+		next_idx = 0;
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
-			return -1;
+		// TODO: don't bother with physical addresses?
+		while ((ms_idx = rte_fbarray_find_next_used(arr,
+				next_idx) >= 0)) {
+			uint64_t addr, len, hw_addr;
+			const struct rte_memseg *ms;
+			next_idx = ms_idx + 1;
+
+			ms = rte_fbarray_get(arr, ms_idx);
+
+			addr = ms->addr_64;
+			len = ms->hugepage_sz;
+			hw_addr = ms->iova;
+
+			memset(&dma_map, 0, sizeof(dma_map));
+			dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+			dma_map.vaddr = addr;
+			dma_map.size = len;
+			dma_map.iova = hw_addr;
+			dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+			ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+			if (ret) {
+				RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
+						  "error %i (%s)\n", errno,
+						  strerror(errno));
+				return -1;
+			}
 		}
 	}
 
@@ -732,8 +751,8 @@ vfio_type1_dma_map(int vfio_container_fd)
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
 	int i, ret;
+	uint64_t hugepage_sz = 0;
 
 	struct vfio_iommu_spapr_register_memory reg = {
 		.argsz = sizeof(reg),
@@ -767,17 +786,31 @@ vfio_spapr_dma_map(int vfio_container_fd)
 	}
 
 	/* create DMA window from 0 to max(phys_addr + len) */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		if (ms[i].addr == NULL)
-			break;
-
-		create.window_size = RTE_MAX(create.window_size,
-				ms[i].iova + ms[i].len);
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_mem_config *mcfg =
+				rte_eal_get_configuration()->mem_config;
+		const struct rte_memseg_list *msl = &mcfg->memsegs[i];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+		int idx, next_idx;
+
+		if (msl->base_va == NULL)
+			continue;
+		if (msl->memseg_arr.count == 0)
+			continue;
+
+		next_idx = 0;
+		while ((idx = rte_fbarray_find_next_used(arr, next_idx)) >= 0) {
+			const struct rte_memseg *ms = rte_fbarray_get(arr, idx);
+			hugepage_sz = RTE_MAX(hugepage_sz, ms->hugepage_sz);
+			create.window_size = RTE_MAX(create.window_size,
+					ms[i].iova + ms[i].len);
+			next_idx = idx + 1;
+		}
 	}
 
 	/* sPAPR requires window size to be a power of 2 */
 	create.window_size = rte_align64pow2(create.window_size);
-	create.page_shift = __builtin_ctzll(ms->hugepage_sz);
+	create.page_shift = __builtin_ctzll(hugepage_sz);
 	create.levels = 1;
 
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
@@ -793,41 +826,60 @@ vfio_spapr_dma_map(int vfio_container_fd)
 	}
 
 	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
 		struct vfio_iommu_type1_dma_map dma_map;
+		const struct rte_memseg_list *msl;
+		const struct rte_fbarray *arr;
+		int ms_idx, next_idx;
 
-		if (ms[i].addr == NULL)
-			break;
+		msl = &rte_eal_get_configuration()->mem_config->memsegs[i];
+		arr = &msl->memseg_arr;
 
-		reg.vaddr = (uintptr_t) ms[i].addr;
-		reg.size = ms[i].len;
-		ret = ioctl(vfio_container_fd,
-			VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot register vaddr for IOMMU, "
-				"error %i (%s)\n", errno, strerror(errno));
-			return -1;
-		}
+		/* skip empty memseg lists */
+		if (arr->count == 0)
+			continue;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
-				 VFIO_DMA_MAP_FLAG_WRITE;
+		next_idx = 0;
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+		while ((ms_idx = rte_fbarray_find_next_used(arr,
+				next_idx) >= 0)) {
+			uint64_t addr, len, hw_addr;
+			const struct rte_memseg *ms;
+			next_idx = ms_idx + 1;
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-				"error %i (%s)\n", errno, strerror(errno));
-			return -1;
-		}
+			ms = rte_fbarray_get(arr, ms_idx);
+
+			addr = ms->addr_64;
+			len = ms->hugepage_sz;
+			hw_addr = ms->iova;
 
+			reg.vaddr = (uintptr_t) addr;
+			reg.size = len;
+			ret = ioctl(vfio_container_fd,
+				VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+			if (ret) {
+				RTE_LOG(ERR, EAL, "  cannot register vaddr for IOMMU, error %i (%s)\n",
+						errno, strerror(errno));
+				return -1;
+			}
+
+			memset(&dma_map, 0, sizeof(dma_map));
+			dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+			dma_map.vaddr = addr;
+			dma_map.size = len;
+			dma_map.iova = hw_addr;
+			dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
+					VFIO_DMA_MAP_FLAG_WRITE;
+
+			ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+			if (ret) {
+				RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
+						  "error %i (%s)\n", errno,
+						  strerror(errno));
+				return -1;
+			}
+		}
 	}
 
 	return 0;
diff --git a/test/test/test_malloc.c b/test/test/test_malloc.c
index 4572caf..ae24c33 100644
--- a/test/test/test_malloc.c
+++ b/test/test/test_malloc.c
@@ -41,6 +41,7 @@
 
 #include <rte_common.h>
 #include <rte_memory.h>
+#include <rte_eal_memconfig.h>
 #include <rte_per_lcore.h>
 #include <rte_launch.h>
 #include <rte_eal.h>
@@ -734,15 +735,23 @@ test_malloc_bad_params(void)
 	return -1;
 }
 
-/* Check if memory is available on a specific socket */
+/* Check if memory is avilable on a specific socket */
 static int
 is_mem_on_socket(int32_t socket)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	const struct rte_mem_config *mcfg =
+			rte_eal_get_configuration()->mem_config;
 	unsigned i;
 
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		if (socket == ms[i].socket_id)
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_memseg_list *msl =
+				&mcfg->memsegs[i];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+
+		if (msl->socket_id != socket)
+			continue;
+
+		if (arr->count)
 			return 1;
 	}
 	return 0;
@@ -755,16 +764,8 @@ is_mem_on_socket(int32_t socket)
 static int32_t
 addr_to_socket(void * addr)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	unsigned i;
-
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		if ((ms[i].addr <= addr) &&
-				((uintptr_t)addr <
-				((uintptr_t)ms[i].addr + (uintptr_t)ms[i].len)))
-			return ms[i].socket_id;
-	}
-	return -1;
+	const struct rte_memseg *ms = rte_mem_virt2memseg(addr, NULL);
+	return ms == NULL ? -1 : ms->socket_id;
 }
 
 /* Test using rte_[c|m|zm]alloc_socket() on a specific socket */
diff --git a/test/test/test_memory.c b/test/test/test_memory.c
index 921bdc8..0d877c8 100644
--- a/test/test/test_memory.c
+++ b/test/test/test_memory.c
@@ -34,8 +34,11 @@
 #include <stdio.h>
 #include <stdint.h>
 
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
 #include <rte_memory.h>
 #include <rte_common.h>
+#include <rte_memzone.h>
 
 #include "test.h"
 
@@ -54,10 +57,12 @@
 static int
 test_memory(void)
 {
+	const struct rte_memzone *mz = NULL;
 	uint64_t s;
 	unsigned i;
 	size_t j;
-	const struct rte_memseg *mem;
+	const struct rte_mem_config *mcfg =
+			rte_eal_get_configuration()->mem_config;
 
 	/*
 	 * dump the mapped memory: the python-expect script checks
@@ -69,20 +74,43 @@ test_memory(void)
 	/* check that memory size is != 0 */
 	s = rte_eal_get_physmem_size();
 	if (s == 0) {
-		printf("No memory detected\n");
-		return -1;
+		printf("No memory detected, attempting to allocate\n");
+		mz = rte_memzone_reserve("tmp", 1000, SOCKET_ID_ANY, 0);
+
+		if (!mz) {
+			printf("Failed to allocate a memzone\n");
+			return -1;
+		}
 	}
 
 	/* try to read memory (should not segfault) */
-	mem = rte_eal_get_physmem_layout();
-	for (i = 0; i < RTE_MAX_MEMSEG && mem[i].addr != NULL ; i++) {
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[i];
+		const struct rte_fbarray *arr = &msl->memseg_arr;
+		int search_idx, cur_idx;
+
+		if (arr->count == 0)
+			continue;
+
+		search_idx = 0;
 
-		/* check memory */
-		for (j = 0; j<mem[i].len; j++) {
-			*((volatile uint8_t *) mem[i].addr + j);
+		while ((cur_idx = rte_fbarray_find_next_used(arr,
+				search_idx)) >= 0) {
+			const struct rte_memseg *ms;
+
+			ms = rte_fbarray_get(arr, cur_idx);
+
+			/* check memory */
+			for (j = 0; j < ms->len; j++) {
+				*((volatile uint8_t *) ms->addr + j);
+			}
+			search_idx = cur_idx + 1;
 		}
 	}
 
+	if (mz)
+		rte_memzone_free(mz);
+
 	return 0;
 }
 
diff --git a/test/test/test_memzone.c b/test/test/test_memzone.c
index 1cf235a..47af721 100644
--- a/test/test/test_memzone.c
+++ b/test/test/test_memzone.c
@@ -132,22 +132,25 @@ static int
 test_memzone_reserve_flags(void)
 {
 	const struct rte_memzone *mz;
-	const struct rte_memseg *ms;
 	int hugepage_2MB_avail = 0;
 	int hugepage_1GB_avail = 0;
 	int hugepage_16MB_avail = 0;
 	int hugepage_16GB_avail = 0;
 	const size_t size = 100;
 	int i = 0;
-	ms = rte_eal_get_physmem_layout();
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		if (ms[i].hugepage_sz == RTE_PGSIZE_2M)
+
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		struct rte_mem_config *mcfg =
+				rte_eal_get_configuration()->mem_config;
+		struct rte_memseg_list *msl = &mcfg->memsegs[i];
+
+		if (msl->hugepage_sz == RTE_PGSIZE_2M)
 			hugepage_2MB_avail = 1;
-		if (ms[i].hugepage_sz == RTE_PGSIZE_1G)
+		if (msl->hugepage_sz == RTE_PGSIZE_1G)
 			hugepage_1GB_avail = 1;
-		if (ms[i].hugepage_sz == RTE_PGSIZE_16M)
+		if (msl->hugepage_sz == RTE_PGSIZE_16M)
 			hugepage_16MB_avail = 1;
-		if (ms[i].hugepage_sz == RTE_PGSIZE_16G)
+		if (msl->hugepage_sz == RTE_PGSIZE_16G)
 			hugepage_16GB_avail = 1;
 	}
 	/* Display the availability of 2MB ,1GB, 16MB, 16GB pages */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 12/23] eal: add support for dynamic memory allocation
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (10 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 11/23] eal: replace memseg with memseg lists Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 13/23] eal: make use of dynamic memory allocation for init Anatoly Burakov
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

Nothing uses that code yet. The bulk of it is copied from old
memory allocation stuff (eal_memory.c). We provide an API to
allocate either one page or multiple pages, guaranteeing that
we'll get contiguous VA for all of the pages that we requested.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_memalloc.h       |  47 ++++
 lib/librte_eal/linuxapp/eal/Makefile       |   2 +
 lib/librte_eal/linuxapp/eal/eal_memalloc.c | 416 +++++++++++++++++++++++++++++
 3 files changed, 465 insertions(+)
 create mode 100755 lib/librte_eal/common/eal_memalloc.h
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_memalloc.c

diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h
new file mode 100755
index 0000000..59fd330
--- /dev/null
+++ b/lib/librte_eal/common/eal_memalloc.h
@@ -0,0 +1,47 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef EAL_MEMALLOC_H
+#define EAL_MEMALLOC_H
+
+#include <stdbool.h>
+
+#include <rte_memory.h>
+
+struct rte_memseg *
+eal_memalloc_alloc_page(uint64_t size, int socket);
+
+int
+eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size,
+		int socket, bool exact);
+
+#endif // EAL_MEMALLOC_H
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 782e1ad..88f10e9 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -62,6 +62,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_thread.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio_mp_sync.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_memalloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_debug.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_lcore.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_timer.c
@@ -105,6 +106,7 @@ CFLAGS_eal_interrupts.o := -D_GNU_SOURCE
 CFLAGS_eal_vfio_mp_sync.o := -D_GNU_SOURCE
 CFLAGS_eal_timer.o := -D_GNU_SOURCE
 CFLAGS_eal_lcore.o := -D_GNU_SOURCE
+CFLAGS_eal_memalloc.o := -D_GNU_SOURCE
 CFLAGS_eal_thread.o := -D_GNU_SOURCE
 CFLAGS_eal_log.o := -D_GNU_SOURCE
 CFLAGS_eal_common_log.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
new file mode 100755
index 0000000..527c2f6
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -0,0 +1,416 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define _FILE_OFFSET_BITS 64
+#include <errno.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/queue.h>
+#include <sys/file.h>
+#include <unistd.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <signal.h>
+#include <setjmp.h>
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_eal_memconfig.h>
+#include <rte_eal.h>
+#include <rte_memory.h>
+
+#include "eal_filesystem.h"
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+
+static sigjmp_buf huge_jmpenv;
+
+static void __rte_unused huge_sigbus_handler(int signo __rte_unused)
+{
+	siglongjmp(huge_jmpenv, 1);
+}
+
+/* Put setjmp into a wrap method to avoid compiling error. Any non-volatile,
+ * non-static local variable in the stack frame calling sigsetjmp might be
+ * clobbered by a call to longjmp.
+ */
+static int __rte_unused huge_wrap_sigsetjmp(void)
+{
+	return sigsetjmp(huge_jmpenv, 1);
+}
+
+static struct sigaction huge_action_old;
+static int huge_need_recover;
+
+static void __rte_unused
+huge_register_sigbus(void)
+{
+	sigset_t mask;
+	struct sigaction action;
+
+	sigemptyset(&mask);
+	sigaddset(&mask, SIGBUS);
+	action.sa_flags = 0;
+	action.sa_mask = mask;
+	action.sa_handler = huge_sigbus_handler;
+
+	huge_need_recover = !sigaction(SIGBUS, &action, &huge_action_old);
+}
+
+static void __rte_unused
+huge_recover_sigbus(void)
+{
+	if (huge_need_recover) {
+		sigaction(SIGBUS, &huge_action_old, NULL);
+		huge_need_recover = 0;
+	}
+}
+
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+static bool
+prepare_numa(int *oldpolicy, struct bitmask *oldmask, int socket_id) {
+	bool have_numa = true;
+
+	/* Check if kernel supports NUMA. */
+	if (numa_available() != 0) {
+		RTE_LOG(DEBUG, EAL, "NUMA is not supported.\n");
+		have_numa = false;
+	}
+
+	if (have_numa) {
+		RTE_LOG(DEBUG, EAL, "Trying to obtain current memory policy.\n");
+		if (get_mempolicy(oldpolicy, oldmask->maskp,
+				  oldmask->size + 1, 0, 0) < 0) {
+			RTE_LOG(ERR, EAL,
+				"Failed to get current mempolicy: %s. "
+				"Assuming MPOL_DEFAULT.\n", strerror(errno));
+			oldpolicy = MPOL_DEFAULT;
+		}
+		RTE_LOG(DEBUG, EAL,
+			"Setting policy MPOL_PREFERRED for socket %d\n",
+			socket_id);
+		numa_set_preferred(socket_id);
+	}
+	return have_numa;
+}
+
+static void
+resotre_numa(int *oldpolicy, struct bitmask *oldmask) {
+	RTE_LOG(DEBUG, EAL,
+		"Restoring previous memory policy: %d\n", *oldpolicy);
+	if (oldpolicy == MPOL_DEFAULT) {
+		numa_set_localalloc();
+	} else if (set_mempolicy(*oldpolicy, oldmask->maskp,
+				 oldmask->size + 1) < 0) {
+		RTE_LOG(ERR, EAL, "Failed to restore mempolicy: %s\n",
+			strerror(errno));
+		numa_set_localalloc();
+	}
+	numa_free_cpumask(oldmask);
+}
+#endif
+
+static int
+alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id,
+		struct hugepage_info *hi, unsigned list_idx, unsigned seg_idx) {
+	int cur_socket_id = 0;
+	uint64_t fa_offset;
+	char path[PATH_MAX];
+	int ret = 0;
+
+	if (internal_config.single_file_segments) {
+		eal_get_hugefile_path(path, sizeof(path), hi->hugedir, list_idx);
+	} else {
+		eal_get_hugefile_path(path, sizeof(path), hi->hugedir,
+				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+	}
+
+	/* try to create hugepage file */
+	int fd = open(path, O_CREAT | O_RDWR, 0600);
+	if (fd < 0) {
+		RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", __func__,
+				strerror(errno));
+		goto fname;
+	}
+	if (internal_config.single_file_segments) {
+		fa_offset = seg_idx * size;
+		if (fallocate(fd, 0, fa_offset, size)) {
+			RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n",
+				__func__, strerror(errno));
+			goto opened;
+		}
+	} else {
+		if (ftruncate(fd, size) < 0) {
+			RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
+				__func__, strerror(errno));
+			goto opened;
+		}
+		fa_offset = 0;
+	}
+
+	/* map the segment, and populate page tables,
+	 * the kernel fills this segment with zeros */
+	void *va = mmap(addr, size, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd, fa_offset);
+	if (va == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL, "%s(): mmap() failed: %s\n", __func__,
+			strerror(errno));
+		goto resized;
+	}
+	if (va != addr) {
+		RTE_LOG(DEBUG, EAL, "%s(): wrong mmap() address\n", __func__);
+		goto mapped;
+	}
+
+	rte_iova_t iova = rte_mem_virt2iova(addr);
+	if (iova == RTE_BAD_PHYS_ADDR) {
+		RTE_LOG(DEBUG, EAL, "%s(): can't get IOVA addr\n",
+			__func__);
+		goto mapped;
+	}
+
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
+
+	if (cur_socket_id != socket_id) {
+		RTE_LOG(DEBUG, EAL,
+				"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
+			__func__, socket_id, cur_socket_id);
+		goto mapped;
+	}
+#endif
+
+	/* In linux, hugetlb limitations, like cgroup, are
+	 * enforced at fault time instead of mmap(), even
+	 * with the option of MAP_POPULATE. Kernel will send
+	 * a SIGBUS signal. To avoid to be killed, save stack
+	 * environment here, if SIGBUS happens, we can jump
+	 * back here.
+	 */
+	if (huge_wrap_sigsetjmp()) {
+		RTE_LOG(DEBUG, EAL, "SIGBUS: Cannot mmap more hugepages of size %uMB\n",
+			(unsigned)(size / 0x100000));
+		goto mapped;
+	}
+	*(int *)addr = *(int *) addr;
+
+	close(fd);
+
+	ms->addr = addr;
+	ms->hugepage_sz = size;
+	ms->len = size;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	goto out;
+
+mapped:
+	munmap(addr, size);
+resized:
+	if (internal_config.single_file_segments)
+		fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+				fa_offset, size);
+	else {
+		unlink(path);
+	}
+opened:
+	close(fd);
+fname:
+	/* anything but goto out is an error */
+	ret = -1;
+out:
+	return ret;
+}
+
+int
+eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n,
+		uint64_t size, int socket, bool exact) {
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *msl = NULL;
+	void *addr;
+	unsigned msl_idx;
+	int cur_idx, next_idx, end_idx, i, ret = 0;
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	bool have_numa;
+	int oldpolicy;
+	struct bitmask *oldmask = numa_allocate_nodemask();
+#endif
+	struct hugepage_info *hi = NULL;
+
+	/* dynamic allocation not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (i = 0; i < (int) RTE_DIM(internal_config.hugepage_info); i++) {
+		if (size ==
+				internal_config.hugepage_info[i].hugepage_sz) {
+			hi = &internal_config.hugepage_info[i];
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "%s(): can't find relevant hugepage_info entry\n",
+			__func__);
+		return -1;
+	}
+
+	/* find our memseg list */
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		struct rte_memseg_list *cur_msl = &mcfg->memsegs[msl_idx];
+
+		if (cur_msl->hugepage_sz != size) {
+			continue;
+		}
+		if (cur_msl->socket_id != socket) {
+			continue;
+		}
+		msl = cur_msl;
+		break;
+	}
+	if (!msl) {
+		RTE_LOG(ERR, EAL, "%s(): couldn't find suitable memseg_list\n",
+			__func__);
+		return -1;
+	}
+
+	/* first, try finding space in already existing list */
+	cur_idx = rte_fbarray_find_next_n_free(&msl->memseg_arr, 0, n);
+
+	if (cur_idx < 0) {
+		int old_len = msl->memseg_arr.len;
+		int space = 0;
+		int new_len = old_len;
+
+		/* grow new len until we can either fit n or can't grow */
+		while (new_len < msl->memseg_arr.capacity &&
+				(space < n)) {
+			new_len = RTE_MIN(new_len * 2, msl->memseg_arr.capacity);
+			space = new_len - old_len;
+		}
+
+		/* check if we can expand the list */
+		if (old_len == new_len) {
+			/* can't expand, the list is full */
+			RTE_LOG(ERR, EAL, "%s(): no space in memseg list\n",
+				__func__);
+			return -1;
+		}
+
+		if (rte_fbarray_resize(&msl->memseg_arr, new_len)) {
+			RTE_LOG(ERR, EAL, "%s(): can't resize memseg list\n",
+				__func__);
+			return -1;
+		}
+
+		/*
+		 * we could conceivably end up with free space at the end of the
+		 * list that wasn't enough to cover everything but can cover
+		 * some of it, so start at (old_len - n) if possible.
+		 */
+		next_idx = RTE_MAX(0, old_len - n);
+
+		cur_idx = rte_fbarray_find_next_n_free(&msl->memseg_arr,
+				next_idx, n);
+
+		if (cur_idx < 0) {
+			/* still no space, bail out */
+			RTE_LOG(ERR, EAL, "%s(): no space in memseg list\n",
+				__func__);
+			return -1;
+		}
+	}
+
+	end_idx = cur_idx + n;
+
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	have_numa = prepare_numa(&oldpolicy, oldmask, socket);
+#endif
+
+	for (i = 0; cur_idx < end_idx; cur_idx++, i++) {
+		struct rte_memseg *cur;
+
+		cur = rte_fbarray_get(&msl->memseg_arr, cur_idx);
+		addr = RTE_PTR_ADD(msl->base_va,
+				cur_idx * msl->hugepage_sz);
+
+		if (alloc_page(cur, addr, size, socket, hi, msl_idx, cur_idx)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i pages, but only %i were allocated\n",
+				n, i);
+
+			/* if exact number of pages wasn't requested, stop */
+			if (!exact) {
+				ret = i;
+				goto restore_numa;
+			}
+			if (ms)
+				memset(ms, 0, sizeof(struct rte_memseg*) * n);
+			ret = -1;
+			goto restore_numa;
+		}
+		if (ms)
+			ms[i] = cur;
+
+		rte_fbarray_set_used(&msl->memseg_arr, cur_idx, true);
+	}
+	ret = n;
+
+restore_numa:
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	if (have_numa)
+		resotre_numa(&oldpolicy, oldmask);
+#endif
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_page(uint64_t size, int socket) {
+	struct rte_memseg *ms;
+	if (eal_memalloc_alloc_page_bulk(&ms, 1, size, socket, true) < 0)
+		return NULL;
+	return ms;
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 13/23] eal: make use of dynamic memory allocation for init
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (11 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 12/23] eal: add support for dynamic memory allocation Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 14/23] eal: add support for dynamic unmapping of pages Anatoly Burakov
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

Add a new (non-legacy) memory init path for EAL. It uses the
new dynamic allocation facilities, although it's only being run
at startup.

If no -m or --socket-mem switches were specified, the new init
will not allocate anything, whereas if those switches were passed,
appropriate amounts of pages would be requested, just like for
legacy init.

Since rte_malloc support for dynamic allocation comes in later
patches, running DPDK without --socket-mem or -m switches will
fail in this patch.

Also, allocated pages will be physically discontiguous (or rather,
they're not guaranteed to be physically contiguous - they may still
be, by accident) unless IOVA_AS_VA mode is used.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linuxapp/eal/eal_memory.c | 60 ++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 59f6889..7cc4a55 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -68,6 +68,7 @@
 #include <rte_string_fns.h>
 
 #include "eal_private.h"
+#include "eal_memalloc.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1322,6 +1323,61 @@ eal_legacy_hugepage_init(void)
 	return -1;
 }
 
+static int
+eal_hugepage_init(void) {
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	test_phys_addrs_available();
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		/* meanwhile, also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
+		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
+
+	/* calculate final number of pages */
+	if (calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (int hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned num_pages = hpi->num_pages[socket_id];
+			int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %luM on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			num_pages_alloc = eal_memalloc_alloc_page_bulk(NULL,
+					num_pages,
+					hpi->hugepage_sz, socket_id,
+					true);
+			if (num_pages_alloc < 0)
+				return -1;
+		}
+	}
+	return 0;
+}
+
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -1533,6 +1589,8 @@ int
 rte_eal_hugepage_init(void) {
 	if (internal_config.legacy_mem)
 		return eal_legacy_hugepage_init();
+	else
+		return eal_hugepage_init();
 	return -1;
 }
 
@@ -1540,6 +1598,8 @@ int
 rte_eal_hugepage_attach(void) {
 	if (internal_config.legacy_mem)
 		return eal_legacy_hugepage_attach();
+	else
+		RTE_LOG(ERR, EAL, "Secondary processes aren't supported yet\n");
 	return -1;
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 14/23] eal: add support for dynamic unmapping of pages
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (12 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 13/23] eal: make use of dynamic memory allocation for init Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 15/23] eal: add API to check if memory is physically contiguous Anatoly Burakov
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This isn't used anywhere yet, but the support is now there. Also,
adding cleanup to allocation procedures, so that if we fail to
allocate everything we asked for, we can free all of it back.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_memalloc.h       |   3 +
 lib/librte_eal/linuxapp/eal/eal_memalloc.c | 131 ++++++++++++++++++++++++++++-
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h
index 59fd330..47e4367 100755
--- a/lib/librte_eal/common/eal_memalloc.h
+++ b/lib/librte_eal/common/eal_memalloc.h
@@ -44,4 +44,7 @@ int
 eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size,
 		int socket, bool exact);
 
+int
+eal_memalloc_free_page(struct rte_memseg *ms);
+
 #endif // EAL_MEMALLOC_H
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 527c2f6..13172a0 100755
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -109,6 +109,18 @@ huge_recover_sigbus(void)
 	}
 }
 
+/*
+ * uses fstat to report the size of a file on disk
+ */
+static bool
+is_zero_length(int fd)
+{
+	struct stat st;
+	if (fstat(fd, &st) < 0)
+		return false;
+	return st.st_blocks == 0;
+}
+
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
 static bool
 prepare_numa(int *oldpolicy, struct bitmask *oldmask, int socket_id) {
@@ -267,6 +279,61 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id,
 	return ret;
 }
 
+static int
+free_page(struct rte_memseg *ms, struct hugepage_info *hi, unsigned list_idx,
+		unsigned seg_idx) {
+	uint64_t fa_offset;
+	char path[PATH_MAX];
+	int fd;
+
+	fa_offset = seg_idx * ms->hugepage_sz;
+
+	if (internal_config.single_file_segments) {
+		eal_get_hugefile_path(path, sizeof(path), hi->hugedir, list_idx);
+	} else {
+		eal_get_hugefile_path(path, sizeof(path), hi->hugedir,
+				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+	}
+
+	munmap(ms->addr, ms->hugepage_sz);
+
+	// TODO: race condition?
+
+	if (mmap(ms->addr, ms->hugepage_sz, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
+				MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL, "couldn't unmap page\n");
+		return -1;
+	}
+
+	if (internal_config.single_file_segments) {
+		/* now, truncate or remove the original file */
+		fd = open(path, O_RDWR, 0600);
+		if (fd < 0) {
+			RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", __func__,
+					strerror(errno));
+			// TODO: proper error handling
+			return -1;
+		}
+
+		if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+				fa_offset, ms->hugepage_sz)) {
+			RTE_LOG(DEBUG, EAL, "Page deallocation failed: %s\n",
+				strerror(errno));
+		}
+		if (is_zero_length(fd)) {
+			unlink(path);
+		}
+		close(fd);
+	} else {
+		unlink(path);
+	}
+
+	memset(ms, 0, sizeof(*ms));
+
+	return 0;
+}
+
 int
 eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n,
 		uint64_t size, int socket, bool exact) {
@@ -274,7 +341,7 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n,
 	struct rte_memseg_list *msl = NULL;
 	void *addr;
 	unsigned msl_idx;
-	int cur_idx, next_idx, end_idx, i, ret = 0;
+	int cur_idx, next_idx, start_idx, end_idx, i, j, ret = 0;
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
 	bool have_numa;
 	int oldpolicy;
@@ -366,6 +433,7 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n,
 	}
 
 	end_idx = cur_idx + n;
+	start_idx = cur_idx;
 
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
 	have_numa = prepare_numa(&oldpolicy, oldmask, socket);
@@ -387,6 +455,20 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n,
 				ret = i;
 				goto restore_numa;
 			}
+			RTE_LOG(DEBUG, EAL, "exact amount of pages was requested, so returning %i allocated pages\n",
+				i);
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				if (free_page(tmp, hi, msl_idx, start_idx + j))
+					rte_panic("Cannot free page\n");
+
+				rte_fbarray_set_used(arr, j, false);
+			}
 			if (ms)
 				memset(ms, 0, sizeof(struct rte_memseg*) * n);
 			ret = -1;
@@ -414,3 +496,50 @@ eal_memalloc_alloc_page(uint64_t size, int socket) {
 		return NULL;
 	return ms;
 }
+
+int
+eal_memalloc_free_page(struct rte_memseg *ms) {
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *msl = NULL;
+	unsigned msl_idx, seg_idx;
+	struct hugepage_info *hi = NULL;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (int i = 0; i < (int) RTE_DIM(internal_config.hugepage_info); i++) {
+		if (ms->hugepage_sz ==
+				internal_config.hugepage_info[i].hugepage_sz) {
+			hi = &internal_config.hugepage_info[i];
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+		uintptr_t start_addr, end_addr;
+		struct rte_memseg_list *cur = &mcfg->memsegs[msl_idx];
+
+		start_addr = (uintptr_t) cur->base_va;
+		end_addr = start_addr +
+				cur->memseg_arr.capacity * cur->hugepage_sz;
+
+		if ((uintptr_t) ms->addr < start_addr ||
+				(uintptr_t) ms->addr >= end_addr) {
+			continue;
+		}
+		msl = cur;
+		seg_idx = RTE_PTR_DIFF(ms->addr, start_addr) / ms->hugepage_sz;
+		break;
+	}
+	if (!msl) {
+		RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		return -1;
+	}
+	rte_fbarray_set_used(&msl->memseg_arr, seg_idx, false);
+	return free_page(ms, hi, msl_idx, seg_idx);
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 15/23] eal: add API to check if memory is physically contiguous
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (13 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 14/23] eal: add support for dynamic unmapping of pages Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 16/23] eal: enable dynamic memory allocation/free on malloc/free Anatoly Burakov
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This will be helpful down the line when we implement support for
allocating physically contiguous memory. We can no longer guarantee
physically contiguous memory unless we're in IOVA_AS_VA mode, but
we can certainly try and see if we succeed. In addition, this would
be useful for e.g. PMD's who may allocate chunks that are smaller
than the pagesize, but they must not cross the page boundary, in
which case we will be able to accommodate that request.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_memalloc.c | 79 +++++++++++++++++++++++++++++
 lib/librte_eal/common/eal_memalloc.h        |  5 ++
 lib/librte_eal/linuxapp/eal/Makefile        |  1 +
 3 files changed, 85 insertions(+)
 create mode 100755 lib/librte_eal/common/eal_common_memalloc.c

diff --git a/lib/librte_eal/common/eal_common_memalloc.c b/lib/librte_eal/common/eal_common_memalloc.c
new file mode 100755
index 0000000..395753a
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_memalloc.c
@@ -0,0 +1,79 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_lcore.h>
+#include <rte_fbarray.h>
+#include <rte_memzone.h>
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "eal_private.h"
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+
+// TODO: secondary
+// TODO: 32-bit
+
+bool
+eal_memalloc_is_contig(const struct rte_memseg_list *msl, void *start,
+		size_t len) {
+	const struct rte_memseg *ms;
+	uint64_t page_sz;
+	void *end;
+	int start_page, end_page, cur_page;
+	rte_iova_t expected;
+
+	/* for legacy memory, it's always contiguous */
+	if (internal_config.legacy_mem)
+		return true;
+
+	/* figure out how many pages we need to fit in current data */
+	page_sz = msl->hugepage_sz;
+	end = RTE_PTR_ADD(start, len);
+
+	start_page = RTE_PTR_DIFF(start, msl->base_va) / page_sz;
+	end_page = RTE_PTR_DIFF(end, msl->base_va) / page_sz;
+
+	/* now, look for contiguous memory */
+	ms = rte_fbarray_get(&msl->memseg_arr, start_page);
+	expected = ms->iova + page_sz;
+
+	for (cur_page = start_page + 1; cur_page < end_page;
+			cur_page++, expected += page_sz) {
+		ms = rte_fbarray_get(&msl->memseg_arr, cur_page);
+
+		if (ms->iova != expected)
+			return false;
+	}
+
+	return true;
+}
diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h
index 47e4367..04f9b72 100755
--- a/lib/librte_eal/common/eal_memalloc.h
+++ b/lib/librte_eal/common/eal_memalloc.h
@@ -36,6 +36,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
+#include <rte_eal_memconfig.h>
 
 struct rte_memseg *
 eal_memalloc_alloc_page(uint64_t size, int socket);
@@ -47,4 +48,8 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size,
 int
 eal_memalloc_free_page(struct rte_memseg *ms);
 
+bool
+eal_memalloc_is_contig(const struct rte_memseg_list *msl, void *start,
+		size_t len);
+
 #endif // EAL_MEMALLOC_H
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 88f10e9..c1fc557 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -75,6 +75,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_timer.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memzone.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_launch.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memalloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memory.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_tailqs.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_errno.c
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 16/23] eal: enable dynamic memory allocation/free on malloc/free
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (14 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 15/23] eal: add API to check if memory is physically contiguous Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 17/23] eal: add backend support for contiguous memory allocation Anatoly Burakov
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This set of changes enables rte_malloc to allocate and free memory
as needed. The way it works is, first malloc checks if there is
enough memory already allocated to satisfy user's request. If there
isn't, we try and allocate more memory. The reverse happens with
free - we free an element, check its size (including free element
merging due to adjacency) and see if it's bigger than hugepage
size and that its start and end span a hugepage or more. Then we
remove the area from malloc heap (adjusting element lengths where
appropriate), and deallocate the page.

For legacy mode, dynamic alloc/free is disabled.

It is worth noting that memseg lists are being sorted by page size,
and that we try our best to satisfy user's request. That is, if
the user requests an element from a 2MB page memory, we will check
if we can satisfy that request from existing memory, if not we try
and allocate more 2MB pages. If that fails and user also specified
a "size is hint" flag, we then check other page sizes and try to
allocate from there. If that fails too, then, depending on flags,
we may try allocating from other sockets. In other words, we try
our best to give the user what they asked for, but going to other
sockets is last resort - first we try to allocate more memory on
the same socket.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_memzone.c |  19 +-
 lib/librte_eal/common/malloc_elem.c        |  96 +++++++++-
 lib/librte_eal/common/malloc_elem.h        |   8 +-
 lib/librte_eal/common/malloc_heap.c        | 280 +++++++++++++++++++++++++++--
 lib/librte_eal/common/malloc_heap.h        |   4 +-
 lib/librte_eal/common/rte_malloc.c         |  24 +--
 6 files changed, 373 insertions(+), 58 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
index f558ac2..c571145 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -132,7 +132,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 	struct rte_memzone *mz;
 	struct rte_mem_config *mcfg;
 	size_t requested_len;
-	int socket, i;
+	int socket;
 
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
@@ -216,21 +216,8 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 		socket = socket_id;
 
 	/* allocate memory on heap */
-	void *mz_addr = malloc_heap_alloc(&mcfg->malloc_heaps[socket], NULL,
-			requested_len, flags, align, bound);
-
-	if ((mz_addr == NULL) && (socket_id == SOCKET_ID_ANY)) {
-		/* try other heaps */
-		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-			if (socket == i)
-				continue;
-
-			mz_addr = malloc_heap_alloc(&mcfg->malloc_heaps[i],
-					NULL, requested_len, flags, align, bound);
-			if (mz_addr != NULL)
-				break;
-		}
-	}
+	void *mz_addr = malloc_heap_alloc(NULL, requested_len, socket, flags,
+			align, bound);
 
 	if (mz_addr == NULL) {
 		rte_errno = ENOMEM;
diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index ab09b94..48ac604 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -269,8 +269,8 @@ malloc_elem_free_list_insert(struct malloc_elem *elem)
 /*
  * Remove the specified element from its heap's free list.
  */
-static void
-elem_free_list_remove(struct malloc_elem *elem)
+void
+malloc_elem_free_list_remove(struct malloc_elem *elem)
 {
 	LIST_REMOVE(elem, free_list);
 }
@@ -290,7 +290,7 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align,
 	const size_t trailer_size = elem->size - old_elem_size - size -
 		MALLOC_ELEM_OVERHEAD;
 
-	elem_free_list_remove(elem);
+	malloc_elem_free_list_remove(elem);
 
 	if (trailer_size > MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
 		/* split it, too much free space after elem */
@@ -363,7 +363,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) {
 		erase = RTE_PTR_SUB(elem->next, MALLOC_ELEM_TRAILER_LEN);
 
 		/* remove from free list, join to this one */
-		elem_free_list_remove(elem->next);
+		malloc_elem_free_list_remove(elem->next);
 		join_elem(elem, elem->next);
 
 		/* erase header and trailer */
@@ -383,7 +383,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) {
 		erase = RTE_PTR_SUB(elem, MALLOC_ELEM_TRAILER_LEN);
 
 		/* remove from free list, join to this one */
-		elem_free_list_remove(elem->prev);
+		malloc_elem_free_list_remove(elem->prev);
 
 		new_elem = elem->prev;
 		join_elem(new_elem, elem);
@@ -402,7 +402,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) {
  * blocks either immediately before or immediately after newly freed block
  * are also free, the blocks are merged together.
  */
-int
+struct malloc_elem *
 malloc_elem_free(struct malloc_elem *elem)
 {
 	void *ptr;
@@ -420,7 +420,87 @@ malloc_elem_free(struct malloc_elem *elem)
 
 	memset(ptr, 0, data_len);
 
-	return 0;
+	return elem;
+}
+
+/* assume all checks were already done */
+void
+malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len) {
+	size_t len_before, len_after;
+	struct malloc_elem *prev, *next;
+	void *end, *elem_end;
+
+	end = RTE_PTR_ADD(start, len);
+	elem_end = RTE_PTR_ADD(elem, elem->size);
+	len_before = RTE_PTR_DIFF(start, elem);
+	len_after = RTE_PTR_DIFF(elem_end, end);
+
+	prev = elem->prev;
+	next = elem->next;
+
+	if (len_after >= MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
+		/* split after */
+		struct malloc_elem *split_after = end;
+
+		split_elem(elem, split_after);
+
+		next = split_after;
+
+		malloc_elem_free_list_insert(split_after);
+	} else if (len_after >= MALLOC_ELEM_HEADER_LEN) {
+		struct malloc_elem *pad_elem = end;
+
+		/* shrink current element */
+		elem->size -= len_after;
+		memset(pad_elem, 0, sizeof(*pad_elem));
+
+		/* copy next element's data to our pad */
+		memcpy(pad_elem, next, sizeof(*pad_elem));
+
+		/* pad next element */
+		next->state = ELEM_PAD;
+		next->pad = len_after;
+
+		/* next element is busy, would've been merged otherwise */
+		pad_elem->pad = len_after;
+		pad_elem->size += len_after;
+	} else if (len_after > 0) {
+		rte_panic("Unaligned element, heap is probably corrupt\n");
+	}
+
+	if (len_before >= MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
+		/* split before */
+		struct malloc_elem *split_before = start;
+
+		split_elem(elem, split_before);
+
+		prev = elem;
+		elem = split_before;
+
+		malloc_elem_free_list_insert(prev);
+	} else if (len_before > 0) {
+		/*
+		 * unlike with elements after current, here we don't need to
+		 * pad elements, but rather just increase the size of previous
+		 * element, copy the old header and and set up trailer.
+		 */
+		void *trailer = RTE_PTR_ADD(prev,
+				prev->size - MALLOC_ELEM_TRAILER_LEN);
+		struct malloc_elem *new_elem = start;
+
+		memcpy(new_elem, elem, sizeof(*elem));
+		new_elem->size -= len_before;
+
+		prev->size += len_before;
+		set_trailer(prev);
+
+		elem = new_elem;
+
+		/* erase old trailer */
+		memset(trailer, 0, MALLOC_ELEM_TRAILER_LEN);
+	}
+
+	remove_elem(elem);
 }
 
 /*
@@ -446,7 +526,7 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size)
 	/* we now know the element fits, so remove from free list,
 	 * join the two
 	 */
-	elem_free_list_remove(elem->next);
+	malloc_elem_free_list_remove(elem->next);
 	join_elem(elem, elem->next);
 
 	if (elem->size - new_size >= MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD) {
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index 330bddc..b47c55e 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -164,7 +164,7 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size,
  * blocks either immediately before or immediately after newly freed block
  * are also free, the blocks are merged together.
  */
-int
+struct malloc_elem *
 malloc_elem_free(struct malloc_elem *elem);
 
 struct malloc_elem *
@@ -177,6 +177,12 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem);
 int
 malloc_elem_resize(struct malloc_elem *elem, size_t size);
 
+void
+malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len);
+
+void
+malloc_elem_free_list_remove(struct malloc_elem *elem);
+
 /*
  * Given an element size, compute its freelist index.
  */
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 5fa21fe..0d61704 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -49,8 +49,10 @@
 #include <rte_spinlock.h>
 #include <rte_memcpy.h>
 #include <rte_atomic.h>
+#include <rte_fbarray.h>
 
 #include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
 #include "malloc_elem.h"
 #include "malloc_heap.h"
 
@@ -151,46 +153,304 @@ find_suitable_element(struct malloc_heap *heap, size_t size,
  * scan fails. Once the new memseg is added, it re-scans and should return
  * the new element after releasing the lock.
  */
-void *
-malloc_heap_alloc(struct malloc_heap *heap,
-		const char *type __attribute__((unused)), size_t size, unsigned flags,
-		size_t align, size_t bound)
+static void *
+heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size,
+		unsigned flags, size_t align, size_t bound)
 {
 	struct malloc_elem *elem;
 
 	size = RTE_CACHE_LINE_ROUNDUP(size);
 	align = RTE_CACHE_LINE_ROUNDUP(align);
 
-	rte_spinlock_lock(&heap->lock);
-
 	elem = find_suitable_element(heap, size, flags, align, bound);
 	if (elem != NULL) {
 		elem = malloc_elem_alloc(elem, size, align, bound);
+
 		/* increase heap's count of allocated elements */
 		heap->alloc_count++;
 	}
-	rte_spinlock_unlock(&heap->lock);
 
 	return elem == NULL ? NULL : (void *)(&elem[1]);
 }
 
+static void *
+try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl,
+		const char *type, size_t size, int socket, unsigned flags,
+		size_t align, size_t bound) {
+	struct malloc_elem *elem;
+	struct rte_memseg **ms;
+	size_t map_len;
+	void *map_addr;
+	int i, n_pages, allocd_pages;
+	void *ret;
+
+	align = RTE_MAX(align, MALLOC_ELEM_HEADER_LEN);
+	map_len = RTE_ALIGN_CEIL(align + size + MALLOC_ELEM_TRAILER_LEN,
+			msl->hugepage_sz);
+
+	n_pages = map_len / msl->hugepage_sz;
+
+	/* we can't know in advance how many pages we'll need, so malloc */
+	ms = malloc(sizeof(*ms) * n_pages);
+
+	allocd_pages = eal_memalloc_alloc_page_bulk(ms, n_pages,
+			msl->hugepage_sz, socket, true);
+
+	/* make sure we've allocated our pages... */
+	if (allocd_pages != n_pages)
+		goto free_ms;
+
+	map_addr = ms[0]->addr;
+
+	/* add newly minted memsegs to malloc heap */
+	elem = malloc_heap_add_memory(heap, msl, map_addr, map_len);
+
+	RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n",
+		msl->socket_id, map_len >> 20ULL);
+
+	/* try once more, as now we have allocated new memory */
+	ret = heap_alloc(heap, type, size, flags,
+			align == 0 ? 1 : align, bound);
+
+	if (ret == NULL)
+		goto free_elem;
+
+	free(ms);
+	return ret;
+
+free_elem:
+	malloc_elem_free_list_remove(elem);
+	malloc_elem_hide_region(elem, map_addr, map_len);
+	heap->total_size -= map_len;
+
+	RTE_LOG(DEBUG, EAL, "%s(): couldn't allocate, so shrinking heap on socket %d by %zdMB\n",
+		__func__, socket, map_len >> 20ULL);
+
+	for (i = 0; i < n_pages; i++) {
+		eal_memalloc_free_page(ms[i]);
+	}
+free_ms:
+	free(ms);
+	return NULL;
+}
+
+static int
+compare_pagesz(const void *a, const void *b) {
+	const struct rte_memseg_list *msla = a;
+	const struct rte_memseg_list *mslb = b;
+
+	if (msla->hugepage_sz < mslb->hugepage_sz)
+		return 1;
+	if (msla->hugepage_sz > mslb->hugepage_sz)
+		return -1;
+	return 0;
+}
+
+/* this will try lower page sizes first */
+static void *
+heap_alloc_on_socket(const char *type, size_t size, int socket,
+		unsigned flags, size_t align, size_t bound) {
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct malloc_heap *heap = &mcfg->malloc_heaps[socket];
+	struct rte_memseg_list *requested_msls[RTE_MAX_MEMSEG_LISTS];
+	struct rte_memseg_list *other_msls[RTE_MAX_MEMSEG_LISTS];
+	int i, n_other_msls = 0, n_requested_msls = 0;
+	bool size_hint = (flags & RTE_MEMZONE_SIZE_HINT_ONLY) > 0;
+	unsigned size_flags = flags & ~RTE_MEMZONE_SIZE_HINT_ONLY;
+	void *ret;
+
+	rte_spinlock_lock(&(heap->lock));
+
+	/* for legacy mode, try once and with all flags */
+	if (internal_config.legacy_mem) {
+		ret = heap_alloc(heap, type, size, flags,
+				align == 0 ? 1 : align, bound);
+		goto alloc_unlock;
+	}
+
+	/*
+	 * we do not pass the size hint here, because even if allocation fails,
+	 * we may still be able to allocate memory from appropriate page sizes,
+	 * we just need to request more memory first.
+	 */
+	ret = heap_alloc(heap, type, size, size_flags, align == 0 ? 1 : align,
+			bound);
+	if (ret != NULL)
+		goto alloc_unlock;
+
+	memset(requested_msls, 0, sizeof(requested_msls));
+	memset(other_msls, 0, sizeof(other_msls));
+
+	/*
+	 * go through memseg list and take note of all the page sizes available,
+	 * and if any of them were specifically requested by the user.
+	 */
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		struct rte_memseg_list *msl = &mcfg->memsegs[i];
+
+		if (msl->socket_id != socket)
+			continue;
+
+		if (msl->base_va == NULL)
+			continue;
+
+		/* if pages of specific size were requested */
+		if (size_flags != 0 && check_hugepage_sz(size_flags,
+				msl->hugepage_sz)) {
+			requested_msls[n_requested_msls++] = msl;
+		} else if (size_flags == 0 || size_hint) {
+			other_msls[n_other_msls++] = msl;
+		}
+	}
+
+	/* sort the lists, smallest first */
+	qsort(requested_msls, n_requested_msls, sizeof(requested_msls[0]),
+			compare_pagesz);
+	qsort(other_msls, n_other_msls, sizeof(other_msls[0]),
+			compare_pagesz);
+
+	for (i = 0; i < n_requested_msls; i++) {
+		struct rte_memseg_list *msl = requested_msls[i];
+
+		/*
+		 * do not pass the size hint here, as user expects other page
+		 * sizes first, before resorting to best effort allocation.
+		 */
+		ret = try_expand_heap(heap, msl, type, size, socket, size_flags,
+				align, bound);
+		if (ret != NULL)
+			goto alloc_unlock;
+	}
+	if (n_other_msls == 0)
+		goto alloc_unlock;
+
+	/* now, try reserving with size hint */
+	ret = heap_alloc(heap, type, size, flags, align == 0 ? 1 : align,
+			bound);
+	if (ret != NULL)
+		goto alloc_unlock;
+
+	/*
+	 * we still couldn't reserve memory, so try expanding heap with other
+	 * page sizes, if there are any
+	 */
+	for (i = 0; i < n_other_msls; i++) {
+		struct rte_memseg_list *msl = other_msls[i];
+
+		ret = try_expand_heap(heap, msl, type, size, socket, flags,
+				align, bound);
+		if (ret != NULL)
+			goto alloc_unlock;
+	}
+alloc_unlock:
+	rte_spinlock_unlock(&(heap->lock));
+	return ret;
+}
+
+void *
+malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags,
+		size_t align, size_t bound) {
+	int socket, i;
+	void *ret;
+
+	/* return NULL if size is 0 or alignment is not power-of-2 */
+	if (size == 0 || (align && !rte_is_power_of_2(align)))
+		return NULL;
+
+	if (!rte_eal_has_hugepages())
+		socket_arg = SOCKET_ID_ANY;
+
+	if (socket_arg == SOCKET_ID_ANY)
+		socket = malloc_get_numa_socket();
+	else
+		socket = socket_arg;
+
+	/* Check socket parameter */
+	if (socket >= RTE_MAX_NUMA_NODES)
+		return NULL;
+
+	// TODO: add warning for alignments bigger than page size if not VFIO
+
+	ret = heap_alloc_on_socket(type, size, socket, flags, align, bound);
+	if (ret != NULL || socket_arg != SOCKET_ID_ANY)
+		return ret;
+
+	/* try other heaps */
+	for (i = 0; i < (int) rte_num_sockets(); i++) {
+		if (i == socket)
+			continue;
+		ret = heap_alloc_on_socket(type, size, socket, flags,
+				align, bound);
+		if (ret != NULL)
+			return ret;
+	}
+	return NULL;
+}
+
 int
 malloc_heap_free(struct malloc_elem *elem) {
 	struct malloc_heap *heap;
-	int ret;
+	void *start, *aligned_start, *end, *aligned_end;
+	size_t len, aligned_len;
+	const struct rte_memseg_list *msl;
+	int n_pages, page_idx, max_page_idx, ret;
 
 	if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY)
 		return -1;
 
 	/* elem may be merged with previous element, so keep heap address */
 	heap = elem->heap;
+	msl = elem->msl;
 
 	rte_spinlock_lock(&(heap->lock));
 
-	ret = malloc_elem_free(elem);
+	elem = malloc_elem_free(elem);
 
-	rte_spinlock_unlock(&(heap->lock));
+	/* anything after this is a bonus */
+	ret = 0;
+
+	/* ...of which we can't avail if we are in legacy mode */
+	if (internal_config.legacy_mem)
+		goto free_unlock;
+
+	/* check if we can free any memory back to the system */
+	if (elem->size < msl->hugepage_sz)
+		goto free_unlock;
+
+	/* probably, but let's make sure, as we may not be using up full page */
+	start = elem;
+	len = elem->size;
+	aligned_start = RTE_PTR_ALIGN_CEIL(start, msl->hugepage_sz);
+	end = RTE_PTR_ADD(elem, len);
+	aligned_end = RTE_PTR_ALIGN_FLOOR(end, msl->hugepage_sz);
 
+	aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start);
+
+	/* can't free anything */
+	if (aligned_len < msl->hugepage_sz)
+		goto free_unlock;
+
+	malloc_elem_free_list_remove(elem);
+
+	malloc_elem_hide_region(elem, (void*) aligned_start, aligned_len);
+
+	/* we don't really care if we fail to deallocate memory */
+	n_pages = aligned_len / msl->hugepage_sz;
+	page_idx = RTE_PTR_DIFF(aligned_start, msl->base_va) / msl->hugepage_sz;
+	max_page_idx = page_idx + n_pages;
+
+	for (; page_idx < max_page_idx; page_idx++) {
+		struct rte_memseg *ms;
+
+		ms = rte_fbarray_get(&msl->memseg_arr, page_idx);
+		eal_memalloc_free_page(ms);
+		heap->total_size -= msl->hugepage_sz;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Heap on socket %d was shrunk by %zdMB\n",
+		msl->socket_id, aligned_len >> 20ULL);
+free_unlock:
+	rte_spinlock_unlock(&(heap->lock));
 	return ret;
 }
 
diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h
index df04dd8..3fcd14f 100644
--- a/lib/librte_eal/common/malloc_heap.h
+++ b/lib/librte_eal/common/malloc_heap.h
@@ -53,8 +53,8 @@ malloc_get_numa_socket(void)
 }
 
 void *
-malloc_heap_alloc(struct malloc_heap *heap,	const char *type, size_t size,
-		unsigned flags, size_t align, size_t bound);
+malloc_heap_alloc(const char *type, size_t size, int socket, unsigned flags,
+		size_t align, size_t bound);
 
 int
 malloc_heap_free(struct malloc_elem *elem);
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index 92cd7d8..dc3199a 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -68,9 +68,7 @@ void rte_free(void *addr)
 void *
 rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg)
 {
-	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	int socket, i;
-	void *ret;
+	int socket;
 
 	/* return NULL if size is 0 or alignment is not power-of-2 */
 	if (size == 0 || (align && !rte_is_power_of_2(align)))
@@ -88,24 +86,8 @@ rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg)
 	if (socket >= RTE_MAX_NUMA_NODES)
 		return NULL;
 
-	ret = malloc_heap_alloc(&mcfg->malloc_heaps[socket], type,
-				size, 0, align == 0 ? 1 : align, 0);
-	if (ret != NULL || socket_arg != SOCKET_ID_ANY)
-		return ret;
-
-	/* try other heaps */
-	for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-		/* we already tried this one */
-		if (i == socket)
-			continue;
-
-		ret = malloc_heap_alloc(&mcfg->malloc_heaps[i], type,
-					size, 0, align == 0 ? 1 : align, 0);
-		if (ret != NULL)
-			return ret;
-	}
-
-	return NULL;
+	return malloc_heap_alloc(type, size, socket_arg, 0,
+			align == 0 ? 1 : align, 0);
 }
 
 /*
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 17/23] eal: add backend support for contiguous memory allocation
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (15 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 16/23] eal: enable dynamic memory allocation/free on malloc/free Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 18/23] eal: add rte_malloc support for allocating contiguous memory Anatoly Burakov
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

No major changes, just add some checks in a few key places, and
a new parameter to pass around.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_memzone.c |  16 +++--
 lib/librte_eal/common/malloc_elem.c        | 105 +++++++++++++++++++++++------
 lib/librte_eal/common/malloc_elem.h        |   6 +-
 lib/librte_eal/common/malloc_heap.c        |  54 +++++++++------
 lib/librte_eal/common/malloc_heap.h        |   6 +-
 lib/librte_eal/common/rte_malloc.c         |  38 +++++++----
 6 files changed, 158 insertions(+), 67 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
index c571145..542ae90 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -127,7 +127,8 @@ find_heap_max_free_elem(int *s, unsigned align)
 
 static const struct rte_memzone *
 memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
-		int socket_id, unsigned flags, unsigned align, unsigned bound)
+		int socket_id, unsigned flags, unsigned align, unsigned bound,
+		bool contig)
 {
 	struct rte_memzone *mz;
 	struct rte_mem_config *mcfg;
@@ -217,7 +218,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 
 	/* allocate memory on heap */
 	void *mz_addr = malloc_heap_alloc(NULL, requested_len, socket, flags,
-			align, bound);
+			align, bound, contig);
 
 	if (mz_addr == NULL) {
 		rte_errno = ENOMEM;
@@ -251,7 +252,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 static const struct rte_memzone *
 rte_memzone_reserve_thread_safe(const char *name, size_t len,
 				int socket_id, unsigned flags, unsigned align,
-				unsigned bound)
+				unsigned bound, bool contig)
 {
 	struct rte_mem_config *mcfg;
 	const struct rte_memzone *mz = NULL;
@@ -262,7 +263,7 @@ rte_memzone_reserve_thread_safe(const char *name, size_t len,
 	rte_rwlock_write_lock(&mcfg->mlock);
 
 	mz = memzone_reserve_aligned_thread_unsafe(
-		name, len, socket_id, flags, align, bound);
+		name, len, socket_id, flags, align, bound, contig);
 
 	rte_rwlock_write_unlock(&mcfg->mlock);
 
@@ -279,7 +280,7 @@ rte_memzone_reserve_bounded(const char *name, size_t len, int socket_id,
 			    unsigned flags, unsigned align, unsigned bound)
 {
 	return rte_memzone_reserve_thread_safe(name, len, socket_id, flags,
-					       align, bound);
+					       align, bound, false);
 }
 
 /*
@@ -291,7 +292,7 @@ rte_memzone_reserve_aligned(const char *name, size_t len, int socket_id,
 			    unsigned flags, unsigned align)
 {
 	return rte_memzone_reserve_thread_safe(name, len, socket_id, flags,
-					       align, 0);
+					       align, 0, false);
 }
 
 /*
@@ -303,7 +304,8 @@ rte_memzone_reserve(const char *name, size_t len, int socket_id,
 		    unsigned flags)
 {
 	return rte_memzone_reserve_thread_safe(name, len, socket_id,
-					       flags, RTE_CACHE_LINE_SIZE, 0);
+					       flags, RTE_CACHE_LINE_SIZE, 0,
+					       false);
 }
 
 int
diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 48ac604..a7d7cef 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -45,6 +45,7 @@
 #include <rte_common.h>
 #include <rte_spinlock.h>
 
+#include "eal_memalloc.h"
 #include "malloc_elem.h"
 #include "malloc_heap.h"
 
@@ -122,32 +123,83 @@ malloc_elem_insert(struct malloc_elem *elem)
 }
 
 /*
+ * Attempt to find enough physically contiguous memory in this block to store
+ * our data. Assume that element has at least enough space to fit in the data,
+ * so we just check the page addresses.
+ */
+static bool
+elem_check_phys_contig(const struct rte_memseg_list *msl, void *start,
+		size_t size) {
+	uint64_t page_sz;
+	void *aligned_start, *end, *aligned_end;
+	size_t aligned_len;
+
+	/* figure out how many pages we need to fit in current data */
+	page_sz = msl->hugepage_sz;
+	aligned_start = RTE_PTR_ALIGN_FLOOR(start, page_sz);
+	end = RTE_PTR_ADD(start, size);
+	aligned_end = RTE_PTR_ALIGN_CEIL(end, page_sz);
+
+	aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start);
+
+	return eal_memalloc_is_contig(msl, aligned_start, aligned_len);
+}
+
+/*
  * calculate the starting point of where data of the requested size
  * and alignment would fit in the current element. If the data doesn't
  * fit, return NULL.
  */
 static void *
 elem_start_pt(struct malloc_elem *elem, size_t size, unsigned align,
-		size_t bound)
+		size_t bound, bool contig)
 {
-	const size_t bmask = ~(bound - 1);
-	uintptr_t end_pt = (uintptr_t)elem +
-			elem->size - MALLOC_ELEM_TRAILER_LEN;
-	uintptr_t new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align);
-	uintptr_t new_elem_start;
-
-	/* check boundary */
-	if ((new_data_start & bmask) != ((end_pt - 1) & bmask)) {
-		end_pt = RTE_ALIGN_FLOOR(end_pt, bound);
-		new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align);
-		if (((end_pt - 1) & bmask) != (new_data_start & bmask))
-			return NULL;
-	}
+	size_t elem_size = elem->size;
 
-	new_elem_start = new_data_start - MALLOC_ELEM_HEADER_LEN;
+	/*
+	 * we're allocating from the end, so adjust the size of element by page
+	 * size each time
+	 */
+	while (elem_size >= size) {
+		const size_t bmask = ~(bound - 1);
+		uintptr_t end_pt = (uintptr_t)elem +
+				elem_size - MALLOC_ELEM_TRAILER_LEN;
+		uintptr_t new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align);
+		uintptr_t new_elem_start;
+
+		/* check boundary */
+		if ((new_data_start & bmask) != ((end_pt - 1) & bmask)) {
+			end_pt = RTE_ALIGN_FLOOR(end_pt, bound);
+			new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align);
+			end_pt = new_data_start + size;
+
+			if (((end_pt - 1) & bmask) != (new_data_start & bmask))
+				return NULL;
+		}
 
-	/* if the new start point is before the exist start, it won't fit */
-	return (new_elem_start < (uintptr_t)elem) ? NULL : (void *)new_elem_start;
+		new_elem_start = new_data_start - MALLOC_ELEM_HEADER_LEN;
+
+		/* if the new start point is before the exist start, it won't fit */
+		if (new_elem_start < (uintptr_t)elem)
+			return NULL;
+
+		if (contig) {
+			size_t new_data_size = end_pt - new_data_start;
+
+			/*
+			 * if physical contiguousness was requested and we
+			 * couldn't fit all data into one physically contiguous
+			 * block, try again with lower addresses.
+			 */
+			if (!elem_check_phys_contig(elem->msl,
+					(void*) new_data_start, new_data_size)) {
+				elem_size -= align;
+				continue;
+			}
+		}
+		return (void *) new_elem_start;
+	}
+	return NULL;
 }
 
 /*
@@ -156,9 +208,9 @@ elem_start_pt(struct malloc_elem *elem, size_t size, unsigned align,
  */
 int
 malloc_elem_can_hold(struct malloc_elem *elem, size_t size,	unsigned align,
-		size_t bound)
+		size_t bound, bool contig)
 {
-	return elem_start_pt(elem, size, align, bound) != NULL;
+	return elem_start_pt(elem, size, align, bound, contig) != NULL;
 }
 
 /*
@@ -283,9 +335,10 @@ malloc_elem_free_list_remove(struct malloc_elem *elem)
  */
 struct malloc_elem *
 malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align,
-		size_t bound)
+		size_t bound, bool contig)
 {
-	struct malloc_elem *new_elem = elem_start_pt(elem, size, align, bound);
+	struct malloc_elem *new_elem = elem_start_pt(elem, size, align, bound,
+			contig);
 	const size_t old_elem_size = (uintptr_t)new_elem - (uintptr_t)elem;
 	const size_t trailer_size = elem->size - old_elem_size - size -
 		MALLOC_ELEM_OVERHEAD;
@@ -508,9 +561,11 @@ malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len) {
  * immediately after it in memory.
  */
 int
-malloc_elem_resize(struct malloc_elem *elem, size_t size)
+malloc_elem_resize(struct malloc_elem *elem, size_t size, bool contig)
 {
 	const size_t new_size = size + elem->pad + MALLOC_ELEM_OVERHEAD;
+	const size_t new_data_size = new_size - MALLOC_ELEM_OVERHEAD;
+	void *data_ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN);
 
 	/* if we request a smaller size, then always return ok */
 	if (elem->size >= new_size)
@@ -523,6 +578,12 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size)
 	if (elem->size + elem->next->size < new_size)
 		return -1;
 
+	/* if physical contiguousness was requested, check that as well */
+	if (contig && !elem_check_phys_contig(elem->msl,
+			data_ptr, new_data_size)) {
+		return -1;
+	}
+
 	/* we now know the element fits, so remove from free list,
 	 * join the two
 	 */
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index b47c55e..02d6bd7 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -149,7 +149,7 @@ malloc_elem_insert(struct malloc_elem *elem);
  */
 int
 malloc_elem_can_hold(struct malloc_elem *elem, size_t size,
-		unsigned align, size_t bound);
+		unsigned align, size_t bound, bool contig);
 
 /*
  * reserve a block of data in an existing malloc_elem. If the malloc_elem
@@ -157,7 +157,7 @@ malloc_elem_can_hold(struct malloc_elem *elem, size_t size,
  */
 struct malloc_elem *
 malloc_elem_alloc(struct malloc_elem *elem, size_t size,
-		unsigned align, size_t bound);
+		unsigned align, size_t bound, bool contig);
 
 /*
  * free a malloc_elem block by adding it to the free list. If the
@@ -175,7 +175,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem);
  * immediately after it in memory.
  */
 int
-malloc_elem_resize(struct malloc_elem *elem, size_t size);
+malloc_elem_resize(struct malloc_elem *elem, size_t size, bool contig);
 
 void
 malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len);
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 0d61704..427f7c6 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -123,7 +123,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
  */
 static struct malloc_elem *
 find_suitable_element(struct malloc_heap *heap, size_t size,
-		unsigned flags, size_t align, size_t bound)
+		unsigned flags, size_t align, size_t bound, bool contig)
 {
 	size_t idx;
 	struct malloc_elem *elem, *alt_elem = NULL;
@@ -132,7 +132,8 @@ find_suitable_element(struct malloc_heap *heap, size_t size,
 			idx < RTE_HEAP_NUM_FREELISTS; idx++) {
 		for (elem = LIST_FIRST(&heap->free_head[idx]);
 				!!elem; elem = LIST_NEXT(elem, free_list)) {
-			if (malloc_elem_can_hold(elem, size, align, bound)) {
+			if (malloc_elem_can_hold(elem, size, align, bound,
+					contig)) {
 				if (check_hugepage_sz(flags, elem->msl->hugepage_sz))
 					return elem;
 				if (alt_elem == NULL)
@@ -155,16 +156,16 @@ find_suitable_element(struct malloc_heap *heap, size_t size,
  */
 static void *
 heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size,
-		unsigned flags, size_t align, size_t bound)
+		unsigned flags, size_t align, size_t bound, bool contig)
 {
 	struct malloc_elem *elem;
 
 	size = RTE_CACHE_LINE_ROUNDUP(size);
 	align = RTE_CACHE_LINE_ROUNDUP(align);
 
-	elem = find_suitable_element(heap, size, flags, align, bound);
+	elem = find_suitable_element(heap, size, flags, align, bound, contig);
 	if (elem != NULL) {
-		elem = malloc_elem_alloc(elem, size, align, bound);
+		elem = malloc_elem_alloc(elem, size, align, bound, contig);
 
 		/* increase heap's count of allocated elements */
 		heap->alloc_count++;
@@ -176,13 +177,13 @@ heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size,
 static void *
 try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl,
 		const char *type, size_t size, int socket, unsigned flags,
-		size_t align, size_t bound) {
+		size_t align, size_t bound, bool contig) {
 	struct malloc_elem *elem;
 	struct rte_memseg **ms;
-	size_t map_len;
+	size_t map_len, data_start_offset;
 	void *map_addr;
 	int i, n_pages, allocd_pages;
-	void *ret;
+	void *ret, *data_start;
 
 	align = RTE_MAX(align, MALLOC_ELEM_HEADER_LEN);
 	map_len = RTE_ALIGN_CEIL(align + size + MALLOC_ELEM_TRAILER_LEN,
@@ -200,6 +201,16 @@ try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl,
 	if (allocd_pages != n_pages)
 		goto free_ms;
 
+	/* check if we wanted contiguous memory but didn't get it */
+	data_start_offset = RTE_ALIGN(MALLOC_ELEM_HEADER_LEN, align);
+	data_start = RTE_PTR_ADD(ms[0]->addr, data_start_offset);
+	if (contig && !eal_memalloc_is_contig(msl, data_start,
+			n_pages * msl->hugepage_sz)) {
+		RTE_LOG(DEBUG, EAL, "%s(): couldn't allocate physically contiguous space\n",
+				__func__);
+		goto free_pages;
+	}
+
 	map_addr = ms[0]->addr;
 
 	/* add newly minted memsegs to malloc heap */
@@ -210,7 +221,7 @@ try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl,
 
 	/* try once more, as now we have allocated new memory */
 	ret = heap_alloc(heap, type, size, flags,
-			align == 0 ? 1 : align, bound);
+			align == 0 ? 1 : align, bound, contig);
 
 	if (ret == NULL)
 		goto free_elem;
@@ -225,7 +236,7 @@ try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl,
 
 	RTE_LOG(DEBUG, EAL, "%s(): couldn't allocate, so shrinking heap on socket %d by %zdMB\n",
 		__func__, socket, map_len >> 20ULL);
-
+free_pages:
 	for (i = 0; i < n_pages; i++) {
 		eal_memalloc_free_page(ms[i]);
 	}
@@ -249,7 +260,7 @@ compare_pagesz(const void *a, const void *b) {
 /* this will try lower page sizes first */
 static void *
 heap_alloc_on_socket(const char *type, size_t size, int socket,
-		unsigned flags, size_t align, size_t bound) {
+		unsigned flags, size_t align, size_t bound, bool contig) {
 	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
 	struct malloc_heap *heap = &mcfg->malloc_heaps[socket];
 	struct rte_memseg_list *requested_msls[RTE_MAX_MEMSEG_LISTS];
@@ -264,7 +275,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket,
 	/* for legacy mode, try once and with all flags */
 	if (internal_config.legacy_mem) {
 		ret = heap_alloc(heap, type, size, flags,
-				align == 0 ? 1 : align, bound);
+				align == 0 ? 1 : align, bound, contig);
 		goto alloc_unlock;
 	}
 
@@ -274,7 +285,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket,
 	 * we just need to request more memory first.
 	 */
 	ret = heap_alloc(heap, type, size, size_flags, align == 0 ? 1 : align,
-			bound);
+			bound, contig);
 	if (ret != NULL)
 		goto alloc_unlock;
 
@@ -317,7 +328,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket,
 		 * sizes first, before resorting to best effort allocation.
 		 */
 		ret = try_expand_heap(heap, msl, type, size, socket, size_flags,
-				align, bound);
+				align, bound, contig);
 		if (ret != NULL)
 			goto alloc_unlock;
 	}
@@ -326,7 +337,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket,
 
 	/* now, try reserving with size hint */
 	ret = heap_alloc(heap, type, size, flags, align == 0 ? 1 : align,
-			bound);
+			bound, contig);
 	if (ret != NULL)
 		goto alloc_unlock;
 
@@ -338,7 +349,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket,
 		struct rte_memseg_list *msl = other_msls[i];
 
 		ret = try_expand_heap(heap, msl, type, size, socket, flags,
-				align, bound);
+				align, bound, contig);
 		if (ret != NULL)
 			goto alloc_unlock;
 	}
@@ -349,7 +360,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket,
 
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags,
-		size_t align, size_t bound) {
+		size_t align, size_t bound, bool contig) {
 	int socket, i;
 	void *ret;
 
@@ -371,7 +382,8 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags,
 
 	// TODO: add warning for alignments bigger than page size if not VFIO
 
-	ret = heap_alloc_on_socket(type, size, socket, flags, align, bound);
+	ret = heap_alloc_on_socket(type, size, socket, flags, align, bound,
+			contig);
 	if (ret != NULL || socket_arg != SOCKET_ID_ANY)
 		return ret;
 
@@ -380,7 +392,7 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags,
 		if (i == socket)
 			continue;
 		ret = heap_alloc_on_socket(type, size, socket, flags,
-				align, bound);
+				align, bound, contig);
 		if (ret != NULL)
 			return ret;
 	}
@@ -455,7 +467,7 @@ malloc_heap_free(struct malloc_elem *elem) {
 }
 
 int
-malloc_heap_resize(struct malloc_elem *elem, size_t size) {
+malloc_heap_resize(struct malloc_elem *elem, size_t size, bool contig) {
 	int ret;
 
 	if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY)
@@ -463,7 +475,7 @@ malloc_heap_resize(struct malloc_elem *elem, size_t size) {
 
 	rte_spinlock_lock(&(elem->heap->lock));
 
-	ret = malloc_elem_resize(elem, size);
+	ret = malloc_elem_resize(elem, size, contig);
 
 	rte_spinlock_unlock(&(elem->heap->lock));
 
diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h
index 3fcd14f..e95b526 100644
--- a/lib/librte_eal/common/malloc_heap.h
+++ b/lib/librte_eal/common/malloc_heap.h
@@ -34,6 +34,8 @@
 #ifndef MALLOC_HEAP_H_
 #define MALLOC_HEAP_H_
 
+#include <stdbool.h>
+
 #include <rte_malloc.h>
 #include <rte_malloc_heap.h>
 
@@ -54,13 +56,13 @@ malloc_get_numa_socket(void)
 
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned flags,
-		size_t align, size_t bound);
+		size_t align, size_t bound, bool contig);
 
 int
 malloc_heap_free(struct malloc_elem *elem);
 
 int
-malloc_heap_resize(struct malloc_elem *elem, size_t size);
+malloc_heap_resize(struct malloc_elem *elem, size_t size, bool contig);
 
 int
 malloc_heap_get_stats(struct malloc_heap *heap,
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index dc3199a..623725e 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -62,12 +62,9 @@ void rte_free(void *addr)
 		rte_panic("Fatal error: Invalid memory\n");
 }
 
-/*
- * Allocate memory on specified heap.
- */
-void *
-rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg)
-{
+static void *
+malloc_socket(const char *type, size_t size, unsigned align, int socket_arg,
+		bool contig) {
 	int socket;
 
 	/* return NULL if size is 0 or alignment is not power-of-2 */
@@ -86,8 +83,16 @@ rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg)
 	if (socket >= RTE_MAX_NUMA_NODES)
 		return NULL;
 
-	return malloc_heap_alloc(type, size, socket_arg, 0,
-			align == 0 ? 1 : align, 0);
+	return malloc_heap_alloc(type, size, socket_arg, 0, align, 0, contig);
+}
+
+/*
+ * Allocate memory on specified heap.
+ */
+void *
+rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg)
+{
+	return malloc_socket(type, size, align, socket_arg, false);
 }
 
 /*
@@ -138,8 +143,8 @@ rte_calloc(const char *type, size_t num, size_t size, unsigned align)
 /*
  * Resize allocated memory.
  */
-void *
-rte_realloc(void *ptr, size_t size, unsigned align)
+static void *
+do_realloc(void *ptr, size_t size, unsigned align, bool contig)
 {
 	if (ptr == NULL)
 		return rte_malloc(NULL, size, align);
@@ -151,12 +156,12 @@ rte_realloc(void *ptr, size_t size, unsigned align)
 	size = RTE_CACHE_LINE_ROUNDUP(size), align = RTE_CACHE_LINE_ROUNDUP(align);
 	/* check alignment matches first, and if ok, see if we can resize block */
 	if (RTE_PTR_ALIGN(ptr,align) == ptr &&
-			malloc_heap_resize(elem, size) == 0)
+			malloc_heap_resize(elem, size, contig) == 0)
 		return ptr;
 
 	/* either alignment is off, or we have no room to expand,
 	 * so move data. */
-	void *new_ptr = rte_malloc(NULL, size, align);
+	void *new_ptr = malloc_socket(NULL, size, align, SOCKET_ID_ANY, contig);
 	if (new_ptr == NULL)
 		return NULL;
 	const unsigned old_size = elem->size - MALLOC_ELEM_OVERHEAD;
@@ -166,6 +171,15 @@ rte_realloc(void *ptr, size_t size, unsigned align)
 	return new_ptr;
 }
 
+/*
+ * Resize allocated memory.
+ */
+void *
+rte_realloc(void *ptr, size_t size, unsigned align)
+{
+	return do_realloc(ptr, size, align, false);
+}
+
 int
 rte_malloc_validate(const void *ptr, size_t *size)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 18/23] eal: add rte_malloc support for allocating contiguous memory
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (16 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 17/23] eal: add backend support for contiguous memory allocation Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 19/23] eal: enable reserving physically contiguous memzones Anatoly Burakov
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This adds a new set of _contig API's to rte_malloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/include/rte_malloc.h | 181 +++++++++++++++++++++++++++++
 lib/librte_eal/common/rte_malloc.c         |  63 ++++++++++
 2 files changed, 244 insertions(+)

diff --git a/lib/librte_eal/common/include/rte_malloc.h b/lib/librte_eal/common/include/rte_malloc.h
index 5d4c11a..c132d33 100644
--- a/lib/librte_eal/common/include/rte_malloc.h
+++ b/lib/librte_eal/common/include/rte_malloc.h
@@ -242,6 +242,187 @@ void *
 rte_calloc_socket(const char *type, size_t num, size_t size, unsigned align, int socket);
 
 /**
+ * This function allocates memory from the huge-page area of memory. The memory
+ * is not cleared. In NUMA systems, the memory allocated resides on the same
+ * NUMA socket as the core that calls this function.
+ *
+ * @param type
+ *   A string identifying the type of allocated objects (useful for debug
+ *   purposes, such as identifying the cause of a memory leak). Can be NULL.
+ * @param size
+ *   Size (in bytes) to be allocated.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the allocated object.
+ */
+void *
+rte_malloc_contig(const char *type, size_t size, unsigned align);
+
+/**
+ * Allocate zero'ed memory from the heap.
+ *
+ * Equivalent to rte_malloc() except that the memory zone is
+ * initialised with zeros. In NUMA systems, the memory allocated resides on the
+ * same NUMA socket as the core that calls this function.
+ *
+ * @param type
+ *   A string identifying the type of allocated objects (useful for debug
+ *   purposes, such as identifying the cause of a memory leak). Can be NULL.
+ * @param size
+ *   Size (in bytes) to be allocated.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must obviously be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the allocated object.
+ */
+void *
+rte_zmalloc_contig(const char *type, size_t size, unsigned align);
+
+/**
+ * Replacement function for calloc(), using huge-page memory. Memory area is
+ * initialised with zeros. In NUMA systems, the memory allocated resides on the
+ * same NUMA socket as the core that calls this function.
+ *
+ * @param type
+ *   A string identifying the type of allocated objects (useful for debug
+ *   purposes, such as identifying the cause of a memory leak). Can be NULL.
+ * @param num
+ *   Number of elements to be allocated.
+ * @param size
+ *   Size (in bytes) of a single element.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must obviously be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the allocated object.
+ */
+void *
+rte_calloc_contig(const char *type, size_t num, size_t size, unsigned align);
+
+/**
+ * Replacement function for realloc(), using huge-page memory. Reserved area
+ * memory is resized, preserving contents. In NUMA systems, the new area
+ * resides on the same NUMA socket as the old area.
+ *
+ * @param ptr
+ *   Pointer to already allocated memory
+ * @param size
+ *   Size (in bytes) of new area. If this is 0, memory is freed.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must obviously be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the reallocated memory.
+ */
+void *
+rte_realloc_contig(void *ptr, size_t size, unsigned align);
+
+/**
+ * This function allocates memory from the huge-page area of memory. The memory
+ * is not cleared.
+ *
+ * @param type
+ *   A string identifying the type of allocated objects (useful for debug
+ *   purposes, such as identifying the cause of a memory leak). Can be NULL.
+ * @param size
+ *   Size (in bytes) to be allocated.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @param socket
+ *   NUMA socket to allocate memory on. If SOCKET_ID_ANY is used, this function
+ *   will behave the same as rte_malloc().
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the allocated object.
+ */
+void *
+rte_malloc_socket_contig(const char *type, size_t size, unsigned align, int socket);
+
+/**
+ * Allocate zero'ed memory from the heap.
+ *
+ * Equivalent to rte_malloc() except that the memory zone is
+ * initialised with zeros.
+ *
+ * @param type
+ *   A string identifying the type of allocated objects (useful for debug
+ *   purposes, such as identifying the cause of a memory leak). Can be NULL.
+ * @param size
+ *   Size (in bytes) to be allocated.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must obviously be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @param socket
+ *   NUMA socket to allocate memory on. If SOCKET_ID_ANY is used, this function
+ *   will behave the same as rte_zmalloc().
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the allocated object.
+ */
+void *
+rte_zmalloc_socket_contig(const char *type, size_t size, unsigned align, int socket);
+
+/**
+ * Replacement function for calloc(), using huge-page memory. Memory area is
+ * initialised with zeros.
+ *
+ * @param type
+ *   A string identifying the type of allocated objects (useful for debug
+ *   purposes, such as identifying the cause of a memory leak). Can be NULL.
+ * @param num
+ *   Number of elements to be allocated.
+ * @param size
+ *   Size (in bytes) of a single element.
+ * @param align
+ *   If 0, the return is a pointer that is suitably aligned for any kind of
+ *   variable (in the same manner as malloc()).
+ *   Otherwise, the return is a pointer that is a multiple of *align*. In
+ *   this case, it must obviously be a power of two. (Minimum alignment is the
+ *   cacheline size, i.e. 64-bytes)
+ * @param socket
+ *   NUMA socket to allocate memory on. If SOCKET_ID_ANY is used, this function
+ *   will behave the same as rte_calloc().
+ * @return
+ *   - NULL on error. Not enough memory, or invalid arguments (size is 0,
+ *     align is not a power of two).
+ *   - Otherwise, the pointer to the allocated object.
+ */
+void *
+rte_calloc_socket_contig(const char *type, size_t num, size_t size, unsigned align, int socket);
+
+/**
  * Frees the memory space pointed to by the provided pointer.
  *
  * This pointer must have been returned by a previous call to
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index 623725e..e8ad085 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -96,6 +96,15 @@ rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg)
 }
 
 /*
+ * Allocate memory on specified heap.
+ */
+void *
+rte_malloc_socket_contig(const char *type, size_t size, unsigned align, int socket_arg)
+{
+	return malloc_socket(type, size, align, socket_arg, true);
+}
+
+/*
  * Allocate memory on default heap.
  */
 void *
@@ -105,6 +114,15 @@ rte_malloc(const char *type, size_t size, unsigned align)
 }
 
 /*
+ * Allocate memory on default heap.
+ */
+void *
+rte_malloc_contig(const char *type, size_t size, unsigned align)
+{
+	return rte_malloc_socket_contig(type, size, align, SOCKET_ID_ANY);
+}
+
+/*
  * Allocate zero'd memory on specified heap.
  */
 void *
@@ -114,6 +132,15 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 }
 
 /*
+ * Allocate zero'd memory on specified heap.
+ */
+void *
+rte_zmalloc_socket_contig(const char *type, size_t size, unsigned align, int socket)
+{
+	return rte_malloc_socket_contig(type, size, align, socket);
+}
+
+/*
  * Allocate zero'd memory on default heap.
  */
 void *
@@ -123,6 +150,15 @@ rte_zmalloc(const char *type, size_t size, unsigned align)
 }
 
 /*
+ * Allocate zero'd memory on default heap.
+ */
+void *
+rte_zmalloc_contig(const char *type, size_t size, unsigned align)
+{
+	return rte_zmalloc_socket_contig(type, size, align, SOCKET_ID_ANY);
+}
+
+/*
  * Allocate zero'd memory on specified heap.
  */
 void *
@@ -132,6 +168,15 @@ rte_calloc_socket(const char *type, size_t num, size_t size, unsigned align, int
 }
 
 /*
+ * Allocate zero'd physically contiguous memory on specified heap.
+ */
+void *
+rte_calloc_socket_contig(const char *type, size_t num, size_t size, unsigned align, int socket)
+{
+	return rte_zmalloc_socket_contig(type, num * size, align, socket);
+}
+
+/*
  * Allocate zero'd memory on default heap.
  */
 void *
@@ -141,6 +186,15 @@ rte_calloc(const char *type, size_t num, size_t size, unsigned align)
 }
 
 /*
+ * Allocate zero'd physically contiguous memory on default heap.
+ */
+void *
+rte_calloc_contig(const char *type, size_t num, size_t size, unsigned align)
+{
+	return rte_zmalloc_contig(type, num * size, align);
+}
+
+/*
  * Resize allocated memory.
  */
 static void *
@@ -180,6 +234,15 @@ rte_realloc(void *ptr, size_t size, unsigned align)
 	return do_realloc(ptr, size, align, false);
 }
 
+/*
+ * Resize allocated physically contiguous memory.
+ */
+void *
+rte_realloc_contig(void *ptr, size_t size, unsigned align)
+{
+	return do_realloc(ptr, size, align, true);
+}
+
 int
 rte_malloc_validate(const void *ptr, size_t *size)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 19/23] eal: enable reserving physically contiguous memzones
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (17 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 18/23] eal: add rte_malloc support for allocating contiguous memory Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 20/23] eal: make memzones use rte_fbarray Anatoly Burakov
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

This adds a new set of _contig API's to rte_memzone.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/eal_common_memzone.c  |  44 ++++++++
 lib/librte_eal/common/include/rte_memzone.h | 158 ++++++++++++++++++++++++++++
 2 files changed, 202 insertions(+)

diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
index 542ae90..a9a4bef 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -200,6 +200,12 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 		socket_id = SOCKET_ID_ANY;
 
 	if (len == 0) {
+		/* len == 0 is only allowed for non-contiguous zones */
+		// TODO: technically, we can make it work, is it worth it?
+		if (contig) {
+			rte_errno = EINVAL;
+			return NULL;
+		}
 		if (bound != 0)
 			requested_len = bound;
 		else {
@@ -285,6 +291,19 @@ rte_memzone_reserve_bounded(const char *name, size_t len, int socket_id,
 
 /*
  * Return a pointer to a correctly filled memzone descriptor (with a
+ * specified alignment and boundary). If the allocation cannot be done,
+ * return NULL.
+ */
+const struct rte_memzone *
+rte_memzone_reserve_bounded_contig(const char *name, size_t len, int socket_id,
+			    unsigned flags, unsigned align, unsigned bound)
+{
+	return rte_memzone_reserve_thread_safe(name, len, socket_id, flags,
+					       align, bound, true);
+}
+
+/*
+ * Return a pointer to a correctly filled memzone descriptor (with a
  * specified alignment). If the allocation cannot be done, return NULL.
  */
 const struct rte_memzone *
@@ -296,6 +315,18 @@ rte_memzone_reserve_aligned(const char *name, size_t len, int socket_id,
 }
 
 /*
+ * Return a pointer to a correctly filled memzone descriptor (with a
+ * specified alignment). If the allocation cannot be done, return NULL.
+ */
+const struct rte_memzone *
+rte_memzone_reserve_aligned_contig(const char *name, size_t len, int socket_id,
+			    unsigned flags, unsigned align)
+{
+	return rte_memzone_reserve_thread_safe(name, len, socket_id, flags,
+					       align, 0, true);
+}
+
+/*
  * Return a pointer to a correctly filled memzone descriptor. If the
  * allocation cannot be done, return NULL.
  */
@@ -308,6 +339,19 @@ rte_memzone_reserve(const char *name, size_t len, int socket_id,
 					       false);
 }
 
+/*
+ * Return a pointer to a correctly filled memzone descriptor. If the
+ * allocation cannot be done, return NULL.
+ */
+const struct rte_memzone *
+rte_memzone_reserve_contig(const char *name, size_t len, int socket_id,
+		    unsigned flags)
+{
+	return rte_memzone_reserve_thread_safe(name, len, socket_id,
+					       flags, RTE_CACHE_LINE_SIZE, 0,
+					       true);
+}
+
 int
 rte_memzone_free(const struct rte_memzone *mz)
 {
diff --git a/lib/librte_eal/common/include/rte_memzone.h b/lib/librte_eal/common/include/rte_memzone.h
index 6f0ba18..237fd31 100644
--- a/lib/librte_eal/common/include/rte_memzone.h
+++ b/lib/librte_eal/common/include/rte_memzone.h
@@ -257,6 +257,164 @@ const struct rte_memzone *rte_memzone_reserve_bounded(const char *name,
 			unsigned flags, unsigned align, unsigned bound);
 
 /**
+ * Reserve a portion of physical memory.
+ *
+ * This function reserves some memory and returns a pointer to a
+ * correctly filled memzone descriptor. If the allocation cannot be
+ * done, return NULL.
+ *
+ * @param name
+ *   The name of the memzone. If it already exists, the function will
+ *   fail and return NULL.
+ * @param len
+ *   The size of the memory to be reserved. If it
+ *   is 0, the biggest contiguous zone will be reserved.
+ * @param socket_id
+ *   The socket identifier in the case of
+ *   NUMA. The value can be SOCKET_ID_ANY if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   The flags parameter is used to request memzones to be
+ *   taken from specifically sized hugepages.
+ *   - RTE_MEMZONE_2MB - Reserved from 2MB pages
+ *   - RTE_MEMZONE_1GB - Reserved from 1GB pages
+ *   - RTE_MEMZONE_16MB - Reserved from 16MB pages
+ *   - RTE_MEMZONE_16GB - Reserved from 16GB pages
+ *   - RTE_MEMZONE_256KB - Reserved from 256KB pages
+ *   - RTE_MEMZONE_256MB - Reserved from 256MB pages
+ *   - RTE_MEMZONE_512MB - Reserved from 512MB pages
+ *   - RTE_MEMZONE_4GB - Reserved from 4GB pages
+ *   - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if
+ *                                  the requested page size is unavailable.
+ *                                  If this flag is not set, the function
+ *                                  will return error on an unavailable size
+ *                                  request.
+ * @return
+ *   A pointer to a correctly-filled read-only memzone descriptor, or NULL
+ *   on error.
+ *   On error case, rte_errno will be set appropriately:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ *    - EINVAL - invalid parameters
+ */
+const struct rte_memzone *rte_memzone_reserve_contig(const char *name,
+					      size_t len, int socket_id,
+					      unsigned flags);
+
+/**
+ * Reserve a portion of physical memory with alignment on a specified
+ * boundary.
+ *
+ * This function reserves some memory with alignment on a specified
+ * boundary, and returns a pointer to a correctly filled memzone
+ * descriptor. If the allocation cannot be done or if the alignment
+ * is not a power of 2, returns NULL.
+ *
+ * @param name
+ *   The name of the memzone. If it already exists, the function will
+ *   fail and return NULL.
+ * @param len
+ *   The size of the memory to be reserved. If it
+ *   is 0, the biggest contiguous zone will be reserved.
+ * @param socket_id
+ *   The socket identifier in the case of
+ *   NUMA. The value can be SOCKET_ID_ANY if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   The flags parameter is used to request memzones to be
+ *   taken from specifically sized hugepages.
+ *   - RTE_MEMZONE_2MB - Reserved from 2MB pages
+ *   - RTE_MEMZONE_1GB - Reserved from 1GB pages
+ *   - RTE_MEMZONE_16MB - Reserved from 16MB pages
+ *   - RTE_MEMZONE_16GB - Reserved from 16GB pages
+ *   - RTE_MEMZONE_256KB - Reserved from 256KB pages
+ *   - RTE_MEMZONE_256MB - Reserved from 256MB pages
+ *   - RTE_MEMZONE_512MB - Reserved from 512MB pages
+ *   - RTE_MEMZONE_4GB - Reserved from 4GB pages
+ *   - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if
+ *                                  the requested page size is unavailable.
+ *                                  If this flag is not set, the function
+ *                                  will return error on an unavailable size
+ *                                  request.
+ * @param align
+ *   Alignment for resulting memzone. Must be a power of 2.
+ * @return
+ *   A pointer to a correctly-filled read-only memzone descriptor, or NULL
+ *   on error.
+ *   On error case, rte_errno will be set appropriately:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ *    - EINVAL - invalid parameters
+ */
+const struct rte_memzone *rte_memzone_reserve_aligned_contig(const char *name,
+			size_t len, int socket_id,
+			unsigned flags, unsigned align);
+
+/**
+ * Reserve a portion of physical memory with specified alignment and
+ * boundary.
+ *
+ * This function reserves some memory with specified alignment and
+ * boundary, and returns a pointer to a correctly filled memzone
+ * descriptor. If the allocation cannot be done or if the alignment
+ * or boundary are not a power of 2, returns NULL.
+ * Memory buffer is reserved in a way, that it wouldn't cross specified
+ * boundary. That implies that requested length should be less or equal
+ * then boundary.
+ *
+ * @param name
+ *   The name of the memzone. If it already exists, the function will
+ *   fail and return NULL.
+ * @param len
+ *   The size of the memory to be reserved. If it
+ *   is 0, the biggest contiguous zone will be reserved.
+ * @param socket_id
+ *   The socket identifier in the case of
+ *   NUMA. The value can be SOCKET_ID_ANY if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   The flags parameter is used to request memzones to be
+ *   taken from specifically sized hugepages.
+ *   - RTE_MEMZONE_2MB - Reserved from 2MB pages
+ *   - RTE_MEMZONE_1GB - Reserved from 1GB pages
+ *   - RTE_MEMZONE_16MB - Reserved from 16MB pages
+ *   - RTE_MEMZONE_16GB - Reserved from 16GB pages
+ *   - RTE_MEMZONE_256KB - Reserved from 256KB pages
+ *   - RTE_MEMZONE_256MB - Reserved from 256MB pages
+ *   - RTE_MEMZONE_512MB - Reserved from 512MB pages
+ *   - RTE_MEMZONE_4GB - Reserved from 4GB pages
+ *   - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if
+ *                                  the requested page size is unavailable.
+ *                                  If this flag is not set, the function
+ *                                  will return error on an unavailable size
+ *                                  request.
+ * @param align
+ *   Alignment for resulting memzone. Must be a power of 2.
+ * @param bound
+ *   Boundary for resulting memzone. Must be a power of 2 or zero.
+ *   Zero value implies no boundary condition.
+ * @return
+ *   A pointer to a correctly-filled read-only memzone descriptor, or NULL
+ *   on error.
+ *   On error case, rte_errno will be set appropriately:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ *    - EINVAL - invalid parameters
+ */
+const struct rte_memzone *rte_memzone_reserve_bounded_contig(const char *name,
+			size_t len, int socket_id,
+			unsigned flags, unsigned align, unsigned bound);
+
+/**
  * Free a memzone.
  *
  * @param mz
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 20/23] eal: make memzones use rte_fbarray
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (18 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 19/23] eal: enable reserving physically contiguous memzones Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 21/23] mempool: add support for the new memory allocation methods Anatoly Burakov
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

We greatly expand memzone list, and it makes some operations faster.
Plus, it's there, so we might as well use it.

As part of this commit, a potential memory leak is fixed (when we
allocate a memzone but there's no room in config, we don't free it
back), and there's a compile fix for ENA driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base                                |   2 +-
 drivers/net/ena/ena_ethdev.c                      |  10 +-
 lib/librte_eal/common/eal_common_memzone.c        | 168 ++++++++++++++++------
 lib/librte_eal/common/include/rte_eal_memconfig.h |   4 +-
 4 files changed, 137 insertions(+), 47 deletions(-)

diff --git a/config/common_base b/config/common_base
index 9730d4c..cce464d 100644
--- a/config/common_base
+++ b/config/common_base
@@ -92,7 +92,7 @@ CONFIG_RTE_MAX_LCORE=128
 CONFIG_RTE_MAX_NUMA_NODES=8
 CONFIG_RTE_MAX_MEMSEG_LISTS=16
 CONFIG_RTE_MAX_MEMSEG_PER_LIST=32768
-CONFIG_RTE_MAX_MEMZONE=2560
+CONFIG_RTE_MAX_MEMZONE=32768
 CONFIG_RTE_MAX_TAILQ=32
 CONFIG_RTE_ENABLE_ASSERT=n
 CONFIG_RTE_LOG_LEVEL=RTE_LOG_INFO
diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c
index 22db895..aa37cad 100644
--- a/drivers/net/ena/ena_ethdev.c
+++ b/drivers/net/ena/ena_ethdev.c
@@ -249,11 +249,15 @@ static const struct eth_dev_ops ena_dev_ops = {
 static inline int ena_cpu_to_node(int cpu)
 {
 	struct rte_config *config = rte_eal_get_configuration();
+	const struct rte_fbarray *arr = &config->mem_config->memzones;
+	const struct rte_memzone *mz;
 
-	if (likely(cpu < RTE_MAX_MEMZONE))
-		return config->mem_config->memzone[cpu].socket_id;
+	if (unlikely(cpu >= RTE_MAX_MEMZONE))
+		return NUMA_NO_NODE;
 
-	return NUMA_NO_NODE;
+	mz = rte_fbarray_get(arr, cpu);
+
+	return mz->socket_id;
 }
 
 static inline void ena_rx_mbuf_prepare(struct rte_mbuf *mbuf,
diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
index a9a4bef..58a4f25 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -58,20 +58,23 @@ static inline const struct rte_memzone *
 memzone_lookup_thread_unsafe(const char *name)
 {
 	const struct rte_mem_config *mcfg;
+	const struct rte_fbarray *arr;
 	const struct rte_memzone *mz;
-	unsigned i = 0;
+	int i = 0;
 
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
+	arr = &mcfg->memzones;
 
 	/*
 	 * the algorithm is not optimal (linear), but there are few
 	 * zones and this function should be called at init only
 	 */
-	for (i = 0; i < RTE_MAX_MEMZONE; i++) {
-		mz = &mcfg->memzone[i];
-		if (mz->addr != NULL && !strncmp(name, mz->name, RTE_MEMZONE_NAMESIZE))
-			return &mcfg->memzone[i];
+	while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) {
+		mz = rte_fbarray_get(arr, i++);
+		if (mz->addr != NULL &&
+				!strncmp(name, mz->name, RTE_MEMZONE_NAMESIZE))
+			return mz;
 	}
 
 	return NULL;
@@ -81,17 +84,44 @@ static inline struct rte_memzone *
 get_next_free_memzone(void)
 {
 	struct rte_mem_config *mcfg;
-	unsigned i = 0;
+	struct rte_fbarray *arr;
+	int i = 0;
 
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
+	arr = &mcfg->memzones;
+
+	i = rte_fbarray_find_next_free(arr, 0);
+	if (i < 0) {
+		/* no space in config, so try expanding the list */
+		int old_len = arr->len;
+		int new_len = old_len * 2;
+		new_len = RTE_MIN(new_len, arr->capacity);
+
+		if (old_len == new_len) {
+			/* can't expand, the list is full */
+			RTE_LOG(ERR, EAL, "%s(): no space in memzone list\n",
+				__func__);
+			return NULL;
+		}
 
-	for (i = 0; i < RTE_MAX_MEMZONE; i++) {
-		if (mcfg->memzone[i].addr == NULL)
-			return &mcfg->memzone[i];
-	}
+		if (rte_fbarray_resize(arr, new_len)) {
+			RTE_LOG(ERR, EAL, "%s(): can't resize memzone list\n",
+				__func__);
+			return NULL;
+		}
 
-	return NULL;
+		/* ensure we have free space */
+		i = rte_fbarray_find_next_free(arr, old_len);
+
+		if (i < 0) {
+			RTE_LOG(ERR, EAL, "%s(): Cannot find room in config!\n",
+				__func__);
+			return NULL;
+		}
+	}
+	rte_fbarray_set_used(arr, i, true);
+	return rte_fbarray_get(arr, i);
 }
 
 /* This function will return the greatest free block if a heap has been
@@ -132,14 +162,16 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 {
 	struct rte_memzone *mz;
 	struct rte_mem_config *mcfg;
+	struct rte_fbarray *arr;
 	size_t requested_len;
 	int socket;
 
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
+	arr = &mcfg->memzones;
 
 	/* no more room in config */
-	if (mcfg->memzone_cnt >= RTE_MAX_MEMZONE) {
+	if (arr->count >= arr->capacity) {
 		RTE_LOG(ERR, EAL, "%s(): No more room in config\n", __func__);
 		rte_errno = ENOSPC;
 		return NULL;
@@ -231,19 +263,19 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
 		return NULL;
 	}
 
-	const struct malloc_elem *elem = malloc_elem_from_data(mz_addr);
+	struct malloc_elem *elem = malloc_elem_from_data(mz_addr);
 
 	/* fill the zone in config */
 	mz = get_next_free_memzone();
 
 	if (mz == NULL) {
-		RTE_LOG(ERR, EAL, "%s(): Cannot find free memzone but there is room "
-				"in config!\n", __func__);
+		RTE_LOG(ERR, EAL, "%s(): Cannot find free memzone but there is room in config!\n",
+			__func__);
 		rte_errno = ENOSPC;
+		malloc_heap_free(elem);
 		return NULL;
 	}
 
-	mcfg->memzone_cnt++;
 	snprintf(mz->name, sizeof(mz->name), "%s", name);
 	mz->iova = rte_malloc_virt2iova(mz_addr);
 	mz->addr = mz_addr;
@@ -356,6 +388,8 @@ int
 rte_memzone_free(const struct rte_memzone *mz)
 {
 	struct rte_mem_config *mcfg;
+	struct rte_fbarray *arr;
+	struct rte_memzone *found_mz;
 	int ret = 0;
 	void *addr;
 	unsigned idx;
@@ -364,21 +398,22 @@ rte_memzone_free(const struct rte_memzone *mz)
 		return -EINVAL;
 
 	mcfg = rte_eal_get_configuration()->mem_config;
+	arr = &mcfg->memzones;
 
 	rte_rwlock_write_lock(&mcfg->mlock);
 
-	idx = ((uintptr_t)mz - (uintptr_t)mcfg->memzone);
-	idx = idx / sizeof(struct rte_memzone);
+	idx = rte_fbarray_find_idx(arr, mz);
+	found_mz = rte_fbarray_get(arr, idx);
 
-	addr = mcfg->memzone[idx].addr;
+	addr = found_mz->addr;
 	if (addr == NULL)
 		ret = -EINVAL;
-	else if (mcfg->memzone_cnt == 0) {
+	else if (arr->count == 0) {
 		rte_panic("%s(): memzone address not NULL but memzone_cnt is 0!\n",
 				__func__);
 	} else {
-		memset(&mcfg->memzone[idx], 0, sizeof(mcfg->memzone[idx]));
-		mcfg->memzone_cnt--;
+		memset(found_mz, 0, sizeof(*found_mz));
+		rte_fbarray_set_used(arr, idx, false);
 	}
 
 	rte_rwlock_write_unlock(&mcfg->mlock);
@@ -412,25 +447,71 @@ rte_memzone_lookup(const char *name)
 void
 rte_memzone_dump(FILE *f)
 {
+	struct rte_fbarray *arr;
 	struct rte_mem_config *mcfg;
-	unsigned i = 0;
+	int i = 0;
 
 	/* get pointer to global configuration */
 	mcfg = rte_eal_get_configuration()->mem_config;
+	arr = &mcfg->memzones;
 
 	rte_rwlock_read_lock(&mcfg->mlock);
 	/* dump all zones */
-	for (i=0; i<RTE_MAX_MEMZONE; i++) {
-		if (mcfg->memzone[i].addr == NULL)
-			break;
-		fprintf(f, "Zone %u: name:<%s>, IO:0x%"PRIx64", len:0x%zx"
+	while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) {
+		void *cur_addr, *mz_end;
+		struct rte_memzone *mz;
+		struct rte_memseg_list *msl = NULL;
+		struct rte_memseg *ms;
+		int msl_idx, ms_idx;
+
+		mz = rte_fbarray_get(arr, i);
+
+		/*
+		 * memzones can span multiple physical pages, so dump addresses
+		 * of all physical pages this memzone spans.
+		 */
+
+		fprintf(f, "Zone %u: name:<%s>, len:0x%zx"
 		       ", virt:%p, socket_id:%"PRId32", flags:%"PRIx32"\n", i,
-		       mcfg->memzone[i].name,
-		       mcfg->memzone[i].iova,
-		       mcfg->memzone[i].len,
-		       mcfg->memzone[i].addr,
-		       mcfg->memzone[i].socket_id,
-		       mcfg->memzone[i].flags);
+		       mz->name,
+		       mz->len,
+		       mz->addr,
+		       mz->socket_id,
+		       mz->flags);
+
+		/* get pointer to appropriate memseg list */
+		for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
+			if (mcfg->memsegs[msl_idx].hugepage_sz != mz->hugepage_sz)
+				continue;
+			if (mcfg->memsegs[msl_idx].socket_id != mz->socket_id)
+				continue;
+			msl = &mcfg->memsegs[msl_idx];
+			break;
+		}
+		if (!msl) {
+			RTE_LOG(DEBUG, EAL, "Skipping bad memzone\n");
+			continue;
+		}
+
+		cur_addr = RTE_PTR_ALIGN_FLOOR(mz->addr, mz->hugepage_sz);
+		mz_end = RTE_PTR_ADD(cur_addr, mz->len);
+
+		ms_idx = RTE_PTR_DIFF(mz->addr, msl->base_va) /
+				msl->hugepage_sz;
+		ms = rte_fbarray_get(&msl->memseg_arr, ms_idx);
+
+		fprintf(f, "physical pages used:\n");
+		do {
+			fprintf(f, "  addr: %p iova: 0x%" PRIx64 " len: 0x%" PRIx64 " len: 0x%" PRIx64 "\n",
+				cur_addr, ms->iova, ms->len, ms->hugepage_sz);
+
+			/* advance VA to next page */
+			cur_addr = RTE_PTR_ADD(cur_addr, ms->hugepage_sz);
+
+			/* memzones occupy contiguous segments */
+			++ms;
+		} while (cur_addr < mz_end);
+		i++;
 	}
 	rte_rwlock_read_unlock(&mcfg->mlock);
 }
@@ -459,9 +540,11 @@ rte_eal_memzone_init(void)
 
 	rte_rwlock_write_lock(&mcfg->mlock);
 
-	/* delete all zones */
-	mcfg->memzone_cnt = 0;
-	memset(mcfg->memzone, 0, sizeof(mcfg->memzone));
+	if (rte_fbarray_alloc(&mcfg->memzones, "memzone", 256,
+			RTE_MAX_MEMZONE, sizeof(struct rte_memzone))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memzone list\n");
+		return -1;
+	}
 
 	rte_rwlock_write_unlock(&mcfg->mlock);
 
@@ -473,14 +556,19 @@ void rte_memzone_walk(void (*func)(const struct rte_memzone *, void *),
 		      void *arg)
 {
 	struct rte_mem_config *mcfg;
-	unsigned i;
+	struct rte_fbarray *arr;
+	int i;
 
 	mcfg = rte_eal_get_configuration()->mem_config;
+	arr = &mcfg->memzones;
+
+	i = 0;
 
 	rte_rwlock_read_lock(&mcfg->mlock);
-	for (i=0; i<RTE_MAX_MEMZONE; i++) {
-		if (mcfg->memzone[i].addr != NULL)
-			(*func)(&mcfg->memzone[i], arg);
+	while ((i = rte_fbarray_find_next_used(arr, i)) > 0) {
+		struct rte_memzone *mz = rte_fbarray_get(arr, i);
+		(*func)(mz, arg);
+		i++;
 	}
 	rte_rwlock_read_unlock(&mcfg->mlock);
 }
diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h
index c9b57a4..8f4cc34 100644
--- a/lib/librte_eal/common/include/rte_eal_memconfig.h
+++ b/lib/librte_eal/common/include/rte_eal_memconfig.h
@@ -86,10 +86,8 @@ struct rte_mem_config {
 	rte_rwlock_t qlock;   /**< used for tailq operation for thread safe. */
 	rte_rwlock_t mplock;  /**< only used by mempool LIB for thread-safe. */
 
-	uint32_t memzone_cnt; /**< Number of allocated memzones */
-
 	/* memory segments and zones */
-	struct rte_memzone memzone[RTE_MAX_MEMZONE]; /**< Memzone descriptors. */
+	struct rte_fbarray memzones; /**< Memzone descriptors. */
 
 	struct rte_memseg_list memsegs[RTE_MAX_MEMSEG_LISTS];
 	/**< list of dynamic arrays holding memsegs */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 21/23] mempool: add support for the new memory allocation methods
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (19 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 20/23] eal: make memzones use rte_fbarray Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 22/23] vfio: allow to map other memory regions Anatoly Burakov
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

If a user has specified that the zone should have contiguous memory,
use the new _contig allocation API's instead of normal ones.
Otherwise, account for the fact that unless we're in IOVA_AS_VA
mode, we cannot guarantee that the pages would be physically
contiguous, so we calculate the memzone size and alignments as if
we were getting the smallest page size available.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_mempool/rte_mempool.c | 84 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 75 insertions(+), 9 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index d50dba4..4b9ab22 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -127,6 +127,26 @@ static unsigned optimize_object_size(unsigned obj_size)
 	return new_obj_size * RTE_MEMPOOL_ALIGN;
 }
 
+static size_t
+get_min_page_size(void) {
+	const struct rte_mem_config *mcfg =
+			rte_eal_get_configuration()->mem_config;
+	int i;
+	size_t min_pagesz = SIZE_MAX;
+
+	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+		const struct rte_memseg_list *msl = &mcfg->memsegs[i];
+
+		if (msl->base_va == NULL)
+			continue;
+
+		if (msl->hugepage_sz < min_pagesz)
+			min_pagesz = msl->hugepage_sz;
+	}
+
+	return min_pagesz == SIZE_MAX ? (size_t) getpagesize() : min_pagesz;
+}
+
 static void
 mempool_add_elem(struct rte_mempool *mp, void *obj, rte_iova_t iova)
 {
@@ -568,6 +588,7 @@ rte_mempool_populate_default(struct rte_mempool *mp)
 	unsigned mz_id, n;
 	unsigned int mp_flags;
 	int ret;
+	bool force_contig, no_contig;
 
 	/* mempool must not be populated */
 	if (mp->nb_mem_chunks != 0)
@@ -582,10 +603,46 @@ rte_mempool_populate_default(struct rte_mempool *mp)
 	/* update mempool capabilities */
 	mp->flags |= mp_flags;
 
-	if (rte_eal_has_hugepages()) {
-		pg_shift = 0; /* not needed, zone is physically contiguous */
+	no_contig = mp->flags & MEMPOOL_F_NO_PHYS_CONTIG;
+	force_contig = mp->flags & MEMPOOL_F_CAPA_PHYS_CONTIG;
+
+	/*
+	 * there are several considerations for page size and page shift here.
+	 *
+	 * if we don't need our mempools to have physically contiguous objects,
+	 * then just set page shift and page size to 0, because the user has
+	 * indicated that there's no need to care about anything.
+	 *
+	 * if we do need contiguous objects, there is also an option to reserve
+	 * the entire mempool memory as one contiguous block of memory, in
+	 * which case the page shift and alignment wouldn't matter as well.
+	 *
+	 * if we require contiguous objects, but not necessarily the entire
+	 * mempool reserved space to be contiguous, then there are two options.
+	 *
+	 * if our IO addresses are virtual, not actual physical (IOVA as VA
+	 * case), then no page shift needed - our memory allocation will give us
+	 * contiguous physical memory as far as the hardware is concerned, so
+	 * act as if we're getting contiguous memory.
+	 *
+	 * if our IO addresses are physical, we may get memory from bigger
+	 * pages, or we might get memory from smaller pages, and how much of it
+	 * we require depends on whether we want bigger or smaller pages.
+	 * However, requesting each and every memory size is too much work, so
+	 * what we'll do instead is walk through the page sizes available, pick
+	 * the smallest one and set up page shift to match that one. We will be
+	 * wasting some space this way, but it's much nicer than looping around
+	 * trying to reserve each and every page size.
+	 */
+
+	if (no_contig || force_contig || rte_eal_iova_mode() == RTE_IOVA_VA) {
 		pg_sz = 0;
+		pg_shift = 0;
 		align = RTE_CACHE_LINE_SIZE;
+	} else if (rte_eal_has_hugepages()) {
+		pg_sz = get_min_page_size();
+		pg_shift = rte_bsf32(pg_sz);
+		align = pg_sz;
 	} else {
 		pg_sz = getpagesize();
 		pg_shift = rte_bsf32(pg_sz);
@@ -604,23 +661,32 @@ rte_mempool_populate_default(struct rte_mempool *mp)
 			goto fail;
 		}
 
-		mz = rte_memzone_reserve_aligned(mz_name, size,
-			mp->socket_id, mz_flags, align);
-		/* not enough memory, retry with the biggest zone we have */
-		if (mz == NULL)
-			mz = rte_memzone_reserve_aligned(mz_name, 0,
+		if (force_contig) {
+			/*
+			 * if contiguous memory for entire mempool memory was
+			 * requested, don't try reserving again if we fail.
+			 */
+			mz = rte_memzone_reserve_aligned_contig(mz_name, size,
+				mp->socket_id, mz_flags, align);
+		} else {
+			mz = rte_memzone_reserve_aligned(mz_name, size,
 				mp->socket_id, mz_flags, align);
+			/* not enough memory, retry with the biggest zone we have */
+			if (mz == NULL)
+				mz = rte_memzone_reserve_aligned(mz_name, 0,
+					mp->socket_id, mz_flags, align);
+		}
 		if (mz == NULL) {
 			ret = -rte_errno;
 			goto fail;
 		}
 
-		if (mp->flags & MEMPOOL_F_NO_PHYS_CONTIG)
+		if (no_contig)
 			iova = RTE_BAD_IOVA;
 		else
 			iova = mz->iova;
 
-		if (rte_eal_has_hugepages())
+		if (rte_eal_has_hugepages() && force_contig)
 			ret = rte_mempool_populate_iova(mp, mz->addr,
 				iova, mz->len,
 				rte_mempool_memchunk_mz_free,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 22/23] vfio: allow to map other memory regions
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (20 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 21/23] mempool: add support for the new memory allocation methods Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 23/23] eal: map/unmap memory with VFIO when alloc/free pages Anatoly Burakov
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas, Pawel Wodkowski

Currently it is not possible to use memory that is not owned by DPDK to
perform DMA. This scenarion might be used in vhost applications (like
SPDK) where guest send its own memory table. To fill this gap provide
API to allow registering arbitrary address in VFIO container.

Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 150 ++++++++++++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_vfio.h |  11 +++
 2 files changed, 140 insertions(+), 21 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 09dfc68..15d28ad 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -40,6 +40,7 @@
 #include <rte_memory.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
+#include <rte_iommu.h>
 
 #include "eal_filesystem.h"
 #include "eal_vfio.h"
@@ -51,17 +52,35 @@
 static struct vfio_config vfio_cfg;
 
 static int vfio_type1_dma_map(int);
+static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
 static int vfio_spapr_dma_map(int);
 static int vfio_noiommu_dma_map(int);
+static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int);
 
 /* IOMMU types we support */
 static const struct vfio_iommu_type iommu_types[] = {
 	/* x86 IOMMU, otherwise known as type 1 */
-	{ RTE_VFIO_TYPE1, "Type 1", &vfio_type1_dma_map},
+	{
+		.type_id = RTE_VFIO_TYPE1,
+		.name = "Type 1",
+		.dma_map_func = &vfio_type1_dma_map,
+		.dma_user_map_func = &vfio_type1_dma_mem_map
+	},
 	/* ppc64 IOMMU, otherwise known as spapr */
-	{ RTE_VFIO_SPAPR, "sPAPR", &vfio_spapr_dma_map},
+	{
+		.type_id = RTE_VFIO_SPAPR,
+		.name = "sPAPR",
+		.dma_map_func = &vfio_spapr_dma_map,
+		.dma_user_map_func = NULL
+		// TODO: work with PPC64 people on enabling this, window size!
+	},
 	/* IOMMU-less mode */
-	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
+	{
+		.type_id = RTE_VFIO_NOIOMMU,
+		.name = "No-IOMMU",
+		.dma_map_func = &vfio_noiommu_dma_map,
+		.dma_user_map_func = &vfio_noiommu_dma_mem_map
+	},
 };
 
 int
@@ -362,9 +381,10 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 */
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
 				vfio_cfg.vfio_active_groups == 1) {
+			const struct vfio_iommu_type *t;
+
 			/* select an IOMMU type which we will be using */
-			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+			t = vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -382,6 +402,8 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				clear_group(vfio_group_fd);
 				return -1;
 			}
+
+			vfio_cfg.vfio_iommu_type = t;
 		}
 	}
 
@@ -694,13 +716,52 @@ vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
+vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len, int do_map)
+{
+	struct vfio_iommu_type1_dma_map dma_map;
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+	int ret;
+
+	if (do_map != 0) {
+		memset(&dma_map, 0, sizeof(dma_map));
+		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+		dma_map.vaddr = vaddr;
+		dma_map.size = len;
+		dma_map.iova = iova;
+		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, error %i (%s)\n",
+				errno, strerror(errno));
+				return -1;
+		}
+
+	} else {
+		memset(&dma_unmap, 0, sizeof(dma_unmap));
+		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+		dma_unmap.size = len;
+		dma_unmap.iova = iova;
+
+		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
+					errno, strerror(errno));
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
 vfio_type1_dma_map(int vfio_container_fd)
 {
-	int i, ret;
+	int i;
 
 	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
 	for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
 		const struct rte_memseg_list *msl;
 		const struct rte_fbarray *arr;
 		int ms_idx, next_idx;
@@ -727,21 +788,9 @@ vfio_type1_dma_map(int vfio_container_fd)
 			len = ms->hugepage_sz;
 			hw_addr = ms->iova;
 
-			memset(&dma_map, 0, sizeof(dma_map));
-			dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-			dma_map.vaddr = addr;
-			dma_map.size = len;
-			dma_map.iova = hw_addr;
-			dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
-
-			ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
-
-			if (ret) {
-				RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-						  "error %i (%s)\n", errno,
-						  strerror(errno));
+			if (vfio_type1_dma_mem_map(vfio_container_fd, addr,
+					hw_addr, len, 1))
 				return -1;
-			}
 		}
 	}
 
@@ -892,6 +941,49 @@ vfio_noiommu_dma_map(int __rte_unused vfio_container_fd)
 	return 0;
 }
 
+static int
+vfio_noiommu_dma_mem_map(int __rte_unused vfio_container_fd,
+			 uint64_t __rte_unused vaddr,
+			 uint64_t __rte_unused iova, uint64_t __rte_unused len,
+			 int __rte_unused do_map)
+{
+	/* No-IOMMU mode does not need DMA mapping */
+	return 0;
+}
+
+static int
+vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map)
+{
+	const struct vfio_iommu_type *t = vfio_cfg.vfio_iommu_type;
+
+	if (!t) {
+		RTE_LOG(ERR, EAL, "  VFIO support not initialized\n");
+		return -1;
+	}
+
+	if (!t->dma_user_map_func) {
+		RTE_LOG(ERR, EAL,
+			"  VFIO custom DMA region maping not supported by IOMMU %s\n",
+			t->name);
+		return -1;
+	}
+
+	return t->dma_user_map_func(vfio_cfg.vfio_container_fd, vaddr, iova,
+			len, do_map);
+}
+
+int
+rte_iommu_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
+{
+	return vfio_dma_mem_map(vaddr, iova, len, 1);
+}
+
+int
+rte_iommu_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
+{
+	return vfio_dma_mem_map(vaddr, iova, len, 0);
+}
+
 int
 rte_vfio_noiommu_is_enabled(void)
 {
@@ -911,4 +1003,20 @@ rte_vfio_noiommu_is_enabled(void)
 	return ret;
 }
 
+#else
+
+int
+rte_iommu_dma_map(uint64_t __rte_unused vaddr, __rte_unused uint64_t iova,
+		  __rte_unused uint64_t len)
+{
+	return 0;
+}
+
+int
+rte_iommu_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova,
+		    __rte_unused uint64_t len)
+{
+	return 0;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index ba7892b..bb669f0 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -48,6 +48,7 @@
 
 #ifdef VFIO_PRESENT
 
+#include <stdint.h>
 #include <linux/vfio.h>
 
 #define RTE_VFIO_TYPE1 VFIO_TYPE1_IOMMU
@@ -139,6 +140,7 @@ struct vfio_config {
 	int vfio_enabled;
 	int vfio_container_fd;
 	int vfio_active_groups;
+	const struct vfio_iommu_type *vfio_iommu_type;
 	struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
 };
 
@@ -148,9 +150,18 @@ struct vfio_config {
  * */
 typedef int (*vfio_dma_func_t)(int);
 
+/* Custom memory region DMA mapping function prototype.
+ * Takes VFIO container fd, virtual address, phisical address, length and
+ * operation type (0 to unmap 1 for map) as a parameters.
+ * Returns 0 on success, -1 on error.
+ **/
+typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
+		uint64_t len, int do_map);
+
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [dpdk-dev] [RFC v2 23/23] eal: map/unmap memory with VFIO when alloc/free pages
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (21 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 22/23] vfio: allow to map other memory regions Anatoly Burakov
@ 2017-12-19 11:14 ` Anatoly Burakov
  2017-12-19 15:46 ` [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Stephen Hemminger
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 46+ messages in thread
From: Anatoly Burakov @ 2017-12-19 11:14 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linuxapp/eal/eal_memalloc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 13172a0..8b3f219 100755
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -61,6 +61,7 @@
 #include <rte_eal_memconfig.h>
 #include <rte_eal.h>
 #include <rte_memory.h>
+#include <rte_iommu.h>
 
 #include "eal_filesystem.h"
 #include "eal_internal_cfg.h"
@@ -259,6 +260,11 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id,
 	ms->iova = iova;
 	ms->socket_id = socket_id;
 
+	/* map the segment so that VFIO has access to it */
+	if (rte_iommu_dma_map(ms->addr_64, iova, size)) {
+		RTE_LOG(DEBUG, EAL, "Cannot register segment with VFIO\n");
+	}
+
 	goto out;
 
 mapped:
@@ -295,6 +301,11 @@ free_page(struct rte_memseg *ms, struct hugepage_info *hi, unsigned list_idx,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
 	}
 
+	/* unmap the segment from VFIO */
+	if (rte_iommu_dma_unmap(ms->addr_64, ms->iova, ms->len)) {
+		RTE_LOG(DEBUG, EAL, "Cannot unregister segment with VFIO\n");
+	}
+
 	munmap(ms->addr, ms->hugepage_sz);
 
 	// TODO: race condition?
-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (22 preceding siblings ...)
  2017-12-19 11:14 ` [dpdk-dev] [RFC v2 23/23] eal: map/unmap memory with VFIO when alloc/free pages Anatoly Burakov
@ 2017-12-19 15:46 ` Stephen Hemminger
  2017-12-19 16:02   ` Burakov, Anatoly
  2017-12-21 21:38 ` Walker, Benjamin
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 46+ messages in thread
From: Stephen Hemminger @ 2017-12-19 15:46 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

On Tue, 19 Dec 2017 11:14:27 +0000
Anatoly Burakov <anatoly.burakov@intel.com> wrote:

> This patchset introduces a prototype implementation of dynamic memory allocation
> for DPDK. It is intended to start a conversation and build consensus on the best
> way to implement this functionality. The patchset works well enough to pass all
> unit tests, and to work with traffic forwarding, provided the device drivers are
> adjusted to ensure contiguous memory allocation where it matters.


What exact functionality is this patchset trying to enable.
It isn't clear what is broken now. Is it a cleanup or something that
is being motivated by memory layout issues?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 15:46 ` [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Stephen Hemminger
@ 2017-12-19 16:02   ` Burakov, Anatoly
  2017-12-19 16:06     ` Stephen Hemminger
  0 siblings, 1 reply; 46+ messages in thread
From: Burakov, Anatoly @ 2017-12-19 16:02 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

On 19-Dec-17 3:46 PM, Stephen Hemminger wrote:
> On Tue, 19 Dec 2017 11:14:27 +0000
> Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> 
>> This patchset introduces a prototype implementation of dynamic memory allocation
>> for DPDK. It is intended to start a conversation and build consensus on the best
>> way to implement this functionality. The patchset works well enough to pass all
>> unit tests, and to work with traffic forwarding, provided the device drivers are
>> adjusted to ensure contiguous memory allocation where it matters.
> 
> 
> What exact functionality is this patchset trying to enable.
> It isn't clear what is broken now. Is it a cleanup or something that
> is being motivated by memory layout issues?
> 

Hi Stephen,

Apologies for not making that clear enough in the cover letter.

The big issue this patchset is trying to solve is the static-ness of 
DPDK's memory allocation. I.e. you reserve memory on startup, and that's 
it - you can't allocate any more memory from the system, and you can't 
free it back without stopping the application.

With this patchset, you can do exactly that. You can basically start 
with zero memory preallocated, and allocate (and free) as you go. For 
example, if you apply this patchset and run malloc autotest, after 
startup you will have used perhaps a single 2MB page. While the test is 
running, you are going to allocate something to the tune of 14MB per 
socket, and at the end you're back at eating 2MB of hugepage memory, 
while all of the memory you used for autotest will be freed back to the 
system. That's the main use case this patchset is trying to address.

Down the line, there are other issues to be solved, which are outlined 
in the cover letter (the aforementioned "discussion points"), but for 
this iteration, dynamic allocation/free of DPDK memory is the one issue 
that is being addressed.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 16:02   ` Burakov, Anatoly
@ 2017-12-19 16:06     ` Stephen Hemminger
  2017-12-19 16:09       ` Burakov, Anatoly
  0 siblings, 1 reply; 46+ messages in thread
From: Stephen Hemminger @ 2017-12-19 16:06 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

On Tue, 19 Dec 2017 16:02:51 +0000
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> On 19-Dec-17 3:46 PM, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 11:14:27 +0000
> > Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> >   
> >> This patchset introduces a prototype implementation of dynamic memory allocation
> >> for DPDK. It is intended to start a conversation and build consensus on the best
> >> way to implement this functionality. The patchset works well enough to pass all
> >> unit tests, and to work with traffic forwarding, provided the device drivers are
> >> adjusted to ensure contiguous memory allocation where it matters.  
> > 
> > 
> > What exact functionality is this patchset trying to enable.
> > It isn't clear what is broken now. Is it a cleanup or something that
> > is being motivated by memory layout issues?
> >   
> 
> Hi Stephen,
> 
> Apologies for not making that clear enough in the cover letter.
> 
> The big issue this patchset is trying to solve is the static-ness of 
> DPDK's memory allocation. I.e. you reserve memory on startup, and that's 
> it - you can't allocate any more memory from the system, and you can't 
> free it back without stopping the application.
> 
> With this patchset, you can do exactly that. You can basically start 
> with zero memory preallocated, and allocate (and free) as you go. For 
> example, if you apply this patchset and run malloc autotest, after 
> startup you will have used perhaps a single 2MB page. While the test is 
> running, you are going to allocate something to the tune of 14MB per 
> socket, and at the end you're back at eating 2MB of hugepage memory, 
> while all of the memory you used for autotest will be freed back to the 
> system. That's the main use case this patchset is trying to address.
> 
> Down the line, there are other issues to be solved, which are outlined 
> in the cover letter (the aforementioned "discussion points"), but for 
> this iteration, dynamic allocation/free of DPDK memory is the one issue 
> that is being addressed.
> 

Ok, maybe name it "memory hot add/remove" since dynamic memory allocation
to me implies redoing malloc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 16:06     ` Stephen Hemminger
@ 2017-12-19 16:09       ` Burakov, Anatoly
  0 siblings, 0 replies; 46+ messages in thread
From: Burakov, Anatoly @ 2017-12-19 16:09 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas

On 19-Dec-17 4:06 PM, Stephen Hemminger wrote:
> On Tue, 19 Dec 2017 16:02:51 +0000
> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
> 
>> On 19-Dec-17 3:46 PM, Stephen Hemminger wrote:
>>> On Tue, 19 Dec 2017 11:14:27 +0000
>>> Anatoly Burakov <anatoly.burakov@intel.com> wrote:
>>>    
>>>> This patchset introduces a prototype implementation of dynamic memory allocation
>>>> for DPDK. It is intended to start a conversation and build consensus on the best
>>>> way to implement this functionality. The patchset works well enough to pass all
>>>> unit tests, and to work with traffic forwarding, provided the device drivers are
>>>> adjusted to ensure contiguous memory allocation where it matters.
>>>
>>>
>>> What exact functionality is this patchset trying to enable.
>>> It isn't clear what is broken now. Is it a cleanup or something that
>>> is being motivated by memory layout issues?
>>>    
>>
>> Hi Stephen,
>>
>> Apologies for not making that clear enough in the cover letter.
>>
>> The big issue this patchset is trying to solve is the static-ness of
>> DPDK's memory allocation. I.e. you reserve memory on startup, and that's
>> it - you can't allocate any more memory from the system, and you can't
>> free it back without stopping the application.
>>
>> With this patchset, you can do exactly that. You can basically start
>> with zero memory preallocated, and allocate (and free) as you go. For
>> example, if you apply this patchset and run malloc autotest, after
>> startup you will have used perhaps a single 2MB page. While the test is
>> running, you are going to allocate something to the tune of 14MB per
>> socket, and at the end you're back at eating 2MB of hugepage memory,
>> while all of the memory you used for autotest will be freed back to the
>> system. That's the main use case this patchset is trying to address.
>>
>> Down the line, there are other issues to be solved, which are outlined
>> in the cover letter (the aforementioned "discussion points"), but for
>> this iteration, dynamic allocation/free of DPDK memory is the one issue
>> that is being addressed.
>>
> 
> Ok, maybe name it "memory hot add/remove" since dynamic memory allocation
> to me implies redoing malloc.
> 

Well, it _kind of_ redoes malloc in the process as we need to handle 
holes in the address space, but sure, something like "memory hotplug" 
would've perhaps been a more suitable name. Thanks for your feedback, it 
will certainly be taken into account for when an inevitable v1 comes :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (23 preceding siblings ...)
  2017-12-19 15:46 ` [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Stephen Hemminger
@ 2017-12-21 21:38 ` Walker, Benjamin
  2017-12-22  9:13   ` Burakov, Anatoly
  2018-01-13 14:13 ` Burakov, Anatoly
                   ` (2 subsequent siblings)
  27 siblings, 1 reply; 46+ messages in thread
From: Walker, Benjamin @ 2017-12-21 21:38 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: thomas, andras.kovacs, Wiles, Keith, Richardson, Bruce

On Tue, 2017-12-19 at 11:14 +0000, Anatoly Burakov wrote:
> 

> Quick outline of all changes done as part of this patchset:
> 
>  * Malloc heap adjusted to handle holes in address space
>  * Single memseg list replaced by multiple expandable memseg lists
>  * VA space for hugepages is preallocated in advance
>  * Added dynamic alloc/free for pages, happening as needed on malloc/free

SPDK will need some way to register for a notification when pages are allocated
or freed. For storage, the number of requests per second is (relative to
networking) fairly small (hundreds of thousands per second in a traditional
block storage stack, or a few million per second with SPDK). Given that, we can
afford to do a dynamic lookup from va to pa/iova on each request in order to
greatly simplify our APIs (users can just pass pointers around instead of
mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
implementing a lookup table of va to pa/iova which we populate by scanning
through the DPDK memory segments at start up, so the lookup in our table is
sufficiently fast for storage use cases. If the list of memory segments changes,
we need to know about it in order to update our map.

Having the map also enables a number of other nice things - for instance we
allow users to register memory that wasn't allocated through DPDK and use it for
DMA operations. We keep that va to pa/iova mapping in the same map. I appreciate
you adding APIs to dynamically register this type of memory with the IOMMU on
our behalf. That allows us to eliminate a nasty hack where we were looking up
the vfio file descriptor through sysfs in order to send the registration ioctl.

>  * Added contiguous memory allocation API's for rte_malloc and rte_memzone
>  * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory
>    with VFIO
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-21 21:38 ` Walker, Benjamin
@ 2017-12-22  9:13   ` Burakov, Anatoly
  2017-12-26 17:19     ` Walker, Benjamin
  0 siblings, 1 reply; 46+ messages in thread
From: Burakov, Anatoly @ 2017-12-22  9:13 UTC (permalink / raw)
  To: Walker, Benjamin, dev
  Cc: thomas, andras.kovacs, Wiles, Keith, Richardson, Bruce

On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
> On Tue, 2017-12-19 at 11:14 +0000, Anatoly Burakov wrote:
>>
> 
>> Quick outline of all changes done as part of this patchset:
>>
>>   * Malloc heap adjusted to handle holes in address space
>>   * Single memseg list replaced by multiple expandable memseg lists
>>   * VA space for hugepages is preallocated in advance
>>   * Added dynamic alloc/free for pages, happening as needed on malloc/free
> 
> SPDK will need some way to register for a notification when pages are allocated
> or freed. For storage, the number of requests per second is (relative to
> networking) fairly small (hundreds of thousands per second in a traditional
> block storage stack, or a few million per second with SPDK). Given that, we can
> afford to do a dynamic lookup from va to pa/iova on each request in order to
> greatly simplify our APIs (users can just pass pointers around instead of
> mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
> scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
> implementing a lookup table of va to pa/iova which we populate by scanning
> through the DPDK memory segments at start up, so the lookup in our table is
> sufficiently fast for storage use cases. If the list of memory segments changes,
> we need to know about it in order to update our map.

Hi Benjamin,

So, in other words, we need callbacks on alloa/free. What information 
would SPDK need when receiving this notification? Since we can't really 
know in advance how many pages we allocate (it may be one, it may be a 
thousand) and they no longer are guaranteed to be contiguous, would a 
per-page callback be OK? Alternatively, we could have one callback per 
operation, but only provide VA and size of allocated memory, while 
leaving everything else to the user. I do add a virt2memseg() function 
which would allow you to look up segment physical addresses easier, so
you won't have to manually scan memseg lists to get IOVA for a given VA.

Thanks for your feedback and suggestions!

> 
> Having the map also enables a number of other nice things - for instance we
> allow users to register memory that wasn't allocated through DPDK and use it for
> DMA operations. We keep that va to pa/iova mapping in the same map. I appreciate
> you adding APIs to dynamically register this type of memory with the IOMMU on
> our behalf. That allows us to eliminate a nasty hack where we were looking up
> the vfio file descriptor through sysfs in order to send the registration ioctl.
> 
>>   * Added contiguous memory allocation API's for rte_malloc and rte_memzone
>>   * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory
>>     with VFIO


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-22  9:13   ` Burakov, Anatoly
@ 2017-12-26 17:19     ` Walker, Benjamin
  2018-02-02 19:28       ` Yongseok Koh
  0 siblings, 1 reply; 46+ messages in thread
From: Walker, Benjamin @ 2017-12-26 17:19 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: thomas, andras.kovacs, Wiles, Keith, Richardson, Bruce

On Fri, 2017-12-22 at 09:13 +0000, Burakov, Anatoly wrote:
> On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
> > SPDK will need some way to register for a notification when pages are
> > allocated
> > or freed. For storage, the number of requests per second is (relative to
> > networking) fairly small (hundreds of thousands per second in a traditional
> > block storage stack, or a few million per second with SPDK). Given that, we
> > can
> > afford to do a dynamic lookup from va to pa/iova on each request in order to
> > greatly simplify our APIs (users can just pass pointers around instead of
> > mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
> > scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
> > implementing a lookup table of va to pa/iova which we populate by scanning
> > through the DPDK memory segments at start up, so the lookup in our table is
> > sufficiently fast for storage use cases. If the list of memory segments
> > changes,
> > we need to know about it in order to update our map.
> 
> Hi Benjamin,
> 
> So, in other words, we need callbacks on alloa/free. What information 
> would SPDK need when receiving this notification? Since we can't really 
> know in advance how many pages we allocate (it may be one, it may be a 
> thousand) and they no longer are guaranteed to be contiguous, would a 
> per-page callback be OK? Alternatively, we could have one callback per 
> operation, but only provide VA and size of allocated memory, while 
> leaving everything else to the user. I do add a virt2memseg() function 
> which would allow you to look up segment physical addresses easier, so
> you won't have to manually scan memseg lists to get IOVA for a given VA.
> 
> Thanks for your feedback and suggestions!

Yes - callbacks on alloc/free would be perfect. Ideally for us we want one
callback per virtual memory region allocated, plus a function we can call to
find the physical addresses/page break points on that virtual region. The
function that finds the physical addresses does not have to be efficient - we'll
just call that once when the new region is allocated and store the results in a
fast lookup table. One call per virtual region is better for us than one call
per physical page because we're actually keeping multiple different types of
memory address translation tables in SPDK. One translates from va to pa/iova, so
for this one we need to break this up into physical pages and it doesn't matter
if you do one call per virtual region or one per physical page. However another
one translates from va to RDMA lkey, so it is much more efficient if we can
register large virtual regions in a single call.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (24 preceding siblings ...)
  2017-12-21 21:38 ` Walker, Benjamin
@ 2018-01-13 14:13 ` Burakov, Anatoly
  2018-01-23 22:33 ` Yongseok Koh
  2018-02-14  8:04 ` Thomas Monjalon
  27 siblings, 0 replies; 46+ messages in thread
From: Burakov, Anatoly @ 2018-01-13 14:13 UTC (permalink / raw)
  To: dev
  Cc: andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, thomas, techboard, jerin.jacob, rosenbaumalex,
	Ananyev, Konstantin, ferruh.yigit

On 19-Dec-17 11:14 AM, Anatoly Burakov wrote:
> This patchset introduces a prototype implementation of dynamic memory allocation
> for DPDK. It is intended to start a conversation and build consensus on the best
> way to implement this functionality. The patchset works well enough to pass all
> unit tests, and to work with traffic forwarding, provided the device drivers are
> adjusted to ensure contiguous memory allocation where it matters.
> 
> The vast majority of changes are in the EAL and malloc, the external API
> disruption is minimal: a new set of API's are added for contiguous memory
> allocation (for rte_malloc and rte_memzone), and a few API additions in
> rte_memory. Every other API change is internal to EAL, and all of the memory
> allocation/freeing is handled through rte_malloc, with no externally visible
> API changes, aside from a call to get physmem layout, which no longer makes
> sense given that there are multiple memseg lists.
> 
> Quick outline of all changes done as part of this patchset:
> 
>   * Malloc heap adjusted to handle holes in address space
>   * Single memseg list replaced by multiple expandable memseg lists
>   * VA space for hugepages is preallocated in advance
>   * Added dynamic alloc/free for pages, happening as needed on malloc/free
>   * Added contiguous memory allocation API's for rte_malloc and rte_memzone
>   * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory
>     with VFIO
> 
> The biggest difference is a "memseg" now represents a single page (as opposed to
> being a big contiguous block of pages). As a consequence, both memzones and
> malloc elements are no longer guaranteed to be physically contiguous, unless
> the user asks for it. To preserve whatever functionality that was dependent
> on previous behavior, a legacy memory option is also provided, however it is
> expected to be temporary solution. The drivers weren't adjusted in this patchset,
> and it is expected that whoever shall test the drivers with this patchset will
> modify their relevant drivers to support the new set of API's. Basic testing
> with forwarding traffic was performed, both with UIO and VFIO, and no performance
> degradation was observed.
> 
> Why multiple memseg lists instead of one? It makes things easier on a number of
> fronts. Since memseg is a single page now, the list will get quite big, and we
> need to locate pages somehow when we allocate and free them. We could of course
> just walk the list and allocate one contiguous chunk of VA space for memsegs,
> but i chose to use separate lists instead, to speed up many operations with the
> list.
> 
> It would be great to see the following discussions within the community regarding
> both current implementation and future work:
> 
>   * Any suggestions to improve current implementation. The whole system with
>     multiple memseg lists is kind of unweildy, so maybe there are better ways to
>     do the same thing. Maybe use a single list after all? We're not expecting
>     malloc/free on hot path, so maybe it doesn't matter that we have to walk
>     the list of potentially thousands of pages?
>   * Pluggable memory allocators. Right now, allocators are hardcoded, but down
>     the line it would be great to have custom allocators (e.g. for externally
>     allocated memory). I've tried to keep the memalloc API minimal and generic
>     enough to be able to easily change it down the line, but suggestions are
>     welcome. Memory drivers, with ops for alloc/free etc.?
>   * Memory tagging. This is related to previous item. Right now, we can only ask
>     malloc to allocate memory by page size, but one could potentially have
>     different memory regions backed by pages of similar sizes (for example,
>     locked 1G pages, to completely avoid TLB misses, alongside regular 1G pages),
>     and it would be good to have that kind of mechanism to distinguish between
>     different memory types available to a DPDK application. One could, for example,
>     tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.
>   * Secondary process implementation, in particular when it comes to allocating/
>     freeing new memory. Current plan is to make use of RPC mechanism proposed by
>     Jianfeng [2] to communicate between primary and secondary processes, however
>     other suggestions are welcome.
>   * Support for non-hugepage memory. This work is planned down the line. Aside
>     from obvious concerns about physical addresses, 4K pages are small and will
>     eat up enormous amounts of memseg list space, so my proposal would be to
>     allocate 4K pages in bigger blocks (say, 2MB).
>   * 32-bit support. Current implementation lacks it, and i don't see a trivial
>     way to make it work if we are to preallocate huge chunks of VA space in
>     advance. We could limit it to 1G per page size, but even that, on multiple
>     sockets, won't work that well, and we can't know in advance what kind of
>     memory user will try to allocate. Drop it? Leave it in legacy mode only?
>   * Preallocation. Right now, malloc will free any and all memory that it can,
>     which could lead to a (perhaps counterintuitive) situation where a user
>     calls DPDK with --socket-mem=1024,1024, does a single "rte_free" and loses
>     all of the preallocated memory in the process. Would preallocating memory
>     *and keeping it no matter what* be a valid use case? E.g. if DPDK was run
>     without any memory requirements specified, grow and shrink as needed, but
>     DPDK was asked to preallocate memory, we can grow but we can't shrink
>     past the preallocated amount?
> 
> Any other feedback about things i didn't think of or missed is greatly
> appreciated.
> 
> [1] http://dpdk.org/dev/patchwork/patch/24484/
> [2] http://dpdk.org/dev/patchwork/patch/31838/
> 
Hi all,

Could this proposal be discussed at the next tech board meeting?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (25 preceding siblings ...)
  2018-01-13 14:13 ` Burakov, Anatoly
@ 2018-01-23 22:33 ` Yongseok Koh
  2018-01-25 16:18   ` Burakov, Anatoly
  2018-02-14  8:04 ` Thomas Monjalon
  27 siblings, 1 reply; 46+ messages in thread
From: Yongseok Koh @ 2018-01-23 22:33 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, Thomas Monjalon


> On Dec 19, 2017, at 3:14 AM, Anatoly Burakov <anatoly.burakov@intel.com> wrote:
[...]
> Quick outline of all changes done as part of this patchset:
> 
> * Malloc heap adjusted to handle holes in address space
> * Single memseg list replaced by multiple expandable memseg lists
> * VA space for hugepages is preallocated in advance

Hi Anatoly,

I haven't looked through your patchset yet but quick question.  As far as I
understand, currently EAL remaps virtual addresses to make VA layout matches PA
layout. I'm not sure my expression is 100% correct.

By your comment above, do you mean VA space for all available physical memory
will always be contiguous?

I have been curious about why VA space is fragmented in DPDK.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-01-23 22:33 ` Yongseok Koh
@ 2018-01-25 16:18   ` Burakov, Anatoly
  0 siblings, 0 replies; 46+ messages in thread
From: Burakov, Anatoly @ 2018-01-25 16:18 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, Thomas Monjalon

On 23-Jan-18 10:33 PM, Yongseok Koh wrote:
> 
>> On Dec 19, 2017, at 3:14 AM, Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> [...]
>> Quick outline of all changes done as part of this patchset:
>>
>> * Malloc heap adjusted to handle holes in address space
>> * Single memseg list replaced by multiple expandable memseg lists
>> * VA space for hugepages is preallocated in advance
> 
> Hi Anatoly,
> 
> I haven't looked through your patchset yet but quick question.  As far as I
> understand, currently EAL remaps virtual addresses to make VA layout matches PA
> layout. I'm not sure my expression is 100% correct.
> 
> By your comment above, do you mean VA space for all available physical memory
> will always be contiguous?
> 
> I have been curious about why VA space is fragmented in DPDK.
> 
> Thanks,
> Yongseok
> 
> 

Hi Yongseok,

Yes and no. Currently, VA space is allocated opportunistically - EAL 
tries to allocate VA space to match PA space layout, but due to varying 
page sizes and contiguous segment sizes, it doesn't always turn out that 
way - hence possible VA fragmentation even if underlying PA space may be 
contiguous.

With this patchset, we kind of do it the other way around - we allocate 
contiguous VA space segment per socket, per page size, and then we map 
physical memory into it, without regard for PA layout whatsoever. So, 
assuming all VA space is mapped, VA space is contiguous while PA space 
may or may not be (depending on underlying physical memory layout and if 
you're using IOMMU).

However, since this is hotpluggable (and hot-unpluggable) memory, we can 
have holes in *mapped* VA space, even though *allocated* VA space would 
be contiguous within the boundaries of page size. You can think of it 
this way - we preallocate segments per page size, per socket, so e.g. on 
a machine with 2 sockets and 2MB and 1GB pages enabled, you'll get 4 
contiguous chunks of VA space that is available for mapping. Underlying 
mappings may or may not be contiguous (i.e. mapped segments might become 
fragmented if you allocate/free memory all over the pace), but they will 
reside within the same contiguous chunk of VA space.

Hope that answers your question :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-26 17:19     ` Walker, Benjamin
@ 2018-02-02 19:28       ` Yongseok Koh
  2018-02-05 10:03         ` Burakov, Anatoly
  0 siblings, 1 reply; 46+ messages in thread
From: Yongseok Koh @ 2018-02-02 19:28 UTC (permalink / raw)
  To: Walker, Benjamin
  Cc: Burakov, Anatoly, dev, thomas, andras.kovacs, Wiles, Keith,
	Richardson, Bruce

On Tue, Dec 26, 2017 at 05:19:25PM +0000, Walker, Benjamin wrote:
> On Fri, 2017-12-22 at 09:13 +0000, Burakov, Anatoly wrote:
> > On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
> > > SPDK will need some way to register for a notification when pages are
> > > allocated
> > > or freed. For storage, the number of requests per second is (relative to
> > > networking) fairly small (hundreds of thousands per second in a traditional
> > > block storage stack, or a few million per second with SPDK). Given that, we
> > > can
> > > afford to do a dynamic lookup from va to pa/iova on each request in order to
> > > greatly simplify our APIs (users can just pass pointers around instead of
> > > mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
> > > scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
> > > implementing a lookup table of va to pa/iova which we populate by scanning
> > > through the DPDK memory segments at start up, so the lookup in our table is
> > > sufficiently fast for storage use cases. If the list of memory segments
> > > changes,
> > > we need to know about it in order to update our map.
> > 
> > Hi Benjamin,
> > 
> > So, in other words, we need callbacks on alloa/free. What information 
> > would SPDK need when receiving this notification? Since we can't really 
> > know in advance how many pages we allocate (it may be one, it may be a 
> > thousand) and they no longer are guaranteed to be contiguous, would a 
> > per-page callback be OK? Alternatively, we could have one callback per 
> > operation, but only provide VA and size of allocated memory, while 
> > leaving everything else to the user. I do add a virt2memseg() function 
> > which would allow you to look up segment physical addresses easier, so
> > you won't have to manually scan memseg lists to get IOVA for a given VA.
> > 
> > Thanks for your feedback and suggestions!
> 
> Yes - callbacks on alloc/free would be perfect. Ideally for us we want one
> callback per virtual memory region allocated, plus a function we can call to
> find the physical addresses/page break points on that virtual region. The
> function that finds the physical addresses does not have to be efficient - we'll
> just call that once when the new region is allocated and store the results in a
> fast lookup table. One call per virtual region is better for us than one call
> per physical page because we're actually keeping multiple different types of
> memory address translation tables in SPDK. One translates from va to pa/iova, so
> for this one we need to break this up into physical pages and it doesn't matter
> if you do one call per virtual region or one per physical page. However another
> one translates from va to RDMA lkey, so it is much more efficient if we can
> register large virtual regions in a single call.

Another yes to callbacks. Like Benjamin mentioned about RDMA, MLX PMD has to
look up LKEY per each packet DMA. Let me briefly explain about this for your
understanding. For security reason, we don't allow application initiates a DMA
transaction with unknown random physical addresses. Instead, va-to-pa mapping
(we call it Memory Region) should be pre-registered and LKEY is the index of the
translation entry registered in device. With the current static memory model, it
is easy to manage because v-p mapping is unchanged over time. But if it becomes
dynamic, MLX PMD should get notified with the event to register/un-regsiter
Memory Region.

For MLX PMD, it is also enough to get one notification per allocation/free of a
virutal memory region. It shouldn't necessarily be a per-page call like Benjamin
mentioned because PA of region doesn't need to be contiguous for registration.
But it doesn't need to know about physical address of the region (I'm not saying
it is unnecessary, but just FYI :-).

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-02 19:28       ` Yongseok Koh
@ 2018-02-05 10:03         ` Burakov, Anatoly
  2018-02-05 10:18           ` Nélio Laranjeiro
  2018-02-14  2:01           ` Yongseok Koh
  0 siblings, 2 replies; 46+ messages in thread
From: Burakov, Anatoly @ 2018-02-05 10:03 UTC (permalink / raw)
  To: Yongseok Koh, Walker, Benjamin
  Cc: dev, thomas, andras.kovacs, Wiles, Keith, Richardson, Bruce

On 02-Feb-18 7:28 PM, Yongseok Koh wrote:
> On Tue, Dec 26, 2017 at 05:19:25PM +0000, Walker, Benjamin wrote:
>> On Fri, 2017-12-22 at 09:13 +0000, Burakov, Anatoly wrote:
>>> On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
>>>> SPDK will need some way to register for a notification when pages are
>>>> allocated
>>>> or freed. For storage, the number of requests per second is (relative to
>>>> networking) fairly small (hundreds of thousands per second in a traditional
>>>> block storage stack, or a few million per second with SPDK). Given that, we
>>>> can
>>>> afford to do a dynamic lookup from va to pa/iova on each request in order to
>>>> greatly simplify our APIs (users can just pass pointers around instead of
>>>> mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
>>>> scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
>>>> implementing a lookup table of va to pa/iova which we populate by scanning
>>>> through the DPDK memory segments at start up, so the lookup in our table is
>>>> sufficiently fast for storage use cases. If the list of memory segments
>>>> changes,
>>>> we need to know about it in order to update our map.
>>>
>>> Hi Benjamin,
>>>
>>> So, in other words, we need callbacks on alloa/free. What information
>>> would SPDK need when receiving this notification? Since we can't really
>>> know in advance how many pages we allocate (it may be one, it may be a
>>> thousand) and they no longer are guaranteed to be contiguous, would a
>>> per-page callback be OK? Alternatively, we could have one callback per
>>> operation, but only provide VA and size of allocated memory, while
>>> leaving everything else to the user. I do add a virt2memseg() function
>>> which would allow you to look up segment physical addresses easier, so
>>> you won't have to manually scan memseg lists to get IOVA for a given VA.
>>>
>>> Thanks for your feedback and suggestions!
>>
>> Yes - callbacks on alloc/free would be perfect. Ideally for us we want one
>> callback per virtual memory region allocated, plus a function we can call to
>> find the physical addresses/page break points on that virtual region. The
>> function that finds the physical addresses does not have to be efficient - we'll
>> just call that once when the new region is allocated and store the results in a
>> fast lookup table. One call per virtual region is better for us than one call
>> per physical page because we're actually keeping multiple different types of
>> memory address translation tables in SPDK. One translates from va to pa/iova, so
>> for this one we need to break this up into physical pages and it doesn't matter
>> if you do one call per virtual region or one per physical page. However another
>> one translates from va to RDMA lkey, so it is much more efficient if we can
>> register large virtual regions in a single call.
> 
> Another yes to callbacks. Like Benjamin mentioned about RDMA, MLX PMD has to
> look up LKEY per each packet DMA. Let me briefly explain about this for your
> understanding. For security reason, we don't allow application initiates a DMA
> transaction with unknown random physical addresses. Instead, va-to-pa mapping
> (we call it Memory Region) should be pre-registered and LKEY is the index of the
> translation entry registered in device. With the current static memory model, it
> is easy to manage because v-p mapping is unchanged over time. But if it becomes
> dynamic, MLX PMD should get notified with the event to register/un-regsiter
> Memory Region.
> 
> For MLX PMD, it is also enough to get one notification per allocation/free of a
> virutal memory region. It shouldn't necessarily be a per-page call like Benjamin
> mentioned because PA of region doesn't need to be contiguous for registration.
> But it doesn't need to know about physical address of the region (I'm not saying
> it is unnecessary, but just FYI :-).
> 
> Thanks,
> Yongseok
> 

Thanks for your feedback, good to hear we're on the right track. I 
already have a prototype implementation of this working, due for v1 
submission :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-05 10:03         ` Burakov, Anatoly
@ 2018-02-05 10:18           ` Nélio Laranjeiro
  2018-02-05 10:36             ` Burakov, Anatoly
  2018-02-14  2:01           ` Yongseok Koh
  1 sibling, 1 reply; 46+ messages in thread
From: Nélio Laranjeiro @ 2018-02-05 10:18 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Yongseok Koh, Walker, Benjamin, dev, thomas, andras.kovacs,
	Wiles, Keith, Richardson, Bruce

On Mon, Feb 05, 2018 at 10:03:35AM +0000, Burakov, Anatoly wrote:
> On 02-Feb-18 7:28 PM, Yongseok Koh wrote:
> > On Tue, Dec 26, 2017 at 05:19:25PM +0000, Walker, Benjamin wrote:
> > > On Fri, 2017-12-22 at 09:13 +0000, Burakov, Anatoly wrote:
> > > > On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
> > > > > SPDK will need some way to register for a notification when pages are
> > > > > allocated
> > > > > or freed. For storage, the number of requests per second is (relative to
> > > > > networking) fairly small (hundreds of thousands per second in a traditional
> > > > > block storage stack, or a few million per second with SPDK). Given that, we
> > > > > can
> > > > > afford to do a dynamic lookup from va to pa/iova on each request in order to
> > > > > greatly simplify our APIs (users can just pass pointers around instead of
> > > > > mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
> > > > > scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
> > > > > implementing a lookup table of va to pa/iova which we populate by scanning
> > > > > through the DPDK memory segments at start up, so the lookup in our table is
> > > > > sufficiently fast for storage use cases. If the list of memory segments
> > > > > changes,
> > > > > we need to know about it in order to update our map.
> > > > 
> > > > Hi Benjamin,
> > > > 
> > > > So, in other words, we need callbacks on alloa/free. What information
> > > > would SPDK need when receiving this notification? Since we can't really
> > > > know in advance how many pages we allocate (it may be one, it may be a
> > > > thousand) and they no longer are guaranteed to be contiguous, would a
> > > > per-page callback be OK? Alternatively, we could have one callback per
> > > > operation, but only provide VA and size of allocated memory, while
> > > > leaving everything else to the user. I do add a virt2memseg() function
> > > > which would allow you to look up segment physical addresses easier, so
> > > > you won't have to manually scan memseg lists to get IOVA for a given VA.
> > > > 
> > > > Thanks for your feedback and suggestions!
> > > 
> > > Yes - callbacks on alloc/free would be perfect. Ideally for us we want one
> > > callback per virtual memory region allocated, plus a function we can call to
> > > find the physical addresses/page break points on that virtual region. The
> > > function that finds the physical addresses does not have to be efficient - we'll
> > > just call that once when the new region is allocated and store the results in a
> > > fast lookup table. One call per virtual region is better for us than one call
> > > per physical page because we're actually keeping multiple different types of
> > > memory address translation tables in SPDK. One translates from va to pa/iova, so
> > > for this one we need to break this up into physical pages and it doesn't matter
> > > if you do one call per virtual region or one per physical page. However another
> > > one translates from va to RDMA lkey, so it is much more efficient if we can
> > > register large virtual regions in a single call.
> > 
> > Another yes to callbacks. Like Benjamin mentioned about RDMA, MLX PMD has to
> > look up LKEY per each packet DMA. Let me briefly explain about this for your
> > understanding. For security reason, we don't allow application initiates a DMA
> > transaction with unknown random physical addresses. Instead, va-to-pa mapping
> > (we call it Memory Region) should be pre-registered and LKEY is the index of the
> > translation entry registered in device. With the current static memory model, it
> > is easy to manage because v-p mapping is unchanged over time. But if it becomes
> > dynamic, MLX PMD should get notified with the event to register/un-regsiter
> > Memory Region.
> > 
> > For MLX PMD, it is also enough to get one notification per allocation/free of a
> > virutal memory region. It shouldn't necessarily be a per-page call like Benjamin
> > mentioned because PA of region doesn't need to be contiguous for registration.
> > But it doesn't need to know about physical address of the region (I'm not saying
> > it is unnecessary, but just FYI :-).
> > 
> > Thanks,
> > Yongseok
> > 
> 
> Thanks for your feedback, good to hear we're on the right track. I already
> have a prototype implementation of this working, due for v1 submission :)

Hi Anatoly,

Good to know.
Do you see some performances impact with this series?

Thanks,

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-05 10:18           ` Nélio Laranjeiro
@ 2018-02-05 10:36             ` Burakov, Anatoly
  2018-02-06  9:10               ` Nélio Laranjeiro
  0 siblings, 1 reply; 46+ messages in thread
From: Burakov, Anatoly @ 2018-02-05 10:36 UTC (permalink / raw)
  To: Nélio Laranjeiro
  Cc: Yongseok Koh, Walker, Benjamin, dev, thomas, andras.kovacs,
	Wiles, Keith, Richardson, Bruce

On 05-Feb-18 10:18 AM, Nélio Laranjeiro wrote:
> On Mon, Feb 05, 2018 at 10:03:35AM +0000, Burakov, Anatoly wrote:
>> On 02-Feb-18 7:28 PM, Yongseok Koh wrote:
>>> On Tue, Dec 26, 2017 at 05:19:25PM +0000, Walker, Benjamin wrote:
>>>> On Fri, 2017-12-22 at 09:13 +0000, Burakov, Anatoly wrote:
>>>>> On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
>>>>>> SPDK will need some way to register for a notification when pages are
>>>>>> allocated
>>>>>> or freed. For storage, the number of requests per second is (relative to
>>>>>> networking) fairly small (hundreds of thousands per second in a traditional
>>>>>> block storage stack, or a few million per second with SPDK). Given that, we
>>>>>> can
>>>>>> afford to do a dynamic lookup from va to pa/iova on each request in order to
>>>>>> greatly simplify our APIs (users can just pass pointers around instead of
>>>>>> mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
>>>>>> scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
>>>>>> implementing a lookup table of va to pa/iova which we populate by scanning
>>>>>> through the DPDK memory segments at start up, so the lookup in our table is
>>>>>> sufficiently fast for storage use cases. If the list of memory segments
>>>>>> changes,
>>>>>> we need to know about it in order to update our map.
>>>>>
>>>>> Hi Benjamin,
>>>>>
>>>>> So, in other words, we need callbacks on alloa/free. What information
>>>>> would SPDK need when receiving this notification? Since we can't really
>>>>> know in advance how many pages we allocate (it may be one, it may be a
>>>>> thousand) and they no longer are guaranteed to be contiguous, would a
>>>>> per-page callback be OK? Alternatively, we could have one callback per
>>>>> operation, but only provide VA and size of allocated memory, while
>>>>> leaving everything else to the user. I do add a virt2memseg() function
>>>>> which would allow you to look up segment physical addresses easier, so
>>>>> you won't have to manually scan memseg lists to get IOVA for a given VA.
>>>>>
>>>>> Thanks for your feedback and suggestions!
>>>>
>>>> Yes - callbacks on alloc/free would be perfect. Ideally for us we want one
>>>> callback per virtual memory region allocated, plus a function we can call to
>>>> find the physical addresses/page break points on that virtual region. The
>>>> function that finds the physical addresses does not have to be efficient - we'll
>>>> just call that once when the new region is allocated and store the results in a
>>>> fast lookup table. One call per virtual region is better for us than one call
>>>> per physical page because we're actually keeping multiple different types of
>>>> memory address translation tables in SPDK. One translates from va to pa/iova, so
>>>> for this one we need to break this up into physical pages and it doesn't matter
>>>> if you do one call per virtual region or one per physical page. However another
>>>> one translates from va to RDMA lkey, so it is much more efficient if we can
>>>> register large virtual regions in a single call.
>>>
>>> Another yes to callbacks. Like Benjamin mentioned about RDMA, MLX PMD has to
>>> look up LKEY per each packet DMA. Let me briefly explain about this for your
>>> understanding. For security reason, we don't allow application initiates a DMA
>>> transaction with unknown random physical addresses. Instead, va-to-pa mapping
>>> (we call it Memory Region) should be pre-registered and LKEY is the index of the
>>> translation entry registered in device. With the current static memory model, it
>>> is easy to manage because v-p mapping is unchanged over time. But if it becomes
>>> dynamic, MLX PMD should get notified with the event to register/un-regsiter
>>> Memory Region.
>>>
>>> For MLX PMD, it is also enough to get one notification per allocation/free of a
>>> virutal memory region. It shouldn't necessarily be a per-page call like Benjamin
>>> mentioned because PA of region doesn't need to be contiguous for registration.
>>> But it doesn't need to know about physical address of the region (I'm not saying
>>> it is unnecessary, but just FYI :-).
>>>
>>> Thanks,
>>> Yongseok
>>>
>>
>> Thanks for your feedback, good to hear we're on the right track. I already
>> have a prototype implementation of this working, due for v1 submission :)
> 
> Hi Anatoly,
> 
> Good to know.
> Do you see some performances impact with this series?
> 
> Thanks,
> 

In general case, no impact is noticeable, since e.g. underlying ring 
implementation does not depend on IO space layout whatsoever. In certain 
specific cases, some optimizations that were made on the assumption that 
physical space is contiguous, would no longer be possible (e.g. 
calculating offset spanning several pages) unless VFIO is in use, as due 
to unpredictability of IO space layout, each page will have to be 
checked individually, rather than sharing common base offset.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-05 10:36             ` Burakov, Anatoly
@ 2018-02-06  9:10               ` Nélio Laranjeiro
  0 siblings, 0 replies; 46+ messages in thread
From: Nélio Laranjeiro @ 2018-02-06  9:10 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Yongseok Koh, Walker, Benjamin, dev, thomas, andras.kovacs,
	Wiles, Keith, Richardson, Bruce

On Mon, Feb 05, 2018 at 10:36:58AM +0000, Burakov, Anatoly wrote:
> On 05-Feb-18 10:18 AM, Nélio Laranjeiro wrote:
> > On Mon, Feb 05, 2018 at 10:03:35AM +0000, Burakov, Anatoly wrote:
> > > On 02-Feb-18 7:28 PM, Yongseok Koh wrote:
> > > > On Tue, Dec 26, 2017 at 05:19:25PM +0000, Walker, Benjamin wrote:
> > > > > On Fri, 2017-12-22 at 09:13 +0000, Burakov, Anatoly wrote:
> > > > > > On 21-Dec-17 9:38 PM, Walker, Benjamin wrote:
> > > > > > > SPDK will need some way to register for a notification when pages are
> > > > > > > allocated
> > > > > > > or freed. For storage, the number of requests per second is (relative to
> > > > > > > networking) fairly small (hundreds of thousands per second in a traditional
> > > > > > > block storage stack, or a few million per second with SPDK). Given that, we
> > > > > > > can
> > > > > > > afford to do a dynamic lookup from va to pa/iova on each request in order to
> > > > > > > greatly simplify our APIs (users can just pass pointers around instead of
> > > > > > > mbufs). DPDK has a way to lookup the pa from a given va, but it does so by
> > > > > > > scanning /proc/self/pagemap and is very slow. SPDK instead handles this by
> > > > > > > implementing a lookup table of va to pa/iova which we populate by scanning
> > > > > > > through the DPDK memory segments at start up, so the lookup in our table is
> > > > > > > sufficiently fast for storage use cases. If the list of memory segments
> > > > > > > changes,
> > > > > > > we need to know about it in order to update our map.
> > > > > > 
> > > > > > Hi Benjamin,
> > > > > > 
> > > > > > So, in other words, we need callbacks on alloa/free. What information
> > > > > > would SPDK need when receiving this notification? Since we can't really
> > > > > > know in advance how many pages we allocate (it may be one, it may be a
> > > > > > thousand) and they no longer are guaranteed to be contiguous, would a
> > > > > > per-page callback be OK? Alternatively, we could have one callback per
> > > > > > operation, but only provide VA and size of allocated memory, while
> > > > > > leaving everything else to the user. I do add a virt2memseg() function
> > > > > > which would allow you to look up segment physical addresses easier, so
> > > > > > you won't have to manually scan memseg lists to get IOVA for a given VA.
> > > > > > 
> > > > > > Thanks for your feedback and suggestions!
> > > > > 
> > > > > Yes - callbacks on alloc/free would be perfect. Ideally for us we want one
> > > > > callback per virtual memory region allocated, plus a function we can call to
> > > > > find the physical addresses/page break points on that virtual region. The
> > > > > function that finds the physical addresses does not have to be efficient - we'll
> > > > > just call that once when the new region is allocated and store the results in a
> > > > > fast lookup table. One call per virtual region is better for us than one call
> > > > > per physical page because we're actually keeping multiple different types of
> > > > > memory address translation tables in SPDK. One translates from va to pa/iova, so
> > > > > for this one we need to break this up into physical pages and it doesn't matter
> > > > > if you do one call per virtual region or one per physical page. However another
> > > > > one translates from va to RDMA lkey, so it is much more efficient if we can
> > > > > register large virtual regions in a single call.
> > > > 
> > > > Another yes to callbacks. Like Benjamin mentioned about RDMA, MLX PMD has to
> > > > look up LKEY per each packet DMA. Let me briefly explain about this for your
> > > > understanding. For security reason, we don't allow application initiates a DMA
> > > > transaction with unknown random physical addresses. Instead, va-to-pa mapping
> > > > (we call it Memory Region) should be pre-registered and LKEY is the index of the
> > > > translation entry registered in device. With the current static memory model, it
> > > > is easy to manage because v-p mapping is unchanged over time. But if it becomes
> > > > dynamic, MLX PMD should get notified with the event to register/un-regsiter
> > > > Memory Region.
> > > > 
> > > > For MLX PMD, it is also enough to get one notification per allocation/free of a
> > > > virutal memory region. It shouldn't necessarily be a per-page call like Benjamin
> > > > mentioned because PA of region doesn't need to be contiguous for registration.
> > > > But it doesn't need to know about physical address of the region (I'm not saying
> > > > it is unnecessary, but just FYI :-).
> > > > 
> > > > Thanks,
> > > > Yongseok
> > > > 
> > > 
> > > Thanks for your feedback, good to hear we're on the right track. I already
> > > have a prototype implementation of this working, due for v1 submission :)
> > 
> > Hi Anatoly,
> > 
> > Good to know.
> > Do you see some performances impact with this series?
> > 
> > Thanks,
> > 
> 
> In general case, no impact is noticeable, since e.g. underlying ring
> implementation does not depend on IO space layout whatsoever. In certain
> specific cases, some optimizations that were made on the assumption that
> physical space is contiguous, would no longer be possible (e.g. calculating
> offset spanning several pages) unless VFIO is in use, as due to
> unpredictability of IO space layout, each page will have to be checked
> individually, rather than sharing common base offset.

My concern is more related to some devices which uses only virtual
memory to send/receive mbufs like Mellanox NIC.
This modification may deprecate some of them as their performance will
be directly impacted.

It certainly needs more testing from maintainers and users with this
series to understand the real impact on each NIC.

I am waiting for the new revision to make some tests,

Thanks,

-- 
Nélio Laranjeiro
6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-05 10:03         ` Burakov, Anatoly
  2018-02-05 10:18           ` Nélio Laranjeiro
@ 2018-02-14  2:01           ` Yongseok Koh
  2018-02-14  9:32             ` Burakov, Anatoly
  1 sibling, 1 reply; 46+ messages in thread
From: Yongseok Koh @ 2018-02-14  2:01 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Walker, Benjamin, dev, Thomas Monjalon, andras.kovacs, Wiles,
	Keith, Richardson, Bruce, Nélio Laranjeiro, Shahaf Shuler,
	Xueming(Steven) Li


> On Feb 5, 2018, at 2:03 AM, Burakov, Anatoly <anatoly.burakov@intel.com> wrote:
> 
> Thanks for your feedback, good to hear we're on the right track. I already have a prototype implementation of this working, due for v1 submission :)

Anatoly,

One more suggestion. Currently, when populating mempool, there's a chance to
have multiple chunks if system memory is highly fragmented. However, with your
new design, it is unlikely to happen unless the system is really low on memory.
Allocation will be dynamic and page by page. With your v2, you seemed to make
minimal changes on mempool. If allocation fails, it will still try to gather
fragments from malloc_heap until it acquires enough objects and the resultant
mempool will have multiple chunks. But like I mentioned, it is very unlikely and
this will only happen when the system is short of memory. Is my understanding
correct?

If so, how about making a change to drop the case where mempool has multiple
chunks?

Thanks
Yongseok

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
                   ` (26 preceding siblings ...)
  2018-01-23 22:33 ` Yongseok Koh
@ 2018-02-14  8:04 ` Thomas Monjalon
  2018-02-14 10:07   ` Burakov, Anatoly
  27 siblings, 1 reply; 46+ messages in thread
From: Thomas Monjalon @ 2018-02-14  8:04 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, Yongseok Koh, nelio.laranjeiro, olivier.matz,
	rahul.lakkireddy, jerin.jacob, hemant.agrawal, alejandro.lucero,
	arybchenko, ferruh.yigit

Hi Anatoly,

19/12/2017 12:14, Anatoly Burakov:
>  * Memory tagging. This is related to previous item. Right now, we can only ask
>    malloc to allocate memory by page size, but one could potentially have
>    different memory regions backed by pages of similar sizes (for example,
>    locked 1G pages, to completely avoid TLB misses, alongside regular 1G pages),
>    and it would be good to have that kind of mechanism to distinguish between
>    different memory types available to a DPDK application. One could, for example,
>    tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.

How do you imagine memory tagging?
Should it be a parameter when requesting some memory from rte_malloc
or rte_mempool?
Could it be a bit-field allowing to combine some properties?
Does it make sense to have "DMA" as one of the purpose?

How to transparently allocate the best memory for the NIC?
You take care of the NUMA socket property, but there can be more
requirements, like getting memory from the NIC itself.

+Cc more people (6WIND, Cavium, Chelsio, Mellanox, Netronome, NXP, Solarflare)
in order to trigger a discussion about the ideal requirements.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-14  2:01           ` Yongseok Koh
@ 2018-02-14  9:32             ` Burakov, Anatoly
  2018-02-14 18:13               ` Yongseok Koh
  0 siblings, 1 reply; 46+ messages in thread
From: Burakov, Anatoly @ 2018-02-14  9:32 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Walker, Benjamin, dev, Thomas Monjalon, andras.kovacs, Wiles,
	Keith, Richardson, Bruce, Nélio Laranjeiro, Shahaf Shuler,
	Xueming(Steven) Li

On 14-Feb-18 2:01 AM, Yongseok Koh wrote:
> 
>> On Feb 5, 2018, at 2:03 AM, Burakov, Anatoly <anatoly.burakov@intel.com> wrote:
>>
>> Thanks for your feedback, good to hear we're on the right track. I already have a prototype implementation of this working, due for v1 submission :)
> 
> Anatoly,
> 
> One more suggestion. Currently, when populating mempool, there's a chance to
> have multiple chunks if system memory is highly fragmented. However, with your
> new design, it is unlikely to happen unless the system is really low on memory.
> Allocation will be dynamic and page by page. With your v2, you seemed to make
> minimal changes on mempool. If allocation fails, it will still try to gather
> fragments from malloc_heap until it acquires enough objects and the resultant
> mempool will have multiple chunks. But like I mentioned, it is very unlikely and
> this will only happen when the system is short of memory. Is my understanding
> correct?
> 
> If so, how about making a change to drop the case where mempool has multiple
> chunks?
> 
> Thanks
> Yongseok
> 

Hi Yongseok,

I would still like to keep it, as it may impact low memory cases such as 
containers.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-14  8:04 ` Thomas Monjalon
@ 2018-02-14 10:07   ` Burakov, Anatoly
  2018-04-25 16:02     ` Burakov, Anatoly
  0 siblings, 1 reply; 46+ messages in thread
From: Burakov, Anatoly @ 2018-02-14 10:07 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, Yongseok Koh, nelio.laranjeiro, olivier.matz,
	rahul.lakkireddy, jerin.jacob, hemant.agrawal, alejandro.lucero,
	arybchenko, ferruh.yigit

On 14-Feb-18 8:04 AM, Thomas Monjalon wrote:
> Hi Anatoly,
> 
> 19/12/2017 12:14, Anatoly Burakov:
>>   * Memory tagging. This is related to previous item. Right now, we can only ask
>>     malloc to allocate memory by page size, but one could potentially have
>>     different memory regions backed by pages of similar sizes (for example,
>>     locked 1G pages, to completely avoid TLB misses, alongside regular 1G pages),
>>     and it would be good to have that kind of mechanism to distinguish between
>>     different memory types available to a DPDK application. One could, for example,
>>     tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.
> 
> How do you imagine memory tagging?
> Should it be a parameter when requesting some memory from rte_malloc
> or rte_mempool?

We can't make it a parameter for mempool without making it a parameter 
for rte_malloc, as every memory allocation in DPDK works through 
rte_malloc. So at the very least, rte_malloc will have it. And as long 
as rte_malloc has it, there's no reason why memzones and mempools 
couldn't - not much code to add.

> Could it be a bit-field allowing to combine some properties?
> Does it make sense to have "DMA" as one of the purpose?

Something like a bitfield would be my preference, yes. That way we could 
classify memory in certain ways and allocate based on that. Which 
"certain ways" these are, i'm not sure. For example, in addition to 
tagging memory as "DMA-capable" (which i think is a given), one might 
tag certain memory as "non-default", as in, never allocate from this 
chunk of memory unless explicitly asked to do so - this could be useful 
for types of memory that are a precious resource.

Then again, it is likely that we won't have many types of memory in 
DPDK, and any other type would be implementation-specific, so maybe just 
stringly-typing it is OK (maybe we can finally make use of "type" 
parameter in rte_malloc!).

> 
> How to transparently allocate the best memory for the NIC?
> You take care of the NUMA socket property, but there can be more
> requirements, like getting memory from the NIC itself.

I would think that we can't make it generic enough to cover all cases, 
so it's best to expose some API's and let PMD's handle this themselves.

> 
> +Cc more people (6WIND, Cavium, Chelsio, Mellanox, Netronome, NXP, Solarflare)
> in order to trigger a discussion about the ideal requirements.
> 



-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-14  9:32             ` Burakov, Anatoly
@ 2018-02-14 18:13               ` Yongseok Koh
  0 siblings, 0 replies; 46+ messages in thread
From: Yongseok Koh @ 2018-02-14 18:13 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Walker, Benjamin, dev, Thomas Monjalon, andras.kovacs, Wiles,
	Keith, Richardson, Bruce, Nélio Laranjeiro, Shahaf Shuler,
	Xueming(Steven) Li



> On Feb 14, 2018, at 1:32 AM, Burakov, Anatoly <anatoly.burakov@intel.com> wrote:
> 
> On 14-Feb-18 2:01 AM, Yongseok Koh wrote:
>>> On Feb 5, 2018, at 2:03 AM, Burakov, Anatoly <anatoly.burakov@intel.com> wrote:
>>> 
>>> Thanks for your feedback, good to hear we're on the right track. I already have a prototype implementation of this working, due for v1 submission :)
>> Anatoly,
>> One more suggestion. Currently, when populating mempool, there's a chance to
>> have multiple chunks if system memory is highly fragmented. However, with your
>> new design, it is unlikely to happen unless the system is really low on memory.
>> Allocation will be dynamic and page by page. With your v2, you seemed to make
>> minimal changes on mempool. If allocation fails, it will still try to gather
>> fragments from malloc_heap until it acquires enough objects and the resultant
>> mempool will have multiple chunks. But like I mentioned, it is very unlikely and
>> this will only happen when the system is short of memory. Is my understanding
>> correct?
>> If so, how about making a change to drop the case where mempool has multiple
>> chunks?
>> Thanks
>> Yongseok
> 
> Hi Yongseok,
> 
> I would still like to keep it, as it may impact low memory cases such as containers.

Agreed. I overlooked that kind of use-cases.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-02-14 10:07   ` Burakov, Anatoly
@ 2018-04-25 16:02     ` Burakov, Anatoly
  2018-04-25 16:12       ` Stephen Hemminger
  0 siblings, 1 reply; 46+ messages in thread
From: Burakov, Anatoly @ 2018-04-25 16:02 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, andras.kovacs, laszlo.vadkeri, keith.wiles, benjamin.walker,
	bruce.richardson, Yongseok Koh, nelio.laranjeiro, olivier.matz,
	rahul.lakkireddy, jerin.jacob, hemant.agrawal, alejandro.lucero,
	arybchenko, ferruh.yigit, Srinath Mannam

On 14-Feb-18 10:07 AM, Burakov, Anatoly wrote:
> On 14-Feb-18 8:04 AM, Thomas Monjalon wrote:
>> Hi Anatoly,
>>
>> 19/12/2017 12:14, Anatoly Burakov:
>>>   * Memory tagging. This is related to previous item. Right now, we 
>>> can only ask
>>>     malloc to allocate memory by page size, but one could potentially 
>>> have
>>>     different memory regions backed by pages of similar sizes (for 
>>> example,
>>>     locked 1G pages, to completely avoid TLB misses, alongside 
>>> regular 1G pages),
>>>     and it would be good to have that kind of mechanism to 
>>> distinguish between
>>>     different memory types available to a DPDK application. One 
>>> could, for example,
>>>     tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.
>>
>> How do you imagine memory tagging?
>> Should it be a parameter when requesting some memory from rte_malloc
>> or rte_mempool?
> 
> We can't make it a parameter for mempool without making it a parameter 
> for rte_malloc, as every memory allocation in DPDK works through 
> rte_malloc. So at the very least, rte_malloc will have it. And as long 
> as rte_malloc has it, there's no reason why memzones and mempools 
> couldn't - not much code to add.
> 
>> Could it be a bit-field allowing to combine some properties?
>> Does it make sense to have "DMA" as one of the purpose?
> 
> Something like a bitfield would be my preference, yes. That way we could 
> classify memory in certain ways and allocate based on that. Which 
> "certain ways" these are, i'm not sure. For example, in addition to 
> tagging memory as "DMA-capable" (which i think is a given), one might 
> tag certain memory as "non-default", as in, never allocate from this 
> chunk of memory unless explicitly asked to do so - this could be useful 
> for types of memory that are a precious resource.
> 
> Then again, it is likely that we won't have many types of memory in 
> DPDK, and any other type would be implementation-specific, so maybe just 
> stringly-typing it is OK (maybe we can finally make use of "type" 
> parameter in rte_malloc!).
> 
>>
>> How to transparently allocate the best memory for the NIC?
>> You take care of the NUMA socket property, but there can be more
>> requirements, like getting memory from the NIC itself.
> 
> I would think that we can't make it generic enough to cover all cases, 
> so it's best to expose some API's and let PMD's handle this themselves.
> 
>>
>> +Cc more people (6WIND, Cavium, Chelsio, Mellanox, Netronome, NXP, 
>> Solarflare)
>> in order to trigger a discussion about the ideal requirements.
>>
> 

Hi all,

I would like to restart this discussion, again :) I would like to hear 
some feedback on my thoughts below.

I've had some more thinking about it, and while i have lots of use-cases 
in mind, i suspect covering them all while keeping a sane API is 
unrealistic.

So, first things first.

Main issue we have is the 1:1 correspondence of malloc heap, and socket 
ID. This has led to various attempts to hijack socket id's to do 
something else - i've seen this approach a few times before, most 
recently in a patch by Srinath/Broadcom [1]. We need to break this 
dependency somehow, and have a unique heap identifier.

Also, since memory allocators are expected to behave roughly similar to 
drivers (e.g. have a driver API and provide hooks for init/alloc/free 
functions, etc.), a request to allocate memory may not just go to the 
heap itself (which is handled internally by rte_malloc), but also go to 
its respective allocator. This is roughly similar to what is happening 
currently, except that which allocator functions to call will then 
depend on which driver allocated that heap.

So, we arrive at a dependency - heap => allocator. Each heap must know 
to which allocator it belongs - so, we also need some kind of way to 
identify not just the heap, but the allocator as well.

In the above quotes from previous mails i suggested categorizing memory 
by "types", but now that i think of it, the API would've been too 
complex, as we would've ideally had to cover use cases such as "allocate 
memory of this type, no matter from which allocator it comes from", 
"allocate memory from this particular heap", "allocate memory from this 
particular allocator"... It gets complicated pretty fast.

What i propose instead, is this. In 99% of time, user wants our hugepage 
allocator. So, by default, all allocations will come through that. In 
the event that user needs memory from a specific heap, we need to 
provide a new set of API's to request memory from a specific heap.

Do we expect situations where user might *not* want default allocator, 
but also *not* know which exact heap he wants? If the answer is no 
(which i'm counting on :) ), then allocating from a specific malloc 
driver becomes as simple as something like this:

mem = rte_malloc_from_heap("my_very_special_heap");

(stringly-typed heap ID is just an example)

So, old API's remain intact, and are always passed through to a default 
allocator, while new API's will grant access to other allocators.

Heap ID alone, however, may not provide enough flexibility. For example, 
if a malloc driver allocates a specific kind of memory that is 
NUMA-aware, it would perhaps be awkward to call different heap ID's when 
the memory being allocated is arguably the same, just subdivided into 
several blocks. Moreover, figuring out situations like this would likely 
require some cooperation from the allocator itself (possibly some 
allocator-specific API's), but should we add malloc heap arguments, 
those would have to be generic. I'm not sure if we want to go that far, 
though.

Does that sound reasonable?

Another tangentially related issue raised by Olivier [1] is of 
allocating memory in blocks, rather than using rte_malloc. Current 
implementation has rte_malloc storing its metadata right in the memory - 
this leads to unnecessary memory fragmentation in certain cases, such as 
allocating memory page-by-page, and in general polluting memory we might 
not want to pollute with malloc metadata.

To fix this, memory allocator would have to store malloc data 
externally, which comes with a few caveats (reverse mapping of pointers 
to malloc elements, storing, looking up and accounting for said 
elements, etc.). It's not currently planned to work on it, but it's 
certainly something to think about :)

[1] http://dpdk.org/dev/patchwork/patch/36596/
[2] http://dpdk.org/ml/archives/dev/2018-March/093212.html

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK
  2018-04-25 16:02     ` Burakov, Anatoly
@ 2018-04-25 16:12       ` Stephen Hemminger
  0 siblings, 0 replies; 46+ messages in thread
From: Stephen Hemminger @ 2018-04-25 16:12 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, dev, andras.kovacs, laszlo.vadkeri, keith.wiles,
	benjamin.walker, bruce.richardson, Yongseok Koh,
	nelio.laranjeiro, olivier.matz, rahul.lakkireddy, jerin.jacob,
	hemant.agrawal, alejandro.lucero, arybchenko, ferruh.yigit,
	Srinath Mannam

On Wed, 25 Apr 2018 17:02:48 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> On 14-Feb-18 10:07 AM, Burakov, Anatoly wrote:
> > On 14-Feb-18 8:04 AM, Thomas Monjalon wrote:  
> >> Hi Anatoly,
> >>
> >> 19/12/2017 12:14, Anatoly Burakov:  
> >>>   * Memory tagging. This is related to previous item. Right now, we 
> >>> can only ask
> >>>     malloc to allocate memory by page size, but one could potentially 
> >>> have
> >>>     different memory regions backed by pages of similar sizes (for 
> >>> example,
> >>>     locked 1G pages, to completely avoid TLB misses, alongside 
> >>> regular 1G pages),
> >>>     and it would be good to have that kind of mechanism to 
> >>> distinguish between
> >>>     different memory types available to a DPDK application. One 
> >>> could, for example,
> >>>     tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.  
> >>
> >> How do you imagine memory tagging?
> >> Should it be a parameter when requesting some memory from rte_malloc
> >> or rte_mempool?  
> > 
> > We can't make it a parameter for mempool without making it a parameter 
> > for rte_malloc, as every memory allocation in DPDK works through 
> > rte_malloc. So at the very least, rte_malloc will have it. And as long 
> > as rte_malloc has it, there's no reason why memzones and mempools 
> > couldn't - not much code to add.
> >   
> >> Could it be a bit-field allowing to combine some properties?
> >> Does it make sense to have "DMA" as one of the purpose?  
> > 
> > Something like a bitfield would be my preference, yes. That way we could 
> > classify memory in certain ways and allocate based on that. Which 
> > "certain ways" these are, i'm not sure. For example, in addition to 
> > tagging memory as "DMA-capable" (which i think is a given), one might 
> > tag certain memory as "non-default", as in, never allocate from this 
> > chunk of memory unless explicitly asked to do so - this could be useful 
> > for types of memory that are a precious resource.
> > 
> > Then again, it is likely that we won't have many types of memory in 
> > DPDK, and any other type would be implementation-specific, so maybe just 
> > stringly-typing it is OK (maybe we can finally make use of "type" 
> > parameter in rte_malloc!).
> >   
> >>
> >> How to transparently allocate the best memory for the NIC?
> >> You take care of the NUMA socket property, but there can be more
> >> requirements, like getting memory from the NIC itself.  
> > 
> > I would think that we can't make it generic enough to cover all cases, 
> > so it's best to expose some API's and let PMD's handle this themselves.
> >   
> >>
> >> +Cc more people (6WIND, Cavium, Chelsio, Mellanox, Netronome, NXP, 
> >> Solarflare)
> >> in order to trigger a discussion about the ideal requirements.
> >>  
> >   
> 
> Hi all,
> 
> I would like to restart this discussion, again :) I would like to hear 
> some feedback on my thoughts below.
> 
> I've had some more thinking about it, and while i have lots of use-cases 
> in mind, i suspect covering them all while keeping a sane API is 
> unrealistic.
> 
> So, first things first.
> 
> Main issue we have is the 1:1 correspondence of malloc heap, and socket 
> ID. This has led to various attempts to hijack socket id's to do 
> something else - i've seen this approach a few times before, most 
> recently in a patch by Srinath/Broadcom [1]. We need to break this 
> dependency somehow, and have a unique heap identifier.
> 
> Also, since memory allocators are expected to behave roughly similar to 
> drivers (e.g. have a driver API and provide hooks for init/alloc/free 
> functions, etc.), a request to allocate memory may not just go to the 
> heap itself (which is handled internally by rte_malloc), but also go to 
> its respective allocator. This is roughly similar to what is happening 
> currently, except that which allocator functions to call will then 
> depend on which driver allocated that heap.
> 
> So, we arrive at a dependency - heap => allocator. Each heap must know 
> to which allocator it belongs - so, we also need some kind of way to 
> identify not just the heap, but the allocator as well.
> 
> In the above quotes from previous mails i suggested categorizing memory 
> by "types", but now that i think of it, the API would've been too 
> complex, as we would've ideally had to cover use cases such as "allocate 
> memory of this type, no matter from which allocator it comes from", 
> "allocate memory from this particular heap", "allocate memory from this 
> particular allocator"... It gets complicated pretty fast.
> 
> What i propose instead, is this. In 99% of time, user wants our hugepage 
> allocator. So, by default, all allocations will come through that. In 
> the event that user needs memory from a specific heap, we need to 
> provide a new set of API's to request memory from a specific heap.
> 
> Do we expect situations where user might *not* want default allocator, 
> but also *not* know which exact heap he wants? If the answer is no 
> (which i'm counting on :) ), then allocating from a specific malloc 
> driver becomes as simple as something like this:
> 
> mem = rte_malloc_from_heap("my_very_special_heap");
> 
> (stringly-typed heap ID is just an example)
> 
> So, old API's remain intact, and are always passed through to a default 
> allocator, while new API's will grant access to other allocators.
> 
> Heap ID alone, however, may not provide enough flexibility. For example, 
> if a malloc driver allocates a specific kind of memory that is 
> NUMA-aware, it would perhaps be awkward to call different heap ID's when 
> the memory being allocated is arguably the same, just subdivided into 
> several blocks. Moreover, figuring out situations like this would likely 
> require some cooperation from the allocator itself (possibly some 
> allocator-specific API's), but should we add malloc heap arguments, 
> those would have to be generic. I'm not sure if we want to go that far, 
> though.
> 
> Does that sound reasonable?
> 
> Another tangentially related issue raised by Olivier [1] is of 
> allocating memory in blocks, rather than using rte_malloc. Current 
> implementation has rte_malloc storing its metadata right in the memory - 
> this leads to unnecessary memory fragmentation in certain cases, such as 
> allocating memory page-by-page, and in general polluting memory we might 
> not want to pollute with malloc metadata.
> 
> To fix this, memory allocator would have to store malloc data 
> externally, which comes with a few caveats (reverse mapping of pointers 
> to malloc elements, storing, looking up and accounting for said 
> elements, etc.). It's not currently planned to work on it, but it's 
> certainly something to think about :)
> 
> [1] http://dpdk.org/dev/patchwork/patch/36596/
> [2] http://dpdk.org/ml/archives/dev/2018-March/093212.html

Maybe the existing rte_malloc which tries to always work like malloc is not
the best API for applications? I always thought the Samba talloc API was less
error prone since it supports reference counting and hierarchal allocation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2018-04-25 16:12 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-19 11:14 [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 01/23] eal: move get_virtual_area out of linuxapp eal_memory.c Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 02/23] eal: add function to report number of detected sockets Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 03/23] eal: add rte_fbarray Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 04/23] eal: move all locking to heap Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 05/23] eal: protect malloc heap stats with a lock Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 06/23] eal: make malloc a doubly-linked list Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 07/23] eal: make malloc_elem_join_adjacent_free public Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 08/23] eal: add "single file segments" command-line option Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 09/23] eal: add "legacy memory" option Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 10/23] eal: read hugepage counts from node-specific sysfs path Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 11/23] eal: replace memseg with memseg lists Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 12/23] eal: add support for dynamic memory allocation Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 13/23] eal: make use of dynamic memory allocation for init Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 14/23] eal: add support for dynamic unmapping of pages Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 15/23] eal: add API to check if memory is physically contiguous Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 16/23] eal: enable dynamic memory allocation/free on malloc/free Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 17/23] eal: add backend support for contiguous memory allocation Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 18/23] eal: add rte_malloc support for allocating contiguous memory Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 19/23] eal: enable reserving physically contiguous memzones Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 20/23] eal: make memzones use rte_fbarray Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 21/23] mempool: add support for the new memory allocation methods Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 22/23] vfio: allow to map other memory regions Anatoly Burakov
2017-12-19 11:14 ` [dpdk-dev] [RFC v2 23/23] eal: map/unmap memory with VFIO when alloc/free pages Anatoly Burakov
2017-12-19 15:46 ` [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK Stephen Hemminger
2017-12-19 16:02   ` Burakov, Anatoly
2017-12-19 16:06     ` Stephen Hemminger
2017-12-19 16:09       ` Burakov, Anatoly
2017-12-21 21:38 ` Walker, Benjamin
2017-12-22  9:13   ` Burakov, Anatoly
2017-12-26 17:19     ` Walker, Benjamin
2018-02-02 19:28       ` Yongseok Koh
2018-02-05 10:03         ` Burakov, Anatoly
2018-02-05 10:18           ` Nélio Laranjeiro
2018-02-05 10:36             ` Burakov, Anatoly
2018-02-06  9:10               ` Nélio Laranjeiro
2018-02-14  2:01           ` Yongseok Koh
2018-02-14  9:32             ` Burakov, Anatoly
2018-02-14 18:13               ` Yongseok Koh
2018-01-13 14:13 ` Burakov, Anatoly
2018-01-23 22:33 ` Yongseok Koh
2018-01-25 16:18   ` Burakov, Anatoly
2018-02-14  8:04 ` Thomas Monjalon
2018-02-14 10:07   ` Burakov, Anatoly
2018-04-25 16:02     ` Burakov, Anatoly
2018-04-25 16:12       ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).