DPDK patches and discussions
 help / color / mirror / Atom feed
From: Anatoly Burakov <anatoly.burakov@intel.com>
To: dev@dpdk.org
Cc: ray.kinsella@intel.com, kuralamudhan.ramakrishnan@intel.com,
	louise.m.daly@intel.com, bruce.richardson@intel.com,
	ferruh.yigit@intel.com, konstantin.ananyev@intel.com,
	thomas@monjalon.net
Subject: [dpdk-dev] [PATCH 9/9] mem: support in-memory mode
Date: Fri,  1 Jun 2018 18:15:18 +0100	[thread overview]
Message-ID: <403e50c5a09d7bf41e2a7264be85ee6a086a4eb2.1527872626.git.anatoly.burakov@intel.com> (raw)
In-Reply-To: <cover.1527872626.git.anatoly.burakov@intel.com>
In-Reply-To: <cover.1527872626.git.anatoly.burakov@intel.com>

Implement the final piece of the in-memory mode puzzle - enable running
DPDK entirely in memory, without creating any files.

To do it, use mmap with MAP_HUGETLB and size flags to enable DPDK to work
without hugetlbfs mountpoints. In order to enable this, a few things needed
to be changed.

First of all, we need to allow empty hugetlbfs mountpoints in
hugepage_info, and handle them correctly (by not trying to create any
files and lock any directories).

Next, we need to reorder the mapping sequence, because the page is not
really allocated until the page fault, and we cannot get its IOVA
address before we trigger the page fault.

Finally, decide at compile time whether we are going to be supporting
anonymous hugepages or not, because we cannot check for it at runtime.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    RFC->v1:
    - Drop memfd and instead use mmap() with MAP_HUGETLB. This will drop the
      kernel requirements down to 3.8, and does not impose any restrictions
      glibc (as far as i known).
    
      Unfortunately, there's a bit of an issue with this approach, because
      mmap() is stupid and will happily ignore unsupported arguments. This
      means that if the binary were to be compiled on a 3.8+ kernel but run
      on a pre-3.8 kernel (such as currently supported minimum of 3.2), then
      most likely the memory would be allocated using regular pages, causing
      unthinkable performance degradation. No solution to this problem is
      currently known to me.

 .../linuxapp/eal/eal_hugepage_info.c          |  91 +++++++-----
 lib/librte_eal/linuxapp/eal/eal_memalloc.c    | 130 +++++++++++-------
 lib/librte_eal/linuxapp/eal/eal_memory.c      |   3 +-
 3 files changed, 139 insertions(+), 85 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
index 7f8e2fd9c..3a7d4b222 100644
--- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
+++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
@@ -18,6 +18,8 @@
 #include <sys/queue.h>
 #include <sys/stat.h>
 
+#include <linux/mman.h> /* for hugetlb-related flags */
+
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_launch.h>
@@ -313,11 +315,49 @@ compare_hpi(const void *a, const void *b)
 	return hpi_b->hugepage_sz - hpi_a->hugepage_sz;
 }
 
+static void
+calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
+{
+	uint64_t total_pages = 0;
+	unsigned int i;
+
+	/*
+	 * first, try to put all hugepages into relevant sockets, but
+	 * if first attempts fails, fall back to collecting all pages
+	 * in one socket and sorting them later
+	 */
+	total_pages = 0;
+	/* we also don't want to do this for legacy init */
+	if (!internal_config.legacy_mem)
+		for (i = 0; i < rte_socket_count(); i++) {
+			int socket = rte_socket_id_by_idx(i);
+			unsigned int num_pages =
+					get_num_hugepages_on_node(
+						dirent->d_name, socket);
+			hpi->num_pages[socket] = num_pages;
+			total_pages += num_pages;
+		}
+	/*
+	 * we failed to sort memory from the get go, so fall
+	 * back to old way
+	 */
+	if (total_pages == 0) {
+		hpi->num_pages[0] = get_num_hugepages(dirent->d_name);
+
+#ifndef RTE_ARCH_64
+		/* for 32-bit systems, limit number of hugepages to
+		 * 1GB per page size */
+		hpi->num_pages[0] = RTE_MIN(hpi->num_pages[0],
+				RTE_PGSIZE_1G / hpi->hugepage_sz);
+#endif
+	}
+}
+
 static int
 hugepage_info_init(void)
 {	const char dirent_start_text[] = "hugepages-";
 	const size_t dirent_start_len = sizeof(dirent_start_text) - 1;
-	unsigned int i, total_pages, num_sizes = 0;
+	unsigned int i, num_sizes = 0;
 	DIR *dir;
 	struct dirent *dirent;
 
@@ -355,6 +395,22 @@ hugepage_info_init(void)
 					"%" PRIu64 " reserved, but no mounted "
 					"hugetlbfs found for that size\n",
 					num_pages, hpi->hugepage_sz);
+			/* if we have kernel support for reserving hugepages
+			 * through mmap, and we're in in-memory mode, treat this
+			 * page size as valid. we cannot be in legacy mode at
+			 * this point because we've checked this earlier in the
+			 * init process.
+			 */
+#ifdef MAP_HUGE_SHIFT
+			if (internal_config.in_memory) {
+				RTE_LOG(DEBUG, EAL, "In-memory mode enabled, "
+					"hugepages of size %" PRIu64 " bytes "
+					"will be allocated anonymously\n",
+					hpi->hugepage_sz);
+				calc_num_pages(hpi, dirent);
+				num_sizes++;
+			}
+#endif
 			continue;
 		}
 
@@ -371,35 +427,7 @@ hugepage_info_init(void)
 		if (clear_hugedir(hpi->hugedir) == -1)
 			break;
 
-		/*
-		 * first, try to put all hugepages into relevant sockets, but
-		 * if first attempts fails, fall back to collecting all pages
-		 * in one socket and sorting them later
-		 */
-		total_pages = 0;
-		/* we also don't want to do this for legacy init */
-		if (!internal_config.legacy_mem)
-			for (i = 0; i < rte_socket_count(); i++) {
-				int socket = rte_socket_id_by_idx(i);
-				unsigned int num_pages =
-						get_num_hugepages_on_node(
-							dirent->d_name, socket);
-				hpi->num_pages[socket] = num_pages;
-				total_pages += num_pages;
-			}
-		/*
-		 * we failed to sort memory from the get go, so fall
-		 * back to old way
-		 */
-		if (total_pages == 0)
-			hpi->num_pages[0] = get_num_hugepages(dirent->d_name);
-
-#ifndef RTE_ARCH_64
-		/* for 32-bit systems, limit number of hugepages to
-		 * 1GB per page size */
-		hpi->num_pages[0] = RTE_MIN(hpi->num_pages[0],
-					    RTE_PGSIZE_1G / hpi->hugepage_sz);
-#endif
+		calc_num_pages(hpi, dirent);
 
 		num_sizes++;
 	}
@@ -423,8 +451,7 @@ hugepage_info_init(void)
 
 		for (j = 0; j < RTE_MAX_NUMA_NODES; j++)
 			num_pages += hpi->num_pages[j];
-		if (strnlen(hpi->hugedir, sizeof(hpi->hugedir)) != 0 &&
-				num_pages > 0)
+		if (num_pages > 0)
 			return 0;
 	}
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index f1b6d9744..19c53e7af 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -28,6 +28,7 @@
 #include <numaif.h>
 #endif
 #include <linux/falloc.h>
+#include <linux/mman.h> /* for hugetlb-related mmap flags */
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -40,6 +41,15 @@
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 
+const int anonymous_hugepages_supported =
+#ifdef MAP_HUGE_SHIFT
+		1;
+#define RTE_MAP_HUGE_SHIFT MAP_HUGE_SHIFT
+#else
+		0;
+#define RTE_MAP_HUGE_SHIFT 26
+#endif
+
 /*
  * not all kernel version support fallocate on hugetlbfs, so fall back to
  * ftruncate and disallow deallocation if fallocate is not supported.
@@ -486,47 +496,63 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	int cur_socket_id = 0;
 #endif
 	uint64_t map_offset;
+	rte_iova_t iova;
+	void *va;
 	char path[PATH_MAX];
 	int ret = 0;
 	int fd;
 	size_t alloc_sz;
 
-	/* takes out a read lock on segment or segment list */
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0) {
-		RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
-		return -1;
-	}
-
 	alloc_sz = hi->hugepage_sz;
-	if (internal_config.single_file_segments) {
-		map_offset = seg_idx * alloc_sz;
-		ret = resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
-				alloc_sz, true);
-		if (ret < 0)
-			goto resized;
+	if (internal_config.in_memory && anonymous_hugepages_supported) {
+		int log2, flags;
+
+		log2 = rte_log2_u32(alloc_sz);
+		/* as per mmap() manpage, all page sizes are log2 of page size
+		 * shifted by MAP_HUGE_SHIFT
+		 */
+		flags = (log2 << RTE_MAP_HUGE_SHIFT) | MAP_HUGETLB | MAP_FIXED |
+				MAP_PRIVATE | MAP_ANONYMOUS;
+		fd = -1;
+		va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE, flags, -1, 0);
 	} else {
-		map_offset = 0;
-		if (ftruncate(fd, alloc_sz) < 0) {
-			RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
-				__func__, strerror(errno));
-			goto resized;
+		/* takes out a read lock on segment or segment list */
+		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
+			return -1;
 		}
-		if (internal_config.hugepage_unlink) {
-			if (unlink(path)) {
-				RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n",
+
+		if (internal_config.single_file_segments) {
+			map_offset = seg_idx * alloc_sz;
+			ret = resize_hugefile(fd, path, list_idx, seg_idx,
+					map_offset, alloc_sz, true);
+			if (ret < 0)
+				goto resized;
+		} else {
+			map_offset = 0;
+			if (ftruncate(fd, alloc_sz) < 0) {
+				RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
 					__func__, strerror(errno));
 				goto resized;
 			}
+			if (internal_config.hugepage_unlink) {
+				if (unlink(path)) {
+					RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n",
+						__func__, strerror(errno));
+					goto resized;
+				}
+			}
 		}
-	}
 
-	/*
-	 * map the segment, and populate page tables, the kernel fills this
-	 * segment with zeros if it's a new page.
-	 */
-	void *va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd, map_offset);
+		/*
+		 * map the segment, and populate page tables, the kernel fills
+		 * this segment with zeros if it's a new page.
+		 */
+		va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE,
+				MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd,
+				map_offset);
+	}
 
 	if (va == MAP_FAILED) {
 		RTE_LOG(DEBUG, EAL, "%s(): mmap() failed: %s\n", __func__,
@@ -539,24 +565,6 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		goto resized;
 	}
 
-	rte_iova_t iova = rte_mem_virt2iova(addr);
-	if (iova == RTE_BAD_PHYS_ADDR) {
-		RTE_LOG(DEBUG, EAL, "%s(): can't get IOVA addr\n",
-			__func__);
-		goto mapped;
-	}
-
-#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
-	move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
-
-	if (cur_socket_id != socket_id) {
-		RTE_LOG(DEBUG, EAL,
-				"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
-			__func__, socket_id, cur_socket_id);
-		goto mapped;
-	}
-#endif
-
 	/* In linux, hugetlb limitations, like cgroup, are
 	 * enforced at fault time instead of mmap(), even
 	 * with the option of MAP_POPULATE. Kernel will send
@@ -569,9 +577,6 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 			(unsigned int)(alloc_sz >> 20));
 		goto mapped;
 	}
-	/* for non-single file segments, we can close fd here */
-	if (!internal_config.single_file_segments)
-		close(fd);
 
 	/* we need to trigger a write to the page to enforce page fault and
 	 * ensure that page is accessible to us, but we can't overwrite value
@@ -580,6 +585,28 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	 */
 	*(volatile int *)addr = *(volatile int *)addr;
 
+	iova = rte_mem_virt2iova(addr);
+	if (iova == RTE_BAD_PHYS_ADDR) {
+		RTE_LOG(DEBUG, EAL, "%s(): can't get IOVA addr\n",
+			__func__);
+		goto mapped;
+	}
+
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
+
+	if (cur_socket_id != socket_id) {
+		RTE_LOG(DEBUG, EAL,
+				"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
+			__func__, socket_id, cur_socket_id);
+		goto mapped;
+	}
+#endif
+	/* for non-single file segments that aren't in-memory, we can close fd
+	 * here */
+	if (!internal_config.single_file_segments && !internal_config.in_memory)
+		close(fd);
+
 	ms->addr = addr;
 	ms->hugepage_sz = alloc_sz;
 	ms->len = alloc_sz;
@@ -600,6 +627,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	} else {
 		/* only remove file if we can take out a write lock */
 		if (internal_config.hugepage_unlink == 0 &&
+				internal_config.in_memory == 0 &&
 				lock(fd, LOCK_EX) == 1)
 			unlink(path);
 		close(fd);
@@ -709,7 +737,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1) {
+	if (wa->hi->lock_descriptor == -1 && !internal_config.in_memory) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -813,7 +841,7 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1) {
+	if (wa->hi->lock_descriptor == -1 && !internal_config.in_memory) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index cb784e1c3..a98d8c036 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -1060,8 +1060,7 @@ get_socket_mem_size(int socket)
 
 	for (i = 0; i < internal_config.num_hugepage_sizes; i++){
 		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		if (strnlen(hpi->hugedir, sizeof(hpi->hugedir)) != 0)
-			size += hpi->hugepage_sz * hpi->num_pages[socket];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
 	}
 
 	return size;
-- 
2.17.0

  parent reply	other threads:[~2018-06-01 17:15 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-01 17:15 [dpdk-dev] [PATCH 0/9] Support running DPDK without hugetlbfs mountpoint Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 1/9] fbarray: support no-shconf mode Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 2/9] ipc: add support for " Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 3/9] eal: add support for no-shconf for hugepage info Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 4/9] eal: add support for no-shconf in hugepage data file Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 5/9] eal: do not create runtime dir in no-shconf mode Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 6/9] mem: add support for hugepage-unlink mode Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 7/9] eal: add --in-memory option Anatoly Burakov
2018-06-01 17:15 ` [dpdk-dev] [PATCH 8/9] doc: add deprecation notice for EAL command line options Anatoly Burakov
2018-06-01 17:15 ` Anatoly Burakov [this message]
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 0/9] Support running DPDK without hugetlbfs mountpoint Anatoly Burakov
2018-07-13 12:47   ` [dpdk-dev] [PATCH v3 0/8] " Anatoly Burakov
2018-07-13 13:41     ` Thomas Monjalon
2018-07-13 12:47   ` [dpdk-dev] [PATCH v3 1/8] fbarray: support no-shconf mode Anatoly Burakov
2018-07-13 12:47   ` [dpdk-dev] [PATCH v3 2/8] ipc: add support for " Anatoly Burakov
2018-07-13 12:47   ` [dpdk-dev] [PATCH v3 3/8] eal: add support for no-shconf for hugepage info Anatoly Burakov
2018-07-13 12:48   ` [dpdk-dev] [PATCH v3 4/8] eal: add support for no-shconf in hugepage data file Anatoly Burakov
2018-07-13 12:48   ` [dpdk-dev] [PATCH v3 5/8] eal: do not create runtime dir in no-shconf mode Anatoly Burakov
2018-07-13 12:48   ` [dpdk-dev] [PATCH v3 6/8] mem: add support for hugepage-unlink mode Anatoly Burakov
2018-07-13 12:48   ` [dpdk-dev] [PATCH v3 7/8] eal: add --in-memory option Anatoly Burakov
2018-07-13 12:48   ` [dpdk-dev] [PATCH v3 8/8] mem: support in-memory mode Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 1/9] fbarray: support no-shconf mode Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 2/9] ipc: add support for " Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 3/9] eal: add support for no-shconf for hugepage info Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 4/9] eal: add support for no-shconf in hugepage data file Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 5/9] eal: do not create runtime dir in no-shconf mode Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 6/9] mem: add support for hugepage-unlink mode Anatoly Burakov
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 7/9] eal: add --in-memory option Anatoly Burakov
2018-07-13 12:13   ` Thomas Monjalon
2018-07-13 12:27     ` Burakov, Anatoly
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 8/9] doc: add deprecation notice for EAL command line options Anatoly Burakov
2018-07-13 12:13   ` Thomas Monjalon
2018-07-13 12:29     ` Burakov, Anatoly
2018-07-13 10:27 ` [dpdk-dev] [PATCH v2 9/9] mem: support in-memory mode Anatoly Burakov
2018-07-13 12:15   ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=403e50c5a09d7bf41e2a7264be85ee6a086a4eb2.1527872626.git.anatoly.burakov@intel.com \
    --to=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=kuralamudhan.ramakrishnan@intel.com \
    --cc=louise.m.daly@intel.com \
    --cc=ray.kinsella@intel.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).