DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 21.11 0/3] eal: add memory pre-allocation from existing files
@ 2021-07-05 12:49 Dmitry Kozlyuk
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-05 12:49 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Hugepage allocation from the system takes time, resulting in slow
startup or sporadic delays later. Most of the time spent in kernel
is zero-filling memory for security reasons, which may be irrelevant
in a controlled environment. The bottleneck is memory access speed,
so for speeduup the amount of memory cleared must be reduced.
We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
allocate dirty pages from existing files and clean it as necessary.
A new malloc_perf_autotest is provided to estimate the impact.
More details are explained in relevant patches.

Dmitry Kozlyuk (2):
  eal/linux: make hugetlbfs analysis reusable
  app/test: add allocator performance autotest

Viacheslav Ovsiienko (1):
  eal: add memory pre-allocation from existing files

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 155 +++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             | 158 ++++++---
 lib/eal/linux/eal_hugepage_info.h             |  39 +++
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 16 files changed, 733 insertions(+), 70 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-07-05 12:49 [dpdk-dev] [PATCH 21.11 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-07-05 12:49 ` Dmitry Kozlyuk
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-05 12:49 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

get_hugepage_dir() searched for a hugetlbfs mount with a given page size
using handcraft parsing of /proc/mounts and mixing traversal logic with
selecting the needed entry. Separate code to enumerate hugetlbfs mounts
to eal_hugepage_mount_walk() taking a callback that can inspect already
parsed entries. Use mntent(3) API for parsing. This allows to reuse
enumeration logic in subsequent patches.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
 lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
 2 files changed, 135 insertions(+), 57 deletions(-)
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index d97792cade..a090c0a5b5 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <fnmatch.h>
 #include <inttypes.h>
+#include <mntent.h>
 #include <stdarg.h>
 #include <unistd.h>
 #include <errno.h>
@@ -34,6 +35,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_hugepages.h"
+#include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
@@ -195,73 +197,110 @@ get_default_hp_size(void)
 	return size;
 }
 
-static int
-get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+int
+eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
 {
-	enum proc_mount_fieldnames {
-		DEVICE = 0,
-		MOUNTPT,
-		FSTYPE,
-		OPTIONS,
-		_FIELDNAME_MAX
-	};
-	static uint64_t default_size = 0;
-	const char proc_mounts[] = "/proc/mounts";
-	const char hugetlbfs_str[] = "hugetlbfs";
-	const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
-	const char pagesize_opt[] = "pagesize=";
-	const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
-	const char split_tok = ' ';
-	char *splitstr[_FIELDNAME_MAX];
-	char buf[BUFSIZ];
-	int retval = -1;
-	const struct internal_config *internal_conf =
-		eal_get_internal_configuration();
-
-	FILE *fd = fopen(proc_mounts, "r");
-	if (fd == NULL)
-		rte_panic("Cannot open %s\n", proc_mounts);
+	static const char PATH[] = "/proc/mounts";
+	static const char OPTION[] = "pagesize";
+
+	static uint64_t default_size;
+
+	FILE *f = NULL;
+	struct mntent *m;
+	char *hugepage_sz_str;
+	uint64_t hugepage_sz;
+	int ret = -1;
+
+	f = setmntent(PATH, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
+				__func__, PATH, strerror(errno));
+		goto exit;
+	}
 
 	if (default_size == 0)
 		default_size = get_default_hp_size();
 
-	while (fgets(buf, sizeof(buf), fd)){
-		if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
-				split_tok) != _FIELDNAME_MAX) {
-			RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
-			break; /* return NULL */
-		}
+	ret = 0;
+	do {
+		m = getmntent(f);
+		if (m == NULL)
+			break;
 
-		/* we have a specified --huge-dir option, only examine that dir */
-		if (internal_conf->hugepage_dir != NULL &&
-				strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
+		if (strcmp(m->mnt_fsname, "hugetlbfs") != 0)
 			continue;
 
-		if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
-			const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
-
-			/* if no explicit page size, the default page size is compared */
-			if (pagesz_str == NULL){
-				if (hugepage_sz == default_size){
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-			/* there is an explicit page size, so check it */
-			else {
-				uint64_t pagesz = rte_str_to_size(&pagesz_str[pagesize_opt_len]);
-				if (pagesz == hugepage_sz) {
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
+		hugepage_sz_str = hasmntopt(m, OPTION);
+		if (hugepage_sz_str != NULL) {
+			hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
+			hugepage_sz = rte_str_to_size(hugepage_sz_str);
+			if (hugepage_sz == 0) {
+				RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
+						m->mnt_opts, m->mnt_dir);
+				continue;
 			}
-		} /* end if strncmp hugetlbfs */
-	} /* end while fgets */
+		} else {
+			RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
+					m->mnt_dir, OPTION);
+			hugepage_sz = default_size;
+		}
 
-	fclose(fd);
-	return retval;
+		if (cb(m->mnt_dir, hugepage_sz, cb_arg) != 0)
+			break;
+	} while (m != NULL);
+
+	if (ferror(f) && !feof(f)) {
+		RTE_LOG(DEBUG, EAL, "%s(): getmntent(): %s\n",
+				__func__, strerror(errno));
+		ret = -1;
+		goto exit;
+	}
+
+exit:
+	if (f != NULL)
+		endmntent(f);
+	return ret;
+}
+
+struct match_hugepage_mount_arg {
+	uint64_t hugepage_sz;
+	char *hugedir;
+	int hugedir_len;
+	bool done;
+};
+
+static int
+match_hugepage_mount(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	const struct internal_config *internal_conf =
+		eal_get_internal_configuration();
+	struct match_hugepage_mount_arg *arg = cb_arg;
+
+	/* we have a specified --huge-dir option, only examine that dir */
+	if (internal_conf->hugepage_dir != NULL &&
+			strcmp(path, internal_conf->hugepage_dir) != 0)
+		return 0;
+
+	if (hugepage_sz == arg->hugepage_sz) {
+		strlcpy(arg->hugedir, path, arg->hugedir_len);
+		arg->done = true;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+{
+	struct match_hugepage_mount_arg arg = {
+		.hugepage_sz = hugepage_sz,
+		.hugedir = hugedir,
+		.hugedir_len = len,
+		.done = false,
+	};
+	int ret = eal_hugepage_mount_walk(match_hugepage_mount, &arg);
+	return ret == 0 && arg.done ? 0 : -1;
 }
 
 /*
diff --git a/lib/eal/linux/eal_hugepage_info.h b/lib/eal/linux/eal_hugepage_info.h
new file mode 100644
index 0000000000..c7efa37c66
--- /dev/null
+++ b/lib/eal/linux/eal_hugepage_info.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#ifndef _EAL_HUGEPAGE_INFO_
+#define _EAL_HUGEPAGE_INFO_
+
+#include <stdint.h>
+
+/**
+ * Function called for each hugetlbfs mount point.
+ *
+ * @param path
+ *  Mount point directory.
+ * @param hugepage_sz
+ *  Hugepage size for the mount or default system hugepage size.
+ * @param arg
+ *  User data.
+ *
+ * @return
+ *  0 to continue walking, 1 to stop.
+ */
+typedef int (eal_hugepage_mount_walk_cb)(const char *path, uint64_t hugepage_sz,
+					 void *arg);
+
+/**
+ * Enumerate hugetlbfs mount points.
+ *
+ * @param cb
+ *  Function called for each mount point.
+ * @param cb_arg
+ *  User data passed to the callback.
+ *
+ * @return
+ *  0 on success, negative on failure.
+ */
+int eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg);
+
+#endif /* _EAL_HUGEPAGE_INFO_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 2/3] eal: add memory pre-allocation from existing files
  2021-07-05 12:49 [dpdk-dev] [PATCH 21.11 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
@ 2021-07-05 12:49 ` Dmitry Kozlyuk
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-05 12:49 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@ Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index ff5861b5f3..c729c36630 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -86,6 +86,7 @@ eal_long_options[] = {
 	{OPT_MASTER_LCORE,      1, NULL, OPT_MASTER_LCORE_NUM     },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1898,6 +1899,8 @@ eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -2018,6 +2021,26 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@ struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@ eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 7b348e707f..c6c634b2b2 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -93,6 +93,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	/* legacy option that will be removed in future */
 #define OPT_PCI_BLACKLIST     "pci-blacklist"
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 772736b53f..72b64d8052 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -48,6 +49,13 @@ malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index bba9b5300a..9a2b191314 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@ extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index ba19fc6347..13c469a510 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index a090c0a5b5..0c262342a5 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@ hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@ static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 3/3] app/test: add allocator performance autotest
  2021-07-05 12:49 [dpdk-dev] [PATCH 21.11 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-07-05 12:49 ` Dmitry Kozlyuk
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-05 12:49 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing would take
for each size as a hint.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 155 ++++++++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 0a5f425578..4fd2694267 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -77,6 +77,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -273,6 +274,7 @@ fast_tests = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..5e9396a5d8
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,155 @@
+#include <inttypes.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	puts("Performance: memset");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	putchar('\n');
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
+		size_t max_runs, double memset_gb_us)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	printf("Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		puts("Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	printf("%12s%8s%12s%12s%12s%12s\n",
+			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
+			"Total (us)", "memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			printf("%12zu Interrupted: out of memory.\n", size);
+			break;
+		}
+		runs_done = j;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_gb_us * size / GB;
+		printf("%12zu%8lu%12.2f%12.2f%12.2f%12.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	putchar('\n');
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_gb_us;
+
+	if (test_memset_perf(&memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files
  2021-07-05 12:49 [dpdk-dev] [PATCH 21.11 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                   ` (2 preceding siblings ...)
  2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
@ 2021-07-16 11:08 ` Dmitry Kozlyuk
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
                     ` (4 more replies)
  3 siblings, 5 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-16 11:08 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Hugepage allocation from the system takes time, resulting in slow
startup or sporadic delays later. Most of the time spent in kernel
is zero-filling memory for security reasons, which may be irrelevant
in a controlled environment. The bottleneck is memory access speed,
so for speeduup the amount of memory cleared must be reduced.
We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
allocate dirty pages from existing files and clean it as necessary.
A new malloc_perf_autotest is provided to estimate the impact.
More details are explained in relevant patches.

v2: fix CI failures

Dmitry Kozlyuk (2):
  eal/linux: make hugetlbfs analysis reusable
  app/test: add allocator performance autotest

Viacheslav Ovsiienko (1):
  eal: add memory pre-allocation from existing files

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 157 +++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             | 158 ++++++---
 lib/eal/linux/eal_hugepage_info.h             |  39 +++
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 16 files changed, 735 insertions(+), 70 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 v2 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-07-16 11:08   ` Dmitry Kozlyuk
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-16 11:08 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

get_hugepage_dir() searched for a hugetlbfs mount with a given page size
using handcraft parsing of /proc/mounts and mixing traversal logic with
selecting the needed entry. Separate code to enumerate hugetlbfs mounts
to eal_hugepage_mount_walk() taking a callback that can inspect already
parsed entries. Use mntent(3) API for parsing. This allows to reuse
enumeration logic in subsequent patches.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
 lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
 2 files changed, 135 insertions(+), 57 deletions(-)
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index d97792cade..a090c0a5b5 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <fnmatch.h>
 #include <inttypes.h>
+#include <mntent.h>
 #include <stdarg.h>
 #include <unistd.h>
 #include <errno.h>
@@ -34,6 +35,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_hugepages.h"
+#include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
@@ -195,73 +197,110 @@ get_default_hp_size(void)
 	return size;
 }
 
-static int
-get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+int
+eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
 {
-	enum proc_mount_fieldnames {
-		DEVICE = 0,
-		MOUNTPT,
-		FSTYPE,
-		OPTIONS,
-		_FIELDNAME_MAX
-	};
-	static uint64_t default_size = 0;
-	const char proc_mounts[] = "/proc/mounts";
-	const char hugetlbfs_str[] = "hugetlbfs";
-	const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
-	const char pagesize_opt[] = "pagesize=";
-	const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
-	const char split_tok = ' ';
-	char *splitstr[_FIELDNAME_MAX];
-	char buf[BUFSIZ];
-	int retval = -1;
-	const struct internal_config *internal_conf =
-		eal_get_internal_configuration();
-
-	FILE *fd = fopen(proc_mounts, "r");
-	if (fd == NULL)
-		rte_panic("Cannot open %s\n", proc_mounts);
+	static const char PATH[] = "/proc/mounts";
+	static const char OPTION[] = "pagesize";
+
+	static uint64_t default_size;
+
+	FILE *f = NULL;
+	struct mntent *m;
+	char *hugepage_sz_str;
+	uint64_t hugepage_sz;
+	int ret = -1;
+
+	f = setmntent(PATH, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
+				__func__, PATH, strerror(errno));
+		goto exit;
+	}
 
 	if (default_size == 0)
 		default_size = get_default_hp_size();
 
-	while (fgets(buf, sizeof(buf), fd)){
-		if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
-				split_tok) != _FIELDNAME_MAX) {
-			RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
-			break; /* return NULL */
-		}
+	ret = 0;
+	do {
+		m = getmntent(f);
+		if (m == NULL)
+			break;
 
-		/* we have a specified --huge-dir option, only examine that dir */
-		if (internal_conf->hugepage_dir != NULL &&
-				strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
+		if (strcmp(m->mnt_fsname, "hugetlbfs") != 0)
 			continue;
 
-		if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
-			const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
-
-			/* if no explicit page size, the default page size is compared */
-			if (pagesz_str == NULL){
-				if (hugepage_sz == default_size){
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-			/* there is an explicit page size, so check it */
-			else {
-				uint64_t pagesz = rte_str_to_size(&pagesz_str[pagesize_opt_len]);
-				if (pagesz == hugepage_sz) {
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
+		hugepage_sz_str = hasmntopt(m, OPTION);
+		if (hugepage_sz_str != NULL) {
+			hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
+			hugepage_sz = rte_str_to_size(hugepage_sz_str);
+			if (hugepage_sz == 0) {
+				RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
+						m->mnt_opts, m->mnt_dir);
+				continue;
 			}
-		} /* end if strncmp hugetlbfs */
-	} /* end while fgets */
+		} else {
+			RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
+					m->mnt_dir, OPTION);
+			hugepage_sz = default_size;
+		}
 
-	fclose(fd);
-	return retval;
+		if (cb(m->mnt_dir, hugepage_sz, cb_arg) != 0)
+			break;
+	} while (m != NULL);
+
+	if (ferror(f) && !feof(f)) {
+		RTE_LOG(DEBUG, EAL, "%s(): getmntent(): %s\n",
+				__func__, strerror(errno));
+		ret = -1;
+		goto exit;
+	}
+
+exit:
+	if (f != NULL)
+		endmntent(f);
+	return ret;
+}
+
+struct match_hugepage_mount_arg {
+	uint64_t hugepage_sz;
+	char *hugedir;
+	int hugedir_len;
+	bool done;
+};
+
+static int
+match_hugepage_mount(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	const struct internal_config *internal_conf =
+		eal_get_internal_configuration();
+	struct match_hugepage_mount_arg *arg = cb_arg;
+
+	/* we have a specified --huge-dir option, only examine that dir */
+	if (internal_conf->hugepage_dir != NULL &&
+			strcmp(path, internal_conf->hugepage_dir) != 0)
+		return 0;
+
+	if (hugepage_sz == arg->hugepage_sz) {
+		strlcpy(arg->hugedir, path, arg->hugedir_len);
+		arg->done = true;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+{
+	struct match_hugepage_mount_arg arg = {
+		.hugepage_sz = hugepage_sz,
+		.hugedir = hugedir,
+		.hugedir_len = len,
+		.done = false,
+	};
+	int ret = eal_hugepage_mount_walk(match_hugepage_mount, &arg);
+	return ret == 0 && arg.done ? 0 : -1;
 }
 
 /*
diff --git a/lib/eal/linux/eal_hugepage_info.h b/lib/eal/linux/eal_hugepage_info.h
new file mode 100644
index 0000000000..c7efa37c66
--- /dev/null
+++ b/lib/eal/linux/eal_hugepage_info.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#ifndef _EAL_HUGEPAGE_INFO_
+#define _EAL_HUGEPAGE_INFO_
+
+#include <stdint.h>
+
+/**
+ * Function called for each hugetlbfs mount point.
+ *
+ * @param path
+ *  Mount point directory.
+ * @param hugepage_sz
+ *  Hugepage size for the mount or default system hugepage size.
+ * @param arg
+ *  User data.
+ *
+ * @return
+ *  0 to continue walking, 1 to stop.
+ */
+typedef int (eal_hugepage_mount_walk_cb)(const char *path, uint64_t hugepage_sz,
+					 void *arg);
+
+/**
+ * Enumerate hugetlbfs mount points.
+ *
+ * @param cb
+ *  Function called for each mount point.
+ * @param cb_arg
+ *  User data passed to the callback.
+ *
+ * @return
+ *  0 on success, negative on failure.
+ */
+int eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg);
+
+#endif /* _EAL_HUGEPAGE_INFO_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 v2 2/3] eal: add memory pre-allocation from existing files
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
@ 2021-07-16 11:08   ` Dmitry Kozlyuk
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-16 11:08 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@ Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index ff5861b5f3..c729c36630 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -86,6 +86,7 @@ eal_long_options[] = {
 	{OPT_MASTER_LCORE,      1, NULL, OPT_MASTER_LCORE_NUM     },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1898,6 +1899,8 @@ eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -2018,6 +2021,26 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@ struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@ eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 7b348e707f..c6c634b2b2 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -93,6 +93,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	/* legacy option that will be removed in future */
 #define OPT_PCI_BLACKLIST     "pci-blacklist"
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 772736b53f..72b64d8052 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -48,6 +49,13 @@ malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index bba9b5300a..9a2b191314 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@ extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index a090c0a5b5..0c262342a5 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@ hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@ static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 21.11 v2 3/3] app/test: add allocator performance autotest
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-07-16 11:08   ` Dmitry Kozlyuk
  2021-08-09  9:45   ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
  4 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-07-16 11:08 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing would take
for each size as a hint.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 157 ++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index a7611686ad..a48dc79463 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -84,6 +84,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -281,6 +282,7 @@ fast_tests = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..4435894095
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,157 @@
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	puts("Performance: memset");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	putchar('\n');
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
+		size_t max_runs, double memset_gb_us)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	printf("Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		puts("Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	printf("%12s%8s%12s%12s%12s%12s\n",
+			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
+			"Total (us)", "memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			printf("%12zu Interrupted: out of memory.\n", size);
+			break;
+		}
+		runs_done = j;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_gb_us * size / GB;
+		printf("%12zu%8zu%12.2f%12.2f%12.2f%12.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	putchar('\n');
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_gb_us;
+
+	if (test_memset_perf(&memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                     ` (2 preceding siblings ...)
  2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
@ 2021-08-09  9:45   ` Dmitry Kozlyuk
  2021-08-30  8:21     ` Dmitry Kozlyuk
  2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
  4 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-09  9:45 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: dkozlyuk, dev

2021-07-16 14:08 (UTC+0300), Dmitry Kozlyuk:
> Hugepage allocation from the system takes time, resulting in slow
> startup or sporadic delays later. Most of the time spent in kernel
> is zero-filling memory for security reasons, which may be irrelevant
> in a controlled environment. The bottleneck is memory access speed,
> so for speeduup the amount of memory cleared must be reduced.
> We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
> allocate dirty pages from existing files and clean it as necessary.
> A new malloc_perf_autotest is provided to estimate the impact.
> More details are explained in relevant patches.
> 
> v2: fix CI failures

Hi Anatoly, could you review please?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files
  2021-08-09  9:45   ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-08-30  8:21     ` Dmitry Kozlyuk
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-08-30  8:21 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: dev

> -----Original Message-----
> From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Sent: August 9, 2021, 12:45
> [...]
> Hi Anatoly, could you review please?

Hi Anatoly, did you have time to look?
Patch 1/3 may be useful on its own, because it makes adding and testing patches like [1] easier.

[1]: http://patchwork.dpdk.org/project/dpdk/patch/20210809112434.383123-1-john.levon@nutanix.com/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v3 0/3] eal: add memory pre-allocation from existing files
  2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                     ` (3 preceding siblings ...)
  2021-08-09  9:45   ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-09-14 10:34   ` Dmitry Kozlyuk
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
                       ` (3 more replies)
  4 siblings, 4 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-09-14 10:34 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Hugepage allocation from the system takes time, resulting in slow
startup or sporadic delays later. Most of the time spent in kernel
is zero-filling memory for security reasons, which may be irrelevant
in a controlled environment. The bottleneck is memory access speed,
so for speeduup the amount of memory cleared must be reduced.
We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
allocate dirty pages from existing files and clean it as necessary.
A new malloc_perf_autotest is provided to estimate the impact.
More details are explained in relevant patches.

v3: fix hugepage mount point detection
v2: fix CI failures

Dmitry Kozlyuk (2):
  eal/linux: make hugetlbfs analysis reusable
  app/test: add allocator performance autotest

Viacheslav Ovsiienko (1):
  eal: add memory pre-allocation from existing files

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 157 +++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             | 158 ++++++---
 lib/eal/linux/eal_hugepage_info.h             |  39 +++
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 16 files changed, 735 insertions(+), 70 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
@ 2021-09-14 10:34     ` Dmitry Kozlyuk
  2021-09-14 12:48       ` John Levon
  2021-09-16 12:08       ` John Levon
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-09-14 10:34 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko, John Levon

get_hugepage_dir() searched for a hugetlbfs mount with a given page size
using handcraft parsing of /proc/mounts and mixing traversal logic with
selecting the needed entry. Separate code to enumerate hugetlbfs mounts
to eal_hugepage_mount_walk() taking a callback that can inspect already
parsed entries. Use mntent(3) API for parsing. This allows to reuse
enumeration logic in subsequent patches.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
Cc: John Levon <john.levon@nutanix.com>

 lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
 lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
 2 files changed, 135 insertions(+), 57 deletions(-)
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index d97792cade..726a086ab3 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <fnmatch.h>
 #include <inttypes.h>
+#include <mntent.h>
 #include <stdarg.h>
 #include <unistd.h>
 #include <errno.h>
@@ -34,6 +35,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_hugepages.h"
+#include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
@@ -195,73 +197,110 @@ get_default_hp_size(void)
 	return size;
 }
 
-static int
-get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+int
+eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
 {
-	enum proc_mount_fieldnames {
-		DEVICE = 0,
-		MOUNTPT,
-		FSTYPE,
-		OPTIONS,
-		_FIELDNAME_MAX
-	};
-	static uint64_t default_size = 0;
-	const char proc_mounts[] = "/proc/mounts";
-	const char hugetlbfs_str[] = "hugetlbfs";
-	const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
-	const char pagesize_opt[] = "pagesize=";
-	const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
-	const char split_tok = ' ';
-	char *splitstr[_FIELDNAME_MAX];
-	char buf[BUFSIZ];
-	int retval = -1;
-	const struct internal_config *internal_conf =
-		eal_get_internal_configuration();
-
-	FILE *fd = fopen(proc_mounts, "r");
-	if (fd == NULL)
-		rte_panic("Cannot open %s\n", proc_mounts);
+	static const char PATH[] = "/proc/mounts";
+	static const char OPTION[] = "pagesize";
+
+	static uint64_t default_size;
+
+	FILE *f = NULL;
+	struct mntent *m;
+	char *hugepage_sz_str;
+	uint64_t hugepage_sz;
+	int ret = -1;
+
+	f = setmntent(PATH, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
+				__func__, PATH, strerror(errno));
+		goto exit;
+	}
 
 	if (default_size == 0)
 		default_size = get_default_hp_size();
 
-	while (fgets(buf, sizeof(buf), fd)){
-		if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
-				split_tok) != _FIELDNAME_MAX) {
-			RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
-			break; /* return NULL */
-		}
+	ret = 0;
+	do {
+		m = getmntent(f);
+		if (m == NULL)
+			break;
 
-		/* we have a specified --huge-dir option, only examine that dir */
-		if (internal_conf->hugepage_dir != NULL &&
-				strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
+		if (strcmp(m->mnt_type, "hugetlbfs") != 0)
 			continue;
 
-		if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
-			const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
-
-			/* if no explicit page size, the default page size is compared */
-			if (pagesz_str == NULL){
-				if (hugepage_sz == default_size){
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-			/* there is an explicit page size, so check it */
-			else {
-				uint64_t pagesz = rte_str_to_size(&pagesz_str[pagesize_opt_len]);
-				if (pagesz == hugepage_sz) {
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
+		hugepage_sz_str = hasmntopt(m, OPTION);
+		if (hugepage_sz_str != NULL) {
+			hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
+			hugepage_sz = rte_str_to_size(hugepage_sz_str);
+			if (hugepage_sz == 0) {
+				RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
+						m->mnt_opts, m->mnt_dir);
+				continue;
 			}
-		} /* end if strncmp hugetlbfs */
-	} /* end while fgets */
+		} else {
+			RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
+					m->mnt_dir, OPTION);
+			hugepage_sz = default_size;
+		}
 
-	fclose(fd);
-	return retval;
+		if (cb(m->mnt_dir, hugepage_sz, cb_arg) != 0)
+			break;
+	} while (m != NULL);
+
+	if (ferror(f) && !feof(f)) {
+		RTE_LOG(DEBUG, EAL, "%s(): getmntent(): %s\n",
+				__func__, strerror(errno));
+		ret = -1;
+		goto exit;
+	}
+
+exit:
+	if (f != NULL)
+		endmntent(f);
+	return ret;
+}
+
+struct match_hugepage_mount_arg {
+	uint64_t hugepage_sz;
+	char *hugedir;
+	int hugedir_len;
+	bool done;
+};
+
+static int
+match_hugepage_mount(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	const struct internal_config *internal_conf =
+		eal_get_internal_configuration();
+	struct match_hugepage_mount_arg *arg = cb_arg;
+
+	/* we have a specified --huge-dir option, only examine that dir */
+	if (internal_conf->hugepage_dir != NULL &&
+			strcmp(path, internal_conf->hugepage_dir) != 0)
+		return 0;
+
+	if (hugepage_sz == arg->hugepage_sz) {
+		strlcpy(arg->hugedir, path, arg->hugedir_len);
+		arg->done = true;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+{
+	struct match_hugepage_mount_arg arg = {
+		.hugepage_sz = hugepage_sz,
+		.hugedir = hugedir,
+		.hugedir_len = len,
+		.done = false,
+	};
+	int ret = eal_hugepage_mount_walk(match_hugepage_mount, &arg);
+	return ret == 0 && arg.done ? 0 : -1;
 }
 
 /*
diff --git a/lib/eal/linux/eal_hugepage_info.h b/lib/eal/linux/eal_hugepage_info.h
new file mode 100644
index 0000000000..c7efa37c66
--- /dev/null
+++ b/lib/eal/linux/eal_hugepage_info.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#ifndef _EAL_HUGEPAGE_INFO_
+#define _EAL_HUGEPAGE_INFO_
+
+#include <stdint.h>
+
+/**
+ * Function called for each hugetlbfs mount point.
+ *
+ * @param path
+ *  Mount point directory.
+ * @param hugepage_sz
+ *  Hugepage size for the mount or default system hugepage size.
+ * @param arg
+ *  User data.
+ *
+ * @return
+ *  0 to continue walking, 1 to stop.
+ */
+typedef int (eal_hugepage_mount_walk_cb)(const char *path, uint64_t hugepage_sz,
+					 void *arg);
+
+/**
+ * Enumerate hugetlbfs mount points.
+ *
+ * @param cb
+ *  Function called for each mount point.
+ * @param cb_arg
+ *  User data passed to the callback.
+ *
+ * @return
+ *  0 on success, negative on failure.
+ */
+int eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg);
+
+#endif /* _EAL_HUGEPAGE_INFO_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] eal: add memory pre-allocation from existing files
  2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
@ 2021-09-14 10:34     ` Dmitry Kozlyuk
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
  2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-09-14 10:34 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@ Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index ff5861b5f3..c729c36630 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -86,6 +86,7 @@ eal_long_options[] = {
 	{OPT_MASTER_LCORE,      1, NULL, OPT_MASTER_LCORE_NUM     },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1898,6 +1899,8 @@ eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -2018,6 +2021,26 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@ struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@ eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 7b348e707f..c6c634b2b2 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -93,6 +93,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	/* legacy option that will be removed in future */
 #define OPT_PCI_BLACKLIST     "pci-blacklist"
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 772736b53f..72b64d8052 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -48,6 +49,13 @@ malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index bba9b5300a..9a2b191314 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@ extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 726a086ab3..08dc0e5620 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@ hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@ static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] app/test: add allocator performance autotest
  2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-09-14 10:34     ` Dmitry Kozlyuk
  2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-09-14 10:34 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing would take
for each size as a hint.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 157 ++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index a7611686ad..a48dc79463 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -84,6 +84,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -281,6 +282,7 @@ fast_tests = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..4435894095
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,157 @@
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	puts("Performance: memset");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	putchar('\n');
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
+		size_t max_runs, double memset_gb_us)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	printf("Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		puts("Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	printf("%12s%8s%12s%12s%12s%12s\n",
+			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
+			"Total (us)", "memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			printf("%12zu Interrupted: out of memory.\n", size);
+			break;
+		}
+		runs_done = j;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_gb_us * size / GB;
+		printf("%12zu%8zu%12.2f%12.2f%12.2f%12.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	putchar('\n');
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_gb_us;
+
+	if (test_memset_perf(&memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
@ 2021-09-14 12:48       ` John Levon
  2021-09-14 12:57         ` Dmitry Kozlyuk
  2021-09-16 12:08       ` John Levon
  1 sibling, 1 reply; 48+ messages in thread
From: John Levon @ 2021-09-14 12:48 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov, Viacheslav Ovsiienko

On Tue, Sep 14, 2021 at 01:34:54PM +0300, Dmitry Kozlyuk wrote:

> get_hugepage_dir() searched for a hugetlbfs mount with a given page size
> using handcraft parsing of /proc/mounts and mixing traversal logic with
> selecting the needed entry. Separate code to enumerate hugetlbfs mounts
> to eal_hugepage_mount_walk() taking a callback that can inspect already
> parsed entries. Use mntent(3) API for parsing. This allows to reuse
> enumeration logic in subsequent patches.

Hi, are you planning to implement my pending change on top of this?

thanks
john

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-14 12:48       ` John Levon
@ 2021-09-14 12:57         ` Dmitry Kozlyuk
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-09-14 12:57 UTC (permalink / raw)
  To: John Levon; +Cc: dev, Anatoly Burakov, Slava Ovsiienko

> -----Original Message-----
> From: John Levon <john.levon@nutanix.com>
> Sent: 14 сентября 2021 г. 15:48
> To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Cc: dev@dpdk.org; Anatoly Burakov <anatoly.burakov@intel.com>; Slava
> Ovsiienko <viacheslavo@nvidia.com>
> Subject: Re: [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Sep 14, 2021 at 01:34:54PM +0300, Dmitry Kozlyuk wrote:
> 
> > get_hugepage_dir() searched for a hugetlbfs mount with a given page
> > size using handcraft parsing of /proc/mounts and mixing traversal
> > logic with selecting the needed entry. Separate code to enumerate
> > hugetlbfs mounts to eal_hugepage_mount_walk() taking a callback that
> > can inspect already parsed entries. Use mntent(3) API for parsing.
> > This allows to reuse enumeration logic in subsequent patches.
> 
> Hi, are you planning to implement my pending change on top of this?

Yes, that's what I have in mind after your patch will be merged.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
  2021-09-14 12:48       ` John Levon
@ 2021-09-16 12:08       ` John Levon
  1 sibling, 0 replies; 48+ messages in thread
From: John Levon @ 2021-09-16 12:08 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov, Viacheslav Ovsiienko

On Tue, Sep 14, 2021 at 01:34:54PM +0300, Dmitry Kozlyuk wrote:

> +	do {
> +		m = getmntent(f);

Should you be using getmntent_r() etc?

Nice cleanup!

regards
john

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files
  2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
                       ` (2 preceding siblings ...)
  2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
@ 2021-09-20 12:52     ` dkozlyuk
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
                         ` (3 more replies)
  3 siblings, 4 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-20 12:52 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Dmitry Kozlyuk

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

Hugepage allocation from the system takes time, resulting in slow
startup or sporadic delays later. Most of the time spent in kernel
is zero-filling memory for security reasons, which may be irrelevant
in a controlled environment. The bottleneck is memory access speed,
so for speeduup the amount of memory cleared must be reduced.
We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
allocate dirty pages from existing files and clean it as necessary.
A new malloc_perf_autotest is provided to estimate the impact.
More details are explained in relevant patches.

v4: getmntent() -> getmntent_r(), better error detection (John Levon)
v3: fix hugepage mount point detection
v2: fix CI failures

Dmitry Kozlyuk (2):
  eal/linux: make hugetlbfs analysis reusable
  app/test: add allocator performance autotest

Viacheslav Ovsiienko (1):
  eal: add memory pre-allocation from existing files

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 157 +++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             | 158 ++++++---
 lib/eal/linux/eal_hugepage_info.h             |  39 +++
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 16 files changed, 735 insertions(+), 70 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v4 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
@ 2021-09-20 12:53       ` dkozlyuk
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 2/3] eal: add memory pre-allocation from existing files dkozlyuk
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-20 12:53 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Dmitry Kozlyuk, John Levon, Viacheslav Ovsiienko

From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>

get_hugepage_dir() searched for a hugetlbfs mount with a given page size
using handcraft parsing of /proc/mounts and mixing traversal logic with
selecting the needed entry. Separate code to enumerate hugetlbfs mounts
to eal_hugepage_mount_walk() taking a callback that can inspect already
parsed entries. Use mntent(3) API for parsing. This allows to reuse
enumeration logic in subsequent patches.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
 lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
 2 files changed, 135 insertions(+), 57 deletions(-)
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index d97792cade..193282e779 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <fnmatch.h>
 #include <inttypes.h>
+#include <mntent.h>
 #include <stdarg.h>
 #include <unistd.h>
 #include <errno.h>
@@ -34,6 +35,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_hugepages.h"
+#include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
@@ -195,73 +197,110 @@ get_default_hp_size(void)
 	return size;
 }
 
-static int
-get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+int
+eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
 {
-	enum proc_mount_fieldnames {
-		DEVICE = 0,
-		MOUNTPT,
-		FSTYPE,
-		OPTIONS,
-		_FIELDNAME_MAX
-	};
-	static uint64_t default_size = 0;
-	const char proc_mounts[] = "/proc/mounts";
-	const char hugetlbfs_str[] = "hugetlbfs";
-	const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
-	const char pagesize_opt[] = "pagesize=";
-	const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
-	const char split_tok = ' ';
-	char *splitstr[_FIELDNAME_MAX];
-	char buf[BUFSIZ];
-	int retval = -1;
-	const struct internal_config *internal_conf =
-		eal_get_internal_configuration();
-
-	FILE *fd = fopen(proc_mounts, "r");
-	if (fd == NULL)
-		rte_panic("Cannot open %s\n", proc_mounts);
+	static const char PATH[] = "/proc/mounts";
+	static const char OPTION[] = "pagesize";
+
+	static uint64_t default_size;
+
+	FILE *f = NULL;
+	struct mntent mntent;
+	char strings[PATH_MAX];
+	char *hugepage_sz_str;
+	uint64_t hugepage_sz;
+	bool stopped = false;
+	int ret = -1;
+
+	f = setmntent(PATH, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
+				__func__, PATH, strerror(errno));
+		goto exit;
+	}
 
 	if (default_size == 0)
 		default_size = get_default_hp_size();
 
-	while (fgets(buf, sizeof(buf), fd)){
-		if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
-				split_tok) != _FIELDNAME_MAX) {
-			RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
-			break; /* return NULL */
+	ret = 0;
+	while (getmntent_r(f, &mntent, strings, sizeof(strings)) != NULL) {
+		if (strcmp(mntent.mnt_type, "hugetlbfs") != 0)
+			continue;
+
+		hugepage_sz_str = hasmntopt(&mntent, OPTION);
+		if (hugepage_sz_str != NULL) {
+			hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
+			hugepage_sz = rte_str_to_size(hugepage_sz_str);
+			if (hugepage_sz == 0) {
+				RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
+						mntent.mnt_opts, mntent.mnt_dir);
+				continue;
+			}
+		} else {
+			RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
+					mntent.mnt_dir, OPTION);
+			hugepage_sz = default_size;
 		}
 
-		/* we have a specified --huge-dir option, only examine that dir */
-		if (internal_conf->hugepage_dir != NULL &&
-				strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
-			continue;
+		if (cb(mntent.mnt_dir, hugepage_sz, cb_arg) != 0) {
+			stopped = true;
+			break;
+		}
+	}
 
-		if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
-			const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
+	if (ferror(f) || (!stopped && !feof(f))) {
+		RTE_LOG(ERR, EAL, "%s(): getmntent_r(): %s\n",
+				__func__, strerror(errno));
+		ret = -1;
+		goto exit;
+	}
 
-			/* if no explicit page size, the default page size is compared */
-			if (pagesz_str == NULL){
-				if (hugepage_sz == default_size){
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-			/* there is an explicit page size, so check it */
-			else {
-				uint64_t pagesz = rte_str_to_size(&pagesz_str[pagesize_opt_len]);
-				if (pagesz == hugepage_sz) {
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-		} /* end if strncmp hugetlbfs */
-	} /* end while fgets */
+exit:
+	if (f != NULL)
+		endmntent(f);
+	return ret;
+}
 
-	fclose(fd);
-	return retval;
+struct match_hugepage_mount_arg {
+	uint64_t hugepage_sz;
+	char *hugedir;
+	int hugedir_len;
+	bool done;
+};
+
+static int
+match_hugepage_mount(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	const struct internal_config *internal_conf =
+		eal_get_internal_configuration();
+	struct match_hugepage_mount_arg *arg = cb_arg;
+
+	/* we have a specified --huge-dir option, only examine that dir */
+	if (internal_conf->hugepage_dir != NULL &&
+			strcmp(path, internal_conf->hugepage_dir) != 0)
+		return 0;
+
+	if (hugepage_sz == arg->hugepage_sz) {
+		strlcpy(arg->hugedir, path, arg->hugedir_len);
+		arg->done = true;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+{
+	struct match_hugepage_mount_arg arg = {
+		.hugepage_sz = hugepage_sz,
+		.hugedir = hugedir,
+		.hugedir_len = len,
+		.done = false,
+	};
+	int ret = eal_hugepage_mount_walk(match_hugepage_mount, &arg);
+	return ret == 0 && arg.done ? 0 : -1;
 }
 
 /*
diff --git a/lib/eal/linux/eal_hugepage_info.h b/lib/eal/linux/eal_hugepage_info.h
new file mode 100644
index 0000000000..c7efa37c66
--- /dev/null
+++ b/lib/eal/linux/eal_hugepage_info.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#ifndef _EAL_HUGEPAGE_INFO_
+#define _EAL_HUGEPAGE_INFO_
+
+#include <stdint.h>
+
+/**
+ * Function called for each hugetlbfs mount point.
+ *
+ * @param path
+ *  Mount point directory.
+ * @param hugepage_sz
+ *  Hugepage size for the mount or default system hugepage size.
+ * @param arg
+ *  User data.
+ *
+ * @return
+ *  0 to continue walking, 1 to stop.
+ */
+typedef int (eal_hugepage_mount_walk_cb)(const char *path, uint64_t hugepage_sz,
+					 void *arg);
+
+/**
+ * Enumerate hugetlbfs mount points.
+ *
+ * @param cb
+ *  Function called for each mount point.
+ * @param cb_arg
+ *  User data passed to the callback.
+ *
+ * @return
+ *  0 on success, negative on failure.
+ */
+int eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg);
+
+#endif /* _EAL_HUGEPAGE_INFO_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v4 2/3] eal: add memory pre-allocation from existing files
  2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
@ 2021-09-20 12:53       ` dkozlyuk
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 3/3] app/test: add allocator performance autotest dkozlyuk
  2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-20 12:53 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko, Dmitry Kozlyuk

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@ Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index ff5861b5f3..c729c36630 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -86,6 +86,7 @@ eal_long_options[] = {
 	{OPT_MASTER_LCORE,      1, NULL, OPT_MASTER_LCORE_NUM     },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1898,6 +1899,8 @@ eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -2018,6 +2021,26 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@ struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@ eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 7b348e707f..c6c634b2b2 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -93,6 +93,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	/* legacy option that will be removed in future */
 #define OPT_PCI_BLACKLIST     "pci-blacklist"
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 772736b53f..72b64d8052 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -48,6 +49,13 @@ malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index bba9b5300a..9a2b191314 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@ extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 193282e779..dfbb49ada9 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@ hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@ static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v4 3/3] app/test: add allocator performance autotest
  2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 2/3] eal: add memory pre-allocation from existing files dkozlyuk
@ 2021-09-20 12:53       ` dkozlyuk
  2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-20 12:53 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Dmitry Kozlyuk, Viacheslav Ovsiienko

From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing would take
for each size as a hint.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 157 ++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index a7611686ad..a48dc79463 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -84,6 +84,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -281,6 +282,7 @@ fast_tests = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..4435894095
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,157 @@
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	puts("Performance: memset");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	putchar('\n');
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
+		size_t max_runs, double memset_gb_us)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	printf("Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		puts("Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	printf("%12s%8s%12s%12s%12s%12s\n",
+			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
+			"Total (us)", "memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			printf("%12zu Interrupted: out of memory.\n", size);
+			break;
+		}
+		runs_done = j;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_gb_us * size / GB;
+		printf("%12zu%8zu%12.2f%12.2f%12.2f%12.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	putchar('\n');
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_gb_us;
+
+	if (test_memset_perf(&memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files
  2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
                         ` (2 preceding siblings ...)
  2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 3/3] app/test: add allocator performance autotest dkozlyuk
@ 2021-09-21  8:16       ` dkozlyuk
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
                           ` (3 more replies)
  3 siblings, 4 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-21  8:16 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Dmitry Kozlyuk

From: Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com>

Hugepage allocation from the system takes time, resulting in slow
startup or sporadic delays later. Most of the time spent in kernel
is zero-filling memory for security reasons, which may be irrelevant
in a controlled environment. The bottleneck is memory access speed,
so for speeduup the amount of memory cleared must be reduced.
We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
allocate dirty pages from existing files and clean it as necessary.
A new malloc_perf_autotest is provided to estimate the impact.
More details are explained in relevant patches.

v5: rebase
v4: getmntent() -> getmntent_r(), better error detection (John Levon)
v3: fix hugepage mount point detection
v2: fix CI failures

Dmitry Kozlyuk (2):
  eal/linux: make hugetlbfs analysis reusable
  app/test: add allocator performance autotest

Viacheslav Ovsiienko (1):
  eal: add memory pre-allocation from existing files

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 157 +++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             | 158 ++++++---
 lib/eal/linux/eal_hugepage_info.h             |  39 +++
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 16 files changed, 735 insertions(+), 70 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
@ 2021-09-21  8:16         ` dkozlyuk
  2021-09-22 13:52           ` John Levon
  2021-10-05 17:36           ` Thomas Monjalon
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 2/3] eal: add memory pre-allocation from existing files dkozlyuk
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-21  8:16 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Dmitry Kozlyuk, John Levon, Viacheslav Ovsiienko

From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>

get_hugepage_dir() searched for a hugetlbfs mount with a given page size
using handcraft parsing of /proc/mounts and mixing traversal logic with
selecting the needed entry. Separate code to enumerate hugetlbfs mounts
to eal_hugepage_mount_walk() taking a callback that can inspect already
parsed entries. Use mntent(3) API for parsing. This allows to reuse
enumeration logic in subsequent patches.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
 lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
 2 files changed, 135 insertions(+), 57 deletions(-)
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index d97792cade..193282e779 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <fnmatch.h>
 #include <inttypes.h>
+#include <mntent.h>
 #include <stdarg.h>
 #include <unistd.h>
 #include <errno.h>
@@ -34,6 +35,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_hugepages.h"
+#include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
@@ -195,73 +197,110 @@ get_default_hp_size(void)
 	return size;
 }
 
-static int
-get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+int
+eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
 {
-	enum proc_mount_fieldnames {
-		DEVICE = 0,
-		MOUNTPT,
-		FSTYPE,
-		OPTIONS,
-		_FIELDNAME_MAX
-	};
-	static uint64_t default_size = 0;
-	const char proc_mounts[] = "/proc/mounts";
-	const char hugetlbfs_str[] = "hugetlbfs";
-	const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
-	const char pagesize_opt[] = "pagesize=";
-	const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
-	const char split_tok = ' ';
-	char *splitstr[_FIELDNAME_MAX];
-	char buf[BUFSIZ];
-	int retval = -1;
-	const struct internal_config *internal_conf =
-		eal_get_internal_configuration();
-
-	FILE *fd = fopen(proc_mounts, "r");
-	if (fd == NULL)
-		rte_panic("Cannot open %s\n", proc_mounts);
+	static const char PATH[] = "/proc/mounts";
+	static const char OPTION[] = "pagesize";
+
+	static uint64_t default_size;
+
+	FILE *f = NULL;
+	struct mntent mntent;
+	char strings[PATH_MAX];
+	char *hugepage_sz_str;
+	uint64_t hugepage_sz;
+	bool stopped = false;
+	int ret = -1;
+
+	f = setmntent(PATH, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
+				__func__, PATH, strerror(errno));
+		goto exit;
+	}
 
 	if (default_size == 0)
 		default_size = get_default_hp_size();
 
-	while (fgets(buf, sizeof(buf), fd)){
-		if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
-				split_tok) != _FIELDNAME_MAX) {
-			RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
-			break; /* return NULL */
+	ret = 0;
+	while (getmntent_r(f, &mntent, strings, sizeof(strings)) != NULL) {
+		if (strcmp(mntent.mnt_type, "hugetlbfs") != 0)
+			continue;
+
+		hugepage_sz_str = hasmntopt(&mntent, OPTION);
+		if (hugepage_sz_str != NULL) {
+			hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
+			hugepage_sz = rte_str_to_size(hugepage_sz_str);
+			if (hugepage_sz == 0) {
+				RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
+						mntent.mnt_opts, mntent.mnt_dir);
+				continue;
+			}
+		} else {
+			RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
+					mntent.mnt_dir, OPTION);
+			hugepage_sz = default_size;
 		}
 
-		/* we have a specified --huge-dir option, only examine that dir */
-		if (internal_conf->hugepage_dir != NULL &&
-				strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
-			continue;
+		if (cb(mntent.mnt_dir, hugepage_sz, cb_arg) != 0) {
+			stopped = true;
+			break;
+		}
+	}
 
-		if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
-			const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
+	if (ferror(f) || (!stopped && !feof(f))) {
+		RTE_LOG(ERR, EAL, "%s(): getmntent_r(): %s\n",
+				__func__, strerror(errno));
+		ret = -1;
+		goto exit;
+	}
 
-			/* if no explicit page size, the default page size is compared */
-			if (pagesz_str == NULL){
-				if (hugepage_sz == default_size){
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-			/* there is an explicit page size, so check it */
-			else {
-				uint64_t pagesz = rte_str_to_size(&pagesz_str[pagesize_opt_len]);
-				if (pagesz == hugepage_sz) {
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-		} /* end if strncmp hugetlbfs */
-	} /* end while fgets */
+exit:
+	if (f != NULL)
+		endmntent(f);
+	return ret;
+}
 
-	fclose(fd);
-	return retval;
+struct match_hugepage_mount_arg {
+	uint64_t hugepage_sz;
+	char *hugedir;
+	int hugedir_len;
+	bool done;
+};
+
+static int
+match_hugepage_mount(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	const struct internal_config *internal_conf =
+		eal_get_internal_configuration();
+	struct match_hugepage_mount_arg *arg = cb_arg;
+
+	/* we have a specified --huge-dir option, only examine that dir */
+	if (internal_conf->hugepage_dir != NULL &&
+			strcmp(path, internal_conf->hugepage_dir) != 0)
+		return 0;
+
+	if (hugepage_sz == arg->hugepage_sz) {
+		strlcpy(arg->hugedir, path, arg->hugedir_len);
+		arg->done = true;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+{
+	struct match_hugepage_mount_arg arg = {
+		.hugepage_sz = hugepage_sz,
+		.hugedir = hugedir,
+		.hugedir_len = len,
+		.done = false,
+	};
+	int ret = eal_hugepage_mount_walk(match_hugepage_mount, &arg);
+	return ret == 0 && arg.done ? 0 : -1;
 }
 
 /*
diff --git a/lib/eal/linux/eal_hugepage_info.h b/lib/eal/linux/eal_hugepage_info.h
new file mode 100644
index 0000000000..c7efa37c66
--- /dev/null
+++ b/lib/eal/linux/eal_hugepage_info.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#ifndef _EAL_HUGEPAGE_INFO_
+#define _EAL_HUGEPAGE_INFO_
+
+#include <stdint.h>
+
+/**
+ * Function called for each hugetlbfs mount point.
+ *
+ * @param path
+ *  Mount point directory.
+ * @param hugepage_sz
+ *  Hugepage size for the mount or default system hugepage size.
+ * @param arg
+ *  User data.
+ *
+ * @return
+ *  0 to continue walking, 1 to stop.
+ */
+typedef int (eal_hugepage_mount_walk_cb)(const char *path, uint64_t hugepage_sz,
+					 void *arg);
+
+/**
+ * Enumerate hugetlbfs mount points.
+ *
+ * @param cb
+ *  Function called for each mount point.
+ * @param cb_arg
+ *  User data passed to the callback.
+ *
+ * @return
+ *  0 on success, negative on failure.
+ */
+int eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg);
+
+#endif /* _EAL_HUGEPAGE_INFO_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 2/3] eal: add memory pre-allocation from existing files
  2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
@ 2021-09-21  8:16         ` dkozlyuk
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 3/3] app/test: add allocator performance autotest dkozlyuk
  2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-21  8:16 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Viacheslav Ovsiienko, Dmitry Kozlyuk

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@ Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index eaef57312f..3cc567a0c0 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -84,6 +84,7 @@ eal_long_options[] = {
 	{OPT_TRACE_MODE,        1, NULL, OPT_TRACE_MODE_NUM       },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1879,6 +1880,8 @@ eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -1999,6 +2002,26 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@ struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@ eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 8e4f7202a2..5c012c8125 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -87,6 +87,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	OPT_LONG_MAX_NUM
 };
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 772736b53f..72b64d8052 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -48,6 +49,13 @@ malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index bba9b5300a..9a2b191314 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@ extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 193282e779..dfbb49ada9 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@ hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@ static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 3/3] app/test: add allocator performance autotest
  2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 2/3] eal: add memory pre-allocation from existing files dkozlyuk
@ 2021-09-21  8:16         ` dkozlyuk
  2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  3 siblings, 0 replies; 48+ messages in thread
From: dkozlyuk @ 2021-09-21  8:16 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Dmitry Kozlyuk, Viacheslav Ovsiienko

From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing would take
for each size as a hint.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 157 ++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index a7611686ad..a48dc79463 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -84,6 +84,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -281,6 +282,7 @@ fast_tests = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..4435894095
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,157 @@
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	puts("Performance: memset");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	putchar('\n');
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
+		size_t max_runs, double memset_gb_us)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	printf("Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		puts("Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	printf("%12s%8s%12s%12s%12s%12s\n",
+			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
+			"Total (us)", "memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			printf("%12zu Interrupted: out of memory.\n", size);
+			break;
+		}
+		runs_done = j;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_gb_us * size / GB;
+		printf("%12zu%8zu%12.2f%12.2f%12.2f%12.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	putchar('\n');
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_gb_us;
+
+	if (test_memset_perf(&memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
@ 2021-09-22 13:52           ` John Levon
  2021-10-05 17:36           ` Thomas Monjalon
  1 sibling, 0 replies; 48+ messages in thread
From: John Levon @ 2021-09-22 13:52 UTC (permalink / raw)
  To: dkozlyuk; +Cc: dev, Anatoly Burakov, Viacheslav Ovsiienko

On Tue, Sep 21, 2021 at 11:16:30AM +0300, dkozlyuk@nvidia.com wrote:

> From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> 
> get_hugepage_dir() searched for a hugetlbfs mount with a given page size
> using handcraft parsing of /proc/mounts and mixing traversal logic with
> selecting the needed entry. Separate code to enumerate hugetlbfs mounts
> to eal_hugepage_mount_walk() taking a callback that can inspect already
> parsed entries. Use mntent(3) API for parsing. This allows to reuse
> enumeration logic in subsequent patches.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Reviewed-by: John Levon <john.levon@nutanix.com>

regards
john

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
  2021-09-22 13:52           ` John Levon
@ 2021-10-05 17:36           ` Thomas Monjalon
  2021-10-08 15:33             ` John Levon
  1 sibling, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2021-10-05 17:36 UTC (permalink / raw)
  To: Anatoly Burakov, Dmitry Kozlyuk
  Cc: dev, John Levon, Viacheslav Ovsiienko, dkozlyuk

21/09/2021 10:16, dkozlyuk@oss.nvidia.com:
> From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> 
> get_hugepage_dir() searched for a hugetlbfs mount with a given page size
> using handcraft parsing of /proc/mounts and mixing traversal logic with
> selecting the needed entry. Separate code to enumerate hugetlbfs mounts
> to eal_hugepage_mount_walk() taking a callback that can inspect already
> parsed entries. Use mntent(3) API for parsing. This allows to reuse
> enumeration logic in subsequent patches.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

First version was sent in July.
Anatoly, please are you available to review?

> +++ b/lib/eal/linux/eal_hugepage_info.h
> @@ -0,0 +1,39 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2021 NVIDIA CORPORATION & AFFILIATES.

Please use this exact format:

Copyright (c) 2021 NVIDIA Corporation & Affiliates




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-10-05 17:36           ` Thomas Monjalon
@ 2021-10-08 15:33             ` John Levon
  2021-10-08 15:50               ` Dmitry Kozlyuk
  0 siblings, 1 reply; 48+ messages in thread
From: John Levon @ 2021-10-08 15:33 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Anatoly Burakov, Dmitry Kozlyuk, dev, Viacheslav Ovsiienko

On Tue, Oct 05, 2021 at 07:36:21PM +0200, Thomas Monjalon wrote:

> 21/09/2021 10:16, dkozlyuk@oss.nvidia.com:
> > From: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> > 
> > get_hugepage_dir() searched for a hugetlbfs mount with a given page size
> > using handcraft parsing of /proc/mounts and mixing traversal logic with
> > selecting the needed entry. Separate code to enumerate hugetlbfs mounts
> > to eal_hugepage_mount_walk() taking a callback that can inspect already
> > parsed entries. Use mntent(3) API for parsing. This allows to reuse
> > enumeration logic in subsequent patches.
> > 
> > Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> > Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> 
> First version was sent in July.
> Anatoly, please are you available to review?

Any progress on these? Since now my original patch ("eal: allow hugetlbfs
sub-directories") is going to have to wait behind this series, since nobody
responded to review of the last version.

Is it usual in DPDK to have to wait months?

thanks
john

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-10-08 15:33             ` John Levon
@ 2021-10-08 15:50               ` Dmitry Kozlyuk
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-08 15:50 UTC (permalink / raw)
  To: John Levon, NBU-Contact-Thomas Monjalon
  Cc: Anatoly Burakov, dev, Slava Ovsiienko

Hello John,

> Any progress on these? Since now my original patch ("eal: allow hugetlbfs
> sub-directories") is going to have to wait behind this series, since nobody
> responded to review of the last version.

Your patch does not directly depend on this one and can be merged earlier.
Maybe my previous comment wasn't clear enough. If your patch is merged first,
I will rebase mine, updating your added code to use the new functions.
If my patch is merged first, it's vice versa. For EAL part it has my ack since v3.

> Is it usual in DPDK to have to wait months?

It is not normal, but admittedly happens :(

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files
  2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
                           ` (2 preceding siblings ...)
  2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 3/3] app/test: add allocator performance autotest dkozlyuk
@ 2021-10-11  8:56         ` Dmitry Kozlyuk
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
                             ` (3 more replies)
  3 siblings, 4 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-11  8:56 UTC (permalink / raw)
  To: dev

Hugepage allocation from the system takes time, resulting in slow
startup or sporadic delays later. Most of the time spent in kernel
is zero-filling memory for security reasons, which may be irrelevant
in a controlled environment. The bottleneck is memory access speed,
so for speeduup the amount of memory cleared must be reduced.
We propose a new EAL option --mem-file FILE1,FILE2,... to quickly
allocate dirty pages from existing files and clean it as necessary.
A new malloc_perf_autotest is provided to estimate the impact.
More details are explained in relevant patches.

v6: fix copyright line (Thomas), add SPDX header for the new test file
    (BTW, why didn't the CI complain in previous versions?)
v5: rebase
v4: getmntent() -> getmntent_r(), better error detection (John Levon)
v3: fix hugepage mount point detection
v2: fix CI failures

Dmitry Kozlyuk (2):
  eal/linux: make hugetlbfs analysis reusable
  app/test: add allocator performance autotest

Viacheslav Ovsiienko (1):
  eal: add memory pre-allocation from existing files

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 161 +++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             | 158 ++++++---
 lib/eal/linux/eal_hugepage_info.h             |  39 +++
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 16 files changed, 739 insertions(+), 70 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-10-11  8:56           ` Dmitry Kozlyuk
  2021-10-13  8:16             ` David Marchand
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-11  8:56 UTC (permalink / raw)
  To: dev; +Cc: Viacheslav Ovsiienko, John Levon

get_hugepage_dir() searched for a hugetlbfs mount with a given page size
using handcraft parsing of /proc/mounts and mixing traversal logic with
selecting the needed entry. Separate code to enumerate hugetlbfs mounts
to eal_hugepage_mount_walk() taking a callback that can inspect already
parsed entries. Use mntent(3) API for parsing. This allows to reuse
enumeration logic in subsequent patches.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Reviewed-by: John Levon <john.levon@nutanix.com>
---
 lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
 lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
 2 files changed, 135 insertions(+), 57 deletions(-)
 create mode 100644 lib/eal/linux/eal_hugepage_info.h

diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index d97792cade..193282e779 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <fnmatch.h>
 #include <inttypes.h>
+#include <mntent.h>
 #include <stdarg.h>
 #include <unistd.h>
 #include <errno.h>
@@ -34,6 +35,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_hugepages.h"
+#include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
@@ -195,73 +197,110 @@ get_default_hp_size(void)
 	return size;
 }
 
-static int
-get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+int
+eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
 {
-	enum proc_mount_fieldnames {
-		DEVICE = 0,
-		MOUNTPT,
-		FSTYPE,
-		OPTIONS,
-		_FIELDNAME_MAX
-	};
-	static uint64_t default_size = 0;
-	const char proc_mounts[] = "/proc/mounts";
-	const char hugetlbfs_str[] = "hugetlbfs";
-	const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
-	const char pagesize_opt[] = "pagesize=";
-	const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
-	const char split_tok = ' ';
-	char *splitstr[_FIELDNAME_MAX];
-	char buf[BUFSIZ];
-	int retval = -1;
-	const struct internal_config *internal_conf =
-		eal_get_internal_configuration();
-
-	FILE *fd = fopen(proc_mounts, "r");
-	if (fd == NULL)
-		rte_panic("Cannot open %s\n", proc_mounts);
+	static const char PATH[] = "/proc/mounts";
+	static const char OPTION[] = "pagesize";
+
+	static uint64_t default_size;
+
+	FILE *f = NULL;
+	struct mntent mntent;
+	char strings[PATH_MAX];
+	char *hugepage_sz_str;
+	uint64_t hugepage_sz;
+	bool stopped = false;
+	int ret = -1;
+
+	f = setmntent(PATH, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
+				__func__, PATH, strerror(errno));
+		goto exit;
+	}
 
 	if (default_size == 0)
 		default_size = get_default_hp_size();
 
-	while (fgets(buf, sizeof(buf), fd)){
-		if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
-				split_tok) != _FIELDNAME_MAX) {
-			RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
-			break; /* return NULL */
+	ret = 0;
+	while (getmntent_r(f, &mntent, strings, sizeof(strings)) != NULL) {
+		if (strcmp(mntent.mnt_type, "hugetlbfs") != 0)
+			continue;
+
+		hugepage_sz_str = hasmntopt(&mntent, OPTION);
+		if (hugepage_sz_str != NULL) {
+			hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
+			hugepage_sz = rte_str_to_size(hugepage_sz_str);
+			if (hugepage_sz == 0) {
+				RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
+						mntent.mnt_opts, mntent.mnt_dir);
+				continue;
+			}
+		} else {
+			RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
+					mntent.mnt_dir, OPTION);
+			hugepage_sz = default_size;
 		}
 
-		/* we have a specified --huge-dir option, only examine that dir */
-		if (internal_conf->hugepage_dir != NULL &&
-				strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
-			continue;
+		if (cb(mntent.mnt_dir, hugepage_sz, cb_arg) != 0) {
+			stopped = true;
+			break;
+		}
+	}
 
-		if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
-			const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
+	if (ferror(f) || (!stopped && !feof(f))) {
+		RTE_LOG(ERR, EAL, "%s(): getmntent_r(): %s\n",
+				__func__, strerror(errno));
+		ret = -1;
+		goto exit;
+	}
 
-			/* if no explicit page size, the default page size is compared */
-			if (pagesz_str == NULL){
-				if (hugepage_sz == default_size){
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-			/* there is an explicit page size, so check it */
-			else {
-				uint64_t pagesz = rte_str_to_size(&pagesz_str[pagesize_opt_len]);
-				if (pagesz == hugepage_sz) {
-					strlcpy(hugedir, splitstr[MOUNTPT], len);
-					retval = 0;
-					break;
-				}
-			}
-		} /* end if strncmp hugetlbfs */
-	} /* end while fgets */
+exit:
+	if (f != NULL)
+		endmntent(f);
+	return ret;
+}
 
-	fclose(fd);
-	return retval;
+struct match_hugepage_mount_arg {
+	uint64_t hugepage_sz;
+	char *hugedir;
+	int hugedir_len;
+	bool done;
+};
+
+static int
+match_hugepage_mount(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	const struct internal_config *internal_conf =
+		eal_get_internal_configuration();
+	struct match_hugepage_mount_arg *arg = cb_arg;
+
+	/* we have a specified --huge-dir option, only examine that dir */
+	if (internal_conf->hugepage_dir != NULL &&
+			strcmp(path, internal_conf->hugepage_dir) != 0)
+		return 0;
+
+	if (hugepage_sz == arg->hugepage_sz) {
+		strlcpy(arg->hugedir, path, arg->hugedir_len);
+		arg->done = true;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
+{
+	struct match_hugepage_mount_arg arg = {
+		.hugepage_sz = hugepage_sz,
+		.hugedir = hugedir,
+		.hugedir_len = len,
+		.done = false,
+	};
+	int ret = eal_hugepage_mount_walk(match_hugepage_mount, &arg);
+	return ret == 0 && arg.done ? 0 : -1;
 }
 
 /*
diff --git a/lib/eal/linux/eal_hugepage_info.h b/lib/eal/linux/eal_hugepage_info.h
new file mode 100644
index 0000000000..bc0e0a616c
--- /dev/null
+++ b/lib/eal/linux/eal_hugepage_info.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#ifndef _EAL_HUGEPAGE_INFO_
+#define _EAL_HUGEPAGE_INFO_
+
+#include <stdint.h>
+
+/**
+ * Function called for each hugetlbfs mount point.
+ *
+ * @param path
+ *  Mount point directory.
+ * @param hugepage_sz
+ *  Hugepage size for the mount or default system hugepage size.
+ * @param arg
+ *  User data.
+ *
+ * @return
+ *  0 to continue walking, 1 to stop.
+ */
+typedef int (eal_hugepage_mount_walk_cb)(const char *path, uint64_t hugepage_sz,
+					 void *arg);
+
+/**
+ * Enumerate hugetlbfs mount points.
+ *
+ * @param cb
+ *  Function called for each mount point.
+ * @param cb_arg
+ *  User data passed to the callback.
+ *
+ * @return
+ *  0 on success, negative on failure.
+ */
+int eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg);
+
+#endif /* _EAL_HUGEPAGE_INFO_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
@ 2021-10-11  8:56           ` Dmitry Kozlyuk
  2021-10-12 15:37             ` David Marchand
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
  2021-10-11 18:52           ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Thomas Monjalon
  3 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-11  8:56 UTC (permalink / raw)
  To: dev; +Cc: Viacheslav Ovsiienko, Anatoly Burakov

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@ Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 1802e3d9e1..1265720484 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -84,6 +84,7 @@ eal_long_options[] = {
 	{OPT_TRACE_MODE,        1, NULL, OPT_TRACE_MODE_NUM       },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1879,6 +1880,8 @@ eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -1999,6 +2002,26 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@ struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@ eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 8e4f7202a2..5c012c8125 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -87,6 +87,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	OPT_LONG_MAX_NUM
 };
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 3a6ec6ecf0..cb1b5a5dd5 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -44,6 +45,13 @@ malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index 6d018629ae..579358e29e 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@ extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@ eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 193282e779..dfbb49ada9 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@ hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@ static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest
  2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-10-11  8:56           ` Dmitry Kozlyuk
  2021-10-12 13:53             ` Aaron Conole
  2021-10-11 18:52           ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Thomas Monjalon
  3 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-11  8:56 UTC (permalink / raw)
  To: dev; +Cc: Viacheslav Ovsiienko

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing would take
for each size as a hint.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 161 ++++++++++++++++++++++++++++++++++++
 2 files changed, 163 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index f144d8b8ed..47d1d60ded 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -85,6 +85,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -282,6 +283,7 @@ fast_tests = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..fa7357f540
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	puts("Performance: memset");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	putchar('\n');
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
+		size_t max_runs, double memset_gb_us)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	printf("Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		puts("Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	printf("%12s%8s%12s%12s%12s%12s\n",
+			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
+			"Total (us)", "memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			printf("%12zu Interrupted: out of memory.\n", size);
+			break;
+		}
+		runs_done = j;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_gb_us * size / GB;
+		printf("%12zu%8zu%12.2f%12.2f%12.2f%12.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	putchar('\n');
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_gb_us;
+
+	if (test_memset_perf(&memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
+			MAX_RUNS, memset_gb_us) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files
  2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
                             ` (2 preceding siblings ...)
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
@ 2021-10-11 18:52           ` Thomas Monjalon
  2021-10-11 21:12             ` [dpdk-dev] [dpdk-ci] " Lincoln Lavoie
  3 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2021-10-11 18:52 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, ci

11/10/2021 10:56, Dmitry Kozlyuk:
> v6: fix copyright line (Thomas), add SPDX header for the new test file
>     (BTW, why didn't the CI complain in previous versions?)

Probably because the CI doesn't run the script devtools/check-spdx-tag.sh
Cc ci@dpdk.org to add this test.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [dpdk-ci] [PATCH v6 0/3] eal: add memory pre-allocation from existing files
  2021-10-11 18:52           ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Thomas Monjalon
@ 2021-10-11 21:12             ` Lincoln Lavoie
  2021-10-12  6:54               ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: Lincoln Lavoie @ 2021-10-11 21:12 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Dmitry Kozlyuk, dev, ci

Hi All,

I'm assuming this should be added to the "checkpatch" job / report?
Correct, or do folks feel this should be a separate run / report?

Cheers,
Lincoln

On Mon, Oct 11, 2021 at 2:52 PM Thomas Monjalon <thomas@monjalon.net> wrote:

> 11/10/2021 10:56, Dmitry Kozlyuk:
> > v6: fix copyright line (Thomas), add SPDX header for the new test file
> >     (BTW, why didn't the CI complain in previous versions?)
>
> Probably because the CI doesn't run the script devtools/check-spdx-tag.sh
> Cc ci@dpdk.org to add this test.
>
>
>
>

-- 
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [dpdk-ci] [PATCH v6 0/3] eal: add memory pre-allocation from existing files
  2021-10-11 21:12             ` [dpdk-dev] [dpdk-ci] " Lincoln Lavoie
@ 2021-10-12  6:54               ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2021-10-12  6:54 UTC (permalink / raw)
  To: Lincoln Lavoie; +Cc: Dmitry Kozlyuk, dev, ci

11/10/2021 23:12, Lincoln Lavoie:
> Hi All,
> 
> I'm assuming this should be added to the "checkpatch" job / report?

I think yes.

> Correct, or do folks feel this should be a separate run / report?

Please what are the tests already run in this job?
I may suggest more.


> On Mon, Oct 11, 2021 at 2:52 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > 11/10/2021 10:56, Dmitry Kozlyuk:
> > > v6: fix copyright line (Thomas), add SPDX header for the new test file
> > >     (BTW, why didn't the CI complain in previous versions?)
> >
> > Probably because the CI doesn't run the script devtools/check-spdx-tag.sh
> > Cc ci@dpdk.org to add this test.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
@ 2021-10-12 13:53             ` Aaron Conole
  2021-10-12 14:48               ` Dmitry Kozlyuk
  0 siblings, 1 reply; 48+ messages in thread
From: Aaron Conole @ 2021-10-12 13:53 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Viacheslav Ovsiienko, Anatoly Burakov

Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com> writes:

> Memory allocator performance is crucial to applications that deal
> with large amount of memory or allocate frequently. DPDK allocator
> performance is affected by EAL options, API used and, at least,
> allocation size. New autotest is intended to be run with different
> EAL options. It measures performance with a range of sizes
> for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.
>
> Work distribution between allocation and deallocation depends on EAL
> options. The test prints both times and total time to ease comparison.
>
> Memory can be filled with zeroes at different points of allocation path,
> but it always takes considerable fraction of overall timing. This is why
> the test measures filling speed and prints how long clearing would take
> for each size as a hint.
>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> ---

This isn't really a test, imho.  There are no assert()s.  How does a
developer who tries to fix a bug in this area know what is acceptable?

Please switch the printf()s to RTE_LOG calls, and add some
RTE_TEST_ASSERT calls to enforce some time range at the least.
Otherwise this test will not really be checking the performance - just
giving a report somewhere.

Also, I don't understand the way the memset test works here.  You do one
large memset at the very beginning and then extrapolate the time it
would take.  Does that hold any value or should we do a memset in each
iteration and enforce a scaled time?

>  app/test/meson.build        |   2 +
>  app/test/test_malloc_perf.c | 161 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 163 insertions(+)
>  create mode 100644 app/test/test_malloc_perf.c
>
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f144d8b8ed..47d1d60ded 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -85,6 +85,7 @@ test_sources = files(
>          'test_lpm6_perf.c',
>          'test_lpm_perf.c',
>          'test_malloc.c',
> +        'test_malloc_perf.c',
>          'test_mbuf.c',
>          'test_member.c',
>          'test_member_perf.c',
> @@ -282,6 +283,7 @@ fast_tests = [
>  
>  perf_test_names = [
>          'ring_perf_autotest',
> +        'malloc_perf_autotest',
>          'mempool_perf_autotest',
>          'memcpy_perf_autotest',
>          'hash_perf_autotest',
> diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
> new file mode 100644
> index 0000000000..fa7357f540
> --- /dev/null
> +++ b/app/test/test_malloc_perf.c
> @@ -0,0 +1,161 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2021 NVIDIA Corporation & Affiliates
> + */
> +
> +#include <inttypes.h>
> +#include <string.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_malloc.h>
> +#include <rte_memzone.h>
> +
> +#include "test.h"
> +
> +typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
> +typedef void (free_t)(void *addr);
> +
> +static const uint64_t KB = 1 << 10;
> +static const uint64_t GB = 1 << 30;
> +
> +static double
> +tsc_to_us(uint64_t tsc, size_t runs)
> +{
> +	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
> +}
> +
> +static int
> +test_memset_perf(double *us_per_gb)
> +{
> +	static const size_t RUNS = 20;
> +
> +	void *ptr;
> +	size_t i;
> +	uint64_t tsc;
> +
> +	puts("Performance: memset");
> +
> +	ptr = rte_malloc(NULL, GB, 0);
> +	if (ptr == NULL) {
> +		printf("rte_malloc(size=%"PRIx64") failed\n", GB);
> +		return -1;
> +	}
> +
> +	tsc = rte_rdtsc_precise();
> +	for (i = 0; i < RUNS; i++)
> +		memset(ptr, 0, GB);
> +	tsc = rte_rdtsc_precise() - tsc;
> +
> +	*us_per_gb = tsc_to_us(tsc, RUNS);
> +	printf("Result: %f.3 GiB/s <=> %.2f us/MiB\n",
> +			US_PER_S / *us_per_gb, *us_per_gb / KB);
> +
> +	rte_free(ptr);
> +	putchar('\n');
> +	return 0;
> +}
> +
> +static int
> +test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t free_fn,
> +		size_t max_runs, double memset_gb_us)
> +{
> +	static const size_t SIZES[] = {
> +			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
> +			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
> +
> +	size_t i, j;
> +	void **ptrs;
> +
> +	printf("Performance: %s\n", name);
> +
> +	ptrs = calloc(max_runs, sizeof(ptrs[0]));
> +	if (ptrs == NULL) {
> +		puts("Cannot allocate memory for pointers");
> +		return -1;
> +	}
> +
> +	printf("%12s%8s%12s%12s%12s%12s\n",
> +			"Size (B)", "Runs", "Alloc (us)", "Free (us)",
> +			"Total (us)", "memset (us)");
> +	for (i = 0; i < RTE_DIM(SIZES); i++) {
> +		size_t size = SIZES[i];
> +		size_t runs_done;
> +		uint64_t tsc_start, tsc_alloc, tsc_free;
> +		double alloc_time, free_time, memset_time;
> +
> +		tsc_start = rte_rdtsc_precise();
> +		for (j = 0; j < max_runs; j++) {
> +			ptrs[j] = alloc_fn(NULL, size, 0);
> +			if (ptrs[j] == NULL)
> +				break;
> +		}
> +		tsc_alloc = rte_rdtsc_precise() - tsc_start;
> +
> +		if (j == 0) {
> +			printf("%12zu Interrupted: out of memory.\n", size);
> +			break;
> +		}
> +		runs_done = j;
> +
> +		tsc_start = rte_rdtsc_precise();
> +		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
> +			free_fn(ptrs[j]);
> +		tsc_free = rte_rdtsc_precise() - tsc_start;
> +
> +		alloc_time = tsc_to_us(tsc_alloc, runs_done);
> +		free_time = tsc_to_us(tsc_free, runs_done);
> +		memset_time = memset_gb_us * size / GB;
> +		printf("%12zu%8zu%12.2f%12.2f%12.2f%12.2f\n",
> +				size, runs_done, alloc_time, free_time,
> +				alloc_time + free_time, memset_time);
> +
> +		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
> +	}
> +
> +	free(ptrs);
> +	putchar('\n');
> +	return 0;
> +}
> +
> +static void *
> +memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
> +{
> +	const struct rte_memzone *mz;
> +	char gen_name[RTE_MEMZONE_NAMESIZE];
> +
> +	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
> +	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
> +			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
> +	return (void *)(uintptr_t)mz;
> +}
> +
> +static void
> +memzone_free(void *addr)
> +{
> +	rte_memzone_free((struct rte_memzone *)addr);
> +}
> +
> +static int
> +test_malloc_perf(void)
> +{
> +	static const size_t MAX_RUNS = 10000;
> +
> +	double memset_gb_us;
> +
> +	if (test_memset_perf(&memset_gb_us) < 0)
> +		return -1;
> +
> +	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free,
> +			MAX_RUNS, memset_gb_us) < 0)
> +		return -1;
> +	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free,
> +			MAX_RUNS, memset_gb_us) < 0)
> +		return -1;
> +
> +	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
> +			RTE_MAX_MEMZONE - 1, memset_gb_us) < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest
  2021-10-12 13:53             ` Aaron Conole
@ 2021-10-12 14:48               ` Dmitry Kozlyuk
  2021-10-15 13:47                 ` Aaron Conole
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12 14:48 UTC (permalink / raw)
  To: Aaron Conole; +Cc: dev, Slava Ovsiienko, Anatoly Burakov

> This isn't really a test, imho.  There are no assert()s.  How does a developer who
> tries to fix a bug in this area know what is acceptable?
> 
> Please switch the printf()s to RTE_LOG calls, and add some RTE_TEST_ASSERT
> calls to enforce some time range at the least.
> Otherwise this test will not really be checking the performance - just giving a
> report somewhere.

I just followed DPDK naming convention of test_xxx_perf.c / xxx_perf_autotest.
They all should really be called benchmarks.
They help developers to see how the code changes affect performance.
I don't understand how this "perf test" is not in line with existing ones
and where it should properly reside.

I'm not totally opposed to replacing printf() with RTE_LOG(), but all other test use printf().
The drawback of the change is inconsistency, what is the benefit?

> Also, I don't understand the way the memset test works here.  You do one large
> memset at the very beginning and then extrapolate the time it would take.  Does
> that hold any value or should we do a memset in each iteration and enforce a
> scaled time?

As explained above, we don't need to enforce anything, we want a report.
I've never seen a case with one NUMA node where memset() time would not scale linearly,
but benchmarks should be precise so I'll change it to memset()'ing the allocated area, thanks. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
@ 2021-10-12 15:37             ` David Marchand
  2021-10-12 15:55               ` Dmitry Kozlyuk
  0 siblings, 1 reply; 48+ messages in thread
From: David Marchand @ 2021-10-12 15:37 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Viacheslav Ovsiienko, Anatoly Burakov

Hello Dmitry, Slava,

On Mon, Oct 11, 2021 at 10:57 AM Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com> wrote:
>
> From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
>
> The primary DPDK process launch might take a long time if initially
> allocated memory is large. From practice allocation of 1 TB of memory
> over 1 GB hugepages on Linux takes tens of seconds. Fast restart
> is highly desired for some applications and launch delay presents
> a problem.
>
> The primary delay happens in this call trace:
>   rte_eal_init()
>     rte_eal_memory_init()
>       rte_eal_hugepage_init()
>         eal_dynmem_hugepage_init()
>           eal_memalloc_alloc_seg_bulk()
>             alloc_seg()
>               mmap()
>
> The largest part of the time spent in mmap() is filling the memory
> with zeros. Kernel does so to prevent data leakage from a process
> that was last using the page. However, in a controlled environment
> it may not be the issue, while performance is. (Linux-specific
> MAP_UNINITIALIZED flag allows mapping without clearing, but it is
> disabled in all popular distributions for the reason above.)
>
> It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
> to map hugepages "as is" from specified FILEs in hugetlbfs.
> Compared to using external memory for the task, EAL option requires
> no change to application code, while allowing administrator
> to control hugepage sizes and their NUMA affinity.
>
> Limitations of the feature:
>
> * Linux-specific (only Linux maps hugepages from files).
> * Incompatible with --legacy-mem (partially replaces it).
> * Incompatible with --single-file-segments
>   (--mem-file FILEs can contain as many segments as needed).
> * Incompatible with --in-memory (logically).
>
> A warning about possible security implications is printed
> when --mem-file is used.
>
> Until this patch DPDK allocator always cleared memory on freeing,
> so that it did not have to do that on allocation, while new memory
> was cleared by the kernel. When --mem-file is in use, DPDK clears memory
> after allocation in rte_zmalloc() and does not clean it on freeing.
> Effectively user trades fast startup for occasional allocation slowdown
> whenever it is absolutely necessary. When memory is recycled, it is
> cleared again, which is suboptimal par se, but saves complication
> of memory management.

I have some trouble figuring the need for the list of files.
Why not use a global knob --mem-clear-on-alloc for this behavior change?


>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-12 15:37             ` David Marchand
@ 2021-10-12 15:55               ` Dmitry Kozlyuk
  2021-10-12 17:32                 ` David Marchand
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12 15:55 UTC (permalink / raw)
  To: David Marchand; +Cc: dev, Slava Ovsiienko, Anatoly Burakov

Hello David,

> I have some trouble figuring the need for the list of files.
> Why not use a global knob --mem-clear-on-alloc for this behavior change?

Moving memset() doesn't speed anything up, it's a forced step for the reasons below.
Currently, memory is cleared by the kernel when a page is mapped during an allocation.
This cannot be turned off in stock kernels. The issue is that initial allocations are longer
by the time needed to clear the pages, which is >90%. For the memory intended for DMA this time is just wasted. If allocations are large, application startup and restart take long. The only way to get hugepages mapped without the kernel clearing them is to map existing files in hugetlbfs. However, rte_zmalloc() needs to return clean memory, that's why we move memset() there. Memory intended for DMA is just never cleared this way. But memory freed and allocated again will be cleared again, unfortunately. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-12 15:55               ` Dmitry Kozlyuk
@ 2021-10-12 17:32                 ` David Marchand
  2021-10-12 21:09                   ` Dmitry Kozlyuk
  0 siblings, 1 reply; 48+ messages in thread
From: David Marchand @ 2021-10-12 17:32 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Slava Ovsiienko, Anatoly Burakov, Thomas Monjalon

On Tue, Oct 12, 2021 at 5:55 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
> > I have some trouble figuring the need for the list of files.
> > Why not use a global knob --mem-clear-on-alloc for this behavior change?
>
> Moving memset() doesn't speed anything up, it's a forced step for the reasons below.
> Currently, memory is cleared by the kernel when a page is mapped during an allocation.
> This cannot be turned off in stock kernels. The issue is that initial allocations are longer
> by the time needed to clear the pages, which is >90%. For the memory intended for DMA this time is just wasted. If allocations are large, application startup and restart take long. The only way to get hugepages mapped without the kernel clearing them is to map existing files in hugetlbfs. However, rte_zmalloc() needs to return clean memory, that's why we move memset() there. Memory intended for DMA is just never cleared this way. But memory freed and allocated again will be cleared again, unfortunately.

Writing my limited understanding, please correct me.

The --mem-file that is proposed does:
- preallocate files which is something close to --socket-mem with the
following differences
  - --mem-file lets user decide on dpdk hugepage files names, which I
think conflicts with --huge-dir and --file-prefix,
  - --mem-file lets user device on hugepage size which I think could
be achieved with some --huge-dir option,
- bypasses unlink() of existing hugepage files which I had overlooked
but is the main painpoint,
- enforces "clear on alloc" in rte_malloc/rte_free.


From this, I see two parts in this patch:
- faster restart, reusing hugepage files as is (combination of not
calling unlink() and doing "clear on alloc"),
  This part is interesting, and I think a single knob for this would be enough.
- finegrained control of hugepage files, but it has the drawback of
imposing primary/secondary run with the same options.
  The second part seems complex to configure. I see conflicts with
existing options, so it seems a good way to get caught up in the
carpet (sorry if it translates badly from French :p).


-- 
David Marchand


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-12 17:32                 ` David Marchand
@ 2021-10-12 21:09                   ` Dmitry Kozlyuk
  2021-10-13 10:18                     ` David Marchand
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-12 21:09 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Slava Ovsiienko, Anatoly Burakov, NBU-Contact-Thomas Monjalon

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: 12 октября 2021 г. 20:33
> To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Cc: dev <dev@dpdk.org>; Slava Ovsiienko <viacheslavo@nvidia.com>; Anatoly
> Burakov <anatoly.burakov@intel.com>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>
> Subject: Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from
> existing files
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Oct 12, 2021 at 5:55 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> wrote:
> > > I have some trouble figuring the need for the list of files.
> > > Why not use a global knob --mem-clear-on-alloc for this behavior
> change?
> >
> > Moving memset() doesn't speed anything up, it's a forced step for the
> reasons below.
> > Currently, memory is cleared by the kernel when a page is mapped during
> an allocation.
> > This cannot be turned off in stock kernels. The issue is that initial
> > allocations are longer by the time needed to clear the pages, which is
> >90%. For the memory intended for DMA this time is just wasted. If
> allocations are large, application startup and restart take long. The only
> way to get hugepages mapped without the kernel clearing them is to map
> existing files in hugetlbfs. However, rte_zmalloc() needs to return clean
> memory, that's why we move memset() there. Memory intended for DMA is just
> never cleared this way. But memory freed and allocated again will be
> cleared again, unfortunately.
> 
> Writing my limited understanding, please correct me.
> 
> The --mem-file that is proposed does:
> - preallocate files which is something close to --socket-mem with the
> following differences
>   - --mem-file lets user decide on dpdk hugepage files names, which I
> think conflicts with --huge-dir and --file-prefix,
>   - --mem-file lets user device on hugepage size which I think could be
> achieved with some --huge-dir option,

The comparison to --socket-mem is valid, because preallocated files form the initial amount of memory allocated from the system. However, using --mem-file does not preclude DPDK from allocating more memory according to --huge-dir and --file-prefix when the application runs out of preallocated blocks.

> - bypasses unlink() of existing hugepage files which I had overlooked but
> is the main painpoint,
> - enforces "clear on alloc" in rte_malloc/rte_free.
> 
> 
> From this, I see two parts in this patch:
> - faster restart, reusing hugepage files as is (combination of not calling
> unlink() and doing "clear on alloc"),
>   This part is interesting, and I think a single knob for this would be
> enough.

In combination with rte_extmem* API this know would indeed allow to implement the feature in the app. However, the drawback is that all the logic to select hugepage size, NUMA, and names would need to be done from the app, probably with its own options. OTOH, there is already hugetlbfs and numactl to avoid apps duplicating this logic. Also, it's not only the fast restart, but also the fast initial start on a prepared system.

> - finegrained control of hugepage files, but it has the drawback of
> imposing primary/secondary run with the same options.
>   The second part seems complex to configure. I see conflicts with
> existing options, so it seems a good way to get caught up in the carpet
> (sorry if it translates badly from French :p).

I don't see why synchronizing memory options is a big issue.
Primary and secondary processes are inherently interdependent.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
@ 2021-10-13  8:16             ` David Marchand
  2021-10-13  9:21               ` Dmitry Kozlyuk
  0 siblings, 1 reply; 48+ messages in thread
From: David Marchand @ 2021-10-13  8:16 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Viacheslav Ovsiienko, John Levon

Hello,

On Mon, Oct 11, 2021 at 10:57 AM Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com> wrote:
>
> get_hugepage_dir() searched for a hugetlbfs mount with a given page size
> using handcraft parsing of /proc/mounts and mixing traversal logic with
> selecting the needed entry. Separate code to enumerate hugetlbfs mounts
> to eal_hugepage_mount_walk() taking a callback that can inspect already
> parsed entries. Use mntent(3) API for parsing. This allows to reuse
> enumeration logic in subsequent patches.
>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Reviewed-by: John Levon <john.levon@nutanix.com>

As you probably noticed, I merged John patch.
Could you rebase this series on the main branch please?

Two minor comments below:

> ---
>  lib/eal/linux/eal_hugepage_info.c | 153 +++++++++++++++++++-----------
>  lib/eal/linux/eal_hugepage_info.h |  39 ++++++++
>  2 files changed, 135 insertions(+), 57 deletions(-)
>  create mode 100644 lib/eal/linux/eal_hugepage_info.h
>
> diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
> index d97792cade..193282e779 100644
> --- a/lib/eal/linux/eal_hugepage_info.c
> +++ b/lib/eal/linux/eal_hugepage_info.c
> @@ -12,6 +12,7 @@
>  #include <stdio.h>
>  #include <fnmatch.h>
>  #include <inttypes.h>
> +#include <mntent.h>
>  #include <stdarg.h>
>  #include <unistd.h>
>  #include <errno.h>
> @@ -34,6 +35,7 @@
>  #include "eal_private.h"
>  #include "eal_internal_cfg.h"
>  #include "eal_hugepages.h"
> +#include "eal_hugepage_info.h"
>  #include "eal_filesystem.h"
>
>  static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
> @@ -195,73 +197,110 @@ get_default_hp_size(void)
>         return size;
>  }
>
> -static int
> -get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
> +int
> +eal_hugepage_mount_walk(eal_hugepage_mount_walk_cb *cb, void *cb_arg)
>  {
> -       enum proc_mount_fieldnames {
> -               DEVICE = 0,
> -               MOUNTPT,
> -               FSTYPE,
> -               OPTIONS,
> -               _FIELDNAME_MAX
> -       };
> -       static uint64_t default_size = 0;
> -       const char proc_mounts[] = "/proc/mounts";
> -       const char hugetlbfs_str[] = "hugetlbfs";
> -       const size_t htlbfs_str_len = sizeof(hugetlbfs_str) - 1;
> -       const char pagesize_opt[] = "pagesize=";
> -       const size_t pagesize_opt_len = sizeof(pagesize_opt) - 1;
> -       const char split_tok = ' ';
> -       char *splitstr[_FIELDNAME_MAX];
> -       char buf[BUFSIZ];
> -       int retval = -1;
> -       const struct internal_config *internal_conf =
> -               eal_get_internal_configuration();
> -
> -       FILE *fd = fopen(proc_mounts, "r");
> -       if (fd == NULL)
> -               rte_panic("Cannot open %s\n", proc_mounts);
> +       static const char PATH[] = "/proc/mounts";
> +       static const char OPTION[] = "pagesize";

Nit: please avoid PATH and OPTION as variable names.

All-uppercase words are usually for macros/defines in dpdk.
Plus, in PATH case, this is a well known shell variable.


> +
> +       static uint64_t default_size;
> +
> +       FILE *f = NULL;
> +       struct mntent mntent;
> +       char strings[PATH_MAX];
> +       char *hugepage_sz_str;
> +       uint64_t hugepage_sz;
> +       bool stopped = false;
> +       int ret = -1;
> +
> +       f = setmntent(PATH, "r");
> +       if (f == NULL) {
> +               RTE_LOG(ERR, EAL, "%s(): setmntent(%s): %s\n",
> +                               __func__, PATH, strerror(errno));
> +               goto exit;

We are in a rather generic helper function.
Error messages should be logged by callers of this helper, because the
caller knows better what the impact of failing to list mountpoints is.
In the helper itself, this log should probably be info or debug level.

If you think this error-level log should be kept in the helper, can
you make it a bit higher level so that users understand what is wrong
and what actions should be done to fix the situation?


> +       }
>
>         if (default_size == 0)
>                 default_size = get_default_hp_size();
>
> -       while (fgets(buf, sizeof(buf), fd)){
> -               if (rte_strsplit(buf, sizeof(buf), splitstr, _FIELDNAME_MAX,
> -                               split_tok) != _FIELDNAME_MAX) {
> -                       RTE_LOG(ERR, EAL, "Error parsing %s\n", proc_mounts);
> -                       break; /* return NULL */
> +       ret = 0;
> +       while (getmntent_r(f, &mntent, strings, sizeof(strings)) != NULL) {
> +               if (strcmp(mntent.mnt_type, "hugetlbfs") != 0)
> +                       continue;
> +
> +               hugepage_sz_str = hasmntopt(&mntent, OPTION);
> +               if (hugepage_sz_str != NULL) {
> +                       hugepage_sz_str += strlen(OPTION) + 1; /* +1 for '=' */
> +                       hugepage_sz = rte_str_to_size(hugepage_sz_str);
> +                       if (hugepage_sz == 0) {
> +                               RTE_LOG(DEBUG, EAL, "Cannot parse hugepage size from '%s' for %s\n",
> +                                               mntent.mnt_opts, mntent.mnt_dir);
> +                               continue;
> +                       }
> +               } else {
> +                       RTE_LOG(DEBUG, EAL, "Hugepage filesystem at %s without %s option\n",
> +                                       mntent.mnt_dir, OPTION);
> +                       hugepage_sz = default_size;
>                 }
>
> -               /* we have a specified --huge-dir option, only examine that dir */
> -               if (internal_conf->hugepage_dir != NULL &&
> -                               strcmp(splitstr[MOUNTPT], internal_conf->hugepage_dir) != 0)
> -                       continue;
> +               if (cb(mntent.mnt_dir, hugepage_sz, cb_arg) != 0) {
> +                       stopped = true;
> +                       break;
> +               }
> +       }
>
> -               if (strncmp(splitstr[FSTYPE], hugetlbfs_str, htlbfs_str_len) == 0){
> -                       const char *pagesz_str = strstr(splitstr[OPTIONS], pagesize_opt);
> +       if (ferror(f) || (!stopped && !feof(f))) {
> +               RTE_LOG(ERR, EAL, "%s(): getmntent_r(): %s\n",
> +                               __func__, strerror(errno));

Idem.


> +               ret = -1;
> +               goto exit;
> +       }


-- 
David Marchand


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable
  2021-10-13  8:16             ` David Marchand
@ 2021-10-13  9:21               ` Dmitry Kozlyuk
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-10-13  9:21 UTC (permalink / raw)
  To: David Marchand; +Cc: dev, Slava Ovsiienko, John Levon

Hello,

> [...]
> As you probably noticed, I merged John patch.
> Could you rebase this series on the main branch please?

Of course. Only would you accept that for now I'll just keep the tests John has added? With the new helper, directory selection logic can be tested isolated from parsing and mkdir(), but we have little time until RC1. Tests can be improved anytime.

> [...]
> We are in a rather generic helper function.
> Error messages should be logged by callers of this helper, because the
> caller knows better what the impact of failing to list mountpoints is.
> In the helper itself, this log should probably be info or debug level.
> 
> If you think this error-level log should be kept in the helper, can you
> make it a bit higher level so that users understand what is wrong and what
> actions should be done to fix the situation?

No, I agree that DEBUG is better for system errors.
I don't often add generic code, but for Windos
we log all system errors as DEBUG and higher-level as ERR.

Will send v7, thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-12 21:09                   ` Dmitry Kozlyuk
@ 2021-10-13 10:18                     ` David Marchand
  2021-11-08 14:27                       ` Dmitry Kozlyuk
  0 siblings, 1 reply; 48+ messages in thread
From: David Marchand @ 2021-10-13 10:18 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Slava Ovsiienko, Anatoly Burakov, NBU-Contact-Thomas Monjalon

On Tue, Oct 12, 2021 at 11:09 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
> > From this, I see two parts in this patch:
> > - faster restart, reusing hugepage files as is (combination of not calling
> > unlink() and doing "clear on alloc"),
> >   This part is interesting, and I think a single knob for this would be
> > enough.
>
> In combination with rte_extmem* API this know would indeed allow to implement the feature in the app. However, the drawback is that all the logic to select hugepage size, NUMA, and names would need to be done from the app, probably with its own options. OTOH, there is already hugetlbfs and numactl to avoid apps duplicating this logic. Also, it's not only the fast restart, but also the fast initial start on a prepared system.

How do you "prepare" a system?


>
> > - finegrained control of hugepage files, but it has the drawback of
> > imposing primary/secondary run with the same options.
> >   The second part seems complex to configure. I see conflicts with
> > existing options, so it seems a good way to get caught up in the carpet
> > (sorry if it translates badly from French :p).
>
> I don't see why synchronizing memory options is a big issue.

We have too many options for the memory subsystem.

I mentionned --socket-mem, --huge-dir, --file-prefix.
But there is also --huge-unlink, --no-shconf, --in-memory,
--legacy-mem, --single-file-segments, --match-allocations and
--socket-limit.
Some of those do part of the job, others are incompatible with this
new option and probably some are orthogonal.

Sure we can add a new one that prepare your toasts, coffee and wake up
the kids (that's progress!).

Maybe you can provide an example on how this is used?

Thanks.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest
  2021-10-12 14:48               ` Dmitry Kozlyuk
@ 2021-10-15 13:47                 ` Aaron Conole
  0 siblings, 0 replies; 48+ messages in thread
From: Aaron Conole @ 2021-10-15 13:47 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Slava Ovsiienko, Anatoly Burakov, David Marchand

Dmitry Kozlyuk <dkozlyuk@nvidia.com> writes:

>> This isn't really a test, imho.  There are no assert()s.  How does a developer who
>> tries to fix a bug in this area know what is acceptable?
>> 
>> Please switch the printf()s to RTE_LOG calls, and add some RTE_TEST_ASSERT
>> calls to enforce some time range at the least.
>> Otherwise this test will not really be checking the performance - just giving a
>> report somewhere.
>
> I just followed DPDK naming convention of test_xxx_perf.c / xxx_perf_autotest.
> They all should really be called benchmarks.

Agreed - they are not really tests and it makes me wonder why we label
them as such.  It will be confusing.  A developer who runs the perf test
suite will just see "OK" everywhere and assume that all the tests are
working - even if they introduce a performance regression.

Maybe it would make sense to relabel them (perf-benchmark or something),
so that there isn't an expectation that we have PASS / FAIL.  That's a
larger scope than this patch, though.

> They help developers to see how the code changes affect performance.
> I don't understand how this "perf test" is not in line with existing ones
> and where it should properly reside.
>
> I'm not totally opposed to replacing printf() with RTE_LOG(), but all other test use printf().
> The drawback of the change is inconsistency, what is the benefit?

RTE_LOG is captured in other places as well.  printf() depending on how
the test app is run might not go anywhere.  Also, at least the ipsec
perf test starts introducing RTE_LOG() calls - although even there they
use printf() for reports.

I guess it's very confusing to call all of these as 'test' since they
aren't.

But that's an aside, and I guess this is consistent with existing
_perf.c files.

>> Also, I don't understand the way the memset test works here.  You do one large
>> memset at the very beginning and then extrapolate the time it would take.  Does
>> that hold any value or should we do a memset in each iteration and enforce a
>> scaled time?
>
> As explained above, we don't need to enforce anything, we want a report.
> I've never seen a case with one NUMA node where memset() time would not scale linearly,
> but benchmarks should be precise so I'll change it to memset()'ing the allocated area, thanks. 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-10-13 10:18                     ` David Marchand
@ 2021-11-08 14:27                       ` Dmitry Kozlyuk
  2021-11-08 17:45                         ` David Marchand
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Kozlyuk @ 2021-11-08 14:27 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Slava Ovsiienko, Anatoly Burakov, NBU-Contact-Thomas Monjalon

Hi David,

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
[...]
> > > - finegrained control of hugepage files, but it has the drawback of
> > > imposing primary/secondary run with the same options.
> > >   The second part seems complex to configure. I see conflicts with
> > > existing options, so it seems a good way to get caught up in the
> > > carpet (sorry if it translates badly from French :p).
> >
> > I don't see why synchronizing memory options is a big issue.
> 
> We have too many options for the memory subsystem.
> 
> I mentionned --socket-mem, --huge-dir, --file-prefix.
> But there is also --huge-unlink, --no-shconf, --in-memory, --legacy-mem, -
> -single-file-segments, --match-allocations and --socket-limit.
> Some of those do part of the job, others are incompatible with this new
> option and probably some are orthogonal.
> 
> Sure we can add a new one that prepare your toasts, coffee and wake up the
> kids (that's progress!).
>
> Maybe you can provide an example on how this is used?

Sorry for the late reply.

After more consideration offline with Thomas
we concluded that the --mem-file option is indeed too intrusive.
I'm going to propose a new solution for the slow restart issue for 22.02,
probably with a knob like you proposed,
only not just changing when the memory is zeroed,
but most importantly allowing EAL to reuse hugepages.
So that in the end the usage would be as follows,
and if it's a restart, memory clearing would be bypassed:

	./dpdk-app --huge-reuse -- ...

Refactoring and benchmark patches may still be useful,
so review efforts were hopefully not in vain.
Thank you for asking the right questions!

FWIW, I agree that memory options should be cleaned up independently.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files
  2021-11-08 14:27                       ` Dmitry Kozlyuk
@ 2021-11-08 17:45                         ` David Marchand
  0 siblings, 0 replies; 48+ messages in thread
From: David Marchand @ 2021-11-08 17:45 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Slava Ovsiienko, Anatoly Burakov, NBU-Contact-Thomas Monjalon

On Mon, Nov 8, 2021 at 3:27 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> Hi David,
>
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> [...]
> > > > - finegrained control of hugepage files, but it has the drawback of
> > > > imposing primary/secondary run with the same options.
> > > >   The second part seems complex to configure. I see conflicts with
> > > > existing options, so it seems a good way to get caught up in the
> > > > carpet (sorry if it translates badly from French :p).
> > >
> > > I don't see why synchronizing memory options is a big issue.
> >
> > We have too many options for the memory subsystem.
> >
> > I mentionned --socket-mem, --huge-dir, --file-prefix.
> > But there is also --huge-unlink, --no-shconf, --in-memory, --legacy-mem, -
> > -single-file-segments, --match-allocations and --socket-limit.
> > Some of those do part of the job, others are incompatible with this new
> > option and probably some are orthogonal.
> >
> > Sure we can add a new one that prepare your toasts, coffee and wake up the
> > kids (that's progress!).
> >
> > Maybe you can provide an example on how this is used?
>
> Sorry for the late reply.

No problem.

>
> After more consideration offline with Thomas
> we concluded that the --mem-file option is indeed too intrusive.
> I'm going to propose a new solution for the slow restart issue for 22.02,
> probably with a knob like you proposed,
> only not just changing when the memory is zeroed,
> but most importantly allowing EAL to reuse hugepages.
> So that in the end the usage would be as follows,
> and if it's a restart, memory clearing would be bypassed:
>
>         ./dpdk-app --huge-reuse -- ...
>
> Refactoring and benchmark patches may still be useful,
> so review efforts were hopefully not in vain.
> Thank you for asking the right questions!
>
> FWIW, I agree that memory options should be cleaned up independently.

Looking forward to 22.02 :-).
Thanks Dmitry.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-11-08 17:45 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-05 12:49 [dpdk-dev] [PATCH 21.11 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-07-05 12:49 ` [dpdk-dev] [PATCH 21.11 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
2021-07-16 11:08 ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-07-16 11:08   ` [dpdk-dev] [PATCH 21.11 v2 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
2021-08-09  9:45   ` [dpdk-dev] [PATCH 21.11 v2 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-08-30  8:21     ` Dmitry Kozlyuk
2021-09-14 10:34   ` [dpdk-dev] [PATCH v3 " Dmitry Kozlyuk
2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
2021-09-14 12:48       ` John Levon
2021-09-14 12:57         ` Dmitry Kozlyuk
2021-09-16 12:08       ` John Levon
2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-09-14 10:34     ` [dpdk-dev] [PATCH v3 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
2021-09-20 12:52     ` [dpdk-dev] [PATCH v4 0/3] eal: add memory pre-allocation from existing files dkozlyuk
2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 2/3] eal: add memory pre-allocation from existing files dkozlyuk
2021-09-20 12:53       ` [dpdk-dev] [PATCH v4 3/3] app/test: add allocator performance autotest dkozlyuk
2021-09-21  8:16       ` [dpdk-dev] [PATCH v5 0/3] eal: add memory pre-allocation from existing files dkozlyuk
2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 1/3] eal/linux: make hugetlbfs analysis reusable dkozlyuk
2021-09-22 13:52           ` John Levon
2021-10-05 17:36           ` Thomas Monjalon
2021-10-08 15:33             ` John Levon
2021-10-08 15:50               ` Dmitry Kozlyuk
2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 2/3] eal: add memory pre-allocation from existing files dkozlyuk
2021-09-21  8:16         ` [dpdk-dev] [PATCH v5 3/3] app/test: add allocator performance autotest dkozlyuk
2021-10-11  8:56         ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 1/3] eal/linux: make hugetlbfs analysis reusable Dmitry Kozlyuk
2021-10-13  8:16             ` David Marchand
2021-10-13  9:21               ` Dmitry Kozlyuk
2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from existing files Dmitry Kozlyuk
2021-10-12 15:37             ` David Marchand
2021-10-12 15:55               ` Dmitry Kozlyuk
2021-10-12 17:32                 ` David Marchand
2021-10-12 21:09                   ` Dmitry Kozlyuk
2021-10-13 10:18                     ` David Marchand
2021-11-08 14:27                       ` Dmitry Kozlyuk
2021-11-08 17:45                         ` David Marchand
2021-10-11  8:56           ` [dpdk-dev] [PATCH v6 3/3] app/test: add allocator performance autotest Dmitry Kozlyuk
2021-10-12 13:53             ` Aaron Conole
2021-10-12 14:48               ` Dmitry Kozlyuk
2021-10-15 13:47                 ` Aaron Conole
2021-10-11 18:52           ` [dpdk-dev] [PATCH v6 0/3] eal: add memory pre-allocation from existing files Thomas Monjalon
2021-10-11 21:12             ` [dpdk-dev] [dpdk-ci] " Lincoln Lavoie
2021-10-12  6:54               ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).