DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Fast restart with many hugepages
@ 2021-12-30 14:37 Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
                   ` (6 more replies)
  0 siblings, 7 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:37 UTC (permalink / raw)
  To: dev
  Cc: Anatoly Burakov, Viacheslav Ovsiienko, David Marchand,
	Thomas Monjalon, Lior Margalit

This patchset is a new design and implementation of [1].

# Problem Statement

Large allocations that involve mapping new hugepages are slow.
This is problematic, for example, in the following use case.
A single-process application allocates ~1TB of mempools at startup.
Sometimes the app needs to restart as quick as possible.
Allocating the hugepages anew takes as long as 15 seconds,
while the new process could just pick up all the memory
left by the old one (reinitializing the contents as needed).

Almost all of mmap(2) time spent in the kernel
is clearing the memory, i.e. filling it with zeros.
This is done if a file in hugetlbfs is mapped
for the first time system-wide, i.e. a hugepage is committed
to prevent data leaks from the previous users of the same hugepage.
For example, mapping 32 GB from a new file may take 2.16 seconds,
while mapping the same pages again takes only 0.3 ms.
Security put aside, e.g. when the environment is controlled,
this effort is wasted for the memory intended for DMA,
because its content will be overwritten anyway.

Linux EAL explicitly removes hugetlbfs files at initialization
and before mapping to force the kernel clear the memory.
This allows the memory allocator to clean memory on only on freeing.

# Solution

Add a new mode allowing EAL to remap existing hugepage files.
While it is intended to make restarts faster in the first place,
it makes any startup faster except the cold one
(with no existing files).

It is the administrator who accepts security risks
implied by reusing hugepages.
The new mode is an opt-in and a warning is logged.

The feature is Linux-only as it is related
to mapping hugepages from files which only Linux does.
It is inherently incompatible with --in-memory,
for --huge-unlink see below.

There is formally no breakage of API contract,
but there is a behavior change in the new mode:
rte_malloc*() and rte_memzone_reserve*() may return dirty memory
(previously they were returning clean memory from free heap elements).
Their contract has always explicitly allowed this,
but still there may be users relying on the traditional behavior.
Such users will need to fix their code to use the new mode.

# Implementation

## User Interface

There is --huge-unlink switch in the same area to remove hugepage files
before mapping them. It is infeasible to use with the new mode,
because the point is to keep hugepage files for fast future restarts.
Extend --huge-unlink option to represent only valid combinations:

* --huge-unlink=existing OR no option (for compatibility):
  unlink files at initialization
  and before opening them as a precaution.

* --huge-unlink=always OR just --huge-unlink (for compatibility):
  same as above + unlink created files before mapping.

* --huge-unlink=never:
  the new mode, do not unlink hugepages files, reuse them.

This option was always Linux-only, but it is kept as common
in case there are users who expect it to be a no-op on other systems.
(Adding a separate --huge-reuse option was also considered,
but there is no obvious benefit and more combinations to test.)

## EAL

If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
so that the memory allocator may clear the memory if need be.
See patch 4/6 description for details.

The memory manager tracks whether an element is clean or dirty.
If rte_zmalloc*() allocates from a dirty element,
the memory is cleared before handling it to the user.
On freeing, the allocator joins adjacent free elements,
but in the new mode it may not be feasible to clear the free memory
if the joint element is dirty (contains dirty parts).
In any case, memory will be cleared only once,
either on freeing or on allocation. See patch 2/6 for details.
Patch 6/6 adds a benchmark to see how time is distributed
between allocation and freeing in different modes.

Besides clearing memory, each mmap() call takes some time which adds up.
EAL does one call per hugepage, 1024 calls for 1 TB may take ~300 ms.
It does so in order to be able to unmap the segments one by one.
However, segments from initial allocation (-m) are never unmapped.
Ideally, initial allocation should take one mmap() call per memory type
(i.e. per NUMA node per page size) if --single-file-segments is used.
This further optimization is not implemented in current version.

[1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/

Dmitry Kozlyuk (6):
  doc: add hugepage mapping details
  mem: add dirty malloc element support
  eal: refactor --huge-unlink storage
  eal/linux: allow hugepage file reuse
  eal: allow hugepage file reuse with --huge-unlink
  app/test: add allocator performance benchmark

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  21 ++-
 .../prog_guide/env_abstraction_layer.rst      |  94 +++++++++-
 doc/guides/rel_notes/release_22_03.rst        |   7 +
 lib/eal/common/eal_common_options.c           |  46 ++++-
 lib/eal/common/eal_internal_cfg.h             |  10 +-
 lib/eal/common/malloc_elem.c                  |  22 ++-
 lib/eal/common/malloc_elem.h                  |  11 +-
 lib/eal/common/malloc_heap.c                  |  18 +-
 lib/eal/common/rte_malloc.c                   |  21 ++-
 lib/eal/include/rte_memory.h                  |   8 +-
 lib/eal/linux/eal_hugepage_info.c             |  59 ++++--
 lib/eal/linux/eal_memalloc.c                  | 164 ++++++++++-------
 lib/eal/linux/eal_memory.c                    |   2 +-
 15 files changed, 537 insertions(+), 122 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 1/6] doc: add hugepage mapping details
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
@ 2021-12-30 14:37 ` Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 2/6] mem: add dirty malloc element support Dmitry Kozlyuk
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:37 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov, Thomas Monjalon

Hugepage mapping is a layer of EAL malloc builds upon.
There were implicit references to its details,
like mentions of segment file descriptors,
but no explicit description of its modes and operation.
Add an overview of mechanics used on ech supported OS.
Convert memory management subsections from list items
to level 4 headers: they are big and important enough.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 .../prog_guide/env_abstraction_layer.rst      | 85 +++++++++++++++++--
 1 file changed, 76 insertions(+), 9 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 29f6fefc48..6cddb86467 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -86,7 +86,7 @@ See chapter
 Memory Mapping Discovery and Memory Reservation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem.
+The allocation of large contiguous physical memory is done using hugepages.
 The EAL provides an API to reserve named memory zones in this contiguous memory.
 The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API.
 
@@ -95,11 +95,12 @@ and legacy mode. Both modes are explained below.
 
 .. note::
 
-    Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
+    Memory reservations done using the APIs provided by rte_malloc are also backed by hugepages.
 
-+ Dynamic memory mode
+Dynamic Memory Mode
+^^^^^^^^^^^^^^^^^^^
 
-Currently, this mode is only supported on Linux.
+Currently, this mode is only supported on Linux and Windows.
 
 In this mode, usage of hugepages by DPDK application will grow and shrink based
 on application's requests. Any memory allocation through ``rte_malloc()``,
@@ -155,7 +156,8 @@ of memory that can be used by DPDK application.
     :ref:`Multi-process Support <Multi-process_Support>` for more details about
     DPDK IPC.
 
-+ Legacy memory mode
+Legacy Memory Mode
+^^^^^^^^^^^^^^^^^^
 
 This mode is enabled by specifying ``--legacy-mem`` command-line switch to the
 EAL. This switch will have no effect on FreeBSD as FreeBSD only supports
@@ -168,7 +170,8 @@ not allow acquiring or releasing hugepages from the system at runtime.
 If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
 hugepage memory will be preallocated.
 
-+ Hugepage allocation matching
+Hugepage Allocation Matching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This behavior is enabled by specifying the ``--match-allocations`` command-line
 switch to the EAL. This switch is Linux-only and not supported with
@@ -182,7 +185,8 @@ matching can be used by these types of applications to satisfy both of these
 requirements. This can result in some increased memory usage which is
 very dependent on the memory allocation patterns of the application.
 
-+ 32-bit support
+32-bit Support
+^^^^^^^^^^^^^^
 
 Additional restrictions are present when running in 32-bit mode. In dynamic
 memory mode, by default maximum of 2 gigabytes of VA space will be preallocated,
@@ -192,7 +196,8 @@ used.
 In legacy mode, VA space will only be preallocated for segments that were
 requested (plus padding, to keep IOVA-contiguousness).
 
-+ Maximum amount of memory
+Maximum Amount of Memory
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 All possible virtual memory space that can ever be used for hugepage mapping in
 a DPDK process is preallocated at startup, thereby placing an upper limit on how
@@ -222,7 +227,68 @@ Normally, these options do not need to be changed.
     can later be mapped into that preallocated VA space (if dynamic memory mode
     is enabled), and can optionally be mapped into it at startup.
 
-+ Segment file descriptors
+Hugepage Mapping
+^^^^^^^^^^^^^^^^
+
+Below is an overview of methods used for each OS to obtain hugepages,
+explaining why certain limitations and options exist in EAL.
+See the user guide for a specific OS for configuration details.
+
+FreeBSD uses ``contigmem`` kernel module
+to reserve a fixed number of hugepages at system start,
+which are mapped by EAL at initialization using a specific ``sysctl()``.
+
+Windows EAL allocates hugepages from the OS as needed using Win32 API,
+so available amount depends on the system load.
+It uses ``virt2phys`` kernel module to obtain physical addresses,
+unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
+
+Linux implements a variety of methods:
+
+* mapping each hugepage from its own file in hugetlbfs;
+* mapping multiple hugepages from a shared file in hugetlbfs;
+* anonymous mapping.
+
+Mapping hugepages from files in hugetlbfs is essential for multi-process,
+because secondary processes need to map the same hugepages.
+EAL creates files like ``rtemap_0``
+in directories specified with ``--huge-dir`` option
+(or in the mount point for a specific hugepage size).
+The ``rtemap_`` prefix can be changed using ``--file-prefix``.
+This may be needed for running multiple primary processes
+that share a hugetlbfs mount point.
+Each backing file by default corresponds to one hugepage,
+it is opened and locked for the entire time the hugepage is used.
+See :ref:`segment-file-descriptors` section
+on how the number of open backing file descriptors can be reduced.
+
+Backing files may persist after the corresponding hugepage is freed
+and even after the application terminates,
+reducing the number of hugepages available to other processes.
+EAL removes existing files at startup
+and can remove newly created files before mapping them with ``--huge-unlink``.
+However, since it disables multi-process anyway,
+using anonymous mapping (``--in-memory``) is recommended instead.
+
+:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
+Hugepages are cleared by the kernel when a file in hugetlbfs or its part
+is mapped for the first time system-wide
+to prevent data leaks from previous users of the same hugepage.
+EAL ensures this behavior by removing existing backing files at startup
+and by recreating them before opening for mapping (as a precaution).
+
+Anonymous mapping does not allow multi-process architecture,
+but it is free of filename conflicts and leftover files on hugetlbfs.
+If memfd_create(2) is supported both at build and run time,
+DPDK memory manager can provide file descriptors for memory segments,
+which are required for VirtIO with vhost-user backend.
+This means open file descriptor issues may also affect this mode,
+with the same solution.
+
+.. _segment-file-descriptors:
+
+Segment File Descriptors
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 On Linux, in most cases, EAL will store segment file descriptors in EAL. This
 can become a problem when using smaller page sizes due to underlying limitations
@@ -731,6 +797,7 @@ We expect only 50% of CPU spend on packet IO.
     echo 100000 > pkt_io/cpu.cfs_period_us
     echo  50000 > pkt_io/cpu.cfs_quota_us
 
+.. _malloc:
 
 Malloc
 ------
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 2/6] mem: add dirty malloc element support
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2021-12-30 14:37 ` Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 3/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:37 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

EAL malloc layer assumed all free elements content
is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
This assumption was ensured in two ways:
1. EAL memalloc layer always returned clean memory.
2. Freed memory was cleared before returning into the heap.

Clearing the memory can be as slow as around 14 GiB/s.
To save doing so, memalloc layer is allowed to return dirty memory.
Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
The allocator tracks elements that contain dirty memory
using the new flag in the element header.
When clean memory is requested via rte_zmalloc*()
and the suitable element is dirty, it is cleared on allocation.
When memory is deallocated, the freed element is joined
with adjacent free elements, and the dirty flag is updated:

    dirty + freed + dirty = dirty  =>  no need to clean
            freed + dirty = dirty      the freed memory

    clean + freed + clean = clean  =>  freed memory
    clean + freed         = clean      must be cleared
            freed + clean = clean
            freed         = clean

As a result, memory is either cleared on free, as before,
or it will be cleared on allocation if need be, but never twice.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 lib/eal/common/malloc_elem.c | 22 +++++++++++++++++++---
 lib/eal/common/malloc_elem.h | 11 +++++++++--
 lib/eal/common/malloc_heap.c | 18 ++++++++++++------
 lib/eal/common/rte_malloc.c  | 21 ++++++++++++++-------
 lib/eal/include/rte_memory.h |  8 ++++++--
 5 files changed, 60 insertions(+), 20 deletions(-)

diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index bdd20a162e..e04e0890fb 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -129,7 +129,7 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
 void
 malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 		struct rte_memseg_list *msl, size_t size,
-		struct malloc_elem *orig_elem, size_t orig_size)
+		struct malloc_elem *orig_elem, size_t orig_size, bool dirty)
 {
 	elem->heap = heap;
 	elem->msl = msl;
@@ -137,6 +137,7 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 	elem->next = NULL;
 	memset(&elem->free_list, 0, sizeof(elem->free_list));
 	elem->state = ELEM_FREE;
+	elem->dirty = dirty;
 	elem->size = size;
 	elem->pad = 0;
 	elem->orig_elem = orig_elem;
@@ -300,7 +301,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 	const size_t new_elem_size = elem->size - old_elem_size;
 
 	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
-			 elem->orig_elem, elem->orig_size);
+			elem->orig_elem, elem->orig_size, elem->dirty);
 	split_pt->prev = elem;
 	split_pt->next = next_elem;
 	if (next_elem)
@@ -506,6 +507,7 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 	else
 		elem1->heap->last = elem1;
 	elem1->next = next;
+	elem1->dirty |= elem2->dirty;
 	if (elem1->pad) {
 		struct malloc_elem *inner = RTE_PTR_ADD(elem1, elem1->pad);
 		inner->size = elem1->size - elem1->pad;
@@ -579,6 +581,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN);
 	data_len = elem->size - MALLOC_ELEM_OVERHEAD;
 
+	/*
+	 * Consider the element clean for the purposes of joining.
+	 * If both neighbors are clean or non-existent,
+	 * the joint element will be clean,
+	 * which means the memory should be cleared.
+	 * There is no need to clear the memory if the joint element is dirty.
+	 */
+	elem->dirty = false;
 	elem = malloc_elem_join_adjacent_free(elem);
 
 	malloc_elem_free_list_insert(elem);
@@ -588,8 +598,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
-	/* poison memory */
+#ifndef RTE_MALLOC_DEBUG
+	/* Normally clear the memory when needed. */
+	if (!elem->dirty)
+		memset(ptr, 0, data_len);
+#else
+	/* Always poison the memory in debug mode. */
 	memset(ptr, MALLOC_POISON, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_elem.h b/lib/eal/common/malloc_elem.h
index 15d8ba7af2..f2aa98821b 100644
--- a/lib/eal/common/malloc_elem.h
+++ b/lib/eal/common/malloc_elem.h
@@ -27,7 +27,13 @@ struct malloc_elem {
 	LIST_ENTRY(malloc_elem) free_list;
 	/**< list of free elements in heap */
 	struct rte_memseg_list *msl;
-	volatile enum elem_state state;
+	/** Element state, @c dirty and @c pad validity depends on it. */
+	/* An extra bit is needed to represent enum elem_state as signed int. */
+	enum elem_state state : 3;
+	/** If state == ELEM_FREE: the memory is not filled with zeroes. */
+	uint32_t dirty : 1;
+	/** Reserved for future use. */
+	uint32_t reserved : 28;
 	uint32_t pad;
 	size_t size;
 	struct malloc_elem *orig_elem;
@@ -320,7 +326,8 @@ malloc_elem_init(struct malloc_elem *elem,
 		struct rte_memseg_list *msl,
 		size_t size,
 		struct malloc_elem *orig_elem,
-		size_t orig_size);
+		size_t orig_size,
+		bool dirty);
 
 void
 malloc_elem_insert(struct malloc_elem *elem);
diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c
index 55aad2711b..24080fc473 100644
--- a/lib/eal/common/malloc_heap.c
+++ b/lib/eal/common/malloc_heap.c
@@ -93,11 +93,11 @@ malloc_socket_to_heap_id(unsigned int socket_id)
  */
 static struct malloc_elem *
 malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
-		void *start, size_t len)
+		void *start, size_t len, bool dirty)
 {
 	struct malloc_elem *elem = start;
 
-	malloc_elem_init(elem, heap, msl, len, elem, len);
+	malloc_elem_init(elem, heap, msl, len, elem, len, dirty);
 
 	malloc_elem_insert(elem);
 
@@ -135,7 +135,8 @@ malloc_add_seg(const struct rte_memseg_list *msl,
 
 	found_msl = &mcfg->memsegs[msl_idx];
 
-	malloc_heap_add_memory(heap, found_msl, ms->addr, len);
+	malloc_heap_add_memory(heap, found_msl, ms->addr, len,
+			ms->flags & RTE_MEMSEG_FLAG_DIRTY);
 
 	heap->total_size += len;
 
@@ -303,7 +304,8 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 	struct rte_memseg_list *msl;
 	struct malloc_elem *elem = NULL;
 	size_t alloc_sz;
-	int allocd_pages;
+	int allocd_pages, i;
+	bool dirty = false;
 	void *ret, *map_addr;
 
 	alloc_sz = (size_t)pg_sz * n_segs;
@@ -372,8 +374,12 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 		goto fail;
 	}
 
+	/* Element is dirty if it contains at least one dirty page. */
+	for (i = 0; i < allocd_pages; i++)
+		dirty |= ms[i]->flags & RTE_MEMSEG_FLAG_DIRTY;
+
 	/* add newly minted memsegs to malloc heap */
-	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz);
+	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz, dirty);
 
 	/* try once more, as now we have allocated new memory */
 	ret = find_suitable_element(heap, elt_size, flags, align, bound,
@@ -1260,7 +1266,7 @@ malloc_heap_add_external_memory(struct malloc_heap *heap,
 	memset(msl->base_va, 0, msl->len);
 
 	/* now, add newly minted memory to the malloc heap */
-	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len);
+	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len, false);
 
 	heap->total_size += msl->len;
 
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index d0bec26920..71a3f7ecb4 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -115,15 +115,22 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
+	if (ptr != NULL) {
+		struct malloc_elem *elem = malloc_elem_from_data(ptr);
+
+		if (elem->dirty) {
+			memset(ptr, 0, size);
+		} else {
 #ifdef RTE_MALLOC_DEBUG
-	/*
-	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
-	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+			/*
+			 * If DEBUG is enabled, then freed memory is marked
+			 * with a poison value and set to zero on allocation.
+			 * If DEBUG is disabled then memory is already zeroed.
+			 */
+			memset(ptr, 0, size);
 #endif
+		}
+	}
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index 6d018629ae..d76e7ba780 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -19,6 +19,7 @@
 extern "C" {
 #endif
 
+#include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
 #include <rte_config.h>
@@ -37,11 +38,14 @@ extern "C" {
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
+/** Prevent this segment from being freed back to the OS. */
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE RTE_BIT32(0)
+/** This segment is not fileld with zeros. */
+#define RTE_MEMSEG_FLAG_DIRTY RTE_BIT32(1)
+
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
-/**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
 	RTE_STD_C11
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 3/6] eal: refactor --huge-unlink storage
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 2/6] mem: add dirty malloc element support Dmitry Kozlyuk
@ 2021-12-30 14:37 ` Dmitry Kozlyuk
  2021-12-30 14:37 ` [RFC PATCH 4/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:37 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

In preparation to extend --huge-unlink option semantics
refactor how it is stored in the internal configuration.
It makes future changes more isolated.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 lib/eal/common/eal_common_options.c | 9 +++++----
 lib/eal/common/eal_internal_cfg.h   | 8 +++++++-
 lib/eal/linux/eal_memalloc.c        | 7 ++++---
 lib/eal/linux/eal_memory.c          | 2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 1cfdd75f3b..7520ebda8e 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -1737,7 +1737,7 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -1766,7 +1766,7 @@ eal_parse_common_option(int opt, const char *optarg,
 		conf->in_memory = 1;
 		/* in-memory is a superset of noshconf and huge-unlink */
 		conf->no_shconf = 1;
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_PROC_TYPE_NUM:
@@ -2050,7 +2050,8 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"be specified together with --"OPT_NO_HUGE"\n");
 		return -1;
 	}
-	if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink &&
+	if (internal_cfg->no_hugetlbfs &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
 			"be specified together with --"OPT_NO_HUGE"\n");
@@ -2061,7 +2062,7 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			" is only supported in non-legacy memory mode\n");
 	}
 	if (internal_cfg->single_file_segments &&
-			internal_cfg->hugepage_unlink &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is "
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..b5e6942578 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -40,6 +40,12 @@ struct simd_bitwidth {
 	uint16_t bitwidth; /**< bitwidth value */
 };
 
+/** Hugepage backing files discipline. */
+struct hugepage_file_discipline {
+	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
+	bool unlink_before_mapping;
+};
+
 /**
  * internal configuration
  */
@@ -48,7 +54,7 @@ struct internal_config {
 	volatile unsigned force_nchannel; /**< force number of channels */
 	volatile unsigned force_nrank;    /**< force number of ranks */
 	volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
-	unsigned hugepage_unlink;         /**< true to unlink backing files */
+	struct hugepage_file_discipline hugepage_file;
 	volatile unsigned no_pci;         /**< true to disable PCI */
 	volatile unsigned no_hpet;        /**< true to disable HPET */
 	volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 337f2bc739..abbe605e49 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -564,7 +564,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 					__func__, strerror(errno));
 				goto resized;
 			}
-			if (internal_conf->hugepage_unlink &&
+			if (internal_conf->hugepage_file.unlink_before_mapping &&
 					!internal_conf->in_memory) {
 				if (unlink(path)) {
 					RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n",
@@ -697,7 +697,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 			close_hugefile(fd, path, list_idx);
 	} else {
 		/* only remove file if we can take out a write lock */
-		if (internal_conf->hugepage_unlink == 0 &&
+		if (!internal_conf->hugepage_file.unlink_before_mapping &&
 				internal_conf->in_memory == 0 &&
 				lock(fd, LOCK_EX) == 1)
 			unlink(path);
@@ -756,7 +756,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-		if (!internal_conf->in_memory && !internal_conf->hugepage_unlink) {
+		if (!internal_conf->in_memory &&
+				internal_conf->hugepage_file.unlink_before_mapping) {
 			ret = lock(fd, LOCK_EX);
 			if (ret >= 0) {
 				/* no one else is using this page */
diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
index 03a4f2dd2d..83eec078a4 100644
--- a/lib/eal/linux/eal_memory.c
+++ b/lib/eal/linux/eal_memory.c
@@ -1428,7 +1428,7 @@ eal_legacy_hugepage_init(void)
 	}
 
 	/* free the hugepage backing files */
-	if (internal_conf->hugepage_unlink &&
+	if (internal_conf->hugepage_file.unlink_before_mapping &&
 		unlink_hugepage_files(tmp_hp, internal_conf->num_hugepage_sizes) < 0) {
 		RTE_LOG(ERR, EAL, "Unlinking hugepage files failed!\n");
 		goto fail;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 4/6] eal/linux: allow hugepage file reuse
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                   ` (2 preceding siblings ...)
  2021-12-30 14:37 ` [RFC PATCH 3/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
@ 2021-12-30 14:37 ` Dmitry Kozlyuk
  2021-12-30 14:48 ` [RFC PATCH 5/6] eal: allow hugepage file reuse with --huge-unlink Dmitry Kozlyuk
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:37 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Linux EAL ensured that mapped hugepages are clean
by always mapping from newly created files:
existing hugepage backing files were always removed.
In this case, the kernel clears the page to prevent data leaks,
because the mapped memory may contain leftover data
from the previous process that was using this memory.
Clearing takes the bulk of the time spent in mmap(2),
increasing EAL initialization time.

Introduce a mode to keep existing files and reuse them
in order to speed up initial memory allocation in EAL.
Hugepages mapped from such files may contain data
left by the previous process that used this memory,
so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
If multiple hugepages are mapped from the same file:
1. When fallocate(2) is used, all memory mapped from this file
   is considered dirty, because it is unknown
   which parts of the file are holes.
2. When ftruncate(3) is used, memory mapped from this file
   is considered dirty unless the file is extended
   to create a new mapping, which implies clean memory.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 lib/eal/common/eal_internal_cfg.h |   2 +
 lib/eal/linux/eal_hugepage_info.c |  59 +++++++----
 lib/eal/linux/eal_memalloc.c      | 157 ++++++++++++++++++------------
 3 files changed, 140 insertions(+), 78 deletions(-)

diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index b5e6942578..3685aa7c52 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -44,6 +44,8 @@ struct simd_bitwidth {
 struct hugepage_file_discipline {
 	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
 	bool unlink_before_mapping;
+	/** Reuse existing files, never delete or re-create them. */
+	bool keep_existing;
 };
 
 /**
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 9fb0e968db..55debdedf0 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon
 /* this function is only called from eal_hugepage_info_init which itself
  * is only called from a primary process */
 static uint32_t
-get_num_hugepages(const char *subdir, size_t sz)
+get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages)
 {
 	unsigned long resv_pages, num_pages, over_pages, surplus_pages;
 	const char *nr_hp_file = "free_hugepages";
@@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz)
 	else
 		over_pages = 0;
 
-	if (num_pages == 0 && over_pages == 0)
+	if (num_pages == 0 && over_pages == 0 && reusable_pages)
 		RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n",
 				sz >> 10);
 
@@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz)
 	if (num_pages < over_pages) /* overflow */
 		num_pages = UINT32_MAX;
 
+	num_pages += reusable_pages;
+	if (num_pages < reusable_pages) /* overflow */
+		num_pages = UINT32_MAX;
+
 	/* we want to return a uint32_t and more than this looks suspicious
 	 * anyway ... */
 	if (num_pages > UINT32_MAX)
@@ -298,12 +302,12 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 }
 
 /*
- * Clear the hugepage directory of whatever hugepage files
- * there are. Checks if the file is locked (i.e.
- * if it's in use by another DPDK process).
+ * Search the hugepage directory for whatever hugepage files there are.
+ * Check if the file is in use by another DPDK process.
+ * If not, either remove it, or keep and count the page as reusable.
  */
 static int
-clear_hugedir(const char * hugedir)
+clear_hugedir(const char *hugedir, bool keep, unsigned int *reusable_pages)
 {
 	DIR *dir;
 	struct dirent *dirent;
@@ -346,8 +350,12 @@ clear_hugedir(const char * hugedir)
 		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
 		/* if lock succeeds, remove the file */
-		if (lck_result != -1)
-			unlinkat(dir_fd, dirent->d_name, 0);
+		if (lck_result != -1) {
+			if (keep)
+				(*reusable_pages)++;
+			else
+				unlinkat(dir_fd, dirent->d_name, 0);
+		}
 		close (fd);
 		dirent = readdir(dir);
 	}
@@ -375,7 +383,8 @@ compare_hpi(const void *a, const void *b)
 }
 
 static void
-calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
+calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent,
+		unsigned int reusable_pages)
 {
 	uint64_t total_pages = 0;
 	unsigned int i;
@@ -388,8 +397,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 * in one socket and sorting them later
 	 */
 	total_pages = 0;
-	/* we also don't want to do this for legacy init */
-	if (!internal_conf->legacy_mem)
+
+	/*
+	 * We also don't want to do this for legacy init.
+	 * When there are hugepage files to reuse it is unknown
+	 * what NUMA node the pages are on.
+	 * This could be determined by mapping,
+	 * but it is precisely what hugepage file reuse is trying to avoid.
+	 */
+	if (!internal_conf->legacy_mem && reusable_pages == 0)
 		for (i = 0; i < rte_socket_count(); i++) {
 			int socket = rte_socket_id_by_idx(i);
 			unsigned int num_pages =
@@ -405,7 +421,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 */
 	if (total_pages == 0) {
 		hpi->num_pages[0] = get_num_hugepages(dirent->d_name,
-				hpi->hugepage_sz);
+				hpi->hugepage_sz, reusable_pages);
 
 #ifndef RTE_ARCH_64
 		/* for 32-bit systems, limit number of hugepages to
@@ -421,6 +437,7 @@ hugepage_info_init(void)
 {	const char dirent_start_text[] = "hugepages-";
 	const size_t dirent_start_len = sizeof(dirent_start_text) - 1;
 	unsigned int i, num_sizes = 0;
+	unsigned int reusable_pages;
 	DIR *dir;
 	struct dirent *dirent;
 	struct internal_config *internal_conf =
@@ -454,7 +471,7 @@ hugepage_info_init(void)
 			uint32_t num_pages;
 
 			num_pages = get_num_hugepages(dirent->d_name,
-					hpi->hugepage_sz);
+					hpi->hugepage_sz, 0);
 			if (num_pages > 0)
 				RTE_LOG(NOTICE, EAL,
 					"%" PRIu32 " hugepages of size "
@@ -473,7 +490,7 @@ hugepage_info_init(void)
 					"hugepages of size %" PRIu64 " bytes "
 					"will be allocated anonymously\n",
 					hpi->hugepage_sz);
-				calc_num_pages(hpi, dirent);
+				calc_num_pages(hpi, dirent, 0);
 				num_sizes++;
 			}
 #endif
@@ -489,11 +506,17 @@ hugepage_info_init(void)
 				"Failed to lock hugepage directory!\n");
 			break;
 		}
-		/* clear out the hugepages dir from unused pages */
-		if (clear_hugedir(hpi->hugedir) == -1)
-			break;
 
-		calc_num_pages(hpi, dirent);
+		/*
+		 * Check for existing hugepage files and either remove them
+		 * or count how many of them can be reused.
+		 */
+		reusable_pages = 0;
+		if (clear_hugedir(hpi->hugedir,
+				internal_conf->hugepage_file.keep_existing,
+				&reusable_pages) == -1)
+			break;
+		calc_num_pages(hpi, dirent, reusable_pages);
 
 		num_sizes++;
 	}
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index abbe605e49..cbd7c9cbee 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused,
 
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
-		unsigned int list_idx, unsigned int seg_idx)
+		unsigned int list_idx, unsigned int seg_idx,
+		bool *dirty)
 {
 	int fd;
+	int *out_fd;
+	struct stat st;
+	int ret;
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
+	if (dirty != NULL)
+		*dirty = false;
+
 	/* for in-memory mode, we only make it here when we're sure we support
 	 * memfd, and this is a special case.
 	 */
@@ -300,66 +307,68 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		return get_seg_memfd(hi, list_idx, seg_idx);
 
 	if (internal_conf->single_file_segments) {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].memseg_list_fd;
 		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
-
-		fd = fd_list[list_idx].memseg_list_fd;
-
-		if (fd < 0) {
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(ERR, EAL, "%s(): open failed: %s\n",
-					__func__, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock and keep it indefinitely */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].memseg_list_fd = fd;
-		}
 	} else {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].fds[seg_idx];
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+	}
+	fd = *out_fd;
+	if (fd >= 0)
+		return fd;
 
-		fd = fd_list[list_idx].fds[seg_idx];
-
-		if (fd < 0) {
-			/* A primary process is the only one creating these
-			 * files. If there is a leftover that was not cleaned
-			 * by clear_hugedir(), we must *now* make sure to drop
-			 * the file or we will remap old stuff while the rest
-			 * of the code is built on the assumption that a new
-			 * page is clean.
-			 */
-			if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
-					unlink(path) == -1 &&
-					errno != ENOENT) {
+	/*
+	 * The kernel clears a hugepage only when it is mapped
+	 * from a particular file for the first time.
+	 * If the file already exists, mapped will be the old
+	 * content of the hugepages. If the memory manager
+	 * assumes all mapped pages to be clean,
+	 * the file must be removed and created anew.
+	 * Otherwise the primary caller must be notified
+	 * that mapped pages will be dirty (secondary callers
+	 * receive the segment state from the primary one).
+	 * When multiple hugepages are mapped from the same file,
+	 * whether they will be dirty depends on the part that is mapped.
+	 *
+	 * There is no TOCTOU between stat() and unlink()/open()
+	 * because the hugepage directory is locked.
+	 */
+	if (!internal_conf->single_file_segments) {
+		ret = stat(path, &st);
+		if (ret < 0 && errno != ENOENT) {
+			RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n",
+				__func__, path, strerror(errno));
+			return -1;
+		}
+		if (rte_eal_process_type() == RTE_PROC_PRIMARY && ret == 0) {
+			if (internal_conf->hugepage_file.keep_existing &&
+					dirty != NULL) {
+				*dirty = true;
+			/* coverity[toctou] */
+			} else if (unlink(path) < 0) {
 				RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
 					__func__, path, strerror(errno));
 				return -1;
 			}
-
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
-					__func__, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].fds[seg_idx] = fd;
 		}
 	}
+
+	/* coverity[toctou] */
+	fd = open(path, O_CREAT | O_RDWR, 0600);
+	if (fd < 0) {
+		RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) < 0) {
+		RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+			__func__, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	*out_fd = fd;
 	return fd;
 }
 
@@ -385,8 +394,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset,
 
 static int
 resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+		bool grow, bool *dirty)
 {
+	const struct internal_config *internal_conf =
+			eal_get_internal_configuration();
 	bool again = false;
 
 	do {
@@ -405,6 +416,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 			uint64_t cur_size = get_file_size(fd);
 
 			/* fallocate isn't supported, fall back to ftruncate */
+			if (dirty != NULL)
+				*dirty = new_size <= cur_size;
 			if (new_size > cur_size &&
 					ftruncate(fd, new_size) < 0) {
 				RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
@@ -447,8 +460,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 						strerror(errno));
 					return -1;
 				}
-			} else
+			} else {
 				fallocate_supported = 1;
+				/*
+				 * It is unknown which portions of an existing
+				 * hugepage file were allocated previously,
+				 * so all pages within the file are considered
+				 * dirty, unless the file is a fresh one.
+				 */
+				if (dirty != NULL)
+					*dirty = internal_conf->hugepage_file.keep_existing;
+			}
 		}
 	} while (again);
 
@@ -475,7 +497,8 @@ close_hugefile(int fd, char *path, int list_idx)
 }
 
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
+resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow,
+		bool *dirty)
 {
 	/* in-memory mode is a special case, because we can be sure that
 	 * fallocate() is supported.
@@ -483,12 +506,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	if (internal_conf->in_memory)
+	if (internal_conf->in_memory) {
+		if (dirty != NULL)
+			*dirty = false;
 		return resize_hugefile_in_memory(fd, fa_offset,
 				page_sz, grow);
+	}
 
 	return resize_hugefile_in_filesystem(fd, fa_offset, page_sz,
-				grow);
+			grow, dirty);
 }
 
 static int
@@ -505,6 +531,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	char path[PATH_MAX];
 	int ret = 0;
 	int fd;
+	bool dirty;
 	size_t alloc_sz;
 	int flags;
 	void *new_addr;
@@ -534,6 +561,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		pagesz_flag = pagesz_flags(alloc_sz);
 		fd = -1;
+		dirty = false;
 		mmap_flags = in_memory_flags | pagesz_flag;
 
 		/* single-file segments codepath will never be active
@@ -544,7 +572,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		map_offset = 0;
 	} else {
 		/* takes out a read lock on segment or segment list */
-		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx,
+				&dirty);
 		if (fd < 0) {
 			RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 			return -1;
@@ -552,7 +581,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		if (internal_conf->single_file_segments) {
 			map_offset = seg_idx * alloc_sz;
-			ret = resize_hugefile(fd, map_offset, alloc_sz, true);
+			ret = resize_hugefile(fd, map_offset, alloc_sz, true,
+					&dirty);
 			if (ret < 0)
 				goto resized;
 
@@ -662,6 +692,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	ms->nrank = rte_memory_get_nrank();
 	ms->iova = iova;
 	ms->socket_id = socket_id;
+	ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0;
 
 	return 0;
 
@@ -689,7 +720,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		return -1;
 
 	if (internal_conf->single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
+		resize_hugefile(fd, map_offset, alloc_sz, false, NULL);
 		/* ignore failure, can't make it any worse */
 
 		/* if refcount is at zero, close the file */
@@ -739,13 +770,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	 * segment and thus drop the lock on original fd, but hugepage dir is
 	 * now locked so we can take out another one without races.
 	 */
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL);
 	if (fd < 0)
 		return -1;
 
 	if (internal_conf->single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, map_offset, ms->len, false, NULL))
 			return -1;
 
 		if (--(fd_list[list_idx].count) == 0)
@@ -1743,6 +1774,12 @@ eal_memalloc_init(void)
 			RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n");
 			return -1;
 		}
+		/* safety net, should be impossible to configure */
+		if (internal_conf->hugepage_file.unlink_before_mapping &&
+				internal_conf->hugepage_file.keep_existing) {
+			RTE_LOG(ERR, EAL, "Unable both to keep existing hugepage files and to unlink them.\n");
+			return -1;
+		}
 	}
 
 	/* initialize all of the fd lists */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 5/6] eal: allow hugepage file reuse with --huge-unlink
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                   ` (3 preceding siblings ...)
  2021-12-30 14:37 ` [RFC PATCH 4/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2021-12-30 14:48 ` Dmitry Kozlyuk
  2021-12-30 14:49 ` [RFC PATCH 6/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
  6 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:48 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Expose Linux EAL ability to reuse existing hugepage files
via --huge-unlink=never switch.
Default behavior is unchanged, it can also be specified
using --huge-unlink=existing for consistency.
Old --huge-unlink switch is kept,
it is an alias for --huge-unlink=always.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst | 21 ++++++++--
 .../prog_guide/env_abstraction_layer.rst      |  9 +++++
 doc/guides/rel_notes/release_22_03.rst        |  7 ++++
 lib/eal/common/eal_common_options.c           | 39 +++++++++++++++++--
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index 74df2611b5..64cd73b497 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -84,10 +84,23 @@ Memory-related options
     Use specified hugetlbfs directory instead of autodetected ones. This can be
     a sub-directory within a hugetlbfs mountpoint.
 
-*   ``--huge-unlink``
-
-    Unlink hugepage files after creating them (implies no secondary process
-    support).
+*   ``--huge-unlink[=existing|always|never]``
+
+    No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default:
+    existing hugepage files are removed and re-created
+    to ensure the kernel clears the memory and prevents any data leaks.
+
+    With ``--huge-unlink`` (no value) or ``--huge-unlink=always``,
+    hugepage files are also removed after creating them,
+    so that the application leaves no files in hugetlbfs.
+    This mode implies no multi-process support.
+
+    When ``--huge-unlink=never`` is specified, existing hugepage files
+    are not removed neither before nor after mapping them.
+    This makes restart faster by saving time to clear memory at initialization,
+    but it may slow down zeroed allocations later.
+    Reused hugepages can contain data from previous processes that used them,
+    which may be a security concern.
 
 *   ``--match-allocations``
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 6cddb86467..d8940f5e2e 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -277,6 +277,15 @@ to prevent data leaks from previous users of the same hugepage.
 EAL ensures this behavior by removing existing backing files at startup
 and by recreating them before opening for mapping (as a precaution).
 
+One expection is ``--huge-unlink=never`` mode.
+It is used to speed up EAL initialization, usually on application restart.
+Clearing memory constitutes more than 95% of hugepage mapping time.
+EAL can save it by remapping existing backing files
+with all the data left in the mapped hugepages ("dirty" memory).
+Such segments are marked with ``RTE_MEMSEG_FLAG_DIRTY``.
+Memory allocator detects dirty segments handles them accordingly,
+in particular, it clears memory requested with ``rte_zmalloc*()``.
+
 Anonymous mapping does not allow multi-process architecture,
 but it is free of filename conflicts and leftover files on hugetlbfs.
 If memfd_create(2) is supported both at build and run time,
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 6d99d1eaa9..0b882362cf 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added ability to reuse hugepages in Linux.**
+
+  It is possible to reuse files in hugetlbfs to speed up hugepage mapping,
+  which may be useful for fast restart and large allocations.
+  The new mode is activated with ``--huge-unlink=never``
+  and has security implications, refer to the user and programmer guides.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 7520ebda8e..905a7769bd 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -74,7 +74,7 @@ eal_long_options[] = {
 	{OPT_FILE_PREFIX,       1, NULL, OPT_FILE_PREFIX_NUM      },
 	{OPT_HELP,              0, NULL, OPT_HELP_NUM             },
 	{OPT_HUGE_DIR,          1, NULL, OPT_HUGE_DIR_NUM         },
-	{OPT_HUGE_UNLINK,       0, NULL, OPT_HUGE_UNLINK_NUM      },
+	{OPT_HUGE_UNLINK,       2, NULL, OPT_HUGE_UNLINK_NUM      },
 	{OPT_IOVA_MODE,	        1, NULL, OPT_IOVA_MODE_NUM        },
 	{OPT_LCORES,            1, NULL, OPT_LCORES_NUM           },
 	{OPT_LOG_LEVEL,         1, NULL, OPT_LOG_LEVEL_NUM        },
@@ -1596,6 +1596,28 @@ available_cores(void)
 	return str;
 }
 
+#define HUGE_UNLINK_NEVER "never"
+
+static int
+eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
+{
+	if (arg == NULL || strcmp(arg, "always") == 0) {
+		out->unlink_before_mapping = true;
+		return 0;
+	}
+	if (strcmp(arg, "existing") == 0) {
+		/* same as not specifying the option */
+		return 0;
+	}
+	if (strcmp(arg, HUGE_UNLINK_NEVER) == 0) {
+		RTE_LOG(WARNING, EAL, "Using --"OPT_HUGE_UNLINK"="
+			HUGE_UNLINK_NEVER" may create data leaks.\n");
+		out->keep_existing = true;
+		return 0;
+	}
+	return -1;
+}
+
 int
 eal_parse_common_option(int opt, const char *optarg,
 			struct internal_config *conf)
@@ -1737,7 +1759,10 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_file.unlink_before_mapping = true;
+		if (eal_parse_huge_unlink(optarg, &conf->hugepage_file) < 0) {
+			RTE_LOG(ERR, EAL, "invalid --"OPT_HUGE_UNLINK" option\n");
+			return -1;
+		}
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -2068,6 +2093,12 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
 		return -1;
 	}
+	if (internal_cfg->hugepage_file.keep_existing &&
+			internal_cfg->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_IN_MEMORY" is not compatible "
+			"with --"OPT_HUGE_UNLINK"="HUGE_UNLINK_NEVER"\n");
+		return -1;
+	}
 	if (internal_cfg->legacy_mem &&
 			internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible "
@@ -2200,7 +2231,9 @@ eal_common_usage(void)
 	       "  --"OPT_NO_TELEMETRY"   Disable telemetry support\n"
 	       "  --"OPT_FORCE_MAX_SIMD_BITWIDTH" Force the max SIMD bitwidth\n"
 	       "\nEAL options for DEBUG use only:\n"
-	       "  --"OPT_HUGE_UNLINK"       Unlink hugepage files after init\n"
+	       "  --"OPT_HUGE_UNLINK"[=existing|always|never]\n"
+	       "                      When to unlink files in hugetlbfs\n"
+	       "                      ('existing' by default, no value means 'always')\n"
 	       "  --"OPT_NO_HUGE"           Use malloc instead of hugetlbfs\n"
 	       "  --"OPT_NO_PCI"            Disable PCI\n"
 	       "  --"OPT_NO_HPET"           Disable HPET\n"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 6/6] app/test: add allocator performance benchmark
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                   ` (4 preceding siblings ...)
  2021-12-30 14:48 ` [RFC PATCH 5/6] eal: allow hugepage file reuse with --huge-unlink Dmitry Kozlyuk
@ 2021-12-30 14:49 ` Dmitry Kozlyuk
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
  6 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2021-12-30 14:49 UTC (permalink / raw)
  To: dev; +Cc: Aaron Conole, Viacheslav Ovsiienko

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing takes
for each size as a reference (for rte_memzone_reserve estimations
are printed).

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 174 ++++++++++++++++++++++++++++++++++++
 2 files changed, 176 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 2b480adfba..899034fc2a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -88,6 +88,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -295,6 +296,7 @@ extra_test_names = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..9686fc8af5
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__)
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+typedef void * (memset_t)(void *addr, int value, size_t size);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	TEST_LOG(INFO, "Reference: memset\n");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn,
+		memset_t *memset_fn, double memset_gb_us, size_t max_runs)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	TEST_LOG(INFO, "Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		TEST_LOG(ERR, "Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs",
+			"Alloc (us)", "Free (us)", "Total (us)",
+			memset_fn != NULL ? "memset (us)" : "est.memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n",
+					size);
+			break;
+		}
+		runs_done = j;
+
+		if (memset_fn != NULL) {
+			tsc_start = rte_rdtsc_precise();
+			for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+				memset_fn(ptrs[j], 0, size);
+			tsc_memset = rte_rdtsc_precise() - tsc_start;
+		}
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_fn != NULL ?
+				tsc_to_us(tsc_memset, runs_done) :
+				memset_gb_us * size / GB;
+		TEST_LOG(INFO, "%12zu%8zu%12.2f%12.2f%12.2f%17.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_us_gb;
+
+	if (test_memset_perf(&memset_us_gb) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			NULL, memset_us_gb, RTE_MAX_MEMZONE - 1) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 0/6] Fast restart with many hugepages
  2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                   ` (5 preceding siblings ...)
  2021-12-30 14:49 ` [RFC PATCH 6/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
@ 2022-01-17  8:07 ` Dmitry Kozlyuk
  2022-01-17  8:07   ` [PATCH v1 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
                     ` (7 more replies)
  6 siblings, 8 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:07 UTC (permalink / raw)
  To: dev
  Cc: Anatoly Burakov, Viacheslav Ovsiienko, David Marchand,
	Thomas Monjalon, Lior Margalit

This patchset is a new design and implementation of [1].
Changes since RFC:
* Fix bugs with -m and --single-file-segments.
* Reject optimization of mmap() call number (see below).

# Problem Statement

Large allocations that involve mapping new hugepages are slow.
This is problematic, for example, in the following use case.
A single-process application allocates ~1TB of mempools at startup.
Sometimes the app needs to restart as quick as possible.
Allocating the hugepages anew takes as long as 15 seconds,
while the new process could just pick up all the memory
left by the old one (reinitializing the contents as needed).

Almost all of mmap(2) time spent in the kernel
is clearing the memory, i.e. filling it with zeros.
This is done if a file in hugetlbfs is mapped
for the first time system-wide, i.e. a hugepage is committed
to prevent data leaks from the previous users of the same hugepage.
For example, mapping 32 GB from a new file may take 2.16 seconds,
while mapping the same pages again takes only 0.3 ms.
Security put aside, e.g. when the environment is controlled,
this effort is wasted for the memory intended for DMA,
because its content will be overwritten anyway.

Linux EAL explicitly removes hugetlbfs files at initialization
and before mapping to force the kernel clear the memory.
This allows the memory allocator to clean memory on only on freeing.

# Solution

Add a new mode allowing EAL to remap existing hugepage files.
While it is intended to make restarts faster in the first place,
it makes any startup faster except the cold one
(with no existing files).

It is the administrator who accepts security risks
implied by reusing hugepages.
The new mode is an opt-in and a warning is logged.

The feature is Linux-only as it is related
to mapping hugepages from files which only Linux does.
It is inherently incompatible with --in-memory,
for --huge-unlink see below.

There is formally no breakage of API contract,
but there is a behavior change in the new mode:
rte_malloc*() and rte_memzone_reserve*() may return dirty memory
(previously they were returning clean memory from free heap elements).
Their contract has always explicitly allowed this,
but still there may be users relying on the traditional behavior.
Such users will need to fix their code to use the new mode.

# Implementation

## User Interface

There is --huge-unlink switch in the same area to remove hugepage files
before mapping them. It is infeasible to use with the new mode,
because the point is to keep hugepage files for fast future restarts.
Extend --huge-unlink option to represent only valid combinations:

* --huge-unlink=existing OR no option (for compatibility):
  unlink files at initialization
  and before opening them as a precaution.

* --huge-unlink=always OR just --huge-unlink (for compatibility):
  same as above + unlink created files before mapping.

* --huge-unlink=never:
  the new mode, do not unlink hugepages files, reuse them.

This option was always Linux-only, but it is kept as common
in case there are users who expect it to be a no-op on other systems.
(Adding a separate --huge-reuse option was also considered,
but there is no obvious benefit and more combinations to test.)

## EAL

If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
so that the memory allocator may clear the memory if need be.
See patch 5/6 description for details how this is done
in different memory mapping modes.

The memory manager tracks whether an element is clean or dirty.
If rte_zmalloc*() allocates from a dirty element,
the memory is cleared before handling it to the user.
On freeing, the allocator joins adjacent free elements,
but in the new mode it may not be feasible to clear the free memory
if the joint element is dirty (contains dirty parts).
In any case, memory will be cleared only once,
either on freeing or on allocation.
See patch 3/6 for details.
Patch 2/6 adds a benchmark to see how time is distributed
between allocation and freeing in different modes.

Besides clearing memory, each mmap() call takes some time.
For example, 1024 calls for 1 TB may take ~300 ms.
The time of one call mapping N hugepages is O(N),
because inside the kernel hugepages are allocated ony by one.
Syscall overhead is negligeable even for one page.
Hence, it does not make sense to reduce the number of mmap() calls,
which would essentially move the loop over pages into the kernel.

[1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/

Dmitry Kozlyuk (6):
  doc: add hugepage mapping details
  app/test: add allocator performance benchmark
  mem: add dirty malloc element support
  eal: refactor --huge-unlink storage
  eal/linux: allow hugepage file reuse
  eal: extend --huge-unlink for hugepage file reuse

 app/test/meson.build                          |   2 +
 app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  21 ++-
 .../prog_guide/env_abstraction_layer.rst      |  94 +++++++++-
 doc/guides/rel_notes/release_22_03.rst        |   7 +
 lib/eal/common/eal_common_options.c           |  46 ++++-
 lib/eal/common/eal_internal_cfg.h             |  10 +-
 lib/eal/common/malloc_elem.c                  |  22 ++-
 lib/eal/common/malloc_elem.h                  |  11 +-
 lib/eal/common/malloc_heap.c                  |  18 +-
 lib/eal/common/rte_malloc.c                   |  21 ++-
 lib/eal/include/rte_memory.h                  |   8 +-
 lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
 lib/eal/linux/eal_memalloc.c                  | 171 ++++++++++-------
 lib/eal/linux/eal_memory.c                    |   2 +-
 15 files changed, 597 insertions(+), 128 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 1/6] doc: add hugepage mapping details
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
@ 2022-01-17  8:07   ` Dmitry Kozlyuk
  2022-01-17  9:20     ` Thomas Monjalon
  2022-01-17  8:07   ` [PATCH v1 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:07 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Hugepage mapping is a layer of EAL malloc builds upon.
There were implicit references to its details,
like mentions of segment file descriptors,
but no explicit description of its modes and operation.
Add an overview of mechanics used on ech supported OS.
Convert memory management subsections from list items
to level 4 headers: they are big and important enough.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 .../prog_guide/env_abstraction_layer.rst      | 85 +++++++++++++++++--
 1 file changed, 76 insertions(+), 9 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index c6accce701..bfe4594bf1 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -86,7 +86,7 @@ See chapter
 Memory Mapping Discovery and Memory Reservation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem.
+The allocation of large contiguous physical memory is done using hugepages.
 The EAL provides an API to reserve named memory zones in this contiguous memory.
 The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API.
 
@@ -95,11 +95,12 @@ and legacy mode. Both modes are explained below.
 
 .. note::
 
-    Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
+    Memory reservations done using the APIs provided by rte_malloc are also backed by hugepages.
 
-+ Dynamic memory mode
+Dynamic Memory Mode
+^^^^^^^^^^^^^^^^^^^
 
-Currently, this mode is only supported on Linux.
+Currently, this mode is only supported on Linux and Windows.
 
 In this mode, usage of hugepages by DPDK application will grow and shrink based
 on application's requests. Any memory allocation through ``rte_malloc()``,
@@ -155,7 +156,8 @@ of memory that can be used by DPDK application.
     :ref:`Multi-process Support <Multi-process_Support>` for more details about
     DPDK IPC.
 
-+ Legacy memory mode
+Legacy Memory Mode
+^^^^^^^^^^^^^^^^^^
 
 This mode is enabled by specifying ``--legacy-mem`` command-line switch to the
 EAL. This switch will have no effect on FreeBSD as FreeBSD only supports
@@ -168,7 +170,8 @@ not allow acquiring or releasing hugepages from the system at runtime.
 If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
 hugepage memory will be preallocated.
 
-+ Hugepage allocation matching
+Hugepage Allocation Matching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This behavior is enabled by specifying the ``--match-allocations`` command-line
 switch to the EAL. This switch is Linux-only and not supported with
@@ -182,7 +185,8 @@ matching can be used by these types of applications to satisfy both of these
 requirements. This can result in some increased memory usage which is
 very dependent on the memory allocation patterns of the application.
 
-+ 32-bit support
+32-bit Support
+^^^^^^^^^^^^^^
 
 Additional restrictions are present when running in 32-bit mode. In dynamic
 memory mode, by default maximum of 2 gigabytes of VA space will be preallocated,
@@ -192,7 +196,8 @@ used.
 In legacy mode, VA space will only be preallocated for segments that were
 requested (plus padding, to keep IOVA-contiguousness).
 
-+ Maximum amount of memory
+Maximum Amount of Memory
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 All possible virtual memory space that can ever be used for hugepage mapping in
 a DPDK process is preallocated at startup, thereby placing an upper limit on how
@@ -222,7 +227,68 @@ Normally, these options do not need to be changed.
     can later be mapped into that preallocated VA space (if dynamic memory mode
     is enabled), and can optionally be mapped into it at startup.
 
-+ Segment file descriptors
+Hugepage Mapping
+^^^^^^^^^^^^^^^^
+
+Below is an overview of methods used for each OS to obtain hugepages,
+explaining why certain limitations and options exist in EAL.
+See the user guide for a specific OS for configuration details.
+
+FreeBSD uses ``contigmem`` kernel module
+to reserve a fixed number of hugepages at system start,
+which are mapped by EAL at initialization using a specific ``sysctl()``.
+
+Windows EAL allocates hugepages from the OS as needed using Win32 API,
+so available amount depends on the system load.
+It uses ``virt2phys`` kernel module to obtain physical addresses,
+unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
+
+Linux implements a variety of methods:
+
+* mapping each hugepage from its own file in hugetlbfs;
+* mapping multiple hugepages from a shared file in hugetlbfs;
+* anonymous mapping.
+
+Mapping hugepages from files in hugetlbfs is essential for multi-process,
+because secondary processes need to map the same hugepages.
+EAL creates files like ``rtemap_0``
+in directories specified with ``--huge-dir`` option
+(or in the mount point for a specific hugepage size).
+The ``rtemap_`` prefix can be changed using ``--file-prefix``.
+This may be needed for running multiple primary processes
+that share a hugetlbfs mount point.
+Each backing file by default corresponds to one hugepage,
+it is opened and locked for the entire time the hugepage is used.
+See :ref:`segment-file-descriptors` section
+on how the number of open backing file descriptors can be reduced.
+
+Backing files may persist after the corresponding hugepage is freed
+and even after the application terminates,
+reducing the number of hugepages available to other processes.
+EAL removes existing files at startup
+and can remove newly created files before mapping them with ``--huge-unlink``.
+However, since it disables multi-process anyway,
+using anonymous mapping (``--in-memory``) is recommended instead.
+
+:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
+Hugepages are cleared by the kernel when a file in hugetlbfs or its part
+is mapped for the first time system-wide
+to prevent data leaks from previous users of the same hugepage.
+EAL ensures this behavior by removing existing backing files at startup
+and by recreating them before opening for mapping (as a precaution).
+
+Anonymous mapping does not allow multi-process architecture,
+but it is free of filename conflicts and leftover files on hugetlbfs.
+If memfd_create(2) is supported both at build and run time,
+DPDK memory manager can provide file descriptors for memory segments,
+which are required for VirtIO with vhost-user backend.
+This means open file descriptor issues may also affect this mode,
+with the same solution.
+
+.. _segment-file-descriptors:
+
+Segment File Descriptors
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 On Linux, in most cases, EAL will store segment file descriptors in EAL. This
 can become a problem when using smaller page sizes due to underlying limitations
@@ -731,6 +797,7 @@ We expect only 50% of CPU spend on packet IO.
     echo 100000 > pkt_io/cpu.cfs_period_us
     echo  50000 > pkt_io/cpu.cfs_quota_us
 
+.. _malloc:
 
 Malloc
 ------
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 2/6] app/test: add allocator performance benchmark
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
  2022-01-17  8:07   ` [PATCH v1 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2022-01-17  8:07   ` Dmitry Kozlyuk
  2022-01-17 15:47     ` Bruce Richardson
  2022-01-17 16:06     ` Aaron Conole
  2022-01-17  8:07   ` [PATCH v1 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
                     ` (5 subsequent siblings)
  7 siblings, 2 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:07 UTC (permalink / raw)
  To: dev; +Cc: Aaron Conole, Viacheslav Ovsiienko

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing takes
for each size as a reference (for rte_memzone_reserve estimations
are printed).

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 174 ++++++++++++++++++++++++++++++++++++
 2 files changed, 176 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 344a609a4d..50cf2602a9 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -88,6 +88,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -295,6 +296,7 @@ extra_test_names = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..9686fc8af5
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__)
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+typedef void * (memset_t)(void *addr, int value, size_t size);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	TEST_LOG(INFO, "Reference: memset\n");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn,
+		memset_t *memset_fn, double memset_gb_us, size_t max_runs)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	TEST_LOG(INFO, "Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		TEST_LOG(ERR, "Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs",
+			"Alloc (us)", "Free (us)", "Total (us)",
+			memset_fn != NULL ? "memset (us)" : "est.memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n",
+					size);
+			break;
+		}
+		runs_done = j;
+
+		if (memset_fn != NULL) {
+			tsc_start = rte_rdtsc_precise();
+			for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+				memset_fn(ptrs[j], 0, size);
+			tsc_memset = rte_rdtsc_precise() - tsc_start;
+		}
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_fn != NULL ?
+				tsc_to_us(tsc_memset, runs_done) :
+				memset_gb_us * size / GB;
+		TEST_LOG(INFO, "%12zu%8zu%12.2f%12.2f%12.2f%17.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_us_gb;
+
+	if (test_memset_perf(&memset_us_gb) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			NULL, memset_us_gb, RTE_MAX_MEMZONE - 1) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 3/6] mem: add dirty malloc element support
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
  2022-01-17  8:07   ` [PATCH v1 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
  2022-01-17  8:07   ` [PATCH v1 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
@ 2022-01-17  8:07   ` Dmitry Kozlyuk
  2022-01-17 14:07     ` Thomas Monjalon
  2022-01-17  8:07   ` [PATCH v1 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:07 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

EAL malloc layer assumed all free elements content
is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
This assumption was ensured in two ways:
1. EAL memalloc layer always returned clean memory.
2. Freed memory was cleared before returning into the heap.

Clearing the memory can be as slow as around 14 GiB/s.
To save doing so, memalloc layer is allowed to return dirty memory.
Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
The allocator tracks elements that contain dirty memory
using the new flag in the element header.
When clean memory is requested via rte_zmalloc*()
and the suitable element is dirty, it is cleared on allocation.
When memory is deallocated, the freed element is joined
with adjacent free elements, and the dirty flag is updated:

    dirty + freed + dirty = dirty  =>  no need to clean
            freed + dirty = dirty      the freed memory

    clean + freed + clean = clean  =>  freed memory
    clean + freed         = clean      must be cleared
            freed + clean = clean
            freed         = clean

As a result, memory is either cleared on free, as before,
or it will be cleared on allocation if need be, but never twice.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 lib/eal/common/malloc_elem.c | 22 +++++++++++++++++++---
 lib/eal/common/malloc_elem.h | 11 +++++++++--
 lib/eal/common/malloc_heap.c | 18 ++++++++++++------
 lib/eal/common/rte_malloc.c  | 21 ++++++++++++++-------
 lib/eal/include/rte_memory.h |  8 ++++++--
 5 files changed, 60 insertions(+), 20 deletions(-)

diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index bdd20a162e..e04e0890fb 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -129,7 +129,7 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
 void
 malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 		struct rte_memseg_list *msl, size_t size,
-		struct malloc_elem *orig_elem, size_t orig_size)
+		struct malloc_elem *orig_elem, size_t orig_size, bool dirty)
 {
 	elem->heap = heap;
 	elem->msl = msl;
@@ -137,6 +137,7 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 	elem->next = NULL;
 	memset(&elem->free_list, 0, sizeof(elem->free_list));
 	elem->state = ELEM_FREE;
+	elem->dirty = dirty;
 	elem->size = size;
 	elem->pad = 0;
 	elem->orig_elem = orig_elem;
@@ -300,7 +301,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 	const size_t new_elem_size = elem->size - old_elem_size;
 
 	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
-			 elem->orig_elem, elem->orig_size);
+			elem->orig_elem, elem->orig_size, elem->dirty);
 	split_pt->prev = elem;
 	split_pt->next = next_elem;
 	if (next_elem)
@@ -506,6 +507,7 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 	else
 		elem1->heap->last = elem1;
 	elem1->next = next;
+	elem1->dirty |= elem2->dirty;
 	if (elem1->pad) {
 		struct malloc_elem *inner = RTE_PTR_ADD(elem1, elem1->pad);
 		inner->size = elem1->size - elem1->pad;
@@ -579,6 +581,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN);
 	data_len = elem->size - MALLOC_ELEM_OVERHEAD;
 
+	/*
+	 * Consider the element clean for the purposes of joining.
+	 * If both neighbors are clean or non-existent,
+	 * the joint element will be clean,
+	 * which means the memory should be cleared.
+	 * There is no need to clear the memory if the joint element is dirty.
+	 */
+	elem->dirty = false;
 	elem = malloc_elem_join_adjacent_free(elem);
 
 	malloc_elem_free_list_insert(elem);
@@ -588,8 +598,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
-	/* poison memory */
+#ifndef RTE_MALLOC_DEBUG
+	/* Normally clear the memory when needed. */
+	if (!elem->dirty)
+		memset(ptr, 0, data_len);
+#else
+	/* Always poison the memory in debug mode. */
 	memset(ptr, MALLOC_POISON, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_elem.h b/lib/eal/common/malloc_elem.h
index 15d8ba7af2..f2aa98821b 100644
--- a/lib/eal/common/malloc_elem.h
+++ b/lib/eal/common/malloc_elem.h
@@ -27,7 +27,13 @@ struct malloc_elem {
 	LIST_ENTRY(malloc_elem) free_list;
 	/**< list of free elements in heap */
 	struct rte_memseg_list *msl;
-	volatile enum elem_state state;
+	/** Element state, @c dirty and @c pad validity depends on it. */
+	/* An extra bit is needed to represent enum elem_state as signed int. */
+	enum elem_state state : 3;
+	/** If state == ELEM_FREE: the memory is not filled with zeroes. */
+	uint32_t dirty : 1;
+	/** Reserved for future use. */
+	uint32_t reserved : 28;
 	uint32_t pad;
 	size_t size;
 	struct malloc_elem *orig_elem;
@@ -320,7 +326,8 @@ malloc_elem_init(struct malloc_elem *elem,
 		struct rte_memseg_list *msl,
 		size_t size,
 		struct malloc_elem *orig_elem,
-		size_t orig_size);
+		size_t orig_size,
+		bool dirty);
 
 void
 malloc_elem_insert(struct malloc_elem *elem);
diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c
index 55aad2711b..24080fc473 100644
--- a/lib/eal/common/malloc_heap.c
+++ b/lib/eal/common/malloc_heap.c
@@ -93,11 +93,11 @@ malloc_socket_to_heap_id(unsigned int socket_id)
  */
 static struct malloc_elem *
 malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
-		void *start, size_t len)
+		void *start, size_t len, bool dirty)
 {
 	struct malloc_elem *elem = start;
 
-	malloc_elem_init(elem, heap, msl, len, elem, len);
+	malloc_elem_init(elem, heap, msl, len, elem, len, dirty);
 
 	malloc_elem_insert(elem);
 
@@ -135,7 +135,8 @@ malloc_add_seg(const struct rte_memseg_list *msl,
 
 	found_msl = &mcfg->memsegs[msl_idx];
 
-	malloc_heap_add_memory(heap, found_msl, ms->addr, len);
+	malloc_heap_add_memory(heap, found_msl, ms->addr, len,
+			ms->flags & RTE_MEMSEG_FLAG_DIRTY);
 
 	heap->total_size += len;
 
@@ -303,7 +304,8 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 	struct rte_memseg_list *msl;
 	struct malloc_elem *elem = NULL;
 	size_t alloc_sz;
-	int allocd_pages;
+	int allocd_pages, i;
+	bool dirty = false;
 	void *ret, *map_addr;
 
 	alloc_sz = (size_t)pg_sz * n_segs;
@@ -372,8 +374,12 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 		goto fail;
 	}
 
+	/* Element is dirty if it contains at least one dirty page. */
+	for (i = 0; i < allocd_pages; i++)
+		dirty |= ms[i]->flags & RTE_MEMSEG_FLAG_DIRTY;
+
 	/* add newly minted memsegs to malloc heap */
-	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz);
+	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz, dirty);
 
 	/* try once more, as now we have allocated new memory */
 	ret = find_suitable_element(heap, elt_size, flags, align, bound,
@@ -1260,7 +1266,7 @@ malloc_heap_add_external_memory(struct malloc_heap *heap,
 	memset(msl->base_va, 0, msl->len);
 
 	/* now, add newly minted memory to the malloc heap */
-	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len);
+	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len, false);
 
 	heap->total_size += msl->len;
 
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index d0bec26920..71a3f7ecb4 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -115,15 +115,22 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
+	if (ptr != NULL) {
+		struct malloc_elem *elem = malloc_elem_from_data(ptr);
+
+		if (elem->dirty) {
+			memset(ptr, 0, size);
+		} else {
 #ifdef RTE_MALLOC_DEBUG
-	/*
-	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
-	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+			/*
+			 * If DEBUG is enabled, then freed memory is marked
+			 * with a poison value and set to zero on allocation.
+			 * If DEBUG is disabled then memory is already zeroed.
+			 */
+			memset(ptr, 0, size);
 #endif
+		}
+	}
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index 6d018629ae..68b069fd04 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -19,6 +19,7 @@
 extern "C" {
 #endif
 
+#include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
 #include <rte_config.h>
@@ -37,11 +38,14 @@ extern "C" {
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
+/** Prevent this segment from being freed back to the OS. */
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE RTE_BIT32(0)
+/** This segment is not filled with zeros. */
+#define RTE_MEMSEG_FLAG_DIRTY RTE_BIT32(1)
+
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
-/**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
 	RTE_STD_C11
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 4/6] eal: refactor --huge-unlink storage
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                     ` (2 preceding siblings ...)
  2022-01-17  8:07   ` [PATCH v1 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
@ 2022-01-17  8:07   ` Dmitry Kozlyuk
  2022-01-17 14:10     ` Thomas Monjalon
  2022-01-17  8:14   ` [PATCH v1 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:07 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

In preparation to extend --huge-unlink option semantics
refactor how it is stored in the internal configuration.
It makes future changes more isolated.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 lib/eal/common/eal_common_options.c | 9 +++++----
 lib/eal/common/eal_internal_cfg.h   | 8 +++++++-
 lib/eal/linux/eal_memalloc.c        | 7 ++++---
 lib/eal/linux/eal_memory.c          | 2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 1cfdd75f3b..7520ebda8e 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -1737,7 +1737,7 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -1766,7 +1766,7 @@ eal_parse_common_option(int opt, const char *optarg,
 		conf->in_memory = 1;
 		/* in-memory is a superset of noshconf and huge-unlink */
 		conf->no_shconf = 1;
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_PROC_TYPE_NUM:
@@ -2050,7 +2050,8 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"be specified together with --"OPT_NO_HUGE"\n");
 		return -1;
 	}
-	if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink &&
+	if (internal_cfg->no_hugetlbfs &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
 			"be specified together with --"OPT_NO_HUGE"\n");
@@ -2061,7 +2062,7 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			" is only supported in non-legacy memory mode\n");
 	}
 	if (internal_cfg->single_file_segments &&
-			internal_cfg->hugepage_unlink &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is "
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..b5e6942578 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -40,6 +40,12 @@ struct simd_bitwidth {
 	uint16_t bitwidth; /**< bitwidth value */
 };
 
+/** Hugepage backing files discipline. */
+struct hugepage_file_discipline {
+	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
+	bool unlink_before_mapping;
+};
+
 /**
  * internal configuration
  */
@@ -48,7 +54,7 @@ struct internal_config {
 	volatile unsigned force_nchannel; /**< force number of channels */
 	volatile unsigned force_nrank;    /**< force number of ranks */
 	volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
-	unsigned hugepage_unlink;         /**< true to unlink backing files */
+	struct hugepage_file_discipline hugepage_file;
 	volatile unsigned no_pci;         /**< true to disable PCI */
 	volatile unsigned no_hpet;        /**< true to disable HPET */
 	volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 337f2bc739..abbe605e49 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -564,7 +564,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 					__func__, strerror(errno));
 				goto resized;
 			}
-			if (internal_conf->hugepage_unlink &&
+			if (internal_conf->hugepage_file.unlink_before_mapping &&
 					!internal_conf->in_memory) {
 				if (unlink(path)) {
 					RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n",
@@ -697,7 +697,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 			close_hugefile(fd, path, list_idx);
 	} else {
 		/* only remove file if we can take out a write lock */
-		if (internal_conf->hugepage_unlink == 0 &&
+		if (!internal_conf->hugepage_file.unlink_before_mapping &&
 				internal_conf->in_memory == 0 &&
 				lock(fd, LOCK_EX) == 1)
 			unlink(path);
@@ -756,7 +756,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-		if (!internal_conf->in_memory && !internal_conf->hugepage_unlink) {
+		if (!internal_conf->in_memory &&
+				internal_conf->hugepage_file.unlink_before_mapping) {
 			ret = lock(fd, LOCK_EX);
 			if (ret >= 0) {
 				/* no one else is using this page */
diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
index 03a4f2dd2d..83eec078a4 100644
--- a/lib/eal/linux/eal_memory.c
+++ b/lib/eal/linux/eal_memory.c
@@ -1428,7 +1428,7 @@ eal_legacy_hugepage_init(void)
 	}
 
 	/* free the hugepage backing files */
-	if (internal_conf->hugepage_unlink &&
+	if (internal_conf->hugepage_file.unlink_before_mapping &&
 		unlink_hugepage_files(tmp_hp, internal_conf->num_hugepage_sizes) < 0) {
 		RTE_LOG(ERR, EAL, "Unlinking hugepage files failed!\n");
 		goto fail;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 5/6] eal/linux: allow hugepage file reuse
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                     ` (3 preceding siblings ...)
  2022-01-17  8:07   ` [PATCH v1 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
@ 2022-01-17  8:14   ` Dmitry Kozlyuk
  2022-01-17 14:24     ` Thomas Monjalon
  2022-01-17  8:14   ` [PATCH v1 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:14 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Linux EAL ensured that mapped hugepages are clean
by always mapping from newly created files:
existing hugepage backing files were always removed.
In this case, the kernel clears the page to prevent data leaks,
because the mapped memory may contain leftover data
from the previous process that was using this memory.
Clearing takes the bulk of the time spent in mmap(2),
increasing EAL initialization time.

Introduce a mode to keep existing files and reuse them
in order to speed up initial memory allocation in EAL.
Hugepages mapped from such files may contain data
left by the previous process that used this memory,
so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
If multiple hugepages are mapped from the same file:
1. When fallocate(2) is used, all memory mapped from this file
   is considered dirty, because it is unknown
   which parts of the file are holes.
2. When ftruncate(3) is used, memory mapped from this file
   is considered dirty unless the file is extended
   to create a new mapping, which implies clean memory.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
Coverity complains that "path" may be uninitialized in get_seg_fd()
at line 327, but it is always initialized with eal_get_hugefile_path()
at lines 309-316.

 lib/eal/common/eal_internal_cfg.h |   2 +
 lib/eal/linux/eal_hugepage_info.c | 118 +++++++++++++++++----
 lib/eal/linux/eal_memalloc.c      | 164 ++++++++++++++++++------------
 3 files changed, 200 insertions(+), 84 deletions(-)

diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index b5e6942578..3685aa7c52 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -44,6 +44,8 @@ struct simd_bitwidth {
 struct hugepage_file_discipline {
 	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
 	bool unlink_before_mapping;
+	/** Reuse existing files, never delete or re-create them. */
+	bool keep_existing;
 };
 
 /**
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 9fb0e968db..6607fe5906 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon
 /* this function is only called from eal_hugepage_info_init which itself
  * is only called from a primary process */
 static uint32_t
-get_num_hugepages(const char *subdir, size_t sz)
+get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages)
 {
 	unsigned long resv_pages, num_pages, over_pages, surplus_pages;
 	const char *nr_hp_file = "free_hugepages";
@@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz)
 	else
 		over_pages = 0;
 
-	if (num_pages == 0 && over_pages == 0)
+	if (num_pages == 0 && over_pages == 0 && reusable_pages)
 		RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n",
 				sz >> 10);
 
@@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz)
 	if (num_pages < over_pages) /* overflow */
 		num_pages = UINT32_MAX;
 
+	num_pages += reusable_pages;
+	if (num_pages < reusable_pages) /* overflow */
+		num_pages = UINT32_MAX;
+
 	/* we want to return a uint32_t and more than this looks suspicious
 	 * anyway ... */
 	if (num_pages > UINT32_MAX)
@@ -297,20 +301,28 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 	return -1;
 }
 
+struct walk_hugedir_data {
+	int dir_fd;
+	int file_fd;
+	const char *file_name;
+	void *user_data;
+};
+
+typedef void (walk_hugedir_t)(const struct walk_hugedir_data *whd);
+
 /*
- * Clear the hugepage directory of whatever hugepage files
- * there are. Checks if the file is locked (i.e.
- * if it's in use by another DPDK process).
+ * Search the hugepage directory for whatever hugepage files there are.
+ * Check if the file is in use by another DPDK process.
+ * If not, execute a callback on it.
  */
 static int
-clear_hugedir(const char * hugedir)
+walk_hugedir(const char *hugedir, walk_hugedir_t *cb, void *user_data)
 {
 	DIR *dir;
 	struct dirent *dirent;
 	int dir_fd, fd, lck_result;
 	const char filter[] = "*map_*"; /* matches hugepage files */
 
-	/* open directory */
 	dir = opendir(hugedir);
 	if (!dir) {
 		RTE_LOG(ERR, EAL, "Unable to open hugepage directory %s\n",
@@ -326,7 +338,7 @@ clear_hugedir(const char * hugedir)
 		goto error;
 	}
 
-	while(dirent != NULL){
+	while (dirent != NULL) {
 		/* skip files that don't match the hugepage pattern */
 		if (fnmatch(filter, dirent->d_name, 0) > 0) {
 			dirent = readdir(dir);
@@ -345,9 +357,15 @@ clear_hugedir(const char * hugedir)
 		/* non-blocking lock */
 		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
-		/* if lock succeeds, remove the file */
+		/* if lock succeeds, execute callback */
 		if (lck_result != -1)
-			unlinkat(dir_fd, dirent->d_name, 0);
+			cb(&(struct walk_hugedir_data){
+				.dir_fd = dir_fd,
+				.file_fd = fd,
+				.file_name = dirent->d_name,
+				.user_data = user_data,
+			});
+
 		close (fd);
 		dirent = readdir(dir);
 	}
@@ -359,12 +377,48 @@ clear_hugedir(const char * hugedir)
 	if (dir)
 		closedir(dir);
 
-	RTE_LOG(ERR, EAL, "Error while clearing hugepage dir: %s\n",
+	RTE_LOG(ERR, EAL, "Error while walking hugepage dir: %s\n",
 		strerror(errno));
 
 	return -1;
 }
 
+static void
+clear_hugedir_cb(const struct walk_hugedir_data *whd)
+{
+	unlinkat(whd->dir_fd, whd->file_name, 0);
+}
+
+/* Remove hugepage files not used by other DPDK processes from a directory. */
+static int
+clear_hugedir(const char *hugedir)
+{
+	return walk_hugedir(hugedir, clear_hugedir_cb, NULL);
+}
+
+static void
+inspect_hugedir_cb(const struct walk_hugedir_data *whd)
+{
+	uint64_t *total_size = whd->user_data;
+	struct stat st;
+
+	if (fstat(whd->file_fd, &st) < 0)
+		RTE_LOG(DEBUG, EAL, "%s(): stat(\"%s\") failed: %s",
+				__func__, whd->file_name, strerror(errno));
+	else
+		(*total_size) += st.st_size;
+}
+
+/*
+ * Count the total size in bytes of all files in the directory
+ * not mapped by other DPDK process.
+ */
+static int
+inspect_hugedir(const char *hugedir, uint64_t *total_size)
+{
+	return walk_hugedir(hugedir, inspect_hugedir_cb, total_size);
+}
+
 static int
 compare_hpi(const void *a, const void *b)
 {
@@ -375,7 +429,8 @@ compare_hpi(const void *a, const void *b)
 }
 
 static void
-calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
+calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent,
+		unsigned int reusable_pages)
 {
 	uint64_t total_pages = 0;
 	unsigned int i;
@@ -388,8 +443,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 * in one socket and sorting them later
 	 */
 	total_pages = 0;
-	/* we also don't want to do this for legacy init */
-	if (!internal_conf->legacy_mem)
+
+	/*
+	 * We also don't want to do this for legacy init.
+	 * When there are hugepage files to reuse it is unknown
+	 * what NUMA node the pages are on.
+	 * This could be determined by mapping,
+	 * but it is precisely what hugepage file reuse is trying to avoid.
+	 */
+	if (!internal_conf->legacy_mem && reusable_pages == 0)
 		for (i = 0; i < rte_socket_count(); i++) {
 			int socket = rte_socket_id_by_idx(i);
 			unsigned int num_pages =
@@ -405,7 +467,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 */
 	if (total_pages == 0) {
 		hpi->num_pages[0] = get_num_hugepages(dirent->d_name,
-				hpi->hugepage_sz);
+				hpi->hugepage_sz, reusable_pages);
 
 #ifndef RTE_ARCH_64
 		/* for 32-bit systems, limit number of hugepages to
@@ -421,6 +483,8 @@ hugepage_info_init(void)
 {	const char dirent_start_text[] = "hugepages-";
 	const size_t dirent_start_len = sizeof(dirent_start_text) - 1;
 	unsigned int i, num_sizes = 0;
+	uint64_t reusable_bytes;
+	unsigned int reusable_pages;
 	DIR *dir;
 	struct dirent *dirent;
 	struct internal_config *internal_conf =
@@ -454,7 +518,7 @@ hugepage_info_init(void)
 			uint32_t num_pages;
 
 			num_pages = get_num_hugepages(dirent->d_name,
-					hpi->hugepage_sz);
+					hpi->hugepage_sz, 0);
 			if (num_pages > 0)
 				RTE_LOG(NOTICE, EAL,
 					"%" PRIu32 " hugepages of size "
@@ -473,7 +537,7 @@ hugepage_info_init(void)
 					"hugepages of size %" PRIu64 " bytes "
 					"will be allocated anonymously\n",
 					hpi->hugepage_sz);
-				calc_num_pages(hpi, dirent);
+				calc_num_pages(hpi, dirent, 0);
 				num_sizes++;
 			}
 #endif
@@ -489,11 +553,23 @@ hugepage_info_init(void)
 				"Failed to lock hugepage directory!\n");
 			break;
 		}
-		/* clear out the hugepages dir from unused pages */
-		if (clear_hugedir(hpi->hugedir) == -1)
-			break;
 
-		calc_num_pages(hpi, dirent);
+		/*
+		 * Check for existing hugepage files and either remove them
+		 * or count how many of them can be reused.
+		 */
+		reusable_pages = 0;
+		if (internal_conf->hugepage_file.keep_existing) {
+			reusable_bytes = 0;
+			if (inspect_hugedir(hpi->hugedir,
+					&reusable_bytes) < 0)
+				break;
+			RTE_ASSERT(reusable_bytes % hpi->hugepage_sz == 0);
+			reusable_pages = reusable_bytes / hpi->hugepage_sz;
+		} else if (clear_hugedir(hpi->hugedir) < 0) {
+			break;
+		}
+		calc_num_pages(hpi, dirent, reusable_pages);
 
 		num_sizes++;
 	}
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index abbe605e49..e4cd10b195 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused,
 
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
-		unsigned int list_idx, unsigned int seg_idx)
+		unsigned int list_idx, unsigned int seg_idx,
+		bool *dirty)
 {
 	int fd;
+	int *out_fd;
+	struct stat st;
+	int ret;
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
+	if (dirty != NULL)
+		*dirty = false;
+
 	/* for in-memory mode, we only make it here when we're sure we support
 	 * memfd, and this is a special case.
 	 */
@@ -300,66 +307,69 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		return get_seg_memfd(hi, list_idx, seg_idx);
 
 	if (internal_conf->single_file_segments) {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].memseg_list_fd;
 		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
-
-		fd = fd_list[list_idx].memseg_list_fd;
-
-		if (fd < 0) {
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(ERR, EAL, "%s(): open failed: %s\n",
-					__func__, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock and keep it indefinitely */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].memseg_list_fd = fd;
-		}
 	} else {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].fds[seg_idx];
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+	}
+	fd = *out_fd;
+	if (fd >= 0)
+		return fd;
 
-		fd = fd_list[list_idx].fds[seg_idx];
-
-		if (fd < 0) {
-			/* A primary process is the only one creating these
-			 * files. If there is a leftover that was not cleaned
-			 * by clear_hugedir(), we must *now* make sure to drop
-			 * the file or we will remap old stuff while the rest
-			 * of the code is built on the assumption that a new
-			 * page is clean.
-			 */
-			if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
-					unlink(path) == -1 &&
-					errno != ENOENT) {
-				RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
-					__func__, path, strerror(errno));
-				return -1;
-			}
+	/*
+	 * There is no TOCTOU between stat() and unlink()/open()
+	 * because the hugepage directory is locked.
+	 */
+	ret = stat(path, &st);
+	if (ret < 0 && errno != ENOENT) {
+		RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	if (internal_conf->hugepage_file.keep_existing && ret == 0 &&
+			dirty != NULL)
+		*dirty = true;
 
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
-					__func__, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].fds[seg_idx] = fd;
+	/*
+	 * The kernel clears a hugepage only when it is mapped
+	 * from a particular file for the first time.
+	 * If the file already exists, the old content will be mapped.
+	 * If the memory manager assumes all mapped pages to be clean,
+	 * the file must be removed and created anew.
+	 * Otherwise, the primary caller must be notified
+	 * that mapped pages will be dirty
+	 * (secondary callers receive the segment state from the primary one).
+	 * When multiple hugepages are mapped from the same file,
+	 * whether they will be dirty depends on the part that is mapped.
+	 */
+	if (!internal_conf->single_file_segments &&
+			rte_eal_process_type() == RTE_PROC_PRIMARY &&
+			ret == 0) {
+		/* coverity[toctou] */
+		if (unlink(path) < 0) {
+			RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
+				__func__, path, strerror(errno));
+			return -1;
 		}
 	}
+
+	/* coverity[toctou] */
+	fd = open(path, O_CREAT | O_RDWR, 0600);
+	if (fd < 0) {
+		RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) < 0) {
+		RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+			__func__, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	*out_fd = fd;
 	return fd;
 }
 
@@ -385,8 +395,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset,
 
 static int
 resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+		bool grow, bool *dirty)
 {
+	const struct internal_config *internal_conf =
+			eal_get_internal_configuration();
 	bool again = false;
 
 	do {
@@ -405,6 +417,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 			uint64_t cur_size = get_file_size(fd);
 
 			/* fallocate isn't supported, fall back to ftruncate */
+			if (dirty != NULL)
+				*dirty = new_size <= cur_size;
 			if (new_size > cur_size &&
 					ftruncate(fd, new_size) < 0) {
 				RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
@@ -447,8 +461,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 						strerror(errno));
 					return -1;
 				}
-			} else
+			} else {
 				fallocate_supported = 1;
+				/*
+				 * It is unknown which portions of an existing
+				 * hugepage file were allocated previously,
+				 * so all pages within the file are considered
+				 * dirty, unless the file is a fresh one.
+				 */
+				if (dirty != NULL)
+					*dirty &= internal_conf->hugepage_file.keep_existing;
+			}
 		}
 	} while (again);
 
@@ -475,7 +498,8 @@ close_hugefile(int fd, char *path, int list_idx)
 }
 
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
+resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow,
+		bool *dirty)
 {
 	/* in-memory mode is a special case, because we can be sure that
 	 * fallocate() is supported.
@@ -483,12 +507,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	if (internal_conf->in_memory)
+	if (internal_conf->in_memory) {
+		if (dirty != NULL)
+			*dirty = false;
 		return resize_hugefile_in_memory(fd, fa_offset,
 				page_sz, grow);
+	}
 
 	return resize_hugefile_in_filesystem(fd, fa_offset, page_sz,
-				grow);
+			grow, dirty);
 }
 
 static int
@@ -505,6 +532,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	char path[PATH_MAX];
 	int ret = 0;
 	int fd;
+	bool dirty;
 	size_t alloc_sz;
 	int flags;
 	void *new_addr;
@@ -534,6 +562,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		pagesz_flag = pagesz_flags(alloc_sz);
 		fd = -1;
+		dirty = false;
 		mmap_flags = in_memory_flags | pagesz_flag;
 
 		/* single-file segments codepath will never be active
@@ -544,7 +573,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		map_offset = 0;
 	} else {
 		/* takes out a read lock on segment or segment list */
-		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx,
+				&dirty);
 		if (fd < 0) {
 			RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 			return -1;
@@ -552,7 +582,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		if (internal_conf->single_file_segments) {
 			map_offset = seg_idx * alloc_sz;
-			ret = resize_hugefile(fd, map_offset, alloc_sz, true);
+			ret = resize_hugefile(fd, map_offset, alloc_sz, true,
+					&dirty);
 			if (ret < 0)
 				goto resized;
 
@@ -662,6 +693,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	ms->nrank = rte_memory_get_nrank();
 	ms->iova = iova;
 	ms->socket_id = socket_id;
+	ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0;
 
 	return 0;
 
@@ -689,7 +721,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		return -1;
 
 	if (internal_conf->single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
+		resize_hugefile(fd, map_offset, alloc_sz, false, NULL);
 		/* ignore failure, can't make it any worse */
 
 		/* if refcount is at zero, close the file */
@@ -739,13 +771,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	 * segment and thus drop the lock on original fd, but hugepage dir is
 	 * now locked so we can take out another one without races.
 	 */
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL);
 	if (fd < 0)
 		return -1;
 
 	if (internal_conf->single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, map_offset, ms->len, false, NULL))
 			return -1;
 
 		if (--(fd_list[list_idx].count) == 0)
@@ -1743,6 +1775,12 @@ eal_memalloc_init(void)
 			RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n");
 			return -1;
 		}
+		/* safety net, should be impossible to configure */
+		if (internal_conf->hugepage_file.unlink_before_mapping &&
+				internal_conf->hugepage_file.keep_existing) {
+			RTE_LOG(ERR, EAL, "Unable both to keep existing hugepage files and to unlink them.\n");
+			return -1;
+		}
 	}
 
 	/* initialize all of the fd lists */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v1 6/6] eal: extend --huge-unlink for hugepage file reuse
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                     ` (4 preceding siblings ...)
  2022-01-17  8:14   ` [PATCH v1 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2022-01-17  8:14   ` Dmitry Kozlyuk
  2022-01-17 14:27     ` Thomas Monjalon
  2022-01-17 16:40   ` [PATCH v1 0/6] Fast restart with many hugepages Bruce Richardson
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
  7 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-17  8:14 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Expose Linux EAL ability to reuse existing hugepage files
via --huge-unlink=never switch.
Default behavior is unchanged, it can also be specified
using --huge-unlink=existing for consistency.
Old --huge-unlink switch is kept,
it is an alias for --huge-unlink=always.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst | 21 ++++++++--
 .../prog_guide/env_abstraction_layer.rst      |  9 +++++
 doc/guides/rel_notes/release_22_03.rst        |  7 ++++
 lib/eal/common/eal_common_options.c           | 39 +++++++++++++++++--
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index 74df2611b5..7586f15ce3 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -84,10 +84,23 @@ Memory-related options
     Use specified hugetlbfs directory instead of autodetected ones. This can be
     a sub-directory within a hugetlbfs mountpoint.
 
-*   ``--huge-unlink``
-
-    Unlink hugepage files after creating them (implies no secondary process
-    support).
+*   ``--huge-unlink[=existing|always|never]``
+
+    No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default:
+    existing hugepage files are removed and re-created
+    to ensure the kernel clears the memory and prevents any data leaks.
+
+    With ``--huge-unlink`` (no value) or ``--huge-unlink=always``,
+    hugepage files are also removed after creating them,
+    so that the application leaves no files in hugetlbfs.
+    This mode implies no multi-process support.
+
+    When ``--huge-unlink=never`` is specified, existing hugepage files
+    are not removed either before or after mapping them.
+    This makes restart faster by saving time to clear memory at initialization,
+    but it may slow down zeroed allocations later.
+    Reused hugepages can contain data from previous processes that used them,
+    which may be a security concern.
 
 *   ``--match-allocations``
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index bfe4594bf1..c7dc4a0e6a 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -277,6 +277,15 @@ to prevent data leaks from previous users of the same hugepage.
 EAL ensures this behavior by removing existing backing files at startup
 and by recreating them before opening for mapping (as a precaution).
 
+One exception is ``--huge-unlink=never`` mode.
+It is used to speed up EAL initialization, usually on application restart.
+Clearing memory constitutes more than 95% of hugepage mapping time.
+EAL can save it by remapping existing backing files
+with all the data left in the mapped hugepages ("dirty" memory).
+Such segments are marked with ``RTE_MEMSEG_FLAG_DIRTY``.
+Memory allocator detects dirty segments handles them accordingly,
+in particular, it clears memory requested with ``rte_zmalloc*()``.
+
 Anonymous mapping does not allow multi-process architecture,
 but it is free of filename conflicts and leftover files on hugetlbfs.
 If memfd_create(2) is supported both at build and run time,
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 6d99d1eaa9..0b882362cf 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added ability to reuse hugepages in Linux.**
+
+  It is possible to reuse files in hugetlbfs to speed up hugepage mapping,
+  which may be useful for fast restart and large allocations.
+  The new mode is activated with ``--huge-unlink=never``
+  and has security implications, refer to the user and programmer guides.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 7520ebda8e..905a7769bd 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -74,7 +74,7 @@ eal_long_options[] = {
 	{OPT_FILE_PREFIX,       1, NULL, OPT_FILE_PREFIX_NUM      },
 	{OPT_HELP,              0, NULL, OPT_HELP_NUM             },
 	{OPT_HUGE_DIR,          1, NULL, OPT_HUGE_DIR_NUM         },
-	{OPT_HUGE_UNLINK,       0, NULL, OPT_HUGE_UNLINK_NUM      },
+	{OPT_HUGE_UNLINK,       2, NULL, OPT_HUGE_UNLINK_NUM      },
 	{OPT_IOVA_MODE,	        1, NULL, OPT_IOVA_MODE_NUM        },
 	{OPT_LCORES,            1, NULL, OPT_LCORES_NUM           },
 	{OPT_LOG_LEVEL,         1, NULL, OPT_LOG_LEVEL_NUM        },
@@ -1596,6 +1596,28 @@ available_cores(void)
 	return str;
 }
 
+#define HUGE_UNLINK_NEVER "never"
+
+static int
+eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
+{
+	if (arg == NULL || strcmp(arg, "always") == 0) {
+		out->unlink_before_mapping = true;
+		return 0;
+	}
+	if (strcmp(arg, "existing") == 0) {
+		/* same as not specifying the option */
+		return 0;
+	}
+	if (strcmp(arg, HUGE_UNLINK_NEVER) == 0) {
+		RTE_LOG(WARNING, EAL, "Using --"OPT_HUGE_UNLINK"="
+			HUGE_UNLINK_NEVER" may create data leaks.\n");
+		out->keep_existing = true;
+		return 0;
+	}
+	return -1;
+}
+
 int
 eal_parse_common_option(int opt, const char *optarg,
 			struct internal_config *conf)
@@ -1737,7 +1759,10 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_file.unlink_before_mapping = true;
+		if (eal_parse_huge_unlink(optarg, &conf->hugepage_file) < 0) {
+			RTE_LOG(ERR, EAL, "invalid --"OPT_HUGE_UNLINK" option\n");
+			return -1;
+		}
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -2068,6 +2093,12 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
 		return -1;
 	}
+	if (internal_cfg->hugepage_file.keep_existing &&
+			internal_cfg->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_IN_MEMORY" is not compatible "
+			"with --"OPT_HUGE_UNLINK"="HUGE_UNLINK_NEVER"\n");
+		return -1;
+	}
 	if (internal_cfg->legacy_mem &&
 			internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible "
@@ -2200,7 +2231,9 @@ eal_common_usage(void)
 	       "  --"OPT_NO_TELEMETRY"   Disable telemetry support\n"
 	       "  --"OPT_FORCE_MAX_SIMD_BITWIDTH" Force the max SIMD bitwidth\n"
 	       "\nEAL options for DEBUG use only:\n"
-	       "  --"OPT_HUGE_UNLINK"       Unlink hugepage files after init\n"
+	       "  --"OPT_HUGE_UNLINK"[=existing|always|never]\n"
+	       "                      When to unlink files in hugetlbfs\n"
+	       "                      ('existing' by default, no value means 'always')\n"
 	       "  --"OPT_NO_HUGE"           Use malloc instead of hugetlbfs\n"
 	       "  --"OPT_NO_PCI"            Disable PCI\n"
 	       "  --"OPT_NO_HPET"           Disable HPET\n"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 1/6] doc: add hugepage mapping details
  2022-01-17  8:07   ` [PATCH v1 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2022-01-17  9:20     ` Thomas Monjalon
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Monjalon @ 2022-01-17  9:20 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov, david.marchand, bruce.richardson

Thanks for the nice addition to the documentation, this is really needed.
Some comments below.

17/01/2022 09:07, Dmitry Kozlyuk:
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> -    Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
> +    Memory reservations done using the APIs provided by rte_malloc are also backed by hugepages.

Should we mention except if --no-huge is used?

> +Hugepage Mapping
> +^^^^^^^^^^^^^^^^
> +
> +Below is an overview of methods used for each OS to obtain hugepages,
> +explaining why certain limitations and options exist in EAL.
> +See the user guide for a specific OS for configuration details.
> +
> +FreeBSD uses ``contigmem`` kernel module
> +to reserve a fixed number of hugepages at system start,
> +which are mapped by EAL at initialization using a specific ``sysctl()``.
> +
> +Windows EAL allocates hugepages from the OS as needed using Win32 API,
> +so available amount depends on the system load.
> +It uses ``virt2phys`` kernel module to obtain physical addresses,
> +unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
> +
> +Linux implements a variety of methods:
> +
> +* mapping each hugepage from its own file in hugetlbfs;
> +* mapping multiple hugepages from a shared file in hugetlbfs;
> +* anonymous mapping.
> +
> +Mapping hugepages from files in hugetlbfs is essential for multi-process,
> +because secondary processes need to map the same hugepages.
> +EAL creates files like ``rtemap_0``
> +in directories specified with ``--huge-dir`` option
> +(or in the mount point for a specific hugepage size).
> +The ``rtemap_`` prefix can be changed using ``--file-prefix``.
> +This may be needed for running multiple primary processes
> +that share a hugetlbfs mount point.
> +Each backing file by default corresponds to one hugepage,
> +it is opened and locked for the entire time the hugepage is used.
> +See :ref:`segment-file-descriptors` section
> +on how the number of open backing file descriptors can be reduced.
> +
> +Backing files may persist after the corresponding hugepage is freed
> +and even after the application terminates,
> +reducing the number of hugepages available to other processes.
> +EAL removes existing files at startup
> +and can remove newly created files before mapping them with ``--huge-unlink``.

This sentence require more explanations, as it is not clear when and why.

> +However, since it disables multi-process anyway,
> +using anonymous mapping (``--in-memory``) is recommended instead.
> +
> +:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
> +Hugepages are cleared by the kernel when a file in hugetlbfs or its part
> +is mapped for the first time system-wide
> +to prevent data leaks from previous users of the same hugepage.
> +EAL ensures this behavior by removing existing backing files at startup
> +and by recreating them before opening for mapping (as a precaution).
> +
> +Anonymous mapping does not allow multi-process architecture,
> +but it is free of filename conflicts and leftover files on hugetlbfs.

It is also easier to run as non-root.

> +If memfd_create(2) is supported both at build and run time,
> +DPDK memory manager can provide file descriptors for memory segments,
> +which are required for VirtIO with vhost-user backend.
> +This means open file descriptor issues may also affect this mode,
> +with the same solution.

This is not clear. Which issues? Which mode? Which solution?




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 3/6] mem: add dirty malloc element support
  2022-01-17  8:07   ` [PATCH v1 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
@ 2022-01-17 14:07     ` Thomas Monjalon
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Monjalon @ 2022-01-17 14:07 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov

17/01/2022 09:07, Dmitry Kozlyuk:
> EAL malloc layer assumed all free elements content
> is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
> This assumption was ensured in two ways:
> 1. EAL memalloc layer always returned clean memory.
> 2. Freed memory was cleared before returning into the heap.
> 
> Clearing the memory can be as slow as around 14 GiB/s.
> To save doing so, memalloc layer is allowed to return dirty memory.
> Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
> The allocator tracks elements that contain dirty memory
> using the new flag in the element header.
> When clean memory is requested via rte_zmalloc*()
> and the suitable element is dirty, it is cleared on allocation.
> When memory is deallocated, the freed element is joined
> with adjacent free elements, and the dirty flag is updated:
> 
>     dirty + freed + dirty = dirty  =>  no need to clean
>             freed + dirty = dirty      the freed memory

It is not said why dirty parts are not cleaned.

> 
>     clean + freed + clean = clean  =>  freed memory
>     clean + freed         = clean      must be cleared
>             freed + clean = clean
>             freed         = clean
> 
> As a result, memory is either cleared on free, as before,
> or it will be cleared on allocation if need be, but never twice.

It is not said whether it is a change for everybody,
or only when enabling an option.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 4/6] eal: refactor --huge-unlink storage
  2022-01-17  8:07   ` [PATCH v1 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
@ 2022-01-17 14:10     ` Thomas Monjalon
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Monjalon @ 2022-01-17 14:10 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov

17/01/2022 09:07, Dmitry Kozlyuk:
> In preparation to extend --huge-unlink option semantics
> refactor how it is stored in the internal configuration.
> It makes future changes more isolated.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> ---
> +/** Hugepage backing files discipline. */
> +struct hugepage_file_discipline {
> +	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
> +	bool unlink_before_mapping;
> +};
[...]
> -	unsigned hugepage_unlink;         /**< true to unlink backing files */
> +	struct hugepage_file_discipline hugepage_file;

That's clearer, thanks.

Acked-by: Thomas Monjalon <thomas@monjalon.net>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 5/6] eal/linux: allow hugepage file reuse
  2022-01-17  8:14   ` [PATCH v1 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2022-01-17 14:24     ` Thomas Monjalon
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Monjalon @ 2022-01-17 14:24 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov

17/01/2022 09:14, Dmitry Kozlyuk:
> Linux EAL ensured that mapped hugepages are clean
> by always mapping from newly created files:
> existing hugepage backing files were always removed.
> In this case, the kernel clears the page to prevent data leaks,
> because the mapped memory may contain leftover data
> from the previous process that was using this memory.
> Clearing takes the bulk of the time spent in mmap(2),
> increasing EAL initialization time.
> 
> Introduce a mode to keep existing files and reuse them
> in order to speed up initial memory allocation in EAL.
> Hugepages mapped from such files may contain data
> left by the previous process that used this memory,
> so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
> If multiple hugepages are mapped from the same file:
> 1. When fallocate(2) is used, all memory mapped from this file
>    is considered dirty, because it is unknown
>    which parts of the file are holes.
> 2. When ftruncate(3) is used, memory mapped from this file
>    is considered dirty unless the file is extended
>    to create a new mapping, which implies clean memory.
[...]
>  struct hugepage_file_discipline {
>  	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
>  	bool unlink_before_mapping;
> +	/** Reuse existing files, never delete or re-create them. */
> +	bool keep_existing;
>  };

That's a bit confusing to mix "unlink" and "keep".
I would prefer focusing on what is done, i.e. unlink when.
I like "unlink_before_mapping" because it is a real action.
The other action should be "unlink_existing" or "unlink_before_creating".



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 6/6] eal: extend --huge-unlink for hugepage file reuse
  2022-01-17  8:14   ` [PATCH v1 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
@ 2022-01-17 14:27     ` Thomas Monjalon
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Monjalon @ 2022-01-17 14:27 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov

17/01/2022 09:14, Dmitry Kozlyuk:
> Expose Linux EAL ability to reuse existing hugepage files
> via --huge-unlink=never switch.
> Default behavior is unchanged, it can also be specified
> using --huge-unlink=existing for consistency.
> Old --huge-unlink switch is kept,
> it is an alias for --huge-unlink=always.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> ---
>  doc/guides/linux_gsg/linux_eal_parameters.rst | 21 ++++++++--
>  .../prog_guide/env_abstraction_layer.rst      |  9 +++++
>  doc/guides/rel_notes/release_22_03.rst        |  7 ++++
>  lib/eal/common/eal_common_options.c           | 39 +++++++++++++++++--
>  4 files changed, 69 insertions(+), 7 deletions(-)
> 
> diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
> index 74df2611b5..7586f15ce3 100644
> --- a/doc/guides/linux_gsg/linux_eal_parameters.rst
> +++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
> @@ -84,10 +84,23 @@ Memory-related options
>      Use specified hugetlbfs directory instead of autodetected ones. This can be
>      a sub-directory within a hugetlbfs mountpoint.
>  
> -*   ``--huge-unlink``
> -
> -    Unlink hugepage files after creating them (implies no secondary process
> -    support).
> +*   ``--huge-unlink[=existing|always|never]``
> +
> +    No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default:
> +    existing hugepage files are removed and re-created
> +    to ensure the kernel clears the memory and prevents any data leaks.
> +
> +    With ``--huge-unlink`` (no value) or ``--huge-unlink=always``,
> +    hugepage files are also removed after creating them,
> +    so that the application leaves no files in hugetlbfs.
> +    This mode implies no multi-process support.
> +
> +    When ``--huge-unlink=never`` is specified, existing hugepage files
> +    are not removed either before or after mapping them.

One detail not clear: the second unlink is before or after mapping?

> +    This makes restart faster by saving time to clear memory at initialization,
> +    but it may slow down zeroed allocations later.
> +    Reused hugepages can contain data from previous processes that used them,
> +    which may be a security concern.

I absolutely love these options.
It keeps compability while making things consistent and understandable.

Acked-by: Thomas Monjalon <thomas@monjalon.net>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 2/6] app/test: add allocator performance benchmark
  2022-01-17  8:07   ` [PATCH v1 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
@ 2022-01-17 15:47     ` Bruce Richardson
  2022-01-17 15:51       ` Bruce Richardson
  2022-01-17 16:06     ` Aaron Conole
  1 sibling, 1 reply; 53+ messages in thread
From: Bruce Richardson @ 2022-01-17 15:47 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Aaron Conole, Viacheslav Ovsiienko

On Mon, Jan 17, 2022 at 10:07:57AM +0200, Dmitry Kozlyuk wrote:
> Memory allocator performance is crucial to applications that deal
> with large amount of memory or allocate frequently. DPDK allocator
> performance is affected by EAL options, API used and, at least,
> allocation size. New autotest is intended to be run with different
> EAL options. It measures performance with a range of sizes
> for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.
> 
> Work distribution between allocation and deallocation depends on EAL
> options. The test prints both times and total time to ease comparison.
> 
> Memory can be filled with zeroes at different points of allocation path,
> but it always takes considerable fraction of overall timing. This is why
> the test measures filling speed and prints how long clearing takes
> for each size as a reference (for rte_memzone_reserve estimations
> are printed).
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> ---
What is the expected running time of this test? When I tried it out on my
machine it appears to hang after the following output:

	USER1:         4096   10000        3.44        1.11        4.56             0.67
	USER1:        65536   10000       21.85       14.75       36.60             9.38
	USER1:      1048576   10000      481.40      329.96      811.36           147.62


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 2/6] app/test: add allocator performance benchmark
  2022-01-17 15:47     ` Bruce Richardson
@ 2022-01-17 15:51       ` Bruce Richardson
  2022-01-19 21:12         ` Dmitry Kozlyuk
  0 siblings, 1 reply; 53+ messages in thread
From: Bruce Richardson @ 2022-01-17 15:51 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Aaron Conole, Viacheslav Ovsiienko

On Mon, Jan 17, 2022 at 03:47:41PM +0000, Bruce Richardson wrote:
> On Mon, Jan 17, 2022 at 10:07:57AM +0200, Dmitry Kozlyuk wrote:
> > Memory allocator performance is crucial to applications that deal
> > with large amount of memory or allocate frequently. DPDK allocator
> > performance is affected by EAL options, API used and, at least,
> > allocation size. New autotest is intended to be run with different
> > EAL options. It measures performance with a range of sizes
> > for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.
> > 
> > Work distribution between allocation and deallocation depends on EAL
> > options. The test prints both times and total time to ease comparison.
> > 
> > Memory can be filled with zeroes at different points of allocation path,
> > but it always takes considerable fraction of overall timing. This is why
> > the test measures filling speed and prints how long clearing takes
> > for each size as a reference (for rte_memzone_reserve estimations
> > are printed).
> > 
> > Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> > Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> > ---
> What is the expected running time of this test? When I tried it out on my
> machine it appears to hang after the following output:
> 
> 	USER1:         4096   10000        3.44        1.11        4.56             0.67
> 	USER1:        65536   10000       21.85       14.75       36.60             9.38
> 	USER1:      1048576   10000      481.40      329.96      811.36           147.62
> 
Just realised I stripped a bit too much context here, including section
title too:

USER1: Performance: rte_malloc
USER1:     Size (B)    Runs  Alloc (us)   Free (us)  Total (us)      memset (us)
USER1:           64   10000        0.10        0.04        0.14             0.02
USER1:          128   10000        0.14        0.05        0.20             0.01
USER1:         1024   10000        2.39        0.15        2.54             0.06
USER1:         4096   10000        3.44        1.11        4.56             0.67
USER1:        65536   10000       21.85       14.75       36.60             9.38
USER1:      1048576   10000      481.40      329.96      811.36           147.62


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 2/6] app/test: add allocator performance benchmark
  2022-01-17  8:07   ` [PATCH v1 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
  2022-01-17 15:47     ` Bruce Richardson
@ 2022-01-17 16:06     ` Aaron Conole
  1 sibling, 0 replies; 53+ messages in thread
From: Aaron Conole @ 2022-01-17 16:06 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Viacheslav Ovsiienko

Dmitry Kozlyuk <dkozlyuk@nvidia.com> writes:

> Memory allocator performance is crucial to applications that deal
> with large amount of memory or allocate frequently. DPDK allocator
> performance is affected by EAL options, API used and, at least,
> allocation size. New autotest is intended to be run with different
> EAL options. It measures performance with a range of sizes
> for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.
>
> Work distribution between allocation and deallocation depends on EAL
> options. The test prints both times and total time to ease comparison.
>
> Memory can be filled with zeroes at different points of allocation path,
> but it always takes considerable fraction of overall timing. This is why
> the test measures filling speed and prints how long clearing takes
> for each size as a reference (for rte_memzone_reserve estimations
> are printed).
>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> ---

Thanks for making the changes.

Acked-by: Aaron Conole <aconole@redhat.com>

>  app/test/meson.build        |   2 +
>  app/test/test_malloc_perf.c | 174 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 176 insertions(+)
>  create mode 100644 app/test/test_malloc_perf.c
>
> diff --git a/app/test/meson.build b/app/test/meson.build
> index 344a609a4d..50cf2602a9 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -88,6 +88,7 @@ test_sources = files(
>          'test_lpm6_perf.c',
>          'test_lpm_perf.c',
>          'test_malloc.c',
> +        'test_malloc_perf.c',
>          'test_mbuf.c',
>          'test_member.c',
>          'test_member_perf.c',
> @@ -295,6 +296,7 @@ extra_test_names = [
>  
>  perf_test_names = [
>          'ring_perf_autotest',
> +        'malloc_perf_autotest',
>          'mempool_perf_autotest',
>          'memcpy_perf_autotest',
>          'hash_perf_autotest',
> diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
> new file mode 100644
> index 0000000000..9686fc8af5
> --- /dev/null
> +++ b/app/test/test_malloc_perf.c
> @@ -0,0 +1,174 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2021 NVIDIA Corporation & Affiliates
> + */
> +
> +#include <inttypes.h>
> +#include <string.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_malloc.h>
> +#include <rte_memzone.h>
> +
> +#include "test.h"
> +
> +#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__)
> +
> +typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
> +typedef void (free_t)(void *addr);
> +typedef void * (memset_t)(void *addr, int value, size_t size);
> +
> +static const uint64_t KB = 1 << 10;
> +static const uint64_t GB = 1 << 30;
> +
> +static double
> +tsc_to_us(uint64_t tsc, size_t runs)
> +{
> +	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
> +}
> +
> +static int
> +test_memset_perf(double *us_per_gb)
> +{
> +	static const size_t RUNS = 20;
> +
> +	void *ptr;
> +	size_t i;
> +	uint64_t tsc;
> +
> +	TEST_LOG(INFO, "Reference: memset\n");
> +
> +	ptr = rte_malloc(NULL, GB, 0);
> +	if (ptr == NULL) {
> +		TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB);
> +		return -1;
> +	}
> +
> +	tsc = rte_rdtsc_precise();
> +	for (i = 0; i < RUNS; i++)
> +		memset(ptr, 0, GB);
> +	tsc = rte_rdtsc_precise() - tsc;
> +
> +	*us_per_gb = tsc_to_us(tsc, RUNS);
> +	TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n",
> +			US_PER_S / *us_per_gb, *us_per_gb / KB);
> +
> +	rte_free(ptr);
> +	TEST_LOG(INFO, "\n");
> +	return 0;
> +}
> +
> +static int
> +test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn,
> +		memset_t *memset_fn, double memset_gb_us, size_t max_runs)
> +{
> +	static const size_t SIZES[] = {
> +			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
> +			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
> +
> +	size_t i, j;
> +	void **ptrs;
> +
> +	TEST_LOG(INFO, "Performance: %s\n", name);
> +
> +	ptrs = calloc(max_runs, sizeof(ptrs[0]));
> +	if (ptrs == NULL) {
> +		TEST_LOG(ERR, "Cannot allocate memory for pointers");
> +		return -1;
> +	}
> +
> +	TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs",
> +			"Alloc (us)", "Free (us)", "Total (us)",
> +			memset_fn != NULL ? "memset (us)" : "est.memset (us)");
> +	for (i = 0; i < RTE_DIM(SIZES); i++) {
> +		size_t size = SIZES[i];
> +		size_t runs_done;
> +		uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free;
> +		double alloc_time, free_time, memset_time;
> +
> +		tsc_start = rte_rdtsc_precise();
> +		for (j = 0; j < max_runs; j++) {
> +			ptrs[j] = alloc_fn(NULL, size, 0);
> +			if (ptrs[j] == NULL)
> +				break;
> +		}
> +		tsc_alloc = rte_rdtsc_precise() - tsc_start;
> +
> +		if (j == 0) {
> +			TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n",
> +					size);
> +			break;
> +		}
> +		runs_done = j;
> +
> +		if (memset_fn != NULL) {
> +			tsc_start = rte_rdtsc_precise();
> +			for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
> +				memset_fn(ptrs[j], 0, size);
> +			tsc_memset = rte_rdtsc_precise() - tsc_start;
> +		}
> +
> +		tsc_start = rte_rdtsc_precise();
> +		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
> +			free_fn(ptrs[j]);
> +		tsc_free = rte_rdtsc_precise() - tsc_start;
> +
> +		alloc_time = tsc_to_us(tsc_alloc, runs_done);
> +		free_time = tsc_to_us(tsc_free, runs_done);
> +		memset_time = memset_fn != NULL ?
> +				tsc_to_us(tsc_memset, runs_done) :
> +				memset_gb_us * size / GB;
> +		TEST_LOG(INFO, "%12zu%8zu%12.2f%12.2f%12.2f%17.2f\n",
> +				size, runs_done, alloc_time, free_time,
> +				alloc_time + free_time, memset_time);
> +
> +		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
> +	}
> +
> +	free(ptrs);
> +	TEST_LOG(INFO, "\n");
> +	return 0;
> +}
> +
> +static void *
> +memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
> +{
> +	const struct rte_memzone *mz;
> +	char gen_name[RTE_MEMZONE_NAMESIZE];
> +
> +	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
> +	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
> +			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
> +	return (void *)(uintptr_t)mz;
> +}
> +
> +static void
> +memzone_free(void *addr)
> +{
> +	rte_memzone_free((struct rte_memzone *)addr);
> +}
> +
> +static int
> +test_malloc_perf(void)
> +{
> +	static const size_t MAX_RUNS = 10000;
> +
> +	double memset_us_gb;
> +
> +	if (test_memset_perf(&memset_us_gb) < 0)
> +		return -1;
> +
> +	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free, memset,
> +			memset_us_gb, MAX_RUNS) < 0)
> +		return -1;
> +	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free, memset,
> +			memset_us_gb, MAX_RUNS) < 0)
> +		return -1;
> +
> +	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
> +			NULL, memset_us_gb, RTE_MAX_MEMZONE - 1) < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 0/6] Fast restart with many hugepages
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                     ` (5 preceding siblings ...)
  2022-01-17  8:14   ` [PATCH v1 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
@ 2022-01-17 16:40   ` Bruce Richardson
  2022-01-19 21:12     ` Dmitry Kozlyuk
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
  7 siblings, 1 reply; 53+ messages in thread
From: Bruce Richardson @ 2022-01-17 16:40 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Anatoly Burakov, Viacheslav Ovsiienko, David Marchand,
	Thomas Monjalon, Lior Margalit

On Mon, Jan 17, 2022 at 10:07:55AM +0200, Dmitry Kozlyuk wrote:
> This patchset is a new design and implementation of [1].
> Changes since RFC:
> * Fix bugs with -m and --single-file-segments.
> * Reject optimization of mmap() call number (see below).
> 
> # Problem Statement
> 
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
> 
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
> 
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
> 
> # Solution
> 
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
> 
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
> 
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
> 
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
> 
> # Implementation
> 
> ## User Interface
> 
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
> 
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
> 
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
> 
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
> 
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
> 
> ## EAL
> 
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
> 
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
> 
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
> 
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
> 

Hi,

this seems really interesting, but in the absense of TB of memory being
used, is it easily possible to see the benefits of this work? I've been
playing with adding large memory allocations to helloworld example and
checking the runtime. Allocating 1GB using malloc per thread seems to show
a small (<0.5 second at most) benefit, and using a fixed 10GB allocation
using memzone_reserve at startup shows runtimes within the margin of error
when run with --huge-unlink=existing vs huge-unlink=never. At what size of
memory footprint is it expected to make a clear improvement?

Thanks,
/Bruce

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 0/6] Fast restart with many hugepages
  2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
                     ` (6 preceding siblings ...)
  2022-01-17 16:40   ` [PATCH v1 0/6] Fast restart with many hugepages Bruce Richardson
@ 2022-01-19 21:09   ` Dmitry Kozlyuk
  2022-01-19 21:09     ` [PATCH v2 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
                       ` (8 more replies)
  7 siblings, 9 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:09 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Anatoly Burakov, Viacheslav Ovsiienko,
	David Marchand, Thomas Monjalon, Lior Margalit

This patchset is a new design and implementation of [1].

v2:
  * Fix hugepage file removal when they are no longer used.
    Disable removal with --huge-unlink=never as intended.
    Document this behavior difference. (Bruce)
  * Improve documentation, commit messages, and naming. (Thomas)

# Problem Statement

Large allocations that involve mapping new hugepages are slow.
This is problematic, for example, in the following use case.
A single-process application allocates ~1TB of mempools at startup.
Sometimes the app needs to restart as quick as possible.
Allocating the hugepages anew takes as long as 15 seconds,
while the new process could just pick up all the memory
left by the old one (reinitializing the contents as needed).

Almost all of mmap(2) time spent in the kernel
is clearing the memory, i.e. filling it with zeros.
This is done if a file in hugetlbfs is mapped
for the first time system-wide, i.e. a hugepage is committed
to prevent data leaks from the previous users of the same hugepage.
For example, mapping 32 GB from a new file may take 2.16 seconds,
while mapping the same pages again takes only 0.3 ms.
Security put aside, e.g. when the environment is controlled,
this effort is wasted for the memory intended for DMA,
because its content will be overwritten anyway.

Linux EAL explicitly removes hugetlbfs files at initialization
and before mapping to force the kernel clear the memory.
This allows the memory allocator to clean memory on only on freeing.

# Solution

Add a new mode allowing EAL to remap existing hugepage files.
While it is intended to make restarts faster in the first place,
it makes any startup faster except the cold one
(with no existing files).

It is the administrator who accepts security risks
implied by reusing hugepages.
The new mode is an opt-in and a warning is logged.

The feature is Linux-only as it is related
to mapping hugepages from files which only Linux does.
It is inherently incompatible with --in-memory,
for --huge-unlink see below.

There is formally no breakage of API contract,
but there is a behavior change in the new mode:
rte_malloc*() and rte_memzone_reserve*() may return dirty memory
(previously they were returning clean memory from free heap elements).
Their contract has always explicitly allowed this,
but still there may be users relying on the traditional behavior.
Such users will need to fix their code to use the new mode.

# Implementation

## User Interface

There is --huge-unlink switch in the same area to remove hugepage files
before mapping them. It is infeasible to use with the new mode,
because the point is to keep hugepage files for fast future restarts.
Extend --huge-unlink option to represent only valid combinations:

* --huge-unlink=existing OR no option (for compatibility):
  unlink files at initialization
  and before opening them as a precaution.

* --huge-unlink=always OR just --huge-unlink (for compatibility):
  same as above + unlink created files before mapping.

* --huge-unlink=never:
  the new mode, do not unlink hugepages files, reuse them.

This option was always Linux-only, but it is kept as common
in case there are users who expect it to be a no-op on other systems.
(Adding a separate --huge-reuse option was also considered,
but there is no obvious benefit and more combinations to test.)

## EAL

If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
so that the memory allocator may clear the memory if need be.
See patch 5/6 description for details how this is done
in different memory mapping modes.

The memory manager tracks whether an element is clean or dirty.
If rte_zmalloc*() allocates from a dirty element,
the memory is cleared before handling it to the user.
On freeing, the allocator joins adjacent free elements,
but in the new mode it may not be feasible to clear the free memory
if the joint element is dirty (contains dirty parts).
In any case, memory will be cleared only once,
either on freeing or on allocation.
See patch 3/6 for details.
Patch 2/6 adds a benchmark to see how time is distributed
between allocation and freeing in different modes.

Besides clearing memory, each mmap() call takes some time.
For example, 1024 calls for 1 TB may take ~300 ms.
The time of one call mapping N hugepages is O(N),
because inside the kernel hugepages are allocated ony by one.
Syscall overhead is negligeable even for one page.
Hence, it does not make sense to reduce the number of mmap() calls,
which would essentially move the loop over pages into the kernel.

[1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/

Dmitry Kozlyuk (6):
  doc: add hugepage mapping details
  app/test: add allocator performance benchmark
  mem: add dirty malloc element support
  eal: refactor --huge-unlink storage
  eal/linux: allow hugepage file reuse
  eal: extend --huge-unlink for hugepage file reuse

 app/test/meson.build                          |   2 +
 app/test/test_eal_flags.c                     |  25 +++
 app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
 .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
 doc/guides/rel_notes/release_22_03.rst        |   7 +
 lib/eal/common/eal_common_options.c           |  48 ++++-
 lib/eal/common/eal_internal_cfg.h             |  10 +-
 lib/eal/common/malloc_elem.c                  |  22 ++-
 lib/eal/common/malloc_elem.h                  |  11 +-
 lib/eal/common/malloc_heap.c                  |  18 +-
 lib/eal/common/rte_malloc.c                   |  21 ++-
 lib/eal/include/rte_memory.h                  |   8 +-
 lib/eal/linux/eal.c                           |   3 +-
 lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
 lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
 lib/eal/linux/eal_memory.c                    |   2 +-
 17 files changed, 644 insertions(+), 129 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 1/6] doc: add hugepage mapping details
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
@ 2022-01-19 21:09     ` Dmitry Kozlyuk
  2022-01-27 13:59       ` Bruce Richardson
  2022-01-19 21:09     ` [PATCH v2 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
                       ` (7 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:09 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Anatoly Burakov

Hugepage mapping is a layer of EAL malloc builds upon.
There were implicit references to its details,
like mentions of segment file descriptors,
but no explicit description of its modes and operation.
Add an overview of mechanics used on ech supported OS.
Convert memory management subsections from list items
to level 4 headers: they are big and important enough.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 .../prog_guide/env_abstraction_layer.rst      | 95 +++++++++++++++++--
 1 file changed, 86 insertions(+), 9 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index c6accce701..fede7fe69d 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -86,7 +86,7 @@ See chapter
 Memory Mapping Discovery and Memory Reservation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem.
+The allocation of large contiguous physical memory is done using hugepages.
 The EAL provides an API to reserve named memory zones in this contiguous memory.
 The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API.
 
@@ -95,11 +95,13 @@ and legacy mode. Both modes are explained below.
 
 .. note::
 
-    Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
+    Memory reservations done using the APIs provided by rte_malloc
+    are also backed by hugepages unless ``--no-huge`` option is given.
 
-+ Dynamic memory mode
+Dynamic Memory Mode
+^^^^^^^^^^^^^^^^^^^
 
-Currently, this mode is only supported on Linux.
+Currently, this mode is only supported on Linux and Windows.
 
 In this mode, usage of hugepages by DPDK application will grow and shrink based
 on application's requests. Any memory allocation through ``rte_malloc()``,
@@ -155,7 +157,8 @@ of memory that can be used by DPDK application.
     :ref:`Multi-process Support <Multi-process_Support>` for more details about
     DPDK IPC.
 
-+ Legacy memory mode
+Legacy Memory Mode
+^^^^^^^^^^^^^^^^^^
 
 This mode is enabled by specifying ``--legacy-mem`` command-line switch to the
 EAL. This switch will have no effect on FreeBSD as FreeBSD only supports
@@ -168,7 +171,8 @@ not allow acquiring or releasing hugepages from the system at runtime.
 If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
 hugepage memory will be preallocated.
 
-+ Hugepage allocation matching
+Hugepage Allocation Matching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This behavior is enabled by specifying the ``--match-allocations`` command-line
 switch to the EAL. This switch is Linux-only and not supported with
@@ -182,7 +186,8 @@ matching can be used by these types of applications to satisfy both of these
 requirements. This can result in some increased memory usage which is
 very dependent on the memory allocation patterns of the application.
 
-+ 32-bit support
+32-bit Support
+^^^^^^^^^^^^^^
 
 Additional restrictions are present when running in 32-bit mode. In dynamic
 memory mode, by default maximum of 2 gigabytes of VA space will be preallocated,
@@ -192,7 +197,8 @@ used.
 In legacy mode, VA space will only be preallocated for segments that were
 requested (plus padding, to keep IOVA-contiguousness).
 
-+ Maximum amount of memory
+Maximum Amount of Memory
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 All possible virtual memory space that can ever be used for hugepage mapping in
 a DPDK process is preallocated at startup, thereby placing an upper limit on how
@@ -222,7 +228,77 @@ Normally, these options do not need to be changed.
     can later be mapped into that preallocated VA space (if dynamic memory mode
     is enabled), and can optionally be mapped into it at startup.
 
-+ Segment file descriptors
+Hugepage Mapping
+^^^^^^^^^^^^^^^^
+
+Below is an overview of methods used for each OS to obtain hugepages,
+explaining why certain limitations and options exist in EAL.
+See the user guide for a specific OS for configuration details.
+
+FreeBSD uses ``contigmem`` kernel module
+to reserve a fixed number of hugepages at system start,
+which are mapped by EAL at initialization using a specific ``sysctl()``.
+
+Windows EAL allocates hugepages from the OS as needed using Win32 API,
+so available amount depends on the system load.
+It uses ``virt2phys`` kernel module to obtain physical addresses,
+unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
+
+Linux implements a variety of methods:
+
+* mapping each hugepage from its own file in hugetlbfs;
+* mapping multiple hugepages from a shared file in hugetlbfs;
+* anonymous mapping.
+
+Mapping hugepages from files in hugetlbfs is essential for multi-process,
+because secondary processes need to map the same hugepages.
+EAL creates files like ``rtemap_0``
+in directories specified with ``--huge-dir`` option
+(or in the mount point for a specific hugepage size).
+The ``rte`` prefix can be changed using ``--file-prefix``.
+This may be needed for running multiple primary processes
+that share a hugetlbfs mount point.
+Each backing file by default corresponds to one hugepage,
+it is opened and locked for the entire time the hugepage is used.
+This may exhaust the number of open files limit (``NOFILE``).
+See :ref:`segment-file-descriptors` section
+on how the number of open backing file descriptors can be reduced.
+
+In dynamic memory mode, EAL removes a backing hugepage file
+when all pages mapped from it are freed back to the system.
+However, backing files may persist after the application terminates
+in case of a crash or a leak of DPDK memory (e.g. ``rte_free()`` is missing).
+This reduces the number of hugepages available to other processes
+as reported by ``/sys/kernel/mm/hugepages/hugepages-*/free_hugepages``.
+EAL can remove the backing files after opening them for mapping
+if ``--huge-unlink`` is given to avoid polluting hugetlbfs.
+However, since it disables multi-process anyway,
+using anonymous mapping (``--in-memory``) is recommended instead.
+
+:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
+Hugepages are cleared by the kernel when a file in hugetlbfs or its part
+is mapped for the first time system-wide
+to prevent data leaks from previous users of the same hugepage.
+EAL ensures this behavior by removing existing backing files at startup
+and by recreating them before opening for mapping (as a precaution).
+
+Anonymous mapping does not allow multi-process architecture,
+but it is free of filename conflicts and leftover files on hugetlbfs.
+It makes running as non-root easier,
+because memory management does not require root permissions in this case
+(the limit of locked memory amount, ``MEMLOCK``, still applies).
+If memfd_create(2) is supported both at build and run time,
+DPDK memory manager can provide file descriptors for memory segments,
+which are required for VirtIO with vhost-user backend.
+This can exhaust the number of open files limit (``NOFILE``)
+despite not creating any files in hugetlbfs.
+See :ref:`segment-file-descriptors` section
+on how the number of open file descriptors used by EAL can be reduced.
+
+.. _segment-file-descriptors:
+
+Segment File Descriptors
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 On Linux, in most cases, EAL will store segment file descriptors in EAL. This
 can become a problem when using smaller page sizes due to underlying limitations
@@ -731,6 +807,7 @@ We expect only 50% of CPU spend on packet IO.
     echo 100000 > pkt_io/cpu.cfs_period_us
     echo  50000 > pkt_io/cpu.cfs_quota_us
 
+.. _malloc:
 
 Malloc
 ------
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 2/6] app/test: add allocator performance benchmark
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
  2022-01-19 21:09     ` [PATCH v2 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2022-01-19 21:09     ` Dmitry Kozlyuk
  2022-01-19 21:09     ` [PATCH v2 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:09 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Viacheslav Ovsiienko, Aaron Conole

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing takes
for each size as a reference (for rte_memzone_reserve estimations
are printed).

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Aaron Conole <aconole@redhat.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 174 ++++++++++++++++++++++++++++++++++++
 2 files changed, 176 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 344a609a4d..50cf2602a9 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -88,6 +88,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -295,6 +296,7 @@ extra_test_names = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..9686fc8af5
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__)
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+typedef void * (memset_t)(void *addr, int value, size_t size);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	TEST_LOG(INFO, "Reference: memset\n");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn,
+		memset_t *memset_fn, double memset_gb_us, size_t max_runs)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	TEST_LOG(INFO, "Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		TEST_LOG(ERR, "Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs",
+			"Alloc (us)", "Free (us)", "Total (us)",
+			memset_fn != NULL ? "memset (us)" : "est.memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n",
+					size);
+			break;
+		}
+		runs_done = j;
+
+		if (memset_fn != NULL) {
+			tsc_start = rte_rdtsc_precise();
+			for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+				memset_fn(ptrs[j], 0, size);
+			tsc_memset = rte_rdtsc_precise() - tsc_start;
+		}
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_fn != NULL ?
+				tsc_to_us(tsc_memset, runs_done) :
+				memset_gb_us * size / GB;
+		TEST_LOG(INFO, "%12zu%8zu%12.2f%12.2f%12.2f%17.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_us_gb;
+
+	if (test_memset_perf(&memset_us_gb) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			NULL, memset_us_gb, RTE_MAX_MEMZONE - 1) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 3/6] mem: add dirty malloc element support
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
  2022-01-19 21:09     ` [PATCH v2 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
  2022-01-19 21:09     ` [PATCH v2 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
@ 2022-01-19 21:09     ` Dmitry Kozlyuk
  2022-01-19 21:09     ` [PATCH v2 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:09 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Anatoly Burakov

EAL malloc layer assumed all free elements content
is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
This assumption was ensured in two ways:
1. EAL memalloc layer always returned clean memory.
2. Freed memory was cleared before returning into the heap.

Clearing the memory can be as slow as around 14 GiB/s.
To save doing so, memalloc layer is allowed to return dirty memory.
Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
The allocator tracks elements that contain dirty memory
using the new flag in the element header.
When clean memory is requested via rte_zmalloc*()
and the suitable element is dirty, it is cleared on allocation.
When memory is deallocated, the freed element is joined
with adjacent free elements, and the dirty flag is updated:

a) If the joint element contains dirty parts, it is dirty:

    dirty + freed + dirty = dirty  =>  no need to clean
            freed + dirty = dirty      the freed memory

   Dirty parts may be large (e.g. initial allocation),
   so clearing them could create unpredictable slowdown.

b) If the only dirty part of the joint element
   is the freed memory, the joint element can be made clean:

    clean + freed + clean = clean  =>  freed memory
    clean + freed         = clean      must be cleared
            freed + clean = clean
            freed         = clean

   This logic naturally reproduces the old behavior
   and always applies in modes when EAL memalloc layer
   returns only clean segments.

As a result, memory is either cleared on free, as before,
or it will be cleared on allocation if need be, but never twice.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 lib/eal/common/malloc_elem.c | 22 +++++++++++++++++++---
 lib/eal/common/malloc_elem.h | 11 +++++++++--
 lib/eal/common/malloc_heap.c | 18 ++++++++++++------
 lib/eal/common/rte_malloc.c  | 21 ++++++++++++++-------
 lib/eal/include/rte_memory.h |  8 ++++++--
 5 files changed, 60 insertions(+), 20 deletions(-)

diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index bdd20a162e..e04e0890fb 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -129,7 +129,7 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
 void
 malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 		struct rte_memseg_list *msl, size_t size,
-		struct malloc_elem *orig_elem, size_t orig_size)
+		struct malloc_elem *orig_elem, size_t orig_size, bool dirty)
 {
 	elem->heap = heap;
 	elem->msl = msl;
@@ -137,6 +137,7 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 	elem->next = NULL;
 	memset(&elem->free_list, 0, sizeof(elem->free_list));
 	elem->state = ELEM_FREE;
+	elem->dirty = dirty;
 	elem->size = size;
 	elem->pad = 0;
 	elem->orig_elem = orig_elem;
@@ -300,7 +301,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 	const size_t new_elem_size = elem->size - old_elem_size;
 
 	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
-			 elem->orig_elem, elem->orig_size);
+			elem->orig_elem, elem->orig_size, elem->dirty);
 	split_pt->prev = elem;
 	split_pt->next = next_elem;
 	if (next_elem)
@@ -506,6 +507,7 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 	else
 		elem1->heap->last = elem1;
 	elem1->next = next;
+	elem1->dirty |= elem2->dirty;
 	if (elem1->pad) {
 		struct malloc_elem *inner = RTE_PTR_ADD(elem1, elem1->pad);
 		inner->size = elem1->size - elem1->pad;
@@ -579,6 +581,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN);
 	data_len = elem->size - MALLOC_ELEM_OVERHEAD;
 
+	/*
+	 * Consider the element clean for the purposes of joining.
+	 * If both neighbors are clean or non-existent,
+	 * the joint element will be clean,
+	 * which means the memory should be cleared.
+	 * There is no need to clear the memory if the joint element is dirty.
+	 */
+	elem->dirty = false;
 	elem = malloc_elem_join_adjacent_free(elem);
 
 	malloc_elem_free_list_insert(elem);
@@ -588,8 +598,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
-	/* poison memory */
+#ifndef RTE_MALLOC_DEBUG
+	/* Normally clear the memory when needed. */
+	if (!elem->dirty)
+		memset(ptr, 0, data_len);
+#else
+	/* Always poison the memory in debug mode. */
 	memset(ptr, MALLOC_POISON, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_elem.h b/lib/eal/common/malloc_elem.h
index 15d8ba7af2..f2aa98821b 100644
--- a/lib/eal/common/malloc_elem.h
+++ b/lib/eal/common/malloc_elem.h
@@ -27,7 +27,13 @@ struct malloc_elem {
 	LIST_ENTRY(malloc_elem) free_list;
 	/**< list of free elements in heap */
 	struct rte_memseg_list *msl;
-	volatile enum elem_state state;
+	/** Element state, @c dirty and @c pad validity depends on it. */
+	/* An extra bit is needed to represent enum elem_state as signed int. */
+	enum elem_state state : 3;
+	/** If state == ELEM_FREE: the memory is not filled with zeroes. */
+	uint32_t dirty : 1;
+	/** Reserved for future use. */
+	uint32_t reserved : 28;
 	uint32_t pad;
 	size_t size;
 	struct malloc_elem *orig_elem;
@@ -320,7 +326,8 @@ malloc_elem_init(struct malloc_elem *elem,
 		struct rte_memseg_list *msl,
 		size_t size,
 		struct malloc_elem *orig_elem,
-		size_t orig_size);
+		size_t orig_size,
+		bool dirty);
 
 void
 malloc_elem_insert(struct malloc_elem *elem);
diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c
index 55aad2711b..24080fc473 100644
--- a/lib/eal/common/malloc_heap.c
+++ b/lib/eal/common/malloc_heap.c
@@ -93,11 +93,11 @@ malloc_socket_to_heap_id(unsigned int socket_id)
  */
 static struct malloc_elem *
 malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
-		void *start, size_t len)
+		void *start, size_t len, bool dirty)
 {
 	struct malloc_elem *elem = start;
 
-	malloc_elem_init(elem, heap, msl, len, elem, len);
+	malloc_elem_init(elem, heap, msl, len, elem, len, dirty);
 
 	malloc_elem_insert(elem);
 
@@ -135,7 +135,8 @@ malloc_add_seg(const struct rte_memseg_list *msl,
 
 	found_msl = &mcfg->memsegs[msl_idx];
 
-	malloc_heap_add_memory(heap, found_msl, ms->addr, len);
+	malloc_heap_add_memory(heap, found_msl, ms->addr, len,
+			ms->flags & RTE_MEMSEG_FLAG_DIRTY);
 
 	heap->total_size += len;
 
@@ -303,7 +304,8 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 	struct rte_memseg_list *msl;
 	struct malloc_elem *elem = NULL;
 	size_t alloc_sz;
-	int allocd_pages;
+	int allocd_pages, i;
+	bool dirty = false;
 	void *ret, *map_addr;
 
 	alloc_sz = (size_t)pg_sz * n_segs;
@@ -372,8 +374,12 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 		goto fail;
 	}
 
+	/* Element is dirty if it contains at least one dirty page. */
+	for (i = 0; i < allocd_pages; i++)
+		dirty |= ms[i]->flags & RTE_MEMSEG_FLAG_DIRTY;
+
 	/* add newly minted memsegs to malloc heap */
-	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz);
+	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz, dirty);
 
 	/* try once more, as now we have allocated new memory */
 	ret = find_suitable_element(heap, elt_size, flags, align, bound,
@@ -1260,7 +1266,7 @@ malloc_heap_add_external_memory(struct malloc_heap *heap,
 	memset(msl->base_va, 0, msl->len);
 
 	/* now, add newly minted memory to the malloc heap */
-	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len);
+	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len, false);
 
 	heap->total_size += msl->len;
 
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index d0bec26920..71a3f7ecb4 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -115,15 +115,22 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
+	if (ptr != NULL) {
+		struct malloc_elem *elem = malloc_elem_from_data(ptr);
+
+		if (elem->dirty) {
+			memset(ptr, 0, size);
+		} else {
 #ifdef RTE_MALLOC_DEBUG
-	/*
-	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
-	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+			/*
+			 * If DEBUG is enabled, then freed memory is marked
+			 * with a poison value and set to zero on allocation.
+			 * If DEBUG is disabled then memory is already zeroed.
+			 */
+			memset(ptr, 0, size);
 #endif
+		}
+	}
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index 6d018629ae..68b069fd04 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -19,6 +19,7 @@
 extern "C" {
 #endif
 
+#include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
 #include <rte_config.h>
@@ -37,11 +38,14 @@ extern "C" {
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
+/** Prevent this segment from being freed back to the OS. */
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE RTE_BIT32(0)
+/** This segment is not filled with zeros. */
+#define RTE_MEMSEG_FLAG_DIRTY RTE_BIT32(1)
+
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
-/**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
 	RTE_STD_C11
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 4/6] eal: refactor --huge-unlink storage
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
                       ` (2 preceding siblings ...)
  2022-01-19 21:09     ` [PATCH v2 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
@ 2022-01-19 21:09     ` Dmitry Kozlyuk
  2022-01-19 21:11     ` [PATCH v2 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:09 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Thomas Monjalon, Anatoly Burakov

In preparation to extend --huge-unlink option semantics
refactor how it is stored in the internal configuration.
It makes future changes more isolated.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 lib/eal/common/eal_common_options.c | 9 +++++----
 lib/eal/common/eal_internal_cfg.h   | 8 +++++++-
 lib/eal/linux/eal_memalloc.c        | 7 ++++---
 lib/eal/linux/eal_memory.c          | 2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 1cfdd75f3b..7520ebda8e 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -1737,7 +1737,7 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -1766,7 +1766,7 @@ eal_parse_common_option(int opt, const char *optarg,
 		conf->in_memory = 1;
 		/* in-memory is a superset of noshconf and huge-unlink */
 		conf->no_shconf = 1;
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_PROC_TYPE_NUM:
@@ -2050,7 +2050,8 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"be specified together with --"OPT_NO_HUGE"\n");
 		return -1;
 	}
-	if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink &&
+	if (internal_cfg->no_hugetlbfs &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
 			"be specified together with --"OPT_NO_HUGE"\n");
@@ -2061,7 +2062,7 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			" is only supported in non-legacy memory mode\n");
 	}
 	if (internal_cfg->single_file_segments &&
-			internal_cfg->hugepage_unlink &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is "
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..b5e6942578 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -40,6 +40,12 @@ struct simd_bitwidth {
 	uint16_t bitwidth; /**< bitwidth value */
 };
 
+/** Hugepage backing files discipline. */
+struct hugepage_file_discipline {
+	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
+	bool unlink_before_mapping;
+};
+
 /**
  * internal configuration
  */
@@ -48,7 +54,7 @@ struct internal_config {
 	volatile unsigned force_nchannel; /**< force number of channels */
 	volatile unsigned force_nrank;    /**< force number of ranks */
 	volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
-	unsigned hugepage_unlink;         /**< true to unlink backing files */
+	struct hugepage_file_discipline hugepage_file;
 	volatile unsigned no_pci;         /**< true to disable PCI */
 	volatile unsigned no_hpet;        /**< true to disable HPET */
 	volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 337f2bc739..56a1ddb32b 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -564,7 +564,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 					__func__, strerror(errno));
 				goto resized;
 			}
-			if (internal_conf->hugepage_unlink &&
+			if (internal_conf->hugepage_file.unlink_before_mapping &&
 					!internal_conf->in_memory) {
 				if (unlink(path)) {
 					RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n",
@@ -697,7 +697,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 			close_hugefile(fd, path, list_idx);
 	} else {
 		/* only remove file if we can take out a write lock */
-		if (internal_conf->hugepage_unlink == 0 &&
+		if (!internal_conf->hugepage_file.unlink_before_mapping &&
 				internal_conf->in_memory == 0 &&
 				lock(fd, LOCK_EX) == 1)
 			unlink(path);
@@ -756,7 +756,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-		if (!internal_conf->in_memory && !internal_conf->hugepage_unlink) {
+		if (!internal_conf->in_memory &&
+				!internal_conf->hugepage_file.unlink_before_mapping) {
 			ret = lock(fd, LOCK_EX);
 			if (ret >= 0) {
 				/* no one else is using this page */
diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
index 03a4f2dd2d..83eec078a4 100644
--- a/lib/eal/linux/eal_memory.c
+++ b/lib/eal/linux/eal_memory.c
@@ -1428,7 +1428,7 @@ eal_legacy_hugepage_init(void)
 	}
 
 	/* free the hugepage backing files */
-	if (internal_conf->hugepage_unlink &&
+	if (internal_conf->hugepage_file.unlink_before_mapping &&
 		unlink_hugepage_files(tmp_hp, internal_conf->num_hugepage_sizes) < 0) {
 		RTE_LOG(ERR, EAL, "Unlinking hugepage files failed!\n");
 		goto fail;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 5/6] eal/linux: allow hugepage file reuse
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
                       ` (3 preceding siblings ...)
  2022-01-19 21:09     ` [PATCH v2 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
@ 2022-01-19 21:11     ` Dmitry Kozlyuk
  2022-01-19 21:11       ` [PATCH v2 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
  2022-01-27 12:07     ` [PATCH v2 0/6] Fast restart with many hugepages Bruce Richardson
                       ` (3 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:11 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Anatoly Burakov

Linux EAL ensured that mapped hugepages are clean
by always mapping from newly created files:
existing hugepage backing files were always removed.
In this case, the kernel clears the page to prevent data leaks,
because the mapped memory may contain leftover data
from the previous process that was using this memory.
Clearing takes the bulk of the time spent in mmap(2),
increasing EAL initialization time.

Introduce a mode to keep existing files and reuse them
in order to speed up initial memory allocation in EAL.
Hugepages mapped from such files may contain data
left by the previous process that used this memory,
so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
If multiple hugepages are mapped from the same file:
1. When fallocate(2) is used, all memory mapped from this file
   is considered dirty, because it is unknown
   which parts of the file are holes.
2. When ftruncate(3) is used, memory mapped from this file
   is considered dirty unless the file is extended
   to create a new mapping, which implies clean memory.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
Coverity complains that "path" may be uninitialized in get_seg_fd()
at line 327, but it is always initialized with eal_get_hugefile_path()
at lines 309-316.

 lib/eal/common/eal_common_options.c |   2 +
 lib/eal/common/eal_internal_cfg.h   |   2 +
 lib/eal/linux/eal.c                 |   3 +-
 lib/eal/linux/eal_hugepage_info.c   | 118 ++++++++++++++++----
 lib/eal/linux/eal_memalloc.c        | 166 +++++++++++++++++-----------
 5 files changed, 206 insertions(+), 85 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 7520ebda8e..cdd2284b0c 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -311,6 +311,8 @@ eal_reset_internal_config(struct internal_config *internal_cfg)
 	internal_cfg->force_nchannel = 0;
 	internal_cfg->hugefile_prefix = NULL;
 	internal_cfg->hugepage_dir = NULL;
+	internal_cfg->hugepage_file.unlink_before_mapping = false;
+	internal_cfg->hugepage_file.unlink_existing = true;
 	internal_cfg->force_sockets = 0;
 	/* zero out the NUMA config */
 	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index b5e6942578..d2be7bfa57 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -44,6 +44,8 @@ struct simd_bitwidth {
 struct hugepage_file_discipline {
 	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
 	bool unlink_before_mapping;
+	/** Unlink exisiting files at startup, re-create them before mapping. */
+	bool unlink_existing;
 };
 
 /**
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 60b4924838..9c8395ab14 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1360,7 +1360,8 @@ rte_eal_cleanup(void)
 	struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
+			internal_conf->hugepage_file.unlink_existing)
 		rte_memseg_walk(mark_freeable, NULL);
 	rte_service_finalize();
 	rte_mp_channel_cleanup();
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 9fb0e968db..ec172ef4b8 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon
 /* this function is only called from eal_hugepage_info_init which itself
  * is only called from a primary process */
 static uint32_t
-get_num_hugepages(const char *subdir, size_t sz)
+get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages)
 {
 	unsigned long resv_pages, num_pages, over_pages, surplus_pages;
 	const char *nr_hp_file = "free_hugepages";
@@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz)
 	else
 		over_pages = 0;
 
-	if (num_pages == 0 && over_pages == 0)
+	if (num_pages == 0 && over_pages == 0 && reusable_pages)
 		RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n",
 				sz >> 10);
 
@@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz)
 	if (num_pages < over_pages) /* overflow */
 		num_pages = UINT32_MAX;
 
+	num_pages += reusable_pages;
+	if (num_pages < reusable_pages) /* overflow */
+		num_pages = UINT32_MAX;
+
 	/* we want to return a uint32_t and more than this looks suspicious
 	 * anyway ... */
 	if (num_pages > UINT32_MAX)
@@ -297,20 +301,28 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 	return -1;
 }
 
+struct walk_hugedir_data {
+	int dir_fd;
+	int file_fd;
+	const char *file_name;
+	void *user_data;
+};
+
+typedef void (walk_hugedir_t)(const struct walk_hugedir_data *whd);
+
 /*
- * Clear the hugepage directory of whatever hugepage files
- * there are. Checks if the file is locked (i.e.
- * if it's in use by another DPDK process).
+ * Search the hugepage directory for whatever hugepage files there are.
+ * Check if the file is in use by another DPDK process.
+ * If not, execute a callback on it.
  */
 static int
-clear_hugedir(const char * hugedir)
+walk_hugedir(const char *hugedir, walk_hugedir_t *cb, void *user_data)
 {
 	DIR *dir;
 	struct dirent *dirent;
 	int dir_fd, fd, lck_result;
 	const char filter[] = "*map_*"; /* matches hugepage files */
 
-	/* open directory */
 	dir = opendir(hugedir);
 	if (!dir) {
 		RTE_LOG(ERR, EAL, "Unable to open hugepage directory %s\n",
@@ -326,7 +338,7 @@ clear_hugedir(const char * hugedir)
 		goto error;
 	}
 
-	while(dirent != NULL){
+	while (dirent != NULL) {
 		/* skip files that don't match the hugepage pattern */
 		if (fnmatch(filter, dirent->d_name, 0) > 0) {
 			dirent = readdir(dir);
@@ -345,9 +357,15 @@ clear_hugedir(const char * hugedir)
 		/* non-blocking lock */
 		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
-		/* if lock succeeds, remove the file */
+		/* if lock succeeds, execute callback */
 		if (lck_result != -1)
-			unlinkat(dir_fd, dirent->d_name, 0);
+			cb(&(struct walk_hugedir_data){
+				.dir_fd = dir_fd,
+				.file_fd = fd,
+				.file_name = dirent->d_name,
+				.user_data = user_data,
+			});
+
 		close (fd);
 		dirent = readdir(dir);
 	}
@@ -359,12 +377,48 @@ clear_hugedir(const char * hugedir)
 	if (dir)
 		closedir(dir);
 
-	RTE_LOG(ERR, EAL, "Error while clearing hugepage dir: %s\n",
+	RTE_LOG(ERR, EAL, "Error while walking hugepage dir: %s\n",
 		strerror(errno));
 
 	return -1;
 }
 
+static void
+clear_hugedir_cb(const struct walk_hugedir_data *whd)
+{
+	unlinkat(whd->dir_fd, whd->file_name, 0);
+}
+
+/* Remove hugepage files not used by other DPDK processes from a directory. */
+static int
+clear_hugedir(const char *hugedir)
+{
+	return walk_hugedir(hugedir, clear_hugedir_cb, NULL);
+}
+
+static void
+inspect_hugedir_cb(const struct walk_hugedir_data *whd)
+{
+	uint64_t *total_size = whd->user_data;
+	struct stat st;
+
+	if (fstat(whd->file_fd, &st) < 0)
+		RTE_LOG(DEBUG, EAL, "%s(): stat(\"%s\") failed: %s",
+				__func__, whd->file_name, strerror(errno));
+	else
+		(*total_size) += st.st_size;
+}
+
+/*
+ * Count the total size in bytes of all files in the directory
+ * not mapped by other DPDK process.
+ */
+static int
+inspect_hugedir(const char *hugedir, uint64_t *total_size)
+{
+	return walk_hugedir(hugedir, inspect_hugedir_cb, total_size);
+}
+
 static int
 compare_hpi(const void *a, const void *b)
 {
@@ -375,7 +429,8 @@ compare_hpi(const void *a, const void *b)
 }
 
 static void
-calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
+calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent,
+		unsigned int reusable_pages)
 {
 	uint64_t total_pages = 0;
 	unsigned int i;
@@ -388,8 +443,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 * in one socket and sorting them later
 	 */
 	total_pages = 0;
-	/* we also don't want to do this for legacy init */
-	if (!internal_conf->legacy_mem)
+
+	/*
+	 * We also don't want to do this for legacy init.
+	 * When there are hugepage files to reuse it is unknown
+	 * what NUMA node the pages are on.
+	 * This could be determined by mapping,
+	 * but it is precisely what hugepage file reuse is trying to avoid.
+	 */
+	if (!internal_conf->legacy_mem && reusable_pages == 0)
 		for (i = 0; i < rte_socket_count(); i++) {
 			int socket = rte_socket_id_by_idx(i);
 			unsigned int num_pages =
@@ -405,7 +467,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 */
 	if (total_pages == 0) {
 		hpi->num_pages[0] = get_num_hugepages(dirent->d_name,
-				hpi->hugepage_sz);
+				hpi->hugepage_sz, reusable_pages);
 
 #ifndef RTE_ARCH_64
 		/* for 32-bit systems, limit number of hugepages to
@@ -421,6 +483,8 @@ hugepage_info_init(void)
 {	const char dirent_start_text[] = "hugepages-";
 	const size_t dirent_start_len = sizeof(dirent_start_text) - 1;
 	unsigned int i, num_sizes = 0;
+	uint64_t reusable_bytes;
+	unsigned int reusable_pages;
 	DIR *dir;
 	struct dirent *dirent;
 	struct internal_config *internal_conf =
@@ -454,7 +518,7 @@ hugepage_info_init(void)
 			uint32_t num_pages;
 
 			num_pages = get_num_hugepages(dirent->d_name,
-					hpi->hugepage_sz);
+					hpi->hugepage_sz, 0);
 			if (num_pages > 0)
 				RTE_LOG(NOTICE, EAL,
 					"%" PRIu32 " hugepages of size "
@@ -473,7 +537,7 @@ hugepage_info_init(void)
 					"hugepages of size %" PRIu64 " bytes "
 					"will be allocated anonymously\n",
 					hpi->hugepage_sz);
-				calc_num_pages(hpi, dirent);
+				calc_num_pages(hpi, dirent, 0);
 				num_sizes++;
 			}
 #endif
@@ -489,11 +553,23 @@ hugepage_info_init(void)
 				"Failed to lock hugepage directory!\n");
 			break;
 		}
-		/* clear out the hugepages dir from unused pages */
-		if (clear_hugedir(hpi->hugedir) == -1)
-			break;
 
-		calc_num_pages(hpi, dirent);
+		/*
+		 * Check for existing hugepage files and either remove them
+		 * or count how many of them can be reused.
+		 */
+		reusable_pages = 0;
+		if (!internal_conf->hugepage_file.unlink_existing) {
+			reusable_bytes = 0;
+			if (inspect_hugedir(hpi->hugedir,
+					&reusable_bytes) < 0)
+				break;
+			RTE_ASSERT(reusable_bytes % hpi->hugepage_sz == 0);
+			reusable_pages = reusable_bytes / hpi->hugepage_sz;
+		} else if (clear_hugedir(hpi->hugedir) < 0) {
+			break;
+		}
+		calc_num_pages(hpi, dirent, reusable_pages);
 
 		num_sizes++;
 	}
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 56a1ddb32b..be109db3d5 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused,
 
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
-		unsigned int list_idx, unsigned int seg_idx)
+		unsigned int list_idx, unsigned int seg_idx,
+		bool *dirty)
 {
 	int fd;
+	int *out_fd;
+	struct stat st;
+	int ret;
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
+	if (dirty != NULL)
+		*dirty = false;
+
 	/* for in-memory mode, we only make it here when we're sure we support
 	 * memfd, and this is a special case.
 	 */
@@ -300,66 +307,70 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		return get_seg_memfd(hi, list_idx, seg_idx);
 
 	if (internal_conf->single_file_segments) {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].memseg_list_fd;
 		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
-
-		fd = fd_list[list_idx].memseg_list_fd;
-
-		if (fd < 0) {
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(ERR, EAL, "%s(): open failed: %s\n",
-					__func__, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock and keep it indefinitely */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].memseg_list_fd = fd;
-		}
 	} else {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].fds[seg_idx];
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+	}
+	fd = *out_fd;
+	if (fd >= 0)
+		return fd;
 
-		fd = fd_list[list_idx].fds[seg_idx];
-
-		if (fd < 0) {
-			/* A primary process is the only one creating these
-			 * files. If there is a leftover that was not cleaned
-			 * by clear_hugedir(), we must *now* make sure to drop
-			 * the file or we will remap old stuff while the rest
-			 * of the code is built on the assumption that a new
-			 * page is clean.
-			 */
-			if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
-					unlink(path) == -1 &&
-					errno != ENOENT) {
-				RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
-					__func__, path, strerror(errno));
-				return -1;
-			}
+	/*
+	 * There is no TOCTOU between stat() and unlink()/open()
+	 * because the hugepage directory is locked.
+	 */
+	ret = stat(path, &st);
+	if (ret < 0 && errno != ENOENT) {
+		RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	if (!internal_conf->hugepage_file.unlink_existing && ret == 0 &&
+			dirty != NULL)
+		*dirty = true;
 
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
-					__func__, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].fds[seg_idx] = fd;
+	/*
+	 * The kernel clears a hugepage only when it is mapped
+	 * from a particular file for the first time.
+	 * If the file already exists, the old content will be mapped.
+	 * If the memory manager assumes all mapped pages to be clean,
+	 * the file must be removed and created anew.
+	 * Otherwise, the primary caller must be notified
+	 * that mapped pages will be dirty
+	 * (secondary callers receive the segment state from the primary one).
+	 * When multiple hugepages are mapped from the same file,
+	 * whether they will be dirty depends on the part that is mapped.
+	 */
+	if (!internal_conf->single_file_segments &&
+			internal_conf->hugepage_file.unlink_existing &&
+			rte_eal_process_type() == RTE_PROC_PRIMARY &&
+			ret == 0) {
+		/* coverity[toctou] */
+		if (unlink(path) < 0) {
+			RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
+				__func__, path, strerror(errno));
+			return -1;
 		}
 	}
+
+	/* coverity[toctou] */
+	fd = open(path, O_CREAT | O_RDWR, 0600);
+	if (fd < 0) {
+		RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) < 0) {
+		RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+			__func__, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	*out_fd = fd;
 	return fd;
 }
 
@@ -385,8 +396,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset,
 
 static int
 resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+		bool grow, bool *dirty)
 {
+	const struct internal_config *internal_conf =
+			eal_get_internal_configuration();
 	bool again = false;
 
 	do {
@@ -405,6 +418,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 			uint64_t cur_size = get_file_size(fd);
 
 			/* fallocate isn't supported, fall back to ftruncate */
+			if (dirty != NULL)
+				*dirty = new_size <= cur_size;
 			if (new_size > cur_size &&
 					ftruncate(fd, new_size) < 0) {
 				RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
@@ -447,8 +462,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 						strerror(errno));
 					return -1;
 				}
-			} else
+			} else {
 				fallocate_supported = 1;
+				/*
+				 * It is unknown which portions of an existing
+				 * hugepage file were allocated previously,
+				 * so all pages within the file are considered
+				 * dirty, unless the file is a fresh one.
+				 */
+				if (dirty != NULL)
+					*dirty &= !internal_conf->hugepage_file.unlink_existing;
+			}
 		}
 	} while (again);
 
@@ -475,7 +499,8 @@ close_hugefile(int fd, char *path, int list_idx)
 }
 
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
+resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow,
+		bool *dirty)
 {
 	/* in-memory mode is a special case, because we can be sure that
 	 * fallocate() is supported.
@@ -483,12 +508,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	if (internal_conf->in_memory)
+	if (internal_conf->in_memory) {
+		if (dirty != NULL)
+			*dirty = false;
 		return resize_hugefile_in_memory(fd, fa_offset,
 				page_sz, grow);
+	}
 
 	return resize_hugefile_in_filesystem(fd, fa_offset, page_sz,
-				grow);
+			grow, dirty);
 }
 
 static int
@@ -505,6 +533,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	char path[PATH_MAX];
 	int ret = 0;
 	int fd;
+	bool dirty;
 	size_t alloc_sz;
 	int flags;
 	void *new_addr;
@@ -534,6 +563,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		pagesz_flag = pagesz_flags(alloc_sz);
 		fd = -1;
+		dirty = false;
 		mmap_flags = in_memory_flags | pagesz_flag;
 
 		/* single-file segments codepath will never be active
@@ -544,7 +574,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		map_offset = 0;
 	} else {
 		/* takes out a read lock on segment or segment list */
-		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx,
+				&dirty);
 		if (fd < 0) {
 			RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 			return -1;
@@ -552,7 +583,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		if (internal_conf->single_file_segments) {
 			map_offset = seg_idx * alloc_sz;
-			ret = resize_hugefile(fd, map_offset, alloc_sz, true);
+			ret = resize_hugefile(fd, map_offset, alloc_sz, true,
+					&dirty);
 			if (ret < 0)
 				goto resized;
 
@@ -662,6 +694,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	ms->nrank = rte_memory_get_nrank();
 	ms->iova = iova;
 	ms->socket_id = socket_id;
+	ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0;
 
 	return 0;
 
@@ -689,7 +722,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		return -1;
 
 	if (internal_conf->single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
+		resize_hugefile(fd, map_offset, alloc_sz, false, NULL);
 		/* ignore failure, can't make it any worse */
 
 		/* if refcount is at zero, close the file */
@@ -739,13 +772,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	 * segment and thus drop the lock on original fd, but hugepage dir is
 	 * now locked so we can take out another one without races.
 	 */
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL);
 	if (fd < 0)
 		return -1;
 
 	if (internal_conf->single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, map_offset, ms->len, false, NULL))
 			return -1;
 
 		if (--(fd_list[list_idx].count) == 0)
@@ -757,6 +790,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		 * holding onto this page.
 		 */
 		if (!internal_conf->in_memory &&
+				internal_conf->hugepage_file.unlink_existing &&
 				!internal_conf->hugepage_file.unlink_before_mapping) {
 			ret = lock(fd, LOCK_EX);
 			if (ret >= 0) {
@@ -1743,6 +1777,12 @@ eal_memalloc_init(void)
 			RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n");
 			return -1;
 		}
+		/* safety net, should be impossible to configure */
+		if (internal_conf->hugepage_file.unlink_before_mapping &&
+				!internal_conf->hugepage_file.unlink_existing) {
+			RTE_LOG(ERR, EAL, "Unlinking existing hugepage files is prohibited, cannot unlink them before mapping.\n");
+			return -1;
+		}
 	}
 
 	/* initialize all of the fd lists */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 6/6] eal: extend --huge-unlink for hugepage file reuse
  2022-01-19 21:11     ` [PATCH v2 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2022-01-19 21:11       ` Dmitry Kozlyuk
  0 siblings, 0 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:11 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Thomas Monjalon, Anatoly Burakov

Expose Linux EAL ability to reuse existing hugepage files
via --huge-unlink=never switch.
Default behavior is unchanged, it can also be specified
using --huge-unlink=existing for consistency.
Old --huge-unlink switch is kept,
it is an alias for --huge-unlink=always.
Add a test case for the --huge-unlink=never mode.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 app/test/test_eal_flags.c                     | 25 ++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst | 24 ++++++++++--
 .../prog_guide/env_abstraction_layer.rst      | 12 ++++++
 doc/guides/rel_notes/release_22_03.rst        |  7 ++++
 lib/eal/common/eal_common_options.c           | 39 +++++++++++++++++--
 5 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/app/test/test_eal_flags.c b/app/test/test_eal_flags.c
index d7f4c2cd47..e2696cda63 100644
--- a/app/test/test_eal_flags.c
+++ b/app/test/test_eal_flags.c
@@ -1122,6 +1122,11 @@ test_file_prefix(void)
 		DEFAULT_MEM_SIZE, "--single-file-segments",
 		"--file-prefix=" memtest1 };
 
+	/* primary process with memtest1 and --huge-unlink=never mode */
+	const char * const argv9[] = {prgname, "-m",
+		DEFAULT_MEM_SIZE, "--huge-unlink=never",
+		"--file-prefix=" memtest1 };
+
 	/* check if files for current prefix are present */
 	if (process_hugefiles(prefix, HUGEPAGE_CHECK_EXISTS) != 1) {
 		printf("Error - hugepage files for %s were not created!\n", prefix);
@@ -1290,6 +1295,26 @@ test_file_prefix(void)
 		return -1;
 	}
 
+	/* this process will run with --huge-unlink,
+	 * so it should not remove hugepage files when it exits
+	 */
+	if (launch_proc(argv9) != 0) {
+		printf("Error - failed to run with --huge-unlink=never\n");
+		return -1;
+	}
+
+	/* check if hugefiles for memtest1 are present */
+	if (process_hugefiles(memtest1, HUGEPAGE_CHECK_EXISTS) == 0) {
+		printf("Error - hugepage files for %s were deleted!\n",
+				memtest1);
+		return -1;
+	} else {
+		if (process_hugefiles(memtest1, HUGEPAGE_DELETE) != 1) {
+			printf("Error - deleting hugepages failed!\n");
+			return -1;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index 74df2611b5..ea8f381391 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -84,10 +84,26 @@ Memory-related options
     Use specified hugetlbfs directory instead of autodetected ones. This can be
     a sub-directory within a hugetlbfs mountpoint.
 
-*   ``--huge-unlink``
-
-    Unlink hugepage files after creating them (implies no secondary process
-    support).
+*   ``--huge-unlink[=existing|always|never]``
+
+    No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default:
+    existing hugepage files are removed and re-created
+    to ensure the kernel clears the memory and prevents any data leaks.
+
+    With ``--huge-unlink`` (no value) or ``--huge-unlink=always``,
+    hugepage files are also removed before mapping them,
+    so that the application leaves no files in hugetlbfs.
+    This mode implies no multi-process support.
+
+    When ``--huge-unlink=never`` is specified, existing hugepage files
+    are never removed, but are remapped instead, allowing hugepage reuse.
+    This makes restart faster by saving time to clear memory at initialization,
+    but it may slow down zeroed allocations later.
+    Reused hugepages can contain data from previous processes that used them,
+    which may be a security concern.
+    Hugepage files created in this mode are also not removed
+    when all the hugepages mapped from them are freed,
+    which allows to reuse these files after a restart.
 
 *   ``--match-allocations``
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index fede7fe69d..b1eae592ab 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -282,6 +282,18 @@ to prevent data leaks from previous users of the same hugepage.
 EAL ensures this behavior by removing existing backing files at startup
 and by recreating them before opening for mapping (as a precaution).
 
+One exception is ``--huge-unlink=never`` mode.
+It is used to speed up EAL initialization, usually on application restart.
+Clearing memory constitutes more than 95% of hugepage mapping time.
+EAL can save it by remapping existing backing files
+with all the data left in the mapped hugepages ("dirty" memory).
+Such segments are marked with ``RTE_MEMSEG_FLAG_DIRTY``.
+Memory allocator detects dirty segments handles them accordingly,
+in particular, it clears memory requested with ``rte_zmalloc*()``.
+In this mode EAL also does not remove a backing file
+when all pages mapped from it are freed,
+because they are intended to be reusable at restart.
+
 Anonymous mapping does not allow multi-process architecture,
 but it is free of filename conflicts and leftover files on hugetlbfs.
 It makes running as non-root easier,
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 6d99d1eaa9..0b882362cf 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added ability to reuse hugepages in Linux.**
+
+  It is possible to reuse files in hugetlbfs to speed up hugepage mapping,
+  which may be useful for fast restart and large allocations.
+  The new mode is activated with ``--huge-unlink=never``
+  and has security implications, refer to the user and programmer guides.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index cdd2284b0c..45d393b393 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -74,7 +74,7 @@ eal_long_options[] = {
 	{OPT_FILE_PREFIX,       1, NULL, OPT_FILE_PREFIX_NUM      },
 	{OPT_HELP,              0, NULL, OPT_HELP_NUM             },
 	{OPT_HUGE_DIR,          1, NULL, OPT_HUGE_DIR_NUM         },
-	{OPT_HUGE_UNLINK,       0, NULL, OPT_HUGE_UNLINK_NUM      },
+	{OPT_HUGE_UNLINK,       2, NULL, OPT_HUGE_UNLINK_NUM      },
 	{OPT_IOVA_MODE,	        1, NULL, OPT_IOVA_MODE_NUM        },
 	{OPT_LCORES,            1, NULL, OPT_LCORES_NUM           },
 	{OPT_LOG_LEVEL,         1, NULL, OPT_LOG_LEVEL_NUM        },
@@ -1598,6 +1598,28 @@ available_cores(void)
 	return str;
 }
 
+#define HUGE_UNLINK_NEVER "never"
+
+static int
+eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
+{
+	if (arg == NULL || strcmp(arg, "always") == 0) {
+		out->unlink_before_mapping = true;
+		return 0;
+	}
+	if (strcmp(arg, "existing") == 0) {
+		/* same as not specifying the option */
+		return 0;
+	}
+	if (strcmp(arg, HUGE_UNLINK_NEVER) == 0) {
+		RTE_LOG(WARNING, EAL, "Using --"OPT_HUGE_UNLINK"="
+			HUGE_UNLINK_NEVER" may create data leaks.\n");
+		out->unlink_existing = false;
+		return 0;
+	}
+	return -1;
+}
+
 int
 eal_parse_common_option(int opt, const char *optarg,
 			struct internal_config *conf)
@@ -1739,7 +1761,10 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_file.unlink_before_mapping = true;
+		if (eal_parse_huge_unlink(optarg, &conf->hugepage_file) < 0) {
+			RTE_LOG(ERR, EAL, "invalid --"OPT_HUGE_UNLINK" option\n");
+			return -1;
+		}
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -2070,6 +2095,12 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
 		return -1;
 	}
+	if (!internal_cfg->hugepage_file.unlink_existing &&
+			internal_cfg->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_IN_MEMORY" is not compatible "
+			"with --"OPT_HUGE_UNLINK"="HUGE_UNLINK_NEVER"\n");
+		return -1;
+	}
 	if (internal_cfg->legacy_mem &&
 			internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible "
@@ -2202,7 +2233,9 @@ eal_common_usage(void)
 	       "  --"OPT_NO_TELEMETRY"   Disable telemetry support\n"
 	       "  --"OPT_FORCE_MAX_SIMD_BITWIDTH" Force the max SIMD bitwidth\n"
 	       "\nEAL options for DEBUG use only:\n"
-	       "  --"OPT_HUGE_UNLINK"       Unlink hugepage files after init\n"
+	       "  --"OPT_HUGE_UNLINK"[=existing|always|never]\n"
+	       "                      When to unlink files in hugetlbfs\n"
+	       "                      ('existing' by default, no value means 'always')\n"
 	       "  --"OPT_NO_HUGE"           Use malloc instead of hugetlbfs\n"
 	       "  --"OPT_NO_PCI"            Disable PCI\n"
 	       "  --"OPT_NO_HPET"           Disable HPET\n"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v1 0/6] Fast restart with many hugepages
  2022-01-17 16:40   ` [PATCH v1 0/6] Fast restart with many hugepages Bruce Richardson
@ 2022-01-19 21:12     ` Dmitry Kozlyuk
  2022-01-20  9:05       ` Bruce Richardson
  0 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:12 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, Anatoly Burakov, Slava Ovsiienko, David Marchand,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	Lior Margalit

Hi Bruce,

> From: Bruce Richardson <bruce.richardson@intel.com>
> [...]
> this seems really interesting, but in the absense of TB of memory
> being
> used, is it easily possible to see the benefits of this work? I've
> been
> playing with adding large memory allocations to helloworld example and
> checking the runtime. Allocating 1GB using malloc per thread seems to
> show
> a small (<0.5 second at most) benefit, and using a fixed 10GB
> allocation
> using memzone_reserve at startup shows runtimes within the margin of
> error
> when run with --huge-unlink=existing vs huge-unlink=never. At what
> size of
> memory footprint is it expected to make a clear improvement?

Sorry, there was a bug in v1 that completely broke the testing.
I should've double-checked
after what I considered a quick rebase before sending.
Version 2 can be simply tested even without modifyin the code:

time sh -c 'echo quit | sudo ../_build/dpdk/app/test/dpdk-test
	--huge-unlink=never -m 8192 --single-file-segments --no-pci
	2>/dev/null >/dev/null'

With --huge-unlink=existing:
real    0m1.450s
user    0m0.574s
sys     0m0.706s	(1)

With --huge-unlink=never, first run (no hugepage files to reuse):
real    0m0.892s
user    0m0.002s
sys     0m0.718s	(2)

With --huge-unlink=never, second run (hugepage files left):
real    0m0.210s
user    0m0.010s
sys     0m0.021s	(3)

Notice that (1) and (2) are close since there is no reuse,
but (2) and (3) are differ by 0.7 seconds for 8GB,
which correlates with 14 GB/sec memset() speed on this machine.
Results without --single-file-segments are nearly identical.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH v1 2/6] app/test: add allocator performance benchmark
  2022-01-17 15:51       ` Bruce Richardson
@ 2022-01-19 21:12         ` Dmitry Kozlyuk
  2022-01-20  9:04           ` Bruce Richardson
  0 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-01-19 21:12 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Aaron Conole, Slava Ovsiienko

> From: Bruce Richardson <bruce.richardson@intel.com>
> [...]
> > What is the expected running time of this test? When I tried it out
> on my
> > machine it appears to hang after the following output:
> > [...]

It always runs within 50 seconds on my machine (E5-1650 v3 @ 3.50GHz).
Judging by the output, it runs faster than yours
(203 vs 811 total microseconds in 1M allocation case):

USER1: Reference: memset
USER1: Result: 14.557848.3 GiB/s <=> 67.08 us/MiB
USER1: 
USER1: Performance: rte_malloc
USER1:     Size (B)    Runs  Alloc (us)   Free (us)  Total (us)      memset (us)
USER1:           64   10000        0.09        0.04        0.13             0.01
USER1:          128   10000        0.09        0.04        0.13             0.01
USER1:         1024   10000        0.12        0.09        0.21             0.11
USER1:         4096   10000        0.15        0.40        0.55             0.27
USER1:        65536   10000        0.16        4.37        4.53             4.25
USER1:      1048576   10000       73.85      129.23      203.07            67.26
USER1:      2097152    7154      148.98      259.42      408.39           134.34
USER1:      4194304    3570      298.28      519.76      818.04           268.65
USER1:     16777216     882     1206.85     2093.46     3300.30          1074.25
USER1:   1073741824       6   188765.01   206544.04   395309.06         68739.83
[...]

Note that to see --huge-unlink effect you must run it twice:
the first run creates and leaves the files, the second reuses them.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 2/6] app/test: add allocator performance benchmark
  2022-01-19 21:12         ` Dmitry Kozlyuk
@ 2022-01-20  9:04           ` Bruce Richardson
  0 siblings, 0 replies; 53+ messages in thread
From: Bruce Richardson @ 2022-01-20  9:04 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Aaron Conole, Slava Ovsiienko

On Wed, Jan 19, 2022 at 09:12:35PM +0000, Dmitry Kozlyuk wrote:
> > From: Bruce Richardson <bruce.richardson@intel.com>
> > [...]
> > > What is the expected running time of this test? When I tried it out
> > on my
> > > machine it appears to hang after the following output:
> > > [...]
> 
> It always runs within 50 seconds on my machine (E5-1650 v3 @ 3.50GHz).
> Judging by the output, it runs faster than yours
> (203 vs 811 total microseconds in 1M allocation case):
> 
> USER1: Reference: memset
> USER1: Result: 14.557848.3 GiB/s <=> 67.08 us/MiB
> USER1: 
> USER1: Performance: rte_malloc
> USER1:     Size (B)    Runs  Alloc (us)   Free (us)  Total (us)      memset (us)
> USER1:           64   10000        0.09        0.04        0.13             0.01
> USER1:          128   10000        0.09        0.04        0.13             0.01
> USER1:         1024   10000        0.12        0.09        0.21             0.11
> USER1:         4096   10000        0.15        0.40        0.55             0.27
> USER1:        65536   10000        0.16        4.37        4.53             4.25
> USER1:      1048576   10000       73.85      129.23      203.07            67.26
> USER1:      2097152    7154      148.98      259.42      408.39           134.34
> USER1:      4194304    3570      298.28      519.76      818.04           268.65
> USER1:     16777216     882     1206.85     2093.46     3300.30          1074.25
> USER1:   1073741824       6   188765.01   206544.04   395309.06         68739.83
> [...]
> 
> Note that to see --huge-unlink effect you must run it twice:
> the first run creates and leaves the files, the second reuses them.

My run seems to hang when doing the 2M size tests, which I also notice is
the first run above where the number of runs is not 10000. What is the
termination condition for each of the runs, and is that something that
could cause hangs on slower machines?

/Bruce

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v1 0/6] Fast restart with many hugepages
  2022-01-19 21:12     ` Dmitry Kozlyuk
@ 2022-01-20  9:05       ` Bruce Richardson
  0 siblings, 0 replies; 53+ messages in thread
From: Bruce Richardson @ 2022-01-20  9:05 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Anatoly Burakov, Slava Ovsiienko, David Marchand,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	Lior Margalit

On Wed, Jan 19, 2022 at 09:12:27PM +0000, Dmitry Kozlyuk wrote:
> Hi Bruce,
> 
> > From: Bruce Richardson <bruce.richardson@intel.com>
> > [...]
> > this seems really interesting, but in the absense of TB of memory
> > being
> > used, is it easily possible to see the benefits of this work? I've
> > been
> > playing with adding large memory allocations to helloworld example and
> > checking the runtime. Allocating 1GB using malloc per thread seems to
> > show
> > a small (<0.5 second at most) benefit, and using a fixed 10GB
> > allocation
> > using memzone_reserve at startup shows runtimes within the margin of
> > error
> > when run with --huge-unlink=existing vs huge-unlink=never. At what
> > size of
> > memory footprint is it expected to make a clear improvement?
> 
> Sorry, there was a bug in v1 that completely broke the testing.
> I should've double-checked
> after what I considered a quick rebase before sending.
> Version 2 can be simply tested even without modifyin the code:
> 
> time sh -c 'echo quit | sudo ../_build/dpdk/app/test/dpdk-test
> 	--huge-unlink=never -m 8192 --single-file-segments --no-pci
> 	2>/dev/null >/dev/null'
> 
> With --huge-unlink=existing:
> real    0m1.450s
> user    0m0.574s
> sys     0m0.706s	(1)
> 
> With --huge-unlink=never, first run (no hugepage files to reuse):
> real    0m0.892s
> user    0m0.002s
> sys     0m0.718s	(2)
> 
> With --huge-unlink=never, second run (hugepage files left):
> real    0m0.210s
> user    0m0.010s
> sys     0m0.021s	(3)
> 
> Notice that (1) and (2) are close since there is no reuse,
> but (2) and (3) are differ by 0.7 seconds for 8GB,
> which correlates with 14 GB/sec memset() speed on this machine.
> Results without --single-file-segments are nearly identical.

Thanks, glad to hear it wasn't just me! I'll check again the v2 when I get
the chance.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/6] Fast restart with many hugepages
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
                       ` (4 preceding siblings ...)
  2022-01-19 21:11     ` [PATCH v2 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2022-01-27 12:07     ` Bruce Richardson
  2022-02-02 14:12     ` Thomas Monjalon
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Bruce Richardson @ 2022-01-27 12:07 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Anatoly Burakov, Viacheslav Ovsiienko, David Marchand,
	Thomas Monjalon, Lior Margalit

On Wed, Jan 19, 2022 at 11:09:11PM +0200, Dmitry Kozlyuk wrote:
> This patchset is a new design and implementation of [1].
> 
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
> 
Thanks for the v2, I now see the promised perf improvements when running
some quick tests with testpmd. Some quick numbers below, summary version is
that for testpmd with default mempool size startup/exit time drops from
1.7s to 1.4s, and when I increase mempool size to 4M mbufs, time drops
from 7.6s to 3.9s.

/Bruce

cmd: "time echo "quit" | sudo ./build/app/dpdk-testpmd -c F --no-pci -- -i"

Baseline (no patches) - 1.7 sec
Baseline (with patches) - 1.7 sec
Huge-unlink=never - 1.4 sec

Adding --total-num-mbufs=4096000

Baseline (with patches) - 7.6 sec
Huge-unlink=never - 3.9 sec

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/6] doc: add hugepage mapping details
  2022-01-19 21:09     ` [PATCH v2 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2022-01-27 13:59       ` Bruce Richardson
  0 siblings, 0 replies; 53+ messages in thread
From: Bruce Richardson @ 2022-01-27 13:59 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev, Anatoly Burakov

On Wed, Jan 19, 2022 at 11:09:12PM +0200, Dmitry Kozlyuk wrote:
> Hugepage mapping is a layer of EAL malloc builds upon.
> There were implicit references to its details,
> like mentions of segment file descriptors,
> but no explicit description of its modes and operation.
> Add an overview of mechanics used on ech supported OS.
> Convert memory management subsections from list items
> to level 4 headers: they are big and important enough.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> ---

Some good cleanup and doc enhancements here. Some comments inline below.
One could argue that this patch should perhaps to be 2, with the conversion
of bullets to subsections being separate, but personally I think it's fine
having it in one patch as here.

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

>  .../prog_guide/env_abstraction_layer.rst      | 95 +++++++++++++++++--
>  1 file changed, 86 insertions(+), 9 deletions(-)
> 
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
> index c6accce701..fede7fe69d 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -86,7 +86,7 @@ See chapter
>  Memory Mapping Discovery and Memory Reservation
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> -The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem.
> +The allocation of large contiguous physical memory is done using hugepages.
>  The EAL provides an API to reserve named memory zones in this contiguous memory.
>  The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API.
>  
> @@ -95,11 +95,13 @@ and legacy mode. Both modes are explained below.
>  
>  .. note::
>  
> -    Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
> +    Memory reservations done using the APIs provided by rte_malloc
> +    are also backed by hugepages unless ``--no-huge`` option is given.
>  
> -+ Dynamic memory mode
> +Dynamic Memory Mode
> +^^^^^^^^^^^^^^^^^^^
>  
> -Currently, this mode is only supported on Linux.
> +Currently, this mode is only supported on Linux and Windows.
>  
>  In this mode, usage of hugepages by DPDK application will grow and shrink based
>  on application's requests. Any memory allocation through ``rte_malloc()``,
> @@ -155,7 +157,8 @@ of memory that can be used by DPDK application.
>      :ref:`Multi-process Support <Multi-process_Support>` for more details about
>      DPDK IPC.
>  
> -+ Legacy memory mode
> +Legacy Memory Mode
> +^^^^^^^^^^^^^^^^^^
>  
>  This mode is enabled by specifying ``--legacy-mem`` command-line switch to the
>  EAL. This switch will have no effect on FreeBSD as FreeBSD only supports
> @@ -168,7 +171,8 @@ not allow acquiring or releasing hugepages from the system at runtime.
>  If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
>  hugepage memory will be preallocated.
>  
> -+ Hugepage allocation matching
> +Hugepage Allocation Matching
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  This behavior is enabled by specifying the ``--match-allocations`` command-line
>  switch to the EAL. This switch is Linux-only and not supported with
> @@ -182,7 +186,8 @@ matching can be used by these types of applications to satisfy both of these
>  requirements. This can result in some increased memory usage which is
>  very dependent on the memory allocation patterns of the application.
>  
> -+ 32-bit support
> +32-bit Support
> +^^^^^^^^^^^^^^
>  
>  Additional restrictions are present when running in 32-bit mode. In dynamic
>  memory mode, by default maximum of 2 gigabytes of VA space will be preallocated,
> @@ -192,7 +197,8 @@ used.
>  In legacy mode, VA space will only be preallocated for segments that were
>  requested (plus padding, to keep IOVA-contiguousness).
>  
> -+ Maximum amount of memory
> +Maximum Amount of Memory
> +^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  All possible virtual memory space that can ever be used for hugepage mapping in
>  a DPDK process is preallocated at startup, thereby placing an upper limit on how
> @@ -222,7 +228,77 @@ Normally, these options do not need to be changed.
>      can later be mapped into that preallocated VA space (if dynamic memory mode
>      is enabled), and can optionally be mapped into it at startup.
>  
> -+ Segment file descriptors
> +Hugepage Mapping
> +^^^^^^^^^^^^^^^^
> +
> +Below is an overview of methods used for each OS to obtain hugepages,
> +explaining why certain limitations and options exist in EAL.
> +See the user guide for a specific OS for configuration details.
> +
> +FreeBSD uses ``contigmem`` kernel module
> +to reserve a fixed number of hugepages at system start,
> +which are mapped by EAL at initialization using a specific ``sysctl()``.
> +
> +Windows EAL allocates hugepages from the OS as needed using Win32 API,
> +so available amount depends on the system load.
> +It uses ``virt2phys`` kernel module to obtain physical addresses,
> +unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
> +
> +Linux implements a variety of methods:
> +
> +* mapping each hugepage from its own file in hugetlbfs;
> +* mapping multiple hugepages from a shared file in hugetlbfs;
> +* anonymous mapping.

I think we need to be clearer about how each of these modes is got i.e. how
they correspond to different EAL args. At least for me, "anonymous mapping"
is better known as "in-memory mode".

> +
> +Mapping hugepages from files in hugetlbfs is essential for multi-process,
> +because secondary processes need to map the same hugepages.
> +EAL creates files like ``rtemap_0``
> +in directories specified with ``--huge-dir`` option
> +(or in the mount point for a specific hugepage size).
> +The ``rte`` prefix can be changed using ``--file-prefix``.
> +This may be needed for running multiple primary processes
> +that share a hugetlbfs mount point.
> +Each backing file by default corresponds to one hugepage,
> +it is opened and locked for the entire time the hugepage is used.
> +This may exhaust the number of open files limit (``NOFILE``).
> +See :ref:`segment-file-descriptors` section
> +on how the number of open backing file descriptors can be reduced.
> +
> +In dynamic memory mode, EAL removes a backing hugepage file
> +when all pages mapped from it are freed back to the system.
> +However, backing files may persist after the application terminates
> +in case of a crash or a leak of DPDK memory (e.g. ``rte_free()`` is missing).
> +This reduces the number of hugepages available to other processes
> +as reported by ``/sys/kernel/mm/hugepages/hugepages-*/free_hugepages``.
> +EAL can remove the backing files after opening them for mapping
> +if ``--huge-unlink`` is given to avoid polluting hugetlbfs.
> +However, since it disables multi-process anyway,
> +using anonymous mapping (``--in-memory``) is recommended instead.
> +
> +:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
> +Hugepages are cleared by the kernel when a file in hugetlbfs or its part
> +is mapped for the first time system-wide
> +to prevent data leaks from previous users of the same hugepage.
> +EAL ensures this behavior by removing existing backing files at startup
> +and by recreating them before opening for mapping (as a precaution).
> +
> +Anonymous mapping does not allow multi-process architecture,
> +but it is free of filename conflicts and leftover files on hugetlbfs.
> +It makes running as non-root easier,
> +because memory management does not require root permissions in this case
> +(the limit of locked memory amount, ``MEMLOCK``, still applies).
> +If memfd_create(2) is supported both at build and run time,
> +DPDK memory manager can provide file descriptors for memory segments,
> +which are required for VirtIO with vhost-user backend.
> +This can exhaust the number of open files limit (``NOFILE``)
> +despite not creating any files in hugetlbfs.
> +See :ref:`segment-file-descriptors` section
> +on how the number of open file descriptors used by EAL can be reduced.
> +

For this last paragraph, I think we need to clarify that this is
"in-memory" mode, and also make it clearer that it's not just that it
doesn't leave hugepage files on the FS, but that in most cases it does not
use hugetlbfs at all.

> +.. _segment-file-descriptors:
> +
> +Segment File Descriptors
> +^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  On Linux, in most cases, EAL will store segment file descriptors in EAL. This
>  can become a problem when using smaller page sizes due to underlying limitations
> @@ -731,6 +807,7 @@ We expect only 50% of CPU spend on packet IO.
>      echo 100000 > pkt_io/cpu.cfs_period_us
>      echo  50000 > pkt_io/cpu.cfs_quota_us
>  
> +.. _malloc:
>  
>  Malloc
>  ------
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/6] Fast restart with many hugepages
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
                       ` (5 preceding siblings ...)
  2022-01-27 12:07     ` [PATCH v2 0/6] Fast restart with many hugepages Bruce Richardson
@ 2022-02-02 14:12     ` Thomas Monjalon
  2022-02-02 21:54     ` David Marchand
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
  8 siblings, 0 replies; 53+ messages in thread
From: Thomas Monjalon @ 2022-02-02 14:12 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Bruce Richardson, Viacheslav Ovsiienko, David Marchand,
	Lior Margalit, Dmitry Kozlyuk

2 weeks passed without any new comment except a test by Bruce.
I would prefer avoiding a merge in the last minute.
Anatoly, any comment?


19/01/2022 22:09, Dmitry Kozlyuk:
> This patchset is a new design and implementation of [1].
> 
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
> 
> # Problem Statement
> 
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
> 
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
> 
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
> 
> # Solution
> 
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
> 
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
> 
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
> 
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
> 
> # Implementation
> 
> ## User Interface
> 
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
> 
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
> 
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
> 
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
> 
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
> 
> ## EAL
> 
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
> 
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
> 
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
> 
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
> 
> Dmitry Kozlyuk (6):
>   doc: add hugepage mapping details
>   app/test: add allocator performance benchmark
>   mem: add dirty malloc element support
>   eal: refactor --huge-unlink storage
>   eal/linux: allow hugepage file reuse
>   eal: extend --huge-unlink for hugepage file reuse
> 
>  app/test/meson.build                          |   2 +
>  app/test/test_eal_flags.c                     |  25 +++
>  app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
>  doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
>  .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
>  doc/guides/rel_notes/release_22_03.rst        |   7 +
>  lib/eal/common/eal_common_options.c           |  48 ++++-
>  lib/eal/common/eal_internal_cfg.h             |  10 +-
>  lib/eal/common/malloc_elem.c                  |  22 ++-
>  lib/eal/common/malloc_elem.h                  |  11 +-
>  lib/eal/common/malloc_heap.c                  |  18 +-
>  lib/eal/common/rte_malloc.c                   |  21 ++-
>  lib/eal/include/rte_memory.h                  |   8 +-
>  lib/eal/linux/eal.c                           |   3 +-
>  lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
>  lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
>  lib/eal/linux/eal_memory.c                    |   2 +-
>  17 files changed, 644 insertions(+), 129 deletions(-)
>  create mode 100644 app/test/test_malloc_perf.c
> 
> 






^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/6] Fast restart with many hugepages
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
                       ` (6 preceding siblings ...)
  2022-02-02 14:12     ` Thomas Monjalon
@ 2022-02-02 21:54     ` David Marchand
  2022-02-03 10:26       ` David Marchand
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
  8 siblings, 1 reply; 53+ messages in thread
From: David Marchand @ 2022-02-02 21:54 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Bruce Richardson, Anatoly Burakov, Viacheslav Ovsiienko,
	Thomas Monjalon, Lior Margalit

Hello Dmitry,

On Wed, Jan 19, 2022 at 10:09 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> This patchset is a new design and implementation of [1].
>
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
>
> # Problem Statement
>
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
>
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
>
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
>
> # Solution
>
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
>
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
>
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
>
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
>
> # Implementation
>
> ## User Interface
>
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
>
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
>
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
>
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
>
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
>
> ## EAL
>
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
>
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
>
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
>
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
>
> Dmitry Kozlyuk (6):
>   doc: add hugepage mapping details
>   app/test: add allocator performance benchmark
>   mem: add dirty malloc element support
>   eal: refactor --huge-unlink storage
>   eal/linux: allow hugepage file reuse
>   eal: extend --huge-unlink for hugepage file reuse
>
>  app/test/meson.build                          |   2 +
>  app/test/test_eal_flags.c                     |  25 +++
>  app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
>  doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
>  .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
>  doc/guides/rel_notes/release_22_03.rst        |   7 +
>  lib/eal/common/eal_common_options.c           |  48 ++++-
>  lib/eal/common/eal_internal_cfg.h             |  10 +-
>  lib/eal/common/malloc_elem.c                  |  22 ++-
>  lib/eal/common/malloc_elem.h                  |  11 +-
>  lib/eal/common/malloc_heap.c                  |  18 +-
>  lib/eal/common/rte_malloc.c                   |  21 ++-
>  lib/eal/include/rte_memory.h                  |   8 +-
>  lib/eal/linux/eal.c                           |   3 +-
>  lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
>  lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
>  lib/eal/linux/eal_memory.c                    |   2 +-
>  17 files changed, 644 insertions(+), 129 deletions(-)
>  create mode 100644 app/test/test_malloc_perf.c

Thanks for the series, the documentation update and keeping the EAL
options count the same as before :-).

It passes my checks (compilation per patch for Linux x86 native, arm86
and ppc cross compil), running unit tests, running malloc tests with
ASan enabled.


I could not check all unit tests with RTE_MALLOC_DEBUG (I passed
-Dc_args=-DRTE_MALLOC_DEBUG to meson).
mbuf_autotest fails but I reproduced the same error before the series
so I'll report and investigate this separately.
Fwiw, the failure is:
1: [/home/dmarchan/builds/build-gcc-shared/app/test/../../lib/librte_eal.so.22(rte_dump_stack+0x1b)
[0x7f860c482dab]]
Test mbuf linearize API
mbuf test FAILED (l.2035): <test_pktmbuf_read_from_offset: Incorrect
data length!
>
mbuf test FAILED (l.2539): <test_pktmbuf_ext_pinned_buffer:
test_rte_pktmbuf_read_from_offset(pinned) failed
>
test_pktmbuf_ext_pinned_buffer() failed
Test Failed


I have one comment on documentation: we have a detailed description of
internal malloc_elem structure and implementation of the dpdk mem
allocator.
https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#internal-implementation
The addition of the "dirty/clean" notion should be described, as it
would help others who want to look into this subsystem.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/6] Fast restart with many hugepages
  2022-02-02 21:54     ` David Marchand
@ 2022-02-03 10:26       ` David Marchand
  0 siblings, 0 replies; 53+ messages in thread
From: David Marchand @ 2022-02-03 10:26 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Bruce Richardson, Anatoly Burakov, Viacheslav Ovsiienko,
	Thomas Monjalon, Lior Margalit

On Wed, Feb 2, 2022 at 10:54 PM David Marchand
<david.marchand@redhat.com> wrote:
> I could not check all unit tests with RTE_MALLOC_DEBUG (I passed
> -Dc_args=-DRTE_MALLOC_DEBUG to meson).
> mbuf_autotest fails but I reproduced the same error before the series
> so I'll report and investigate this separately.
> Fwiw, the failure is:
> 1: [/home/dmarchan/builds/build-gcc-shared/app/test/../../lib/librte_eal.so.22(rte_dump_stack+0x1b)
> [0x7f860c482dab]]
> Test mbuf linearize API
> mbuf test FAILED (l.2035): <test_pktmbuf_read_from_offset: Incorrect
> data length!
> >
> mbuf test FAILED (l.2539): <test_pktmbuf_ext_pinned_buffer:
> test_rte_pktmbuf_read_from_offset(pinned) failed
> >
> test_pktmbuf_ext_pinned_buffer() failed
> Test Failed

This should be fixed with:
https://patchwork.dpdk.org/project/dpdk/patch/20220203093912.25032-1-david.marchand@redhat.com/


-- 
David Marchand


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 0/6] Fast restart with many hugepages
  2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
                       ` (7 preceding siblings ...)
  2022-02-02 21:54     ` David Marchand
@ 2022-02-03 18:13     ` Dmitry Kozlyuk
  2022-02-03 18:13       ` [PATCH v3 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
                         ` (6 more replies)
  8 siblings, 7 replies; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev
  Cc: Anatoly Burakov, Viacheslav Ovsiienko, David Marchand,
	Thomas Monjalon, Lior Margalit

v3:
  * Improve Linux modes and anonymous mapping doc (Bruce).
  * Document new "dirty" field of "malloc_elem" structure (David).
v2:
  * Fix hugepage file removal when they are no longer used.
    Disable removal with --huge-unlink=never as intended.
    Document this behavior difference. (Bruce)
  * Improve documentation, commit messages, and naming. (Thomas)

This patchset is a new design and implementation of [1].

# Problem Statement

Large allocations that involve mapping new hugepages are slow.
This is problematic, for example, in the following use case.
A single-process application allocates ~1TB of mempools at startup.
Sometimes the app needs to restart as quick as possible.
Allocating the hugepages anew takes as long as 15 seconds,
while the new process could just pick up all the memory
left by the old one (reinitializing the contents as needed).

Almost all of mmap(2) time spent in the kernel
is clearing the memory, i.e. filling it with zeros.
This is done if a file in hugetlbfs is mapped
for the first time system-wide, i.e. a hugepage is committed
to prevent data leaks from the previous users of the same hugepage.
For example, mapping 32 GB from a new file may take 2.16 seconds,
while mapping the same pages again takes only 0.3 ms.
Security put aside, e.g. when the environment is controlled,
this effort is wasted for the memory intended for DMA,
because its content will be overwritten anyway.

Linux EAL explicitly removes hugetlbfs files at initialization
and before mapping to force the kernel clear the memory.
This allows the memory allocator to clean memory on only on freeing.

# Solution

Add a new mode allowing EAL to remap existing hugepage files.
While it is intended to make restarts faster in the first place,
it makes any startup faster except the cold one
(with no existing files).

It is the administrator who accepts security risks
implied by reusing hugepages.
The new mode is an opt-in and a warning is logged.

The feature is Linux-only as it is related
to mapping hugepages from files which only Linux does.
It is inherently incompatible with --in-memory,
for --huge-unlink see below.

There is formally no breakage of API contract,
but there is a behavior change in the new mode:
rte_malloc*() and rte_memzone_reserve*() may return dirty memory
(previously they were returning clean memory from free heap elements).
Their contract has always explicitly allowed this,
but still there may be users relying on the traditional behavior.
Such users will need to fix their code to use the new mode.

# Implementation

## User Interface

There is --huge-unlink switch in the same area to remove hugepage files
before mapping them. It is infeasible to use with the new mode,
because the point is to keep hugepage files for fast future restarts.
Extend --huge-unlink option to represent only valid combinations:

* --huge-unlink=existing OR no option (for compatibility):
  unlink files at initialization
  and before opening them as a precaution.

* --huge-unlink=always OR just --huge-unlink (for compatibility):
  same as above + unlink created files before mapping.

* --huge-unlink=never:
  the new mode, do not unlink hugepages files, reuse them.

This option was always Linux-only, but it is kept as common
in case there are users who expect it to be a no-op on other systems.
(Adding a separate --huge-reuse option was also considered,
but there is no obvious benefit and more combinations to test.)

## EAL

If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
so that the memory allocator may clear the memory if need be.
See patch 5/6 description for details how this is done
in different memory mapping modes.

The memory manager tracks whether an element is clean or dirty.
If rte_zmalloc*() allocates from a dirty element,
the memory is cleared before handling it to the user.
On freeing, the allocator joins adjacent free elements,
but in the new mode it may not be feasible to clear the free memory
if the joint element is dirty (contains dirty parts).
In any case, memory will be cleared only once,
either on freeing or on allocation.
See patch 3/6 for details.
Patch 2/6 adds a benchmark to see how time is distributed
between allocation and freeing in different modes.

Besides clearing memory, each mmap() call takes some time.
For example, 1024 calls for 1 TB may take ~300 ms.
The time of one call mapping N hugepages is O(N),
because inside the kernel hugepages are allocated ony by one.
Syscall overhead is negligeable even for one page.
Hence, it does not make sense to reduce the number of mmap() calls,
which would essentially move the loop over pages into the kernel.

[1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/


Dmitry Kozlyuk (6):
  doc: add hugepage mapping details
  app/test: add allocator performance benchmark
  mem: add dirty malloc element support
  eal: refactor --huge-unlink storage
  eal/linux: allow hugepage file reuse
  eal: extend --huge-unlink for hugepage file reuse

 app/test/meson.build                          |   2 +
 app/test/test_eal_flags.c                     |  25 +++
 app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
 .../prog_guide/env_abstraction_layer.rst      | 113 +++++++++++-
 doc/guides/rel_notes/release_22_03.rst        |   7 +
 lib/eal/common/eal_common_options.c           |  48 ++++-
 lib/eal/common/eal_internal_cfg.h             |  10 +-
 lib/eal/common/malloc_elem.c                  |  22 ++-
 lib/eal/common/malloc_elem.h                  |  11 +-
 lib/eal/common/malloc_heap.c                  |  18 +-
 lib/eal/common/rte_malloc.c                   |  21 ++-
 lib/eal/include/rte_memory.h                  |   8 +-
 lib/eal/linux/eal.c                           |   3 +-
 lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
 lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
 lib/eal/linux/eal_memory.c                    |   2 +-
 17 files changed, 650 insertions(+), 129 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 1/6] doc: add hugepage mapping details
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
@ 2022-02-03 18:13       ` Dmitry Kozlyuk
  2022-02-08 15:28         ` Burakov, Anatoly
  2022-02-03 18:13       ` [PATCH v3 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
                         ` (5 subsequent siblings)
  6 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Anatoly Burakov

Hugepage mapping is a layer of EAL malloc builds upon.
There were implicit references to its details,
like mentions of segment file descriptors,
but no explicit description of its modes and operation.
Add an overview of mechanics used on ech supported OS.
Convert memory management subsections from list items
to level 4 headers: they are big and important enough.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 .../prog_guide/env_abstraction_layer.rst      | 96 +++++++++++++++++--
 1 file changed, 87 insertions(+), 9 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index c6accce701..def5480997 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -86,7 +86,7 @@ See chapter
 Memory Mapping Discovery and Memory Reservation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem.
+The allocation of large contiguous physical memory is done using hugepages.
 The EAL provides an API to reserve named memory zones in this contiguous memory.
 The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API.
 
@@ -95,11 +95,13 @@ and legacy mode. Both modes are explained below.
 
 .. note::
 
-    Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
+    Memory reservations done using the APIs provided by rte_malloc
+    are also backed by hugepages unless ``--no-huge`` option is given.
 
-+ Dynamic memory mode
+Dynamic Memory Mode
+^^^^^^^^^^^^^^^^^^^
 
-Currently, this mode is only supported on Linux.
+Currently, this mode is only supported on Linux and Windows.
 
 In this mode, usage of hugepages by DPDK application will grow and shrink based
 on application's requests. Any memory allocation through ``rte_malloc()``,
@@ -155,7 +157,8 @@ of memory that can be used by DPDK application.
     :ref:`Multi-process Support <Multi-process_Support>` for more details about
     DPDK IPC.
 
-+ Legacy memory mode
+Legacy Memory Mode
+^^^^^^^^^^^^^^^^^^
 
 This mode is enabled by specifying ``--legacy-mem`` command-line switch to the
 EAL. This switch will have no effect on FreeBSD as FreeBSD only supports
@@ -168,7 +171,8 @@ not allow acquiring or releasing hugepages from the system at runtime.
 If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
 hugepage memory will be preallocated.
 
-+ Hugepage allocation matching
+Hugepage Allocation Matching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This behavior is enabled by specifying the ``--match-allocations`` command-line
 switch to the EAL. This switch is Linux-only and not supported with
@@ -182,7 +186,8 @@ matching can be used by these types of applications to satisfy both of these
 requirements. This can result in some increased memory usage which is
 very dependent on the memory allocation patterns of the application.
 
-+ 32-bit support
+32-bit Support
+^^^^^^^^^^^^^^
 
 Additional restrictions are present when running in 32-bit mode. In dynamic
 memory mode, by default maximum of 2 gigabytes of VA space will be preallocated,
@@ -192,7 +197,8 @@ used.
 In legacy mode, VA space will only be preallocated for segments that were
 requested (plus padding, to keep IOVA-contiguousness).
 
-+ Maximum amount of memory
+Maximum Amount of Memory
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 All possible virtual memory space that can ever be used for hugepage mapping in
 a DPDK process is preallocated at startup, thereby placing an upper limit on how
@@ -222,7 +228,78 @@ Normally, these options do not need to be changed.
     can later be mapped into that preallocated VA space (if dynamic memory mode
     is enabled), and can optionally be mapped into it at startup.
 
-+ Segment file descriptors
+Hugepage Mapping
+^^^^^^^^^^^^^^^^
+
+Below is an overview of methods used for each OS to obtain hugepages,
+explaining why certain limitations and options exist in EAL.
+See the user guide for a specific OS for configuration details.
+
+FreeBSD uses ``contigmem`` kernel module
+to reserve a fixed number of hugepages at system start,
+which are mapped by EAL at initialization using a specific ``sysctl()``.
+
+Windows EAL allocates hugepages from the OS as needed using Win32 API,
+so available amount depends on the system load.
+It uses ``virt2phys`` kernel module to obtain physical addresses,
+unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
+
+Linux allows to select any combination of the following:
+
+* use files in hugetlbfs (the default)
+  or anonymous mappings (``--in-memory``);
+* map each hugepage from its own file (the default)
+  or map multiple hugepages from one big file (``--single-file-segments``).
+
+Mapping hugepages from files in hugetlbfs is essential for multi-process,
+because secondary processes need to map the same hugepages.
+EAL creates files like ``rtemap_0``
+in directories specified with ``--huge-dir`` option
+(or in the mount point for a specific hugepage size).
+The ``rte`` prefix can be changed using ``--file-prefix``.
+This may be needed for running multiple primary processes
+that share a hugetlbfs mount point.
+Each backing file by default corresponds to one hugepage,
+it is opened and locked for the entire time the hugepage is used.
+This may exhaust the number of open files limit (``NOFILE``).
+See :ref:`segment-file-descriptors` section
+on how the number of open backing file descriptors can be reduced.
+
+In dynamic memory mode, EAL removes a backing hugepage file
+when all pages mapped from it are freed back to the system.
+However, backing files may persist after the application terminates
+in case of a crash or a leak of DPDK memory (e.g. ``rte_free()`` is missing).
+This reduces the number of hugepages available to other processes
+as reported by ``/sys/kernel/mm/hugepages/hugepages-*/free_hugepages``.
+EAL can remove the backing files after opening them for mapping
+if ``--huge-unlink`` is given to avoid polluting hugetlbfs.
+However, since it disables multi-process anyway,
+using anonymous mapping (``--in-memory``) is recommended instead.
+
+:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
+Hugepages are cleared by the kernel when a file in hugetlbfs or its part
+is mapped for the first time system-wide
+to prevent data leaks from previous users of the same hugepage.
+EAL ensures this behavior by removing existing backing files at startup
+and by recreating them before opening for mapping (as a precaution).
+
+Anonymous mapping does not allow multi-process architecture.
+This mode does not use hugetlbfs
+and thus does not require root permissions for memory management
+(the limit of locked memory amount, ``MEMLOCK``, still applies).
+It is free of filename conflict and leftover file issues.
+If memfd_create(2) is supported both at build and run time,
+DPDK memory manager can provide file descriptors for memory segments,
+which are required for VirtIO with vhost-user backend.
+This can exhaust the number of open files limit (``NOFILE``)
+despite not creating any visible files.
+See :ref:`segment-file-descriptors` section
+on how the number of open file descriptors used by EAL can be reduced.
+
+.. _segment-file-descriptors:
+
+Segment File Descriptors
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 On Linux, in most cases, EAL will store segment file descriptors in EAL. This
 can become a problem when using smaller page sizes due to underlying limitations
@@ -731,6 +808,7 @@ We expect only 50% of CPU spend on packet IO.
     echo 100000 > pkt_io/cpu.cfs_period_us
     echo  50000 > pkt_io/cpu.cfs_quota_us
 
+.. _malloc:
 
 Malloc
 ------
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 2/6] app/test: add allocator performance benchmark
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
  2022-02-03 18:13       ` [PATCH v3 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2022-02-03 18:13       ` Dmitry Kozlyuk
  2022-02-08 16:20         ` Burakov, Anatoly
  2022-02-03 18:13       ` [PATCH v3 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
                         ` (4 subsequent siblings)
  6 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev; +Cc: Viacheslav Ovsiienko, Aaron Conole

Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing takes
for each size as a reference (for rte_memzone_reserve estimations
are printed).

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Aaron Conole <aconole@redhat.com>
---
 app/test/meson.build        |   2 +
 app/test/test_malloc_perf.c | 174 ++++++++++++++++++++++++++++++++++++
 2 files changed, 176 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 5476c180ee..b0486cd129 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -87,6 +87,7 @@ test_sources = files(
         'test_lpm6_perf.c',
         'test_lpm_perf.c',
         'test_malloc.c',
+        'test_malloc_perf.c',
         'test_mbuf.c',
         'test_member.c',
         'test_member_perf.c',
@@ -256,6 +257,7 @@ extra_test_names = [
 
 perf_test_names = [
         'ring_perf_autotest',
+        'malloc_perf_autotest',
         'mempool_perf_autotest',
         'memcpy_perf_autotest',
         'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 0000000000..ccec43ae84
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <inttypes.h>
+#include <string.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_memzone.h>
+
+#include "test.h"
+
+#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__)
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+typedef void * (memset_t)(void *addr, int value, size_t size);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+	return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+	static const size_t RUNS = 20;
+
+	void *ptr;
+	size_t i;
+	uint64_t tsc;
+
+	TEST_LOG(INFO, "Reference: memset\n");
+
+	ptr = rte_malloc(NULL, GB, 0);
+	if (ptr == NULL) {
+		TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB);
+		return -1;
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < RUNS; i++)
+		memset(ptr, 0, GB);
+	tsc = rte_rdtsc_precise() - tsc;
+
+	*us_per_gb = tsc_to_us(tsc, RUNS);
+	TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+			US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+	rte_free(ptr);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn,
+		memset_t *memset_fn, double memset_gb_us, size_t max_runs)
+{
+	static const size_t SIZES[] = {
+			1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+			1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+	size_t i, j;
+	void **ptrs;
+
+	TEST_LOG(INFO, "Performance: %s\n", name);
+
+	ptrs = calloc(max_runs, sizeof(ptrs[0]));
+	if (ptrs == NULL) {
+		TEST_LOG(ERR, "Cannot allocate memory for pointers");
+		return -1;
+	}
+
+	TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs",
+			"Alloc (us)", "Free (us)", "Total (us)",
+			memset_fn != NULL ? "memset (us)" : "est.memset (us)");
+	for (i = 0; i < RTE_DIM(SIZES); i++) {
+		size_t size = SIZES[i];
+		size_t runs_done;
+		uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free;
+		double alloc_time, free_time, memset_time;
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < max_runs; j++) {
+			ptrs[j] = alloc_fn(NULL, size, 0);
+			if (ptrs[j] == NULL)
+				break;
+		}
+		tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+		if (j == 0) {
+			TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n",
+					size);
+			break;
+		}
+		runs_done = j;
+
+		if (memset_fn != NULL) {
+			tsc_start = rte_rdtsc_precise();
+			for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+				memset_fn(ptrs[j], 0, size);
+			tsc_memset = rte_rdtsc_precise() - tsc_start;
+		}
+
+		tsc_start = rte_rdtsc_precise();
+		for (j = 0; j < runs_done && ptrs[j] != NULL; j++)
+			free_fn(ptrs[j]);
+		tsc_free = rte_rdtsc_precise() - tsc_start;
+
+		alloc_time = tsc_to_us(tsc_alloc, runs_done);
+		free_time = tsc_to_us(tsc_free, runs_done);
+		memset_time = memset_fn != NULL ?
+				tsc_to_us(tsc_memset, runs_done) :
+				memset_gb_us * size / GB;
+		TEST_LOG(INFO, "%12zu%8zu%12.2f%12.2f%12.2f%17.2f\n",
+				size, runs_done, alloc_time, free_time,
+				alloc_time + free_time, memset_time);
+
+		memset(ptrs, 0, max_runs * sizeof(ptrs[0]));
+	}
+
+	free(ptrs);
+	TEST_LOG(INFO, "\n");
+	return 0;
+}
+
+static void *
+memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align)
+{
+	const struct rte_memzone *mz;
+	char gen_name[RTE_MEMZONE_NAMESIZE];
+
+	snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc());
+	mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY,
+			RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align);
+	return (void *)(uintptr_t)mz;
+}
+
+static void
+memzone_free(void *addr)
+{
+	rte_memzone_free((struct rte_memzone *)addr);
+}
+
+static int
+test_malloc_perf(void)
+{
+	static const size_t MAX_RUNS = 10000;
+
+	double memset_us_gb = 0;
+
+	if (test_memset_perf(&memset_us_gb) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_malloc", rte_malloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+	if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free, memset,
+			memset_us_gb, MAX_RUNS) < 0)
+		return -1;
+
+	if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free,
+			NULL, memset_us_gb, RTE_MAX_MEMZONE - 1) < 0)
+		return -1;
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 3/6] mem: add dirty malloc element support
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
  2022-02-03 18:13       ` [PATCH v3 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
  2022-02-03 18:13       ` [PATCH v3 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
@ 2022-02-03 18:13       ` Dmitry Kozlyuk
  2022-02-08 16:36         ` Burakov, Anatoly
  2022-02-03 18:13       ` [PATCH v3 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
                         ` (3 subsequent siblings)
  6 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

EAL malloc layer assumed all free elements content
is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
This assumption was ensured in two ways:
1. EAL memalloc layer always returned clean memory.
2. Freed memory was cleared before returning into the heap.

Clearing the memory can be as slow as around 14 GiB/s.
To save doing so, memalloc layer is allowed to return dirty memory.
Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
The allocator tracks elements that contain dirty memory
using the new flag in the element header.
When clean memory is requested via rte_zmalloc*()
and the suitable element is dirty, it is cleared on allocation.
When memory is deallocated, the freed element is joined
with adjacent free elements, and the dirty flag is updated:

a) If the joint element contains dirty parts, it is dirty:

    dirty + freed + dirty = dirty  =>  no need to clean
            freed + dirty = dirty      the freed memory

   Dirty parts may be large (e.g. initial allocation),
   so clearing them could create unpredictable slowdown.

b) If the only dirty part of the joint element
   is the freed memory, the joint element can be made clean:

    clean + freed + clean = clean  =>  freed memory
    clean + freed         = clean      must be cleared
            freed + clean = clean
            freed         = clean

   This logic naturally reproduces the old behavior
   and always applies in modes when EAL memalloc layer
   returns only clean segments.

As a result, memory is either cleared on free, as before,
or it will be cleared on allocation if need be, but never twice.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 .../prog_guide/env_abstraction_layer.rst      |  4 ++++
 lib/eal/common/malloc_elem.c                  | 22 ++++++++++++++++---
 lib/eal/common/malloc_elem.h                  | 11 ++++++++--
 lib/eal/common/malloc_heap.c                  | 18 ++++++++++-----
 lib/eal/common/rte_malloc.c                   | 21 ++++++++++++------
 lib/eal/include/rte_memory.h                  |  8 +++++--
 6 files changed, 64 insertions(+), 20 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index def5480997..b467bdf004 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -956,6 +956,10 @@ to be virtually contiguous.
     In that case, the pad header is used to locate the actual malloc element
     header for the block.
 
+*   dirty - this flag is only meaningful when ``state`` is ``FREE``.
+    It indicates that the content of the element is not fully zero-filled.
+    Memory from such blocks must be cleared when requested via ``rte_zmalloc*()``.
+
 *   pad - this holds the length of the padding present at the start of the block.
     In the case of a normal block header, it is added to the address of the end
     of the header to give the address of the start of the data area, i.e. the
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index bdd20a162e..e04e0890fb 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -129,7 +129,7 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
 void
 malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 		struct rte_memseg_list *msl, size_t size,
-		struct malloc_elem *orig_elem, size_t orig_size)
+		struct malloc_elem *orig_elem, size_t orig_size, bool dirty)
 {
 	elem->heap = heap;
 	elem->msl = msl;
@@ -137,6 +137,7 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 	elem->next = NULL;
 	memset(&elem->free_list, 0, sizeof(elem->free_list));
 	elem->state = ELEM_FREE;
+	elem->dirty = dirty;
 	elem->size = size;
 	elem->pad = 0;
 	elem->orig_elem = orig_elem;
@@ -300,7 +301,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 	const size_t new_elem_size = elem->size - old_elem_size;
 
 	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
-			 elem->orig_elem, elem->orig_size);
+			elem->orig_elem, elem->orig_size, elem->dirty);
 	split_pt->prev = elem;
 	split_pt->next = next_elem;
 	if (next_elem)
@@ -506,6 +507,7 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2)
 	else
 		elem1->heap->last = elem1;
 	elem1->next = next;
+	elem1->dirty |= elem2->dirty;
 	if (elem1->pad) {
 		struct malloc_elem *inner = RTE_PTR_ADD(elem1, elem1->pad);
 		inner->size = elem1->size - elem1->pad;
@@ -579,6 +581,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN);
 	data_len = elem->size - MALLOC_ELEM_OVERHEAD;
 
+	/*
+	 * Consider the element clean for the purposes of joining.
+	 * If both neighbors are clean or non-existent,
+	 * the joint element will be clean,
+	 * which means the memory should be cleared.
+	 * There is no need to clear the memory if the joint element is dirty.
+	 */
+	elem->dirty = false;
 	elem = malloc_elem_join_adjacent_free(elem);
 
 	malloc_elem_free_list_insert(elem);
@@ -588,8 +598,14 @@ malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
-	/* poison memory */
+#ifndef RTE_MALLOC_DEBUG
+	/* Normally clear the memory when needed. */
+	if (!elem->dirty)
+		memset(ptr, 0, data_len);
+#else
+	/* Always poison the memory in debug mode. */
 	memset(ptr, MALLOC_POISON, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_elem.h b/lib/eal/common/malloc_elem.h
index 15d8ba7af2..f2aa98821b 100644
--- a/lib/eal/common/malloc_elem.h
+++ b/lib/eal/common/malloc_elem.h
@@ -27,7 +27,13 @@ struct malloc_elem {
 	LIST_ENTRY(malloc_elem) free_list;
 	/**< list of free elements in heap */
 	struct rte_memseg_list *msl;
-	volatile enum elem_state state;
+	/** Element state, @c dirty and @c pad validity depends on it. */
+	/* An extra bit is needed to represent enum elem_state as signed int. */
+	enum elem_state state : 3;
+	/** If state == ELEM_FREE: the memory is not filled with zeroes. */
+	uint32_t dirty : 1;
+	/** Reserved for future use. */
+	uint32_t reserved : 28;
 	uint32_t pad;
 	size_t size;
 	struct malloc_elem *orig_elem;
@@ -320,7 +326,8 @@ malloc_elem_init(struct malloc_elem *elem,
 		struct rte_memseg_list *msl,
 		size_t size,
 		struct malloc_elem *orig_elem,
-		size_t orig_size);
+		size_t orig_size,
+		bool dirty);
 
 void
 malloc_elem_insert(struct malloc_elem *elem);
diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c
index 55aad2711b..24080fc473 100644
--- a/lib/eal/common/malloc_heap.c
+++ b/lib/eal/common/malloc_heap.c
@@ -93,11 +93,11 @@ malloc_socket_to_heap_id(unsigned int socket_id)
  */
 static struct malloc_elem *
 malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
-		void *start, size_t len)
+		void *start, size_t len, bool dirty)
 {
 	struct malloc_elem *elem = start;
 
-	malloc_elem_init(elem, heap, msl, len, elem, len);
+	malloc_elem_init(elem, heap, msl, len, elem, len, dirty);
 
 	malloc_elem_insert(elem);
 
@@ -135,7 +135,8 @@ malloc_add_seg(const struct rte_memseg_list *msl,
 
 	found_msl = &mcfg->memsegs[msl_idx];
 
-	malloc_heap_add_memory(heap, found_msl, ms->addr, len);
+	malloc_heap_add_memory(heap, found_msl, ms->addr, len,
+			ms->flags & RTE_MEMSEG_FLAG_DIRTY);
 
 	heap->total_size += len;
 
@@ -303,7 +304,8 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 	struct rte_memseg_list *msl;
 	struct malloc_elem *elem = NULL;
 	size_t alloc_sz;
-	int allocd_pages;
+	int allocd_pages, i;
+	bool dirty = false;
 	void *ret, *map_addr;
 
 	alloc_sz = (size_t)pg_sz * n_segs;
@@ -372,8 +374,12 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size,
 		goto fail;
 	}
 
+	/* Element is dirty if it contains at least one dirty page. */
+	for (i = 0; i < allocd_pages; i++)
+		dirty |= ms[i]->flags & RTE_MEMSEG_FLAG_DIRTY;
+
 	/* add newly minted memsegs to malloc heap */
-	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz);
+	elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz, dirty);
 
 	/* try once more, as now we have allocated new memory */
 	ret = find_suitable_element(heap, elt_size, flags, align, bound,
@@ -1260,7 +1266,7 @@ malloc_heap_add_external_memory(struct malloc_heap *heap,
 	memset(msl->base_va, 0, msl->len);
 
 	/* now, add newly minted memory to the malloc heap */
-	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len);
+	malloc_heap_add_memory(heap, msl, msl->base_va, msl->len, false);
 
 	heap->total_size += msl->len;
 
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index d0bec26920..71a3f7ecb4 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -115,15 +115,22 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
+	if (ptr != NULL) {
+		struct malloc_elem *elem = malloc_elem_from_data(ptr);
+
+		if (elem->dirty) {
+			memset(ptr, 0, size);
+		} else {
 #ifdef RTE_MALLOC_DEBUG
-	/*
-	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
-	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+			/*
+			 * If DEBUG is enabled, then freed memory is marked
+			 * with a poison value and set to zero on allocation.
+			 * If DEBUG is disabled then memory is already zeroed.
+			 */
+			memset(ptr, 0, size);
 #endif
+		}
+	}
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index 6d018629ae..68b069fd04 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -19,6 +19,7 @@
 extern "C" {
 #endif
 
+#include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
 #include <rte_config.h>
@@ -37,11 +38,14 @@ extern "C" {
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
+/** Prevent this segment from being freed back to the OS. */
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE RTE_BIT32(0)
+/** This segment is not filled with zeros. */
+#define RTE_MEMSEG_FLAG_DIRTY RTE_BIT32(1)
+
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
-/**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
 	RTE_STD_C11
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 4/6] eal: refactor --huge-unlink storage
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
                         ` (2 preceding siblings ...)
  2022-02-03 18:13       ` [PATCH v3 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
@ 2022-02-03 18:13       ` Dmitry Kozlyuk
  2022-02-08 16:39         ` Burakov, Anatoly
  2022-02-03 18:13       ` [PATCH v3 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
                         ` (2 subsequent siblings)
  6 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev; +Cc: Thomas Monjalon, Anatoly Burakov

In preparation to extend --huge-unlink option semantics
refactor how it is stored in the internal configuration.
It makes future changes more isolated.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 lib/eal/common/eal_common_options.c | 9 +++++----
 lib/eal/common/eal_internal_cfg.h   | 8 +++++++-
 lib/eal/linux/eal_memalloc.c        | 7 ++++---
 lib/eal/linux/eal_memory.c          | 2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 1cfdd75f3b..7520ebda8e 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -1737,7 +1737,7 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -1766,7 +1766,7 @@ eal_parse_common_option(int opt, const char *optarg,
 		conf->in_memory = 1;
 		/* in-memory is a superset of noshconf and huge-unlink */
 		conf->no_shconf = 1;
-		conf->hugepage_unlink = 1;
+		conf->hugepage_file.unlink_before_mapping = true;
 		break;
 
 	case OPT_PROC_TYPE_NUM:
@@ -2050,7 +2050,8 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"be specified together with --"OPT_NO_HUGE"\n");
 		return -1;
 	}
-	if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink &&
+	if (internal_cfg->no_hugetlbfs &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
 			"be specified together with --"OPT_NO_HUGE"\n");
@@ -2061,7 +2062,7 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			" is only supported in non-legacy memory mode\n");
 	}
 	if (internal_cfg->single_file_segments &&
-			internal_cfg->hugepage_unlink &&
+			internal_cfg->hugepage_file.unlink_before_mapping &&
 			!internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is "
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..b5e6942578 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -40,6 +40,12 @@ struct simd_bitwidth {
 	uint16_t bitwidth; /**< bitwidth value */
 };
 
+/** Hugepage backing files discipline. */
+struct hugepage_file_discipline {
+	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
+	bool unlink_before_mapping;
+};
+
 /**
  * internal configuration
  */
@@ -48,7 +54,7 @@ struct internal_config {
 	volatile unsigned force_nchannel; /**< force number of channels */
 	volatile unsigned force_nrank;    /**< force number of ranks */
 	volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
-	unsigned hugepage_unlink;         /**< true to unlink backing files */
+	struct hugepage_file_discipline hugepage_file;
 	volatile unsigned no_pci;         /**< true to disable PCI */
 	volatile unsigned no_hpet;        /**< true to disable HPET */
 	volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 16b58d861b..5f5531830d 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -564,7 +564,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 					__func__, strerror(errno));
 				goto resized;
 			}
-			if (internal_conf->hugepage_unlink &&
+			if (internal_conf->hugepage_file.unlink_before_mapping &&
 					!internal_conf->in_memory) {
 				if (unlink(path)) {
 					RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n",
@@ -697,7 +697,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 			close_hugefile(fd, path, list_idx);
 	} else {
 		/* only remove file if we can take out a write lock */
-		if (internal_conf->hugepage_unlink == 0 &&
+		if (!internal_conf->hugepage_file.unlink_before_mapping &&
 				internal_conf->in_memory == 0 &&
 				lock(fd, LOCK_EX) == 1)
 			unlink(path);
@@ -756,7 +756,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-		if (!internal_conf->in_memory && !internal_conf->hugepage_unlink) {
+		if (!internal_conf->in_memory &&
+				!internal_conf->hugepage_file.unlink_before_mapping) {
 			ret = lock(fd, LOCK_EX);
 			if (ret >= 0) {
 				/* no one else is using this page */
diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
index 03a4f2dd2d..83eec078a4 100644
--- a/lib/eal/linux/eal_memory.c
+++ b/lib/eal/linux/eal_memory.c
@@ -1428,7 +1428,7 @@ eal_legacy_hugepage_init(void)
 	}
 
 	/* free the hugepage backing files */
-	if (internal_conf->hugepage_unlink &&
+	if (internal_conf->hugepage_file.unlink_before_mapping &&
 		unlink_hugepage_files(tmp_hp, internal_conf->num_hugepage_sizes) < 0) {
 		RTE_LOG(ERR, EAL, "Unlinking hugepage files failed!\n");
 		goto fail;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 5/6] eal/linux: allow hugepage file reuse
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
                         ` (3 preceding siblings ...)
  2022-02-03 18:13       ` [PATCH v3 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
@ 2022-02-03 18:13       ` Dmitry Kozlyuk
  2022-02-08 17:05         ` Burakov, Anatoly
  2022-02-03 18:13       ` [PATCH v3 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
  2022-02-08 20:40       ` [PATCH v3 0/6] Fast restart with many hugepages David Marchand
  6 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev; +Cc: Anatoly Burakov

Linux EAL ensured that mapped hugepages are clean
by always mapping from newly created files:
existing hugepage backing files were always removed.
In this case, the kernel clears the page to prevent data leaks,
because the mapped memory may contain leftover data
from the previous process that was using this memory.
Clearing takes the bulk of the time spent in mmap(2),
increasing EAL initialization time.

Introduce a mode to keep existing files and reuse them
in order to speed up initial memory allocation in EAL.
Hugepages mapped from such files may contain data
left by the previous process that used this memory,
so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
If multiple hugepages are mapped from the same file:
1. When fallocate(2) is used, all memory mapped from this file
   is considered dirty, because it is unknown
   which parts of the file are holes.
2. When ftruncate(3) is used, memory mapped from this file
   is considered dirty unless the file is extended
   to create a new mapping, which implies clean memory.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
Coverity complains that "path" may be uninitialized in get_seg_fd()
at line 327, but it is always initialized with eal_get_hugefile_path()
at lines 309-316.

 lib/eal/common/eal_common_options.c |   2 +
 lib/eal/common/eal_internal_cfg.h   |   2 +
 lib/eal/linux/eal.c                 |   3 +-
 lib/eal/linux/eal_hugepage_info.c   | 118 ++++++++++++++++----
 lib/eal/linux/eal_memalloc.c        | 166 +++++++++++++++++-----------
 5 files changed, 206 insertions(+), 85 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 7520ebda8e..cdd2284b0c 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -311,6 +311,8 @@ eal_reset_internal_config(struct internal_config *internal_cfg)
 	internal_cfg->force_nchannel = 0;
 	internal_cfg->hugefile_prefix = NULL;
 	internal_cfg->hugepage_dir = NULL;
+	internal_cfg->hugepage_file.unlink_before_mapping = false;
+	internal_cfg->hugepage_file.unlink_existing = true;
 	internal_cfg->force_sockets = 0;
 	/* zero out the NUMA config */
 	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index b5e6942578..d2be7bfa57 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -44,6 +44,8 @@ struct simd_bitwidth {
 struct hugepage_file_discipline {
 	/** Unlink files before mapping them to leave no trace in hugetlbfs. */
 	bool unlink_before_mapping;
+	/** Unlink exisiting files at startup, re-create them before mapping. */
+	bool unlink_existing;
 };
 
 /**
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 60b4924838..9c8395ab14 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1360,7 +1360,8 @@ rte_eal_cleanup(void)
 	struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
+			internal_conf->hugepage_file.unlink_existing)
 		rte_memseg_walk(mark_freeable, NULL);
 	rte_service_finalize();
 	rte_mp_channel_cleanup();
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 9fb0e968db..ec172ef4b8 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon
 /* this function is only called from eal_hugepage_info_init which itself
  * is only called from a primary process */
 static uint32_t
-get_num_hugepages(const char *subdir, size_t sz)
+get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages)
 {
 	unsigned long resv_pages, num_pages, over_pages, surplus_pages;
 	const char *nr_hp_file = "free_hugepages";
@@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz)
 	else
 		over_pages = 0;
 
-	if (num_pages == 0 && over_pages == 0)
+	if (num_pages == 0 && over_pages == 0 && reusable_pages)
 		RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n",
 				sz >> 10);
 
@@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz)
 	if (num_pages < over_pages) /* overflow */
 		num_pages = UINT32_MAX;
 
+	num_pages += reusable_pages;
+	if (num_pages < reusable_pages) /* overflow */
+		num_pages = UINT32_MAX;
+
 	/* we want to return a uint32_t and more than this looks suspicious
 	 * anyway ... */
 	if (num_pages > UINT32_MAX)
@@ -297,20 +301,28 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 	return -1;
 }
 
+struct walk_hugedir_data {
+	int dir_fd;
+	int file_fd;
+	const char *file_name;
+	void *user_data;
+};
+
+typedef void (walk_hugedir_t)(const struct walk_hugedir_data *whd);
+
 /*
- * Clear the hugepage directory of whatever hugepage files
- * there are. Checks if the file is locked (i.e.
- * if it's in use by another DPDK process).
+ * Search the hugepage directory for whatever hugepage files there are.
+ * Check if the file is in use by another DPDK process.
+ * If not, execute a callback on it.
  */
 static int
-clear_hugedir(const char * hugedir)
+walk_hugedir(const char *hugedir, walk_hugedir_t *cb, void *user_data)
 {
 	DIR *dir;
 	struct dirent *dirent;
 	int dir_fd, fd, lck_result;
 	const char filter[] = "*map_*"; /* matches hugepage files */
 
-	/* open directory */
 	dir = opendir(hugedir);
 	if (!dir) {
 		RTE_LOG(ERR, EAL, "Unable to open hugepage directory %s\n",
@@ -326,7 +338,7 @@ clear_hugedir(const char * hugedir)
 		goto error;
 	}
 
-	while(dirent != NULL){
+	while (dirent != NULL) {
 		/* skip files that don't match the hugepage pattern */
 		if (fnmatch(filter, dirent->d_name, 0) > 0) {
 			dirent = readdir(dir);
@@ -345,9 +357,15 @@ clear_hugedir(const char * hugedir)
 		/* non-blocking lock */
 		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
-		/* if lock succeeds, remove the file */
+		/* if lock succeeds, execute callback */
 		if (lck_result != -1)
-			unlinkat(dir_fd, dirent->d_name, 0);
+			cb(&(struct walk_hugedir_data){
+				.dir_fd = dir_fd,
+				.file_fd = fd,
+				.file_name = dirent->d_name,
+				.user_data = user_data,
+			});
+
 		close (fd);
 		dirent = readdir(dir);
 	}
@@ -359,12 +377,48 @@ clear_hugedir(const char * hugedir)
 	if (dir)
 		closedir(dir);
 
-	RTE_LOG(ERR, EAL, "Error while clearing hugepage dir: %s\n",
+	RTE_LOG(ERR, EAL, "Error while walking hugepage dir: %s\n",
 		strerror(errno));
 
 	return -1;
 }
 
+static void
+clear_hugedir_cb(const struct walk_hugedir_data *whd)
+{
+	unlinkat(whd->dir_fd, whd->file_name, 0);
+}
+
+/* Remove hugepage files not used by other DPDK processes from a directory. */
+static int
+clear_hugedir(const char *hugedir)
+{
+	return walk_hugedir(hugedir, clear_hugedir_cb, NULL);
+}
+
+static void
+inspect_hugedir_cb(const struct walk_hugedir_data *whd)
+{
+	uint64_t *total_size = whd->user_data;
+	struct stat st;
+
+	if (fstat(whd->file_fd, &st) < 0)
+		RTE_LOG(DEBUG, EAL, "%s(): stat(\"%s\") failed: %s",
+				__func__, whd->file_name, strerror(errno));
+	else
+		(*total_size) += st.st_size;
+}
+
+/*
+ * Count the total size in bytes of all files in the directory
+ * not mapped by other DPDK process.
+ */
+static int
+inspect_hugedir(const char *hugedir, uint64_t *total_size)
+{
+	return walk_hugedir(hugedir, inspect_hugedir_cb, total_size);
+}
+
 static int
 compare_hpi(const void *a, const void *b)
 {
@@ -375,7 +429,8 @@ compare_hpi(const void *a, const void *b)
 }
 
 static void
-calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
+calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent,
+		unsigned int reusable_pages)
 {
 	uint64_t total_pages = 0;
 	unsigned int i;
@@ -388,8 +443,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 * in one socket and sorting them later
 	 */
 	total_pages = 0;
-	/* we also don't want to do this for legacy init */
-	if (!internal_conf->legacy_mem)
+
+	/*
+	 * We also don't want to do this for legacy init.
+	 * When there are hugepage files to reuse it is unknown
+	 * what NUMA node the pages are on.
+	 * This could be determined by mapping,
+	 * but it is precisely what hugepage file reuse is trying to avoid.
+	 */
+	if (!internal_conf->legacy_mem && reusable_pages == 0)
 		for (i = 0; i < rte_socket_count(); i++) {
 			int socket = rte_socket_id_by_idx(i);
 			unsigned int num_pages =
@@ -405,7 +467,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
 	 */
 	if (total_pages == 0) {
 		hpi->num_pages[0] = get_num_hugepages(dirent->d_name,
-				hpi->hugepage_sz);
+				hpi->hugepage_sz, reusable_pages);
 
 #ifndef RTE_ARCH_64
 		/* for 32-bit systems, limit number of hugepages to
@@ -421,6 +483,8 @@ hugepage_info_init(void)
 {	const char dirent_start_text[] = "hugepages-";
 	const size_t dirent_start_len = sizeof(dirent_start_text) - 1;
 	unsigned int i, num_sizes = 0;
+	uint64_t reusable_bytes;
+	unsigned int reusable_pages;
 	DIR *dir;
 	struct dirent *dirent;
 	struct internal_config *internal_conf =
@@ -454,7 +518,7 @@ hugepage_info_init(void)
 			uint32_t num_pages;
 
 			num_pages = get_num_hugepages(dirent->d_name,
-					hpi->hugepage_sz);
+					hpi->hugepage_sz, 0);
 			if (num_pages > 0)
 				RTE_LOG(NOTICE, EAL,
 					"%" PRIu32 " hugepages of size "
@@ -473,7 +537,7 @@ hugepage_info_init(void)
 					"hugepages of size %" PRIu64 " bytes "
 					"will be allocated anonymously\n",
 					hpi->hugepage_sz);
-				calc_num_pages(hpi, dirent);
+				calc_num_pages(hpi, dirent, 0);
 				num_sizes++;
 			}
 #endif
@@ -489,11 +553,23 @@ hugepage_info_init(void)
 				"Failed to lock hugepage directory!\n");
 			break;
 		}
-		/* clear out the hugepages dir from unused pages */
-		if (clear_hugedir(hpi->hugedir) == -1)
-			break;
 
-		calc_num_pages(hpi, dirent);
+		/*
+		 * Check for existing hugepage files and either remove them
+		 * or count how many of them can be reused.
+		 */
+		reusable_pages = 0;
+		if (!internal_conf->hugepage_file.unlink_existing) {
+			reusable_bytes = 0;
+			if (inspect_hugedir(hpi->hugedir,
+					&reusable_bytes) < 0)
+				break;
+			RTE_ASSERT(reusable_bytes % hpi->hugepage_sz == 0);
+			reusable_pages = reusable_bytes / hpi->hugepage_sz;
+		} else if (clear_hugedir(hpi->hugedir) < 0) {
+			break;
+		}
+		calc_num_pages(hpi, dirent, reusable_pages);
 
 		num_sizes++;
 	}
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 5f5531830d..b68f5e165d 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused,
 
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
-		unsigned int list_idx, unsigned int seg_idx)
+		unsigned int list_idx, unsigned int seg_idx,
+		bool *dirty)
 {
 	int fd;
+	int *out_fd;
+	struct stat st;
+	int ret;
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
+	if (dirty != NULL)
+		*dirty = false;
+
 	/* for in-memory mode, we only make it here when we're sure we support
 	 * memfd, and this is a special case.
 	 */
@@ -300,66 +307,70 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		return get_seg_memfd(hi, list_idx, seg_idx);
 
 	if (internal_conf->single_file_segments) {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].memseg_list_fd;
 		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
-
-		fd = fd_list[list_idx].memseg_list_fd;
-
-		if (fd < 0) {
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(ERR, EAL, "%s(): open '%s' failed: %s\n",
-					__func__, path, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock and keep it indefinitely */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].memseg_list_fd = fd;
-		}
 	} else {
-		/* create a hugepage file path */
+		out_fd = &fd_list[list_idx].fds[seg_idx];
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+	}
+	fd = *out_fd;
+	if (fd >= 0)
+		return fd;
 
-		fd = fd_list[list_idx].fds[seg_idx];
-
-		if (fd < 0) {
-			/* A primary process is the only one creating these
-			 * files. If there is a leftover that was not cleaned
-			 * by clear_hugedir(), we must *now* make sure to drop
-			 * the file or we will remap old stuff while the rest
-			 * of the code is built on the assumption that a new
-			 * page is clean.
-			 */
-			if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
-					unlink(path) == -1 &&
-					errno != ENOENT) {
-				RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
-					__func__, path, strerror(errno));
-				return -1;
-			}
+	/*
+	 * There is no TOCTOU between stat() and unlink()/open()
+	 * because the hugepage directory is locked.
+	 */
+	ret = stat(path, &st);
+	if (ret < 0 && errno != ENOENT) {
+		RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	if (!internal_conf->hugepage_file.unlink_existing && ret == 0 &&
+			dirty != NULL)
+		*dirty = true;
 
-			fd = open(path, O_CREAT | O_RDWR, 0600);
-			if (fd < 0) {
-				RTE_LOG(ERR, EAL, "%s(): open '%s' failed: %s\n",
-					__func__, path, strerror(errno));
-				return -1;
-			}
-			/* take out a read lock */
-			if (lock(fd, LOCK_SH) < 0) {
-				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
-					__func__, strerror(errno));
-				close(fd);
-				return -1;
-			}
-			fd_list[list_idx].fds[seg_idx] = fd;
+	/*
+	 * The kernel clears a hugepage only when it is mapped
+	 * from a particular file for the first time.
+	 * If the file already exists, the old content will be mapped.
+	 * If the memory manager assumes all mapped pages to be clean,
+	 * the file must be removed and created anew.
+	 * Otherwise, the primary caller must be notified
+	 * that mapped pages will be dirty
+	 * (secondary callers receive the segment state from the primary one).
+	 * When multiple hugepages are mapped from the same file,
+	 * whether they will be dirty depends on the part that is mapped.
+	 */
+	if (!internal_conf->single_file_segments &&
+			internal_conf->hugepage_file.unlink_existing &&
+			rte_eal_process_type() == RTE_PROC_PRIMARY &&
+			ret == 0) {
+		/* coverity[toctou] */
+		if (unlink(path) < 0) {
+			RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n",
+				__func__, path, strerror(errno));
+			return -1;
 		}
 	}
+
+	/* coverity[toctou] */
+	fd = open(path, O_CREAT | O_RDWR, 0600);
+	if (fd < 0) {
+		RTE_LOG(DEBUG, EAL, "%s(): open '%s' failed: %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) < 0) {
+		RTE_LOG(ERR, EAL, "%s(): lock '%s' failed: %s\n",
+			__func__, path, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	*out_fd = fd;
 	return fd;
 }
 
@@ -385,8 +396,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset,
 
 static int
 resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+		bool grow, bool *dirty)
 {
+	const struct internal_config *internal_conf =
+			eal_get_internal_configuration();
 	bool again = false;
 
 	do {
@@ -405,6 +418,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 			uint64_t cur_size = get_file_size(fd);
 
 			/* fallocate isn't supported, fall back to ftruncate */
+			if (dirty != NULL)
+				*dirty = new_size <= cur_size;
 			if (new_size > cur_size &&
 					ftruncate(fd, new_size) < 0) {
 				RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n",
@@ -447,8 +462,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz,
 						strerror(errno));
 					return -1;
 				}
-			} else
+			} else {
 				fallocate_supported = 1;
+				/*
+				 * It is unknown which portions of an existing
+				 * hugepage file were allocated previously,
+				 * so all pages within the file are considered
+				 * dirty, unless the file is a fresh one.
+				 */
+				if (dirty != NULL)
+					*dirty &= !internal_conf->hugepage_file.unlink_existing;
+			}
 		}
 	} while (again);
 
@@ -475,7 +499,8 @@ close_hugefile(int fd, char *path, int list_idx)
 }
 
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
+resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow,
+		bool *dirty)
 {
 	/* in-memory mode is a special case, because we can be sure that
 	 * fallocate() is supported.
@@ -483,12 +508,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow)
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	if (internal_conf->in_memory)
+	if (internal_conf->in_memory) {
+		if (dirty != NULL)
+			*dirty = false;
 		return resize_hugefile_in_memory(fd, fa_offset,
 				page_sz, grow);
+	}
 
 	return resize_hugefile_in_filesystem(fd, fa_offset, page_sz,
-				grow);
+			grow, dirty);
 }
 
 static int
@@ -505,6 +533,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	char path[PATH_MAX];
 	int ret = 0;
 	int fd;
+	bool dirty;
 	size_t alloc_sz;
 	int flags;
 	void *new_addr;
@@ -534,6 +563,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		pagesz_flag = pagesz_flags(alloc_sz);
 		fd = -1;
+		dirty = false;
 		mmap_flags = in_memory_flags | pagesz_flag;
 
 		/* single-file segments codepath will never be active
@@ -544,7 +574,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		map_offset = 0;
 	} else {
 		/* takes out a read lock on segment or segment list */
-		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx,
+				&dirty);
 		if (fd < 0) {
 			RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 			return -1;
@@ -552,7 +583,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 
 		if (internal_conf->single_file_segments) {
 			map_offset = seg_idx * alloc_sz;
-			ret = resize_hugefile(fd, map_offset, alloc_sz, true);
+			ret = resize_hugefile(fd, map_offset, alloc_sz, true,
+					&dirty);
 			if (ret < 0)
 				goto resized;
 
@@ -662,6 +694,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	ms->nrank = rte_memory_get_nrank();
 	ms->iova = iova;
 	ms->socket_id = socket_id;
+	ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0;
 
 	return 0;
 
@@ -689,7 +722,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		return -1;
 
 	if (internal_conf->single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
+		resize_hugefile(fd, map_offset, alloc_sz, false, NULL);
 		/* ignore failure, can't make it any worse */
 
 		/* if refcount is at zero, close the file */
@@ -739,13 +772,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	 * segment and thus drop the lock on original fd, but hugepage dir is
 	 * now locked so we can take out another one without races.
 	 */
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL);
 	if (fd < 0)
 		return -1;
 
 	if (internal_conf->single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, map_offset, ms->len, false, NULL))
 			return -1;
 
 		if (--(fd_list[list_idx].count) == 0)
@@ -757,6 +790,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		 * holding onto this page.
 		 */
 		if (!internal_conf->in_memory &&
+				internal_conf->hugepage_file.unlink_existing &&
 				!internal_conf->hugepage_file.unlink_before_mapping) {
 			ret = lock(fd, LOCK_EX);
 			if (ret >= 0) {
@@ -1743,6 +1777,12 @@ eal_memalloc_init(void)
 			RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n");
 			return -1;
 		}
+		/* safety net, should be impossible to configure */
+		if (internal_conf->hugepage_file.unlink_before_mapping &&
+				!internal_conf->hugepage_file.unlink_existing) {
+			RTE_LOG(ERR, EAL, "Unlinking existing hugepage files is prohibited, cannot unlink them before mapping.\n");
+			return -1;
+		}
 	}
 
 	/* initialize all of the fd lists */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 6/6] eal: extend --huge-unlink for hugepage file reuse
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
                         ` (4 preceding siblings ...)
  2022-02-03 18:13       ` [PATCH v3 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2022-02-03 18:13       ` Dmitry Kozlyuk
  2022-02-08 17:14         ` Burakov, Anatoly
  2022-02-08 20:40       ` [PATCH v3 0/6] Fast restart with many hugepages David Marchand
  6 siblings, 1 reply; 53+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-03 18:13 UTC (permalink / raw)
  To: dev; +Cc: Thomas Monjalon, Anatoly Burakov

Expose Linux EAL ability to reuse existing hugepage files
via --huge-unlink=never switch.
Default behavior is unchanged, it can also be specified
using --huge-unlink=existing for consistency.
Old --huge-unlink switch is kept,
it is an alias for --huge-unlink=always.
Add a test case for the --huge-unlink=never mode.

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 app/test/test_eal_flags.c                     | 25 ++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst | 24 ++++++++++--
 .../prog_guide/env_abstraction_layer.rst      | 13 +++++++
 doc/guides/rel_notes/release_22_03.rst        |  7 ++++
 lib/eal/common/eal_common_options.c           | 39 +++++++++++++++++--
 5 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/app/test/test_eal_flags.c b/app/test/test_eal_flags.c
index d7f4c2cd47..e2696cda63 100644
--- a/app/test/test_eal_flags.c
+++ b/app/test/test_eal_flags.c
@@ -1122,6 +1122,11 @@ test_file_prefix(void)
 		DEFAULT_MEM_SIZE, "--single-file-segments",
 		"--file-prefix=" memtest1 };
 
+	/* primary process with memtest1 and --huge-unlink=never mode */
+	const char * const argv9[] = {prgname, "-m",
+		DEFAULT_MEM_SIZE, "--huge-unlink=never",
+		"--file-prefix=" memtest1 };
+
 	/* check if files for current prefix are present */
 	if (process_hugefiles(prefix, HUGEPAGE_CHECK_EXISTS) != 1) {
 		printf("Error - hugepage files for %s were not created!\n", prefix);
@@ -1290,6 +1295,26 @@ test_file_prefix(void)
 		return -1;
 	}
 
+	/* this process will run with --huge-unlink,
+	 * so it should not remove hugepage files when it exits
+	 */
+	if (launch_proc(argv9) != 0) {
+		printf("Error - failed to run with --huge-unlink=never\n");
+		return -1;
+	}
+
+	/* check if hugefiles for memtest1 are present */
+	if (process_hugefiles(memtest1, HUGEPAGE_CHECK_EXISTS) == 0) {
+		printf("Error - hugepage files for %s were deleted!\n",
+				memtest1);
+		return -1;
+	} else {
+		if (process_hugefiles(memtest1, HUGEPAGE_DELETE) != 1) {
+			printf("Error - deleting hugepages failed!\n");
+			return -1;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index 74df2611b5..ea8f381391 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -84,10 +84,26 @@ Memory-related options
     Use specified hugetlbfs directory instead of autodetected ones. This can be
     a sub-directory within a hugetlbfs mountpoint.
 
-*   ``--huge-unlink``
-
-    Unlink hugepage files after creating them (implies no secondary process
-    support).
+*   ``--huge-unlink[=existing|always|never]``
+
+    No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default:
+    existing hugepage files are removed and re-created
+    to ensure the kernel clears the memory and prevents any data leaks.
+
+    With ``--huge-unlink`` (no value) or ``--huge-unlink=always``,
+    hugepage files are also removed before mapping them,
+    so that the application leaves no files in hugetlbfs.
+    This mode implies no multi-process support.
+
+    When ``--huge-unlink=never`` is specified, existing hugepage files
+    are never removed, but are remapped instead, allowing hugepage reuse.
+    This makes restart faster by saving time to clear memory at initialization,
+    but it may slow down zeroed allocations later.
+    Reused hugepages can contain data from previous processes that used them,
+    which may be a security concern.
+    Hugepage files created in this mode are also not removed
+    when all the hugepages mapped from them are freed,
+    which allows to reuse these files after a restart.
 
 *   ``--match-allocations``
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b467bdf004..8f06c34c22 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -283,6 +283,18 @@ to prevent data leaks from previous users of the same hugepage.
 EAL ensures this behavior by removing existing backing files at startup
 and by recreating them before opening for mapping (as a precaution).
 
+One exception is ``--huge-unlink=never`` mode.
+It is used to speed up EAL initialization, usually on application restart.
+Clearing memory constitutes more than 95% of hugepage mapping time.
+EAL can save it by remapping existing backing files
+with all the data left in the mapped hugepages ("dirty" memory).
+Such segments are marked with ``RTE_MEMSEG_FLAG_DIRTY``.
+Memory allocator detects dirty segments and handles them accordingly,
+in particular, it clears memory requested with ``rte_zmalloc*()``.
+In this mode EAL also does not remove a backing file
+when all pages mapped from it are freed,
+because they are intended to be reusable at restart.
+
 Anonymous mapping does not allow multi-process architecture.
 This mode does not use hugetlbfs
 and thus does not require root permissions for memory management
@@ -959,6 +971,7 @@ to be virtually contiguous.
 *   dirty - this flag is only meaningful when ``state`` is ``FREE``.
     It indicates that the content of the element is not fully zero-filled.
     Memory from such blocks must be cleared when requested via ``rte_zmalloc*()``.
+    Dirty elements only appear with ``--huge-unlink=never``.
 
 *   pad - this holds the length of the padding present at the start of the block.
     In the case of a normal block header, it is added to the address of the end
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 746f50e84f..58361db687 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -73,6 +73,13 @@ New Features
 
   The new API ``rte_event_eth_rx_adapter_event_port_get()`` was added.
 
+* **Added ability to reuse hugepages in Linux.**
+
+  It is possible to reuse files in hugetlbfs to speed up hugepage mapping,
+  which may be useful for fast restart and large allocations.
+  The new mode is activated with ``--huge-unlink=never``
+  and has security implications, refer to the user and programmer guides.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index cdd2284b0c..45d393b393 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -74,7 +74,7 @@ eal_long_options[] = {
 	{OPT_FILE_PREFIX,       1, NULL, OPT_FILE_PREFIX_NUM      },
 	{OPT_HELP,              0, NULL, OPT_HELP_NUM             },
 	{OPT_HUGE_DIR,          1, NULL, OPT_HUGE_DIR_NUM         },
-	{OPT_HUGE_UNLINK,       0, NULL, OPT_HUGE_UNLINK_NUM      },
+	{OPT_HUGE_UNLINK,       2, NULL, OPT_HUGE_UNLINK_NUM      },
 	{OPT_IOVA_MODE,	        1, NULL, OPT_IOVA_MODE_NUM        },
 	{OPT_LCORES,            1, NULL, OPT_LCORES_NUM           },
 	{OPT_LOG_LEVEL,         1, NULL, OPT_LOG_LEVEL_NUM        },
@@ -1598,6 +1598,28 @@ available_cores(void)
 	return str;
 }
 
+#define HUGE_UNLINK_NEVER "never"
+
+static int
+eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
+{
+	if (arg == NULL || strcmp(arg, "always") == 0) {
+		out->unlink_before_mapping = true;
+		return 0;
+	}
+	if (strcmp(arg, "existing") == 0) {
+		/* same as not specifying the option */
+		return 0;
+	}
+	if (strcmp(arg, HUGE_UNLINK_NEVER) == 0) {
+		RTE_LOG(WARNING, EAL, "Using --"OPT_HUGE_UNLINK"="
+			HUGE_UNLINK_NEVER" may create data leaks.\n");
+		out->unlink_existing = false;
+		return 0;
+	}
+	return -1;
+}
+
 int
 eal_parse_common_option(int opt, const char *optarg,
 			struct internal_config *conf)
@@ -1739,7 +1761,10 @@ eal_parse_common_option(int opt, const char *optarg,
 
 	/* long options */
 	case OPT_HUGE_UNLINK_NUM:
-		conf->hugepage_file.unlink_before_mapping = true;
+		if (eal_parse_huge_unlink(optarg, &conf->hugepage_file) < 0) {
+			RTE_LOG(ERR, EAL, "invalid --"OPT_HUGE_UNLINK" option\n");
+			return -1;
+		}
 		break;
 
 	case OPT_NO_HUGE_NUM:
@@ -2070,6 +2095,12 @@ eal_check_common_options(struct internal_config *internal_cfg)
 			"not compatible with --"OPT_HUGE_UNLINK"\n");
 		return -1;
 	}
+	if (!internal_cfg->hugepage_file.unlink_existing &&
+			internal_cfg->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_IN_MEMORY" is not compatible "
+			"with --"OPT_HUGE_UNLINK"="HUGE_UNLINK_NEVER"\n");
+		return -1;
+	}
 	if (internal_cfg->legacy_mem &&
 			internal_cfg->in_memory) {
 		RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible "
@@ -2202,7 +2233,9 @@ eal_common_usage(void)
 	       "  --"OPT_NO_TELEMETRY"   Disable telemetry support\n"
 	       "  --"OPT_FORCE_MAX_SIMD_BITWIDTH" Force the max SIMD bitwidth\n"
 	       "\nEAL options for DEBUG use only:\n"
-	       "  --"OPT_HUGE_UNLINK"       Unlink hugepage files after init\n"
+	       "  --"OPT_HUGE_UNLINK"[=existing|always|never]\n"
+	       "                      When to unlink files in hugetlbfs\n"
+	       "                      ('existing' by default, no value means 'always')\n"
 	       "  --"OPT_NO_HUGE"           Use malloc instead of hugetlbfs\n"
 	       "  --"OPT_NO_PCI"            Disable PCI\n"
 	       "  --"OPT_NO_HPET"           Disable HPET\n"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 1/6] doc: add hugepage mapping details
  2022-02-03 18:13       ` [PATCH v3 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
@ 2022-02-08 15:28         ` Burakov, Anatoly
  0 siblings, 0 replies; 53+ messages in thread
From: Burakov, Anatoly @ 2022-02-08 15:28 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Bruce Richardson

On 03-Feb-22 6:13 PM, Dmitry Kozlyuk wrote:
> Hugepage mapping is a layer of EAL malloc builds upon.
> There were implicit references to its details,
> like mentions of segment file descriptors,
> but no explicit description of its modes and operation.
> Add an overview of mechanics used on ech supported OS.
> Convert memory management subsections from list items
> to level 4 headers: they are big and important enough.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> ---


> +In dynamic memory mode, EAL removes a backing hugepage file
> +when all pages mapped from it are freed back to the system.
> +However, backing files may persist after the application terminates
> +in case of a crash or a leak of DPDK memory (e.g. ``rte_free()`` is missing).
> +This reduces the number of hugepages available to other processes
> +as reported by ``/sys/kernel/mm/hugepages/hugepages-*/free_hugepages``.
> +EAL can remove the backing files after opening them for mapping
> +if ``--huge-unlink`` is given to avoid polluting hugetlbfs.
> +However, since it disables multi-process anyway,
> +using anonymous mapping (``--in-memory``) is recommended instead.
> +
> +:ref:`EAL memory allocator <malloc>` relies on hugepages being zero-filled.
> +Hugepages are cleared by the kernel when a file in hugetlbfs or its part
> +is mapped for the first time system-wide
> +to prevent data leaks from previous users of the same hugepage.
> +EAL ensures this behavior by removing existing backing files at startup
> +and by recreating them before opening for mapping (as a precaution).
> +
> +Anonymous mapping does not allow multi-process architecture.
> +This mode does not use hugetlbfs
> +and thus does not require root permissions for memory management
> +(the limit of locked memory amount, ``MEMLOCK``, still applies).
> +It is free of filename conflict and leftover file issues.
> +If memfd_create(2) is supported both at build and run time,

Nitpick, quote memfd? e.g. `memfd_create(2)`

Otherwise,

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 2/6] app/test: add allocator performance benchmark
  2022-02-03 18:13       ` [PATCH v3 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
@ 2022-02-08 16:20         ` Burakov, Anatoly
  0 siblings, 0 replies; 53+ messages in thread
From: Burakov, Anatoly @ 2022-02-08 16:20 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Viacheslav Ovsiienko, Aaron Conole

On 03-Feb-22 6:13 PM, Dmitry Kozlyuk wrote:
> Memory allocator performance is crucial to applications that deal
> with large amount of memory or allocate frequently. DPDK allocator
> performance is affected by EAL options, API used and, at least,
> allocation size. New autotest is intended to be run with different
> EAL options. It measures performance with a range of sizes
> for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.
> 
> Work distribution between allocation and deallocation depends on EAL
> options. The test prints both times and total time to ease comparison.
> 
> Memory can be filled with zeroes at different points of allocation path,
> but it always takes considerable fraction of overall timing. This is why
> the test measures filling speed and prints how long clearing takes
> for each size as a reference (for rte_memzone_reserve estimations
> are printed).
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Acked-by: Aaron Conole <aconole@redhat.com>
> ---

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 3/6] mem: add dirty malloc element support
  2022-02-03 18:13       ` [PATCH v3 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
@ 2022-02-08 16:36         ` Burakov, Anatoly
  0 siblings, 0 replies; 53+ messages in thread
From: Burakov, Anatoly @ 2022-02-08 16:36 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev

On 03-Feb-22 6:13 PM, Dmitry Kozlyuk wrote:
> EAL malloc layer assumed all free elements content
> is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
> This assumption was ensured in two ways:
> 1. EAL memalloc layer always returned clean memory.
> 2. Freed memory was cleared before returning into the heap.
> 
> Clearing the memory can be as slow as around 14 GiB/s.
> To save doing so, memalloc layer is allowed to return dirty memory.
> Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
> The allocator tracks elements that contain dirty memory
> using the new flag in the element header.
> When clean memory is requested via rte_zmalloc*()
> and the suitable element is dirty, it is cleared on allocation.
> When memory is deallocated, the freed element is joined
> with adjacent free elements, and the dirty flag is updated:
> 
> a) If the joint element contains dirty parts, it is dirty:
> 
>      dirty + freed + dirty = dirty  =>  no need to clean
>              freed + dirty = dirty      the freed memory
> 
>     Dirty parts may be large (e.g. initial allocation),
>     so clearing them could create unpredictable slowdown.
> 
> b) If the only dirty part of the joint element
>     is the freed memory, the joint element can be made clean:
> 
>      clean + freed + clean = clean  =>  freed memory
>      clean + freed         = clean      must be cleared
>              freed + clean = clean
>              freed         = clean
> 
>     This logic naturally reproduces the old behavior
>     and always applies in modes when EAL memalloc layer
>     returns only clean segments.
> 
> As a result, memory is either cleared on free, as before,
> or it will be cleared on allocation if need be, but never twice.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> ---

Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 4/6] eal: refactor --huge-unlink storage
  2022-02-03 18:13       ` [PATCH v3 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
@ 2022-02-08 16:39         ` Burakov, Anatoly
  0 siblings, 0 replies; 53+ messages in thread
From: Burakov, Anatoly @ 2022-02-08 16:39 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Thomas Monjalon

On 03-Feb-22 6:13 PM, Dmitry Kozlyuk wrote:
> In preparation to extend --huge-unlink option semantics
> refactor how it is stored in the internal configuration.
> It makes future changes more isolated.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Thomas Monjalon <thomas@monjalon.net>
> ---

I would question the need to keep the huge-unlink option at all (as well 
as its distant cousin, `--noshconf`), because it's functionally 
equivalent to memfd. However, that's not really the purposes of this 
patch, so that's besides the point :)

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 5/6] eal/linux: allow hugepage file reuse
  2022-02-03 18:13       ` [PATCH v3 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
@ 2022-02-08 17:05         ` Burakov, Anatoly
  0 siblings, 0 replies; 53+ messages in thread
From: Burakov, Anatoly @ 2022-02-08 17:05 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev

On 03-Feb-22 6:13 PM, Dmitry Kozlyuk wrote:
> Linux EAL ensured that mapped hugepages are clean
> by always mapping from newly created files:
> existing hugepage backing files were always removed.
> In this case, the kernel clears the page to prevent data leaks,
> because the mapped memory may contain leftover data
> from the previous process that was using this memory.
> Clearing takes the bulk of the time spent in mmap(2),
> increasing EAL initialization time.
> 
> Introduce a mode to keep existing files and reuse them
> in order to speed up initial memory allocation in EAL.
> Hugepages mapped from such files may contain data
> left by the previous process that used this memory,
> so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
> If multiple hugepages are mapped from the same file:
> 1. When fallocate(2) is used, all memory mapped from this file
>     is considered dirty, because it is unknown
>     which parts of the file are holes.
> 2. When ftruncate(3) is used, memory mapped from this file
>     is considered dirty unless the file is extended
>     to create a new mapping, which implies clean memory.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> ---


> -	while(dirent != NULL){
> +	while (dirent != NULL) {
>   		/* skip files that don't match the hugepage pattern */
>   		if (fnmatch(filter, dirent->d_name, 0) > 0) {
>   			dirent = readdir(dir);
> @@ -345,9 +357,15 @@ clear_hugedir(const char * hugedir)
>   		/* non-blocking lock */
>   		lck_result = flock(fd, LOCK_EX | LOCK_NB);
>   
> -		/* if lock succeeds, remove the file */
> +		/* if lock succeeds, execute callback */
>   		if (lck_result != -1)
> -			unlinkat(dir_fd, dirent->d_name, 0);
> +			cb(&(struct walk_hugedir_data){
> +				.dir_fd = dir_fd,
> +				.file_fd = fd,
> +				.file_name = dirent->d_name,
> +				.user_data = user_data,
> +			});

Off topic, but nice trick! Didn't know C allowed for this.

Otherwise, LGTM

Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 6/6] eal: extend --huge-unlink for hugepage file reuse
  2022-02-03 18:13       ` [PATCH v3 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
@ 2022-02-08 17:14         ` Burakov, Anatoly
  0 siblings, 0 replies; 53+ messages in thread
From: Burakov, Anatoly @ 2022-02-08 17:14 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Thomas Monjalon

On 03-Feb-22 6:13 PM, Dmitry Kozlyuk wrote:
> Expose Linux EAL ability to reuse existing hugepage files
> via --huge-unlink=never switch.
> Default behavior is unchanged, it can also be specified
> using --huge-unlink=existing for consistency.
> Old --huge-unlink switch is kept,
> it is an alias for --huge-unlink=always.
> Add a test case for the --huge-unlink=never mode.
> 
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Acked-by: Thomas Monjalon <thomas@monjalon.net>
> ---

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 0/6] Fast restart with many hugepages
  2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
                         ` (5 preceding siblings ...)
  2022-02-03 18:13       ` [PATCH v3 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
@ 2022-02-08 20:40       ` David Marchand
  6 siblings, 0 replies; 53+ messages in thread
From: David Marchand @ 2022-02-08 20:40 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Anatoly Burakov, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Bruce Richardson

On Thu, Feb 3, 2022 at 7:13 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> This patchset is a new design and implementation of [1].
>
> # Problem Statement
>
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
>
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
>
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
>
> # Solution
>
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
>
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
>
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
>
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
>
> # Implementation
>
> ## User Interface
>
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
>
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
>
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
>
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
>
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
>
> ## EAL
>
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
>
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
>
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
>
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
>

I fixed some checkpatch warnings, updated MAINTAINERS for the added
test and kept ERR level for a log message when creating files in
get_seg_fd().

Thanks again for enhancing the documentation, Dmitry.
And thanks to testers and reviewers.

Series applied.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2022-02-08 20:40 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-30 14:37 [RFC PATCH 0/6] Fast restart with many hugepages Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 2/6] mem: add dirty malloc element support Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 3/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 4/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2021-12-30 14:48 ` [RFC PATCH 5/6] eal: allow hugepage file reuse with --huge-unlink Dmitry Kozlyuk
2021-12-30 14:49 ` [RFC PATCH 6/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
2022-01-17  8:07   ` [PATCH v1 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2022-01-17  9:20     ` Thomas Monjalon
2022-01-17  8:07   ` [PATCH v1 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-01-17 15:47     ` Bruce Richardson
2022-01-17 15:51       ` Bruce Richardson
2022-01-19 21:12         ` Dmitry Kozlyuk
2022-01-20  9:04           ` Bruce Richardson
2022-01-17 16:06     ` Aaron Conole
2022-01-17  8:07   ` [PATCH v1 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
2022-01-17 14:07     ` Thomas Monjalon
2022-01-17  8:07   ` [PATCH v1 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2022-01-17 14:10     ` Thomas Monjalon
2022-01-17  8:14   ` [PATCH v1 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2022-01-17 14:24     ` Thomas Monjalon
2022-01-17  8:14   ` [PATCH v1 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
2022-01-17 14:27     ` Thomas Monjalon
2022-01-17 16:40   ` [PATCH v1 0/6] Fast restart with many hugepages Bruce Richardson
2022-01-19 21:12     ` Dmitry Kozlyuk
2022-01-20  9:05       ` Bruce Richardson
2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
2022-01-19 21:09     ` [PATCH v2 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2022-01-27 13:59       ` Bruce Richardson
2022-01-19 21:09     ` [PATCH v2 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-01-19 21:09     ` [PATCH v2 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
2022-01-19 21:09     ` [PATCH v2 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2022-01-19 21:11     ` [PATCH v2 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2022-01-19 21:11       ` [PATCH v2 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
2022-01-27 12:07     ` [PATCH v2 0/6] Fast restart with many hugepages Bruce Richardson
2022-02-02 14:12     ` Thomas Monjalon
2022-02-02 21:54     ` David Marchand
2022-02-03 10:26       ` David Marchand
2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
2022-02-03 18:13       ` [PATCH v3 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2022-02-08 15:28         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-02-08 16:20         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
2022-02-08 16:36         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2022-02-08 16:39         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2022-02-08 17:05         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
2022-02-08 17:14         ` Burakov, Anatoly
2022-02-08 20:40       ` [PATCH v3 0/6] Fast restart with many hugepages David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).