DPDK patches and discussions
 help / color / mirror / Atom feed
From: David Marchand <david.marchand@redhat.com>
To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Cc: dev <dev@dpdk.org>, Bruce Richardson <bruce.richardson@intel.com>,
	 Anatoly Burakov <anatoly.burakov@intel.com>,
	Viacheslav Ovsiienko <viacheslavo@nvidia.com>,
	 Thomas Monjalon <thomas@monjalon.net>,
	Lior Margalit <lmargalit@nvidia.com>
Subject: Re: [PATCH v2 0/6] Fast restart with many hugepages
Date: Wed, 2 Feb 2022 22:54:59 +0100	[thread overview]
Message-ID: <CAJFAV8w5XXRnnEEw4A2SsS_xv0xBh-aOm+PKb5hXoQjSZMhJTg@mail.gmail.com> (raw)
In-Reply-To: <20220119210917.765505-1-dkozlyuk@nvidia.com>

Hello Dmitry,

On Wed, Jan 19, 2022 at 10:09 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> This patchset is a new design and implementation of [1].
>
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
>
> # Problem Statement
>
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
>
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
>
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
>
> # Solution
>
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
>
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
>
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
>
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
>
> # Implementation
>
> ## User Interface
>
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
>
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
>
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
>
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
>
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
>
> ## EAL
>
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
>
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
>
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
>
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
>
> Dmitry Kozlyuk (6):
>   doc: add hugepage mapping details
>   app/test: add allocator performance benchmark
>   mem: add dirty malloc element support
>   eal: refactor --huge-unlink storage
>   eal/linux: allow hugepage file reuse
>   eal: extend --huge-unlink for hugepage file reuse
>
>  app/test/meson.build                          |   2 +
>  app/test/test_eal_flags.c                     |  25 +++
>  app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
>  doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
>  .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
>  doc/guides/rel_notes/release_22_03.rst        |   7 +
>  lib/eal/common/eal_common_options.c           |  48 ++++-
>  lib/eal/common/eal_internal_cfg.h             |  10 +-
>  lib/eal/common/malloc_elem.c                  |  22 ++-
>  lib/eal/common/malloc_elem.h                  |  11 +-
>  lib/eal/common/malloc_heap.c                  |  18 +-
>  lib/eal/common/rte_malloc.c                   |  21 ++-
>  lib/eal/include/rte_memory.h                  |   8 +-
>  lib/eal/linux/eal.c                           |   3 +-
>  lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
>  lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
>  lib/eal/linux/eal_memory.c                    |   2 +-
>  17 files changed, 644 insertions(+), 129 deletions(-)
>  create mode 100644 app/test/test_malloc_perf.c

Thanks for the series, the documentation update and keeping the EAL
options count the same as before :-).

It passes my checks (compilation per patch for Linux x86 native, arm86
and ppc cross compil), running unit tests, running malloc tests with
ASan enabled.


I could not check all unit tests with RTE_MALLOC_DEBUG (I passed
-Dc_args=-DRTE_MALLOC_DEBUG to meson).
mbuf_autotest fails but I reproduced the same error before the series
so I'll report and investigate this separately.
Fwiw, the failure is:
1: [/home/dmarchan/builds/build-gcc-shared/app/test/../../lib/librte_eal.so.22(rte_dump_stack+0x1b)
[0x7f860c482dab]]
Test mbuf linearize API
mbuf test FAILED (l.2035): <test_pktmbuf_read_from_offset: Incorrect
data length!
>
mbuf test FAILED (l.2539): <test_pktmbuf_ext_pinned_buffer:
test_rte_pktmbuf_read_from_offset(pinned) failed
>
test_pktmbuf_ext_pinned_buffer() failed
Test Failed


I have one comment on documentation: we have a detailed description of
internal malloc_elem structure and implementation of the dpdk mem
allocator.
https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#internal-implementation
The addition of the "dirty/clean" notion should be described, as it
would help others who want to look into this subsystem.


-- 
David Marchand


  parent reply	other threads:[~2022-02-02 21:55 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-30 14:37 [RFC PATCH " Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 2/6] mem: add dirty malloc element support Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 3/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2021-12-30 14:37 ` [RFC PATCH 4/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2021-12-30 14:48 ` [RFC PATCH 5/6] eal: allow hugepage file reuse with --huge-unlink Dmitry Kozlyuk
2021-12-30 14:49 ` [RFC PATCH 6/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-01-17  8:07 ` [PATCH v1 0/6] Fast restart with many hugepages Dmitry Kozlyuk
2022-01-17  8:07   ` [PATCH v1 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2022-01-17  9:20     ` Thomas Monjalon
2022-01-17  8:07   ` [PATCH v1 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-01-17 15:47     ` Bruce Richardson
2022-01-17 15:51       ` Bruce Richardson
2022-01-19 21:12         ` Dmitry Kozlyuk
2022-01-20  9:04           ` Bruce Richardson
2022-01-17 16:06     ` Aaron Conole
2022-01-17  8:07   ` [PATCH v1 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
2022-01-17 14:07     ` Thomas Monjalon
2022-01-17  8:07   ` [PATCH v1 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2022-01-17 14:10     ` Thomas Monjalon
2022-01-17  8:14   ` [PATCH v1 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2022-01-17 14:24     ` Thomas Monjalon
2022-01-17  8:14   ` [PATCH v1 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
2022-01-17 14:27     ` Thomas Monjalon
2022-01-17 16:40   ` [PATCH v1 0/6] Fast restart with many hugepages Bruce Richardson
2022-01-19 21:12     ` Dmitry Kozlyuk
2022-01-20  9:05       ` Bruce Richardson
2022-01-19 21:09   ` [PATCH v2 " Dmitry Kozlyuk
2022-01-19 21:09     ` [PATCH v2 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2022-01-27 13:59       ` Bruce Richardson
2022-01-19 21:09     ` [PATCH v2 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-01-19 21:09     ` [PATCH v2 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
2022-01-19 21:09     ` [PATCH v2 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2022-01-19 21:11     ` [PATCH v2 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2022-01-19 21:11       ` [PATCH v2 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
2022-01-27 12:07     ` [PATCH v2 0/6] Fast restart with many hugepages Bruce Richardson
2022-02-02 14:12     ` Thomas Monjalon
2022-02-02 21:54     ` David Marchand [this message]
2022-02-03 10:26       ` David Marchand
2022-02-03 18:13     ` [PATCH v3 " Dmitry Kozlyuk
2022-02-03 18:13       ` [PATCH v3 1/6] doc: add hugepage mapping details Dmitry Kozlyuk
2022-02-08 15:28         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 2/6] app/test: add allocator performance benchmark Dmitry Kozlyuk
2022-02-08 16:20         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 3/6] mem: add dirty malloc element support Dmitry Kozlyuk
2022-02-08 16:36         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 4/6] eal: refactor --huge-unlink storage Dmitry Kozlyuk
2022-02-08 16:39         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 5/6] eal/linux: allow hugepage file reuse Dmitry Kozlyuk
2022-02-08 17:05         ` Burakov, Anatoly
2022-02-03 18:13       ` [PATCH v3 6/6] eal: extend --huge-unlink for " Dmitry Kozlyuk
2022-02-08 17:14         ` Burakov, Anatoly
2022-02-08 20:40       ` [PATCH v3 0/6] Fast restart with many hugepages David Marchand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJFAV8w5XXRnnEEw4A2SsS_xv0xBh-aOm+PKb5hXoQjSZMhJTg@mail.gmail.com \
    --to=david.marchand@redhat.com \
    --cc=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=dkozlyuk@nvidia.com \
    --cc=lmargalit@nvidia.com \
    --cc=thomas@monjalon.net \
    --cc=viacheslavo@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).