From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id D5ED0A00C5;
	Wed,  2 Feb 2022 15:12:53 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 619ED40688;
	Wed,  2 Feb 2022 15:12:53 +0100 (CET)
Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com
 [66.111.4.26]) by mails.dpdk.org (Postfix) with ESMTP id 95BAE40141
 for <dev@dpdk.org>; Wed,  2 Feb 2022 15:12:52 +0100 (CET)
Received: from compute2.internal (compute2.nyi.internal [10.202.2.46])
 by mailout.nyi.internal (Postfix) with ESMTP id 095535C0050;
 Wed,  2 Feb 2022 09:12:52 -0500 (EST)
Received: from mailfrontend2 ([10.202.2.163])
 by compute2.internal (MEProxy); Wed, 02 Feb 2022 09:12:52 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h=
 cc:cc:content-transfer-encoding:content-type:date:date:from:from
 :in-reply-to:in-reply-to:message-id:mime-version:references
 :reply-to:sender:subject:subject:to:to; s=fm3; bh=AdiqvWLeA6C+vL
 T0aiKmjxJf69Tlz5So9P8nA4C7qz8=; b=lo84eP+P7PX3pHPRV2huuCvGm8uD+P
 PKa3rS0hVY288k8LPYZq7C/MrPG2bMyb+zSrjPOgJ4/kf7sJCv9KOHxMxzvaKuYM
 oglpR0k8mqPGs6F8hKjcAtPx/Qb7kLxN7W6WuujCwCMAQVG+SYGEVRXicQqVWLPR
 xG+KW+5xHeKdPiTIlUcj7GnQ7AeiP3QQk6Mvh5VFdPm95NFWpYx+XN7p0aTe5V5D
 BqHWpO/fYhAaXQ12cDS0z5j2Mjy/Q7/vcHpY47MozyUA1btzbAIhQBbm1u15aubA
 EMAExZl/0+KQmghBqTm00oxXDDJAeNvD+6WkmVZwGyS2VbMQ2awMGTyQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:cc:content-transfer-encoding
 :content-type:date:date:from:from:in-reply-to:in-reply-to
 :message-id:mime-version:references:reply-to:sender:subject
 :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender
 :x-sasl-enc; s=fm2; bh=AdiqvWLeA6C+vLT0aiKmjxJf69Tlz5So9P8nA4C7q
 z8=; b=Q9i+1HbL+WAal/CPd47IVpH/L9Ga+LqSvw+1tO/LPC0LQKiC1RGl7C6ir
 5ZB4Eoug+iiFla/Wg5vI7E1IpxWuK+5zS0h/CLDXzktVr6H2oTvKRJLn1L/MrpLS
 lv2CMdrq+dkUn9LvILgnJv777g4R4XT6dt5+00+7hEHupiaLngZyC1n5IR37DOn9
 5L3eGXbmL0q1UJ7mRZwNgzzleBcYK41XNT0nIlvlLx0SGWBbUSnuRmo95QEEifmm
 nASbnIj+5MhxlCpdELQKgYT91bp2FEwcZRL+5tlS7+Jme6uKsT4E1UUmHd9uk9gA
 kM+Mi7nNV5B2p+a1k8v8H1jYpRq9w==
X-ME-Sender: <xms:Y5H6YSmseYcs-Dj8Kcxbg37mVs5qkoI8QYEZxRyjncnLnfDpX3Hbwg>
 <xme:Y5H6YZ0_PYsWXLwihW2vintUVwLniGEmL3XCL9UZ4LtWZv7KrlZvEwisavb4hY4aL
 nP1hF0Lq7Y7udmKng>
X-ME-Received: <xmr:Y5H6YQp8VxIXKJYImlAQ4-XtOB5ftU-Zcc9bHm7sDkWSbg7TOMxDEExVW5Gj5LVkZtlN2QTvcwLoKR8kqMz5PqK2OA>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddrgeehgdeivdcutefuodetggdotefrodftvf
 curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu
 uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc
 fjughrpefhvffufffkjghfggfgtgesthfuredttddtvdenucfhrhhomhepvfhhohhmrghs
 ucfoohhnjhgrlhhonhcuoehthhhomhgrshesmhhonhhjrghlohhnrdhnvghtqeenucggtf
 frrghtthgvrhhnpeffvdffjeeuteelfeeileduudeugfetjeelveefkeejfeeigeehteff
 vdekfeegudenucffohhmrghinhepughpughkrdhorhhgnecuvehluhhsthgvrhfuihiivg
 eptdenucfrrghrrghmpehmrghilhhfrhhomhepthhhohhmrghssehmohhnjhgrlhhonhdr
 nhgvth
X-ME-Proxy: <xmx:Y5H6YWk4zLLkPY7QD0tCvRTrww7FrO5zXVyz5V4iKh4PHccd8uMXbA>
 <xmx:Y5H6YQ0fsuYsgdcQisfD02VzbuyC8HdLMyqY4bX1dhhhSlqlQaiqtg>
 <xmx:Y5H6YduCn6hCZw4ePi6iuszmmNPZH3Al634kIqoUfUpCZuXTkuCsOQ>
 <xmx:ZJH6YU-Ne28iTkJDfxM3I1UqAz4KqGeFToIdkqQzHaM3iygOeieS3w>
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed,
 2 Feb 2022 09:12:50 -0500 (EST)
From: Thomas Monjalon <thomas@monjalon.net>
To: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: dev@dpdk.org, Bruce Richardson <bruce.richardson@intel.com>,
 Viacheslav Ovsiienko <viacheslavo@nvidia.com>,
 David Marchand <david.marchand@redhat.com>,
 Lior Margalit <lmargalit@nvidia.com>, Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Subject: Re: [PATCH v2 0/6] Fast restart with many hugepages
Date: Wed, 02 Feb 2022 15:12:48 +0100
Message-ID: <36530803.XM6RcZxFsP@thomas>
In-Reply-To: <20220119210917.765505-1-dkozlyuk@nvidia.com>
References: <20220117080801.481568-1-dkozlyuk@nvidia.com>
 <20220119210917.765505-1-dkozlyuk@nvidia.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

2 weeks passed without any new comment except a test by Bruce.
I would prefer avoiding a merge in the last minute.
Anatoly, any comment?


19/01/2022 22:09, Dmitry Kozlyuk:
> This patchset is a new design and implementation of [1].
> 
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
> 
> # Problem Statement
> 
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
> 
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
> 
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
> 
> # Solution
> 
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
> 
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
> 
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
> 
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
> 
> # Implementation
> 
> ## User Interface
> 
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
> 
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
> 
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
> 
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
> 
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
> 
> ## EAL
> 
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
> 
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
> 
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
> 
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
> 
> Dmitry Kozlyuk (6):
>   doc: add hugepage mapping details
>   app/test: add allocator performance benchmark
>   mem: add dirty malloc element support
>   eal: refactor --huge-unlink storage
>   eal/linux: allow hugepage file reuse
>   eal: extend --huge-unlink for hugepage file reuse
> 
>  app/test/meson.build                          |   2 +
>  app/test/test_eal_flags.c                     |  25 +++
>  app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
>  doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
>  .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
>  doc/guides/rel_notes/release_22_03.rst        |   7 +
>  lib/eal/common/eal_common_options.c           |  48 ++++-
>  lib/eal/common/eal_internal_cfg.h             |  10 +-
>  lib/eal/common/malloc_elem.c                  |  22 ++-
>  lib/eal/common/malloc_elem.h                  |  11 +-
>  lib/eal/common/malloc_heap.c                  |  18 +-
>  lib/eal/common/rte_malloc.c                   |  21 ++-
>  lib/eal/include/rte_memory.h                  |   8 +-
>  lib/eal/linux/eal.c                           |   3 +-
>  lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
>  lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
>  lib/eal/linux/eal_memory.c                    |   2 +-
>  17 files changed, 644 insertions(+), 129 deletions(-)
>  create mode 100644 app/test/test_malloc_perf.c
> 
>