From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D5ED0A00C5; Wed, 2 Feb 2022 15:12:53 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 619ED40688; Wed, 2 Feb 2022 15:12:53 +0100 (CET) Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26]) by mails.dpdk.org (Postfix) with ESMTP id 95BAE40141 for ; Wed, 2 Feb 2022 15:12:52 +0100 (CET) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 095535C0050; Wed, 2 Feb 2022 09:12:52 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Wed, 02 Feb 2022 09:12:52 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= cc:cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm3; bh=AdiqvWLeA6C+vL T0aiKmjxJf69Tlz5So9P8nA4C7qz8=; b=lo84eP+P7PX3pHPRV2huuCvGm8uD+P PKa3rS0hVY288k8LPYZq7C/MrPG2bMyb+zSrjPOgJ4/kf7sJCv9KOHxMxzvaKuYM oglpR0k8mqPGs6F8hKjcAtPx/Qb7kLxN7W6WuujCwCMAQVG+SYGEVRXicQqVWLPR xG+KW+5xHeKdPiTIlUcj7GnQ7AeiP3QQk6Mvh5VFdPm95NFWpYx+XN7p0aTe5V5D BqHWpO/fYhAaXQ12cDS0z5j2Mjy/Q7/vcHpY47MozyUA1btzbAIhQBbm1u15aubA EMAExZl/0+KQmghBqTm00oxXDDJAeNvD+6WkmVZwGyS2VbMQ2awMGTyQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=AdiqvWLeA6C+vLT0aiKmjxJf69Tlz5So9P8nA4C7q z8=; b=Q9i+1HbL+WAal/CPd47IVpH/L9Ga+LqSvw+1tO/LPC0LQKiC1RGl7C6ir 5ZB4Eoug+iiFla/Wg5vI7E1IpxWuK+5zS0h/CLDXzktVr6H2oTvKRJLn1L/MrpLS lv2CMdrq+dkUn9LvILgnJv777g4R4XT6dt5+00+7hEHupiaLngZyC1n5IR37DOn9 5L3eGXbmL0q1UJ7mRZwNgzzleBcYK41XNT0nIlvlLx0SGWBbUSnuRmo95QEEifmm nASbnIj+5MhxlCpdELQKgYT91bp2FEwcZRL+5tlS7+Jme6uKsT4E1UUmHd9uk9gA kM+Mi7nNV5B2p+a1k8v8H1jYpRq9w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddrgeehgdeivdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkjghfggfgtgesthfuredttddtvdenucfhrhhomhepvfhhohhmrghs ucfoohhnjhgrlhhonhcuoehthhhomhgrshesmhhonhhjrghlohhnrdhnvghtqeenucggtf frrghtthgvrhhnpeffvdffjeeuteelfeeileduudeugfetjeelveefkeejfeeigeehteff vdekfeegudenucffohhmrghinhepughpughkrdhorhhgnecuvehluhhsthgvrhfuihiivg eptdenucfrrghrrghmpehmrghilhhfrhhomhepthhhohhmrghssehmohhnjhgrlhhonhdr nhgvth X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 2 Feb 2022 09:12:50 -0500 (EST) From: Thomas Monjalon To: Anatoly Burakov Cc: dev@dpdk.org, Bruce Richardson , Viacheslav Ovsiienko , David Marchand , Lior Margalit , Dmitry Kozlyuk Subject: Re: [PATCH v2 0/6] Fast restart with many hugepages Date: Wed, 02 Feb 2022 15:12:48 +0100 Message-ID: <36530803.XM6RcZxFsP@thomas> In-Reply-To: <20220119210917.765505-1-dkozlyuk@nvidia.com> References: <20220117080801.481568-1-dkozlyuk@nvidia.com> <20220119210917.765505-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org 2 weeks passed without any new comment except a test by Bruce. I would prefer avoiding a merge in the last minute. Anatoly, any comment? 19/01/2022 22:09, Dmitry Kozlyuk: > This patchset is a new design and implementation of [1]. > > v2: > * Fix hugepage file removal when they are no longer used. > Disable removal with --huge-unlink=never as intended. > Document this behavior difference. (Bruce) > * Improve documentation, commit messages, and naming. (Thomas) > > # Problem Statement > > Large allocations that involve mapping new hugepages are slow. > This is problematic, for example, in the following use case. > A single-process application allocates ~1TB of mempools at startup. > Sometimes the app needs to restart as quick as possible. > Allocating the hugepages anew takes as long as 15 seconds, > while the new process could just pick up all the memory > left by the old one (reinitializing the contents as needed). > > Almost all of mmap(2) time spent in the kernel > is clearing the memory, i.e. filling it with zeros. > This is done if a file in hugetlbfs is mapped > for the first time system-wide, i.e. a hugepage is committed > to prevent data leaks from the previous users of the same hugepage. > For example, mapping 32 GB from a new file may take 2.16 seconds, > while mapping the same pages again takes only 0.3 ms. > Security put aside, e.g. when the environment is controlled, > this effort is wasted for the memory intended for DMA, > because its content will be overwritten anyway. > > Linux EAL explicitly removes hugetlbfs files at initialization > and before mapping to force the kernel clear the memory. > This allows the memory allocator to clean memory on only on freeing. > > # Solution > > Add a new mode allowing EAL to remap existing hugepage files. > While it is intended to make restarts faster in the first place, > it makes any startup faster except the cold one > (with no existing files). > > It is the administrator who accepts security risks > implied by reusing hugepages. > The new mode is an opt-in and a warning is logged. > > The feature is Linux-only as it is related > to mapping hugepages from files which only Linux does. > It is inherently incompatible with --in-memory, > for --huge-unlink see below. > > There is formally no breakage of API contract, > but there is a behavior change in the new mode: > rte_malloc*() and rte_memzone_reserve*() may return dirty memory > (previously they were returning clean memory from free heap elements). > Their contract has always explicitly allowed this, > but still there may be users relying on the traditional behavior. > Such users will need to fix their code to use the new mode. > > # Implementation > > ## User Interface > > There is --huge-unlink switch in the same area to remove hugepage files > before mapping them. It is infeasible to use with the new mode, > because the point is to keep hugepage files for fast future restarts. > Extend --huge-unlink option to represent only valid combinations: > > * --huge-unlink=existing OR no option (for compatibility): > unlink files at initialization > and before opening them as a precaution. > > * --huge-unlink=always OR just --huge-unlink (for compatibility): > same as above + unlink created files before mapping. > > * --huge-unlink=never: > the new mode, do not unlink hugepages files, reuse them. > > This option was always Linux-only, but it is kept as common > in case there are users who expect it to be a no-op on other systems. > (Adding a separate --huge-reuse option was also considered, > but there is no obvious benefit and more combinations to test.) > > ## EAL > > If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY > so that the memory allocator may clear the memory if need be. > See patch 5/6 description for details how this is done > in different memory mapping modes. > > The memory manager tracks whether an element is clean or dirty. > If rte_zmalloc*() allocates from a dirty element, > the memory is cleared before handling it to the user. > On freeing, the allocator joins adjacent free elements, > but in the new mode it may not be feasible to clear the free memory > if the joint element is dirty (contains dirty parts). > In any case, memory will be cleared only once, > either on freeing or on allocation. > See patch 3/6 for details. > Patch 2/6 adds a benchmark to see how time is distributed > between allocation and freeing in different modes. > > Besides clearing memory, each mmap() call takes some time. > For example, 1024 calls for 1 TB may take ~300 ms. > The time of one call mapping N hugepages is O(N), > because inside the kernel hugepages are allocated ony by one. > Syscall overhead is negligeable even for one page. > Hence, it does not make sense to reduce the number of mmap() calls, > which would essentially move the loop over pages into the kernel. > > [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/ > > Dmitry Kozlyuk (6): > doc: add hugepage mapping details > app/test: add allocator performance benchmark > mem: add dirty malloc element support > eal: refactor --huge-unlink storage > eal/linux: allow hugepage file reuse > eal: extend --huge-unlink for hugepage file reuse > > app/test/meson.build | 2 + > app/test/test_eal_flags.c | 25 +++ > app/test/test_malloc_perf.c | 174 ++++++++++++++++++ > doc/guides/linux_gsg/linux_eal_parameters.rst | 24 ++- > .../prog_guide/env_abstraction_layer.rst | 107 ++++++++++- > doc/guides/rel_notes/release_22_03.rst | 7 + > lib/eal/common/eal_common_options.c | 48 ++++- > lib/eal/common/eal_internal_cfg.h | 10 +- > lib/eal/common/malloc_elem.c | 22 ++- > lib/eal/common/malloc_elem.h | 11 +- > lib/eal/common/malloc_heap.c | 18 +- > lib/eal/common/rte_malloc.c | 21 ++- > lib/eal/include/rte_memory.h | 8 +- > lib/eal/linux/eal.c | 3 +- > lib/eal/linux/eal_hugepage_info.c | 118 +++++++++--- > lib/eal/linux/eal_memalloc.c | 173 ++++++++++------- > lib/eal/linux/eal_memory.c | 2 +- > 17 files changed, 644 insertions(+), 129 deletions(-) > create mode 100644 app/test/test_malloc_perf.c > >