From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 03959A00C3; Wed, 2 Feb 2022 22:55:17 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C9F9A40688; Wed, 2 Feb 2022 22:55:17 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mails.dpdk.org (Postfix) with ESMTP id E131A40141 for ; Wed, 2 Feb 2022 22:55:16 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643838916; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=vt3rUaCY/ijBuh09MDG3iIZHRn3kg5ZvyftgcNSq+Uo=; b=EE4UJd3e8wXtkTcgSi+JpLqvM0NTxZEU7MJ+niJrXw36a435XKTfbdD4S8AcuMPVLVggeb fuwiAUd4wOawGwh2siavAiuoxO7vX3/1SPETweJzdXhVMUytWs5uyI1oOf1xcO+OgJkwLp bgjv1KwjF+1hO9EJ10BDSSv1sKziCH8= Received: from mail-lf1-f72.google.com (mail-lf1-f72.google.com [209.85.167.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-412-ZMIbMnEIMb-0_8eJMd7vNQ-1; Wed, 02 Feb 2022 16:55:13 -0500 X-MC-Unique: ZMIbMnEIMb-0_8eJMd7vNQ-1 Received: by mail-lf1-f72.google.com with SMTP id z24-20020a056512371800b0043ea4caa07cso168661lfr.17 for ; Wed, 02 Feb 2022 13:55:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vt3rUaCY/ijBuh09MDG3iIZHRn3kg5ZvyftgcNSq+Uo=; b=d/QYr4gQiBen8rWI4HLvTxGhm2634UjVh1s5gew4fxtQcGbooJmzfDKmLYvIedZJIA AKL4RRwfP21jJQr8RpXj93ex2n9m60Rr1qwzzBKujhoptbQVONnf5z+Yi+VQvD81toJZ uF7P9uoj6a/+T9xTfqj9qYVPApSg2QaheAR9N2cQZvEgLlr09sQp7XVheIIJXvNRtjf6 Bw3vYmDgg/cVmy8EqBwS5Xg9exMPbgs1v4tR0DgFxR3GcOwWZqlM3SZVQr3b+lxloIvd YRsKN4Fq5NqQG9cgp0RLWFKip0x8iDZGXNeDuP94vJ8KsQjR1Z5Fv5CG21OfnWUltO/v dG+w== X-Gm-Message-State: AOAM532Q4Io6IxjK+kSdG34V74Dfx2nC60+7oMBLByJhRS0rXJvNFlDZ Vfswj1qXXWApXwVzYwj7tRFSI2gfHwZl/szQN0XQps8S7zFbEiltK1j2HLz3JCmkyj0t6zyllmr NWsarUFeo7NidcBppGSA= X-Received: by 2002:a05:6512:118e:: with SMTP id g14mr24104102lfr.265.1643838912106; Wed, 02 Feb 2022 13:55:12 -0800 (PST) X-Google-Smtp-Source: ABdhPJy85bRK2ATDHWwf3pasXRDmY/HQxLP1rKdEawpOJK6/3DUEEVQadcuScv3pwZlHeJEDEbqMUfp0Ehl9pwL2mNc= X-Received: by 2002:a05:6512:118e:: with SMTP id g14mr24104085lfr.265.1643838911792; Wed, 02 Feb 2022 13:55:11 -0800 (PST) MIME-Version: 1.0 References: <20220117080801.481568-1-dkozlyuk@nvidia.com> <20220119210917.765505-1-dkozlyuk@nvidia.com> In-Reply-To: <20220119210917.765505-1-dkozlyuk@nvidia.com> From: David Marchand Date: Wed, 2 Feb 2022 22:54:59 +0100 Message-ID: Subject: Re: [PATCH v2 0/6] Fast restart with many hugepages To: Dmitry Kozlyuk Cc: dev , Bruce Richardson , Anatoly Burakov , Viacheslav Ovsiienko , Thomas Monjalon , Lior Margalit Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=dmarchan@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hello Dmitry, On Wed, Jan 19, 2022 at 10:09 PM Dmitry Kozlyuk wrote: > > This patchset is a new design and implementation of [1]. > > v2: > * Fix hugepage file removal when they are no longer used. > Disable removal with --huge-unlink=never as intended. > Document this behavior difference. (Bruce) > * Improve documentation, commit messages, and naming. (Thomas) > > # Problem Statement > > Large allocations that involve mapping new hugepages are slow. > This is problematic, for example, in the following use case. > A single-process application allocates ~1TB of mempools at startup. > Sometimes the app needs to restart as quick as possible. > Allocating the hugepages anew takes as long as 15 seconds, > while the new process could just pick up all the memory > left by the old one (reinitializing the contents as needed). > > Almost all of mmap(2) time spent in the kernel > is clearing the memory, i.e. filling it with zeros. > This is done if a file in hugetlbfs is mapped > for the first time system-wide, i.e. a hugepage is committed > to prevent data leaks from the previous users of the same hugepage. > For example, mapping 32 GB from a new file may take 2.16 seconds, > while mapping the same pages again takes only 0.3 ms. > Security put aside, e.g. when the environment is controlled, > this effort is wasted for the memory intended for DMA, > because its content will be overwritten anyway. > > Linux EAL explicitly removes hugetlbfs files at initialization > and before mapping to force the kernel clear the memory. > This allows the memory allocator to clean memory on only on freeing. > > # Solution > > Add a new mode allowing EAL to remap existing hugepage files. > While it is intended to make restarts faster in the first place, > it makes any startup faster except the cold one > (with no existing files). > > It is the administrator who accepts security risks > implied by reusing hugepages. > The new mode is an opt-in and a warning is logged. > > The feature is Linux-only as it is related > to mapping hugepages from files which only Linux does. > It is inherently incompatible with --in-memory, > for --huge-unlink see below. > > There is formally no breakage of API contract, > but there is a behavior change in the new mode: > rte_malloc*() and rte_memzone_reserve*() may return dirty memory > (previously they were returning clean memory from free heap elements). > Their contract has always explicitly allowed this, > but still there may be users relying on the traditional behavior. > Such users will need to fix their code to use the new mode. > > # Implementation > > ## User Interface > > There is --huge-unlink switch in the same area to remove hugepage files > before mapping them. It is infeasible to use with the new mode, > because the point is to keep hugepage files for fast future restarts. > Extend --huge-unlink option to represent only valid combinations: > > * --huge-unlink=existing OR no option (for compatibility): > unlink files at initialization > and before opening them as a precaution. > > * --huge-unlink=always OR just --huge-unlink (for compatibility): > same as above + unlink created files before mapping. > > * --huge-unlink=never: > the new mode, do not unlink hugepages files, reuse them. > > This option was always Linux-only, but it is kept as common > in case there are users who expect it to be a no-op on other systems. > (Adding a separate --huge-reuse option was also considered, > but there is no obvious benefit and more combinations to test.) > > ## EAL > > If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY > so that the memory allocator may clear the memory if need be. > See patch 5/6 description for details how this is done > in different memory mapping modes. > > The memory manager tracks whether an element is clean or dirty. > If rte_zmalloc*() allocates from a dirty element, > the memory is cleared before handling it to the user. > On freeing, the allocator joins adjacent free elements, > but in the new mode it may not be feasible to clear the free memory > if the joint element is dirty (contains dirty parts). > In any case, memory will be cleared only once, > either on freeing or on allocation. > See patch 3/6 for details. > Patch 2/6 adds a benchmark to see how time is distributed > between allocation and freeing in different modes. > > Besides clearing memory, each mmap() call takes some time. > For example, 1024 calls for 1 TB may take ~300 ms. > The time of one call mapping N hugepages is O(N), > because inside the kernel hugepages are allocated ony by one. > Syscall overhead is negligeable even for one page. > Hence, it does not make sense to reduce the number of mmap() calls, > which would essentially move the loop over pages into the kernel. > > [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/ > > Dmitry Kozlyuk (6): > doc: add hugepage mapping details > app/test: add allocator performance benchmark > mem: add dirty malloc element support > eal: refactor --huge-unlink storage > eal/linux: allow hugepage file reuse > eal: extend --huge-unlink for hugepage file reuse > > app/test/meson.build | 2 + > app/test/test_eal_flags.c | 25 +++ > app/test/test_malloc_perf.c | 174 ++++++++++++++++++ > doc/guides/linux_gsg/linux_eal_parameters.rst | 24 ++- > .../prog_guide/env_abstraction_layer.rst | 107 ++++++++++- > doc/guides/rel_notes/release_22_03.rst | 7 + > lib/eal/common/eal_common_options.c | 48 ++++- > lib/eal/common/eal_internal_cfg.h | 10 +- > lib/eal/common/malloc_elem.c | 22 ++- > lib/eal/common/malloc_elem.h | 11 +- > lib/eal/common/malloc_heap.c | 18 +- > lib/eal/common/rte_malloc.c | 21 ++- > lib/eal/include/rte_memory.h | 8 +- > lib/eal/linux/eal.c | 3 +- > lib/eal/linux/eal_hugepage_info.c | 118 +++++++++--- > lib/eal/linux/eal_memalloc.c | 173 ++++++++++------- > lib/eal/linux/eal_memory.c | 2 +- > 17 files changed, 644 insertions(+), 129 deletions(-) > create mode 100644 app/test/test_malloc_perf.c Thanks for the series, the documentation update and keeping the EAL options count the same as before :-). It passes my checks (compilation per patch for Linux x86 native, arm86 and ppc cross compil), running unit tests, running malloc tests with ASan enabled. I could not check all unit tests with RTE_MALLOC_DEBUG (I passed -Dc_args=-DRTE_MALLOC_DEBUG to meson). mbuf_autotest fails but I reproduced the same error before the series so I'll report and investigate this separately. Fwiw, the failure is: 1: [/home/dmarchan/builds/build-gcc-shared/app/test/../../lib/librte_eal.so.22(rte_dump_stack+0x1b) [0x7f860c482dab]] Test mbuf linearize API mbuf test FAILED (l.2035): mbuf test FAILED (l.2539): test_pktmbuf_ext_pinned_buffer() failed Test Failed I have one comment on documentation: we have a detailed description of internal malloc_elem structure and implementation of the dpdk mem allocator. https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#internal-implementation The addition of the "dirty/clean" notion should be described, as it would help others who want to look into this subsystem. -- David Marchand