From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 05C47A0350; Tue, 8 Feb 2022 21:40:54 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D1D2441144; Tue, 8 Feb 2022 21:40:53 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mails.dpdk.org (Postfix) with ESMTP id 9B0E241101 for ; Tue, 8 Feb 2022 21:40:51 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1644352851; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=jJDmxS7tf928Mw4JDs2k4bldZpB1eghxUQtMw/J3zHM=; b=drfPzwRaC0aHv4h0njkfKZhYGb8reWxl1FkxZ+u+NsTrg8L+pYEZS4Ecj4LRlA5WDnjXuR 86IegaIfFdRzhcDT0y+VQQ0d/Saxe65dFUVnZ8oTVgMBmrALcaxk7pmDytx5qCGTwPSqxB QBTLvwjTtmSrrNNenL/XjAo2wiqc+ug= Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com [209.85.167.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-266-VZKWVaGpNgOJ3qiNMdtCOw-1; Tue, 08 Feb 2022 15:40:49 -0500 X-MC-Unique: VZKWVaGpNgOJ3qiNMdtCOw-1 Received: by mail-lf1-f71.google.com with SMTP id o25-20020ac24e99000000b004421aff5064so38568lfr.7 for ; Tue, 08 Feb 2022 12:40:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jJDmxS7tf928Mw4JDs2k4bldZpB1eghxUQtMw/J3zHM=; b=6rqYgQgh6maLrkIXHgSjMYR95qy4Yv1aXe+Oa1fXCvXvdNNc0gS0zUvAb51avx+Q8W PN0wx1pn+dNPUTzirs6SlNboWTfpKq0VxbqZ1JkwTznrUOvGKpV4Z3NZtIROe3I3ronN ZohYKPm6DP0bQLKzpKN93ykfRzlXKTZMz1HTtFKHwlbh4MDANIsprvdWuvozvVC3zPVs s4KLjs8EYMpb6OjIngTM6vlEx9zmPeEJpdMxMDzLdbwiBX0cfNZl4izxi2aD1ch74xXE fAFmjXW9rGItlR8UzJzaZ1GIcc8Zxu47/2lSX2JPkvSORv5pxb9wtdnokyo0/OFiTkt+ CwWQ== X-Gm-Message-State: AOAM530VQYwC6tXNm3IHUlmOET1Qf8UBAdkb05DIwy3dpoAygBA8LkH2 ihuHOuD1LOiYQIRqy+tifRfw3A1X5qOWGkMav8yq8lkY3gS2DQHX7lIEnoKM5VrMGBjJkxWjrol AUhEpfzv+aVK3qV92NCo= X-Received: by 2002:a05:6512:31cd:: with SMTP id j13mr4227728lfe.484.1644352847602; Tue, 08 Feb 2022 12:40:47 -0800 (PST) X-Google-Smtp-Source: ABdhPJxLRYo1Ep0Ug5ot2Limmf7CIhxM9jU11//BbFCN7KOVf4XX3FmzBdGROThiHsoBWqmLU7N53b6Xa/DuSBQ9gF8= X-Received: by 2002:a05:6512:31cd:: with SMTP id j13mr4227705lfe.484.1644352847134; Tue, 08 Feb 2022 12:40:47 -0800 (PST) MIME-Version: 1.0 References: <20220119210917.765505-1-dkozlyuk@nvidia.com> <20220203181337.759161-1-dkozlyuk@nvidia.com> In-Reply-To: <20220203181337.759161-1-dkozlyuk@nvidia.com> From: David Marchand Date: Tue, 8 Feb 2022 21:40:35 +0100 Message-ID: Subject: Re: [PATCH v3 0/6] Fast restart with many hugepages To: Dmitry Kozlyuk Cc: dev , Anatoly Burakov , Viacheslav Ovsiienko , Thomas Monjalon , Lior Margalit , Bruce Richardson Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=dmarchan@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, Feb 3, 2022 at 7:13 PM Dmitry Kozlyuk wrote: > > This patchset is a new design and implementation of [1]. > > # Problem Statement > > Large allocations that involve mapping new hugepages are slow. > This is problematic, for example, in the following use case. > A single-process application allocates ~1TB of mempools at startup. > Sometimes the app needs to restart as quick as possible. > Allocating the hugepages anew takes as long as 15 seconds, > while the new process could just pick up all the memory > left by the old one (reinitializing the contents as needed). > > Almost all of mmap(2) time spent in the kernel > is clearing the memory, i.e. filling it with zeros. > This is done if a file in hugetlbfs is mapped > for the first time system-wide, i.e. a hugepage is committed > to prevent data leaks from the previous users of the same hugepage. > For example, mapping 32 GB from a new file may take 2.16 seconds, > while mapping the same pages again takes only 0.3 ms. > Security put aside, e.g. when the environment is controlled, > this effort is wasted for the memory intended for DMA, > because its content will be overwritten anyway. > > Linux EAL explicitly removes hugetlbfs files at initialization > and before mapping to force the kernel clear the memory. > This allows the memory allocator to clean memory on only on freeing. > > # Solution > > Add a new mode allowing EAL to remap existing hugepage files. > While it is intended to make restarts faster in the first place, > it makes any startup faster except the cold one > (with no existing files). > > It is the administrator who accepts security risks > implied by reusing hugepages. > The new mode is an opt-in and a warning is logged. > > The feature is Linux-only as it is related > to mapping hugepages from files which only Linux does. > It is inherently incompatible with --in-memory, > for --huge-unlink see below. > > There is formally no breakage of API contract, > but there is a behavior change in the new mode: > rte_malloc*() and rte_memzone_reserve*() may return dirty memory > (previously they were returning clean memory from free heap elements). > Their contract has always explicitly allowed this, > but still there may be users relying on the traditional behavior. > Such users will need to fix their code to use the new mode. > > # Implementation > > ## User Interface > > There is --huge-unlink switch in the same area to remove hugepage files > before mapping them. It is infeasible to use with the new mode, > because the point is to keep hugepage files for fast future restarts. > Extend --huge-unlink option to represent only valid combinations: > > * --huge-unlink=existing OR no option (for compatibility): > unlink files at initialization > and before opening them as a precaution. > > * --huge-unlink=always OR just --huge-unlink (for compatibility): > same as above + unlink created files before mapping. > > * --huge-unlink=never: > the new mode, do not unlink hugepages files, reuse them. > > This option was always Linux-only, but it is kept as common > in case there are users who expect it to be a no-op on other systems. > (Adding a separate --huge-reuse option was also considered, > but there is no obvious benefit and more combinations to test.) > > ## EAL > > If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY > so that the memory allocator may clear the memory if need be. > See patch 5/6 description for details how this is done > in different memory mapping modes. > > The memory manager tracks whether an element is clean or dirty. > If rte_zmalloc*() allocates from a dirty element, > the memory is cleared before handling it to the user. > On freeing, the allocator joins adjacent free elements, > but in the new mode it may not be feasible to clear the free memory > if the joint element is dirty (contains dirty parts). > In any case, memory will be cleared only once, > either on freeing or on allocation. > See patch 3/6 for details. > Patch 2/6 adds a benchmark to see how time is distributed > between allocation and freeing in different modes. > > Besides clearing memory, each mmap() call takes some time. > For example, 1024 calls for 1 TB may take ~300 ms. > The time of one call mapping N hugepages is O(N), > because inside the kernel hugepages are allocated ony by one. > Syscall overhead is negligeable even for one page. > Hence, it does not make sense to reduce the number of mmap() calls, > which would essentially move the loop over pages into the kernel. > > [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/ > I fixed some checkpatch warnings, updated MAINTAINERS for the added test and kept ERR level for a log message when creating files in get_seg_fd(). Thanks again for enhancing the documentation, Dmitry. And thanks to testers and reviewers. Series applied. -- David Marchand