From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 30A15A04A5; Thu, 30 Dec 2021 15:38:11 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DD70D410F1; Thu, 30 Dec 2021 15:38:10 +0100 (CET) Received: from NAM11-CO1-obe.outbound.protection.outlook.com (mail-co1nam11on2048.outbound.protection.outlook.com [40.107.220.48]) by mails.dpdk.org (Postfix) with ESMTP id 3B23540F35 for ; Thu, 30 Dec 2021 15:38:10 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=EbS7Ee0cD7h3xQBIfYwPV2PQ5XyROJeQcveTPwsEXxlzgMzQ2dB6s4kw/hpQwMCJMMwL3hmfzZ07oEjhuOnRu4KgsY7vRb6S31nEoIukVBikuuElzdqLEiiieVo5nRdBmqrNiwqoo/qtBPErtSHMvgLqBgzbU75v8s7JdnbSYYYWYyZLObMCU5YY5mEtvhbDnKlmAXFyQHBtIJQmcvNn+CS0vmJmfdYq+C2ENemzFlRBobXLCZIHz65Y4b2bUA6lzrTlmRXMyGb8mrx6Pv1OYWTzAid8EfWbnla75J8sdLoEnUT6taUS0NK6xgh84nAqT0I5TV2Ld7nntfzutEw2iQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=JLjampzGceTINDko69741thqY2OH3IaiPNC3I4piBgA=; b=RPqG1KuTrxm9rYvp+PbRAzBgyxONzdkVkEU4YD41eXocx0IBm7PkaGZgMTu1MsNT0M9wQKXp5o9fh8ENMXDeFhWUN2JcyeppjnUTBrDjek6/+BtP1YbF85oRLHcVwlUD4oBAj0L0EqyW4wjJBzKCykxabue7pytXBpHeguY7QkdVslhiKTDQZODIdSj1mLPoikUlVbYrcHA9WolW8KBbvzda6p02vXGg57ekpq1TQdcbo4Wwh+Ia0QUk2al9OyywD4wXq6/0hqxXhGtYc0lahhinUcB+QZqcfXKOFxKcxzy0yTwC2OKbrXYEPahErsbipPQQZyFXdZ89yozLarf62g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.236) smtp.rcpttodomain=redhat.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=JLjampzGceTINDko69741thqY2OH3IaiPNC3I4piBgA=; b=ra7e1Z1zrzTkA3ONdiZA6Ls+YFk56SBQFnd3tyx+McNJxDBwASMU470+XQNYtxb82O6SYcPnRG7gi6kt66qLqZMWQxgCprD1YD0zZ8/T+LRDssv7B2NZ3QR3biJwZWOOSj2wrWGI4Ot5J5hZcb6CGKLQgOToqEh8q3ZrVHQemQZoCBIvZJrGZDfIOImswOQ5RNGC48VohcCVjlTjjXJttDipUu4bq19rdAl/Nnkx5vVXlhob9hP+8kAjbKqEBid34fQCpsiwEde73QCwuQrj8Y4P7+m4mN5u3kjTjfB9Dplr5wSmofHvlwzFAaKW3AlfydTbAIV7wSpMk0pbOPjllA== Received: from CO2PR18CA0046.namprd18.prod.outlook.com (2603:10b6:104:2::14) by CH0PR12MB5267.namprd12.prod.outlook.com (2603:10b6:610:d2::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4823.19; Thu, 30 Dec 2021 14:38:08 +0000 Received: from CO1NAM11FT053.eop-nam11.prod.protection.outlook.com (2603:10b6:104:2:cafe::86) by CO2PR18CA0046.outlook.office365.com (2603:10b6:104:2::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4844.14 via Frontend Transport; Thu, 30 Dec 2021 14:38:08 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.236) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.236 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.236; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.236) by CO1NAM11FT053.mail.protection.outlook.com (10.13.175.63) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4844.14 via Frontend Transport; Thu, 30 Dec 2021 14:38:08 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL109.nvidia.com (10.27.9.19) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Thu, 30 Dec 2021 14:38:01 +0000 Received: from nvidia.com (172.20.187.6) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Thu, 30 Dec 2021 06:37:59 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov , Viacheslav Ovsiienko , David Marchand , "Thomas Monjalon" , Lior Margalit Subject: [RFC PATCH 0/6] Fast restart with many hugepages Date: Thu, 30 Dec 2021 16:37:38 +0200 Message-ID: <20211230143744.3550098-1-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [172.20.187.6] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 0c30b0ab-e952-4026-df7c-08d9cba1ff43 X-MS-TrafficTypeDiagnostic: CH0PR12MB5267:EE_ X-LD-Processed: 43083d15-7273-40c1-b7db-39efd9ccc17a,ExtAddr X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:8882; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: iSRrYpevtMu4dytlmaTkC8j06w+AgNZN4x8ZtWS6yGILH9L7kVbaffJV0jpab2tMofDBWAPtvtrXmWgb39oNbTFfbFfkK5Y/2yIzrDmPqjc3mZ0sRKQEt+Gpdn7QqerpRZ1QsOCS/P8MoK0C8jYFuDMAV3HiLNvtXTvjLKDQosAKkzMaKpIuezpdxws6Y3qC3+Kp5nRFXOep+RXPmO+hvVIIfdbKrE78OhXXRtpn1jEt3lP7Nku8AEGsSEQSp0ldjohRlbxx+Un3CzO8sLqp39t5DzJpoxMZmTzkkqeG6Q+ekmfWEBx1lakoMM6q9GK4oj5H34YdMElPr+s5ZnPBoGfR72tnssP8TufiEytIxqVZ4tUAW6u0qHEMvWv2Sa9eiS3E4/qGyDol2kV9Qv4CpvmEnRIlncLjDfGssyOyQxNoyzEriotBh+fy2GfMWJ80Hx25n+PigL5BXCEwKUtTJWaAMdqXRkL2Z+5IVH3PJRxGCKYLKAaCp19Whja4ufYns62jo8D1a49yvK42p0uHi7DFwtq3ap1CB1D7GylCIEtYdMaNU31T+AvmsaQoGdqPSw2/HVbTV9hd08pbhwynDc9/i+OskkxgX226taBo2yCiPvsbnms0N01jquyFMcwaa/B4n/yV+fYPTjf8hHQwxXf9UHCzbXvvPoFdGmv38vbCFD1nsE4vEt15+qYWQ2OWj6oesHKYkS44ikClwKG54zlW9nx7GBETQfydXxvUACmL3jdnRCKHMgBlMTAO1ovY1lsUxJb8x4LsmVVnscY+GL0QYpaZBmm6m3OIhAZ1LkL686LRRj0FAdxWAqfHCUxfehAzwRPeseES+zkiSbS/xnK/RXFEmW5TtYjenXLvBXigNlmcLyJoIvLdhnXGXi5i X-Forefront-Antispam-Report: CIP:12.22.5.236; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(40470700002)(36840700001)(46966006)(336012)(966005)(81166007)(16526019)(186003)(426003)(6286002)(8936002)(36860700001)(7696005)(2906002)(86362001)(6916009)(316002)(40460700001)(70586007)(26005)(107886003)(83380400001)(356005)(508600001)(47076005)(70206006)(55016003)(82310400004)(36756003)(1076003)(8676002)(6666004)(5660300002)(4326008)(2616005)(54906003)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Dec 2021 14:38:08.1353 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 0c30b0ab-e952-4026-df7c-08d9cba1ff43 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.236]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: CO1NAM11FT053.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH0PR12MB5267 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This patchset is a new design and implementation of [1]. # Problem Statement Large allocations that involve mapping new hugepages are slow. This is problematic, for example, in the following use case. A single-process application allocates ~1TB of mempools at startup. Sometimes the app needs to restart as quick as possible. Allocating the hugepages anew takes as long as 15 seconds, while the new process could just pick up all the memory left by the old one (reinitializing the contents as needed). Almost all of mmap(2) time spent in the kernel is clearing the memory, i.e. filling it with zeros. This is done if a file in hugetlbfs is mapped for the first time system-wide, i.e. a hugepage is committed to prevent data leaks from the previous users of the same hugepage. For example, mapping 32 GB from a new file may take 2.16 seconds, while mapping the same pages again takes only 0.3 ms. Security put aside, e.g. when the environment is controlled, this effort is wasted for the memory intended for DMA, because its content will be overwritten anyway. Linux EAL explicitly removes hugetlbfs files at initialization and before mapping to force the kernel clear the memory. This allows the memory allocator to clean memory on only on freeing. # Solution Add a new mode allowing EAL to remap existing hugepage files. While it is intended to make restarts faster in the first place, it makes any startup faster except the cold one (with no existing files). It is the administrator who accepts security risks implied by reusing hugepages. The new mode is an opt-in and a warning is logged. The feature is Linux-only as it is related to mapping hugepages from files which only Linux does. It is inherently incompatible with --in-memory, for --huge-unlink see below. There is formally no breakage of API contract, but there is a behavior change in the new mode: rte_malloc*() and rte_memzone_reserve*() may return dirty memory (previously they were returning clean memory from free heap elements). Their contract has always explicitly allowed this, but still there may be users relying on the traditional behavior. Such users will need to fix their code to use the new mode. # Implementation ## User Interface There is --huge-unlink switch in the same area to remove hugepage files before mapping them. It is infeasible to use with the new mode, because the point is to keep hugepage files for fast future restarts. Extend --huge-unlink option to represent only valid combinations: * --huge-unlink=existing OR no option (for compatibility): unlink files at initialization and before opening them as a precaution. * --huge-unlink=always OR just --huge-unlink (for compatibility): same as above + unlink created files before mapping. * --huge-unlink=never: the new mode, do not unlink hugepages files, reuse them. This option was always Linux-only, but it is kept as common in case there are users who expect it to be a no-op on other systems. (Adding a separate --huge-reuse option was also considered, but there is no obvious benefit and more combinations to test.) ## EAL If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY so that the memory allocator may clear the memory if need be. See patch 4/6 description for details. The memory manager tracks whether an element is clean or dirty. If rte_zmalloc*() allocates from a dirty element, the memory is cleared before handling it to the user. On freeing, the allocator joins adjacent free elements, but in the new mode it may not be feasible to clear the free memory if the joint element is dirty (contains dirty parts). In any case, memory will be cleared only once, either on freeing or on allocation. See patch 2/6 for details. Patch 6/6 adds a benchmark to see how time is distributed between allocation and freeing in different modes. Besides clearing memory, each mmap() call takes some time which adds up. EAL does one call per hugepage, 1024 calls for 1 TB may take ~300 ms. It does so in order to be able to unmap the segments one by one. However, segments from initial allocation (-m) are never unmapped. Ideally, initial allocation should take one mmap() call per memory type (i.e. per NUMA node per page size) if --single-file-segments is used. This further optimization is not implemented in current version. [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/ Dmitry Kozlyuk (6): doc: add hugepage mapping details mem: add dirty malloc element support eal: refactor --huge-unlink storage eal/linux: allow hugepage file reuse eal: allow hugepage file reuse with --huge-unlink app/test: add allocator performance benchmark app/test/meson.build | 2 + app/test/test_malloc_perf.c | 174 ++++++++++++++++++ doc/guides/linux_gsg/linux_eal_parameters.rst | 21 ++- .../prog_guide/env_abstraction_layer.rst | 94 +++++++++- doc/guides/rel_notes/release_22_03.rst | 7 + lib/eal/common/eal_common_options.c | 46 ++++- lib/eal/common/eal_internal_cfg.h | 10 +- lib/eal/common/malloc_elem.c | 22 ++- lib/eal/common/malloc_elem.h | 11 +- lib/eal/common/malloc_heap.c | 18 +- lib/eal/common/rte_malloc.c | 21 ++- lib/eal/include/rte_memory.h | 8 +- lib/eal/linux/eal_hugepage_info.c | 59 ++++-- lib/eal/linux/eal_memalloc.c | 164 ++++++++++------- lib/eal/linux/eal_memory.c | 2 +- 15 files changed, 537 insertions(+), 122 deletions(-) create mode 100644 app/test/test_malloc_perf.c -- 2.25.1