From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9F640A04A5; Thu, 30 Dec 2021 15:38:36 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2E56741174; Thu, 30 Dec 2021 15:38:17 +0100 (CET) Received: from NAM10-BN7-obe.outbound.protection.outlook.com (mail-bn7nam10on2079.outbound.protection.outlook.com [40.107.92.79]) by mails.dpdk.org (Postfix) with ESMTP id 0C84B41148 for ; Thu, 30 Dec 2021 15:38:13 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=R2yneNHPSRcqbKN8/okvXGnhIdZjvfRhJ7NEcCEOfSpjTLVGY9+5TvZXzB1p9bmJgL8oYjpB0uPMOjO/GvhMmu2BCF6V2BP1/4kB5M9V72Q1oaVM7CwJI5TrJUAn1cuUfHpneziyyAzdKXueiql3QfJnvhzlEySeOFmLI5zB9ZPAaOnYTWYuKOuFM6zDKwa2y/UVDxuv70Os/MSIGdmY+D0Iun/5GrYSGDP3uL4k/EdzO3KjJHV8tbXMOddnnmxe6nJMU1DOt0KavMPl+Cgewgn4nHhqF4mpLf5YqnU0tOnyxyfa2g0tBzVBnUWBf8g65f6KzCNCvh8ExoyBOmuqDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=eQVaTsSne+Qcod8p9FmCKdIBumlNwY94+Fc+J4zCyfM=; b=MVL/Vv74FJrhBavq/gzemnSrsGrNx4cfoTIzvbj4P0KEpPOCR+0BZGjTVP7XBTB15erWlEq2UH6bhWHQ7w4Dvk9T1f67eK7YnGUhd2OYkQAKMvdnyW+eStoc3fi/ZGLIT5bEWIopuJeRyGP6zHZ4tcHG89ZipeGAYjBtIakzjas/A1q6zlIWXF6M6ntZF5TTBGxrzghBxhB4X5p7I3pWLjHb6ZY3UMlzJzP4zlKDAjHVXxXGlNZiG+rz5mQ4p6jY7Ya7LZOfR5FhnaDsZNG391+O9gDzRHriw43JnjqI6bea/rHIba3QnskbiRW/bkQZNzGDrbhjWb8+DdMKc0v4pg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.236) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=eQVaTsSne+Qcod8p9FmCKdIBumlNwY94+Fc+J4zCyfM=; b=dhKvDYXB0TJnyEcaoSRysuUjwuMSOxsukRnFok9xFf8wVCqtdchHo43mAkLEmTr/iJkBhmgs0UdtB4AtYqhByVwAv5IPzwaRiCcv+Hg9xcfDmQvBcW4rK1nJWGMCrUy7iXOHqZKCqKfi7lEH8mj3AM8VtxsxFdnd/9VNFE+E/I6ghRim9c2oKZrYZN1HGlBesCqUFygUDo/6oLpuemV+pTKAdgzYmFVlmg16HiLJVB1sZIg2qbmL7wULFBFHhKjw+puqr0YAuJtPHAstuUCbJIZai6Ka/3I7p5GyqMgvkwydqBvbS8zA1v38jUnvD883+zmojCB7rvKl/dIoCbaz1A== Received: from CO2PR18CA0061.namprd18.prod.outlook.com (2603:10b6:104:2::29) by DM6PR12MB3210.namprd12.prod.outlook.com (2603:10b6:5:185::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4823.19; Thu, 30 Dec 2021 14:38:10 +0000 Received: from CO1NAM11FT053.eop-nam11.prod.protection.outlook.com (2603:10b6:104:2:cafe::73) by CO2PR18CA0061.outlook.office365.com (2603:10b6:104:2::29) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4844.14 via Frontend Transport; Thu, 30 Dec 2021 14:38:10 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.236) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.236 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.236; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.236) by CO1NAM11FT053.mail.protection.outlook.com (10.13.175.63) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4844.14 via Frontend Transport; Thu, 30 Dec 2021 14:38:10 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL109.nvidia.com (10.27.9.19) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Thu, 30 Dec 2021 14:38:08 +0000 Received: from nvidia.com (172.20.187.6) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Thu, 30 Dec 2021 06:38:06 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [RFC PATCH 4/6] eal/linux: allow hugepage file reuse Date: Thu, 30 Dec 2021 16:37:42 +0200 Message-ID: <20211230143744.3550098-5-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20211230143744.3550098-1-dkozlyuk@nvidia.com> References: <20211230143744.3550098-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [172.20.187.6] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 21990e4b-43a2-4a65-ec97-08d9cba200be X-MS-TrafficTypeDiagnostic: DM6PR12MB3210:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:6790; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: wFTlkvoS7xI7HeJrnS5eRmjFBMVpfT4RoK1GvngqZDhYovx+lTdUwNyLmeQXS1cQZa23gp7obifaKq0Iu8+2ORq5zrx68bQQRM9y0xuO1UqBD0gLdcrYFSkcVgNPM6CHYGarqg123rMYq10u4joM6wLN7abr7zEmsn1wE/7+MybDOas1oCMa7iGa0q74fsStf68k1ZC3+22rofJ8vYAUYgYKxXAe2/pEbC92jPNpjvzFRHnxrT9jkqU95d/+qMt0s+izLS0YSt8JfKlPMa4ogoDIp23P4lwt1C+xehFYW0vOz+uCG+tANg96vccB5dd7P6arDmJ/wM9QAg5uXZqX+SVcE53NHy3uNvf34tOy44lPEdJzCEpAsLj9sP0ADA5hoPgytaeZOxClNVmpxCWAe5fLK4t3GeXsCm5i7oPs//3bXYWvrzmDo76lyDGelxMCu6Dt8M/WKmNXfllok7w/5YF3W9q0ipC3HCTgaq/E37OA6wldxi9AlvhhCB4bshzVEX4VuG7V0cV3xN6lWwmtdbgjT6OBnvAsxDoVYv1b5vjTKBuM/2Nr0UZUFf+jlU2gciIXaUZQB3AJTCTkRBHhhnw9OgBHM6JM4VzhiOFRcHSX85Zt010xf6HuD5Gft/pulkk09KEzbC+mLa0fJSr7TokDQGyet82hSd3FUXd5qi722wkGmF8zdYWf831OCuVBoc8xgHLkGVlgV7mlf4oeyiZoFO12opivOdtc4NbfLzn1DqUlo7hz3rcA/7vCpli0TyZvhXmE5Eb66bcDCoU6+ASd/ouLYAuqHFnMyLXpPps= X-Forefront-Antispam-Report: CIP:12.22.5.236; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(40470700002)(36840700001)(46966006)(8936002)(4326008)(30864003)(70586007)(70206006)(7696005)(8676002)(47076005)(82310400004)(426003)(26005)(16526019)(36860700001)(336012)(186003)(36756003)(86362001)(81166007)(356005)(6286002)(1076003)(6666004)(2906002)(316002)(83380400001)(2616005)(40460700001)(55016003)(508600001)(5660300002)(6916009)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Dec 2021 14:38:10.6038 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 21990e4b-43a2-4a65-ec97-08d9cba200be X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.236]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: CO1NAM11FT053.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR12MB3210 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Linux EAL ensured that mapped hugepages are clean by always mapping from newly created files: existing hugepage backing files were always removed. In this case, the kernel clears the page to prevent data leaks, because the mapped memory may contain leftover data from the previous process that was using this memory. Clearing takes the bulk of the time spent in mmap(2), increasing EAL initialization time. Introduce a mode to keep existing files and reuse them in order to speed up initial memory allocation in EAL. Hugepages mapped from such files may contain data left by the previous process that used this memory, so RTE_MEMSEG_FLAG_DIRTY is set for their segments. If multiple hugepages are mapped from the same file: 1. When fallocate(2) is used, all memory mapped from this file is considered dirty, because it is unknown which parts of the file are holes. 2. When ftruncate(3) is used, memory mapped from this file is considered dirty unless the file is extended to create a new mapping, which implies clean memory. Signed-off-by: Dmitry Kozlyuk --- lib/eal/common/eal_internal_cfg.h | 2 + lib/eal/linux/eal_hugepage_info.c | 59 +++++++---- lib/eal/linux/eal_memalloc.c | 157 ++++++++++++++++++------------ 3 files changed, 140 insertions(+), 78 deletions(-) diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h index b5e6942578..3685aa7c52 100644 --- a/lib/eal/common/eal_internal_cfg.h +++ b/lib/eal/common/eal_internal_cfg.h @@ -44,6 +44,8 @@ struct simd_bitwidth { struct hugepage_file_discipline { /** Unlink files before mapping them to leave no trace in hugetlbfs. */ bool unlink_before_mapping; + /** Reuse existing files, never delete or re-create them. */ + bool keep_existing; }; /** diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c index 9fb0e968db..55debdedf0 100644 --- a/lib/eal/linux/eal_hugepage_info.c +++ b/lib/eal/linux/eal_hugepage_info.c @@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon /* this function is only called from eal_hugepage_info_init which itself * is only called from a primary process */ static uint32_t -get_num_hugepages(const char *subdir, size_t sz) +get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages) { unsigned long resv_pages, num_pages, over_pages, surplus_pages; const char *nr_hp_file = "free_hugepages"; @@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz) else over_pages = 0; - if (num_pages == 0 && over_pages == 0) + if (num_pages == 0 && over_pages == 0 && reusable_pages) RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n", sz >> 10); @@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz) if (num_pages < over_pages) /* overflow */ num_pages = UINT32_MAX; + num_pages += reusable_pages; + if (num_pages < reusable_pages) /* overflow */ + num_pages = UINT32_MAX; + /* we want to return a uint32_t and more than this looks suspicious * anyway ... */ if (num_pages > UINT32_MAX) @@ -298,12 +302,12 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len) } /* - * Clear the hugepage directory of whatever hugepage files - * there are. Checks if the file is locked (i.e. - * if it's in use by another DPDK process). + * Search the hugepage directory for whatever hugepage files there are. + * Check if the file is in use by another DPDK process. + * If not, either remove it, or keep and count the page as reusable. */ static int -clear_hugedir(const char * hugedir) +clear_hugedir(const char *hugedir, bool keep, unsigned int *reusable_pages) { DIR *dir; struct dirent *dirent; @@ -346,8 +350,12 @@ clear_hugedir(const char * hugedir) lck_result = flock(fd, LOCK_EX | LOCK_NB); /* if lock succeeds, remove the file */ - if (lck_result != -1) - unlinkat(dir_fd, dirent->d_name, 0); + if (lck_result != -1) { + if (keep) + (*reusable_pages)++; + else + unlinkat(dir_fd, dirent->d_name, 0); + } close (fd); dirent = readdir(dir); } @@ -375,7 +383,8 @@ compare_hpi(const void *a, const void *b) } static void -calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) +calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent, + unsigned int reusable_pages) { uint64_t total_pages = 0; unsigned int i; @@ -388,8 +397,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) * in one socket and sorting them later */ total_pages = 0; - /* we also don't want to do this for legacy init */ - if (!internal_conf->legacy_mem) + + /* + * We also don't want to do this for legacy init. + * When there are hugepage files to reuse it is unknown + * what NUMA node the pages are on. + * This could be determined by mapping, + * but it is precisely what hugepage file reuse is trying to avoid. + */ + if (!internal_conf->legacy_mem && reusable_pages == 0) for (i = 0; i < rte_socket_count(); i++) { int socket = rte_socket_id_by_idx(i); unsigned int num_pages = @@ -405,7 +421,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) */ if (total_pages == 0) { hpi->num_pages[0] = get_num_hugepages(dirent->d_name, - hpi->hugepage_sz); + hpi->hugepage_sz, reusable_pages); #ifndef RTE_ARCH_64 /* for 32-bit systems, limit number of hugepages to @@ -421,6 +437,7 @@ hugepage_info_init(void) { const char dirent_start_text[] = "hugepages-"; const size_t dirent_start_len = sizeof(dirent_start_text) - 1; unsigned int i, num_sizes = 0; + unsigned int reusable_pages; DIR *dir; struct dirent *dirent; struct internal_config *internal_conf = @@ -454,7 +471,7 @@ hugepage_info_init(void) uint32_t num_pages; num_pages = get_num_hugepages(dirent->d_name, - hpi->hugepage_sz); + hpi->hugepage_sz, 0); if (num_pages > 0) RTE_LOG(NOTICE, EAL, "%" PRIu32 " hugepages of size " @@ -473,7 +490,7 @@ hugepage_info_init(void) "hugepages of size %" PRIu64 " bytes " "will be allocated anonymously\n", hpi->hugepage_sz); - calc_num_pages(hpi, dirent); + calc_num_pages(hpi, dirent, 0); num_sizes++; } #endif @@ -489,11 +506,17 @@ hugepage_info_init(void) "Failed to lock hugepage directory!\n"); break; } - /* clear out the hugepages dir from unused pages */ - if (clear_hugedir(hpi->hugedir) == -1) - break; - calc_num_pages(hpi, dirent); + /* + * Check for existing hugepage files and either remove them + * or count how many of them can be reused. + */ + reusable_pages = 0; + if (clear_hugedir(hpi->hugedir, + internal_conf->hugepage_file.keep_existing, + &reusable_pages) == -1) + break; + calc_num_pages(hpi, dirent, reusable_pages); num_sizes++; } diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c index abbe605e49..cbd7c9cbee 100644 --- a/lib/eal/linux/eal_memalloc.c +++ b/lib/eal/linux/eal_memalloc.c @@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused, static int get_seg_fd(char *path, int buflen, struct hugepage_info *hi, - unsigned int list_idx, unsigned int seg_idx) + unsigned int list_idx, unsigned int seg_idx, + bool *dirty) { int fd; + int *out_fd; + struct stat st; + int ret; const struct internal_config *internal_conf = eal_get_internal_configuration(); + if (dirty != NULL) + *dirty = false; + /* for in-memory mode, we only make it here when we're sure we support * memfd, and this is a special case. */ @@ -300,66 +307,68 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, return get_seg_memfd(hi, list_idx, seg_idx); if (internal_conf->single_file_segments) { - /* create a hugepage file path */ + out_fd = &fd_list[list_idx].memseg_list_fd; eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx); - - fd = fd_list[list_idx].memseg_list_fd; - - if (fd < 0) { - fd = open(path, O_CREAT | O_RDWR, 0600); - if (fd < 0) { - RTE_LOG(ERR, EAL, "%s(): open failed: %s\n", - __func__, strerror(errno)); - return -1; - } - /* take out a read lock and keep it indefinitely */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; - } - fd_list[list_idx].memseg_list_fd = fd; - } } else { - /* create a hugepage file path */ + out_fd = &fd_list[list_idx].fds[seg_idx]; eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); + } + fd = *out_fd; + if (fd >= 0) + return fd; - fd = fd_list[list_idx].fds[seg_idx]; - - if (fd < 0) { - /* A primary process is the only one creating these - * files. If there is a leftover that was not cleaned - * by clear_hugedir(), we must *now* make sure to drop - * the file or we will remap old stuff while the rest - * of the code is built on the assumption that a new - * page is clean. - */ - if (rte_eal_process_type() == RTE_PROC_PRIMARY && - unlink(path) == -1 && - errno != ENOENT) { + /* + * The kernel clears a hugepage only when it is mapped + * from a particular file for the first time. + * If the file already exists, mapped will be the old + * content of the hugepages. If the memory manager + * assumes all mapped pages to be clean, + * the file must be removed and created anew. + * Otherwise the primary caller must be notified + * that mapped pages will be dirty (secondary callers + * receive the segment state from the primary one). + * When multiple hugepages are mapped from the same file, + * whether they will be dirty depends on the part that is mapped. + * + * There is no TOCTOU between stat() and unlink()/open() + * because the hugepage directory is locked. + */ + if (!internal_conf->single_file_segments) { + ret = stat(path, &st); + if (ret < 0 && errno != ENOENT) { + RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n", + __func__, path, strerror(errno)); + return -1; + } + if (rte_eal_process_type() == RTE_PROC_PRIMARY && ret == 0) { + if (internal_conf->hugepage_file.keep_existing && + dirty != NULL) { + *dirty = true; + /* coverity[toctou] */ + } else if (unlink(path) < 0) { RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n", __func__, path, strerror(errno)); return -1; } - - fd = open(path, O_CREAT | O_RDWR, 0600); - if (fd < 0) { - RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", - __func__, strerror(errno)); - return -1; - } - /* take out a read lock */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; - } - fd_list[list_idx].fds[seg_idx] = fd; } } + + /* coverity[toctou] */ + fd = open(path, O_CREAT | O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", + __func__, strerror(errno)); + return -1; + } + /* take out a read lock */ + if (lock(fd, LOCK_SH) < 0) { + RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", + __func__, strerror(errno)); + close(fd); + return -1; + } + *out_fd = fd; return fd; } @@ -385,8 +394,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset, static int resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, - bool grow) + bool grow, bool *dirty) { + const struct internal_config *internal_conf = + eal_get_internal_configuration(); bool again = false; do { @@ -405,6 +416,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, uint64_t cur_size = get_file_size(fd); /* fallocate isn't supported, fall back to ftruncate */ + if (dirty != NULL) + *dirty = new_size <= cur_size; if (new_size > cur_size && ftruncate(fd, new_size) < 0) { RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n", @@ -447,8 +460,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, strerror(errno)); return -1; } - } else + } else { fallocate_supported = 1; + /* + * It is unknown which portions of an existing + * hugepage file were allocated previously, + * so all pages within the file are considered + * dirty, unless the file is a fresh one. + */ + if (dirty != NULL) + *dirty = internal_conf->hugepage_file.keep_existing; + } } } while (again); @@ -475,7 +497,8 @@ close_hugefile(int fd, char *path, int list_idx) } static int -resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow, + bool *dirty) { /* in-memory mode is a special case, because we can be sure that * fallocate() is supported. @@ -483,12 +506,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) const struct internal_config *internal_conf = eal_get_internal_configuration(); - if (internal_conf->in_memory) + if (internal_conf->in_memory) { + if (dirty != NULL) + *dirty = false; return resize_hugefile_in_memory(fd, fa_offset, page_sz, grow); + } return resize_hugefile_in_filesystem(fd, fa_offset, page_sz, - grow); + grow, dirty); } static int @@ -505,6 +531,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, char path[PATH_MAX]; int ret = 0; int fd; + bool dirty; size_t alloc_sz; int flags; void *new_addr; @@ -534,6 +561,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, pagesz_flag = pagesz_flags(alloc_sz); fd = -1; + dirty = false; mmap_flags = in_memory_flags | pagesz_flag; /* single-file segments codepath will never be active @@ -544,7 +572,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, map_offset = 0; } else { /* takes out a read lock on segment or segment list */ - fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); + fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, + &dirty); if (fd < 0) { RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n"); return -1; @@ -552,7 +581,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, if (internal_conf->single_file_segments) { map_offset = seg_idx * alloc_sz; - ret = resize_hugefile(fd, map_offset, alloc_sz, true); + ret = resize_hugefile(fd, map_offset, alloc_sz, true, + &dirty); if (ret < 0) goto resized; @@ -662,6 +692,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, ms->nrank = rte_memory_get_nrank(); ms->iova = iova; ms->socket_id = socket_id; + ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0; return 0; @@ -689,7 +720,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, return -1; if (internal_conf->single_file_segments) { - resize_hugefile(fd, map_offset, alloc_sz, false); + resize_hugefile(fd, map_offset, alloc_sz, false, NULL); /* ignore failure, can't make it any worse */ /* if refcount is at zero, close the file */ @@ -739,13 +770,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, * segment and thus drop the lock on original fd, but hugepage dir is * now locked so we can take out another one without races. */ - fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); + fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL); if (fd < 0) return -1; if (internal_conf->single_file_segments) { map_offset = seg_idx * ms->len; - if (resize_hugefile(fd, map_offset, ms->len, false)) + if (resize_hugefile(fd, map_offset, ms->len, false, NULL)) return -1; if (--(fd_list[list_idx].count) == 0) @@ -1743,6 +1774,12 @@ eal_memalloc_init(void) RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n"); return -1; } + /* safety net, should be impossible to configure */ + if (internal_conf->hugepage_file.unlink_before_mapping && + internal_conf->hugepage_file.keep_existing) { + RTE_LOG(ERR, EAL, "Unable both to keep existing hugepage files and to unlink them.\n"); + return -1; + } } /* initialize all of the fd lists */ -- 2.25.1