From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E16E4A00C3; Thu, 3 Feb 2022 19:14:13 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id B870F41C2E; Thu, 3 Feb 2022 19:13:55 +0100 (CET) Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2060.outbound.protection.outlook.com [40.107.243.60]) by mails.dpdk.org (Postfix) with ESMTP id 1601B410DC for ; Thu, 3 Feb 2022 19:13:53 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oTgo++rWIk4TQCd/UWUcJoj6IV+Yap3gUrWYc2oAnVqglwm8G4F+lulBfaEq/zQi2yDODzMOSGLEGjc+RD9qLGJMxoXnj8ndf9+NVwZC38r+ZSriW6e/KFtgWClXlpwMToVsLGrpwe+pR9tddI7qEfgM9/azjNffsL5D5kYVGpatypqBnqxXLh5gCvwcbXdpNfYH3WXMjhJAmBBC01IJHRbJZV6m0bsI9NBjO2Fl0XviOdnjxgztkxFT1sjQqGU6sJFQ8ltpU/HzwP+kr4sTUNvqIA2sOGLfhSbaWdEc4yzoNbyyUpd13yVLaR9vyjn8PdMi/OWA72+IGBHSo9fYbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=NXDJX00aMzN0XQr2rK0RrXulTgVty8zRAD6cfLtRpo0=; b=dGwPLBjm7Ci6MQwfgiisRCnEQR/gNU8Zd7C2s/kjOYqWV5V0P42aEwgJZlUr3bRzySRJNO9kvrkVe0zjJxI4aQiDP7GyuX/q3Wy4istKAFmHX90sObpYdvfHkwJfpblutL4DccgPVx2NZmkJOqDAcE88FiinO/0FYxSAgiSV3aWFG0CIsBwfLR6j+LHQAHbGi98fGvytldY2MGPE/8TZI07Deu66Rm0+cXHZ6rWKmbeRAXnkdRhEfhFvqa0qI8HaUUvPhSTHFV1MVY/Ow6oceI3R5XjiMeOIZUeUrFRo9VpDNvy+9g+kyYKxPgFot9+NwfM9tiLqS811Bh+Y35vrnw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.238) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=NXDJX00aMzN0XQr2rK0RrXulTgVty8zRAD6cfLtRpo0=; b=KPvHBNry7QVJglmik/tFd+q2NA8ddlnL5Xcm6e6sgPasYy98Q8TQ+n6LvYt4suX8KAhD+4FBzwIQUycBhgBpHv3roQ3+KqdvruG+Ym+Kfs4UFeyZs61fodce80YvvE8/A0Mj0zMcAsGLiAGhZiCWIG11zPzO2Ufgjaf+aeP47eKTlKeqGpiQeNDVvaUzV6tjaJo14LvYPqFdC0t1iLVUuEaX33tc4jDXRNsBe77LFrU838PL2CKlUYfGkt5F5H2pTvviW0qalcK8vD7lkUq07PpQ/N/hzP2idD25FSPUI58Ma1ZU7dfLKZdMC80lG7fsUyzJVW+q9dXEcxt8Je/O7A== Received: from BN9PR03CA0229.namprd03.prod.outlook.com (2603:10b6:408:f8::24) by CH0PR12MB5217.namprd12.prod.outlook.com (2603:10b6:610:d0::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4951.12; Thu, 3 Feb 2022 18:13:51 +0000 Received: from BN8NAM11FT022.eop-nam11.prod.protection.outlook.com (2603:10b6:408:f8:cafe::c0) by BN9PR03CA0229.outlook.office365.com (2603:10b6:408:f8::24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4951.12 via Frontend Transport; Thu, 3 Feb 2022 18:13:51 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.238) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.238) by BN8NAM11FT022.mail.protection.outlook.com (10.13.176.112) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4951.12 via Frontend Transport; Thu, 3 Feb 2022 18:13:50 +0000 Received: from rnnvmail205.nvidia.com (10.129.68.10) by DRHQMAIL105.nvidia.com (10.27.9.14) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Thu, 3 Feb 2022 18:13:50 +0000 Received: from rnnvmail203.nvidia.com (10.129.68.9) by rnnvmail205.nvidia.com (10.129.68.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.9; Thu, 3 Feb 2022 10:13:49 -0800 Received: from nvidia.com (10.127.8.11) by mail.nvidia.com (10.129.68.9) with Microsoft SMTP Server id 15.2.986.9 via Frontend Transport; Thu, 3 Feb 2022 10:13:48 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [PATCH v3 5/6] eal/linux: allow hugepage file reuse Date: Thu, 3 Feb 2022 20:13:35 +0200 Message-ID: <20220203181337.759161-6-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220203181337.759161-1-dkozlyuk@nvidia.com> References: <20220119210917.765505-1-dkozlyuk@nvidia.com> <20220203181337.759161-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 2af9a9f9-d1fa-41a5-933f-08d9e740ee47 X-MS-TrafficTypeDiagnostic: CH0PR12MB5217:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:6430; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: edvusvO0YBIYFFFXnPbQGRyRVpM6S+ONpp/oA363jPfGZzQ5VUdDM51wLQnYlTdEA9di2WNaomZId4SyAOM+CDAd3b3uohsDR7ratV2xS3XWsGXfC+Xhs+nEnP7a3yiu2Z22V3yb1oFMqD+LNkUKeG/wwBvOD/R3qnb70ClvtfwGypYsExkVE7N3AYm7edaGH5U9qV179l1O1sHAXJkSuxBofpShpXG+2tqOZV18D5KoZtjdEZSWqbLK+Q6r7I80Y42StAQnDBL244ev0Pb9bB4jW2dKph70eNhU4iOZLEPlfjV/7xE5WB3EJLtnuTRNVWLDcmKAiQvHvvA2Klv1PAFDXlvvVCU3URkrvSxt8t/DMwTRzHMnbrO4V9dA0XVuUToABMMQ3oIo9Hw3dyp2yh1ci+UNzWmrKR3xM62xpx6mqHOYginEgr5SRex1GNO8JjkdCumzCYisZlGmxZgvyHlko4TRP0KNC1WfWflZNsDPyaYyw3k0PH+iT+vTKl+YoBaK53iTLSZA+/ItMgzmED2gOBgyFfk3Ry61X5tdFiRZLGQA59f80/TvtiU/I0kr97ZOeaAtzuAladLrZHD4qrK/9J35fdikywJkp9dPcjD7nIQAlpUp5wEx6KGAuAoHGGdN2WWdNw1Bx6HrLjLrZqA2p6hB6TMibtRJ+YD9+0JFwjK4ckgvlJ9gLWYnr42Z4HhpIX2CMpln1INVU4Q+JQ== X-Forefront-Antispam-Report: CIP:12.22.5.238; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(13230001)(4636009)(36840700001)(40470700004)(46966006)(36860700001)(2906002)(26005)(1076003)(6286002)(186003)(2616005)(30864003)(508600001)(36756003)(6916009)(356005)(316002)(55016003)(70206006)(47076005)(6666004)(5660300002)(8676002)(336012)(70586007)(82310400004)(81166007)(7696005)(40460700003)(83380400001)(86362001)(4326008)(426003)(8936002)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Feb 2022 18:13:50.9508 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 2af9a9f9-d1fa-41a5-933f-08d9e740ee47 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.238]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: BN8NAM11FT022.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH0PR12MB5217 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Linux EAL ensured that mapped hugepages are clean by always mapping from newly created files: existing hugepage backing files were always removed. In this case, the kernel clears the page to prevent data leaks, because the mapped memory may contain leftover data from the previous process that was using this memory. Clearing takes the bulk of the time spent in mmap(2), increasing EAL initialization time. Introduce a mode to keep existing files and reuse them in order to speed up initial memory allocation in EAL. Hugepages mapped from such files may contain data left by the previous process that used this memory, so RTE_MEMSEG_FLAG_DIRTY is set for their segments. If multiple hugepages are mapped from the same file: 1. When fallocate(2) is used, all memory mapped from this file is considered dirty, because it is unknown which parts of the file are holes. 2. When ftruncate(3) is used, memory mapped from this file is considered dirty unless the file is extended to create a new mapping, which implies clean memory. Signed-off-by: Dmitry Kozlyuk --- Coverity complains that "path" may be uninitialized in get_seg_fd() at line 327, but it is always initialized with eal_get_hugefile_path() at lines 309-316. lib/eal/common/eal_common_options.c | 2 + lib/eal/common/eal_internal_cfg.h | 2 + lib/eal/linux/eal.c | 3 +- lib/eal/linux/eal_hugepage_info.c | 118 ++++++++++++++++---- lib/eal/linux/eal_memalloc.c | 166 +++++++++++++++++----------- 5 files changed, 206 insertions(+), 85 deletions(-) diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c index 7520ebda8e..cdd2284b0c 100644 --- a/lib/eal/common/eal_common_options.c +++ b/lib/eal/common/eal_common_options.c @@ -311,6 +311,8 @@ eal_reset_internal_config(struct internal_config *internal_cfg) internal_cfg->force_nchannel = 0; internal_cfg->hugefile_prefix = NULL; internal_cfg->hugepage_dir = NULL; + internal_cfg->hugepage_file.unlink_before_mapping = false; + internal_cfg->hugepage_file.unlink_existing = true; internal_cfg->force_sockets = 0; /* zero out the NUMA config */ for (i = 0; i < RTE_MAX_NUMA_NODES; i++) diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h index b5e6942578..d2be7bfa57 100644 --- a/lib/eal/common/eal_internal_cfg.h +++ b/lib/eal/common/eal_internal_cfg.h @@ -44,6 +44,8 @@ struct simd_bitwidth { struct hugepage_file_discipline { /** Unlink files before mapping them to leave no trace in hugetlbfs. */ bool unlink_before_mapping; + /** Unlink exisiting files at startup, re-create them before mapping. */ + bool unlink_existing; }; /** diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c index 60b4924838..9c8395ab14 100644 --- a/lib/eal/linux/eal.c +++ b/lib/eal/linux/eal.c @@ -1360,7 +1360,8 @@ rte_eal_cleanup(void) struct internal_config *internal_conf = eal_get_internal_configuration(); - if (rte_eal_process_type() == RTE_PROC_PRIMARY) + if (rte_eal_process_type() == RTE_PROC_PRIMARY && + internal_conf->hugepage_file.unlink_existing) rte_memseg_walk(mark_freeable, NULL); rte_service_finalize(); rte_mp_channel_cleanup(); diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c index 9fb0e968db..ec172ef4b8 100644 --- a/lib/eal/linux/eal_hugepage_info.c +++ b/lib/eal/linux/eal_hugepage_info.c @@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon /* this function is only called from eal_hugepage_info_init which itself * is only called from a primary process */ static uint32_t -get_num_hugepages(const char *subdir, size_t sz) +get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages) { unsigned long resv_pages, num_pages, over_pages, surplus_pages; const char *nr_hp_file = "free_hugepages"; @@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz) else over_pages = 0; - if (num_pages == 0 && over_pages == 0) + if (num_pages == 0 && over_pages == 0 && reusable_pages) RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n", sz >> 10); @@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz) if (num_pages < over_pages) /* overflow */ num_pages = UINT32_MAX; + num_pages += reusable_pages; + if (num_pages < reusable_pages) /* overflow */ + num_pages = UINT32_MAX; + /* we want to return a uint32_t and more than this looks suspicious * anyway ... */ if (num_pages > UINT32_MAX) @@ -297,20 +301,28 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len) return -1; } +struct walk_hugedir_data { + int dir_fd; + int file_fd; + const char *file_name; + void *user_data; +}; + +typedef void (walk_hugedir_t)(const struct walk_hugedir_data *whd); + /* - * Clear the hugepage directory of whatever hugepage files - * there are. Checks if the file is locked (i.e. - * if it's in use by another DPDK process). + * Search the hugepage directory for whatever hugepage files there are. + * Check if the file is in use by another DPDK process. + * If not, execute a callback on it. */ static int -clear_hugedir(const char * hugedir) +walk_hugedir(const char *hugedir, walk_hugedir_t *cb, void *user_data) { DIR *dir; struct dirent *dirent; int dir_fd, fd, lck_result; const char filter[] = "*map_*"; /* matches hugepage files */ - /* open directory */ dir = opendir(hugedir); if (!dir) { RTE_LOG(ERR, EAL, "Unable to open hugepage directory %s\n", @@ -326,7 +338,7 @@ clear_hugedir(const char * hugedir) goto error; } - while(dirent != NULL){ + while (dirent != NULL) { /* skip files that don't match the hugepage pattern */ if (fnmatch(filter, dirent->d_name, 0) > 0) { dirent = readdir(dir); @@ -345,9 +357,15 @@ clear_hugedir(const char * hugedir) /* non-blocking lock */ lck_result = flock(fd, LOCK_EX | LOCK_NB); - /* if lock succeeds, remove the file */ + /* if lock succeeds, execute callback */ if (lck_result != -1) - unlinkat(dir_fd, dirent->d_name, 0); + cb(&(struct walk_hugedir_data){ + .dir_fd = dir_fd, + .file_fd = fd, + .file_name = dirent->d_name, + .user_data = user_data, + }); + close (fd); dirent = readdir(dir); } @@ -359,12 +377,48 @@ clear_hugedir(const char * hugedir) if (dir) closedir(dir); - RTE_LOG(ERR, EAL, "Error while clearing hugepage dir: %s\n", + RTE_LOG(ERR, EAL, "Error while walking hugepage dir: %s\n", strerror(errno)); return -1; } +static void +clear_hugedir_cb(const struct walk_hugedir_data *whd) +{ + unlinkat(whd->dir_fd, whd->file_name, 0); +} + +/* Remove hugepage files not used by other DPDK processes from a directory. */ +static int +clear_hugedir(const char *hugedir) +{ + return walk_hugedir(hugedir, clear_hugedir_cb, NULL); +} + +static void +inspect_hugedir_cb(const struct walk_hugedir_data *whd) +{ + uint64_t *total_size = whd->user_data; + struct stat st; + + if (fstat(whd->file_fd, &st) < 0) + RTE_LOG(DEBUG, EAL, "%s(): stat(\"%s\") failed: %s", + __func__, whd->file_name, strerror(errno)); + else + (*total_size) += st.st_size; +} + +/* + * Count the total size in bytes of all files in the directory + * not mapped by other DPDK process. + */ +static int +inspect_hugedir(const char *hugedir, uint64_t *total_size) +{ + return walk_hugedir(hugedir, inspect_hugedir_cb, total_size); +} + static int compare_hpi(const void *a, const void *b) { @@ -375,7 +429,8 @@ compare_hpi(const void *a, const void *b) } static void -calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) +calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent, + unsigned int reusable_pages) { uint64_t total_pages = 0; unsigned int i; @@ -388,8 +443,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) * in one socket and sorting them later */ total_pages = 0; - /* we also don't want to do this for legacy init */ - if (!internal_conf->legacy_mem) + + /* + * We also don't want to do this for legacy init. + * When there are hugepage files to reuse it is unknown + * what NUMA node the pages are on. + * This could be determined by mapping, + * but it is precisely what hugepage file reuse is trying to avoid. + */ + if (!internal_conf->legacy_mem && reusable_pages == 0) for (i = 0; i < rte_socket_count(); i++) { int socket = rte_socket_id_by_idx(i); unsigned int num_pages = @@ -405,7 +467,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) */ if (total_pages == 0) { hpi->num_pages[0] = get_num_hugepages(dirent->d_name, - hpi->hugepage_sz); + hpi->hugepage_sz, reusable_pages); #ifndef RTE_ARCH_64 /* for 32-bit systems, limit number of hugepages to @@ -421,6 +483,8 @@ hugepage_info_init(void) { const char dirent_start_text[] = "hugepages-"; const size_t dirent_start_len = sizeof(dirent_start_text) - 1; unsigned int i, num_sizes = 0; + uint64_t reusable_bytes; + unsigned int reusable_pages; DIR *dir; struct dirent *dirent; struct internal_config *internal_conf = @@ -454,7 +518,7 @@ hugepage_info_init(void) uint32_t num_pages; num_pages = get_num_hugepages(dirent->d_name, - hpi->hugepage_sz); + hpi->hugepage_sz, 0); if (num_pages > 0) RTE_LOG(NOTICE, EAL, "%" PRIu32 " hugepages of size " @@ -473,7 +537,7 @@ hugepage_info_init(void) "hugepages of size %" PRIu64 " bytes " "will be allocated anonymously\n", hpi->hugepage_sz); - calc_num_pages(hpi, dirent); + calc_num_pages(hpi, dirent, 0); num_sizes++; } #endif @@ -489,11 +553,23 @@ hugepage_info_init(void) "Failed to lock hugepage directory!\n"); break; } - /* clear out the hugepages dir from unused pages */ - if (clear_hugedir(hpi->hugedir) == -1) - break; - calc_num_pages(hpi, dirent); + /* + * Check for existing hugepage files and either remove them + * or count how many of them can be reused. + */ + reusable_pages = 0; + if (!internal_conf->hugepage_file.unlink_existing) { + reusable_bytes = 0; + if (inspect_hugedir(hpi->hugedir, + &reusable_bytes) < 0) + break; + RTE_ASSERT(reusable_bytes % hpi->hugepage_sz == 0); + reusable_pages = reusable_bytes / hpi->hugepage_sz; + } else if (clear_hugedir(hpi->hugedir) < 0) { + break; + } + calc_num_pages(hpi, dirent, reusable_pages); num_sizes++; } diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c index 5f5531830d..b68f5e165d 100644 --- a/lib/eal/linux/eal_memalloc.c +++ b/lib/eal/linux/eal_memalloc.c @@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused, static int get_seg_fd(char *path, int buflen, struct hugepage_info *hi, - unsigned int list_idx, unsigned int seg_idx) + unsigned int list_idx, unsigned int seg_idx, + bool *dirty) { int fd; + int *out_fd; + struct stat st; + int ret; const struct internal_config *internal_conf = eal_get_internal_configuration(); + if (dirty != NULL) + *dirty = false; + /* for in-memory mode, we only make it here when we're sure we support * memfd, and this is a special case. */ @@ -300,66 +307,70 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, return get_seg_memfd(hi, list_idx, seg_idx); if (internal_conf->single_file_segments) { - /* create a hugepage file path */ + out_fd = &fd_list[list_idx].memseg_list_fd; eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx); - - fd = fd_list[list_idx].memseg_list_fd; - - if (fd < 0) { - fd = open(path, O_CREAT | O_RDWR, 0600); - if (fd < 0) { - RTE_LOG(ERR, EAL, "%s(): open '%s' failed: %s\n", - __func__, path, strerror(errno)); - return -1; - } - /* take out a read lock and keep it indefinitely */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; - } - fd_list[list_idx].memseg_list_fd = fd; - } } else { - /* create a hugepage file path */ + out_fd = &fd_list[list_idx].fds[seg_idx]; eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); + } + fd = *out_fd; + if (fd >= 0) + return fd; - fd = fd_list[list_idx].fds[seg_idx]; - - if (fd < 0) { - /* A primary process is the only one creating these - * files. If there is a leftover that was not cleaned - * by clear_hugedir(), we must *now* make sure to drop - * the file or we will remap old stuff while the rest - * of the code is built on the assumption that a new - * page is clean. - */ - if (rte_eal_process_type() == RTE_PROC_PRIMARY && - unlink(path) == -1 && - errno != ENOENT) { - RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n", - __func__, path, strerror(errno)); - return -1; - } + /* + * There is no TOCTOU between stat() and unlink()/open() + * because the hugepage directory is locked. + */ + ret = stat(path, &st); + if (ret < 0 && errno != ENOENT) { + RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n", + __func__, path, strerror(errno)); + return -1; + } + if (!internal_conf->hugepage_file.unlink_existing && ret == 0 && + dirty != NULL) + *dirty = true; - fd = open(path, O_CREAT | O_RDWR, 0600); - if (fd < 0) { - RTE_LOG(ERR, EAL, "%s(): open '%s' failed: %s\n", - __func__, path, strerror(errno)); - return -1; - } - /* take out a read lock */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; - } - fd_list[list_idx].fds[seg_idx] = fd; + /* + * The kernel clears a hugepage only when it is mapped + * from a particular file for the first time. + * If the file already exists, the old content will be mapped. + * If the memory manager assumes all mapped pages to be clean, + * the file must be removed and created anew. + * Otherwise, the primary caller must be notified + * that mapped pages will be dirty + * (secondary callers receive the segment state from the primary one). + * When multiple hugepages are mapped from the same file, + * whether they will be dirty depends on the part that is mapped. + */ + if (!internal_conf->single_file_segments && + internal_conf->hugepage_file.unlink_existing && + rte_eal_process_type() == RTE_PROC_PRIMARY && + ret == 0) { + /* coverity[toctou] */ + if (unlink(path) < 0) { + RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n", + __func__, path, strerror(errno)); + return -1; } } + + /* coverity[toctou] */ + fd = open(path, O_CREAT | O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): open '%s' failed: %s\n", + __func__, path, strerror(errno)); + return -1; + } + /* take out a read lock */ + if (lock(fd, LOCK_SH) < 0) { + RTE_LOG(ERR, EAL, "%s(): lock '%s' failed: %s\n", + __func__, path, strerror(errno)); + close(fd); + return -1; + } + *out_fd = fd; return fd; } @@ -385,8 +396,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset, static int resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, - bool grow) + bool grow, bool *dirty) { + const struct internal_config *internal_conf = + eal_get_internal_configuration(); bool again = false; do { @@ -405,6 +418,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, uint64_t cur_size = get_file_size(fd); /* fallocate isn't supported, fall back to ftruncate */ + if (dirty != NULL) + *dirty = new_size <= cur_size; if (new_size > cur_size && ftruncate(fd, new_size) < 0) { RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n", @@ -447,8 +462,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, strerror(errno)); return -1; } - } else + } else { fallocate_supported = 1; + /* + * It is unknown which portions of an existing + * hugepage file were allocated previously, + * so all pages within the file are considered + * dirty, unless the file is a fresh one. + */ + if (dirty != NULL) + *dirty &= !internal_conf->hugepage_file.unlink_existing; + } } } while (again); @@ -475,7 +499,8 @@ close_hugefile(int fd, char *path, int list_idx) } static int -resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow, + bool *dirty) { /* in-memory mode is a special case, because we can be sure that * fallocate() is supported. @@ -483,12 +508,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) const struct internal_config *internal_conf = eal_get_internal_configuration(); - if (internal_conf->in_memory) + if (internal_conf->in_memory) { + if (dirty != NULL) + *dirty = false; return resize_hugefile_in_memory(fd, fa_offset, page_sz, grow); + } return resize_hugefile_in_filesystem(fd, fa_offset, page_sz, - grow); + grow, dirty); } static int @@ -505,6 +533,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, char path[PATH_MAX]; int ret = 0; int fd; + bool dirty; size_t alloc_sz; int flags; void *new_addr; @@ -534,6 +563,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, pagesz_flag = pagesz_flags(alloc_sz); fd = -1; + dirty = false; mmap_flags = in_memory_flags | pagesz_flag; /* single-file segments codepath will never be active @@ -544,7 +574,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, map_offset = 0; } else { /* takes out a read lock on segment or segment list */ - fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); + fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, + &dirty); if (fd < 0) { RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n"); return -1; @@ -552,7 +583,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, if (internal_conf->single_file_segments) { map_offset = seg_idx * alloc_sz; - ret = resize_hugefile(fd, map_offset, alloc_sz, true); + ret = resize_hugefile(fd, map_offset, alloc_sz, true, + &dirty); if (ret < 0) goto resized; @@ -662,6 +694,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, ms->nrank = rte_memory_get_nrank(); ms->iova = iova; ms->socket_id = socket_id; + ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0; return 0; @@ -689,7 +722,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, return -1; if (internal_conf->single_file_segments) { - resize_hugefile(fd, map_offset, alloc_sz, false); + resize_hugefile(fd, map_offset, alloc_sz, false, NULL); /* ignore failure, can't make it any worse */ /* if refcount is at zero, close the file */ @@ -739,13 +772,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, * segment and thus drop the lock on original fd, but hugepage dir is * now locked so we can take out another one without races. */ - fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); + fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL); if (fd < 0) return -1; if (internal_conf->single_file_segments) { map_offset = seg_idx * ms->len; - if (resize_hugefile(fd, map_offset, ms->len, false)) + if (resize_hugefile(fd, map_offset, ms->len, false, NULL)) return -1; if (--(fd_list[list_idx].count) == 0) @@ -757,6 +790,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, * holding onto this page. */ if (!internal_conf->in_memory && + internal_conf->hugepage_file.unlink_existing && !internal_conf->hugepage_file.unlink_before_mapping) { ret = lock(fd, LOCK_EX); if (ret >= 0) { @@ -1743,6 +1777,12 @@ eal_memalloc_init(void) RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n"); return -1; } + /* safety net, should be impossible to configure */ + if (internal_conf->hugepage_file.unlink_before_mapping && + !internal_conf->hugepage_file.unlink_existing) { + RTE_LOG(ERR, EAL, "Unlinking existing hugepage files is prohibited, cannot unlink them before mapping.\n"); + return -1; + } } /* initialize all of the fd lists */ -- 2.25.1