From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id B12231B5A3 for ; Mon, 26 Nov 2018 12:36:44 +0100 (CET) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 175D6309B6C2; Mon, 26 Nov 2018 11:36:44 +0000 (UTC) Received: from ktraynor.remote.csb (ovpn-117-230.ams2.redhat.com [10.36.117.230]) by smtp.corp.redhat.com (Postfix) with ESMTP id 66A83600CC; Mon, 26 Nov 2018 11:36:42 +0000 (UTC) To: "Burakov, Anatoly" Cc: dpdk stable References: <20181122164957.13003-1-ktraynor@redhat.com> <20181122164957.13003-18-ktraynor@redhat.com> From: Kevin Traynor Organization: Red Hat Message-ID: Date: Mon, 26 Nov 2018 11:36:41 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.47]); Mon, 26 Nov 2018 11:36:44 +0000 (UTC) Subject: Re: [dpdk-stable] patch 'mem: improve segment list preallocation' has been queued to stable release 18.08.1 X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Nov 2018 11:36:45 -0000 On 11/26/2018 11:16 AM, Burakov, Anatoly wrote: > Hi Kevin, > > FYI > > http://patches.dpdk.org/patch/48338/ > Thanks Anatoly. There's a couple of more "batches of patches" to do, so will make sure it's included (after it's applied on dpdk). > Thanks, > Anatoly > > >> -----Original Message----- >> From: Kevin Traynor [mailto:ktraynor@redhat.com] >> Sent: Thursday, November 22, 2018 4:49 PM >> To: Burakov, Anatoly >> Cc: dpdk stable >> Subject: patch 'mem: improve segment list preallocation' has been queued >> to stable release 18.08.1 >> >> Hi, >> >> FYI, your patch has been queued to stable release 18.08.1 >> >> Note it hasn't been pushed to http://dpdk.org/browse/dpdk-stable yet. >> It will be pushed if I get no objections before 11/28/18. So please shout if >> anyone has objections. >> >> Also note that after the patch there's a diff of the upstream commit vs the >> patch applied to the branch. If the code is different (ie: not only metadata >> diffs), due for example to a change in context or macro names, please >> double check it. >> >> Thanks. >> >> Kevin Traynor >> >> --- >> From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00 >> 2001 >> From: Anatoly Burakov >> Date: Fri, 5 Oct 2018 09:29:44 +0100 >> Subject: [PATCH] mem: improve segment list preallocation >> >> [ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ] >> >> Current code to preallocate segment lists is trying to do everything in one go, >> and thus ends up being convoluted, hard to understand, and, most >> importantly, does not scale beyond initial assumptions about number of >> NUMA nodes and number of page sizes, and therefore has issues on some >> configurations. >> >> Instead of fixing these issues in the existing code, simply rewrite it to be >> slightly less clever but much more logical, and provide ample comments to >> explain exactly what is going on. >> >> We cannot use the same approach for 32-bit code because the limitations of >> the target dictate current socket-centric approach rather than type-centric >> approach we use on 64-bit target, so 32-bit code is left unmodified. FreeBSD >> doesn't support NUMA so there's no complexity involved there, and thus its >> code is much more readable and not worth changing. >> >> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific") >> >> Signed-off-by: Anatoly Burakov >> --- >> lib/librte_eal/linuxapp/eal/eal_memory.c | 179 +++++++++++++++++------ >> 1 file changed, 137 insertions(+), 42 deletions(-) >> >> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c >> b/lib/librte_eal/linuxapp/eal/eal_memory.c >> index 6131bfde2..cc2d3fb69 100644 >> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c >> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c >> @@ -2096,7 +2096,13 @@ memseg_primary_init(void) { >> struct rte_mem_config *mcfg = rte_eal_get_configuration()- >>> mem_config; >> - int i, socket_id, hpi_idx, msl_idx = 0; >> + struct memtype { >> + uint64_t page_sz; >> + int socket_id; >> + } *memtypes = NULL; >> + int i, hpi_idx, msl_idx; >> struct rte_memseg_list *msl; >> - uint64_t max_mem, total_mem; >> + uint64_t max_mem, max_mem_per_type; >> + unsigned int max_seglists_per_type; >> + unsigned int n_memtypes, cur_type; >> >> /* no-huge does not need this at all */ @@ -2104,8 +2110,49 @@ >> memseg_primary_init(void) >> return 0; >> >> - max_mem = (uint64_t)RTE_MAX_MEM_MB << 20; >> - total_mem = 0; >> + /* >> + * figuring out amount of memory we're going to have is a long and >> very >> + * involved process. the basic element we're operating with is a >> memory >> + * type, defined as a combination of NUMA node ID and page size (so >> that >> + * e.g. 2 sockets with 2 page sizes yield 4 memory types in total). >> + * >> + * deciding amount of memory going towards each memory type is a >> + * balancing act between maximum segments per type, maximum >> memory per >> + * type, and number of detected NUMA nodes. the goal is to make >> sure >> + * each memory type gets at least one memseg list. >> + * >> + * the total amount of memory is limited by RTE_MAX_MEM_MB >> value. >> + * >> + * the total amount of memory per type is limited by either >> + * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB >> divided by the number >> + * of detected NUMA nodes. additionally, maximum number of >> segments per >> + * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is >> because for >> + * smaller page sizes, it can take hundreds of thousands of segments >> to >> + * reach the above specified per-type memory limits. >> + * >> + * additionally, each type may have multiple memseg lists associated >> + * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for >> bigger >> + * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller >> ones. >> + * >> + * the number of memseg lists per type is decided based on the >> above >> + * limits, and also taking number of detected NUMA nodes, to make >> sure >> + * that we don't run out of memseg lists before we populate all >> NUMA >> + * nodes with memory. >> + * >> + * we do this in three stages. first, we collect the number of types. >> + * then, we figure out memory constraints and populate the list of >> + * would-be memseg lists. then, we go ahead and allocate the >> memseg >> + * lists. >> + */ >> >> - /* create memseg lists */ >> + /* create space for mem types */ >> + n_memtypes = internal_config.num_hugepage_sizes * >> rte_socket_count(); >> + memtypes = calloc(n_memtypes, sizeof(*memtypes)); >> + if (memtypes == NULL) { >> + RTE_LOG(ERR, EAL, "Cannot allocate space for memory >> types\n"); >> + return -1; >> + } >> + >> + /* populate mem types */ >> + cur_type = 0; >> for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes; >> hpi_idx++) { >> @@ -2116,9 +2163,6 @@ memseg_primary_init(void) >> hugepage_sz = hpi->hugepage_sz; >> >> - for (i = 0; i < (int) rte_socket_count(); i++) { >> - uint64_t max_type_mem, total_type_mem = 0; >> - int type_msl_idx, max_segs, total_segs = 0; >> - >> - socket_id = rte_socket_id_by_idx(i); >> + for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) { >> + int socket_id = rte_socket_id_by_idx(i); >> >> #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES >> @@ -2126,47 +2170,98 @@ memseg_primary_init(void) >> break; >> #endif >> + memtypes[cur_type].page_sz = hugepage_sz; >> + memtypes[cur_type].socket_id = socket_id; >> >> - if (total_mem >= max_mem) >> - break; >> + RTE_LOG(DEBUG, EAL, "Detected memory type: " >> + "socket_id:%u hugepage_sz:%" PRIu64 "\n", >> + socket_id, hugepage_sz); >> + } >> + } >> >> - max_type_mem = RTE_MIN(max_mem - >> total_mem, >> - (uint64_t)RTE_MAX_MEM_MB_PER_TYPE << >> 20); >> - max_segs = RTE_MAX_MEMSEG_PER_TYPE; >> + /* set up limits for types */ >> + max_mem = (uint64_t)RTE_MAX_MEM_MB << 20; >> + max_mem_per_type = >> RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20, >> + max_mem / n_memtypes); >> + /* >> + * limit maximum number of segment lists per type to ensure there's >> + * space for memseg lists for all NUMA nodes with all page sizes >> + */ >> + max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes; >> >> - type_msl_idx = 0; >> - while (total_type_mem < max_type_mem && >> - total_segs < max_segs) { >> - uint64_t cur_max_mem, cur_mem; >> - unsigned int n_segs; >> + if (max_seglists_per_type == 0) { >> + RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, >> please increase %s\n", >> + RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS)); >> + return -1; >> + } >> >> - if (msl_idx >= RTE_MAX_MEMSEG_LISTS) { >> - RTE_LOG(ERR, EAL, >> - "No more space in memseg >> lists, please increase %s\n", >> - >> RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS)); >> - return -1; >> - } >> + /* go through all mem types and create segment lists */ >> + msl_idx = 0; >> + for (cur_type = 0; cur_type < n_memtypes; cur_type++) { >> + unsigned int cur_seglist, n_seglists, n_segs; >> + unsigned int max_segs_per_type, max_segs_per_list; >> + struct memtype *type = &memtypes[cur_type]; >> + uint64_t max_mem_per_list, pagesz; >> + int socket_id; >> >> - msl = &mcfg->memsegs[msl_idx++]; >> + pagesz = type->page_sz; >> + socket_id = type->socket_id; >> >> - cur_max_mem = max_type_mem - >> total_type_mem; >> + /* >> + * we need to create segment lists for this type. we must >> take >> + * into account the following things: >> + * >> + * 1. total amount of memory we can use for this memory >> type >> + * 2. total amount of memory per memseg list allowed >> + * 3. number of segments needed to fit the amount of >> memory >> + * 4. number of segments allowed per type >> + * 5. number of segments allowed per memseg list >> + * 6. number of memseg lists we are allowed to take up >> + */ >> >> - cur_mem = >> get_mem_amount(hugepage_sz, >> - cur_max_mem); >> - n_segs = cur_mem / hugepage_sz; >> + /* calculate how much segments we will need in total */ >> + max_segs_per_type = max_mem_per_type / pagesz; >> + /* limit number of segments to maximum allowed per type >> */ >> + max_segs_per_type = RTE_MIN(max_segs_per_type, >> + (unsigned >> int)RTE_MAX_MEMSEG_PER_TYPE); >> + /* limit number of segments to maximum allowed per list */ >> + max_segs_per_list = RTE_MIN(max_segs_per_type, >> + (unsigned >> int)RTE_MAX_MEMSEG_PER_LIST); >> >> - if (alloc_memseg_list(msl, hugepage_sz, >> n_segs, >> - socket_id, type_msl_idx)) >> - return -1; >> + /* calculate how much memory we can have per segment list >> */ >> + max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz, >> + (uint64_t)RTE_MAX_MEM_MB_PER_LIST << >> 20); >> >> - total_segs += msl->memseg_arr.len; >> - total_type_mem = total_segs * >> hugepage_sz; >> - type_msl_idx++; >> + /* calculate how many segments each segment list will have >> */ >> + n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / >> pagesz); >> >> - if (alloc_va_space(msl)) { >> - RTE_LOG(ERR, EAL, "Cannot allocate >> VA space for memseg list\n"); >> - return -1; >> - } >> + /* calculate how many segment lists we can have */ >> + n_seglists = RTE_MIN(max_segs_per_type / n_segs, >> + max_mem_per_type / max_mem_per_list); >> + >> + /* limit number of segment lists according to our maximum >> */ >> + n_seglists = RTE_MIN(n_seglists, max_seglists_per_type); >> + >> + RTE_LOG(DEBUG, EAL, "Creating %i segment lists: " >> + "n_segs:%i socket_id:%i hugepage_sz:%" >> PRIu64 "\n", >> + n_seglists, n_segs, socket_id, pagesz); >> + >> + /* create all segment lists */ >> + for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) { >> + if (msl_idx >= RTE_MAX_MEMSEG_LISTS) { >> + RTE_LOG(ERR, EAL, >> + "No more space in memseg lists, >> please increase %s\n", >> + >> RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS)); >> + return -1; >> + } >> + msl = &mcfg->memsegs[msl_idx++]; >> + >> + if (alloc_memseg_list(msl, pagesz, n_segs, >> + socket_id, cur_seglist)) >> + return -1; >> + >> + if (alloc_va_space(msl)) { >> + RTE_LOG(ERR, EAL, "Cannot allocate VA >> space for memseg list\n"); >> + return -1; >> } >> - total_mem += total_type_mem; >> } >> } >> -- >> 2.19.0 >> >> --- >> Diff of the applied patch vs upstream commit (please double-check if non- >> empty: >> --- >> --- - 2018-11-22 16:47:32.741839268 +0000 >> +++ 0018-mem-improve-segment-list-preallocation.patch 2018-11-22 >> 16:47:32.000000000 +0000 >> @@ -1,8 +1,10 @@ >> -From 1dd342d0fdc4f72102f0b48c89b6a39f029004fe Mon Sep 17 00:00:00 >> 2001 >> +From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00 >> 2001 >> From: Anatoly Burakov >> Date: Fri, 5 Oct 2018 09:29:44 +0100 >> Subject: [PATCH] mem: improve segment list preallocation >> >> +[ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ] >> + >> Current code to preallocate segment lists is trying to do everything in one >> go, and thus ends up being convoluted, hard to understand, and, most >> importantly, does not scale beyond @@ -21,7 +23,6 @@ its code is much >> more readable and not worth changing. >> >> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific") >> -Cc: stable@dpdk.org >> >> Signed-off-by: Anatoly Burakov >> --- >> @@ -29,10 +30,10 @@ >> 1 file changed, 137 insertions(+), 42 deletions(-) >> >> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c >> b/lib/librte_eal/linuxapp/eal/eal_memory.c >> -index 04f264818..19e686eb6 100644 >> +index 6131bfde2..cc2d3fb69 100644 >> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c >> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c >> -@@ -2132,7 +2132,13 @@ memseg_primary_init(void) >> +@@ -2096,7 +2096,13 @@ memseg_primary_init(void) >> { >> struct rte_mem_config *mcfg = rte_eal_get_configuration()- >>> mem_config; >> - int i, socket_id, hpi_idx, msl_idx = 0; >> @@ -48,7 +49,7 @@ >> + unsigned int n_memtypes, cur_type; >> >> /* no-huge does not need this at all */ -@@ -2140,8 +2146,49 @@ >> memseg_primary_init(void) >> +@@ -2104,8 +2110,49 @@ memseg_primary_init(void) >> return 0; >> >> - max_mem = (uint64_t)RTE_MAX_MEM_MB << 20; >> @@ -101,7 +102,7 @@ >> + cur_type = 0; >> for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes; >> hpi_idx++) { >> -@@ -2152,9 +2199,6 @@ memseg_primary_init(void) >> +@@ -2116,9 +2163,6 @@ memseg_primary_init(void) >> hugepage_sz = hpi->hugepage_sz; >> >> - for (i = 0; i < (int) rte_socket_count(); i++) { >> @@ -113,7 +114,7 @@ >> + int socket_id = rte_socket_id_by_idx(i); >> >> #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES >> -@@ -2162,47 +2206,98 @@ memseg_primary_init(void) >> +@@ -2126,47 +2170,98 @@ memseg_primary_init(void) >> break; >> #endif >> + memtypes[cur_type].page_sz = hugepage_sz;