From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com [210.118.77.12]) by dpdk.org (Postfix) with ESMTP id 5A9045905 for ; Tue, 6 Jun 2017 10:14:06 +0200 (CEST) Received: from eucas1p2.samsung.com (unknown [182.198.249.207]) by mailout2.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0OR400IPE9JHKK00@mailout2.w1.samsung.com> for dev@dpdk.org; Tue, 06 Jun 2017 09:14:05 +0100 (BST) Received: from eusmges3.samsung.com (unknown [203.254.199.242]) by eucas1p1.samsung.com (KnoxPortal) with ESMTP id 20170606081404eucas1p1d530ea9d09e30a488047af2d587920df~Fejrscr4t1474514745eucas1p1U; Tue, 6 Jun 2017 08:14:04 +0000 (GMT) Received: from eucas1p1.samsung.com ( [182.198.249.206]) by eusmges3.samsung.com (EUCPMTA) with SMTP id 46.D5.17464.C4466395; Tue, 6 Jun 2017 09:14:04 +0100 (BST) Received: from eusmgms1.samsung.com (unknown [182.198.249.179]) by eucas1p2.samsung.com (KnoxPortal) with ESMTP id 20170606081403eucas1p20c561b9177a51cfe58dd53b76cbfaaf7~FejrD7nv61299612996eucas1p2l; Tue, 6 Jun 2017 08:14:03 +0000 (GMT) X-AuditID: cbfec7f2-f797e6d000004438-1c-5936644cf3d4 Received: from eusync1.samsung.com ( [203.254.199.211]) by eusmgms1.samsung.com (EUCPMTA) with SMTP id B4.07.17452.B4466395; Tue, 6 Jun 2017 09:14:03 +0100 (BST) Received: from imaximets.rnd.samsung.ru ([106.109.129.180]) by eusync1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0OR400K2V9J69R80@eusync1.samsung.com>; Tue, 06 Jun 2017 09:14:03 +0100 (BST) From: Ilya Maximets To: dev@dpdk.org, David Marchand , Sergio Gonzalez Monroy , Thomas Monjalon Cc: Heetae Ahn , Yuanhan Liu , Jianfeng Tan , Neil Horman , Yulong Pei , Ilya Maximets Date: Tue, 06 Jun 2017 11:13:51 +0300 Message-id: <1496736832-835-2-git-send-email-i.maximets@samsung.com> X-Mailer: git-send-email 2.7.4 In-reply-to: <1496736832-835-1-git-send-email-i.maximets@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrEIsWRmVeSWpSXmKPExsWy7djPc7o+KWaRBpfOS1is6GhnsXj3aTuT xbTPt9ktrrT/ZLfonv2FzeJW80k2ixUTjjBafHpwgsWiZclOJotvD74zO3B5XOy/w+jxa8FS Vo8b/26xeSze85LJ49jNaewefVtWMXpc+b6aMYA9issmJTUnsyy1SN8ugStjxYavzAXfbSrm Hj3M3MD4WL+LkYNDQsBE4m+/cxcjJ5ApJnHh3nq2LkYuDiGBpYwSR/q6WSGcz4wSN07uZoKo MpH49HIdC4gtJLCMUeJziy9EUTOTxLvmBawgCTYBHYlTq48wgiREBBYySjT+PMEM4jALvGKU eLdnH1i7sICtRMv2XjYQm0VAVeLM8svMIDavgItER8cCNoh1chI3z3Uyg9zKKeAqcf1JFMgc CYF57BK3O38yQ/wgK7HpAJTpIvH8ZRxEp7DEq+Nb2CFsGYnOjoNMEK3NjBINqy4xQjgTGCW+ NC+Hes1e4tTNq2A2swCfxKRt06GG8kp0tAlBlHhIbFr9FirsKNE8URHi+auMErf2LWScwCiz gJFhFaNIamlxbnpqsbFecWJucWleul5yfu4mRmDMn/53/NMOxq8nrA4xCnAwKvHwNkSZRgqx JpYVV+YeYpTgYFYS4WXcCxTiTUmsrEotyo8vKs1JLT7EKM3BoiTOy3XqWoSQQHpiSWp2ampB ahFMlomDU6qBcVVkrv+uKzsm3KmrSYt/+uf5qgNiyVM/7VndwRfv9SWoOFbGKqM8u03x4Mdr 3lI7Op+ZZPfslAr441PClrs7PSw2z/OiuMM77RcR7K/vbl4vbPJ14iRR9bf9l2/87/AqM18y I91g1ilWlxaZLZZLGZJm7GvweccyfW+8xLuFSybeuz37+fn0fUosxRmJhlrMRcWJANWtaR71 AgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFmplkeLIzCtJLcpLzFFi42I5/e/4ZV3vFLNIgzvbrCxWdLSzWLz7tJ3J Ytrn2+wWV9p/slt0z/7CZnGr+SSbxYoJRxgtPj04wWLRsmQnk8W3B9+ZHbg8LvbfYfT4tWAp q8eNf7fYPBbvecnkcezmNHaPvi2rGD2ufF/NGMAe5WaTkZqYklqkkJqXnJ+SmZduqxQa4qZr oaSQl5ibaqsUoesbEqSkUJaYUwrkGRmgAQfnAPdgJX27BLeMFRu+Mhd8t6mYe/QwcwPjY/0u Rk4OCQETiU8v17FA2GISF+6tZ+ti5OIQEljCKNHdeJEdwmllklg+8QcjSBWbgI7EqdVHGEES IgILGSUurP7CDOIwC7xglPj+9g5YlbCArUTL9l42EJtFQFXizPLLzCA2r4CLREfHAjaIfXIS N891AsU5ODgFXCWuP4mC2NbIKHH3w1G2CYy8CxgZVjGKpJYW56bnFhvqFSfmFpfmpesl5+du YgRGwbZjPzfvYLy0MfgQowAHoxIPb0OUaaQQa2JZcWXuIUYJDmYlEV7GvUAh3pTEyqrUovz4 otKc1OJDjKZAR01klhJNzgdGaF5JvKGJobmloZGxhYW5kZGSOG/JhyvhQgLpiSWp2ampBalF MH1MHJxSDYxC/vOY49f41h+yW7y5lff6ibxXj5ovO9tI8Lj93/A+L26f1VUTpT/Bx7+yp/BP vS6Z9G3apy3LN7RMDLcJn3Qo8pOT47uGKr6Fwlut1mYahS8rYT5WovOx6f/nyR5h1X++6Jko yX2da5cRuytee22hqPNzwfzYnkSdqM2O99x83IS4VQvsjyixFGckGmoxFxUnAgD56/g4mAIA AA== X-MTR: 20000000000000000@CPGS X-CMS-MailID: 20170606081403eucas1p20c561b9177a51cfe58dd53b76cbfaaf7 X-Msg-Generator: CA X-Sender-IP: 182.198.249.179 X-Local-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?= =?UTF-8?B?G+yCvOyEseyghOyekBtMZWFkaW5nIEVuZ2luZWVy?= X-Global-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?= =?UTF-8?B?G1NhbXN1bmcgRWxlY3Ryb25pY3MbTGVhZGluZyBFbmdpbmVlcg==?= X-Sender-Code: =?UTF-8?B?QzEwG0NJU0hRG0MxMEdEMDFHRDAxMDE1NA==?= CMS-TYPE: 201P X-HopCount: 7 X-CMS-RootMailID: 20170606081403eucas1p20c561b9177a51cfe58dd53b76cbfaaf7 X-RootMTR: 20170606081403eucas1p20c561b9177a51cfe58dd53b76cbfaaf7 References: <1496730138-32056-1-git-send-email-i.maximets@samsung.com> <1496736832-835-1-git-send-email-i.maximets@samsung.com> Subject: [dpdk-dev] [PATCH v4 1/2] mem: balanced allocation of hugepages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Jun 2017 08:14:06 -0000 Currently EAL allocates hugepages one by one not paying attention from which NUMA node allocation was done. Such behaviour leads to allocation failure if number of available hugepages for application limited by cgroups or hugetlbfs and memory requested not only from the first socket. Example: # 90 x 1GB hugepages availavle in a system cgcreate -g hugetlb:/test # Limit to 32GB of hugepages cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test # Request 4GB from each of 2 sockets cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ... EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB EAL: 32 not 90 hugepages of size 1024 MB allocated EAL: Not enough memory available on socket 1! Requested: 4096MB, available: 0MB PANIC in rte_eal_init(): Cannot init memory This happens beacause all allocated pages are on socket 0. Fix this issue by setting mempolicy MPOL_PREFERRED for each hugepage to one of requested nodes using following schema: 1) Allocate essential hugepages: 1.1) Allocate as many hugepages from numa N to only fit requested memory for this numa. 1.2) repeat 1.1 for all numa nodes. 2) Try to map all remaining free hugepages in a round-robin fashion. 3) Sort pages and choose the most suitable. In this case all essential memory will be allocated and all remaining pages will be fairly distributed between all requested nodes. libnuma added as a general dependency for EAL. Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") Signed-off-by: Ilya Maximets --- lib/librte_eal/Makefile | 2 + lib/librte_eal/linuxapp/eal/eal_memory.c | 94 ++++++++++++++++++++++++++++++-- mk/rte.app.mk | 1 + 3 files changed, 93 insertions(+), 4 deletions(-) diff --git a/lib/librte_eal/Makefile b/lib/librte_eal/Makefile index 5690bb4..0a1af3a 100644 --- a/lib/librte_eal/Makefile +++ b/lib/librte_eal/Makefile @@ -37,4 +37,6 @@ DEPDIRS-linuxapp := common DIRS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += bsdapp DEPDIRS-bsdapp := common +LDLIBS += -lnuma + include $(RTE_SDK)/mk/rte.subdir.mk diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 9c9baf6..5947434 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include @@ -358,6 +359,19 @@ static int huge_wrap_sigsetjmp(void) return sigsetjmp(huge_jmpenv, 1); } +#ifndef ULONG_SIZE +#define ULONG_SIZE sizeof(unsigned long) +#endif +#ifndef ULONG_BITS +#define ULONG_BITS (ULONG_SIZE * CHAR_BIT) +#endif +#ifndef DIV_ROUND_UP +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) +#endif +#ifndef BITS_TO_LONGS +#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, ULONG_SIZE) +#endif + /* * Mmap all hugepages of hugepage table: it first open a file in * hugetlbfs, then mmap() hugepage_sz data in it. If orig is set, the @@ -366,18 +380,78 @@ static int huge_wrap_sigsetjmp(void) * map continguous physical blocks in contiguous virtual blocks. */ static unsigned -map_all_hugepages(struct hugepage_file *hugepg_tbl, - struct hugepage_info *hpi, int orig) +map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi, + uint64_t *essential_memory, int orig) { int fd; unsigned i; void *virtaddr; void *vma_addr = NULL; size_t vma_len = 0; + unsigned long nodemask[BITS_TO_LONGS(RTE_MAX_NUMA_NODES)] = {0UL}; + unsigned long maxnode = 0; + int node_id = -1; + bool numa_available = true; + + /* Check if kernel supports NUMA. */ + if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS) { + RTE_LOG(DEBUG, EAL, "NUMA is not supported.\n"); + numa_available = false; + } + + if (orig && numa_available) { + for (i = 0; i < RTE_MAX_NUMA_NODES; i++) + if (internal_config.socket_mem[i]) + maxnode = i + 1; + } for (i = 0; i < hpi->num_pages[0]; i++) { uint64_t hugepage_sz = hpi->hugepage_sz; + if (maxnode) { + unsigned int j; + + for (j = 0; j < RTE_MAX_NUMA_NODES; j++) + if (essential_memory[j]) + break; + + if (j == RTE_MAX_NUMA_NODES) { + node_id = (node_id + 1) % RTE_MAX_NUMA_NODES; + while (!internal_config.socket_mem[node_id]) { + node_id++; + node_id %= RTE_MAX_NUMA_NODES; + } + } else { + node_id = j; + if (essential_memory[j] < hugepage_sz) + essential_memory[j] = 0; + else + essential_memory[j] -= hugepage_sz; + } + + nodemask[node_id / ULONG_BITS] = + 1UL << (node_id % ULONG_BITS); + + RTE_LOG(DEBUG, EAL, + "Setting policy MPOL_PREFERRED for socket %d\n", + node_id); + /* + * Due to old linux kernel bug (feature?) we have to + * increase maxnode by 1. It will be unconditionally + * decreased back to normal value inside the syscall + * handler. + */ + if (set_mempolicy(MPOL_PREFERRED, + nodemask, maxnode + 1) < 0) { + RTE_LOG(ERR, EAL, + "Failed to set policy MPOL_PREFERRED: " + "%s\n", strerror(errno)); + return i; + } + + nodemask[node_id / ULONG_BITS] = 0UL; + } + if (orig) { hugepg_tbl[i].file_id = i; hugepg_tbl[i].size = hugepage_sz; @@ -488,6 +562,9 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, vma_len -= hugepage_sz; } + if (maxnode && set_mempolicy(MPOL_DEFAULT, NULL, 0) < 0) + RTE_LOG(ERR, EAL, "Failed to set mempolicy MPOL_DEFAULT\n"); + return i; } @@ -572,6 +649,9 @@ find_numasocket(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi) if (hugepg_tbl[i].orig_va == va) { hugepg_tbl[i].socket_id = socket_id; hp_count++; + RTE_LOG(DEBUG, EAL, + "Hugepage %s is on socket %d\n", + hugepg_tbl[i].filepath, socket_id); } } } @@ -1010,6 +1090,11 @@ rte_eal_hugepage_init(void) huge_register_sigbus(); + /* make a copy of socket_mem, needed for balanced allocation. */ + for (i = 0; i < RTE_MAX_NUMA_NODES; i++) + memory[i] = internal_config.socket_mem[i]; + + /* map all hugepages and sort them */ for (i = 0; i < (int)internal_config.num_hugepage_sizes; i ++){ unsigned pages_old, pages_new; @@ -1027,7 +1112,8 @@ rte_eal_hugepage_init(void) /* map all hugepages available */ pages_old = hpi->num_pages[0]; - pages_new = map_all_hugepages(&tmp_hp[hp_offset], hpi, 1); + pages_new = map_all_hugepages(&tmp_hp[hp_offset], hpi, + memory, 1); if (pages_new < pages_old) { RTE_LOG(DEBUG, EAL, "%d not %d hugepages of size %u MB allocated\n", @@ -1070,7 +1156,7 @@ rte_eal_hugepage_init(void) sizeof(struct hugepage_file), cmp_physaddr); /* remap all hugepages */ - if (map_all_hugepages(&tmp_hp[hp_offset], hpi, 0) != + if (map_all_hugepages(&tmp_hp[hp_offset], hpi, NULL, 0) != hpi->num_pages[0]) { RTE_LOG(ERR, EAL, "Failed to remap %u MB pages\n", (unsigned)(hpi->hugepage_sz / 0x100000)); diff --git a/mk/rte.app.mk b/mk/rte.app.mk index bcaf1b3..b208e88 100644 --- a/mk/rte.app.mk +++ b/mk/rte.app.mk @@ -186,6 +186,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n) # The static libraries do not know their dependencies. # So linking with static library requires explicit dependencies. _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lrt +_LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lnuma _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lm _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lrt _LDLIBS-$(CONFIG_RTE_LIBRTE_METER) += -lm -- 2.7.4