From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailout3.w1.samsung.com (mailout3.w1.samsung.com [210.118.77.13]) by dpdk.org (Postfix) with ESMTP id 0554B7CBD for ; Tue, 6 Jun 2017 08:22:34 +0200 (CEST) Received: from eucas1p1.samsung.com (unknown [182.198.249.206]) by mailout3.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0OR4002E64DLDO60@mailout3.w1.samsung.com> for dev@dpdk.org; Tue, 06 Jun 2017 07:22:33 +0100 (BST) Received: from eusmges2.samsung.com (unknown [203.254.199.241]) by eucas1p2.samsung.com (KnoxPortal) with ESMTP id 20170606062232eucas1p21d61a3d9daec1620dfb6d4582a4dfe7d~FdCTlOY8E1877418774eucas1p2a; Tue, 6 Jun 2017 06:22:32 +0000 (GMT) Received: from eucas1p1.samsung.com ( [182.198.249.206]) by eusmges2.samsung.com (EUCPMTA) with SMTP id 17.AC.04459.82A46395; Tue, 6 Jun 2017 07:22:32 +0100 (BST) Received: from eusmgms2.samsung.com (unknown [182.198.249.180]) by eucas1p1.samsung.com (KnoxPortal) with ESMTP id 20170606062232eucas1p11d2c304a28353d32b93ddfbd134d4da9~FdCSunT991598115981eucas1p1r; Tue, 6 Jun 2017 06:22:32 +0000 (GMT) X-AuditID: cbfec7f1-f796e6d00000116b-6d-59364a28dbdf Received: from eusync3.samsung.com ( [203.254.199.213]) by eusmgms2.samsung.com (EUCPMTA) with SMTP id 39.CD.20206.72A46395; Tue, 6 Jun 2017 07:22:31 +0100 (BST) Received: from imaximets.rnd.samsung.ru ([106.109.129.180]) by eusync3.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0OR400HQM4D9G230@eusync3.samsung.com>; Tue, 06 Jun 2017 07:22:31 +0100 (BST) From: Ilya Maximets To: dev@dpdk.org, David Marchand , Sergio Gonzalez Monroy , Thomas Monjalon Cc: Heetae Ahn , Yuanhan Liu , Jianfeng Tan , Neil Horman , Yulong Pei , Ilya Maximets Date: Tue, 06 Jun 2017 09:22:17 +0300 Message-id: <1496730138-32056-2-git-send-email-i.maximets@samsung.com> X-Mailer: git-send-email 2.7.4 In-reply-to: <1496730138-32056-1-git-send-email-i.maximets@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrCIsWRmVeSWpSXmKPExsWy7djPc7oaXmaRBlt+aFus6GhnsXj3aTuT xbTPt9ktrrT/ZLfonv2FzeJW80k2ixUTjjBafHpwgsWiZclOJotvD74zO3B5XOy/w+jxa8FS Vo8b/26xeSze85LJ49jNaewefVtWMXpc+b6aMYA9issmJTUnsyy1SN8ugSvjxNpW5oJG64o7 x0+xNzAu1Oti5OSQEDCR+DvjKjuELSZx4d56ti5GLg4hgaWMEotarjOBJIQEPjNKfF0jC9Ow 9t0edoiiZYwS13dMZ4FwmpkkXm7dwgJSxSagI3Fq9RFGkISIwEJGicafJ5hBHGaBV4wS7/bs A6sSFrCV2HLsOiOIzSKgKvFlw12wQ3gF3CS27u1mhdgnJ3HzXCcziM0p4C6x4vVqVpBBEgLT 2SV6Fm8EauAAcmQlNh1ghqh3kTjWMokFwhaWeHV8C9RzMhKXJ3ezQPQ2M0o0rLrECOFMYJT4 0rycCaLKXuLUzatgNrMAn8SkbdOZIRbwSnS0CUGYHhITD8hAmI4Sj/cXQHx/jVHi6eL1rBMY ZRYwMqxiFEktLc5NTy020itOzC0uzUvXS87P3cQIjPvT/45/3MH4/oTVIUYBDkYlHt6GKNNI IdbEsuLK3EOMEhzMSiK8jHuBQrwpiZVVqUX58UWlOanFhxilOViUxHm5Tl2LEBJITyxJzU5N LUgtgskycXBKNTCKz1EVj7UN/jTlYNod/ZR9k72dv9zPWOpXJqGnGC+d7y64IFN62v+d7tJ7 tHVPyual5Jawlh4M+Ot4gyU8tXpm9oLTulL7vL51Z37ZdZE/ZZa6afLF5+xvrtU+/y5Q/lqy Qfl4a6d0/GIx3lbR74eOBFy3yOMK551WPfHku1q+op+3Zl7MZ1JiKc5INNRiLipOBAAwloZC 9wIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFmpjkeLIzCtJLcpLzFFi42I5/e/4VV11L7NIg4PnDC1WdLSzWLz7tJ3J Ytrn2+wWV9p/slt0z/7CZnGr+SSbxYoJRxgtPj04wWLRsmQnk8W3B9+ZHbg8LvbfYfT4tWAp q8eNf7fYPBbvecnkcezmNHaPvi2rGD2ufF/NGMAe5WaTkZqYklqkkJqXnJ+SmZduqxQa4qZr oaSQl5ibaqsUoesbEqSkUJaYUwrkGRmgAQfnAPdgJX27BLeME2tbmQsarSvuHD/F3sC4UK+L kZNDQsBEYu27PewQtpjEhXvr2boYuTiEBJYwSlz8Oo0Vwmllkni9cB8LSBWbgI7EqdVHGEES IgILGSUurP7CDOIwC7xglPj+9g4jSJWwgK3ElmPXwWwWAVWJLxvugu3gFXCT2Lq3mxVin5zE zXOdzCA2p4C7xIrXq6HWNTFKnDvRzD6BkXcBI8MqRpHU0uLc9NxiI73ixNzi0rx0veT83E2M wDjYduznlh2MXe+CDzEKcDAq8fDeiDGNFGJNLCuuzD3EKMHBrCTCy7gXKMSbklhZlVqUH19U mpNafIjRFOiqicxSosn5wBjNK4k3NDE0tzQ0MrawMDcyUhLnnfrhSriQQHpiSWp2ampBahFM HxMHp1QDo0txS/+lAwu8LuckRc+uXR/6NXVlnsTCfqWdLzmZrl+23lbz/s/KElbm9C/paXUP KjIWyskXGNib2E3wqVfcPc3BcuNre+GbQZF75yab82twPv6y0eN4/vLXP2XkL3/iWKG0Rel9 7JZD91y2vuTNSVOMcWKZfMm5rWjSGVtZZ+OtTzt9vWu3KrEUZyQaajEXFScCAOlLhnOZAgAA X-MTR: 20000000000000000@CPGS X-CMS-MailID: 20170606062232eucas1p11d2c304a28353d32b93ddfbd134d4da9 X-Msg-Generator: CA X-Sender-IP: 182.198.249.180 X-Local-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?= =?UTF-8?B?G+yCvOyEseyghOyekBtMZWFkaW5nIEVuZ2luZWVy?= X-Global-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?= =?UTF-8?B?G1NhbXN1bmcgRWxlY3Ryb25pY3MbTGVhZGluZyBFbmdpbmVlcg==?= X-Sender-Code: =?UTF-8?B?QzEwG0NJU0hRG0MxMEdEMDFHRDAxMDE1NA==?= CMS-TYPE: 201P X-HopCount: 7 X-CMS-RootMailID: 20170606062232eucas1p11d2c304a28353d32b93ddfbd134d4da9 X-RootMTR: 20170606062232eucas1p11d2c304a28353d32b93ddfbd134d4da9 References: <1491811459-1647-1-git-send-email-i.maximets@samsung.com> <1496730138-32056-1-git-send-email-i.maximets@samsung.com> Subject: [dpdk-dev] [PATCH v3 1/2] mem: balanced allocation of hugepages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Jun 2017 06:22:35 -0000 Currently EAL allocates hugepages one by one not paying attention from which NUMA node allocation was done. Such behaviour leads to allocation failure if number of available hugepages for application limited by cgroups or hugetlbfs and memory requested not only from the first socket. Example: # 90 x 1GB hugepages availavle in a system cgcreate -g hugetlb:/test # Limit to 32GB of hugepages cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test # Request 4GB from each of 2 sockets cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ... EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB EAL: 32 not 90 hugepages of size 1024 MB allocated EAL: Not enough memory available on socket 1! Requested: 4096MB, available: 0MB PANIC in rte_eal_init(): Cannot init memory This happens beacause all allocated pages are on socket 0. Fix this issue by setting mempolicy MPOL_PREFERRED for each hugepage to one of requested nodes using following schema: 1) Allocate essential hugepages: 1.1) Allocate as many hugepages from numa N to only fit requested memory for this numa. 1.2) repeat 1.1 for all numa nodes. 2) Try to map all remaining free hugepages in a round-robin fashion. 3) Sort pages and choose the most suitable. In this case all essential memory will be allocated and all remaining pages will be fairly distributed between all requested nodes. libnuma added as a general dependency for EAL. Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") Signed-off-by: Ilya Maximets --- lib/librte_eal/Makefile | 2 + lib/librte_eal/linuxapp/eal/eal_memory.c | 87 ++++++++++++++++++++++++++++++-- mk/rte.app.mk | 1 + 3 files changed, 86 insertions(+), 4 deletions(-) diff --git a/lib/librte_eal/Makefile b/lib/librte_eal/Makefile index 5690bb4..0a1af3a 100644 --- a/lib/librte_eal/Makefile +++ b/lib/librte_eal/Makefile @@ -37,4 +37,6 @@ DEPDIRS-linuxapp := common DIRS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += bsdapp DEPDIRS-bsdapp := common +LDLIBS += -lnuma + include $(RTE_SDK)/mk/rte.subdir.mk diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 9c9baf6..35e5bce 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include @@ -358,6 +359,19 @@ static int huge_wrap_sigsetjmp(void) return sigsetjmp(huge_jmpenv, 1); } +#ifndef ULONG_SIZE +#define ULONG_SIZE sizeof(unsigned long) +#endif +#ifndef ULONG_BITS +#define ULONG_BITS (ULONG_SIZE * CHAR_BIT) +#endif +#ifndef DIV_ROUND_UP +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) +#endif +#ifndef BITS_TO_LONGS +#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, ULONG_SIZE) +#endif + /* * Mmap all hugepages of hugepage table: it first open a file in * hugetlbfs, then mmap() hugepage_sz data in it. If orig is set, the @@ -366,18 +380,71 @@ static int huge_wrap_sigsetjmp(void) * map continguous physical blocks in contiguous virtual blocks. */ static unsigned -map_all_hugepages(struct hugepage_file *hugepg_tbl, - struct hugepage_info *hpi, int orig) +map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi, + uint64_t *essential_memory, int orig) { int fd; unsigned i; void *virtaddr; void *vma_addr = NULL; size_t vma_len = 0; + unsigned long nodemask[BITS_TO_LONGS(RTE_MAX_NUMA_NODES)] = {0UL}; + unsigned long maxnode = 0; + int node_id = -1; + + if (orig) { + for (i = 0; i < RTE_MAX_NUMA_NODES; i++) + if (internal_config.socket_mem[i]) + maxnode = i + 1; + } for (i = 0; i < hpi->num_pages[0]; i++) { uint64_t hugepage_sz = hpi->hugepage_sz; + if (maxnode) { + unsigned int j; + + for (j = 0; j < RTE_MAX_NUMA_NODES; j++) + if (essential_memory[j]) + break; + + if (j == RTE_MAX_NUMA_NODES) { + node_id = (node_id + 1) % RTE_MAX_NUMA_NODES; + while (!internal_config.socket_mem[node_id]) { + node_id++; + node_id %= RTE_MAX_NUMA_NODES; + } + } else { + node_id = j; + if (essential_memory[j] < hugepage_sz) + essential_memory[j] = 0; + else + essential_memory[j] -= hugepage_sz; + } + + nodemask[node_id / ULONG_BITS] = + 1UL << (node_id % ULONG_BITS); + + RTE_LOG(DEBUG, EAL, + "Setting policy MPOL_PREFERRED for socket %d\n", + node_id); + /* + * Due to old linux kernel bug (feature?) we have to + * increase maxnode by 1. It will be unconditionally + * decreased back to normal value inside the syscall + * handler. + */ + if (set_mempolicy(MPOL_PREFERRED, + nodemask, maxnode + 1) < 0) { + RTE_LOG(ERR, EAL, + "Failed to set policy MPOL_PREFERRED: " + "%s\n", strerror(errno)); + return i; + } + + nodemask[node_id / ULONG_BITS] = 0UL; + } + if (orig) { hugepg_tbl[i].file_id = i; hugepg_tbl[i].size = hugepage_sz; @@ -488,6 +555,9 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, vma_len -= hugepage_sz; } + if (maxnode && set_mempolicy(MPOL_DEFAULT, NULL, 0) < 0) + RTE_LOG(ERR, EAL, "Failed to set mempolicy MPOL_DEFAULT\n"); + return i; } @@ -572,6 +642,9 @@ find_numasocket(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi) if (hugepg_tbl[i].orig_va == va) { hugepg_tbl[i].socket_id = socket_id; hp_count++; + RTE_LOG(DEBUG, EAL, + "Hugepage %s is on socket %d\n", + hugepg_tbl[i].filepath, socket_id); } } } @@ -1010,6 +1083,11 @@ rte_eal_hugepage_init(void) huge_register_sigbus(); + /* make a copy of socket_mem, needed for balanced allocation. */ + for (i = 0; i < RTE_MAX_NUMA_NODES; i++) + memory[i] = internal_config.socket_mem[i]; + + /* map all hugepages and sort them */ for (i = 0; i < (int)internal_config.num_hugepage_sizes; i ++){ unsigned pages_old, pages_new; @@ -1027,7 +1105,8 @@ rte_eal_hugepage_init(void) /* map all hugepages available */ pages_old = hpi->num_pages[0]; - pages_new = map_all_hugepages(&tmp_hp[hp_offset], hpi, 1); + pages_new = map_all_hugepages(&tmp_hp[hp_offset], hpi, + memory, 1); if (pages_new < pages_old) { RTE_LOG(DEBUG, EAL, "%d not %d hugepages of size %u MB allocated\n", @@ -1070,7 +1149,7 @@ rte_eal_hugepage_init(void) sizeof(struct hugepage_file), cmp_physaddr); /* remap all hugepages */ - if (map_all_hugepages(&tmp_hp[hp_offset], hpi, 0) != + if (map_all_hugepages(&tmp_hp[hp_offset], hpi, NULL, 0) != hpi->num_pages[0]) { RTE_LOG(ERR, EAL, "Failed to remap %u MB pages\n", (unsigned)(hpi->hugepage_sz / 0x100000)); diff --git a/mk/rte.app.mk b/mk/rte.app.mk index bcaf1b3..b208e88 100644 --- a/mk/rte.app.mk +++ b/mk/rte.app.mk @@ -186,6 +186,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n) # The static libraries do not know their dependencies. # So linking with static library requires explicit dependencies. _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lrt +_LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lnuma _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lm _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lrt _LDLIBS-$(CONFIG_RTE_LIBRTE_METER) += -lm -- 2.7.4