From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com [210.118.77.12]) by dpdk.org (Postfix) with ESMTP id 976362C5 for ; Mon, 10 Apr 2017 10:04:28 +0200 (CEST) Received: from eucas1p1.samsung.com (unknown [182.198.249.206]) by mailout2.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0OO600BA4P3EYK60@mailout2.w1.samsung.com> for dev@dpdk.org; Mon, 10 Apr 2017 09:04:26 +0100 (BST) Received: from eusmges2.samsung.com (unknown [203.254.199.241]) by eucas1p1.samsung.com (KnoxPortal) with ESMTP id 20170410080426eucas1p1003a5d470c270861a4ea3d2df369e70f~z_p-k80hM0277902779eucas1p1y; Mon, 10 Apr 2017 08:04:26 +0000 (GMT) Received: from eucas1p2.samsung.com ( [182.198.249.207]) by eusmges2.samsung.com (EUCPMTA) with SMTP id ED.1F.04459.98C3BE85; Mon, 10 Apr 2017 09:04:25 +0100 (BST) Received: from eusmgms1.samsung.com (unknown [182.198.249.179]) by eucas1p2.samsung.com (KnoxPortal) with ESMTP id 20170410080425eucas1p27fd424ae58151f13b1a7a3723aa4ad1e~z_p_8_DLT0505105051eucas1p2b; Mon, 10 Apr 2017 08:04:25 +0000 (GMT) X-AuditID: cbfec7f1-f796e6d00000116b-e9-58eb3c896ffd Received: from eusync1.samsung.com ( [203.254.199.211]) by eusmgms1.samsung.com (EUCPMTA) with SMTP id 62.11.17452.30D3BE85; Mon, 10 Apr 2017 09:06:27 +0100 (BST) Received: from imaximets.rnd.samsung.ru ([106.109.129.180]) by eusync1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0OO6004ZRP39JL10@eusync1.samsung.com>; Mon, 10 Apr 2017 09:04:25 +0100 (BST) From: Ilya Maximets To: dev@dpdk.org, David Marchand , Sergio Gonzalez Monroy , Thomas Monjalon Cc: Heetae Ahn , Yuanhan Liu , Jianfeng Tan , Neil Horman , Yulong Pei , Ilya Maximets Date: Mon, 10 Apr 2017 11:04:19 +0300 Message-id: <1491811459-1647-1-git-send-email-i.maximets@samsung.com> X-Mailer: git-send-email 2.7.4 In-reply-to: <1487250070-13973-1-git-send-email-i.maximets@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrHIsWRmVeSWpSXmKPExsWy7djP87qdNq8jDKY+trZY0dHOYvHu03Ym i2mfb7NbXGn/yW7RPfsLm8Wt5pNsFismHGG0+LJpOpvF9QkXWC2+PfjO7MDlcbH/DqPHrwVL WT0W73nJ5DHvZKBH35ZVjB5Xvq9mDGCL4rJJSc3JLEst0rdL4Mr4vv0wW8Fjw4qzl+8wNTD+ Ue9i5OSQEDCR6J4zixHCFpO4cG89WxcjF4eQwFJGiR3rFzFDOJ8ZJX49/8oG0/Hw3h1WEFtI YBmjxOsPahBFzUwSG/5CFLEJ6EicWn2EESQhIrCcUWL9l3tgO5gFPjJKtPz262Lk4BAWsJS4 +S8GJMwioCqxqfkHC4jNK+Aq0fLkL9RJchI3z3Uyg9icAu4SjyZvYQeZKSEwmV1i+c7rjCBz JARkJTYdYIaod5FoO/IXyhaWeHUcpB7ElpHo7DjIBNHbzCjRsOoSI4QzgVHiS/NyJogqe4lT N68yQRzKJzFp23RmiAW8Eh1tQhAlHhL7D31lhbAdJd4s3cMO8f0sYHgtf84+gVFmASPDKkaR 1NLi3PTUYiO94sTc4tK8dL3k/NxNjMBYP/3v+McdjO9PWB1iFOBgVOLh/VHxKkKINbGsuDL3 EKMEB7OSCO8N6dcRQrwpiZVVqUX58UWlOanFhxilOViUxHm5Tl2LEBJITyxJzU5NLUgtgsky cXBKNTC6dbnm3D3otFqWN3D/uxjZpF6faxv9px4o31p3TWLShVfTDnsv2b9g5c6Ptxc9iz64 603DKv7OiogAjeVFCfHmS9+7WXLxRyXX++6xtRNk5J7hsjyObZfzvyvbEiex7365QEMmU7VM 5doB67dMMYsX/ErSnXpxu9D6sMzfaRcMvIuWd9xIKjupxFKckWioxVxUnAgAQ71mNvECAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrHLMWRmVeSWpSXmKPExsVy+t/xy7rMtq8jDJomKFus6GhnsXj3aTuT xbTPt9ktrrT/ZLfonv2FzeJW80k2ixUTjjBafNk0nc3i+oQLrBbfHnxnduDyuNh/h9Hj14Kl rB6L97xk8ph3MtCjb8sqRo8r31czBrBFudlkpCampBYppOYl56dk5qXbKoWGuOlaKCnkJeam 2ipF6PqGBCkplCXmlAJ5RgZowME5wD1YSd8uwS3j+/bDbAWPDSvOXr7D1MD4R72LkZNDQsBE 4uG9O6wQtpjEhXvr2boYuTiEBJYwSkz88QrKaWWSaF//hR2kik1AR+LU6iOMIAkRgeWMEos6 +sCqmAU+Mkoc3/GHpYuRg0NYwFLi5r8YkAYWAVWJTc0/WEBsXgFXiZYnfxkh1slJ3DzXyQxi cwq4SzyavAVsgZCAm0TXoocsExh5FzAyrGIUSS0tzk3PLTbUK07MLS7NS9dLzs/dxAgM+23H fm7ewXhpY/AhRgEORiUe3oDqVxFCrIllxZW5hxglOJiVRHhvSL+OEOJNSaysSi3Kjy8qzUkt PsRoCnTURGYp0eR8YEzmlcQbmhiaWxoaGVtYmBsZKYnzlny4Ei4kkJ5YkpqdmlqQWgTTx8TB KdXA6Olb05Mq/ee+gF2nocUX5pddf9SnmCy+V/q8dIGY5MInx7dYNi5dpb9z858dZxfsNHxZ vmbZtVNKFdKKYo1ROaY9R/cmLwxyzHBwNfDWOSM9pWE2r2SloW7qQ9Okp2kR9z4J6i3tdT0h uFK0//KeBRrPTBia8rWWTXb9I/D+5r0Pdziq82d+V2Ipzkg01GIuKk4EAIyTKcmRAgAA X-MTR: 20000000000000000@CPGS X-CMS-MailID: 20170410080425eucas1p27fd424ae58151f13b1a7a3723aa4ad1e X-Msg-Generator: CA X-Sender-IP: 182.198.249.179 X-Local-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?= =?UTF-8?B?G+yCvOyEseyghOyekBtMZWFkaW5nIEVuZ2luZWVy?= X-Global-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?= =?UTF-8?B?G1NhbXN1bmcgRWxlY3Ryb25pY3MbTGVhZGluZyBFbmdpbmVlcg==?= X-Sender-Code: =?UTF-8?B?QzEwG0NJU0hRG0MxMEdEMDFHRDAxMDE1NA==?= CMS-TYPE: 201P X-HopCount: 7 X-CMS-RootMailID: 20170410080425eucas1p27fd424ae58151f13b1a7a3723aa4ad1e X-RootMTR: 20170410080425eucas1p27fd424ae58151f13b1a7a3723aa4ad1e References: <1487250070-13973-1-git-send-email-i.maximets@samsung.com> Subject: [dpdk-dev] [PATCH v2] mem: balanced allocation of hugepages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Apr 2017 08:04:28 -0000 Currently EAL allocates hugepages one by one not paying attention from which NUMA node allocation was done. Such behaviour leads to allocation failure if number of available hugepages for application limited by cgroups or hugetlbfs and memory requested not only from the first socket. Example: # 90 x 1GB hugepages availavle in a system cgcreate -g hugetlb:/test # Limit to 32GB of hugepages cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test # Request 4GB from each of 2 sockets cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ... EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB EAL: 32 not 90 hugepages of size 1024 MB allocated EAL: Not enough memory available on socket 1! Requested: 4096MB, available: 0MB PANIC in rte_eal_init(): Cannot init memory This happens beacause all allocated pages are on socket 0. Fix this issue by setting mempolicy MPOL_PREFERRED for each hugepage to one of requested nodes in a round-robin fashion. In this case all allocated pages will be fairly distributed between all requested nodes. New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES introduced and disabled by default because of external dependency from libnuma. Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") Signed-off-by: Ilya Maximets --- Version 2: * rebased (fuzz in Makefile) config/common_base | 1 + lib/librte_eal/Makefile | 4 ++ lib/librte_eal/linuxapp/eal/eal_memory.c | 65 ++++++++++++++++++++++++++++++++ mk/rte.app.mk | 3 ++ 4 files changed, 73 insertions(+) diff --git a/config/common_base b/config/common_base index 5f2ad94..09782ff 100644 --- a/config/common_base +++ b/config/common_base @@ -102,6 +102,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n CONFIG_RTE_EAL_IGB_UIO=n CONFIG_RTE_EAL_VFIO=n CONFIG_RTE_MALLOC_DEBUG=n +CONFIG_RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES=n # Default driver path (or "" to disable) CONFIG_RTE_EAL_PMD_PATH="" diff --git a/lib/librte_eal/Makefile b/lib/librte_eal/Makefile index 5690bb4..e5f552a 100644 --- a/lib/librte_eal/Makefile +++ b/lib/librte_eal/Makefile @@ -37,4 +37,8 @@ DEPDIRS-linuxapp := common DIRS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += bsdapp DEPDIRS-bsdapp := common +ifeq ($(CONFIG_RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES),y) +LDLIBS += -lnuma +endif + include $(RTE_SDK)/mk/rte.subdir.mk diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 657c6f4..8cb7432 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -83,6 +83,9 @@ #include #include #include +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES +#include +#endif #include #include @@ -377,6 +380,21 @@ static int huge_wrap_sigsetjmp(void) return sigsetjmp(huge_jmpenv, 1); } +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES +#ifndef ULONG_SIZE +#define ULONG_SIZE sizeof(unsigned long) +#endif +#ifndef ULONG_BITS +#define ULONG_BITS (ULONG_SIZE * CHAR_BIT) +#endif +#ifndef DIV_ROUND_UP +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) +#endif +#ifndef BITS_TO_LONGS +#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, ULONG_SIZE) +#endif +#endif + /* * Mmap all hugepages of hugepage table: it first open a file in * hugetlbfs, then mmap() hugepage_sz data in it. If orig is set, the @@ -393,10 +411,48 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, void *virtaddr; void *vma_addr = NULL; size_t vma_len = 0; +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES + unsigned long nodemask[BITS_TO_LONGS(RTE_MAX_NUMA_NODES)] = {0UL}; + unsigned long maxnode = 0; + int node_id = -1; + + for (i = 0; i < RTE_MAX_NUMA_NODES; i++) + if (internal_config.socket_mem[i]) + maxnode = i + 1; +#endif for (i = 0; i < hpi->num_pages[0]; i++) { uint64_t hugepage_sz = hpi->hugepage_sz; +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES + if (maxnode) { + node_id = (node_id + 1) % RTE_MAX_NUMA_NODES; + while (!internal_config.socket_mem[node_id]) + node_id = (node_id + 1) % RTE_MAX_NUMA_NODES; + + nodemask[node_id / ULONG_BITS] = + 1UL << (node_id % ULONG_BITS); + + RTE_LOG(DEBUG, EAL, + "Setting policy MPOL_PREFERRED for socket %d\n", + node_id); + /* + * Due to old linux kernel bug (feature?) we have to + * increase maxnode by 1. It will be unconditionally + * decreased back to normal value inside the syscall + * handler. + */ + if (set_mempolicy(MPOL_PREFERRED, + nodemask, maxnode + 1) < 0) { + RTE_LOG(ERR, EAL, + "Failed to set policy MPOL_PREFERRED: " + "%s\n", strerror(errno)); + return i; + } + + nodemask[node_id / ULONG_BITS] = 0UL; + } +#endif if (orig) { hugepg_tbl[i].file_id = i; hugepg_tbl[i].size = hugepage_sz; @@ -507,6 +563,10 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, vma_len -= hugepage_sz; } +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES + if (maxnode && set_mempolicy(MPOL_DEFAULT, NULL, 0) < 0) + RTE_LOG(ERR, EAL, "Failed to set mempolicy MPOL_DEFAULT\n"); +#endif return i; } @@ -591,6 +651,11 @@ find_numasocket(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi) if (hugepg_tbl[i].orig_va == va) { hugepg_tbl[i].socket_id = socket_id; hp_count++; +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES + RTE_LOG(DEBUG, EAL, + "Hugepage %s is on socket %d\n", + hugepg_tbl[i].filepath, socket_id); +#endif } } } diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 4c659e9..ca8e5fe 100644 --- a/mk/rte.app.mk +++ b/mk/rte.app.mk @@ -173,6 +173,9 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n) # The static libraries do not know their dependencies. # So linking with static library requires explicit dependencies. _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lrt +ifeq ($(CONFIG_RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES),y) +_LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lnuma +endif _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lm _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lrt _LDLIBS-$(CONFIG_RTE_LIBRTE_METER) += -lm -- 2.7.4