From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 5DBAB378E; Thu, 16 Feb 2017 14:26:30 +0100 (CET) Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Feb 2017 05:26:29 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.35,169,1484035200"; d="scan'208";a="66535583" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by fmsmga006.fm.intel.com with ESMTP; 16 Feb 2017 05:26:29 -0800 Received: from fmsmsx157.amr.corp.intel.com (10.18.116.73) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 16 Feb 2017 05:26:29 -0800 Received: from shsmsx101.ccr.corp.intel.com (10.239.4.153) by FMSMSX157.amr.corp.intel.com (10.18.116.73) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 16 Feb 2017 05:26:28 -0800 Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.20]) by SHSMSX101.ccr.corp.intel.com ([169.254.1.177]) with mapi id 14.03.0248.002; Thu, 16 Feb 2017 21:26:26 +0800 From: "Tan, Jianfeng" To: Ilya Maximets , "dev@dpdk.org" , David Marchand , "Gonzalez Monroy, Sergio" CC: Heetae Ahn , Yuanhan Liu , Neil Horman , "Pei, Yulong" , "stable@dpdk.org" Thread-Topic: [PATCH] mem: balanced allocation of hugepages Thread-Index: AQHSiFTXMAk76LxB1Uu+OE2hntyfEaFrnifw Date: Thu, 16 Feb 2017 13:26:26 +0000 Message-ID: References: <1487250070-13973-1-git-send-email-i.maximets@samsung.com> In-Reply-To: <1487250070-13973-1-git-send-email-i.maximets@samsung.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH] mem: balanced allocation of hugepages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Feb 2017 13:26:31 -0000 Hi, > -----Original Message----- > From: Ilya Maximets [mailto:i.maximets@samsung.com] > Sent: Thursday, February 16, 2017 9:01 PM > To: dev@dpdk.org; David Marchand; Gonzalez Monroy, Sergio > Cc: Heetae Ahn; Yuanhan Liu; Tan, Jianfeng; Neil Horman; Pei, Yulong; Ily= a > Maximets; stable@dpdk.org > Subject: [PATCH] mem: balanced allocation of hugepages >=20 > Currently EAL allocates hugepages one by one not paying > attention from which NUMA node allocation was done. >=20 > Such behaviour leads to allocation failure if number of > available hugepages for application limited by cgroups > or hugetlbfs and memory requested not only from the first > socket. >=20 > Example: > # 90 x 1GB hugepages availavle in a system >=20 > cgcreate -g hugetlb:/test > # Limit to 32GB of hugepages > cgset -r hugetlb.1GB.limit_in_bytes=3D34359738368 test > # Request 4GB from each of 2 sockets > cgexec -g hugetlb:test testpmd --socket-mem=3D4096,4096 ... >=20 > EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB > EAL: 32 not 90 hugepages of size 1024 MB allocated > EAL: Not enough memory available on socket 1! > Requested: 4096MB, available: 0MB > PANIC in rte_eal_init(): > Cannot init memory >=20 > This happens beacause all allocated pages are > on socket 0. For such an use case, why not just use "numactl --interleave=3D0,1 xxx"? Do you see use case like --socket-mem 2048,1024 and only three 1GB-hugepage= are allowed? Thanks, Jianfeng >=20 > Fix this issue by setting mempolicy MPOL_PREFERRED for each > hugepage to one of requested nodes in a round-robin fashion. > In this case all allocated pages will be fairly distributed > between all requested nodes. >=20 > New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > introduced and disabled by default because of external > dependency from libnuma. >=20 > Cc: > Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") >=20 > Signed-off-by: Ilya Maximets > --- > config/common_base | 1 + > lib/librte_eal/Makefile | 4 ++ > lib/librte_eal/linuxapp/eal/eal_memory.c | 66 > ++++++++++++++++++++++++++++++++ > mk/rte.app.mk | 3 ++ > 4 files changed, 74 insertions(+) >=20 > diff --git a/config/common_base b/config/common_base > index 71a4fcb..fbcebbd 100644 > --- a/config/common_base > +++ b/config/common_base > @@ -97,6 +97,7 @@ CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=3Dn > CONFIG_RTE_EAL_IGB_UIO=3Dn > CONFIG_RTE_EAL_VFIO=3Dn > CONFIG_RTE_MALLOC_DEBUG=3Dn > +CONFIG_RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES=3Dn >=20 > # Default driver path (or "" to disable) > CONFIG_RTE_EAL_PMD_PATH=3D"" > diff --git a/lib/librte_eal/Makefile b/lib/librte_eal/Makefile > index cf11a09..5ae3846 100644 > --- a/lib/librte_eal/Makefile > +++ b/lib/librte_eal/Makefile > @@ -35,4 +35,8 @@ DIRS-y +=3D common > DIRS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) +=3D linuxapp > DIRS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) +=3D bsdapp >=20 > +ifeq ($(CONFIG_RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES),y) > +LDLIBS +=3D -lnuma > +endif > + > include $(RTE_SDK)/mk/rte.subdir.mk > diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c > b/lib/librte_eal/linuxapp/eal/eal_memory.c > index a956bb2..8536a36 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_memory.c > +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c > @@ -82,6 +82,9 @@ > #include > #include > #include > +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > +#include > +#endif >=20 > #include > #include > @@ -359,6 +362,21 @@ static int huge_wrap_sigsetjmp(void) > return sigsetjmp(huge_jmpenv, 1); > } >=20 > +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > +#ifndef ULONG_SIZE > +#define ULONG_SIZE sizeof(unsigned long) > +#endif > +#ifndef ULONG_BITS > +#define ULONG_BITS (ULONG_SIZE * CHAR_BIT) > +#endif > +#ifndef DIV_ROUND_UP > +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) > +#endif > +#ifndef BITS_TO_LONGS > +#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, ULONG_SIZE) > +#endif > +#endif > + > /* > * Mmap all hugepages of hugepage table: it first open a file in > * hugetlbfs, then mmap() hugepage_sz data in it. If orig is set, the > @@ -375,10 +393,48 @@ map_all_hugepages(struct hugepage_file > *hugepg_tbl, > void *virtaddr; > void *vma_addr =3D NULL; > size_t vma_len =3D 0; > +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > + unsigned long > nodemask[BITS_TO_LONGS(RTE_MAX_NUMA_NODES)] =3D {0UL}; > + unsigned long maxnode =3D 0; > + int node_id =3D -1; > + > + for (i =3D 0; i < RTE_MAX_NUMA_NODES; i++) > + if (internal_config.socket_mem[i]) > + maxnode =3D i + 1; > +#endif >=20 > for (i =3D 0; i < hpi->num_pages[0]; i++) { > uint64_t hugepage_sz =3D hpi->hugepage_sz; >=20 > +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > + if (maxnode) { > + node_id =3D (node_id + 1) % RTE_MAX_NUMA_NODES; > + while (!internal_config.socket_mem[node_id]) > + node_id =3D (node_id + 1) % > RTE_MAX_NUMA_NODES; > + > + nodemask[node_id / ULONG_BITS] =3D > + 1UL << (node_id % > ULONG_BITS); > + > + RTE_LOG(DEBUG, EAL, > + "Setting policy MPOL_PREFERRED for > socket %d\n", > + node_id); > + /* > + * Due to old linux kernel bug (feature?) we have to > + * increase maxnode by 1. It will be unconditionally > + * decreased back to normal value inside the syscall > + * handler. > + */ > + if (set_mempolicy(MPOL_PREFERRED, > + nodemask, maxnode + 1) < 0) { > + RTE_LOG(ERR, EAL, > + "Failed to set policy > MPOL_PREFERRED: " > + "%s\n", strerror(errno)); > + return i; > + } > + > + nodemask[node_id / ULONG_BITS] =3D 0UL; > + } > +#endif > if (orig) { > hugepg_tbl[i].file_id =3D i; > hugepg_tbl[i].size =3D hugepage_sz; > @@ -489,6 +545,10 @@ map_all_hugepages(struct hugepage_file > *hugepg_tbl, > vma_len -=3D hugepage_sz; > } >=20 > +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > + if (maxnode && set_mempolicy(MPOL_DEFAULT, NULL, 0) < 0) > + RTE_LOG(ERR, EAL, "Failed to set mempolicy > MPOL_DEFAULT\n"); > +#endif > return i; > } >=20 > @@ -573,6 +634,11 @@ find_numasocket(struct hugepage_file *hugepg_tbl, > struct hugepage_info *hpi) > if (hugepg_tbl[i].orig_va =3D=3D va) { > hugepg_tbl[i].socket_id =3D socket_id; > hp_count++; > +#ifdef RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES > + RTE_LOG(DEBUG, EAL, > + "Hugepage %s is on socket %d\n", > + hugepg_tbl[i].filepath, socket_id); > +#endif > } > } > } > diff --git a/mk/rte.app.mk b/mk/rte.app.mk > index 92f3635..c2153b9 100644 > --- a/mk/rte.app.mk > +++ b/mk/rte.app.mk > @@ -159,6 +159,9 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n) > # The static libraries do not know their dependencies. > # So linking with static library requires explicit dependencies. > _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) +=3D -lrt > +ifeq ($(CONFIG_RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES),y) > +_LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) +=3D -lnuma > +endif > _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) +=3D -lm > _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) +=3D -lrt > _LDLIBS-$(CONFIG_RTE_LIBRTE_METER) +=3D -lm > -- > 2.7.4