From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id ED45F2BB0; Mon, 10 Apr 2017 09:51:56 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga105.jf.intel.com with ESMTP; 10 Apr 2017 00:51:55 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.37,181,1488873600"; d="scan'208";a="1153740239" Received: from smonroyx-mobl.ger.corp.intel.com (HELO [10.237.221.23]) ([10.237.221.23]) by fmsmga002.fm.intel.com with ESMTP; 10 Apr 2017 00:51:53 -0700 To: Ilya Maximets , Thomas Monjalon References: <2a9b03bd-a4c0-8d20-0bbd-77730140eef0@samsung.com> <1945759.SoJb5dzy87@xps13> Cc: dev@dpdk.org, David Marchand , Heetae Ahn , Yuanhan Liu , Jianfeng Tan , Neil Horman , Yulong Pei , stable@dpdk.org, Bruce Richardson From: Sergio Gonzalez Monroy Message-ID: Date: Mon, 10 Apr 2017 08:51:52 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH] mem: balanced allocation of hugepages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Apr 2017 07:51:57 -0000 On 10/04/2017 08:11, Ilya Maximets wrote: > On 07.04.2017 18:44, Thomas Monjalon wrote: >> 2017-04-07 18:14, Ilya Maximets: >>> Hi All. >>> >>> I wanted to ask (just to clarify current status): >>> Will this patch be included in current release (acked by maintainer) >>> and then I will upgrade it to hybrid logic or I will just prepare v3 >>> with hybrid logic for 17.08 ? >> What is your preferred option Ilya? > I have no strong opinion on this. One thought is that it could be > nice if someone else could test this functionality with current > release before enabling it by default in 17.08. > > Tomorrow I'm going on vacation. So I'll post rebased version today > (there are few fuzzes with current master) and you with Sergio may > decide what to do. > > Best regards, Ilya Maximets. > >> Sergio? I would be inclined towards v3 targeting v17.08. IMHO it would be more clean this way. Sergio >> >>> On 27.03.2017 17:43, Ilya Maximets wrote: >>>> On 27.03.2017 16:01, Sergio Gonzalez Monroy wrote: >>>>> On 09/03/2017 12:57, Ilya Maximets wrote: >>>>>> On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote: >>>>>>> Hi Ilya, >>>>>>> >>>>>>> I have done similar tests and as you already pointed out, 'numactl --interleave' does not seem to work as expected. >>>>>>> I have also checked that the issue can be reproduced with quota limit on hugetlbfs mount point. >>>>>>> >>>>>>> I would be inclined towards *adding libnuma as dependency* to DPDK to make memory allocation a bit more reliable. >>>>>>> >>>>>>> Currently at a high level regarding hugepages per numa node: >>>>>>> 1) Try to map all free hugepages. The total number of mapped hugepages depends if there were any limits, such as cgroups or quota in mount point. >>>>>>> 2) Find out numa node of each hugepage. >>>>>>> 3) Check if we have enough hugepages for requested memory in each numa socket/node. >>>>>>> >>>>>>> Using libnuma we could try to allocate hugepages per numa: >>>>>>> 1) Try to map as many hugepages from numa 0. >>>>>>> 2) Check if we have enough hugepages for requested memory in numa 0. >>>>>>> 3) Try to map as many hugepages from numa 1. >>>>>>> 4) Check if we have enough hugepages for requested memory in numa 1. >>>>>>> >>>>>>> This approach would improve failing scenarios caused by limits but It would still not fix issues regarding non-contiguous hugepages (worst case each hugepage is a memseg). >>>>>>> The non-contiguous hugepages issues are not as critical now that mempools can span over multiple memsegs/hugepages, but it is still a problem for any other library requiring big chunks of memory. >>>>>>> >>>>>>> Potentially if we were to add an option such as 'iommu-only' when all devices are bound to vfio-pci, we could have a reliable way to allocate hugepages by just requesting the number of pages from each numa. >>>>>>> >>>>>>> Thoughts? >>>>>> Hi Sergio, >>>>>> >>>>>> Thanks for your attention to this. >>>>>> >>>>>> For now, as we have some issues with non-contiguous >>>>>> hugepages, I'm thinking about following hybrid schema: >>>>>> 1) Allocate essential hugepages: >>>>>> 1.1) Allocate as many hugepages from numa N to >>>>>> only fit requested memory for this numa. >>>>>> 1.2) repeat 1.1 for all numa nodes. >>>>>> 2) Try to map all remaining free hugepages in a round-robin >>>>>> fashion like in this patch. >>>>>> 3) Sort pages and choose the most suitable. >>>>>> >>>>>> This solution should decrease number of issues connected with >>>>>> non-contiguous memory. >>>>> Sorry for late reply, I was hoping for more comments from the community. >>>>> >>>>> IMHO this should be default behavior, which means no config option and libnuma as EAL dependency. >>>>> I think your proposal is good, could you consider implementing such approach on next release? >>>> Sure, I can implement this for 17.08 release. >>>> >>>>>>> On 06/03/2017 09:34, Ilya Maximets wrote: >>>>>>>> Hi all. >>>>>>>> >>>>>>>> So, what about this change? >>>>>>>> >>>>>>>> Best regards, Ilya Maximets. >>>>>>>> >>>>>>>> On 16.02.2017 16:01, Ilya Maximets wrote: >>>>>>>>> Currently EAL allocates hugepages one by one not paying >>>>>>>>> attention from which NUMA node allocation was done. >>>>>>>>> >>>>>>>>> Such behaviour leads to allocation failure if number of >>>>>>>>> available hugepages for application limited by cgroups >>>>>>>>> or hugetlbfs and memory requested not only from the first >>>>>>>>> socket. >>>>>>>>> >>>>>>>>> Example: >>>>>>>>> # 90 x 1GB hugepages availavle in a system >>>>>>>>> >>>>>>>>> cgcreate -g hugetlb:/test >>>>>>>>> # Limit to 32GB of hugepages >>>>>>>>> cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test >>>>>>>>> # Request 4GB from each of 2 sockets >>>>>>>>> cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ... >>>>>>>>> >>>>>>>>> EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB >>>>>>>>> EAL: 32 not 90 hugepages of size 1024 MB allocated >>>>>>>>> EAL: Not enough memory available on socket 1! >>>>>>>>> Requested: 4096MB, available: 0MB >>>>>>>>> PANIC in rte_eal_init(): >>>>>>>>> Cannot init memory >>>>>>>>> >>>>>>>>> This happens beacause all allocated pages are >>>>>>>>> on socket 0. >>>>>>>>> >>>>>>>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each >>>>>>>>> hugepage to one of requested nodes in a round-robin fashion. >>>>>>>>> In this case all allocated pages will be fairly distributed >>>>>>>>> between all requested nodes. >>>>>>>>> >>>>>>>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES >>>>>>>>> introduced and disabled by default because of external >>>>>>>>> dependency from libnuma. >>>>>>>>> >>>>>>>>> Cc: >>>>>>>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") >>>>>>>>> >>>>>>>>> Signed-off-by: Ilya Maximets >>>>>>>>> --- >>>>>>>>> config/common_base | 1 + >>>>>>>>> lib/librte_eal/Makefile | 4 ++ >>>>>>>>> lib/librte_eal/linuxapp/eal/eal_memory.c | 66 ++++++++++++++++++++++++++++++++ >>>>>>>>> mk/rte.app.mk | 3 ++ >>>>>>>>> 4 files changed, 74 insertions(+) >>>>> Acked-by: Sergio Gonzalez Monroy >>>> Thanks. >>>> >>>> Best regards, Ilya Maximets. >>>> >> >> >> >>