From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 3DA5BFAE3; Mon, 27 Mar 2017 15:02:06 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=intel.com; i=@intel.com; q=dns/txt; s=intel; t=1490619726; x=1522155726; h=from:subject:to:references:cc:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=jf+6g8XYgF0xh3Kj809BU0DBjcRYnVVdArKvevN1SiY=; b=pvWmnOZEfsJSQmB1RQuQrUHpzs7uSsmZGEyeKnTUAM+eyBXBhREec/8l EFvpuQptRSGF1GUGQ0Ff9oTRkWzqWA==; Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 27 Mar 2017 06:02:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.36,231,1486454400"; d="scan'208";a="1127576544" Received: from smonroyx-mobl.ger.corp.intel.com (HELO [10.237.221.23]) ([10.237.221.23]) by fmsmga001.fm.intel.com with ESMTP; 27 Mar 2017 06:01:59 -0700 From: Sergio Gonzalez Monroy To: Ilya Maximets , dev@dpdk.org, David Marchand References: <1487250070-13973-1-git-send-email-i.maximets@samsung.com> <50517d4c-5174-f4b2-e77e-143f7aac2c00@samsung.com> Cc: Heetae Ahn , Yuanhan Liu , Jianfeng Tan , Neil Horman , Yulong Pei , stable@dpdk.org, Thomas Monjalon , Bruce Richardson Message-ID: <077682cf-8534-7890-9453-7c9e822bd3e6@intel.com> Date: Mon, 27 Mar 2017 14:01:59 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH] mem: balanced allocation of hugepages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Mar 2017 13:02:07 -0000 On 09/03/2017 12:57, Ilya Maximets wrote: > On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote: >> Hi Ilya, >> >> I have done similar tests and as you already pointed out, 'numactl --interleave' does not seem to work as expected. >> I have also checked that the issue can be reproduced with quota limit on hugetlbfs mount point. >> >> I would be inclined towards *adding libnuma as dependency* to DPDK to make memory allocation a bit more reliable. >> >> Currently at a high level regarding hugepages per numa node: >> 1) Try to map all free hugepages. The total number of mapped hugepages depends if there were any limits, such as cgroups or quota in mount point. >> 2) Find out numa node of each hugepage. >> 3) Check if we have enough hugepages for requested memory in each numa socket/node. >> >> Using libnuma we could try to allocate hugepages per numa: >> 1) Try to map as many hugepages from numa 0. >> 2) Check if we have enough hugepages for requested memory in numa 0. >> 3) Try to map as many hugepages from numa 1. >> 4) Check if we have enough hugepages for requested memory in numa 1. >> >> This approach would improve failing scenarios caused by limits but It would still not fix issues regarding non-contiguous hugepages (worst case each hugepage is a memseg). >> The non-contiguous hugepages issues are not as critical now that mempools can span over multiple memsegs/hugepages, but it is still a problem for any other library requiring big chunks of memory. >> >> Potentially if we were to add an option such as 'iommu-only' when all devices are bound to vfio-pci, we could have a reliable way to allocate hugepages by just requesting the number of pages from each numa. >> >> Thoughts? > Hi Sergio, > > Thanks for your attention to this. > > For now, as we have some issues with non-contiguous > hugepages, I'm thinking about following hybrid schema: > 1) Allocate essential hugepages: > 1.1) Allocate as many hugepages from numa N to > only fit requested memory for this numa. > 1.2) repeat 1.1 for all numa nodes. > 2) Try to map all remaining free hugepages in a round-robin > fashion like in this patch. > 3) Sort pages and choose the most suitable. > > This solution should decrease number of issues connected with > non-contiguous memory. Sorry for late reply, I was hoping for more comments from the community. IMHO this should be default behavior, which means no config option and libnuma as EAL dependency. I think your proposal is good, could you consider implementing such approach on next release? Regards. > Best regards, Ilya Maximets. > >> On 06/03/2017 09:34, Ilya Maximets wrote: >>> Hi all. >>> >>> So, what about this change? >>> >>> Best regards, Ilya Maximets. >>> >>> On 16.02.2017 16:01, Ilya Maximets wrote: >>>> Currently EAL allocates hugepages one by one not paying >>>> attention from which NUMA node allocation was done. >>>> >>>> Such behaviour leads to allocation failure if number of >>>> available hugepages for application limited by cgroups >>>> or hugetlbfs and memory requested not only from the first >>>> socket. >>>> >>>> Example: >>>> # 90 x 1GB hugepages availavle in a system >>>> >>>> cgcreate -g hugetlb:/test >>>> # Limit to 32GB of hugepages >>>> cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test >>>> # Request 4GB from each of 2 sockets >>>> cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ... >>>> >>>> EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB >>>> EAL: 32 not 90 hugepages of size 1024 MB allocated >>>> EAL: Not enough memory available on socket 1! >>>> Requested: 4096MB, available: 0MB >>>> PANIC in rte_eal_init(): >>>> Cannot init memory >>>> >>>> This happens beacause all allocated pages are >>>> on socket 0. >>>> >>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each >>>> hugepage to one of requested nodes in a round-robin fashion. >>>> In this case all allocated pages will be fairly distributed >>>> between all requested nodes. >>>> >>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES >>>> introduced and disabled by default because of external >>>> dependency from libnuma. >>>> >>>> Cc: >>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") >>>> >>>> Signed-off-by: Ilya Maximets >>>> --- >>>> config/common_base | 1 + >>>> lib/librte_eal/Makefile | 4 ++ >>>> lib/librte_eal/linuxapp/eal/eal_memory.c | 66 ++++++++++++++++++++++++++++++++ >>>> mk/rte.app.mk | 3 ++ >>>> 4 files changed, 74 insertions(+) Acked-by: Sergio Gonzalez Monroy