From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <thomas.monjalon@6wind.com>
Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51])
 by dpdk.org (Postfix) with ESMTP id 8527C692F
 for <dev@dpdk.org>; Fri,  7 Apr 2017 17:44:58 +0200 (CEST)
Received: by mail-wm0-f51.google.com with SMTP id u2so1724261wmu.0
 for <dev@dpdk.org>; Fri, 07 Apr 2017 08:44:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=6wind-com.20150623.gappssmtp.com; s=20150623;
 h=from:to:cc:subject:date:message-id:user-agent:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=R6FtsrLLhiS1LNSC2T51qBQcjm9M5nPX8YQcBYqC5tY=;
 b=O9t5E9Wc4uc6gu5EpVD+IKUB4QhlGbmYa48IWlPak13G8iuj/eUHW3n61QsRXVUW7m
 kpom8oPtVVZb/6gO/zdDqZdDYWmIOpJl/ykKjsnju1zWxaCPeoEc7abA0Y2PJ8gWw/l5
 y3N37hc0Xq5V145pm8QrtKRVIf7wi+3563zoJQIqy+RwGs+ddfFZ3or6WKUefbhEr3Tl
 T6/oCuzGVIjZtcQ66VqzV2nFrF5FyC0BmYN98dtX8YYHEOHOebjtvbtK7kbg5XR472T8
 /PdpsMIpc4/Q6n49PTLqzszOhbFQcasUaHLFqGrHgdKX5wUPY/ImQbGE00jbNuO26ufJ
 AT7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:user-agent
 :in-reply-to:references:mime-version:content-transfer-encoding;
 bh=R6FtsrLLhiS1LNSC2T51qBQcjm9M5nPX8YQcBYqC5tY=;
 b=jdnk4O7/lJ5Bei+rv4F6/OlJkKvwSfjliMWk2taQgaDbQ4yQBhHXYF8CvHHf9+0Prp
 J1D4WgrQBoJxvCcnkHqaagpybMFCVWooVnPAbx8JifQErLT2XPUnKX8rDQEUP6UedA4I
 AIZ2xF1Go0CvPsfe8BjoNbmmEzSe4QBm4iu857CmJq7JKKjH5/j5l1Z7vZ3z3KjNtI5g
 IWdm2yCHJ5xKQcT6GvJf+n7YSbGCIiYFQJ5BkYo8HD3BzHLqwL0B6WAD1yCJwmV5d8fI
 1D856pe02fv5B90y015iRZi153n5TJ/uruTMexsRyiI0D4XWc+MC//R7AhF4fY/ogBUv
 5gqw==
X-Gm-Message-State: AFeK/H1Fnclk/WYS73b6tyawe/8UxKnQuAQG/X4Z4OXOlrxvTsutogroRzBLn86q3SHVv+hz
X-Received: by 10.28.139.195 with SMTP id n186mr30692508wmd.139.1491579898003; 
 Fri, 07 Apr 2017 08:44:58 -0700 (PDT)
Received: from xps13.localnet (184.203.134.77.rev.sfr.net. [77.134.203.184])
 by smtp.gmail.com with ESMTPSA id y190sm30631293wmy.15.2017.04.07.08.44.57
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 07 Apr 2017 08:44:57 -0700 (PDT)
From: Thomas Monjalon <thomas.monjalon@6wind.com>
To: Ilya Maximets <i.maximets@samsung.com>,
 Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Cc: dev@dpdk.org, David Marchand <david.marchand@6wind.com>,
 Heetae Ahn <heetae82.ahn@samsung.com>,
 Yuanhan Liu <yuanhan.liu@linux.intel.com>,
 Jianfeng Tan <jianfeng.tan@intel.com>, Neil Horman <nhorman@tuxdriver.com>,
 Yulong Pei <yulong.pei@intel.com>, stable@dpdk.org,
 Bruce Richardson <bruce.richardson@intel.com>
Date: Fri, 07 Apr 2017 17:44:56 +0200
Message-ID: <1945759.SoJb5dzy87@xps13>
User-Agent: KMail/4.14.10 (Linux/4.5.4-1-ARCH; KDE/4.14.11; x86_64; ; )
In-Reply-To: <2a9b03bd-a4c0-8d20-0bbd-77730140eef0@samsung.com>
References: <CGME20170216130139eucas1p2512567d6f5db9eaac5ee840b56bf920a@eucas1p2.samsung.com>
 <d90f300a-dc8c-8788-d3ef-7970a6d36508@samsung.com>
 <2a9b03bd-a4c0-8d20-0bbd-77730140eef0@samsung.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
Subject: Re: [dpdk-dev] [PATCH] mem: balanced allocation of hugepages
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Apr 2017 15:44:58 -0000

2017-04-07 18:14, Ilya Maximets:
> Hi All.
> 
> I wanted to ask (just to clarify current status):
> Will this patch be included in current release (acked by maintainer)
> and then I will upgrade it to hybrid logic or I will just prepare v3
> with hybrid logic for 17.08 ?

What is your preferred option Ilya?
Sergio?


> On 27.03.2017 17:43, Ilya Maximets wrote:
> > On 27.03.2017 16:01, Sergio Gonzalez Monroy wrote:
> >> On 09/03/2017 12:57, Ilya Maximets wrote:
> >>> On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote:
> >>>> Hi Ilya,
> >>>>
> >>>> I have done similar tests and as you already pointed out, 'numactl --interleave' does not seem to work as expected.
> >>>> I have also checked that the issue can be reproduced with quota limit on hugetlbfs mount point.
> >>>>
> >>>> I would be inclined towards *adding libnuma as dependency* to DPDK to make memory allocation a bit more reliable.
> >>>>
> >>>> Currently at a high level regarding hugepages per numa node:
> >>>> 1) Try to map all free hugepages. The total number of mapped hugepages depends if there were any limits, such as cgroups or quota in mount point.
> >>>> 2) Find out numa node of each hugepage.
> >>>> 3) Check if we have enough hugepages for requested memory in each numa socket/node.
> >>>>
> >>>> Using libnuma we could try to allocate hugepages per numa:
> >>>> 1) Try to map as many hugepages from numa 0.
> >>>> 2) Check if we have enough hugepages for requested memory in numa 0.
> >>>> 3) Try to map as many hugepages from numa 1.
> >>>> 4) Check if we have enough hugepages for requested memory in numa 1.
> >>>>
> >>>> This approach would improve failing scenarios caused by limits but It would still not fix issues regarding non-contiguous hugepages (worst case each hugepage is a memseg).
> >>>> The non-contiguous hugepages issues are not as critical now that mempools can span over multiple memsegs/hugepages, but it is still a problem for any other library requiring big chunks of memory.
> >>>>
> >>>> Potentially if we were to add an option such as 'iommu-only' when all devices are bound to vfio-pci, we could have a reliable way to allocate hugepages by just requesting the number of pages from each numa.
> >>>>
> >>>> Thoughts?
> >>> Hi Sergio,
> >>>
> >>> Thanks for your attention to this.
> >>>
> >>> For now, as we have some issues with non-contiguous
> >>> hugepages, I'm thinking about following hybrid schema:
> >>> 1) Allocate essential hugepages:
> >>>     1.1) Allocate as many hugepages from numa N to
> >>>          only fit requested memory for this numa.
> >>>     1.2) repeat 1.1 for all numa nodes.
> >>> 2) Try to map all remaining free hugepages in a round-robin
> >>>     fashion like in this patch.
> >>> 3) Sort pages and choose the most suitable.
> >>>
> >>> This solution should decrease number of issues connected with
> >>> non-contiguous memory.
> >>
> >> Sorry for late reply, I was hoping for more comments from the community.
> >>
> >> IMHO this should be default behavior, which means no config option and libnuma as EAL dependency.
> >> I think your proposal is good, could you consider implementing such approach on next release?
> > 
> > Sure, I can implement this for 17.08 release.
> > 
> >>>
> >>>> On 06/03/2017 09:34, Ilya Maximets wrote:
> >>>>> Hi all.
> >>>>>
> >>>>> So, what about this change?
> >>>>>
> >>>>> Best regards, Ilya Maximets.
> >>>>>
> >>>>> On 16.02.2017 16:01, Ilya Maximets wrote:
> >>>>>> Currently EAL allocates hugepages one by one not paying
> >>>>>> attention from which NUMA node allocation was done.
> >>>>>>
> >>>>>> Such behaviour leads to allocation failure if number of
> >>>>>> available hugepages for application limited by cgroups
> >>>>>> or hugetlbfs and memory requested not only from the first
> >>>>>> socket.
> >>>>>>
> >>>>>> Example:
> >>>>>>      # 90 x 1GB hugepages availavle in a system
> >>>>>>
> >>>>>>      cgcreate -g hugetlb:/test
> >>>>>>      # Limit to 32GB of hugepages
> >>>>>>      cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test
> >>>>>>      # Request 4GB from each of 2 sockets
> >>>>>>      cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ...
> >>>>>>
> >>>>>>      EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB
> >>>>>>      EAL: 32 not 90 hugepages of size 1024 MB allocated
> >>>>>>      EAL: Not enough memory available on socket 1!
> >>>>>>           Requested: 4096MB, available: 0MB
> >>>>>>      PANIC in rte_eal_init():
> >>>>>>      Cannot init memory
> >>>>>>
> >>>>>>      This happens beacause all allocated pages are
> >>>>>>      on socket 0.
> >>>>>>
> >>>>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each
> >>>>>> hugepage to one of requested nodes in a round-robin fashion.
> >>>>>> In this case all allocated pages will be fairly distributed
> >>>>>> between all requested nodes.
> >>>>>>
> >>>>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES
> >>>>>> introduced and disabled by default because of external
> >>>>>> dependency from libnuma.
> >>>>>>
> >>>>>> Cc:<stable@dpdk.org>
> >>>>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages")
> >>>>>>
> >>>>>> Signed-off-by: Ilya Maximets<i.maximets@samsung.com>
> >>>>>> ---
> >>>>>>    config/common_base                       |  1 +
> >>>>>>    lib/librte_eal/Makefile                  |  4 ++
> >>>>>>    lib/librte_eal/linuxapp/eal/eal_memory.c | 66 ++++++++++++++++++++++++++++++++
> >>>>>>    mk/rte.app.mk                            |  3 ++
> >>>>>>    4 files changed, 74 insertions(+)
> >>
> >> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
> > 
> > Thanks.
> > 
> > Best regards, Ilya Maximets.
> >