From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 8F223A04F0;
	Fri, 27 Dec 2019 09:11:26 +0100 (CET)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id D9C7A1BFEB;
	Fri, 27 Dec 2019 09:11:25 +0100 (CET)
Received: from mail-wr1-f65.google.com (mail-wr1-f65.google.com
 [209.85.221.65]) by dpdk.org (Postfix) with ESMTP id 5A4511BFE3
 for <dev@dpdk.org>; Fri, 27 Dec 2019 09:11:24 +0100 (CET)
Received: by mail-wr1-f65.google.com with SMTP id d16so25406966wre.10
 for <dev@dpdk.org>; Fri, 27 Dec 2019 00:11:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to:user-agent;
 bh=HKsxwhZZGp69KltMmvoF+Cq4qBAMXLjknYnTYgZTTQM=;
 b=NNVU/VhSZofy7YbDPaAnJO9T8FB8qc1EoYKK8CkIaTy0vSsOXjfz7vWXSQeMA6t9fy
 nzPLih+924unGBo6Yb7az1XM46xUxMTS9kFTCOU8N4C7oLRiHASHoBa2IaMZ1j7FHYRt
 K29sDHJuqA0+V2A4ebIsOol+JoaGsWCEucpvtdQdXl0ASYk+VishuADuz0zWqLRMcbF2
 zLJU9kBoeBWsQwpZPLcM1lWJUDsDYON0kM7PpW9M59igKBVBfRLPOo2q6axef5CT5wfp
 W7QlpnIFQg1qP7Jlc7+DlvWokzmnwTVFjpAML6mbLHaY477/+tFBNPiRG0FfA2yjdQIo
 CUKQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to:user-agent;
 bh=HKsxwhZZGp69KltMmvoF+Cq4qBAMXLjknYnTYgZTTQM=;
 b=BPE0sEMJ8AujPzKjBGa6OMFEf597Wo5I+BULTYYH0DDtG4Q+u9ni6LoePARVQw52kb
 CUBXHzB+b8sTF8kNWsM3gl9niHk4wDGc7H+5VN8WjuPoDI1Y9x0aP2l/8tSMZ7QGmrMu
 pglq7oRk4szFSDyrPKwIl/Q9os5R7+0WV/dqqCfAfqqamVjQN5fN0aAVeio8hbyF3xWV
 UwOkrBPe2ayR9VE0NTqq+qUpLMuhRZscAZmOwClrMKF6hYcoYjx8/gB2XE3wsOUoASJZ
 J4xgety3ESSREs0Uw1LaQ4lYqOucuITgmqoSH2LbEIPSbSonEmi5cI1D3SwFhFJ8x3XQ
 U8vA==
X-Gm-Message-State: APjAAAX3HgHh6yBrY7YKYquWV4AOpHyRdm0CGA5vi6jj92TpjmVgrpOS
 2EovukkQGn1mGg6XMThrASc9Tm9UsDk=
X-Google-Smtp-Source: APXvYqxphlXmkJo4OUgjbMFChZArJX7EqAcP8gy1iJLpjiNUZtgRV783OLiX8wu4AvihtU2u2pzY3Q==
X-Received: by 2002:adf:f5cf:: with SMTP id k15mr50681629wrp.182.1577434283896; 
 Fri, 27 Dec 2019 00:11:23 -0800 (PST)
Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr.
 [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0])
 by smtp.gmail.com with ESMTPSA id o15sm34368078wra.83.2019.12.27.00.11.22
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 27 Dec 2019 00:11:23 -0800 (PST)
Date: Fri, 27 Dec 2019 09:11:22 +0100
From: Olivier Matz <olivier.matz@6wind.com>
To: Bao-Long Tran <tranbaolong@niometrics.com>
Cc: anatoly.burakov@intel.com, arybchenko@solarflare.com, dev@dpdk.org,
 users@dpdk.org, ricudis@niometrics.com
Message-ID: <20191227081122.GL22738@platinum>
References: <AEEF393A-B56D-4F06-B54F-5AF4022B1F2D@niometrics.com>
 <20191226154524.GG22738@platinum>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20191226154524.GG22738@platinum>
User-Agent: Mutt/1.10.1 (2018-07-13)
Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to
 hugepage allocation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> Hi Bao-Long,
> 
> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> > Hi,
> > 
> > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior 
> > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> > same mempool size, the number of hugepages allocated changes from run to run.
> > 
> > Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> > 
> > 1. Reserve 16x1G hugepages on socket 0 
> > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> > make && ./build/basicfwd 
> > 3. At the same time, watch the number of hugepages allocated 
> > "watch -n.1 ls /dev/hugepages"
> > 4. Repeat step 2
> > 
> > If you can reproduce, you should see that for some runs, DPDK allocates 5
> > hugepages, other times it allocates 6. When it allocates 6, if you watch the 
> > output from step 3., you will see that DPDK first  try to allocate 5 hugepages, 
> > then unmap all 5, retry, and got 6.
> 
> I cannot reproduce in the same conditions than yours (with 16 hugepages
> on socket 0), but I think I can see a similar issue:
> 
> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> are used). If I reserve 5 hugepages, it takes more time,
> taking/releasing hugepages several times, and it finally succeeds with 5
> hugepages.
> 
> > For our use case, it's important that DPDK allocate the same number of 
> > hugepages on every run so we can get reproducable results.
> 
> One possibility is to use the --legacy-mem EAL option. It will try to
> reserve all hugepages first.

Passing --socket-mem=5120,0 also does the job.

> > Studying the code, this seems to be the behavior of
> > rte_mempool_populate_default(). If I understand correctly, if the first try fail
> > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> > condition, and eventually wound up with 6 hugepages.
> 
> No, I think you don't have the IOVA-contiguous constraint in your
> case. This is what I see:
> 
> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> c- the total element size (with header) is 2304 + 64 = 2368
> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
>    obj_per_page = 453438    (453438 * 2368 = 1073741184)
>    mem_size = 4966058495
> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
>    rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
>      mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
>      align=64)
>    For some reason, it fails: we can see that the number of map'd hugepages
>    increases in /dev/hugepages, the return to its original value.
>    I don't think it should fail here.
> f- then, it will try to allocate the biggest available contiguous zone. In
>    my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
>    This is a second problem: if we call it again, it returns NULL, because
>    it won't map another hugepage.
> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
>    (mempool header), and it triggers the mapping a a new hugepage, that
>    will be used in the next loop, back at step d with a smaller mem_size.
> 
> > Questions: 
> > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory 
> > is abundant? 
> 
> In my case, it looks that we have a bit less than 1G which is free at
> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> The allocator ends up in mapping 5 pages (and fail), while only 4 is
> needed.
> 
> Anatoly, do you have any idea? Shouldn't we take in account the amount
> of free space at the end of the heap when expanding?
> 
> > 2. Why does the 2nd retry need N+1 hugepages?
> 
> When the first alloc fails, the mempool code tries to allocate in
> several chunks which are not virtually contiguous. This is needed in
> case the memory is fragmented.
> 
> > Some insights for Q1: From my experiments, seems like the IOVA of the first
> > hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> > It could explain the retry when the IOVA of the first hugepage is near the end of 
> > the IOVA space. But I have also seen situation where the 1st hugepage is near
> > the beginning of the IOVA space and it still failed the 1st time.
> > 
> > Here's the code:
> > #include <rte_eal.h>
> > #include <rte_mbuf.h>
> > 
> > int
> > main(int argc, char *argv[])
> > {
> > 	struct rte_mempool *mbuf_pool;
> > 	unsigned mbuf_pool_size = 2097151;
> > 
> > 	int ret = rte_eal_init(argc, argv);
> > 	if (ret < 0)
> > 		rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> > 
> > 	printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> > 	mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> > 		256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> > 
> > 	printf("mbuf_pool %p\n", mbuf_pool);
> > 
> > 	return 0;
> > }
> > 
> > Best regards,
> > BL
> 
> Regards,
> Olivier