From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id C0513A04F0;
	Fri, 27 Dec 2019 11:06:04 +0100 (CET)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id A78E51BFFF;
	Fri, 27 Dec 2019 11:06:03 +0100 (CET)
Received: from integrity.niometrics.com (integrity.niometrics.com
 [42.61.70.122]) by dpdk.org (Postfix) with ESMTP id 0BBF31BFF9;
 Fri, 27 Dec 2019 11:06:02 +0100 (CET)
Received: from [10.15.0.122] (unknown [10.15.0.122])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by integrity.niometrics.com (Postfix) with ESMTPSA id 04739409CBA8;
 Fri, 27 Dec 2019 18:05:57 +0800 (+08)
DMARC-Filter: OpenDMARC Filter v1.3.2 integrity.niometrics.com 04739409CBA8
Authentication-Results: integrity.niometrics.com;
 dmarc=fail (p=reject dis=none)
 header.from=niometrics.com
Authentication-Results: integrity.niometrics.com;
 spf=fail smtp.mailfrom=tranbaolong@niometrics.com
DKIM-Filter: OpenDKIM Filter v2.11.0 integrity.niometrics.com 04739409CBA8
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niometrics.com;
 s=default; t=1577441160;
 bh=7GNgVoCUJyeJ/wwcwYAExFEUPWeiARi76pejm33lFvM=;
 h=Subject:From:In-Reply-To:Date:Cc:References:To:From;
 b=TLFTGclCKBfqC6pmQmjedv9n+rjtVmUu3uze/l8FWcYckvpxTDnBTf3j2Rr+yMFay
 wY4qVf4apIsgLglOF8LNTCQoVyUTm3mXYxSAPCX50CcZRJAJRToN4CzXJYzTfKUkCh
 pG0xzA7JBL7PRXQM/kbczw+GcuncqQLSXi/H3oXM=
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3601.0.10\))
From: Bao-Long Tran <tranbaolong@niometrics.com>
In-Reply-To: <20191227081122.GL22738@platinum>
Date: Fri, 27 Dec 2019 18:05:57 +0800
Cc: anatoly.burakov@intel.com, arybchenko@solarflare.com, dev@dpdk.org,
 users@dpdk.org, ricudis@niometrics.com
Content-Transfer-Encoding: quoted-printable
Message-Id: <6C9B12E3-7C6C-4D0F-981B-12A49F71E467@niometrics.com>
References: <AEEF393A-B56D-4F06-B54F-5AF4022B1F2D@niometrics.com>
 <20191226154524.GG22738@platinum> <20191227081122.GL22738@platinum>
To: Olivier Matz <olivier.matz@6wind.com>
X-Mailer: Apple Mail (2.3601.0.10)
X-Spam-Status: No, score=-1.0 required=3.5 tests=ALL_TRUSTED
 autolearn=disabled version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 integrity.niometrics.com
Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to
 hugepage allocation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Olivier,

> On 27 Dec 2019, at 4:11 PM, Olivier Matz <olivier.matz@6wind.com> =
wrote:
>=20
> On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
>> Hi Bao-Long,
>>=20
>> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
>>> Hi,
>>>=20
>>> I'm not sure if this is a bug, but I've seen an inconsistency in the =
behavior=20
>>> of DPDK with regards to hugepage allocation for rte_mempool. =
Basically, for the
>>> same mempool size, the number of hugepages allocated changes from =
run to run.
>>>=20
>>> Here's how I reproduce with DPDK 19.11. IOVA=3Dpa (default)
>>>=20
>>> 1. Reserve 16x1G hugepages on socket 0=20
>>> 2. Replace examples/skeleton/basicfwd.c with the code below, build =
and run
>>> make && ./build/basicfwd=20
>>> 3. At the same time, watch the number of hugepages allocated=20
>>> "watch -n.1 ls /dev/hugepages"
>>> 4. Repeat step 2
>>>=20
>>> If you can reproduce, you should see that for some runs, DPDK =
allocates 5
>>> hugepages, other times it allocates 6. When it allocates 6, if you =
watch the=20
>>> output from step 3., you will see that DPDK first  try to allocate 5 =
hugepages,=20
>>> then unmap all 5, retry, and got 6.
>>=20
>> I cannot reproduce in the same conditions than yours (with 16 =
hugepages
>> on socket 0), but I think I can see a similar issue:
>>=20
>> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
>> are used). If I reserve 5 hugepages, it takes more time,
>> taking/releasing hugepages several times, and it finally succeeds =
with 5
>> hugepages.

My apology: I just checked again, I was using DPDK 19.05, not 19.11 or =
master.=20
Let me try to see if I can repro my issue with 19.11. Sorry for the =
confusion.

I also saw your patch to reduce wasted memory (eba11e). Seems like it =
resolves
the problem with the IOVA-contig constraint that I described in my first =
message.
I'll look into it to confirm.

If I cannot repro my issue (different number of hugepages) with 19.11, =
from our
side we can upgrade to 19.11 and that's all we need for now. But let me =
also try
to repro the issue you described (multiple attempts to allocate =
hugepages).

>>=20
>>> For our use case, it's important that DPDK allocate the same number =
of=20
>>> hugepages on every run so we can get reproducable results.
>>=20
>> One possibility is to use the --legacy-mem EAL option. It will try to
>> reserve all hugepages first.
>=20
> Passing --socket-mem=3D5120,0 also does the job.
>=20

>>> Studying the code, this seems to be the behavior of
>>> rte_mempool_populate_default(). If I understand correctly, if the =
first try fail
>>> to get 5 IOVA-contiguous pages, it retries, relaxing the =
IOVA-contiguous
>>> condition, and eventually wound up with 6 hugepages.
>>=20
>> No, I think you don't have the IOVA-contiguous constraint in your
>> case. This is what I see:
>>=20
>> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
>> b- it tries to allocate 2097151 objects of size 2304, pg_size =3D =
1073741824
>> c- the total element size (with header) is 2304 + 64 =3D 2368
>> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
>>   obj_per_page =3D 453438    (453438 * 2368 =3D 1073741184)
>>   mem_size =3D 4966058495
>> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, =
with:
>>   rte_memzone_reserve_aligned(name, size=3D4966058495, socket=3D0,
>>     mz_flags=3DRTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
>>     align=3D64)
>>   For some reason, it fails: we can see that the number of map'd =
hugepages
>>   increases in /dev/hugepages, the return to its original value.
>>   I don't think it should fail here.
>> f- then, it will try to allocate the biggest available contiguous =
zone. In
>>   my case, it is 1055291776 bytes (almost all the uniq map'd =
hugepage).
>>   This is a second problem: if we call it again, it returns NULL, =
because
>>   it won't map another hugepage.
>> g- by luck, calling rte_mempool_populate_virt() allocates a small =
aera
>>   (mempool header), and it triggers the mapping a a new hugepage, =
that
>>   will be used in the next loop, back at step d with a smaller =
mem_size.
>>=20

>>> Questions:=20
>>> 1. Why does the API sometimes fail to get IOVA contig mem, when =
hugepage memory=20
>>> is abundant?=20
>>=20
>> In my case, it looks that we have a bit less than 1G which is free at
>> the end of the heap, than we call =
rte_memzone_reserve_aligned(size=3D5G).
>> The allocator ends up in mapping 5 pages (and fail), while only 4 is
>> needed.
>>=20
>> Anatoly, do you have any idea? Shouldn't we take in account the =
amount
>> of free space at the end of the heap when expanding?
>>=20
>>> 2. Why does the 2nd retry need N+1 hugepages?
>>=20
>> When the first alloc fails, the mempool code tries to allocate in
>> several chunks which are not virtually contiguous. This is needed in
>> case the memory is fragmented.
>>=20
>>> Some insights for Q1: =46rom my experiments, seems like the IOVA of =
the first
>>> hugepage is not guaranteed to be at the start of the IOVA space =
(understandably).
>>> It could explain the retry when the IOVA of the first hugepage is =
near the end of=20
>>> the IOVA space. But I have also seen situation where the 1st =
hugepage is near
>>> the beginning of the IOVA space and it still failed the 1st time.
>>>=20
>>> Here's the code:
>>> #include <rte_eal.h>
>>> #include <rte_mbuf.h>
>>>=20
>>> int
>>> main(int argc, char *argv[])
>>> {
>>> 	struct rte_mempool *mbuf_pool;
>>> 	unsigned mbuf_pool_size =3D 2097151;
>>>=20
>>> 	int ret =3D rte_eal_init(argc, argv);
>>> 	if (ret < 0)
>>> 		rte_exit(EXIT_FAILURE, "Error with EAL =
initialization\n");
>>>=20
>>> 	printf("Creating mbuf pool size=3D%u\n", mbuf_pool_size);
>>> 	mbuf_pool =3D rte_pktmbuf_pool_create("MBUF_POOL", =
mbuf_pool_size,
>>> 		256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
>>>=20
>>> 	printf("mbuf_pool %p\n", mbuf_pool);
>>>=20
>>> 	return 0;
>>> }
>>>=20
>>> Best regards,
>>> BL
>>=20
>> Regards,
>> Olivier

Thanks,
BL