From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id C0513A04F0; Fri, 27 Dec 2019 11:06:04 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id A78E51BFFF; Fri, 27 Dec 2019 11:06:03 +0100 (CET) Received: from integrity.niometrics.com (integrity.niometrics.com [42.61.70.122]) by dpdk.org (Postfix) with ESMTP id 0BBF31BFF9; Fri, 27 Dec 2019 11:06:02 +0100 (CET) Received: from [10.15.0.122] (unknown [10.15.0.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by integrity.niometrics.com (Postfix) with ESMTPSA id 04739409CBA8; Fri, 27 Dec 2019 18:05:57 +0800 (+08) DMARC-Filter: OpenDMARC Filter v1.3.2 integrity.niometrics.com 04739409CBA8 Authentication-Results: integrity.niometrics.com; dmarc=fail (p=reject dis=none) header.from=niometrics.com Authentication-Results: integrity.niometrics.com; spf=fail smtp.mailfrom=tranbaolong@niometrics.com DKIM-Filter: OpenDKIM Filter v2.11.0 integrity.niometrics.com 04739409CBA8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niometrics.com; s=default; t=1577441160; bh=7GNgVoCUJyeJ/wwcwYAExFEUPWeiARi76pejm33lFvM=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=TLFTGclCKBfqC6pmQmjedv9n+rjtVmUu3uze/l8FWcYckvpxTDnBTf3j2Rr+yMFay wY4qVf4apIsgLglOF8LNTCQoVyUTm3mXYxSAPCX50CcZRJAJRToN4CzXJYzTfKUkCh pG0xzA7JBL7PRXQM/kbczw+GcuncqQLSXi/H3oXM= Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3601.0.10\)) From: Bao-Long Tran In-Reply-To: <20191227081122.GL22738@platinum> Date: Fri, 27 Dec 2019 18:05:57 +0800 Cc: anatoly.burakov@intel.com, arybchenko@solarflare.com, dev@dpdk.org, users@dpdk.org, ricudis@niometrics.com Content-Transfer-Encoding: quoted-printable Message-Id: <6C9B12E3-7C6C-4D0F-981B-12A49F71E467@niometrics.com> References: <20191226154524.GG22738@platinum> <20191227081122.GL22738@platinum> To: Olivier Matz X-Mailer: Apple Mail (2.3601.0.10) X-Spam-Status: No, score=-1.0 required=3.5 tests=ALL_TRUSTED autolearn=disabled version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on integrity.niometrics.com Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Olivier, > On 27 Dec 2019, at 4:11 PM, Olivier Matz = wrote: >=20 > On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote: >> Hi Bao-Long, >>=20 >> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote: >>> Hi, >>>=20 >>> I'm not sure if this is a bug, but I've seen an inconsistency in the = behavior=20 >>> of DPDK with regards to hugepage allocation for rte_mempool. = Basically, for the >>> same mempool size, the number of hugepages allocated changes from = run to run. >>>=20 >>> Here's how I reproduce with DPDK 19.11. IOVA=3Dpa (default) >>>=20 >>> 1. Reserve 16x1G hugepages on socket 0=20 >>> 2. Replace examples/skeleton/basicfwd.c with the code below, build = and run >>> make && ./build/basicfwd=20 >>> 3. At the same time, watch the number of hugepages allocated=20 >>> "watch -n.1 ls /dev/hugepages" >>> 4. Repeat step 2 >>>=20 >>> If you can reproduce, you should see that for some runs, DPDK = allocates 5 >>> hugepages, other times it allocates 6. When it allocates 6, if you = watch the=20 >>> output from step 3., you will see that DPDK first try to allocate 5 = hugepages,=20 >>> then unmap all 5, retry, and got 6. >>=20 >> I cannot reproduce in the same conditions than yours (with 16 = hugepages >> on socket 0), but I think I can see a similar issue: >>=20 >> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages >> are used). If I reserve 5 hugepages, it takes more time, >> taking/releasing hugepages several times, and it finally succeeds = with 5 >> hugepages. My apology: I just checked again, I was using DPDK 19.05, not 19.11 or = master.=20 Let me try to see if I can repro my issue with 19.11. Sorry for the = confusion. I also saw your patch to reduce wasted memory (eba11e). Seems like it = resolves the problem with the IOVA-contig constraint that I described in my first = message. I'll look into it to confirm. If I cannot repro my issue (different number of hugepages) with 19.11, = from our side we can upgrade to 19.11 and that's all we need for now. But let me = also try to repro the issue you described (multiple attempts to allocate = hugepages). >>=20 >>> For our use case, it's important that DPDK allocate the same number = of=20 >>> hugepages on every run so we can get reproducable results. >>=20 >> One possibility is to use the --legacy-mem EAL option. It will try to >> reserve all hugepages first. >=20 > Passing --socket-mem=3D5120,0 also does the job. >=20 >>> Studying the code, this seems to be the behavior of >>> rte_mempool_populate_default(). If I understand correctly, if the = first try fail >>> to get 5 IOVA-contiguous pages, it retries, relaxing the = IOVA-contiguous >>> condition, and eventually wound up with 6 hugepages. >>=20 >> No, I think you don't have the IOVA-contiguous constraint in your >> case. This is what I see: >>=20 >> a- reserve 5 hugepages on socket 0, and start your patched basicfwd >> b- it tries to allocate 2097151 objects of size 2304, pg_size =3D = 1073741824 >> c- the total element size (with header) is 2304 + 64 =3D 2368 >> d- in rte_mempool_op_calc_mem_size_helper(), it calculates >> obj_per_page =3D 453438 (453438 * 2368 =3D 1073741184) >> mem_size =3D 4966058495 >> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, = with: >> rte_memzone_reserve_aligned(name, size=3D4966058495, socket=3D0, >> mz_flags=3DRTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY, >> align=3D64) >> For some reason, it fails: we can see that the number of map'd = hugepages >> increases in /dev/hugepages, the return to its original value. >> I don't think it should fail here. >> f- then, it will try to allocate the biggest available contiguous = zone. In >> my case, it is 1055291776 bytes (almost all the uniq map'd = hugepage). >> This is a second problem: if we call it again, it returns NULL, = because >> it won't map another hugepage. >> g- by luck, calling rte_mempool_populate_virt() allocates a small = aera >> (mempool header), and it triggers the mapping a a new hugepage, = that >> will be used in the next loop, back at step d with a smaller = mem_size. >>=20 >>> Questions:=20 >>> 1. Why does the API sometimes fail to get IOVA contig mem, when = hugepage memory=20 >>> is abundant?=20 >>=20 >> In my case, it looks that we have a bit less than 1G which is free at >> the end of the heap, than we call = rte_memzone_reserve_aligned(size=3D5G). >> The allocator ends up in mapping 5 pages (and fail), while only 4 is >> needed. >>=20 >> Anatoly, do you have any idea? Shouldn't we take in account the = amount >> of free space at the end of the heap when expanding? >>=20 >>> 2. Why does the 2nd retry need N+1 hugepages? >>=20 >> When the first alloc fails, the mempool code tries to allocate in >> several chunks which are not virtually contiguous. This is needed in >> case the memory is fragmented. >>=20 >>> Some insights for Q1: =46rom my experiments, seems like the IOVA of = the first >>> hugepage is not guaranteed to be at the start of the IOVA space = (understandably). >>> It could explain the retry when the IOVA of the first hugepage is = near the end of=20 >>> the IOVA space. But I have also seen situation where the 1st = hugepage is near >>> the beginning of the IOVA space and it still failed the 1st time. >>>=20 >>> Here's the code: >>> #include >>> #include >>>=20 >>> int >>> main(int argc, char *argv[]) >>> { >>> struct rte_mempool *mbuf_pool; >>> unsigned mbuf_pool_size =3D 2097151; >>>=20 >>> int ret =3D rte_eal_init(argc, argv); >>> if (ret < 0) >>> rte_exit(EXIT_FAILURE, "Error with EAL = initialization\n"); >>>=20 >>> printf("Creating mbuf pool size=3D%u\n", mbuf_pool_size); >>> mbuf_pool =3D rte_pktmbuf_pool_create("MBUF_POOL", = mbuf_pool_size, >>> 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0); >>>=20 >>> printf("mbuf_pool %p\n", mbuf_pool); >>>=20 >>> return 0; >>> } >>>=20 >>> Best regards, >>> BL >>=20 >> Regards, >> Olivier Thanks, BL