From: Olivier Matz <olivier.matz@6wind.com>
To: Bao-Long Tran <tranbaolong@niometrics.com>
Cc: anatoly.burakov@intel.com, arybchenko@solarflare.com,
dev@dpdk.org, users@dpdk.org, ricudis@niometrics.com
Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
Date: Fri, 27 Dec 2019 09:11:22 +0100 [thread overview]
Message-ID: <20191227081122.GL22738@platinum> (raw)
In-Reply-To: <20191226154524.GG22738@platinum>
On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> Hi Bao-Long,
>
> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> > Hi,
> >
> > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
> > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> > same mempool size, the number of hugepages allocated changes from run to run.
> >
> > Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> >
> > 1. Reserve 16x1G hugepages on socket 0
> > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> > make && ./build/basicfwd
> > 3. At the same time, watch the number of hugepages allocated
> > "watch -n.1 ls /dev/hugepages"
> > 4. Repeat step 2
> >
> > If you can reproduce, you should see that for some runs, DPDK allocates 5
> > hugepages, other times it allocates 6. When it allocates 6, if you watch the
> > output from step 3., you will see that DPDK first try to allocate 5 hugepages,
> > then unmap all 5, retry, and got 6.
>
> I cannot reproduce in the same conditions than yours (with 16 hugepages
> on socket 0), but I think I can see a similar issue:
>
> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> are used). If I reserve 5 hugepages, it takes more time,
> taking/releasing hugepages several times, and it finally succeeds with 5
> hugepages.
>
> > For our use case, it's important that DPDK allocate the same number of
> > hugepages on every run so we can get reproducable results.
>
> One possibility is to use the --legacy-mem EAL option. It will try to
> reserve all hugepages first.
Passing --socket-mem=5120,0 also does the job.
> > Studying the code, this seems to be the behavior of
> > rte_mempool_populate_default(). If I understand correctly, if the first try fail
> > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> > condition, and eventually wound up with 6 hugepages.
>
> No, I think you don't have the IOVA-contiguous constraint in your
> case. This is what I see:
>
> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> c- the total element size (with header) is 2304 + 64 = 2368
> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
> obj_per_page = 453438 (453438 * 2368 = 1073741184)
> mem_size = 4966058495
> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
> rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
> align=64)
> For some reason, it fails: we can see that the number of map'd hugepages
> increases in /dev/hugepages, the return to its original value.
> I don't think it should fail here.
> f- then, it will try to allocate the biggest available contiguous zone. In
> my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
> This is a second problem: if we call it again, it returns NULL, because
> it won't map another hugepage.
> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
> (mempool header), and it triggers the mapping a a new hugepage, that
> will be used in the next loop, back at step d with a smaller mem_size.
>
> > Questions:
> > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
> > is abundant?
>
> In my case, it looks that we have a bit less than 1G which is free at
> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> The allocator ends up in mapping 5 pages (and fail), while only 4 is
> needed.
>
> Anatoly, do you have any idea? Shouldn't we take in account the amount
> of free space at the end of the heap when expanding?
>
> > 2. Why does the 2nd retry need N+1 hugepages?
>
> When the first alloc fails, the mempool code tries to allocate in
> several chunks which are not virtually contiguous. This is needed in
> case the memory is fragmented.
>
> > Some insights for Q1: From my experiments, seems like the IOVA of the first
> > hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> > It could explain the retry when the IOVA of the first hugepage is near the end of
> > the IOVA space. But I have also seen situation where the 1st hugepage is near
> > the beginning of the IOVA space and it still failed the 1st time.
> >
> > Here's the code:
> > #include <rte_eal.h>
> > #include <rte_mbuf.h>
> >
> > int
> > main(int argc, char *argv[])
> > {
> > struct rte_mempool *mbuf_pool;
> > unsigned mbuf_pool_size = 2097151;
> >
> > int ret = rte_eal_init(argc, argv);
> > if (ret < 0)
> > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> >
> > printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> > 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> >
> > printf("mbuf_pool %p\n", mbuf_pool);
> >
> > return 0;
> > }
> >
> > Best regards,
> > BL
>
> Regards,
> Olivier
next prev parent reply other threads:[~2019-12-27 8:11 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-23 11:09 Bao-Long Tran
2019-12-26 15:45 ` Olivier Matz
2019-12-27 8:11 ` Olivier Matz [this message]
2019-12-27 10:05 ` Bao-Long Tran
2019-12-27 11:11 ` Olivier Matz
2020-01-07 13:06 ` Burakov, Anatoly
2020-01-09 13:32 ` Olivier Matz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191227081122.GL22738@platinum \
--to=olivier.matz@6wind.com \
--cc=anatoly.burakov@intel.com \
--cc=arybchenko@solarflare.com \
--cc=dev@dpdk.org \
--cc=ricudis@niometrics.com \
--cc=tranbaolong@niometrics.com \
--cc=users@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).