* [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
@ 2019-12-23 11:09 Bao-Long Tran
2019-12-26 15:45 ` Olivier Matz
0 siblings, 1 reply; 7+ messages in thread
From: Bao-Long Tran @ 2019-12-23 11:09 UTC (permalink / raw)
To: anatoly.burakov, olivier.matz, arybchenko; +Cc: dev, users, ricudis
Hi,
I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
same mempool size, the number of hugepages allocated changes from run to run.
Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
1. Reserve 16x1G hugepages on socket 0
2. Replace examples/skeleton/basicfwd.c with the code below, build and run
make && ./build/basicfwd
3. At the same time, watch the number of hugepages allocated
"watch -n.1 ls /dev/hugepages"
4. Repeat step 2
If you can reproduce, you should see that for some runs, DPDK allocates 5
hugepages, other times it allocates 6. When it allocates 6, if you watch the
output from step 3., you will see that DPDK first try to allocate 5 hugepages,
then unmap all 5, retry, and got 6.
For our use case, it's important that DPDK allocate the same number of
hugepages on every run so we can get reproducable results.
Studying the code, this seems to be the behavior of
rte_mempool_populate_default(). If I understand correctly, if the first try fail
to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
condition, and eventually wound up with 6 hugepages.
Questions:
1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
is abundant?
2. Why does the 2nd retry need N+1 hugepages?
Some insights for Q1: From my experiments, seems like the IOVA of the first
hugepage is not guaranteed to be at the start of the IOVA space (understandably).
It could explain the retry when the IOVA of the first hugepage is near the end of
the IOVA space. But I have also seen situation where the 1st hugepage is near
the beginning of the IOVA space and it still failed the 1st time.
Here's the code:
#include <rte_eal.h>
#include <rte_mbuf.h>
int
main(int argc, char *argv[])
{
struct rte_mempool *mbuf_pool;
unsigned mbuf_pool_size = 2097151;
int ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
printf("mbuf_pool %p\n", mbuf_pool);
return 0;
}
Best regards,
BL
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
2019-12-23 11:09 [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation Bao-Long Tran
@ 2019-12-26 15:45 ` Olivier Matz
2019-12-27 8:11 ` Olivier Matz
0 siblings, 1 reply; 7+ messages in thread
From: Olivier Matz @ 2019-12-26 15:45 UTC (permalink / raw)
To: Bao-Long Tran; +Cc: anatoly.burakov, arybchenko, dev, users, ricudis
Hi Bao-Long,
On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> Hi,
>
> I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> same mempool size, the number of hugepages allocated changes from run to run.
>
> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
>
> 1. Reserve 16x1G hugepages on socket 0
> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> make && ./build/basicfwd
> 3. At the same time, watch the number of hugepages allocated
> "watch -n.1 ls /dev/hugepages"
> 4. Repeat step 2
>
> If you can reproduce, you should see that for some runs, DPDK allocates 5
> hugepages, other times it allocates 6. When it allocates 6, if you watch the
> output from step 3., you will see that DPDK first try to allocate 5 hugepages,
> then unmap all 5, retry, and got 6.
I cannot reproduce in the same conditions than yours (with 16 hugepages
on socket 0), but I think I can see a similar issue:
If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
are used). If I reserve 5 hugepages, it takes more time,
taking/releasing hugepages several times, and it finally succeeds with 5
hugepages.
> For our use case, it's important that DPDK allocate the same number of
> hugepages on every run so we can get reproducable results.
One possibility is to use the --legacy-mem EAL option. It will try to
reserve all hugepages first.
> Studying the code, this seems to be the behavior of
> rte_mempool_populate_default(). If I understand correctly, if the first try fail
> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> condition, and eventually wound up with 6 hugepages.
No, I think you don't have the IOVA-contiguous constraint in your
case. This is what I see:
a- reserve 5 hugepages on socket 0, and start your patched basicfwd
b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
c- the total element size (with header) is 2304 + 64 = 2368
d- in rte_mempool_op_calc_mem_size_helper(), it calculates
obj_per_page = 453438 (453438 * 2368 = 1073741184)
mem_size = 4966058495
e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
align=64)
For some reason, it fails: we can see that the number of map'd hugepages
increases in /dev/hugepages, the return to its original value.
I don't think it should fail here.
f- then, it will try to allocate the biggest available contiguous zone. In
my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
This is a second problem: if we call it again, it returns NULL, because
it won't map another hugepage.
g- by luck, calling rte_mempool_populate_virt() allocates a small aera
(mempool header), and it triggers the mapping a a new hugepage, that
will be used in the next loop, back at step d with a smaller mem_size.
> Questions:
> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
> is abundant?
In my case, it looks that we have a bit less than 1G which is free at
the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
The allocator ends up in mapping 5 pages (and fail), while only 4 is
needed.
Anatoly, do you have any idea? Shouldn't we take in account the amount
of free space at the end of the heap when expanding?
> 2. Why does the 2nd retry need N+1 hugepages?
When the first alloc fails, the mempool code tries to allocate in
several chunks which are not virtually contiguous. This is needed in
case the memory is fragmented.
> Some insights for Q1: From my experiments, seems like the IOVA of the first
> hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> It could explain the retry when the IOVA of the first hugepage is near the end of
> the IOVA space. But I have also seen situation where the 1st hugepage is near
> the beginning of the IOVA space and it still failed the 1st time.
>
> Here's the code:
> #include <rte_eal.h>
> #include <rte_mbuf.h>
>
> int
> main(int argc, char *argv[])
> {
> struct rte_mempool *mbuf_pool;
> unsigned mbuf_pool_size = 2097151;
>
> int ret = rte_eal_init(argc, argv);
> if (ret < 0)
> rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
>
> printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
>
> printf("mbuf_pool %p\n", mbuf_pool);
>
> return 0;
> }
>
> Best regards,
> BL
Regards,
Olivier
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
2019-12-26 15:45 ` Olivier Matz
@ 2019-12-27 8:11 ` Olivier Matz
2019-12-27 10:05 ` Bao-Long Tran
0 siblings, 1 reply; 7+ messages in thread
From: Olivier Matz @ 2019-12-27 8:11 UTC (permalink / raw)
To: Bao-Long Tran; +Cc: anatoly.burakov, arybchenko, dev, users, ricudis
On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> Hi Bao-Long,
>
> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> > Hi,
> >
> > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
> > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> > same mempool size, the number of hugepages allocated changes from run to run.
> >
> > Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> >
> > 1. Reserve 16x1G hugepages on socket 0
> > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> > make && ./build/basicfwd
> > 3. At the same time, watch the number of hugepages allocated
> > "watch -n.1 ls /dev/hugepages"
> > 4. Repeat step 2
> >
> > If you can reproduce, you should see that for some runs, DPDK allocates 5
> > hugepages, other times it allocates 6. When it allocates 6, if you watch the
> > output from step 3., you will see that DPDK first try to allocate 5 hugepages,
> > then unmap all 5, retry, and got 6.
>
> I cannot reproduce in the same conditions than yours (with 16 hugepages
> on socket 0), but I think I can see a similar issue:
>
> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> are used). If I reserve 5 hugepages, it takes more time,
> taking/releasing hugepages several times, and it finally succeeds with 5
> hugepages.
>
> > For our use case, it's important that DPDK allocate the same number of
> > hugepages on every run so we can get reproducable results.
>
> One possibility is to use the --legacy-mem EAL option. It will try to
> reserve all hugepages first.
Passing --socket-mem=5120,0 also does the job.
> > Studying the code, this seems to be the behavior of
> > rte_mempool_populate_default(). If I understand correctly, if the first try fail
> > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> > condition, and eventually wound up with 6 hugepages.
>
> No, I think you don't have the IOVA-contiguous constraint in your
> case. This is what I see:
>
> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> c- the total element size (with header) is 2304 + 64 = 2368
> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
> obj_per_page = 453438 (453438 * 2368 = 1073741184)
> mem_size = 4966058495
> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
> rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
> align=64)
> For some reason, it fails: we can see that the number of map'd hugepages
> increases in /dev/hugepages, the return to its original value.
> I don't think it should fail here.
> f- then, it will try to allocate the biggest available contiguous zone. In
> my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
> This is a second problem: if we call it again, it returns NULL, because
> it won't map another hugepage.
> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
> (mempool header), and it triggers the mapping a a new hugepage, that
> will be used in the next loop, back at step d with a smaller mem_size.
>
> > Questions:
> > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
> > is abundant?
>
> In my case, it looks that we have a bit less than 1G which is free at
> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> The allocator ends up in mapping 5 pages (and fail), while only 4 is
> needed.
>
> Anatoly, do you have any idea? Shouldn't we take in account the amount
> of free space at the end of the heap when expanding?
>
> > 2. Why does the 2nd retry need N+1 hugepages?
>
> When the first alloc fails, the mempool code tries to allocate in
> several chunks which are not virtually contiguous. This is needed in
> case the memory is fragmented.
>
> > Some insights for Q1: From my experiments, seems like the IOVA of the first
> > hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> > It could explain the retry when the IOVA of the first hugepage is near the end of
> > the IOVA space. But I have also seen situation where the 1st hugepage is near
> > the beginning of the IOVA space and it still failed the 1st time.
> >
> > Here's the code:
> > #include <rte_eal.h>
> > #include <rte_mbuf.h>
> >
> > int
> > main(int argc, char *argv[])
> > {
> > struct rte_mempool *mbuf_pool;
> > unsigned mbuf_pool_size = 2097151;
> >
> > int ret = rte_eal_init(argc, argv);
> > if (ret < 0)
> > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> >
> > printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> > 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> >
> > printf("mbuf_pool %p\n", mbuf_pool);
> >
> > return 0;
> > }
> >
> > Best regards,
> > BL
>
> Regards,
> Olivier
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
2019-12-27 8:11 ` Olivier Matz
@ 2019-12-27 10:05 ` Bao-Long Tran
2019-12-27 11:11 ` Olivier Matz
0 siblings, 1 reply; 7+ messages in thread
From: Bao-Long Tran @ 2019-12-27 10:05 UTC (permalink / raw)
To: Olivier Matz; +Cc: anatoly.burakov, arybchenko, dev, users, ricudis
Hi Olivier,
> On 27 Dec 2019, at 4:11 PM, Olivier Matz <olivier.matz@6wind.com> wrote:
>
> On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
>> Hi Bao-Long,
>>
>> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
>>> Hi,
>>>
>>> I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
>>> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
>>> same mempool size, the number of hugepages allocated changes from run to run.
>>>
>>> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
>>>
>>> 1. Reserve 16x1G hugepages on socket 0
>>> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
>>> make && ./build/basicfwd
>>> 3. At the same time, watch the number of hugepages allocated
>>> "watch -n.1 ls /dev/hugepages"
>>> 4. Repeat step 2
>>>
>>> If you can reproduce, you should see that for some runs, DPDK allocates 5
>>> hugepages, other times it allocates 6. When it allocates 6, if you watch the
>>> output from step 3., you will see that DPDK first try to allocate 5 hugepages,
>>> then unmap all 5, retry, and got 6.
>>
>> I cannot reproduce in the same conditions than yours (with 16 hugepages
>> on socket 0), but I think I can see a similar issue:
>>
>> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
>> are used). If I reserve 5 hugepages, it takes more time,
>> taking/releasing hugepages several times, and it finally succeeds with 5
>> hugepages.
My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master.
Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.
I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
the problem with the IOVA-contig constraint that I described in my first message.
I'll look into it to confirm.
If I cannot repro my issue (different number of hugepages) with 19.11, from our
side we can upgrade to 19.11 and that's all we need for now. But let me also try
to repro the issue you described (multiple attempts to allocate hugepages).
>>
>>> For our use case, it's important that DPDK allocate the same number of
>>> hugepages on every run so we can get reproducable results.
>>
>> One possibility is to use the --legacy-mem EAL option. It will try to
>> reserve all hugepages first.
>
> Passing --socket-mem=5120,0 also does the job.
>
>>> Studying the code, this seems to be the behavior of
>>> rte_mempool_populate_default(). If I understand correctly, if the first try fail
>>> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
>>> condition, and eventually wound up with 6 hugepages.
>>
>> No, I think you don't have the IOVA-contiguous constraint in your
>> case. This is what I see:
>>
>> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
>> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
>> c- the total element size (with header) is 2304 + 64 = 2368
>> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
>> obj_per_page = 453438 (453438 * 2368 = 1073741184)
>> mem_size = 4966058495
>> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
>> rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
>> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
>> align=64)
>> For some reason, it fails: we can see that the number of map'd hugepages
>> increases in /dev/hugepages, the return to its original value.
>> I don't think it should fail here.
>> f- then, it will try to allocate the biggest available contiguous zone. In
>> my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
>> This is a second problem: if we call it again, it returns NULL, because
>> it won't map another hugepage.
>> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
>> (mempool header), and it triggers the mapping a a new hugepage, that
>> will be used in the next loop, back at step d with a smaller mem_size.
>>
>>> Questions:
>>> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
>>> is abundant?
>>
>> In my case, it looks that we have a bit less than 1G which is free at
>> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
>> The allocator ends up in mapping 5 pages (and fail), while only 4 is
>> needed.
>>
>> Anatoly, do you have any idea? Shouldn't we take in account the amount
>> of free space at the end of the heap when expanding?
>>
>>> 2. Why does the 2nd retry need N+1 hugepages?
>>
>> When the first alloc fails, the mempool code tries to allocate in
>> several chunks which are not virtually contiguous. This is needed in
>> case the memory is fragmented.
>>
>>> Some insights for Q1: From my experiments, seems like the IOVA of the first
>>> hugepage is not guaranteed to be at the start of the IOVA space (understandably).
>>> It could explain the retry when the IOVA of the first hugepage is near the end of
>>> the IOVA space. But I have also seen situation where the 1st hugepage is near
>>> the beginning of the IOVA space and it still failed the 1st time.
>>>
>>> Here's the code:
>>> #include <rte_eal.h>
>>> #include <rte_mbuf.h>
>>>
>>> int
>>> main(int argc, char *argv[])
>>> {
>>> struct rte_mempool *mbuf_pool;
>>> unsigned mbuf_pool_size = 2097151;
>>>
>>> int ret = rte_eal_init(argc, argv);
>>> if (ret < 0)
>>> rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
>>>
>>> printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
>>> mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
>>> 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
>>>
>>> printf("mbuf_pool %p\n", mbuf_pool);
>>>
>>> return 0;
>>> }
>>>
>>> Best regards,
>>> BL
>>
>> Regards,
>> Olivier
Thanks,
BL
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
2019-12-27 10:05 ` Bao-Long Tran
@ 2019-12-27 11:11 ` Olivier Matz
2020-01-07 13:06 ` Burakov, Anatoly
0 siblings, 1 reply; 7+ messages in thread
From: Olivier Matz @ 2019-12-27 11:11 UTC (permalink / raw)
To: Bao-Long Tran; +Cc: anatoly.burakov, arybchenko, dev, users, ricudis
Hi Bao-Long,
On Fri, Dec 27, 2019 at 06:05:57PM +0800, Bao-Long Tran wrote:
> Hi Olivier,
>
> > On 27 Dec 2019, at 4:11 PM, Olivier Matz <olivier.matz@6wind.com> wrote:
> >
> > On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> >> Hi Bao-Long,
> >>
> >> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> >>> Hi,
> >>>
> >>> I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
> >>> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> >>> same mempool size, the number of hugepages allocated changes from run to run.
> >>>
> >>> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> >>>
> >>> 1. Reserve 16x1G hugepages on socket 0
> >>> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> >>> make && ./build/basicfwd
> >>> 3. At the same time, watch the number of hugepages allocated
> >>> "watch -n.1 ls /dev/hugepages"
> >>> 4. Repeat step 2
> >>>
> >>> If you can reproduce, you should see that for some runs, DPDK allocates 5
> >>> hugepages, other times it allocates 6. When it allocates 6, if you watch the
> >>> output from step 3., you will see that DPDK first try to allocate 5 hugepages,
> >>> then unmap all 5, retry, and got 6.
> >>
> >> I cannot reproduce in the same conditions than yours (with 16 hugepages
> >> on socket 0), but I think I can see a similar issue:
> >>
> >> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> >> are used). If I reserve 5 hugepages, it takes more time,
> >> taking/releasing hugepages several times, and it finally succeeds with 5
> >> hugepages.
>
> My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master.
> Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.
>
> I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
> the problem with the IOVA-contig constraint that I described in my first message.
> I'll look into it to confirm.
>
> If I cannot repro my issue (different number of hugepages) with 19.11, from our
> side we can upgrade to 19.11 and that's all we need for now. But let me also try
> to repro the issue you described (multiple attempts to allocate hugepages).
OK, thanks.
Anyway, I think there is an issue on 19.11. And it is is even worse with
2M hugepages. Let's say we reserve 500x 2M hugepages, and try to
allocate a mempool of 5G:
1/ mempool_populate tries to allocate in one virtually contiguous block,
which maps all 500 hugepages, then fail, unmapping them
2/ it tries to allocate the largest zone, which returns ~2MB.
3/ this zone is added to the mempool, and for that, it allocates a
mem_header struct, which triggers the mapping of a new page.
4/ Back to 1... until it fails after 3 mins
The memzone allocation of "largest available area" does not have the
same semantic depending on the memory model (pre-mapped hugepages or
not). When using dynamic hugepage mapping, it won't map any additional
hugepage.
To solve the issue, we could either change it to allocate all available
hugepages, or change mempool populate, by not using the "largest
available area" allocation, doing the search by ourself.
>
> >>
> >>> For our use case, it's important that DPDK allocate the same number of
> >>> hugepages on every run so we can get reproducable results.
> >>
> >> One possibility is to use the --legacy-mem EAL option. It will try to
> >> reserve all hugepages first.
> >
> > Passing --socket-mem=5120,0 also does the job.
> >
>
> >>> Studying the code, this seems to be the behavior of
> >>> rte_mempool_populate_default(). If I understand correctly, if the first try fail
> >>> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> >>> condition, and eventually wound up with 6 hugepages.
> >>
> >> No, I think you don't have the IOVA-contiguous constraint in your
> >> case. This is what I see:
> >>
> >> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> >> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> >> c- the total element size (with header) is 2304 + 64 = 2368
> >> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
> >> obj_per_page = 453438 (453438 * 2368 = 1073741184)
> >> mem_size = 4966058495
> >> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
> >> rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
> >> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
> >> align=64)
> >> For some reason, it fails: we can see that the number of map'd hugepages
> >> increases in /dev/hugepages, the return to its original value.
> >> I don't think it should fail here.
> >> f- then, it will try to allocate the biggest available contiguous zone. In
> >> my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
> >> This is a second problem: if we call it again, it returns NULL, because
> >> it won't map another hugepage.
> >> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
> >> (mempool header), and it triggers the mapping a a new hugepage, that
> >> will be used in the next loop, back at step d with a smaller mem_size.
> >>
>
> >>> Questions:
> >>> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
> >>> is abundant?
> >>
> >> In my case, it looks that we have a bit less than 1G which is free at
> >> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> >> The allocator ends up in mapping 5 pages (and fail), while only 4 is
> >> needed.
> >>
> >> Anatoly, do you have any idea? Shouldn't we take in account the amount
> >> of free space at the end of the heap when expanding?
> >>
> >>> 2. Why does the 2nd retry need N+1 hugepages?
> >>
> >> When the first alloc fails, the mempool code tries to allocate in
> >> several chunks which are not virtually contiguous. This is needed in
> >> case the memory is fragmented.
> >>
> >>> Some insights for Q1: From my experiments, seems like the IOVA of the first
> >>> hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> >>> It could explain the retry when the IOVA of the first hugepage is near the end of
> >>> the IOVA space. But I have also seen situation where the 1st hugepage is near
> >>> the beginning of the IOVA space and it still failed the 1st time.
> >>>
> >>> Here's the code:
> >>> #include <rte_eal.h>
> >>> #include <rte_mbuf.h>
> >>>
> >>> int
> >>> main(int argc, char *argv[])
> >>> {
> >>> struct rte_mempool *mbuf_pool;
> >>> unsigned mbuf_pool_size = 2097151;
> >>>
> >>> int ret = rte_eal_init(argc, argv);
> >>> if (ret < 0)
> >>> rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> >>>
> >>> printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> >>> mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> >>> 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> >>>
> >>> printf("mbuf_pool %p\n", mbuf_pool);
> >>>
> >>> return 0;
> >>> }
> >>>
> >>> Best regards,
> >>> BL
> >>
> >> Regards,
> >> Olivier
>
> Thanks,
> BL
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
2019-12-27 11:11 ` Olivier Matz
@ 2020-01-07 13:06 ` Burakov, Anatoly
2020-01-09 13:32 ` Olivier Matz
0 siblings, 1 reply; 7+ messages in thread
From: Burakov, Anatoly @ 2020-01-07 13:06 UTC (permalink / raw)
To: Olivier Matz, Bao-Long Tran; +Cc: arybchenko, dev, users, ricudis
On 27-Dec-19 11:11 AM, Olivier Matz wrote:
> Hi Bao-Long,
>
> On Fri, Dec 27, 2019 at 06:05:57PM +0800, Bao-Long Tran wrote:
>> Hi Olivier,
>>
>>> On 27 Dec 2019, at 4:11 PM, Olivier Matz <olivier.matz@6wind.com> wrote:
>>>
>>> On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
>>>> Hi Bao-Long,
>>>>
>>>> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
>>>>> Hi,
>>>>>
>>>>> I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
>>>>> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
>>>>> same mempool size, the number of hugepages allocated changes from run to run.
>>>>>
>>>>> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
>>>>>
>>>>> 1. Reserve 16x1G hugepages on socket 0
>>>>> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
>>>>> make && ./build/basicfwd
>>>>> 3. At the same time, watch the number of hugepages allocated
>>>>> "watch -n.1 ls /dev/hugepages"
>>>>> 4. Repeat step 2
>>>>>
>>>>> If you can reproduce, you should see that for some runs, DPDK allocates 5
>>>>> hugepages, other times it allocates 6. When it allocates 6, if you watch the
>>>>> output from step 3., you will see that DPDK first try to allocate 5 hugepages,
>>>>> then unmap all 5, retry, and got 6.
>>>>
>>>> I cannot reproduce in the same conditions than yours (with 16 hugepages
>>>> on socket 0), but I think I can see a similar issue:
>>>>
>>>> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
>>>> are used). If I reserve 5 hugepages, it takes more time,
>>>> taking/releasing hugepages several times, and it finally succeeds with 5
>>>> hugepages.
>>
>> My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master.
>> Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.
>>
>> I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
>> the problem with the IOVA-contig constraint that I described in my first message.
>> I'll look into it to confirm.
>>
>> If I cannot repro my issue (different number of hugepages) with 19.11, from our
>> side we can upgrade to 19.11 and that's all we need for now. But let me also try
>> to repro the issue you described (multiple attempts to allocate hugepages).
>
> OK, thanks.
>
> Anyway, I think there is an issue on 19.11. And it is is even worse with
> 2M hugepages. Let's say we reserve 500x 2M hugepages, and try to
> allocate a mempool of 5G:
>
> 1/ mempool_populate tries to allocate in one virtually contiguous block,
> which maps all 500 hugepages, then fail, unmapping them
> 2/ it tries to allocate the largest zone, which returns ~2MB.
> 3/ this zone is added to the mempool, and for that, it allocates a
> mem_header struct, which triggers the mapping of a new page.
> 4/ Back to 1... until it fails after 3 mins
>
> The memzone allocation of "largest available area" does not have the
> same semantic depending on the memory model (pre-mapped hugepages or
> not). When using dynamic hugepage mapping, it won't map any additional
> hugepage.
>
> To solve the issue, we could either change it to allocate all available
> hugepages, or change mempool populate, by not using the "largest
> available area" allocation, doing the search by ourself.
Yep, this is one of the things that is currently an unsolved problem in
the allocator. I am not sure if any one behavior is "more correct" than
the other, so i don't think allocating "all available" hugepages is more
correct than not doing it.
Besides, there's no reliable way to get "biggest" chunk of memory,
because while you might get *some* memory from 2M pages, there's no
guarantee that the amount you may get from 1G pages isn't bigger. So, we
either momentarily take over the entire users' memory and figure out
what we need and what we don't, or we use the first available page size
and hope that that's enough.
That said, there's an internal API to allocate "up to X" pages, so in
principle, we could build this kind of infrastructure.
>
>
>>
>>>>
>>>>> For our use case, it's important that DPDK allocate the same number of
>>>>> hugepages on every run so we can get reproducable results.
>>>>
>>>> One possibility is to use the --legacy-mem EAL option. It will try to
>>>> reserve all hugepages first.
>>>
>>> Passing --socket-mem=5120,0 also does the job.
>>>
>>
>>>>> Studying the code, this seems to be the behavior of
>>>>> rte_mempool_populate_default(). If I understand correctly, if the first try fail
>>>>> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
>>>>> condition, and eventually wound up with 6 hugepages.
>>>>
>>>> No, I think you don't have the IOVA-contiguous constraint in your
>>>> case. This is what I see:
>>>>
>>>> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
>>>> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
>>>> c- the total element size (with header) is 2304 + 64 = 2368
>>>> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
>>>> obj_per_page = 453438 (453438 * 2368 = 1073741184)
>>>> mem_size = 4966058495
>>>> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
>>>> rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
>>>> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
>>>> align=64)
>>>> For some reason, it fails: we can see that the number of map'd hugepages
>>>> increases in /dev/hugepages, the return to its original value.
>>>> I don't think it should fail here.
>>>> f- then, it will try to allocate the biggest available contiguous zone. In
>>>> my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
>>>> This is a second problem: if we call it again, it returns NULL, because
>>>> it won't map another hugepage.
>>>> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
>>>> (mempool header), and it triggers the mapping a a new hugepage, that
>>>> will be used in the next loop, back at step d with a smaller mem_size.
>>>>
>>
>>>>> Questions:
>>>>> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
>>>>> is abundant?
>>>>
>>>> In my case, it looks that we have a bit less than 1G which is free at
>>>> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
>>>> The allocator ends up in mapping 5 pages (and fail), while only 4 is
>>>> needed.
>>>>
>>>> Anatoly, do you have any idea? Shouldn't we take in account the amount
>>>> of free space at the end of the heap when expanding?
>>>>
>>>>> 2. Why does the 2nd retry need N+1 hugepages?
>>>>
>>>> When the first alloc fails, the mempool code tries to allocate in
>>>> several chunks which are not virtually contiguous. This is needed in
>>>> case the memory is fragmented.
>>>>
>>>>> Some insights for Q1: From my experiments, seems like the IOVA of the first
>>>>> hugepage is not guaranteed to be at the start of the IOVA space (understandably).
>>>>> It could explain the retry when the IOVA of the first hugepage is near the end of
>>>>> the IOVA space. But I have also seen situation where the 1st hugepage is near
>>>>> the beginning of the IOVA space and it still failed the 1st time.
>>>>>
>>>>> Here's the code:
>>>>> #include <rte_eal.h>
>>>>> #include <rte_mbuf.h>
>>>>>
>>>>> int
>>>>> main(int argc, char *argv[])
>>>>> {
>>>>> struct rte_mempool *mbuf_pool;
>>>>> unsigned mbuf_pool_size = 2097151;
>>>>>
>>>>> int ret = rte_eal_init(argc, argv);
>>>>> if (ret < 0)
>>>>> rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
>>>>>
>>>>> printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
>>>>> mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
>>>>> 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
>>>>>
>>>>> printf("mbuf_pool %p\n", mbuf_pool);
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Best regards,
>>>>> BL
>>>>
>>>> Regards,
>>>> Olivier
>>
>> Thanks,
>> BL
>>
>
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
2020-01-07 13:06 ` Burakov, Anatoly
@ 2020-01-09 13:32 ` Olivier Matz
0 siblings, 0 replies; 7+ messages in thread
From: Olivier Matz @ 2020-01-09 13:32 UTC (permalink / raw)
To: Burakov, Anatoly; +Cc: Bao-Long Tran, arybchenko, dev, users, ricudis
On Tue, Jan 07, 2020 at 01:06:01PM +0000, Burakov, Anatoly wrote:
> On 27-Dec-19 11:11 AM, Olivier Matz wrote:
> > Hi Bao-Long,
> >
> > On Fri, Dec 27, 2019 at 06:05:57PM +0800, Bao-Long Tran wrote:
> > > Hi Olivier,
> > >
> > > > On 27 Dec 2019, at 4:11 PM, Olivier Matz <olivier.matz@6wind.com> wrote:
> > > >
> > > > On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> > > > > Hi Bao-Long,
> > > > >
> > > > > On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
> > > > > > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> > > > > > same mempool size, the number of hugepages allocated changes from run to run.
> > > > > >
> > > > > > Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> > > > > >
> > > > > > 1. Reserve 16x1G hugepages on socket 0
> > > > > > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> > > > > > make && ./build/basicfwd
> > > > > > 3. At the same time, watch the number of hugepages allocated
> > > > > > "watch -n.1 ls /dev/hugepages"
> > > > > > 4. Repeat step 2
> > > > > >
> > > > > > If you can reproduce, you should see that for some runs, DPDK allocates 5
> > > > > > hugepages, other times it allocates 6. When it allocates 6, if you watch the
> > > > > > output from step 3., you will see that DPDK first try to allocate 5 hugepages,
> > > > > > then unmap all 5, retry, and got 6.
> > > > >
> > > > > I cannot reproduce in the same conditions than yours (with 16 hugepages
> > > > > on socket 0), but I think I can see a similar issue:
> > > > >
> > > > > If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> > > > > are used). If I reserve 5 hugepages, it takes more time,
> > > > > taking/releasing hugepages several times, and it finally succeeds with 5
> > > > > hugepages.
> > >
> > > My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master.
> > > Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.
> > >
> > > I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
> > > the problem with the IOVA-contig constraint that I described in my first message.
> > > I'll look into it to confirm.
> > >
> > > If I cannot repro my issue (different number of hugepages) with 19.11, from our
> > > side we can upgrade to 19.11 and that's all we need for now. But let me also try
> > > to repro the issue you described (multiple attempts to allocate hugepages).
> >
> > OK, thanks.
> >
> > Anyway, I think there is an issue on 19.11. And it is is even worse with
> > 2M hugepages. Let's say we reserve 500x 2M hugepages, and try to
> > allocate a mempool of 5G:
> >
> > 1/ mempool_populate tries to allocate in one virtually contiguous block,
> > which maps all 500 hugepages, then fail, unmapping them
> > 2/ it tries to allocate the largest zone, which returns ~2MB.
> > 3/ this zone is added to the mempool, and for that, it allocates a
> > mem_header struct, which triggers the mapping of a new page.
> > 4/ Back to 1... until it fails after 3 mins
> >
> > The memzone allocation of "largest available area" does not have the
> > same semantic depending on the memory model (pre-mapped hugepages or
> > not). When using dynamic hugepage mapping, it won't map any additional
> > hugepage.
> >
> > To solve the issue, we could either change it to allocate all available
> > hugepages, or change mempool populate, by not using the "largest
> > available area" allocation, doing the search by ourself.
>
> Yep, this is one of the things that is currently an unsolved problem in the
> allocator. I am not sure if any one behavior is "more correct" than the
> other, so i don't think allocating "all available" hugepages is more correct
> than not doing it.
>
> Besides, there's no reliable way to get "biggest" chunk of memory, because
> while you might get *some* memory from 2M pages, there's no guarantee that
> the amount you may get from 1G pages isn't bigger. So, we either momentarily
> take over the entire users' memory and figure out what we need and what we
> don't, or we use the first available page size and hope that that's enough.
>
> That said, there's an internal API to allocate "up to X" pages, so in
> principle, we could build this kind of infrastructure.
I tried to solve the issue in mempool, without using the memzone_alloc(size=0)
feature. See https://patches.dpdk.org/patch/64370/
>
> >
> >
> > >
> > > > >
> > > > > > For our use case, it's important that DPDK allocate the same number of
> > > > > > hugepages on every run so we can get reproducable results.
> > > > >
> > > > > One possibility is to use the --legacy-mem EAL option. It will try to
> > > > > reserve all hugepages first.
> > > >
> > > > Passing --socket-mem=5120,0 also does the job.
> > > >
> > >
> > > > > > Studying the code, this seems to be the behavior of
> > > > > > rte_mempool_populate_default(). If I understand correctly, if the first try fail
> > > > > > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> > > > > > condition, and eventually wound up with 6 hugepages.
> > > > >
> > > > > No, I think you don't have the IOVA-contiguous constraint in your
> > > > > case. This is what I see:
> > > > >
> > > > > a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> > > > > b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> > > > > c- the total element size (with header) is 2304 + 64 = 2368
> > > > > d- in rte_mempool_op_calc_mem_size_helper(), it calculates
> > > > > obj_per_page = 453438 (453438 * 2368 = 1073741184)
> > > > > mem_size = 4966058495
> > > > > e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
> > > > > rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
> > > > > mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
> > > > > align=64)
> > > > > For some reason, it fails: we can see that the number of map'd hugepages
> > > > > increases in /dev/hugepages, the return to its original value.
> > > > > I don't think it should fail here.
> > > > > f- then, it will try to allocate the biggest available contiguous zone. In
> > > > > my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
> > > > > This is a second problem: if we call it again, it returns NULL, because
> > > > > it won't map another hugepage.
> > > > > g- by luck, calling rte_mempool_populate_virt() allocates a small aera
> > > > > (mempool header), and it triggers the mapping a a new hugepage, that
> > > > > will be used in the next loop, back at step d with a smaller mem_size.
> > > > >
> > >
> > > > > > Questions:
> > > > > > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
> > > > > > is abundant?
> > > > >
> > > > > In my case, it looks that we have a bit less than 1G which is free at
> > > > > the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> > > > > The allocator ends up in mapping 5 pages (and fail), while only 4 is
> > > > > needed.
> > > > >
> > > > > Anatoly, do you have any idea? Shouldn't we take in account the amount
> > > > > of free space at the end of the heap when expanding?
> > > > >
> > > > > > 2. Why does the 2nd retry need N+1 hugepages?
> > > > >
> > > > > When the first alloc fails, the mempool code tries to allocate in
> > > > > several chunks which are not virtually contiguous. This is needed in
> > > > > case the memory is fragmented.
> > > > >
> > > > > > Some insights for Q1: From my experiments, seems like the IOVA of the first
> > > > > > hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> > > > > > It could explain the retry when the IOVA of the first hugepage is near the end of
> > > > > > the IOVA space. But I have also seen situation where the 1st hugepage is near
> > > > > > the beginning of the IOVA space and it still failed the 1st time.
> > > > > >
> > > > > > Here's the code:
> > > > > > #include <rte_eal.h>
> > > > > > #include <rte_mbuf.h>
> > > > > >
> > > > > > int
> > > > > > main(int argc, char *argv[])
> > > > > > {
> > > > > > struct rte_mempool *mbuf_pool;
> > > > > > unsigned mbuf_pool_size = 2097151;
> > > > > >
> > > > > > int ret = rte_eal_init(argc, argv);
> > > > > > if (ret < 0)
> > > > > > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> > > > > >
> > > > > > printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> > > > > > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> > > > > > 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> > > > > >
> > > > > > printf("mbuf_pool %p\n", mbuf_pool);
> > > > > >
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > Best regards,
> > > > > > BL
> > > > >
> > > > > Regards,
> > > > > Olivier
> > >
> > > Thanks,
> > > BL
> > >
> >
>
>
> --
> Thanks,
> Anatoly
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-01-09 13:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-23 11:09 [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation Bao-Long Tran
2019-12-26 15:45 ` Olivier Matz
2019-12-27 8:11 ` Olivier Matz
2019-12-27 10:05 ` Bao-Long Tran
2019-12-27 11:11 ` Olivier Matz
2020-01-07 13:06 ` Burakov, Anatoly
2020-01-09 13:32 ` Olivier Matz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).