Failure while allocating 1GB hugepages

DPDK usage discussions
 help / color / mirror / Atom feed

* Failure while allocating 1GB hugepages
@ 2024-05-10  9:33 Antonio Di Bacco
  2024-05-10 15:07 ` Dmitry Kozlyuk
  0 siblings, 1 reply; 7+ messages in thread
From: Antonio Di Bacco @ 2024-05-10  9:33 UTC (permalink / raw)
  To: users

I have 16 hugepages available per NUMA on a 4 NUMA system:

[user@node-1 hugepages]$ cat
/sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages
16
16
16
16

Using the following program with dpdk 21.11, sometimes I can allocate
a few pages but most of the time I cannot. I tried also to remove
rtemap_* under /dev/hugepages.
rte_memzone_reserve_aligned is always supposed to use a new page?

#include <stdio.h>
#include <rte_eal.h>
#include <rte_memzone.h>

#include <rte_errno.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    const struct rte_memzone *mz;
    int ret;
    printf("pid: %d\n", getpid());
    // Initialize EAL
    ret = rte_eal_init(argc, argv);
    if (ret < 0) {
        fprintf(stderr, "Error with EAL initialization\n");
        return -1;
    }

    for (int socket = 0; socket < 4; socket++)
    {
      for (int i = 0; i < 16; i++)
      {
        // Allocate memory using rte_memzone_reserve_aligned
        char name[32];
        sprintf(name, "my_memzone%d-%d", i, socket);
        mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket,
RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30);

        if (mz == NULL) {
          printf("errno %s\n", rte_strerror(rte_errno));
          fprintf(stderr, "Memory allocation failed\n");
          rte_eal_cleanup();
          return -1;
      }

      printf("Memory allocated with name %s at socket %d physical
address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id,
(mz->iova), mz->addr, mz->addr_64, mz->len);
    }
    }

    // Clean up EAL
    rte_eal_cleanup();
    return 0;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failure while allocating 1GB hugepages
  2024-05-10  9:33 Failure while allocating 1GB hugepages Antonio Di Bacco
@ 2024-05-10 15:07 ` Dmitry Kozlyuk
  2024-05-22 10:22   ` Antonio Di Bacco
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Kozlyuk @ 2024-05-10 15:07 UTC (permalink / raw)
  To: Antonio Di Bacco; +Cc: users

2024-05-10 11:33 (UTC+0200), Antonio Di Bacco:
> I have 16 hugepages available per NUMA on a 4 NUMA system:
> 
> [user@node-1 hugepages]$ cat
> /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages
> 16
> 16
> 16
> 16
> 
> Using the following program with dpdk 21.11, sometimes I can allocate
> a few pages but most of the time I cannot. I tried also to remove
> rtemap_* under /dev/hugepages.
> rte_memzone_reserve_aligned is always supposed to use a new page?
> 
> #include <stdio.h>
> #include <rte_eal.h>
> #include <rte_memzone.h>
> 
> #include <rte_errno.h>
> #include <unistd.h>
> 
> int main(int argc, char **argv)
> {
>     const struct rte_memzone *mz;
>     int ret;
>     printf("pid: %d\n", getpid());
>     // Initialize EAL
>     ret = rte_eal_init(argc, argv);
>     if (ret < 0) {
>         fprintf(stderr, "Error with EAL initialization\n");
>         return -1;
>     }
> 
>     for (int socket = 0; socket < 4; socket++)
>     {
>       for (int i = 0; i < 16; i++)
>       {
>         // Allocate memory using rte_memzone_reserve_aligned
>         char name[32];
>         sprintf(name, "my_memzone%d-%d", i, socket);
>         mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket,
> RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30);
> 
>         if (mz == NULL) {
>           printf("errno %s\n", rte_strerror(rte_errno));
>           fprintf(stderr, "Memory allocation failed\n");
>           rte_eal_cleanup();
>           return -1;
>       }
> 
>       printf("Memory allocated with name %s at socket %d physical
> address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id,
> (mz->iova), mz->addr, mz->addr_64, mz->len);
>     }
>     }
> 
>     // Clean up EAL
>     rte_eal_cleanup();
>     return 0;
> }

Hi Antonio,

Does it succeed without RTE_MEMZONE_IOVA_CONTIG?
If so, does your system/app have ASLR enabled?

When memzone size is 1G and hugepage size is 1G,
two hugepages are required: one for the requested amount of memory,
and one for memory allocator element header,
which does not fit into the same page obviously.
I suspect that two allocated hugepages get non-continuous IOVA
and that's why the function fails.
There are no useful logs in EAL to check the suspicion,
but you can hack elem_check_phys_contig() in malloc_elem.c.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failure while allocating 1GB hugepages
  2024-05-10 15:07 ` Dmitry Kozlyuk
@ 2024-05-22 10:22   ` Antonio Di Bacco
  2024-05-30 10:28     ` Antonio Di Bacco
  0 siblings, 1 reply; 7+ messages in thread
From: Antonio Di Bacco @ 2024-05-22 10:22 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

That was really useful. Thx

On Fri, May 10, 2024 at 5:07 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> 2024-05-10 11:33 (UTC+0200), Antonio Di Bacco:
> > I have 16 hugepages available per NUMA on a 4 NUMA system:
> >
> > [user@node-1 hugepages]$ cat
> > /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages
> > 16
> > 16
> > 16
> > 16
> >
> > Using the following program with dpdk 21.11, sometimes I can allocate
> > a few pages but most of the time I cannot. I tried also to remove
> > rtemap_* under /dev/hugepages.
> > rte_memzone_reserve_aligned is always supposed to use a new page?
> >
> > #include <stdio.h>
> > #include <rte_eal.h>
> > #include <rte_memzone.h>
> >
> > #include <rte_errno.h>
> > #include <unistd.h>
> >
> > int main(int argc, char **argv)
> > {
> >     const struct rte_memzone *mz;
> >     int ret;
> >     printf("pid: %d\n", getpid());
> >     // Initialize EAL
> >     ret = rte_eal_init(argc, argv);
> >     if (ret < 0) {
> >         fprintf(stderr, "Error with EAL initialization\n");
> >         return -1;
> >     }
> >
> >     for (int socket = 0; socket < 4; socket++)
> >     {
> >       for (int i = 0; i < 16; i++)
> >       {
> >         // Allocate memory using rte_memzone_reserve_aligned
> >         char name[32];
> >         sprintf(name, "my_memzone%d-%d", i, socket);
> >         mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket,
> > RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30);
> >
> >         if (mz == NULL) {
> >           printf("errno %s\n", rte_strerror(rte_errno));
> >           fprintf(stderr, "Memory allocation failed\n");
> >           rte_eal_cleanup();
> >           return -1;
> >       }
> >
> >       printf("Memory allocated with name %s at socket %d physical
> > address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id,
> > (mz->iova), mz->addr, mz->addr_64, mz->len);
> >     }
> >     }
> >
> >     // Clean up EAL
> >     rte_eal_cleanup();
> >     return 0;
> > }
>
> Hi Antonio,
>
> Does it succeed without RTE_MEMZONE_IOVA_CONTIG?
> If so, does your system/app have ASLR enabled?
>
> When memzone size is 1G and hugepage size is 1G,
> two hugepages are required: one for the requested amount of memory,
> and one for memory allocator element header,
> which does not fit into the same page obviously.
> I suspect that two allocated hugepages get non-continuous IOVA
> and that's why the function fails.
> There are no useful logs in EAL to check the suspicion,
> but you can hack elem_check_phys_contig() in malloc_elem.c.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failure while allocating 1GB hugepages
  2024-05-22 10:22   ` Antonio Di Bacco
@ 2024-05-30 10:28     ` Antonio Di Bacco
  2024-05-30 15:00       ` Dmitry Kozlyuk
  0 siblings, 1 reply; 7+ messages in thread
From: Antonio Di Bacco @ 2024-05-30 10:28 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

Just in case I need, let us say, 1.5 GB CONTIGUOUS memory zone,
would it be fine to use something like this as GRUB config in Linux?

default_hugepagesz=2G hugepagesz=2G hugepages=4"


On Wed, May 22, 2024 at 12:22 PM Antonio Di Bacco
<a.dibacco.ks@gmail.com> wrote:
>
> That was really useful. Thx
>
> On Fri, May 10, 2024 at 5:07 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
> >
> > 2024-05-10 11:33 (UTC+0200), Antonio Di Bacco:
> > > I have 16 hugepages available per NUMA on a 4 NUMA system:
> > >
> > > [user@node-1 hugepages]$ cat
> > > /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages
> > > 16
> > > 16
> > > 16
> > > 16
> > >
> > > Using the following program with dpdk 21.11, sometimes I can allocate
> > > a few pages but most of the time I cannot. I tried also to remove
> > > rtemap_* under /dev/hugepages.
> > > rte_memzone_reserve_aligned is always supposed to use a new page?
> > >
> > > #include <stdio.h>
> > > #include <rte_eal.h>
> > > #include <rte_memzone.h>
> > >
> > > #include <rte_errno.h>
> > > #include <unistd.h>
> > >
> > > int main(int argc, char **argv)
> > > {
> > >     const struct rte_memzone *mz;
> > >     int ret;
> > >     printf("pid: %d\n", getpid());
> > >     // Initialize EAL
> > >     ret = rte_eal_init(argc, argv);
> > >     if (ret < 0) {
> > >         fprintf(stderr, "Error with EAL initialization\n");
> > >         return -1;
> > >     }
> > >
> > >     for (int socket = 0; socket < 4; socket++)
> > >     {
> > >       for (int i = 0; i < 16; i++)
> > >       {
> > >         // Allocate memory using rte_memzone_reserve_aligned
> > >         char name[32];
> > >         sprintf(name, "my_memzone%d-%d", i, socket);
> > >         mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket,
> > > RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30);
> > >
> > >         if (mz == NULL) {
> > >           printf("errno %s\n", rte_strerror(rte_errno));
> > >           fprintf(stderr, "Memory allocation failed\n");
> > >           rte_eal_cleanup();
> > >           return -1;
> > >       }
> > >
> > >       printf("Memory allocated with name %s at socket %d physical
> > > address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id,
> > > (mz->iova), mz->addr, mz->addr_64, mz->len);
> > >     }
> > >     }
> > >
> > >     // Clean up EAL
> > >     rte_eal_cleanup();
> > >     return 0;
> > > }
> >
> > Hi Antonio,
> >
> > Does it succeed without RTE_MEMZONE_IOVA_CONTIG?
> > If so, does your system/app have ASLR enabled?
> >
> > When memzone size is 1G and hugepage size is 1G,
> > two hugepages are required: one for the requested amount of memory,
> > and one for memory allocator element header,
> > which does not fit into the same page obviously.
> > I suspect that two allocated hugepages get non-continuous IOVA
> > and that's why the function fails.
> > There are no useful logs in EAL to check the suspicion,
> > but you can hack elem_check_phys_contig() in malloc_elem.c.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failure while allocating 1GB hugepages
  2024-05-30 10:28     ` Antonio Di Bacco
@ 2024-05-30 15:00       ` Dmitry Kozlyuk
  2024-06-03 12:39         ` Antonio Di Bacco
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Kozlyuk @ 2024-05-30 15:00 UTC (permalink / raw)
  To: Antonio Di Bacco; +Cc: users

2024-05-30 12:28 (UTC+0200), Antonio Di Bacco:
> Just in case I need, let us say, 1.5 GB CONTIGUOUS memory zone,
> would it be fine to use something like this as GRUB config in Linux?
> 
> default_hugepagesz=2G hugepagesz=2G hugepages=4"

On x86, "hugepagesz" and "default_hugepagesz" may be either 2M or 1G.
There is no way to *guarantee* that there will be
two physically adjacent 1G hugepages forming 1.5GB contiguous space,
but in practice these options, with the above correction, will do.

Note that by default the kernel will spread hugepages between NUMA nodes.
You can control this by a more elaborate form of "hugepages" option:

	https://docs.kernel.org/admin-guide/mm/hugetlbpage.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failure while allocating 1GB hugepages
  2024-05-30 15:00       ` Dmitry Kozlyuk
@ 2024-06-03 12:39         ` Antonio Di Bacco
  2024-06-04 22:50           ` Dmitry Kozlyuk
  0 siblings, 1 reply; 7+ messages in thread
From: Antonio Di Bacco @ 2024-06-03 12:39 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

Hi,
I have the same behaviour with the code in this message.

The first rte_memzone_reserve_aligned() call requesting 1.5GB
contiguous memory always fails, while the second one is always
successful.

It seems in eal_memalloc_is_contig() the 'msl->memseg_arr' items are inverted:
when there is the sequence FC0000000, F80000000 the allocation fails,
while the segments sequence F80000000, FC0000000 is fine.
From my understaning 'msl->memseg_arr' comes from
'rte_eal_get_configuration()->mem_config;' which is rte_config
declared in eal_common_config.c

Is there an explanation for this swinging behaviour ?

Br,


Here the source code:

#include <stdio.h>
#include <rte_eal.h>
#include <rte_memzone.h>
#include <rte_errno.h>
#include <unistd.h>

int main(int argc, char **argv)
{
const struct rte_memzone *mz;

if (rte_eal_init(argc, argv) < 0)
return -1;

printf("Allocating : 1.5GB\n");

mz = rte_memzone_reserve_aligned("my_huge_mem",  0x60000000,
rte_socket_id(), RTE_MEMZONE_1GB | RTE_MEMZONE_IOVA_CONTIG,
RTE_CACHE_LINE_SIZE);
if (mz == NULL)
{
printf("  Fail(1): errno %s\n", rte_strerror(rte_errno));
mz = rte_memzone_reserve_aligned("my_huge_mem",  0x60000000,
rte_socket_id(), RTE_MEMZONE_1GB | RTE_MEMZONE_IOVA_CONTIG,
RTE_CACHE_LINE_SIZE);
if (mz == NULL)
{
printf("  Fail(2): errno %s\n", rte_strerror(rte_errno));
return -2;
}
}

printf("  Success: phy[%p] size[%zu]\n", mz->iova, mz->len);
rte_memzone_free(mz);
rte_eal_cleanup();
return 0;
}

I added two RTE_LOG notices in eal_memalloc_is_contig @
eal_common_memalloc.c around line 324, after rte_fbarray_get() to
print ms->iova

/* skip first iteration */
ms = rte_fbarray_get(&msl->memseg_arr, start_seg);
RTE_LOG(NOTICE, EAL, "memseg_arr[0] = %lX \n", ms->iova); // DEBUG
cur = ms->iova;
expected = cur + pgsz;

/* if we can't access IOVA addresses, assume non-contiguous */
if (cur == RTE_BAD_IOVA)
return false;

for (cur_seg = start_seg + 1; cur_seg < end_seg;
cur_seg++, expected += pgsz) {
ms = rte_fbarray_get(&msl->memseg_arr, cur_seg);
RTE_LOG(NOTICE, EAL, "memseg_arr[%d] = %lX \n", cur_seg, ms->iova); // DEBUG

if (ms->iova != expected)
return false;
}

The output is:

Allocating : 1.5GB
EAL: memseg_arr[0] = FC0000000
EAL: memseg_arr[1] = F80000000
  Fail(1): errno Cannot allocate memory
EAL: memseg_arr[0] = F80000000
EAL: memseg_arr[1] = FC0000000
EAL: memseg_arr[0] = F80000000
EAL: memseg_arr[1] = FC0000000
EAL: memseg_arr[0] = F80000000
EAL: memseg_arr[1] = FC0000000
EAL: memseg_arr[0] = F80000000
EAL: memseg_arr[1] = FC0000000
  Success: phy[0xfa0000000] size[1610612736]

On Thu, May 30, 2024 at 5:00 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> 2024-05-30 12:28 (UTC+0200), Antonio Di Bacco:
> > Just in case I need, let us say, 1.5 GB CONTIGUOUS memory zone,
> > would it be fine to use something like this as GRUB config in Linux?
> >
> > default_hugepagesz=2G hugepagesz=2G hugepages=4"
>
> On x86, "hugepagesz" and "default_hugepagesz" may be either 2M or 1G.
> There is no way to *guarantee* that there will be
> two physically adjacent 1G hugepages forming 1.5GB contiguous space,
> but in practice these options, with the above correction, will do.
>
> Note that by default the kernel will spread hugepages between NUMA nodes.
> You can control this by a more elaborate form of "hugepages" option:
>
>         https://docs.kernel.org/admin-guide/mm/hugetlbpage.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failure while allocating 1GB hugepages
  2024-06-03 12:39         ` Antonio Di Bacco
@ 2024-06-04 22:50           ` Dmitry Kozlyuk
  0 siblings, 0 replies; 7+ messages in thread
From: Dmitry Kozlyuk @ 2024-06-04 22:50 UTC (permalink / raw)
  To: Antonio Di Bacco; +Cc: users

2024-06-03 14:39 (UTC+0200), Antonio Di Bacco:
> Hi,
> I have the same behaviour with the code in this message.
> 
> The first rte_memzone_reserve_aligned() call requesting 1.5GB
> contiguous memory always fails, while the second one is always
> successful.

Hi,

I can't explain the "always" part, but unstable behavior comes from
unpredictable IOVA (physical address) that DPDK gets from the kernel.
On the first try:

1. DPDK has no 1G hugepages mapped, it needs 2 more 1G hugepages.
   	alloc_pages_on_heap() -> eal_memalloc_alloc_seg_bulk()

2. DPDK asks the kernel for one 1G hugepage,
   kernel maps the hugepage with IOVA = 0xFC000000,
   DPDK stores it in memseg_arr[0].
	eal_memalloc_alloc_seg_bulk() -> alloc_seg()

3. Same for another hugepage and memseg_arr[1]->iova = 0xF8000000.

4. DPDK checks is the pages are continuous.
	alloc_pages_on_heap() -> eal_memalloc_is_contig() = false

5. Since it's a failure, DPDK frees newly allocated pages.
	alloc_pages_on_heap() -> rollback_expand_heap()

On the second try:

6. Steps 1 and 2 repeat, but now memseg_arr[0]->iova = 0xF8000000.
7. Step 3 repeats, but now memseg_arr[0]->iova = 0xFC000000.
8. IOVAs are continuous, success.

Just a wild guess why the second try may be likely to succeed:
memseg_arr[1] with IOVA = 0xF8000000 is freed last at step 5,
so maybe this is why the kernel is likely to reuse this page at step 6.

I'm afraid the simplest way to get PA-continuous 1.5G reliably
is indeed to try several times.
The preferred way is to use IOMMU and IOVA-as-VA if HW permits.

> It seems in eal_memalloc_is_contig() the 'msl->memseg_arr' items are inverted:
> when there is the sequence FC0000000, F80000000 the allocation fails,
> while the segments sequence F80000000, FC0000000 is fine.
> From my understaning 'msl->memseg_arr' comes from
> 'rte_eal_get_configuration()->mem_config;' which is rte_config
> declared in eal_common_config.c

Not quite, msl->memseg_arr content is dynamic, see above.

P.S. One may say, DPDK could do better.
It does have N hugepages occupying a continuous range of IOVA.
DPDK could make them VA-continuous by remapping.
But this would be more work, it still wouldn't be 100% reliable,
and still insecure and inflexible compared to IOMMU.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-04 22:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-10  9:33 Failure while allocating 1GB hugepages Antonio Di Bacco
2024-05-10 15:07 ` Dmitry Kozlyuk
2024-05-22 10:22   ` Antonio Di Bacco
2024-05-30 10:28     ` Antonio Di Bacco
2024-05-30 15:00       ` Dmitry Kozlyuk
2024-06-03 12:39         ` Antonio Di Bacco
2024-06-04 22:50           ` Dmitry Kozlyuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).