* Failure while allocating 1GB hugepages @ 2024-05-10 9:33 Antonio Di Bacco 2024-05-10 15:07 ` Dmitry Kozlyuk 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2024-05-10 9:33 UTC (permalink / raw) To: users I have 16 hugepages available per NUMA on a 4 NUMA system: [user@node-1 hugepages]$ cat /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages 16 16 16 16 Using the following program with dpdk 21.11, sometimes I can allocate a few pages but most of the time I cannot. I tried also to remove rtemap_* under /dev/hugepages. rte_memzone_reserve_aligned is always supposed to use a new page? #include <stdio.h> #include <rte_eal.h> #include <rte_memzone.h> #include <rte_errno.h> #include <unistd.h> int main(int argc, char **argv) { const struct rte_memzone *mz; int ret; printf("pid: %d\n", getpid()); // Initialize EAL ret = rte_eal_init(argc, argv); if (ret < 0) { fprintf(stderr, "Error with EAL initialization\n"); return -1; } for (int socket = 0; socket < 4; socket++) { for (int i = 0; i < 16; i++) { // Allocate memory using rte_memzone_reserve_aligned char name[32]; sprintf(name, "my_memzone%d-%d", i, socket); mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket, RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30); if (mz == NULL) { printf("errno %s\n", rte_strerror(rte_errno)); fprintf(stderr, "Memory allocation failed\n"); rte_eal_cleanup(); return -1; } printf("Memory allocated with name %s at socket %d physical address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id, (mz->iova), mz->addr, mz->addr_64, mz->len); } } // Clean up EAL rte_eal_cleanup(); return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure while allocating 1GB hugepages 2024-05-10 9:33 Failure while allocating 1GB hugepages Antonio Di Bacco @ 2024-05-10 15:07 ` Dmitry Kozlyuk 2024-05-22 10:22 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Dmitry Kozlyuk @ 2024-05-10 15:07 UTC (permalink / raw) To: Antonio Di Bacco; +Cc: users 2024-05-10 11:33 (UTC+0200), Antonio Di Bacco: > I have 16 hugepages available per NUMA on a 4 NUMA system: > > [user@node-1 hugepages]$ cat > /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages > 16 > 16 > 16 > 16 > > Using the following program with dpdk 21.11, sometimes I can allocate > a few pages but most of the time I cannot. I tried also to remove > rtemap_* under /dev/hugepages. > rte_memzone_reserve_aligned is always supposed to use a new page? > > #include <stdio.h> > #include <rte_eal.h> > #include <rte_memzone.h> > > #include <rte_errno.h> > #include <unistd.h> > > int main(int argc, char **argv) > { > const struct rte_memzone *mz; > int ret; > printf("pid: %d\n", getpid()); > // Initialize EAL > ret = rte_eal_init(argc, argv); > if (ret < 0) { > fprintf(stderr, "Error with EAL initialization\n"); > return -1; > } > > for (int socket = 0; socket < 4; socket++) > { > for (int i = 0; i < 16; i++) > { > // Allocate memory using rte_memzone_reserve_aligned > char name[32]; > sprintf(name, "my_memzone%d-%d", i, socket); > mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket, > RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30); > > if (mz == NULL) { > printf("errno %s\n", rte_strerror(rte_errno)); > fprintf(stderr, "Memory allocation failed\n"); > rte_eal_cleanup(); > return -1; > } > > printf("Memory allocated with name %s at socket %d physical > address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id, > (mz->iova), mz->addr, mz->addr_64, mz->len); > } > } > > // Clean up EAL > rte_eal_cleanup(); > return 0; > } Hi Antonio, Does it succeed without RTE_MEMZONE_IOVA_CONTIG? If so, does your system/app have ASLR enabled? When memzone size is 1G and hugepage size is 1G, two hugepages are required: one for the requested amount of memory, and one for memory allocator element header, which does not fit into the same page obviously. I suspect that two allocated hugepages get non-continuous IOVA and that's why the function fails. There are no useful logs in EAL to check the suspicion, but you can hack elem_check_phys_contig() in malloc_elem.c. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure while allocating 1GB hugepages 2024-05-10 15:07 ` Dmitry Kozlyuk @ 2024-05-22 10:22 ` Antonio Di Bacco 2024-05-30 10:28 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2024-05-22 10:22 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users That was really useful. Thx On Fri, May 10, 2024 at 5:07 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote: > > 2024-05-10 11:33 (UTC+0200), Antonio Di Bacco: > > I have 16 hugepages available per NUMA on a 4 NUMA system: > > > > [user@node-1 hugepages]$ cat > > /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages > > 16 > > 16 > > 16 > > 16 > > > > Using the following program with dpdk 21.11, sometimes I can allocate > > a few pages but most of the time I cannot. I tried also to remove > > rtemap_* under /dev/hugepages. > > rte_memzone_reserve_aligned is always supposed to use a new page? > > > > #include <stdio.h> > > #include <rte_eal.h> > > #include <rte_memzone.h> > > > > #include <rte_errno.h> > > #include <unistd.h> > > > > int main(int argc, char **argv) > > { > > const struct rte_memzone *mz; > > int ret; > > printf("pid: %d\n", getpid()); > > // Initialize EAL > > ret = rte_eal_init(argc, argv); > > if (ret < 0) { > > fprintf(stderr, "Error with EAL initialization\n"); > > return -1; > > } > > > > for (int socket = 0; socket < 4; socket++) > > { > > for (int i = 0; i < 16; i++) > > { > > // Allocate memory using rte_memzone_reserve_aligned > > char name[32]; > > sprintf(name, "my_memzone%d-%d", i, socket); > > mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket, > > RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30); > > > > if (mz == NULL) { > > printf("errno %s\n", rte_strerror(rte_errno)); > > fprintf(stderr, "Memory allocation failed\n"); > > rte_eal_cleanup(); > > return -1; > > } > > > > printf("Memory allocated with name %s at socket %d physical > > address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id, > > (mz->iova), mz->addr, mz->addr_64, mz->len); > > } > > } > > > > // Clean up EAL > > rte_eal_cleanup(); > > return 0; > > } > > Hi Antonio, > > Does it succeed without RTE_MEMZONE_IOVA_CONTIG? > If so, does your system/app have ASLR enabled? > > When memzone size is 1G and hugepage size is 1G, > two hugepages are required: one for the requested amount of memory, > and one for memory allocator element header, > which does not fit into the same page obviously. > I suspect that two allocated hugepages get non-continuous IOVA > and that's why the function fails. > There are no useful logs in EAL to check the suspicion, > but you can hack elem_check_phys_contig() in malloc_elem.c. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure while allocating 1GB hugepages 2024-05-22 10:22 ` Antonio Di Bacco @ 2024-05-30 10:28 ` Antonio Di Bacco 2024-05-30 15:00 ` Dmitry Kozlyuk 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2024-05-30 10:28 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users Just in case I need, let us say, 1.5 GB CONTIGUOUS memory zone, would it be fine to use something like this as GRUB config in Linux? default_hugepagesz=2G hugepagesz=2G hugepages=4" On Wed, May 22, 2024 at 12:22 PM Antonio Di Bacco <a.dibacco.ks@gmail.com> wrote: > > That was really useful. Thx > > On Fri, May 10, 2024 at 5:07 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote: > > > > 2024-05-10 11:33 (UTC+0200), Antonio Di Bacco: > > > I have 16 hugepages available per NUMA on a 4 NUMA system: > > > > > > [user@node-1 hugepages]$ cat > > > /sys/devices/system/node/*/hugepages/hugepages-1048576kB/free_hugepages > > > 16 > > > 16 > > > 16 > > > 16 > > > > > > Using the following program with dpdk 21.11, sometimes I can allocate > > > a few pages but most of the time I cannot. I tried also to remove > > > rtemap_* under /dev/hugepages. > > > rte_memzone_reserve_aligned is always supposed to use a new page? > > > > > > #include <stdio.h> > > > #include <rte_eal.h> > > > #include <rte_memzone.h> > > > > > > #include <rte_errno.h> > > > #include <unistd.h> > > > > > > int main(int argc, char **argv) > > > { > > > const struct rte_memzone *mz; > > > int ret; > > > printf("pid: %d\n", getpid()); > > > // Initialize EAL > > > ret = rte_eal_init(argc, argv); > > > if (ret < 0) { > > > fprintf(stderr, "Error with EAL initialization\n"); > > > return -1; > > > } > > > > > > for (int socket = 0; socket < 4; socket++) > > > { > > > for (int i = 0; i < 16; i++) > > > { > > > // Allocate memory using rte_memzone_reserve_aligned > > > char name[32]; > > > sprintf(name, "my_memzone%d-%d", i, socket); > > > mz = rte_memzone_reserve_aligned(name, 1ULL << 30, socket, > > > RTE_MEMZONE_IOVA_CONTIG, 1ULL << 30); > > > > > > if (mz == NULL) { > > > printf("errno %s\n", rte_strerror(rte_errno)); > > > fprintf(stderr, "Memory allocation failed\n"); > > > rte_eal_cleanup(); > > > return -1; > > > } > > > > > > printf("Memory allocated with name %s at socket %d physical > > > address: %p, addr %p addr64 %lx size: %zu\n", name, mz->socket_id, > > > (mz->iova), mz->addr, mz->addr_64, mz->len); > > > } > > > } > > > > > > // Clean up EAL > > > rte_eal_cleanup(); > > > return 0; > > > } > > > > Hi Antonio, > > > > Does it succeed without RTE_MEMZONE_IOVA_CONTIG? > > If so, does your system/app have ASLR enabled? > > > > When memzone size is 1G and hugepage size is 1G, > > two hugepages are required: one for the requested amount of memory, > > and one for memory allocator element header, > > which does not fit into the same page obviously. > > I suspect that two allocated hugepages get non-continuous IOVA > > and that's why the function fails. > > There are no useful logs in EAL to check the suspicion, > > but you can hack elem_check_phys_contig() in malloc_elem.c. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure while allocating 1GB hugepages 2024-05-30 10:28 ` Antonio Di Bacco @ 2024-05-30 15:00 ` Dmitry Kozlyuk 2024-06-03 12:39 ` Antonio Di Bacco 0 siblings, 1 reply; 7+ messages in thread From: Dmitry Kozlyuk @ 2024-05-30 15:00 UTC (permalink / raw) To: Antonio Di Bacco; +Cc: users 2024-05-30 12:28 (UTC+0200), Antonio Di Bacco: > Just in case I need, let us say, 1.5 GB CONTIGUOUS memory zone, > would it be fine to use something like this as GRUB config in Linux? > > default_hugepagesz=2G hugepagesz=2G hugepages=4" On x86, "hugepagesz" and "default_hugepagesz" may be either 2M or 1G. There is no way to *guarantee* that there will be two physically adjacent 1G hugepages forming 1.5GB contiguous space, but in practice these options, with the above correction, will do. Note that by default the kernel will spread hugepages between NUMA nodes. You can control this by a more elaborate form of "hugepages" option: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure while allocating 1GB hugepages 2024-05-30 15:00 ` Dmitry Kozlyuk @ 2024-06-03 12:39 ` Antonio Di Bacco 2024-06-04 22:50 ` Dmitry Kozlyuk 0 siblings, 1 reply; 7+ messages in thread From: Antonio Di Bacco @ 2024-06-03 12:39 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users Hi, I have the same behaviour with the code in this message. The first rte_memzone_reserve_aligned() call requesting 1.5GB contiguous memory always fails, while the second one is always successful. It seems in eal_memalloc_is_contig() the 'msl->memseg_arr' items are inverted: when there is the sequence FC0000000, F80000000 the allocation fails, while the segments sequence F80000000, FC0000000 is fine. From my understaning 'msl->memseg_arr' comes from 'rte_eal_get_configuration()->mem_config;' which is rte_config declared in eal_common_config.c Is there an explanation for this swinging behaviour ? Br, Here the source code: #include <stdio.h> #include <rte_eal.h> #include <rte_memzone.h> #include <rte_errno.h> #include <unistd.h> int main(int argc, char **argv) { const struct rte_memzone *mz; if (rte_eal_init(argc, argv) < 0) return -1; printf("Allocating : 1.5GB\n"); mz = rte_memzone_reserve_aligned("my_huge_mem", 0x60000000, rte_socket_id(), RTE_MEMZONE_1GB | RTE_MEMZONE_IOVA_CONTIG, RTE_CACHE_LINE_SIZE); if (mz == NULL) { printf(" Fail(1): errno %s\n", rte_strerror(rte_errno)); mz = rte_memzone_reserve_aligned("my_huge_mem", 0x60000000, rte_socket_id(), RTE_MEMZONE_1GB | RTE_MEMZONE_IOVA_CONTIG, RTE_CACHE_LINE_SIZE); if (mz == NULL) { printf(" Fail(2): errno %s\n", rte_strerror(rte_errno)); return -2; } } printf(" Success: phy[%p] size[%zu]\n", mz->iova, mz->len); rte_memzone_free(mz); rte_eal_cleanup(); return 0; } I added two RTE_LOG notices in eal_memalloc_is_contig @ eal_common_memalloc.c around line 324, after rte_fbarray_get() to print ms->iova /* skip first iteration */ ms = rte_fbarray_get(&msl->memseg_arr, start_seg); RTE_LOG(NOTICE, EAL, "memseg_arr[0] = %lX \n", ms->iova); // DEBUG cur = ms->iova; expected = cur + pgsz; /* if we can't access IOVA addresses, assume non-contiguous */ if (cur == RTE_BAD_IOVA) return false; for (cur_seg = start_seg + 1; cur_seg < end_seg; cur_seg++, expected += pgsz) { ms = rte_fbarray_get(&msl->memseg_arr, cur_seg); RTE_LOG(NOTICE, EAL, "memseg_arr[%d] = %lX \n", cur_seg, ms->iova); // DEBUG if (ms->iova != expected) return false; } The output is: Allocating : 1.5GB EAL: memseg_arr[0] = FC0000000 EAL: memseg_arr[1] = F80000000 Fail(1): errno Cannot allocate memory EAL: memseg_arr[0] = F80000000 EAL: memseg_arr[1] = FC0000000 EAL: memseg_arr[0] = F80000000 EAL: memseg_arr[1] = FC0000000 EAL: memseg_arr[0] = F80000000 EAL: memseg_arr[1] = FC0000000 EAL: memseg_arr[0] = F80000000 EAL: memseg_arr[1] = FC0000000 Success: phy[0xfa0000000] size[1610612736] On Thu, May 30, 2024 at 5:00 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote: > > 2024-05-30 12:28 (UTC+0200), Antonio Di Bacco: > > Just in case I need, let us say, 1.5 GB CONTIGUOUS memory zone, > > would it be fine to use something like this as GRUB config in Linux? > > > > default_hugepagesz=2G hugepagesz=2G hugepages=4" > > On x86, "hugepagesz" and "default_hugepagesz" may be either 2M or 1G. > There is no way to *guarantee* that there will be > two physically adjacent 1G hugepages forming 1.5GB contiguous space, > but in practice these options, with the above correction, will do. > > Note that by default the kernel will spread hugepages between NUMA nodes. > You can control this by a more elaborate form of "hugepages" option: > > https://docs.kernel.org/admin-guide/mm/hugetlbpage.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure while allocating 1GB hugepages 2024-06-03 12:39 ` Antonio Di Bacco @ 2024-06-04 22:50 ` Dmitry Kozlyuk 0 siblings, 0 replies; 7+ messages in thread From: Dmitry Kozlyuk @ 2024-06-04 22:50 UTC (permalink / raw) To: Antonio Di Bacco; +Cc: users 2024-06-03 14:39 (UTC+0200), Antonio Di Bacco: > Hi, > I have the same behaviour with the code in this message. > > The first rte_memzone_reserve_aligned() call requesting 1.5GB > contiguous memory always fails, while the second one is always > successful. Hi, I can't explain the "always" part, but unstable behavior comes from unpredictable IOVA (physical address) that DPDK gets from the kernel. On the first try: 1. DPDK has no 1G hugepages mapped, it needs 2 more 1G hugepages. alloc_pages_on_heap() -> eal_memalloc_alloc_seg_bulk() 2. DPDK asks the kernel for one 1G hugepage, kernel maps the hugepage with IOVA = 0xFC000000, DPDK stores it in memseg_arr[0]. eal_memalloc_alloc_seg_bulk() -> alloc_seg() 3. Same for another hugepage and memseg_arr[1]->iova = 0xF8000000. 4. DPDK checks is the pages are continuous. alloc_pages_on_heap() -> eal_memalloc_is_contig() = false 5. Since it's a failure, DPDK frees newly allocated pages. alloc_pages_on_heap() -> rollback_expand_heap() On the second try: 6. Steps 1 and 2 repeat, but now memseg_arr[0]->iova = 0xF8000000. 7. Step 3 repeats, but now memseg_arr[0]->iova = 0xFC000000. 8. IOVAs are continuous, success. Just a wild guess why the second try may be likely to succeed: memseg_arr[1] with IOVA = 0xF8000000 is freed last at step 5, so maybe this is why the kernel is likely to reuse this page at step 6. I'm afraid the simplest way to get PA-continuous 1.5G reliably is indeed to try several times. The preferred way is to use IOMMU and IOVA-as-VA if HW permits. > It seems in eal_memalloc_is_contig() the 'msl->memseg_arr' items are inverted: > when there is the sequence FC0000000, F80000000 the allocation fails, > while the segments sequence F80000000, FC0000000 is fine. > From my understaning 'msl->memseg_arr' comes from > 'rte_eal_get_configuration()->mem_config;' which is rte_config > declared in eal_common_config.c Not quite, msl->memseg_arr content is dynamic, see above. P.S. One may say, DPDK could do better. It does have N hugepages occupying a continuous range of IOVA. DPDK could make them VA-continuous by remapping. But this would be more work, it still wouldn't be 100% reliable, and still insecure and inflexible compared to IOMMU. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-06-04 22:50 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-05-10 9:33 Failure while allocating 1GB hugepages Antonio Di Bacco 2024-05-10 15:07 ` Dmitry Kozlyuk 2024-05-22 10:22 ` Antonio Di Bacco 2024-05-30 10:28 ` Antonio Di Bacco 2024-05-30 15:00 ` Dmitry Kozlyuk 2024-06-03 12:39 ` Antonio Di Bacco 2024-06-04 22:50 ` Dmitry Kozlyuk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).