* Mempool bigger than 1 page causes segmentation fault @ 2022-07-27 11:59 MOD 2022-07-27 12:30 ` Dmitry Kozlyuk 0 siblings, 1 reply; 5+ messages in thread From: MOD @ 2022-07-27 11:59 UTC (permalink / raw) To: users [-- Attachment #1: Type: text/plain, Size: 657 bytes --] Hi All, My team and I have encountered a problem where allocation of a mempool larger than 1GB (== 1 Hugepage) fails. We are in a multi-process environment, and the `rte_mempool_create` happens in the secondary process. Sometimes the allocation succeeds but after some successes (for me specifically, two) the following occurs: the secondary process segfaults on `malloc_elem_can_hold`, inside a stack starting from `rte_mempool_create`. Restarting the secondary process does not work as it is stuck on `EAL: Probing VFIO support`, and restarting the main process is the only option. Has anyone had this problem, or knows any possible solution? Thanks! [-- Attachment #2: Type: text/html, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Mempool bigger than 1 page causes segmentation fault 2022-07-27 11:59 Mempool bigger than 1 page causes segmentation fault MOD @ 2022-07-27 12:30 ` Dmitry Kozlyuk [not found] ` <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com> 0 siblings, 1 reply; 5+ messages in thread From: Dmitry Kozlyuk @ 2022-07-27 12:30 UTC (permalink / raw) To: MOD; +Cc: users 2022-07-27 14:59 (UTC+0300), MOD: > Hi All, > > My team and I have encountered a problem where allocation of a mempool > larger than 1GB (== 1 Hugepage) fails. > We are in a multi-process environment, and the `rte_mempool_create` > happens in the secondary process. > > Sometimes the allocation succeeds but after some successes (for me > specifically, two) the following occurs: > the secondary process segfaults on `malloc_elem_can_hold`, inside a stack > starting from `rte_mempool_create`. > > Restarting the secondary process does not work as it is stuck on `EAL: > Probing VFIO support`, and restarting > the main process is the only option. > > Has anyone had this problem, or knows any possible solution? > Thanks! Please tell the DPDK version and attach the stack trace. If possible, try rebuilding DPDK with RTE_MALLOC_DEBUG defined, and if your DPDK version supports it, with AddressSanitizer enabled. Segfault in a function that traverses the malloc element list suggests the heap may be corrupted, but it's only a guess. Restarting the secondary process after a segfault is hardly a viable idea because at this point the common memory may be already corrupted, some lock may be taken and never released (which is a possible reason it stucks, BTW). ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com>]
* Re: Mempool bigger than 1 page causes segmentation fault [not found] ` <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com> @ 2022-07-28 13:10 ` Dmitry Kozlyuk 2022-07-31 11:32 ` MOD 0 siblings, 1 reply; 5+ messages in thread From: Dmitry Kozlyuk @ 2022-07-28 13:10 UTC (permalink / raw) To: MOD; +Cc: users 2022-07-28 15:05 (UTC+0300), MOD: > Hi, Thanks for the response! > the DPDK version is 20.11.4 > > the stack trace is: > malloc_elem_can_hold() // librte_eal.so.21 > find_suitable_element() // librte_eal.so.21 > malloc_heap_alloc() // librte_eal.so.21 > rte_memzone_reserve_thread_safe() // librte_eal.so.21 > rte_mempool_populate_default() // librte_mempool.so.21 > rte_mempool_create() // librte_mempool.so.21 Is this all the info---no arguments, no lines? You're using a debug build of DPDK, right? > RTE_MALLOC_DEBUG doesn't seem to change anything, > but I noticed that I have been wrong about the allocation succeeding > (not because of RTE_MALLOC_DEBUG) > > the error happens right on the first attempt. Did you try running with ASAN (meson -Db_sanitize=address)? Can you provide a short code to reproduce or does it happen only in a larger program? Please keep Cc: users@dpdk.org so that more people can join if they want. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Mempool bigger than 1 page causes segmentation fault 2022-07-28 13:10 ` Dmitry Kozlyuk @ 2022-07-31 11:32 ` MOD 2022-09-07 11:00 ` MOD 0 siblings, 1 reply; 5+ messages in thread From: MOD @ 2022-07-31 11:32 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users [-- Attachment #1: Type: text/plain, Size: 2020 bytes --] Hi, The issue is probably not with my code but with the compilation on DPDK, because I got it to repeat on a separated program, where I setup an EAL with the flags `-l 1 --no-pci` (just rte_eal_init and rte_mempool_create) this seems to be a memseg_list issue When running the program above, and requesting large amounts of memory (200M elements of 8 bytes each) I don't crash, but get `couldnt find suitable memseg_list` error This also happens when trying to allocate from the main process This error is probably related to these parameters from rte_config.h: /* EAL defines */ #define RTE_MAX_HEAPS 32 #define RTE_MAX_MEMSEG_LISTS 128 #define RTE_MAX_MEMSEG_PER_LIST 8192 #define RTE_MAX_MEM_MB_PER_LIST 32768 #define RTE_MAX_MEMSEG_PER_TYPE 32768 #define RTE_MAX_MEM_MB_PER_TYPE 65536 #define RTE_MAX_MEMZONE 2560 #define RTE_MAX_TAILQ 32 I could not find a good documentation on how to calculate the proper values for these parameters On Thu, Jul 28, 2022 at 4:10 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote: > 2022-07-28 15:05 (UTC+0300), MOD: > > Hi, Thanks for the response! > > the DPDK version is 20.11.4 > > > > the stack trace is: > > malloc_elem_can_hold() // librte_eal.so.21 > > find_suitable_element() // librte_eal.so.21 > > malloc_heap_alloc() // librte_eal.so.21 > > rte_memzone_reserve_thread_safe() // librte_eal.so.21 > > rte_mempool_populate_default() // librte_mempool.so.21 > > rte_mempool_create() // librte_mempool.so.21 > > Is this all the info---no arguments, no lines? > You're using a debug build of DPDK, right? > > > RTE_MALLOC_DEBUG doesn't seem to change anything, > > but I noticed that I have been wrong about the allocation succeeding > > (not because of RTE_MALLOC_DEBUG) > > > > the error happens right on the first attempt. > > Did you try running with ASAN (meson -Db_sanitize=address)? > > Can you provide a short code to reproduce > or does it happen only in a larger program? > > Please keep Cc: users@dpdk.org so that more people can join if they want. > [-- Attachment #2: Type: text/html, Size: 11897 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Mempool bigger than 1 page causes segmentation fault 2022-07-31 11:32 ` MOD @ 2022-09-07 11:00 ` MOD 0 siblings, 0 replies; 5+ messages in thread From: MOD @ 2022-09-07 11:00 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users [-- Attachment #1: Type: text/plain, Size: 4464 bytes --] Hello, This is an update to this bug research, as I have had time to look at it again I have created an example program (code below) and tried them with debug & rte_malloc_debug using dpdk 20.11 and 22.07 the results are the same - and will also be below I now suspect it could be a bug in DPDK dynamic memory mode (it doesn't happen in legacy mode) and may be related to a long allocation time causing a timeout The application code is very minimal, and should at the most get an error at `rte_mempool_create` more information about the system, firmware and DPDK compilation can be provided if it may be related to that The primary process code: #include <rte_eal.h> #include <memory> int main(void) { const char* flags[] = {"-l","1","--no-pci"}; rte_eal_init(sizeof(flags) / sizeof(char*), std:: const_cast<char **>(flags)); printf("primary started"); while (true) {} return 0; } The secondary process code: #include <rte_eal.h> #include <rte_mempool.h> #include <memory> int main( void) { const char* flags[] = {"-l","1","--no-pci", "--proc-type", "secondary"}; rte_eal_init(sizeof(flags) / sizeof(char*), std::const_cast< char **>(flags)); rte_mempool* pool = rte_mempool_create("my_pool", 150000000, 40, 0, 0, NULL, NULL, NULL, NULL, 0, 0); // 150M elements * 40B = 6GB mempool if (pool) { printf("allocation success"); } else { printf("allocation failure"); } fflush(stdout); return 0; } The result in the primary process: EAL: Detected CPU lcores: 96 EAL: Detected NUMA nodes: 2 EAL: Detected shared linkage of DPDK EAL: Multi-Process socket /var/run/dpdk/rte/mp_socket EAL: Selected IOVA mode 'PA' TELEMETRY: No legacy callbacks, legacy socket not created primary started The results in the secondary process: EAL: Detected CPU lcores: 96 EAL: Detected NUMA nodes: 2 EAL: Detected shared linkage of DPDK EAL: Multi-Process socket /var/run/dpdk/rte/mp_socket_....... EAL: Selected IOVA mode 'PA' EAL: Request timed out // <---------------This is the rte_mempool_create EAL: Request timed out EAL: Request timed out *** crashes with retcode 139 The main process looks find from the CLI, but the secondary will not be able to start again (stuck at EAL: Selected IOVA mode 'PA') What should my next step be? As far as debugging / solving / reporting this? On Sun, Jul 31, 2022 at 2:32 PM MOD <sdk.register@gmail.com> wrote: > Hi, > The issue is probably not with my code but with the compilation on DPDK, > because I got it to repeat on a separated program, > where I setup an EAL with the flags `-l 1 --no-pci` > (just rte_eal_init and rte_mempool_create) > > this seems to be a memseg_list issue > When running the program above, and requesting large amounts of memory > (200M elements of 8 bytes each) > I don't crash, but get `couldnt find suitable memseg_list` error > This also happens when trying to allocate from the main process > > This error is probably related to these parameters from rte_config.h: > /* EAL defines */ > #define RTE_MAX_HEAPS 32 > #define RTE_MAX_MEMSEG_LISTS 128 > #define RTE_MAX_MEMSEG_PER_LIST 8192 > #define RTE_MAX_MEM_MB_PER_LIST 32768 > #define RTE_MAX_MEMSEG_PER_TYPE 32768 > #define RTE_MAX_MEM_MB_PER_TYPE 65536 > #define RTE_MAX_MEMZONE 2560 > #define RTE_MAX_TAILQ 32 > > > I could not find a good documentation on how to calculate the proper > values for these parameters > > > On Thu, Jul 28, 2022 at 4:10 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> > wrote: > >> 2022-07-28 15:05 (UTC+0300), MOD: >> > Hi, Thanks for the response! >> > the DPDK version is 20.11.4 >> > >> > the stack trace is: >> > malloc_elem_can_hold() // librte_eal.so.21 >> > find_suitable_element() // librte_eal.so.21 >> > malloc_heap_alloc() // librte_eal.so.21 >> > rte_memzone_reserve_thread_safe() // librte_eal.so.21 >> > rte_mempool_populate_default() // librte_mempool.so.21 >> > rte_mempool_create() // librte_mempool.so.21 >> >> Is this all the info---no arguments, no lines? >> You're using a debug build of DPDK, right? >> >> > RTE_MALLOC_DEBUG doesn't seem to change anything, >> > but I noticed that I have been wrong about the allocation succeeding >> > (not because of RTE_MALLOC_DEBUG) >> > >> > the error happens right on the first attempt. >> >> Did you try running with ASAN (meson -Db_sanitize=address)? >> >> Can you provide a short code to reproduce >> or does it happen only in a larger program? >> >> Please keep Cc: users@dpdk.org so that more people can join if they want. >> > [-- Attachment #2: Type: text/html, Size: 38843 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-09-07 11:01 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-07-27 11:59 Mempool bigger than 1 page causes segmentation fault MOD 2022-07-27 12:30 ` Dmitry Kozlyuk [not found] ` <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com> 2022-07-28 13:10 ` Dmitry Kozlyuk 2022-07-31 11:32 ` MOD 2022-09-07 11:00 ` MOD
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).