DPDK usage discussions
 help / color / mirror / Atom feed
* Mempool bigger than 1 page causes segmentation fault
@ 2022-07-27 11:59 MOD
  2022-07-27 12:30 ` Dmitry Kozlyuk
  0 siblings, 1 reply; 5+ messages in thread
From: MOD @ 2022-07-27 11:59 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 657 bytes --]

Hi All,

My team and I have encountered a problem where allocation of a mempool
larger than 1GB (== 1 Hugepage) fails.
We are in a multi-process environment, and the `rte_mempool_create`
happens in the secondary process.

Sometimes the allocation succeeds but after some successes (for me
specifically, two) the following occurs:
the secondary process segfaults on `malloc_elem_can_hold`, inside a stack
starting from `rte_mempool_create`.

Restarting the secondary process does not work as it is stuck on `EAL:
Probing VFIO support`, and restarting
the main process is the only option.

Has anyone had this problem, or knows any possible solution?
Thanks!

[-- Attachment #2: Type: text/html, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mempool bigger than 1 page causes segmentation fault
  2022-07-27 11:59 Mempool bigger than 1 page causes segmentation fault MOD
@ 2022-07-27 12:30 ` Dmitry Kozlyuk
       [not found]   ` <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Dmitry Kozlyuk @ 2022-07-27 12:30 UTC (permalink / raw)
  To: MOD; +Cc: users

2022-07-27 14:59 (UTC+0300), MOD:
> Hi All,
> 
> My team and I have encountered a problem where allocation of a mempool
> larger than 1GB (== 1 Hugepage) fails.
> We are in a multi-process environment, and the `rte_mempool_create`
> happens in the secondary process.
> 
> Sometimes the allocation succeeds but after some successes (for me
> specifically, two) the following occurs:
> the secondary process segfaults on `malloc_elem_can_hold`, inside a stack
> starting from `rte_mempool_create`.
> 
> Restarting the secondary process does not work as it is stuck on `EAL:
> Probing VFIO support`, and restarting
> the main process is the only option.
> 
> Has anyone had this problem, or knows any possible solution?
> Thanks!

Please tell the DPDK version and attach the stack trace.

If possible, try rebuilding DPDK with RTE_MALLOC_DEBUG defined,
and if your DPDK version supports it, with AddressSanitizer enabled.
Segfault in a function that traverses the malloc element list
suggests the heap may be corrupted, but it's only a guess.

Restarting the secondary process after a segfault is hardly a viable idea
because at this point the common memory may be already corrupted,
some lock may be taken and never released
(which is a possible reason it stucks, BTW).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mempool bigger than 1 page causes segmentation fault
       [not found]   ` <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com>
@ 2022-07-28 13:10     ` Dmitry Kozlyuk
  2022-07-31 11:32       ` MOD
  0 siblings, 1 reply; 5+ messages in thread
From: Dmitry Kozlyuk @ 2022-07-28 13:10 UTC (permalink / raw)
  To: MOD; +Cc: users

2022-07-28 15:05 (UTC+0300), MOD:
> Hi, Thanks for the response!
> the DPDK version is 20.11.4
> 
> the stack trace is:
> malloc_elem_can_hold() // librte_eal.so.21
> find_suitable_element() // librte_eal.so.21
> malloc_heap_alloc()  // librte_eal.so.21
> rte_memzone_reserve_thread_safe()  // librte_eal.so.21
> rte_mempool_populate_default()  // librte_mempool.so.21
> rte_mempool_create() // librte_mempool.so.21

Is this all the info---no arguments, no lines?
You're using a debug build of DPDK, right?
 
> RTE_MALLOC_DEBUG doesn't seem to change anything,
> but I noticed that I have been wrong about the allocation succeeding
> (not because of RTE_MALLOC_DEBUG)
> 
> the error happens right on the first attempt.

Did you try running with ASAN (meson -Db_sanitize=address)?

Can you provide a short code to reproduce
or does it happen only in a larger program?

Please keep Cc: users@dpdk.org so that more people can join if they want.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mempool bigger than 1 page causes segmentation fault
  2022-07-28 13:10     ` Dmitry Kozlyuk
@ 2022-07-31 11:32       ` MOD
  2022-09-07 11:00         ` MOD
  0 siblings, 1 reply; 5+ messages in thread
From: MOD @ 2022-07-31 11:32 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

Hi,
The issue is probably not with my code but with the compilation on DPDK,
because I got it to repeat on a separated program,
where I setup an EAL with the flags `-l 1 --no-pci`
(just rte_eal_init and rte_mempool_create)

this seems to be a memseg_list issue
When running the program  above, and requesting large amounts of memory
(200M elements of 8 bytes each)
I don't crash, but get `couldnt find suitable memseg_list` error
This also happens when trying to allocate from  the main process

This error is probably related to these parameters from rte_config.h:
/* EAL defines */
#define RTE_MAX_HEAPS 32
#define RTE_MAX_MEMSEG_LISTS 128
#define RTE_MAX_MEMSEG_PER_LIST 8192
#define RTE_MAX_MEM_MB_PER_LIST 32768
#define RTE_MAX_MEMSEG_PER_TYPE 32768
#define RTE_MAX_MEM_MB_PER_TYPE 65536
#define RTE_MAX_MEMZONE 2560
#define RTE_MAX_TAILQ 32


I could not find a good documentation on how to calculate the proper values
for these parameters


On Thu, Jul 28, 2022 at 4:10 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
wrote:

> 2022-07-28 15:05 (UTC+0300), MOD:
> > Hi, Thanks for the response!
> > the DPDK version is 20.11.4
> >
> > the stack trace is:
> > malloc_elem_can_hold() // librte_eal.so.21
> > find_suitable_element() // librte_eal.so.21
> > malloc_heap_alloc()  // librte_eal.so.21
> > rte_memzone_reserve_thread_safe()  // librte_eal.so.21
> > rte_mempool_populate_default()  // librte_mempool.so.21
> > rte_mempool_create() // librte_mempool.so.21
>
> Is this all the info---no arguments, no lines?
> You're using a debug build of DPDK, right?
>
> > RTE_MALLOC_DEBUG doesn't seem to change anything,
> > but I noticed that I have been wrong about the allocation succeeding
> > (not because of RTE_MALLOC_DEBUG)
> >
> > the error happens right on the first attempt.
>
> Did you try running with ASAN (meson -Db_sanitize=address)?
>
> Can you provide a short code to reproduce
> or does it happen only in a larger program?
>
> Please keep Cc: users@dpdk.org so that more people can join if they want.
>

[-- Attachment #2: Type: text/html, Size: 11897 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mempool bigger than 1 page causes segmentation fault
  2022-07-31 11:32       ` MOD
@ 2022-09-07 11:00         ` MOD
  0 siblings, 0 replies; 5+ messages in thread
From: MOD @ 2022-09-07 11:00 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

[-- Attachment #1: Type: text/plain, Size: 4464 bytes --]

Hello,
This is an update to this bug research, as I have had time to look at it
again
I have created an example program (code below) and tried them with debug &
rte_malloc_debug using dpdk 20.11 and 22.07
the results are the same - and will also be below
I now suspect it could be a bug in DPDK dynamic memory mode (it doesn't
happen in legacy mode)
and may be related to a long allocation time causing a timeout
The application code is very minimal, and should at the most get an error
at `rte_mempool_create`

more information about the system, firmware and DPDK compilation can be
provided if it may be related to that

The primary process code:
#include <rte_eal.h> #include <memory> int main(void) { const char* flags[]
= {"-l","1","--no-pci"}; rte_eal_init(sizeof(flags) / sizeof(char*), std::
const_cast<char **>(flags)); printf("primary started"); while (true) {}
return 0; }









The secondary process code:
#include <rte_eal.h> #include <rte_mempool.h> #include <memory> int main(
void) { const char* flags[] = {"-l","1","--no-pci", "--proc-type",
"secondary"}; rte_eal_init(sizeof(flags) / sizeof(char*), std::const_cast<
char **>(flags)); rte_mempool* pool = rte_mempool_create("my_pool",
150000000, 40, 0, 0, NULL, NULL, NULL, NULL, 0, 0); // 150M elements * 40B
= 6GB mempool if (pool) { printf("allocation success"); } else {
printf("allocation
failure"); } fflush(stdout); return 0; }














The result in the primary process:
EAL: Detected CPU lcores: 96
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Multi-Process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
TELEMETRY: No legacy callbacks, legacy socket not created
primary started

The results in the secondary process:
EAL: Detected CPU lcores: 96
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Multi-Process socket /var/run/dpdk/rte/mp_socket_.......
EAL: Selected IOVA mode 'PA'
EAL: Request timed out // <---------------This is the rte_mempool_create
EAL: Request timed out
EAL: Request timed out
*** crashes with retcode 139

The main process looks find from the CLI, but the secondary will not be
able to start again (stuck at EAL: Selected IOVA mode 'PA')

What should my next step be? As far as debugging / solving / reporting this?


On Sun, Jul 31, 2022 at 2:32 PM MOD <sdk.register@gmail.com> wrote:

> Hi,
> The issue is probably not with my code but with the compilation on DPDK,
> because I got it to repeat on a separated program,
> where I setup an EAL with the flags `-l 1 --no-pci`
> (just rte_eal_init and rte_mempool_create)
>
> this seems to be a memseg_list issue
> When running the program  above, and requesting large amounts of memory
> (200M elements of 8 bytes each)
> I don't crash, but get `couldnt find suitable memseg_list` error
> This also happens when trying to allocate from  the main process
>
> This error is probably related to these parameters from rte_config.h:
> /* EAL defines */
> #define RTE_MAX_HEAPS 32
> #define RTE_MAX_MEMSEG_LISTS 128
> #define RTE_MAX_MEMSEG_PER_LIST 8192
> #define RTE_MAX_MEM_MB_PER_LIST 32768
> #define RTE_MAX_MEMSEG_PER_TYPE 32768
> #define RTE_MAX_MEM_MB_PER_TYPE 65536
> #define RTE_MAX_MEMZONE 2560
> #define RTE_MAX_TAILQ 32
>
>
> I could not find a good documentation on how to calculate the proper
> values for these parameters
>
>
> On Thu, Jul 28, 2022 at 4:10 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> wrote:
>
>> 2022-07-28 15:05 (UTC+0300), MOD:
>> > Hi, Thanks for the response!
>> > the DPDK version is 20.11.4
>> >
>> > the stack trace is:
>> > malloc_elem_can_hold() // librte_eal.so.21
>> > find_suitable_element() // librte_eal.so.21
>> > malloc_heap_alloc()  // librte_eal.so.21
>> > rte_memzone_reserve_thread_safe()  // librte_eal.so.21
>> > rte_mempool_populate_default()  // librte_mempool.so.21
>> > rte_mempool_create() // librte_mempool.so.21
>>
>> Is this all the info---no arguments, no lines?
>> You're using a debug build of DPDK, right?
>>
>> > RTE_MALLOC_DEBUG doesn't seem to change anything,
>> > but I noticed that I have been wrong about the allocation succeeding
>> > (not because of RTE_MALLOC_DEBUG)
>> >
>> > the error happens right on the first attempt.
>>
>> Did you try running with ASAN (meson -Db_sanitize=address)?
>>
>> Can you provide a short code to reproduce
>> or does it happen only in a larger program?
>>
>> Please keep Cc: users@dpdk.org so that more people can join if they want.
>>
>

[-- Attachment #2: Type: text/html, Size: 38843 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-07 11:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-27 11:59 Mempool bigger than 1 page causes segmentation fault MOD
2022-07-27 12:30 ` Dmitry Kozlyuk
     [not found]   ` <CA+Md9nye7c6X9=9bFc3kp+zOQEASM6-PeqW7i-G=xHRkG5PG4A@mail.gmail.com>
2022-07-28 13:10     ` Dmitry Kozlyuk
2022-07-31 11:32       ` MOD
2022-09-07 11:00         ` MOD

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).