https://bugs.dpdk.org/show_bug.cgi?id=1277 Bug ID: 1277 Summary: memory_hotplug_lock deadlock during initialization Product: DPDK Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: Normal Component: core Assignee: dev@dpdk.org Reporter: artemyko@nvidia.com Target Milestone: --- It seems the issue arose due to changes in the DPDK read-write lock implementation. Following these changes, the RW-lock no longer supports recursion, implying that a single thread shouldn't obtain a read lock if it already possesses one. The problem arises during initialization: the rte_eal_memory_init() function acquires the memory_hotplug_lock, and later on, the sequence of calls eal_memalloc_init() -> rte_memseg_list_walk() acquires it again without releasing it. This scenario introduces the risk of a potential deadlock when concurrent write locks are applied to the same memory_hotplug_lock. To address this locally, we resolved the issue by replacing rte_memseg_list_walk() with rte_memseg_list_walk_thread_unsafe(). Reproduction: Create mp_deadlock directory under dpdk/examples/. Then add main.c /* SPDX-License-Identifier: BSD-3-Clause * Copyright(c) 2010-2014 Intel Corporation */ #include #include #include #include #include #include #include #include #include #include #include #include /* Initialization of Environment Abstraction Layer (EAL). 8< */ int main(int argc, char **argv) { int ret; ret = rte_eal_init(argc, argv); if (ret < 0) rte_panic("Cannot init EAL\n"); /* >8 End of initialization of Environment Abstraction Layer */ if (rte_eal_process_type() == RTE_PROC_PRIMARY) getchar(); else { if (rte_lcore_id() <= 1) { int i = 0; void *p; while (1) { p = rte_malloc_socket(NULL, 0x1000000, 0x1000, -1); rte_free(p); printf("malloc %d times\n", i++); } } } /* clean up the EAL */ rte_eal_cleanup(); return 0; } Compile: I followed https://doc.dpdk.org/guides/prog_guide/build_app.html and some tips from related web page. Run primary: ./examples/mp_deadlock/build/mp_deadlock -l 0 --file-prefix=dpdk1 --proc-type=primary Run secondary 1: ./examples/mp_deadlock/build/mp_deadlock -l 1 --file-prefix=dpdk1 --proc-type=secondary Run secondary 2: while true do ./examples/mp_deadlock/build/mp_deadlock -l 2 --file-prefix=dpdk1 --proc-type=secondary done Stack trace. It looks like the following caused deadlock. #0 0x00007f850e97a3f2 in rte_mcfg_mem_write_lock () from /usr/local/lib64/librte_eal.so.23 And #0 0x00007f3f591b5362 in rte_mcfg_mem_read_lock () from /usr/local/lib64/librte_eal.so.23 [root@fedora dpdk]# ps -ef | grep deadlock root 7328 1004 0 20:47 pts/0 00:00:00 bash ./mp_deadlock1.sh root 7329 7328 4 20:47 pts/0 00:00:01 ./examples/mp_deadlock/build/mp_deadlock -l 0 --file-prefix=dpdk1 --proc-type=primary root 7333 5693 0 20:47 pts/4 00:00:00 bash ./mp_deadlock2.sh root 7334 7333 94 20:47 pts/4 00:00:31 ./examples/mp_deadlock/build/mp_deadlock -l 1 --file-prefix=dpdk1 --proc-type=secondary root 7337 5267 0 20:47 pts/1 00:00:00 bash ./mp_deadlock.sh root 7338 7337 98 20:47 pts/1 00:00:29 ./examples/mp_deadlock/build/mp_deadlock -l 2 --file-prefix=dpdk1 --proc-type=secondary root 7342 5480 0 20:47 pts/2 00:00:00 grep --color=auto deadlock [root@fedora dpdk]# pstack 7329 Thread 4 (Thread 0x7f20ae487640 (LWP 7332) "telemetry-v2"): #0 0x00007f20b200ae6f in accept () from /lib64/libc.so.6 #1 0x00007f20b1e004a3 in socket_listener () from /usr/local/lib64/librte_telemetry.so.23 #2 0x00007f20b1f85b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f20b200a6a0 in clone3 () from /lib64/libc.so.6 Thread 3 (Thread 0x7f20aec88640 (LWP 7331) "rte_mp_handle"): #0 0x00007f20b200b23d in recvmsg () from /lib64/libc.so.6 #1 0x00007f20b2137ecf in mp_handle () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f20b1f85b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f20b200a6a0 in clone3 () from /lib64/libc.so.6 Thread 2 (Thread 0x7f20af489640 (LWP 7330) "eal-intr-thread"): #0 0x00007f20b2009c7e in epoll_wait () from /lib64/libc.so.6 #1 0x00007f20b2141c54 in eal_intr_thread_main () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f20b1f85b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f20b200a6a0 in clone3 () from /lib64/libc.so.6 Thread 1 (Thread 0x7f20b1df9900 (LWP 7329) "mp_deadlock"): #0 0x00007f20b1ff984c in read () from /lib64/libc.so.6 #1 0x00007f20b1f7e914 in __GI__IO_file_underflow () from /lib64/libc.so.6 #2 0x00007f20b1f7f946 in _IO_default_uflow () from /lib64/libc.so.6 #3 0x00007f20b1f7a328 in getc () from /lib64/libc.so.6 #4 0x000000000040113e in main () [root@fedora dpdk]# pstack 7334 Thread 3 (Thread 0x7f850b4da640 (LWP 7336) "rte_mp_handle"): #0 0x00007f850e85d23d in recvmsg () from /lib64/libc.so.6 #1 0x00007f850e989ecf in mp_handle () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f850e7d7b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f850e85c6a0 in clone3 () from /lib64/libc.so.6 Thread 2 (Thread 0x7f850bcdb640 (LWP 7335) "eal-intr-thread"): #0 0x00007f850e85bc7e in epoll_wait () from /lib64/libc.so.6 #1 0x00007f850e993c54 in eal_intr_thread_main () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f850e7d7b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f850e85c6a0 in clone3 () from /lib64/libc.so.6 Thread 1 (Thread 0x7f850e64b900 (LWP 7334) "mp_deadlock"): #0 0x00007f850e97a3f2 in rte_mcfg_mem_write_lock () from /usr/local/lib64/librte_eal.so.23 #1 0x00007f850e984509 in malloc_heap_free () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f850e98508f in rte_free () from /usr/local/lib64/librte_eal.so.23 #3 0x0000000000401126 in main () [root@fedora dpdk]# pstack 7338 Thread 3 (Thread 0x7f3f55d15640 (LWP 7340) "rte_mp_handle"): #0 0x00007f3f5909823d in recvmsg () from /lib64/libc.so.6 #1 0x00007f3f591c4ecf in mp_handle () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f3f59012b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f3f590976a0 in clone3 () from /lib64/libc.so.6 Thread 2 (Thread 0x7f3f56516640 (LWP 7339) "eal-intr-thread"): #0 0x00007f3f59096c7e in epoll_wait () from /lib64/libc.so.6 #1 0x00007f3f591cec54 in eal_intr_thread_main () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f3f59012b17 in start_thread () from /lib64/libc.so.6 #3 0x00007f3f590976a0 in clone3 () from /lib64/libc.so.6 Thread 1 (Thread 0x7f3f58e86900 (LWP 7338) "mp_deadlock"): #0 0x00007f3f591b5362 in rte_mcfg_mem_read_lock () from /usr/local/lib64/librte_eal.so.23 #1 0x00007f3f591b6bf2 in rte_memseg_list_walk () from /usr/local/lib64/librte_eal.so.23 #2 0x00007f3f591d2f65 in eal_memalloc_init () from /usr/local/lib64/librte_eal.so.23 #3 0x00007f3f591b741b in rte_eal_memory_init () from /usr/local/lib64/librte_eal.so.23 #4 0x00007f3f591aab64 in rte_eal_init.cold () from /usr/local/lib64/librte_eal.so.23 #5 0x00000000004010d9 in main () -- You are receiving this mail because: You are the assignee for the bug.