From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E5A2B41E65; Fri, 10 Mar 2023 18:31:56 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 885A540A8B; Fri, 10 Mar 2023 18:31:56 +0100 (CET) Received: from inbox.dpdk.org (inbox.dpdk.org [95.142.172.178]) by mails.dpdk.org (Postfix) with ESMTP id 14F9340685 for ; Fri, 10 Mar 2023 18:31:55 +0100 (CET) Received: by inbox.dpdk.org (Postfix, from userid 33) id EF09341E6C; Fri, 10 Mar 2023 18:31:54 +0100 (CET) From: bugzilla@dpdk.org To: dev@dpdk.org Subject: [Bug 1178] Deadlock with heap lock Date: Fri, 10 Mar 2023 17:31:54 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: DPDK X-Bugzilla-Component: core X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maxime.coquelin@redhat.com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: Normal X-Bugzilla-Assigned-To: dev@dpdk.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: multipart/alternative; boundary=16784695140.a4256179F.3353905 Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://bugs.dpdk.org/ Auto-Submitted: auto-generated X-Auto-Response-Suppress: All MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org --16784695140.a4256179F.3353905 Date: Fri, 10 Mar 2023 18:31:54 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.dpdk.org/ Auto-Submitted: auto-generated X-Auto-Response-Suppress: All https://bugs.dpdk.org/show_bug.cgi?id=3D1178 Bug ID: 1178 Summary: Deadlock with heap lock Product: DPDK Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: Normal Component: core Assignee: dev@dpdk.org Reporter: maxime.coquelin@redhat.com Target Milestone: --- While doing test using testpmd with virtio-user PMD, I got a deadlock I man= aged to reproduce systematically. It looks like my system was under hugepages pressure. After a reboot, I cannot reproduce the deadlock anymore. This is the backtrace I collected before having to reboot: (gdb) bt #0 0x0000000000b80e87 in rte_spinlock_lock (sl=3D0x100000100) at ../lib/eal/x86/include/rte_spinlock.h:29 #1 0x0000000000b827fb in malloc_heap_free (elem=3D0x1003a8b00) at ../lib/eal/common/malloc_heap.c:878 #2 0x0000000000b83ba5 in mem_free (addr=3D0x1003a8b80, trace_ena=3Dtrue) at ../lib/eal/common/rte_malloc.c:37 #3 0x0000000000b83be9 in rte_free (addr=3D0x1003a8b80) at ../lib/eal/common/rte_malloc.c:44 #4 0x0000000003a2ce3d in virtio_user_free_vrings (dev=3D0x1003a8d00) at ../drivers/net/virtio/virtio_user/virtio_user_dev.c:652 #5 0x0000000003a2d5ae in virtio_user_dev_uninit (dev=3D0x1003a8d00) at ../drivers/net/virtio/virtio_user/virtio_user_dev.c:823 #6 0x0000000003a11a50 in virtio_user_dev_close (hw=3D0x1003a8d00) at ../drivers/net/virtio/virtio_user_ethdev.c:284 #7 0x00000000039dcbaf in virtio_dev_close (dev=3D0x98c0580 ) at ../drivers/net/virtio/virtio_ethdev.c:382 #8 0x0000000000aecd0a in rte_eth_dev_close (port_id=3D0) at ../lib/ethdev/rte_ethdev.c:1720 #9 0x0000000003a129c6 in virtio_user_pmd_remove (vdev=3D0xa008b30) at ../drivers/net/virtio/virtio_user_ethdev.c:715 #10 0x0000000000c3fefd in vdev_cleanup () at ../drivers/bus/vdev/vdev.c:588 #11 0x0000000000b667eb in eal_bus_cleanup () at ../lib/eal/common/eal_common_bus.c:104 #12 0x0000000000b97431 in rte_eal_cleanup () at ../lib/eal/linux/eal.c:1374 #13 0x0000000000b67557 in rte_exit (exit_code=3D1, format=3D0x8b4f3f8 "rxq = %d invalid - must be >=3D 0 && <=3D %u\n") at ../lib/eal/common/eal_common_debug.c:42 #14 0x000000000064822a in launch_args_parse (argc=3D4, argv=3D0x7fff45669a6= 0) at ../app/test-pmd/parameters.c:1053 #15 0x000000000065322a in main (argc=3D4, argv=3D0x7fff45669a60) at ../app/test-pmd/testpmd.c:4564 (gdb) info threads Id Target Id Frame * 1 Thread 0x7f7eb7841c00 (LWP 386037) "dpdk-testpmd" 0x0000000000b80= e87 in rte_spinlock_lock (sl=3D0x100000100) at ../lib/eal/x86/include/rte_spinlock.h:29 2 Thread 0x7f7eb6fff400 (LWP 386038) "eal-intr-thread" 0x00007f7eb7111= c0e in epoll_wait (epfd=3D6, events=3D0x7f7eb6ffc440, maxevents=3D1, timeout=3D-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 3 Thread 0x7f7eb5ffd400 (LWP 386040) "rte-worker-7" __GI___libc_read (nbytes=3D1, buf=3D0x7f7eb5ffa333, fd=3D10) at ../sysdeps/unix/sysv/linux/read.c:26 4 Thread 0x7f7eb57fc400 (LWP 386041) "telemetry-v2" 0x00007f7eb7113= 23f in __libc_accept (fd=3D25, addr=3D..., len=3D0x0) at ../sysdeps/unix/sysv/linux/accept.c:26 (gdb) t 2 [Switching to thread 2 (Thread 0x7f7eb6fff400 (LWP 386038))] #0 0x00007f7eb7111c0e in epoll_wait (epfd=3D6, events=3D0x7f7eb6ffc440, maxevents=3D1, timeout=3D-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 Downloading 0.00 MB source file /usr/src/debug/glibc-2.35-22.fc36.x86_64/misc/../sysdeps/unix/sysv/linux/ep= oll_wait.c 30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeo= ut); (gdb) bt #0 0x00007f7eb7111c0e in epoll_wait (epfd=3D6, events=3D0x7f7eb6ffc440, maxevents=3D1, timeout=3D-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x0000000000b9ce94 in eal_intr_handle_interrupts (pfd=3D6, totalfds=3D1= ) at ../lib/eal/linux/eal_interrupts.c:1077 #2 0x0000000000b9d0c7 in eal_intr_thread_main (arg=3D0x0) at ../lib/eal/linux/eal_interrupts.c:1163 #3 0x0000000000b7d6a6 in ctrl_thread_start (arg=3D0xa024ec0) at ../lib/eal/common/eal_common_thread.c:276 #4 0x00007f7eb708cdcd in start_thread (arg=3D) at pthread_create.c:442 #5 0x00007f7eb7112630 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) t 3 [Switching to thread 3 (Thread 0x7f7eb5ffd400 (LWP 386040))] #0 __GI___libc_read (nbytes=3D1, buf=3D0x7f7eb5ffa333, fd=3D10) at ../sysdeps/unix/sysv/linux/read.c:26 Downloading 0.00 MB source file /usr/src/debug/glibc-2.35-22.fc36.x86_64/io/../sysdeps/unix/sysv/linux/read= .c 26 return SYSCALL_CANCEL (read, fd, buf, nbytes); (gdb) bt #0 __GI___libc_read (nbytes=3D1, buf=3D0x7f7eb5ffa333, fd=3D10) at ../sysdeps/unix/sysv/linux/read.c:26 #1 __GI___libc_read (fd=3D10, buf=3D0x7f7eb5ffa333, nbytes=3D1) at ../sysdeps/unix/sysv/linux/read.c:24 #2 0x0000000000b94220 in eal_thread_wait_command () at ../lib/eal/unix/eal_unix_thread.c:44 #3 0x0000000000b7d497 in eal_thread_loop (arg=3D0x7) at ../lib/eal/common/eal_common_thread.c:189 #4 0x0000000000b96580 in eal_worker_thread_loop (arg=3D0x7) at ../lib/eal/linux/eal.c:915 #5 0x00007f7eb708cdcd in start_thread (arg=3D) at pthread_create.c:442 #6 0x00007f7eb7112630 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) t 4 [Switching to thread 4 (Thread 0x7f7eb57fc400 (LWP 386041))] #0 0x00007f7eb711323f in __libc_accept (fd=3D25, addr=3D..., len=3D0x0) at ../sysdeps/unix/sysv/linux/accept.c:26 Downloading 0.00 MB source file /usr/src/debug/glibc-2.35-22.fc36.x86_64/socket/../sysdeps/unix/sysv/linux/= accept.c 26 return SYSCALL_CANCEL (accept, fd, addr.__sockaddr__, len); (gdb) bt #0 0x00007f7eb711323f in __libc_accept (fd=3D25, addr=3D..., len=3D0x0) at ../sysdeps/unix/sysv/linux/accept.c:26 #1 0x0000000000bac798 in socket_listener (socket=3D0x99f3220 ) = at ../lib/telemetry/telemetry.c:405 #2 0x00007f7eb708cdcd in start_thread (arg=3D) at pthread_create.c:442 #3 0x00007f7eb7112630 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 So the deadlock happens on heap->lock. Looking at the usage of this lock I found something... interesting: int rte_malloc_heap_destroy(const char *heap_name) { struct malloc_heap *heap =3D NULL; int ret; if (heap_name =3D=3D NULL || strnlen(heap_name, RTE_HEAP_NAME_MAX_LEN) =3D=3D 0 = || strnlen(heap_name, RTE_HEAP_NAME_MAX_LEN) =3D=3D RTE_HEAP_NAME_MAX_LEN) { rte_errno =3D EINVAL; return -1; } rte_mcfg_mem_write_lock(); /* start from non-socket heaps */ heap =3D find_named_heap(heap_name); if (heap =3D=3D NULL) { RTE_LOG(ERR, EAL, "Heap %s not found\n", heap_name); rte_errno =3D ENOENT; ret =3D -1; goto unlock; } /* we shouldn't be able to destroy internal heaps */ if (heap->socket_id < RTE_MAX_NUMA_NODES) { rte_errno =3D EPERM; ret =3D -1; goto unlock; } /* sanity checks done, now we can destroy the heap */ rte_spinlock_lock(&heap->lock); ret =3D malloc_heap_destroy(heap); /* if we failed, lock is still active */ if (ret < 0) rte_spinlock_unlock(&heap->lock); unlock: rte_mcfg_mem_write_unlock(); return ret; } If malloc_heap_destroy() fails, we don't unlock the spinlock. It does not look sane, why not just unlock and propagate the error? How can it be preferable to freeze the application without letting the user know something went wrong? The commit introducing this is not explicit about this choice, so I'd like = to understand the reasoning before posting a fix: commit 15d6dd023c29fc526d16616137844bd91b6afdfe Author: Anatoly Burakov Date: Tue Oct 2 14:34:50 2018 +0100 malloc: allow destroying heaps Add an API to destroy specified heap. Signed-off-by: Anatoly Burakov Thanks in advance, Maxime --=20 You are receiving this mail because: You are the assignee for the bug.= --16784695140.a4256179F.3353905 Date: Fri, 10 Mar 2023 18:31:54 +0100 MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.dpdk.org/ Auto-Submitted: auto-generated X-Auto-Response-Suppress: All
Bug ID 1178
Summary Deadlock with heap lock
Product DPDK
Version unspecified
Hardware All
OS All
Status UNCONFIRMED
Severity normal
Priority Normal
Component core
Assignee dev@dpdk.org
Reporter maxime.coquelin@redhat.com
Target Milestone ---

While doing test using testpmd wit=
h virtio-user PMD, I got a deadlock I managed
to reproduce systematically.
It looks like my system was under hugepages pressure.
After a reboot, I cannot reproduce the deadlock anymore.

This is the backtrace I collected before having to reboot:

(gdb) bt
#0  0x0000000000b80e87 in rte_spinlock_lock (sl=3D0x100000100) at
../lib/eal/x86/include/rte_spinlock.h:29
#1  0x0000000000b827fb in malloc_heap_free (elem=3D0x1003a8b00) at
../lib/eal/common/malloc_heap.c:878
#2  0x0000000000b83ba5 in mem_free (addr=3D0x1003a8b80, trace_ena=3Dtrue) at
../lib/eal/common/rte_malloc.c:37
#3  0x0000000000b83be9 in rte_free (addr=3D0x1003a8b80) at
../lib/eal/common/rte_malloc.c:44
#4  0x0000000003a2ce3d in virtio_user_free_vrings (dev=3D0x1003a8d00) at
../drivers/net/virtio/virtio_user/virtio_user_dev.c:652
#5  0x0000000003a2d5ae in virtio_user_dev_uninit (dev=3D0x1003a8d00) at
../drivers/net/virtio/virtio_user/virtio_user_dev.c:823
#6  0x0000000003a11a50 in virtio_user_dev_close (hw=3D0x1003a8d00) at
../drivers/net/virtio/virtio_user_ethdev.c:284
#7  0x00000000039dcbaf in virtio_dev_close (dev=3D0x98c0580 <rte_eth_dev=
ices>) at
../drivers/net/virtio/virtio_ethdev.c:382
#8  0x0000000000aecd0a in rte_eth_dev_close (port_id=3D0) at
../lib/ethdev/rte_ethdev.c:1720
#9  0x0000000003a129c6 in virtio_user_pmd_remove (vdev=3D0xa008b30) at
../drivers/net/virtio/virtio_user_ethdev.c:715
#10 0x0000000000c3fefd in vdev_cleanup () at ../drivers/bus/vdev/vdev.c:588
#11 0x0000000000b667eb in eal_bus_cleanup () at
../lib/eal/common/eal_common_bus.c:104
#12 0x0000000000b97431 in rte_eal_cleanup () at ../lib/eal/linux/eal.c:1374
#13 0x0000000000b67557 in rte_exit (exit_code=3D1, format=3D0x8b4f3f8 "=
;rxq %d
invalid - must be >=3D 0 && <=3D %u\n")
    at ../lib/eal/common/eal_common_debug.c:42
#14 0x000000000064822a in launch_args_parse (argc=3D4, argv=3D0x7fff45669a6=
0) at
../app/test-pmd/parameters.c:1053
#15 0x000000000065322a in main (argc=3D4, argv=3D0x7fff45669a60) at
../app/test-pmd/testpmd.c:4564
(gdb) info threads
  Id   Target Id                                            Frame
* 1    Thread 0x7f7eb7841c00 (LWP 386037) "dpdk-testpmd"    0x000=
0000000b80e87
in rte_spinlock_lock (sl=3D0x100000100)
    at ../lib/eal/x86/include/rte_spinlock.h:29
  2    Thread 0x7f7eb6fff400 (LWP 386038) "eal-intr-thread" 0x000=
07f7eb7111c0e
in epoll_wait (epfd=3D6, events=3D0x7f7eb6ffc440, maxevents=3D1,
    timeout=3D-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  3    Thread 0x7f7eb5ffd400 (LWP 386040) "rte-worker-7"    __GI_=
__libc_read
(nbytes=3D1, buf=3D0x7f7eb5ffa333, fd=3D10)
    at ../sysdeps/unix/sysv/linux/read.c:26
  4    Thread 0x7f7eb57fc400 (LWP 386041) "telemetry-v2"    0x000=
07f7eb711323f
in __libc_accept (fd=3D25, addr=3D..., len=3D0x0)
    at ../sysdeps/unix/sysv/linux/accept.c:26
(gdb) t 2
[Switching to thread 2 (Thread 0x7f7eb6fff400 (LWP 386038))]
#0  0x00007f7eb7111c0e in epoll_wait (epfd=3D6, events=3D0x7f7eb6ffc440,
maxevents=3D1, timeout=3D-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
Downloading 0.00 MB source file
/usr/src/debug/glibc-2.35-22.fc36.x86_64/misc/../sysdeps/unix/sysv/linux/ep=
oll_wait.c
30        return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeo=
ut);
(gdb) bt
#0  0x00007f7eb7111c0e in epoll_wait (epfd=3D6, events=3D0x7f7eb6ffc440,
maxevents=3D1, timeout=3D-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000000000b9ce94 in eal_intr_handle_interrupts (pfd=3D6, totalfds=3D1=
) at
../lib/eal/linux/eal_interrupts.c:1077
#2  0x0000000000b9d0c7 in eal_intr_thread_main (arg=3D0x0) at
../lib/eal/linux/eal_interrupts.c:1163
#3  0x0000000000b7d6a6 in ctrl_thread_start (arg=3D0xa024ec0) at
../lib/eal/common/eal_common_thread.c:276
#4  0x00007f7eb708cdcd in start_thread (arg=3D<optimized out>) at
pthread_create.c:442
#5  0x00007f7eb7112630 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 3
[Switching to thread 3 (Thread 0x7f7eb5ffd400 (LWP 386040))]
#0  __GI___libc_read (nbytes=3D1, buf=3D0x7f7eb5ffa333, fd=3D10) at
../sysdeps/unix/sysv/linux/read.c:26
Downloading 0.00 MB source file
/usr/src/debug/glibc-2.35-22.fc36.x86_64/io/../sysdeps/unix/sysv/linux/read=
.c
26        return SYSCALL_CANCEL (read, fd, buf, nbytes);
(gdb) bt
#0  __GI___libc_read (nbytes=3D1, buf=3D0x7f7eb5ffa333, fd=3D10) at
../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=3D10, buf=3D0x7f7eb5ffa333, nbytes=3D1) at
../sysdeps/unix/sysv/linux/read.c:24
#2  0x0000000000b94220 in eal_thread_wait_command () at
../lib/eal/unix/eal_unix_thread.c:44
#3  0x0000000000b7d497 in eal_thread_loop (arg=3D0x7) at
../lib/eal/common/eal_common_thread.c:189
#4  0x0000000000b96580 in eal_worker_thread_loop (arg=3D0x7) at
../lib/eal/linux/eal.c:915
#5  0x00007f7eb708cdcd in start_thread (arg=3D<optimized out>) at
pthread_create.c:442
#6  0x00007f7eb7112630 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 4
[Switching to thread 4 (Thread 0x7f7eb57fc400 (LWP 386041))]
#0  0x00007f7eb711323f in __libc_accept (fd=3D25, addr=3D..., len=3D0x0) at
../sysdeps/unix/sysv/linux/accept.c:26
Downloading 0.00 MB source file
/usr/src/debug/glibc-2.35-22.fc36.x86_64/socket/../sysdeps/unix/sysv/linux/=
accept.c
26        return SYSCALL_CANCEL (accept, fd, addr.__sockaddr__, len);
(gdb) bt
#0  0x00007f7eb711323f in __libc_accept (fd=3D25, addr=3D..., len=3D0x0) at
../sysdeps/unix/sysv/linux/accept.c:26
#1  0x0000000000bac798 in socket_listener (socket=3D0x99f3220 <v2_socket=
>) at
../lib/telemetry/telemetry.c:405
#2  0x00007f7eb708cdcd in start_thread (arg=3D<optimized out>) at
pthread_create.c:442
#3  0x00007f7eb7112630 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

So the deadlock happens on heap->lock.
Looking at the usage of this lock I found something... interesting:

int
rte_malloc_heap_destroy(const char *heap_name)
{
        struct malloc_heap *heap =3D NULL;
        int ret;

        if (heap_name =3D=3D NULL ||
                        strnlen(heap_name, RTE_HEAP_NAME_MAX_LEN) =3D=3D 0 =
||
                        strnlen(heap_name, RTE_HEAP_NAME_MAX_LEN) =3D=3D
                                RTE_HEAP_NAME_MAX_LEN) {
                rte_errno =3D EINVAL;
                return -1;
        }
        rte_mcfg_mem_write_lock();

        /* start from non-socket heaps */
        heap =3D find_named_heap(heap_name);
        if (heap =3D=3D NULL) {
                RTE_LOG(ERR, EAL, "Heap %s not found\n", heap_nam=
e);
                rte_errno =3D ENOENT;
                ret =3D -1;
                goto unlock;
        }
        /* we shouldn't be able to destroy internal heaps */
        if (heap->socket_id < RTE_MAX_NUMA_NODES) {
                rte_errno =3D EPERM;
                ret =3D -1;
                goto unlock;
        }
        /* sanity checks done, now we can destroy the heap */
        rte_spinlock_lock(&heap->lock);
        ret =3D malloc_heap_destroy(heap);

        /* if we failed, lock is still active */
        if (ret < 0)
                rte_spinlock_unlock(&heap->lock);
unlock:
        rte_mcfg_mem_write_unlock();

        return ret;
}

If malloc_heap_destroy() fails, we don't unlock the spinlock.
It does not look sane, why not just unlock and propagate the error?
How can it be preferable to freeze the application without letting the user
know something went wrong?

The commit introducing this is not explicit about this choice, so I'd like =
to
understand the reasoning before posting a fix:

commit 15d6dd023c29fc526d16616137844bd91b6afdfe
Author: Anatoly Burakov <anatoly.burakov@intel.com>
Date:   Tue Oct 2 14:34:50 2018 +0100

    malloc: allow destroying heaps

    Add an API to destroy specified heap.

    Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Thanks in advance,
Maxime
          


You are receiving this mail because:
  • You are the assignee for the bug.
=20=20=20=20=20=20=20=20=20=20
= --16784695140.a4256179F.3353905--