DPDK patches and discussions
 help / color / mirror / Atom feed
From: bugzilla@dpdk.org
To: dev@dpdk.org
Subject: [Bug 1178] Deadlock with heap lock
Date: Fri, 10 Mar 2023 17:31:54 +0000	[thread overview]
Message-ID: <bug-1178-3@http.bugs.dpdk.org/> (raw)

[-- Attachment #1: Type: text/plain, Size: 8339 bytes --]

https://bugs.dpdk.org/show_bug.cgi?id=1178

            Bug ID: 1178
           Summary: Deadlock with heap lock
           Product: DPDK
           Version: unspecified
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: Normal
         Component: core
          Assignee: dev@dpdk.org
          Reporter: maxime.coquelin@redhat.com
  Target Milestone: ---

While doing test using testpmd with virtio-user PMD, I got a deadlock I managed
to reproduce systematically.
It looks like my system was under hugepages pressure.
After a reboot, I cannot reproduce the deadlock anymore.

This is the backtrace I collected before having to reboot:

(gdb) bt
#0  0x0000000000b80e87 in rte_spinlock_lock (sl=0x100000100) at
../lib/eal/x86/include/rte_spinlock.h:29
#1  0x0000000000b827fb in malloc_heap_free (elem=0x1003a8b00) at
../lib/eal/common/malloc_heap.c:878
#2  0x0000000000b83ba5 in mem_free (addr=0x1003a8b80, trace_ena=true) at
../lib/eal/common/rte_malloc.c:37
#3  0x0000000000b83be9 in rte_free (addr=0x1003a8b80) at
../lib/eal/common/rte_malloc.c:44
#4  0x0000000003a2ce3d in virtio_user_free_vrings (dev=0x1003a8d00) at
../drivers/net/virtio/virtio_user/virtio_user_dev.c:652
#5  0x0000000003a2d5ae in virtio_user_dev_uninit (dev=0x1003a8d00) at
../drivers/net/virtio/virtio_user/virtio_user_dev.c:823
#6  0x0000000003a11a50 in virtio_user_dev_close (hw=0x1003a8d00) at
../drivers/net/virtio/virtio_user_ethdev.c:284
#7  0x00000000039dcbaf in virtio_dev_close (dev=0x98c0580 <rte_eth_devices>) at
../drivers/net/virtio/virtio_ethdev.c:382
#8  0x0000000000aecd0a in rte_eth_dev_close (port_id=0) at
../lib/ethdev/rte_ethdev.c:1720
#9  0x0000000003a129c6 in virtio_user_pmd_remove (vdev=0xa008b30) at
../drivers/net/virtio/virtio_user_ethdev.c:715
#10 0x0000000000c3fefd in vdev_cleanup () at ../drivers/bus/vdev/vdev.c:588
#11 0x0000000000b667eb in eal_bus_cleanup () at
../lib/eal/common/eal_common_bus.c:104
#12 0x0000000000b97431 in rte_eal_cleanup () at ../lib/eal/linux/eal.c:1374
#13 0x0000000000b67557 in rte_exit (exit_code=1, format=0x8b4f3f8 "rxq %d
invalid - must be >= 0 && <= %u\n")
    at ../lib/eal/common/eal_common_debug.c:42
#14 0x000000000064822a in launch_args_parse (argc=4, argv=0x7fff45669a60) at
../app/test-pmd/parameters.c:1053
#15 0x000000000065322a in main (argc=4, argv=0x7fff45669a60) at
../app/test-pmd/testpmd.c:4564
(gdb) info threads
  Id   Target Id                                            Frame
* 1    Thread 0x7f7eb7841c00 (LWP 386037) "dpdk-testpmd"    0x0000000000b80e87
in rte_spinlock_lock (sl=0x100000100)
    at ../lib/eal/x86/include/rte_spinlock.h:29
  2    Thread 0x7f7eb6fff400 (LWP 386038) "eal-intr-thread" 0x00007f7eb7111c0e
in epoll_wait (epfd=6, events=0x7f7eb6ffc440, maxevents=1,
    timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  3    Thread 0x7f7eb5ffd400 (LWP 386040) "rte-worker-7"    __GI___libc_read
(nbytes=1, buf=0x7f7eb5ffa333, fd=10)
    at ../sysdeps/unix/sysv/linux/read.c:26
  4    Thread 0x7f7eb57fc400 (LWP 386041) "telemetry-v2"    0x00007f7eb711323f
in __libc_accept (fd=25, addr=..., len=0x0)
    at ../sysdeps/unix/sysv/linux/accept.c:26
(gdb) t 2
[Switching to thread 2 (Thread 0x7f7eb6fff400 (LWP 386038))]
#0  0x00007f7eb7111c0e in epoll_wait (epfd=6, events=0x7f7eb6ffc440,
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
Downloading 0.00 MB source file
/usr/src/debug/glibc-2.35-22.fc36.x86_64/misc/../sysdeps/unix/sysv/linux/epoll_wait.c
30        return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f7eb7111c0e in epoll_wait (epfd=6, events=0x7f7eb6ffc440,
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000000000b9ce94 in eal_intr_handle_interrupts (pfd=6, totalfds=1) at
../lib/eal/linux/eal_interrupts.c:1077
#2  0x0000000000b9d0c7 in eal_intr_thread_main (arg=0x0) at
../lib/eal/linux/eal_interrupts.c:1163
#3  0x0000000000b7d6a6 in ctrl_thread_start (arg=0xa024ec0) at
../lib/eal/common/eal_common_thread.c:276
#4  0x00007f7eb708cdcd in start_thread (arg=<optimized out>) at
pthread_create.c:442
#5  0x00007f7eb7112630 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 3
[Switching to thread 3 (Thread 0x7f7eb5ffd400 (LWP 386040))]
#0  __GI___libc_read (nbytes=1, buf=0x7f7eb5ffa333, fd=10) at
../sysdeps/unix/sysv/linux/read.c:26
Downloading 0.00 MB source file
/usr/src/debug/glibc-2.35-22.fc36.x86_64/io/../sysdeps/unix/sysv/linux/read.c
26        return SYSCALL_CANCEL (read, fd, buf, nbytes);
(gdb) bt
#0  __GI___libc_read (nbytes=1, buf=0x7f7eb5ffa333, fd=10) at
../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=10, buf=0x7f7eb5ffa333, nbytes=1) at
../sysdeps/unix/sysv/linux/read.c:24
#2  0x0000000000b94220 in eal_thread_wait_command () at
../lib/eal/unix/eal_unix_thread.c:44
#3  0x0000000000b7d497 in eal_thread_loop (arg=0x7) at
../lib/eal/common/eal_common_thread.c:189
#4  0x0000000000b96580 in eal_worker_thread_loop (arg=0x7) at
../lib/eal/linux/eal.c:915
#5  0x00007f7eb708cdcd in start_thread (arg=<optimized out>) at
pthread_create.c:442
#6  0x00007f7eb7112630 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 4
[Switching to thread 4 (Thread 0x7f7eb57fc400 (LWP 386041))]
#0  0x00007f7eb711323f in __libc_accept (fd=25, addr=..., len=0x0) at
../sysdeps/unix/sysv/linux/accept.c:26
Downloading 0.00 MB source file
/usr/src/debug/glibc-2.35-22.fc36.x86_64/socket/../sysdeps/unix/sysv/linux/accept.c
26        return SYSCALL_CANCEL (accept, fd, addr.__sockaddr__, len);
(gdb) bt
#0  0x00007f7eb711323f in __libc_accept (fd=25, addr=..., len=0x0) at
../sysdeps/unix/sysv/linux/accept.c:26
#1  0x0000000000bac798 in socket_listener (socket=0x99f3220 <v2_socket>) at
../lib/telemetry/telemetry.c:405
#2  0x00007f7eb708cdcd in start_thread (arg=<optimized out>) at
pthread_create.c:442
#3  0x00007f7eb7112630 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

So the deadlock happens on heap->lock.
Looking at the usage of this lock I found something... interesting:

int
rte_malloc_heap_destroy(const char *heap_name)
{
        struct malloc_heap *heap = NULL;
        int ret;

        if (heap_name == NULL ||
                        strnlen(heap_name, RTE_HEAP_NAME_MAX_LEN) == 0 ||
                        strnlen(heap_name, RTE_HEAP_NAME_MAX_LEN) ==
                                RTE_HEAP_NAME_MAX_LEN) {
                rte_errno = EINVAL;
                return -1;
        }
        rte_mcfg_mem_write_lock();

        /* start from non-socket heaps */
        heap = find_named_heap(heap_name);
        if (heap == NULL) {
                RTE_LOG(ERR, EAL, "Heap %s not found\n", heap_name);
                rte_errno = ENOENT;
                ret = -1;
                goto unlock;
        }
        /* we shouldn't be able to destroy internal heaps */
        if (heap->socket_id < RTE_MAX_NUMA_NODES) {
                rte_errno = EPERM;
                ret = -1;
                goto unlock;
        }
        /* sanity checks done, now we can destroy the heap */
        rte_spinlock_lock(&heap->lock);
        ret = malloc_heap_destroy(heap);

        /* if we failed, lock is still active */
        if (ret < 0)
                rte_spinlock_unlock(&heap->lock);
unlock:
        rte_mcfg_mem_write_unlock();

        return ret;
}

If malloc_heap_destroy() fails, we don't unlock the spinlock.
It does not look sane, why not just unlock and propagate the error?
How can it be preferable to freeze the application without letting the user
know something went wrong?

The commit introducing this is not explicit about this choice, so I'd like to
understand the reasoning before posting a fix:

commit 15d6dd023c29fc526d16616137844bd91b6afdfe
Author: Anatoly Burakov <anatoly.burakov@intel.com>
Date:   Tue Oct 2 14:34:50 2018 +0100

    malloc: allow destroying heaps

    Add an API to destroy specified heap.

    Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Thanks in advance,
Maxime

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #2: Type: text/html, Size: 10442 bytes --]

                 reply	other threads:[~2023-03-10 17:31 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-1178-3@http.bugs.dpdk.org/ \
    --to=bugzilla@dpdk.org \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).