From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f68.google.com (mail-wm0-f68.google.com [74.125.82.68]) by dpdk.org (Postfix) with ESMTP id 416BDE07 for ; Mon, 30 Apr 2018 20:47:17 +0200 (CEST) Received: by mail-wm0-f68.google.com with SMTP id l1so15866440wmb.2 for ; Mon, 30 Apr 2018 11:47:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:user-agent:in-reply-to:references:mime-version :content-transfer-encoding:subject:to:cc:from:message-id; bh=EZaYYJHX5Px9aKTgRJJY+Lqc+lAzjv+UFzxu1Gk+26Q=; b=mlW7ayYJBAF18csy/xY0RBniEJUqLTCX0H0pPRyophp7dZyBj2PP8bUursl0uEMbvu rtJe02qUjbCVVIa8mOap3ql+M1yoq++1un53jEtybhAOC5JRL2BZJeiXg4ixMm2u4yGP 0Lc5njFyUJ/xxH/mMWg1oh8Sw+oDW1PGMfHaHELdWx2tKEF5pR8txpDhqN2smjuBfDMn Ey3eEUqKwpV2Dnygl+CixvLFTO4C5W3m5sOurQTm9AfduRxlEpnBMQxAnGsSzPFYnHRu O4N9GYD+eVbutDl8+MDI7isgz/LBajRPzmlmVJl1HkA3eu+BhY3U47SD5U7OoiQnPOZf OHOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:user-agent:in-reply-to:references :mime-version:content-transfer-encoding:subject:to:cc:from :message-id; bh=EZaYYJHX5Px9aKTgRJJY+Lqc+lAzjv+UFzxu1Gk+26Q=; b=HDL/ly+SCct0gTnggzOsDUxZCaNMcrs8FUKd9BF6gvXqmSjVz6SfO0AGfseXUkfHF+ tBIahds9n7l1Q3nK0QszmDzS4zKzZBgD0IPAFJH3mFNkCMtk1vJfLfpdmuY7i57Yzt/z LXVaWfkqjTYDz9pwpF8w2AfYf3S0iUUQNUiTGd57QCkmDRD44n2Noqays4oFj7Ja8+44 h+zySEYnfKXrRB4wjRZd9zInhvq0+ky9Ul2sAPxuHQkilxrslj+PcCbzbUy28vMZQZhu HJsuYkbEHkO45rwjAZJEhCSS0Ok/nnTxhuX8MOuBQFTl/BFrHyKJLwUt6IVWQ8Bwn4mx vy0Q== X-Gm-Message-State: ALQs6tAhLi2TVyzQwm/3OOWSCW4Q6YhANNrHy51yjVBgP+9kRg2Jfk10 xwwG0V7FdBtMo6lAnZD7+E30GQ== X-Google-Smtp-Source: AB8JxZoGNK9Zb+8xZxC3W7fmfnYjHKIMawzvUWpiWgFeDJcrrbROlWYUyAhBw/zJy3Ar+rVNZjB1Zg== X-Received: by 10.28.227.132 with SMTP id a126mr9243434wmh.93.1525114036811; Mon, 30 Apr 2018 11:47:16 -0700 (PDT) Received: from [10.38.5.118] ([37.170.44.154]) by smtp.gmail.com with ESMTPSA id 16-v6sm13968163wrt.20.2018.04.30.11.47.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 30 Apr 2018 11:47:16 -0700 (PDT) Date: Mon, 30 Apr 2018 20:46:58 +0200 User-Agent: K-9 Mail for Android In-Reply-To: References: <20180403130439.11151-1-olivier.matz@6wind.com> <20180424144651.13145-1-olivier.matz@6wind.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable To: Maxime Coquelin ,dev@dpdk.org CC: Anatoly Burakov , Jianfeng Tan , Thomas Monjalon From: Olivier Matz Message-ID: <4256B2F0-EF9D-4B22-AC1A-D440C002360A@6wind.com> Subject: Re: [dpdk-dev] pthread_barrier_deadlock in -rc1 (was: "Re: [PATCH v3 0/5] fix control thread affinities") X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Apr 2018 18:47:17 -0000 Hi Maxime, Le 30 avril 2018 17:45:52 GMT+02:00, Maxime Coquelin a =C3=A9crit : >Hi Olivier, > >On 04/24/2018 04:46 PM, Olivier Matz wrote: >> Some parts of dpdk use their own management threads=2E Most of the >time, >> the affinity of the thread is not properly set: it should not be >scheduled >> on the dataplane cores, because interrupting them can cause packet >losses=2E >>=20 >> This patchset introduces a new wrapper for thread creation that does >> the job automatically, avoiding code duplication=2E >>=20 >> v3: >> * new patch: use this API in examples when relevant=2E >> * replace pthread_kill by pthread_cancel=2E Note that pthread_join() >> is still needed=2E >> * rebase: vfio and pdump do not have control pthreads anymore, and >eal >> has 2 new pthreads >> * remove all calls to snprintf/strlcpy that truncate the thread name: >> all strings lengths are already < 16=2E >>=20 >> v2: >> * set affinity to master core if no core is off, as suggested by >> Anatoly >>=20 >> Olivier Matz (5): >> eal: use sizeof to avoid a double use of a define >> eal: new function to create control threads >> eal: set name when creating a control thread >> eal: set affinity for control threads >> examples: use new API to create control threads >>=20 >> drivers/net/kni/Makefile | 1 + >> drivers/net/kni/rte_eth_kni=2Ec | 3 +- >> examples/tep_termination/main=2Ec | 16 +++---- >> examples/vhost/main=2Ec | 19 +++----- >> lib/librte_eal/bsdapp/eal/eal=2Ec | 4 +- >> lib/librte_eal/bsdapp/eal/eal_thread=2Ec | 2 +- >> lib/librte_eal/common/eal_common_proc=2Ec | 15 ++---- >> lib/librte_eal/common/eal_common_thread=2Ec | 72 >++++++++++++++++++++++++++++ >> lib/librte_eal/common/include/rte_lcore=2Eh | 26 ++++++++++ >> lib/librte_eal/linuxapp/eal/eal=2Ec | 4 +- >> lib/librte_eal/linuxapp/eal/eal_interrupts=2Ec | 17 ++----- >> lib/librte_eal/linuxapp/eal/eal_thread=2Ec | 2 +- >> lib/librte_eal/linuxapp/eal/eal_timer=2Ec | 12 +---- >> lib/librte_eal/rte_eal_version=2Emap | 1 + >> lib/librte_vhost/socket=2Ec | 25 ++-------- >> 15 files changed, 135 insertions(+), 84 deletions(-) >>=20 > >I face a deadlock issue with your series, that Jianfeng patch does not >resolve ("eal: fix threads block on barrier")=2E Reverting the series and >Jianfeng patch makes the issue to disappear=2E > >I face the problem in a VM (not seen on the host): ># =2E/install/bin/testpmd -l 0,1,2 --socket-mem 1024 -n 4 --proc-type >auto=20 >--file-prefix pg -- --portmask=3D3 --forward-mode=3Dmacswap=20 >--port-topology=3Dchained --disable-rss -i --rxq=3D1 --txq=3D1 --rxd=3D25= 6=20 >--txd=3D256 --nb-cores=3D2 --auto-start >EAL: Detected 3 lcore(s) >EAL: Detected 1 NUMA nodes >EAL: Auto-detected process type: PRIMARY >EAL: Multi-process socket /var/run/=2Epg_unix > > >Then it is stuck=2E Attaching with GDB, I get below backtrace >information: > >(gdb) info threads > Id Target Id Frame > 3 Thread 0x7f63e1f9f700 (LWP 8808) "rte_mp_handle"=20 >0x00007f63e2591bfd in recvmsg () at >=2E=2E/sysdeps/unix/syscall-template=2ES:81 > 2 Thread 0x7f63e179e700 (LWP 8809) "rte_mp_async"=20 >pthread_barrier_wait () at=20 >=2E=2E/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait=2ES:71 >* 1 Thread 0x7f63e32cec00 (LWP 8807) "testpmd" pthread_barrier_wait=20 >() at =2E=2E/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait=2ES= :71 >(gdb) bt full >#0 pthread_barrier_wait () at=20 >=2E=2E/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait=2ES:71 >No locals=2E >#1 0x0000000000520c54 in rte_ctrl_thread_create=20 >(thread=3Dthread@entry=3D0x7ffe5c895020, name=3Dname@entry=3D0x869d86=20 >"rte_mp_async", attr=3Dattr@entry=3D0x0,=20 >start_routine=3Dstart_routine@entry=3D0x521030 ,=20 >arg=3Darg@entry=3D0x0) > at /root/src/dpdk/lib/librte_eal/common/eal_common_thread=2Ec:207 > params =3D 0x17b1e40 > lcore_id =3D > cpuset =3D {__bits =3D {1, 0 }} > cpu_found =3D > ret =3D 0 >#2 0x00000000005220b6 in rte_mp_channel_init () at=20 >/root/src/dpdk/lib/librte_eal/common/eal_common_proc=2Ec:674 > path =3D "/var/run\000=2Epg_unix_*", '\000' = =2E=2E=2E > dir_fd =3D 4 > mp_handle_tid =3D 140066969745152 > async_reply_handle_tid =3D 140066961352448 >#3 0x000000000050c227 in rte_eal_init (argc=3Dargc@entry=3D23,=20 >argv=3Dargv@entry=3D0x7ffe5c896378) at=20 >/root/src/dpdk/lib/librte_eal/linuxapp/eal/eal=2Ec:775 > i =3D > fctret =3D 11 > ret =3D > thread_id =3D 140066989861888 > run_once =3D {cnt =3D 1} > logid =3D 0x17b1e00 "testpmd" > cpuset =3D "T}\211\\\376\177", '\000' ,=20 >"\020", '\000' =2E=2E=2E > thread_name =3D "X}\211\\\376\177\000\000\226\301\036\342c\177\000" > __func__ =3D "rte_eal_init" >#4 0x0000000000473214 in main (argc=3D23, argv=3D0x7ffe5c896378) at=20 >/root/src/dpdk/app/test-pmd/testpmd=2Ec:2597 > diag =3D > port_id =3D > ret =3D > __func__ =3D "main" >(gdb) thread 2 >[Switching to thread 2 (Thread 0x7f63e179e700 (LWP 8809))] >#0 pthread_barrier_wait () at=20 >=2E=2E/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait=2ES:71 >71 cmpl %edx, (%rdi) >(gdb) bt full >#0 pthread_barrier_wait () at=20 >=2E=2E/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait=2ES:71 >No locals=2E >#1 0x0000000000520777 in rte_thread_init (arg=3D) at=20 >/root/src/dpdk/lib/librte_eal/common/eal_common_thread=2Ec:156 > params =3D > start_routine =3D 0x521030 > routine_arg =3D 0x0 >#2 0x00007f63e258add5 in start_thread (arg=3D0x7f63e179e700) at=20 >pthread_create=2Ec:308 > __res =3D > pd =3D 0x7f63e179e700 > now =3D > unwind_buf =3D {cancel_jmp_buf =3D {{jmp_buf =3D {14006696135244= 8,=20 >1212869169857371576, 0, 8392704, 0, 140066961352448,=20 >-1291626103561052744, -1291619793368703560}, mask_was_saved =3D 0}}, priv > >=3D {pad =3D {0x0, 0x0, 0x0, 0x0}, data =3D { > prev =3D 0x0, cleanup =3D 0x0, canceltype =3D 0}}} > not_first_call =3D > pagesize_m1 =3D > sp =3D > freesize =3D >#3 0x00007f63e22b4b3d in clone () at=20 >=2E=2E/sysdeps/unix/sysv/linux/x86_64/clone=2ES:113 >No locals=2E >(gdb) thread 3 >[Switching to thread 3 (Thread 0x7f63e1f9f700 (LWP 8808))] >#0 0x00007f63e2591bfd in recvmsg () at=20 >=2E=2E/sysdeps/unix/syscall-template=2ES:81 >81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) >(gdb) bt full >#0 0x00007f63e2591bfd in recvmsg () at=20 >=2E=2E/sysdeps/unix/syscall-template=2ES:81 >No locals=2E >#1 0x000000000052194e in read_msg (s=3D0x7f63e1f9d3b0, m=3D0x7f63e1f9d5a= 0) > >at /root/src/dpdk/lib/librte_eal/common/eal_common_proc=2Ec:258 > msglen =3D > control =3D=20 >"\000\000\000\000\000\000\000\000\336~\f\343c\177\000\000\005", '\000'=20 >, "\360\371\033\342c\177\000" > cmsg =3D > iov =3D {iov_base =3D 0x7f63e1f9d5a0, iov_len =3D 332} > msgh =3D {msg_name =3D 0x7f63e1f9d3b0, msg_namelen =3D 110, msg_io= v =3D=20 >0x7f63e1f9d370, msg_iovlen =3D 1, msg_control =3D 0x7f63e1f9d380,=20 >msg_controllen =3D 48, msg_flags =3D 0} >#2 mp_handle (arg=3D) at=20 >/root/src/dpdk/lib/librte_eal/common/eal_common_proc=2Ec:346 > msg =3D {type =3D 0, msg =3D {name =3D '\000' = ,=20 >len_param =3D 0, num_fds =3D 0, param =3D '\000' , "\00= 2",=20 >'\000' , fds =3D {0, 0, 0, 0, 0, 0, 0, 0}}} > sa =3D {sun_family =3D 55104, > sun_path =3D=20 >"\371\341c\177\000\000\352\372\f\343c\177\000\000\000\000\000\000\000\000= \000\000\377\377\377\377\377\377\377\377\000\367\371\341c\177\000\000\030\0= 00\000\000\000\000\000\000p\327\371\341c\177\000\000\000\367\371\341c\177\0= 00\000\000\367\371\341c\177", > >'\000' , "\200\037\000\000\377\377"} >#3 0x00007f63e258add5 in start_thread (arg=3D0x7f63e1f9f700) at=20 >pthread_create=2Ec:308 > __res =3D > pd =3D 0x7f63e1f9f700 > now =3D > unwind_buf =3D {cancel_jmp_buf =3D {{jmp_buf =3D {14006696974515= 2,=20 >1212869169857371576, 0, 8392704, 0, 140066969745152,=20 >-1291625004586295880, -1291619793368703560}, mask_was_saved =3D 0}}, priv > >=3D {pad =3D {0x0, 0x0, 0x0, 0x0}, data =3D { > prev =3D 0x0, cleanup =3D 0x0, canceltype =3D 0}}} > not_first_call =3D > pagesize_m1 =3D > sp =3D > freesize =3D >#4 0x00007f63e22b4b3d in clone () at=20 >=2E=2E/sysdeps/unix/sysv/linux/x86_64/clone=2ES:113 >No locals=2E > >I don't have more info for now=2E > Thanks for the feedback on this issue=2E I don't see obvious reason for th= is deadlock yet=2E I'll investigate it asap (not tomorrow, but wednesday)=2E In the worst cas= e, we can revert the series if I cannot find the root cause rapidly=2E Olivier