From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9D1ABA00E6 for ; Wed, 10 Jul 2019 10:19:00 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 4FF882BF4; Wed, 10 Jul 2019 10:18:59 +0200 (CEST) Received: from mail-vs1-f65.google.com (mail-vs1-f65.google.com [209.85.217.65]) by dpdk.org (Postfix) with ESMTP id 96DAC1C01 for ; Wed, 10 Jul 2019 10:18:58 +0200 (CEST) Received: by mail-vs1-f65.google.com with SMTP id h28so924815vsl.12 for ; Wed, 10 Jul 2019 01:18:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=veyUfWrP3NboPbG9zGyCynzJE6Nk0AMArF/0ACcVazQ=; b=pNoG45gqu9P7QK1crjY9ZjufBQiST5321c0F88+vNSQar+DCKnxWFfrNRzmm3bl+jt 5USsSPP4GmyU6Vjh4APA8VySC2fIKLI0wPA7uKct2My/IKNx+j8Miy9d1UwsEkD5Svh4 7P/9KjGeugf8yD5jb7zZMsiGziWazq36LuiRbUrozIF3Nvg35uOnGrknFLcKE0U1/K6C dGGMry4kLPHmhyXc6KSD3OgFg6ukb9ZjnxwIko+UrmI1RhemBRkP02yCQlAPxiwThaKH aOmclDwUDoV7FKJam1sxNYfmCDKHI7HzZtemVkKuvT9+XqGTYPLwTJv4q+w5tg8rfEV+ 7x6w== X-Gm-Message-State: APjAAAVItwo0fWCPdQ2a0AQJO2bgCm27wowm9N60eD+iAKfEk2+SAs+J 0N1jBLEFnUer0kkzw+7WOCn4kIWleU4nrP8Wk575LQ== X-Google-Smtp-Source: APXvYqzoCNurer34l9vDL2tGxtm9dn8J3A2nFNv4lsSN9Ei3BiT/rWZE4P/gBpU7ipbmYchnntHueybVdUj/yaHd64o= X-Received: by 2002:a67:da99:: with SMTP id w25mr17679823vsj.141.1562746737976; Wed, 10 Jul 2019 01:18:57 -0700 (PDT) MIME-Version: 1.0 References: <1559638792-8608-1-git-send-email-david.marchand@redhat.com> <1560580950-16754-1-git-send-email-david.marchand@redhat.com> <70986373.KVGszKu7e3@xps> <139fd420-dbee-0a33-1885-00c9593fe201@redhat.com> In-Reply-To: <139fd420-dbee-0a33-1885-00c9593fe201@redhat.com> From: David Marchand Date: Wed, 10 Jul 2019 10:18:46 +0200 Message-ID: To: Michael Santana Francisco Cc: Aaron Conole , Thomas Monjalon , dev , JananeeX M Parthasarathy , David Hunt Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] [PATCH v2 00/15] Unit tests fixes for CI X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Tue, Jul 9, 2019 at 5:50 PM Michael Santana Francisco < msantana@redhat.com> wrote: > On 7/1/19 2:07 PM, Michael Santana Francisco wrote: > >> > >> > >> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole wrote: > >>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little > more > >>>>> than 10s, > >>> Occasionally the distributor test times out as well. I've moved them > as > >>> part of a separate patch, that I'll post along with a bigger series to > >>> enable the unit tests under travis. Michael and I are leaning toward > >>> introducing a new variable called RUN_TESTS which will do the docs and > >>> unit testing since those combined would add quite a bit to the > execution > >>> time of each job (and feel free to bike shed the name, since the > patches > >>> aren't final). > >> > >> Seeing how the distributor autotest usually takes less than a second to > complete, this sounds like a bug. > >> I don't think I caught this so far. > > So I actually ran into the distributor test timing out. I agree with > > David in that it is a bug with the test. Looking at the logs that test > > normally finishes in less than 1/2 a second, so running to 10 seconds > > and timing out is a big jump in run time. I ran into the issue where > > it timedout, so I restarted the job and it finished no problem. > > The test fails every so often for no good reason and the logs[1] dont > > really say much. I speculate that it is waiting for a resource to > > become available or in the worse case a deadlock. Seeing that it only > > fails every so often and it passes when restarted I don't think it's a > > big deal, nevertheless it's worth investing time figuring out what's > > wrong > > > > [1] https://api.travis-ci.com/v3/job/212335916/log.txt > > I investigated a little bit on this this test. CC'd David Hunt, > > I was able to reproduce the problem on v19.08-rc1 with: > > `while sudo sh -c "echo 'distributor_autotest' | > ./build/app/test/dpdk-test"; do :; done` > > It runs a couple of times fine showing output and showing progress, but > then at some point after a couple of seconds it just stops - no longer > getting any output. It just sits there with no further output. I let it > sit there for a whole minute and nothing happens. So I attach gdb to try > to figure out what is happening. One thread seems to be stuck on a while > loop, see lib/librte_distributor/rte_distributor.c:310. > > I looked at the assembly code (layout asm, ni) and I saw these four > lines below (which correspond to the while loop) being executed > repeatedly and indefinitely. It looks like this thread is waiting for > the variable bufptr64[0] to change state. > > 0xa064d0 pause > 0xa064d2 mov 0x3840(%rdx),%rax > 0xa064d9 test $0x1,%al > 0xa064db je 0xa064d0 > > > While the first thread is waiting on bufptr64[0] to change state, there > is another thread that is also stuck on another while loop on > lib/librte_distributor/rte_distributor.c:53. It seems that this thread > is stuck waiting for retptr64 to change state. Corresponding assembly > being executed indefinitely: > > 0xa06de0 mov 0x38c0(%r8),%rax > 0xa06de7 test $0x1,%al > 0xa06de9 je 0xa06bbd > > 0xa06def nop > 0xa06df0 pause > 0xa06df2 rdtsc > 0xa06df4 mov %rdx,%r10 > 0xa06df7 shl $0x20,%r10 > 0xa06dfb mov %eax,%eax > 0xa06dfd or %r10,%rax > 0xa06e00 lea 0x64(%rax),%r10 > 0xa06e04 jmp 0xa06e12 > > 0xa06e06 nopw > %cs:0x0(%rax,%rax,1) > 0xa06e10 pause > 0xa06e12 rdtsc > 0xa06e14 shl $0x20,%rdx > 0xa06e18 mov %eax,%eax > 0xa06e1a or %rdx,%rax > 0xa06e1d cmp %rax,%r10 > 0xa06e20 ja 0xa06e10 > > 0xa06e22 jmp 0xa06de0 > > > > My guess is that these threads are interdependent, so one thread is > waiting for the other thread to change the state of the control > variable. I can't say for sure if this is what is happening or why the > these variables don't change state, so I would like ask someone who is > more familiar with this particular code to take a look > Ah cool, thanks for the analysis. Can you create a bz with this description and assign it to the librte_distributor maintainer? -- David Marchand