From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id C179AA00E6 for ; Tue, 9 Jul 2019 17:51:00 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DCC7B324D; Tue, 9 Jul 2019 17:50:59 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id E689CA3 for ; Tue, 9 Jul 2019 17:50:57 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2FEF730872FD; Tue, 9 Jul 2019 15:50:52 +0000 (UTC) Received: from localhost.localdomain (unknown [10.18.25.137]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4E8B18705E; Tue, 9 Jul 2019 15:50:50 +0000 (UTC) From: Michael Santana Francisco To: David Marchand Cc: Aaron Conole , Thomas Monjalon , dev , JananeeX M Parthasarathy , david.hunt@intel.com References: <1559638792-8608-1-git-send-email-david.marchand@redhat.com> <1560580950-16754-1-git-send-email-david.marchand@redhat.com> <70986373.KVGszKu7e3@xps> Organization: Red Hat Message-ID: <139fd420-dbee-0a33-1885-00c9593fe201@redhat.com> Date: Tue, 9 Jul 2019 11:50:49 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.47]); Tue, 09 Jul 2019 15:50:57 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH v2 00/15] Unit tests fixes for CI X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 7/1/19 2:07 PM, Michael Santana Francisco wrote: >> >> >> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole wrote: >>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more >>>>> than 10s, >>> Occasionally the distributor test times out as well. I've moved them as >>> part of a separate patch, that I'll post along with a bigger series to >>> enable the unit tests under travis. Michael and I are leaning toward >>> introducing a new variable called RUN_TESTS which will do the docs and >>> unit testing since those combined would add quite a bit to the execution >>> time of each job (and feel free to bike shed the name, since the patches >>> aren't final). >> >> Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug. >> I don't think I caught this so far. > So I actually ran into the distributor test timing out. I agree with > David in that it is a bug with the test. Looking at the logs that test > normally finishes in less than 1/2 a second, so running to 10 seconds > and timing out is a big jump in run time. I ran into the issue where > it timedout, so I restarted the job and it finished no problem. > The test fails every so often for no good reason and the logs[1] dont > really say much. I speculate that it is waiting for a resource to > become available or in the worse case a deadlock. Seeing that it only > fails every so often and it passes when restarted I don't think it's a > big deal, nevertheless it's worth investing time figuring out what's > wrong > > [1] https://api.travis-ci.com/v3/job/212335916/log.txt I investigated a little bit on this this test. CC'd David Hunt, I was able to reproduce the problem on v19.08-rc1 with: `while sudo sh -c "echo 'distributor_autotest' | ./build/app/test/dpdk-test"; do :; done` It runs a couple of times fine showing output and showing progress, but then at some point after a couple of seconds it just stops - no longer getting any output. It just sits there with no further output. I let it sit there for a whole minute and nothing happens. So I attach gdb to try to figure out what is happening. One thread seems to be stuck on a while loop, see lib/librte_distributor/rte_distributor.c:310. I looked at the assembly code (layout asm, ni) and I saw these four lines below (which correspond to the while loop) being executed repeatedly and indefinitely. It looks like this thread is waiting for the variable bufptr64[0] to change state. 0xa064d0    pause 0xa064d2    mov    0x3840(%rdx),%rax 0xa064d9    test   $0x1,%al 0xa064db    je     0xa064d0 While the first thread is waiting on bufptr64[0] to change state, there is another thread that is also stuck on another while loop on lib/librte_distributor/rte_distributor.c:53. It seems that this thread is stuck waiting for retptr64 to change state. Corresponding assembly being executed indefinitely: 0xa06de0 mov    0x38c0(%r8),%rax 0xa06de7 test   $0x1,%al 0xa06de9 je     0xa06bbd 0xa06def         nop 0xa06df0 pause 0xa06df2 rdtsc 0xa06df4 mov    %rdx,%r10 0xa06df7 shl    $0x20,%r10 0xa06dfb mov    %eax,%eax 0xa06dfd or     %r10,%rax 0xa06e00 lea    0x64(%rax),%r10 0xa06e04 jmp    0xa06e12 0xa06e06 nopw   %cs:0x0(%rax,%rax,1) 0xa06e10 pause 0xa06e12 rdtsc 0xa06e14 shl    $0x20,%rdx 0xa06e18 mov    %eax,%eax 0xa06e1a or     %rdx,%rax 0xa06e1d cmp    %rax,%r10 0xa06e20 ja     0xa06e10 0xa06e22 jmp    0xa06de0 My guess is that these threads are interdependent, so one thread is waiting for the other thread to change the state of the control variable. I can't say for sure if this is what is happening or why the these variables don't change state, so I would like ask someone who is more familiar with this particular code to take a look >> >> Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc. >> About the name, RUN_TESTS is ok for me. >> >> What do you want to make of this variable? >> Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests? >> >> >>>>> - librte_table unit test crashes on ipv6 [2], >>> I guess we're waiting on a patch from Jananee (CC'd)? >> >> Yep. >> >> >> -- >> David Marchand