From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <dev-bounces@dpdk.org> Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id C179AA00E6 for <public@inbox.dpdk.org>; Tue, 9 Jul 2019 17:51:00 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DCC7B324D; Tue, 9 Jul 2019 17:50:59 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id E689CA3 for <dev@dpdk.org>; Tue, 9 Jul 2019 17:50:57 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2FEF730872FD; Tue, 9 Jul 2019 15:50:52 +0000 (UTC) Received: from localhost.localdomain (unknown [10.18.25.137]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4E8B18705E; Tue, 9 Jul 2019 15:50:50 +0000 (UTC) From: Michael Santana Francisco <msantana@redhat.com> To: David Marchand <david.marchand@redhat.com> Cc: Aaron Conole <aconole@redhat.com>, Thomas Monjalon <thomas@monjalon.net>, dev <dev@dpdk.org>, JananeeX M Parthasarathy <jananeex.m.parthasarathy@intel.com>, david.hunt@intel.com References: <1559638792-8608-1-git-send-email-david.marchand@redhat.com> <1560580950-16754-1-git-send-email-david.marchand@redhat.com> <70986373.KVGszKu7e3@xps> <f7tk1d1rbay.fsf@dhcp-25.97.bos.redhat.com> <CAJFAV8zYBgMoA7fBgDOM4ECiRkBsf6U+NPMWW=f+zNBmjFPUQw@mail.gmail.com> <CABzctQ8T9GpGkoeBn+snqaDfb369gWjnHEAO_6YWS7e+VuDsxw@mail.gmail.com> Organization: Red Hat Message-ID: <139fd420-dbee-0a33-1885-00c9593fe201@redhat.com> Date: Tue, 9 Jul 2019 11:50:49 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <CABzctQ8T9GpGkoeBn+snqaDfb369gWjnHEAO_6YWS7e+VuDsxw@mail.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.47]); Tue, 09 Jul 2019 15:50:57 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH v2 00/15] Unit tests fixes for CI X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> On 7/1/19 2:07 PM, Michael Santana Francisco wrote: >> >> >> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote: >>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more >>>>> than 10s, >>> Occasionally the distributor test times out as well. I've moved them as >>> part of a separate patch, that I'll post along with a bigger series to >>> enable the unit tests under travis. Michael and I are leaning toward >>> introducing a new variable called RUN_TESTS which will do the docs and >>> unit testing since those combined would add quite a bit to the execution >>> time of each job (and feel free to bike shed the name, since the patches >>> aren't final). >> >> Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug. >> I don't think I caught this so far. > So I actually ran into the distributor test timing out. I agree with > David in that it is a bug with the test. Looking at the logs that test > normally finishes in less than 1/2 a second, so running to 10 seconds > and timing out is a big jump in run time. I ran into the issue where > it timedout, so I restarted the job and it finished no problem. > The test fails every so often for no good reason and the logs[1] dont > really say much. I speculate that it is waiting for a resource to > become available or in the worse case a deadlock. Seeing that it only > fails every so often and it passes when restarted I don't think it's a > big deal, nevertheless it's worth investing time figuring out what's > wrong > > [1] https://api.travis-ci.com/v3/job/212335916/log.txt I investigated a little bit on this this test. CC'd David Hunt, I was able to reproduce the problem on v19.08-rc1 with: `while sudo sh -c "echo 'distributor_autotest' | ./build/app/test/dpdk-test"; do :; done` It runs a couple of times fine showing output and showing progress, but then at some point after a couple of seconds it just stops - no longer getting any output. It just sits there with no further output. I let it sit there for a whole minute and nothing happens. So I attach gdb to try to figure out what is happening. One thread seems to be stuck on a while loop, see lib/librte_distributor/rte_distributor.c:310. I looked at the assembly code (layout asm, ni) and I saw these four lines below (which correspond to the while loop) being executed repeatedly and indefinitely. It looks like this thread is waiting for the variable bufptr64[0] to change state. 0xa064d0 <release+32> pause 0xa064d2 <release+34> mov 0x3840(%rdx),%rax 0xa064d9 <release+41> test $0x1,%al 0xa064db <release+43> je 0xa064d0 <release+32> While the first thread is waiting on bufptr64[0] to change state, there is another thread that is also stuck on another while loop on lib/librte_distributor/rte_distributor.c:53. It seems that this thread is stuck waiting for retptr64 to change state. Corresponding assembly being executed indefinitely: 0xa06de0 <rte_distributor_request_pkt_v1705+592> mov 0x38c0(%r8),%rax 0xa06de7 <rte_distributor_request_pkt_v1705+599> test $0x1,%al 0xa06de9 <rte_distributor_request_pkt_v1705+601> je 0xa06bbd <rte_distributor_request_pkt_v1705+45> 0xa06def <rte_distributor_request_pkt_v1705+607> nop 0xa06df0 <rte_distributor_request_pkt_v1705+608> pause 0xa06df2 <rte_distributor_request_pkt_v1705+610> rdtsc 0xa06df4 <rte_distributor_request_pkt_v1705+612> mov %rdx,%r10 0xa06df7 <rte_distributor_request_pkt_v1705+615> shl $0x20,%r10 0xa06dfb <rte_distributor_request_pkt_v1705+619> mov %eax,%eax 0xa06dfd <rte_distributor_request_pkt_v1705+621> or %r10,%rax 0xa06e00 <rte_distributor_request_pkt_v1705+624> lea 0x64(%rax),%r10 0xa06e04 <rte_distributor_request_pkt_v1705+628> jmp 0xa06e12 <rte_distributor_request_pkt_v1705+642> 0xa06e06 <rte_distributor_request_pkt_v1705+630> nopw %cs:0x0(%rax,%rax,1) 0xa06e10 <rte_distributor_request_pkt_v1705+640> pause 0xa06e12 <rte_distributor_request_pkt_v1705+642> rdtsc 0xa06e14 <rte_distributor_request_pkt_v1705+644> shl $0x20,%rdx 0xa06e18 <rte_distributor_request_pkt_v1705+648> mov %eax,%eax 0xa06e1a <rte_distributor_request_pkt_v1705+650> or %rdx,%rax 0xa06e1d <rte_distributor_request_pkt_v1705+653> cmp %rax,%r10 0xa06e20 <rte_distributor_request_pkt_v1705+656> ja 0xa06e10 <rte_distributor_request_pkt_v1705+640> 0xa06e22 <rte_distributor_request_pkt_v1705+658> jmp 0xa06de0 <rte_distributor_request_pkt_v1705+592> My guess is that these threads are interdependent, so one thread is waiting for the other thread to change the state of the control variable. I can't say for sure if this is what is happening or why the these variables don't change state, so I would like ask someone who is more familiar with this particular code to take a look >> >> Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc. >> About the name, RUN_TESTS is ok for me. >> >> What do you want to make of this variable? >> Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests? >> >> >>>>> - librte_table unit test crashes on ipv6 [2], >>> I guess we're waiting on a patch from Jananee (CC'd)? >> >> Yep. >> >> >> -- >> David Marchand