From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 42173A2F6B for ; Tue, 8 Oct 2019 22:08:24 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 1430B1C192; Tue, 8 Oct 2019 22:08:24 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id CD0D21BFCB; Tue, 8 Oct 2019 22:08:19 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2827E86679; Tue, 8 Oct 2019 20:08:19 +0000 (UTC) Received: from dhcp-25.97.bos.redhat.com (unknown [10.18.25.121]) by smtp.corp.redhat.com (Postfix) with ESMTPS id CF15860A9F; Tue, 8 Oct 2019 20:08:15 +0000 (UTC) From: Aaron Conole To: David Marchand Cc: Ruifeng Wang , David Hunt , dev , hkalra@marvell.com, Gavin Hu , Honnappa Nagarahalli , nd , dpdk stable References: <20191008095524.1585-1-ruifeng.wang@arm.com> Date: Tue, 08 Oct 2019 16:08:15 -0400 In-Reply-To: (David Marchand's message of "Tue, 8 Oct 2019 21:46:37 +0200") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Tue, 08 Oct 2019 20:08:19 +0000 (UTC) Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix deadlock issue for aarch64 X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org Sender: "stable" David Marchand writes: > On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole wrote: >> >> Ruifeng Wang writes: >> >> > Distributor and worker threads rely on data structs in cache line >> > for synchronization. The shared data structs were not protected. >> > This caused deadlock issue on weaker memory ordering platforms as >> > aarch64. >> > Fix this issue by adding memory barriers to ensure synchronization >> > among cores. >> > >> > Bugzilla ID: 342 >> > Fixes: 775003ad2f96 ("distributor: add new burst-capable library") >> > Cc: stable@dpdk.org >> > >> > Signed-off-by: Ruifeng Wang >> > Reviewed-by: Gavin Hu >> > --- >> >> I see a failure in the distributor_autotest (on one of the builds): >> >> 64/82 DPDK:fast-tests / distributor_autotest FAIL 0.37 s (exit >> status 255 or signal 127 SIGinvalid) >> >> --- command --- >> >> DPDK_TEST='distributor_autotest' >> /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1 >> --file-prefix=distributor_autotest >> >> --- stdout --- >> >> EAL: Probing VFIO support... >> >> APP: HPET is not enabled, using TSC as default timer >> >> RTE>>distributor_autotest >> >> === Basic distributor sanity tests === >> >> Worker 0 handled 32 packets >> >> Sanity test with all zero hashes done. >> >> Worker 0 handled 32 packets >> >> Sanity test with non-zero hashes done >> >> === testing big burst (single) === >> >> Sanity test of returned packets done >> >> === Sanity test with mbuf alloc/free (single) === >> >> Sanity test with mbuf alloc/free passed >> >> Too few cores to run worker shutdown test >> >> === Basic distributor sanity tests === >> >> Worker 0 handled 32 packets >> >> Sanity test with all zero hashes done. >> >> Worker 0 handled 32 packets >> >> Sanity test with non-zero hashes done >> >> === testing big burst (burst) === >> >> Sanity test of returned packets done >> >> === Sanity test with mbuf alloc/free (burst) === >> >> Line 326: Packet count is incorrect, 1048568, expected 1048576 >> >> Test Failed >> >> RTE>> >> >> --- stderr --- >> >> EAL: Detected 2 lcore(s) >> >> EAL: Detected 1 NUMA nodes >> >> EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket >> >> EAL: Selected IOVA mode 'PA' >> >> EAL: No available hugepages reported in hugepages-1048576kB >> >> ------- >> >> Not sure how to help debug further. I'll re-start the job to see if >> it 'clears' up - but I guess there may be a delicate synchronization >> somewhere that needs to be accounted. > > Idem, and with the same loop I used before, it can be caught quickly. > > # time (log=/tmp/$$.log; while true; do echo distributor_autotest > |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 > -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm > -f $log) Probably good to document it, yes. It seems to be a good technique for reproducing failures. > [snip] > > RTE>>distributor_autotest > EAL: Trying to obtain current memory policy. > EAL: Setting policy MPOL_PREFERRED for socket 0 > EAL: Restoring previous memory policy: 0 > EAL: request: mp_malloc_sync > EAL: Heap on socket 0 was expanded by 2MB > EAL: Trying to obtain current memory policy. > EAL: Setting policy MPOL_PREFERRED for socket 0 > EAL: Restoring previous memory policy: 0 > EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space > EAL: Trying to obtain current memory policy. > EAL: Setting policy MPOL_PREFERRED for socket 0 > EAL: Restoring previous memory policy: 0 > EAL: request: mp_malloc_sync > EAL: Heap on socket 0 was expanded by 8MB > === Basic distributor sanity tests === > Worker 0 handled 32 packets > Sanity test with all zero hashes done. > Worker 0 handled 32 packets > Sanity test with non-zero hashes done > === testing big burst (single) === > Sanity test of returned packets done > > === Sanity test with mbuf alloc/free (single) === > Sanity test with mbuf alloc/free passed > > Too few cores to run worker shutdown test > === Basic distributor sanity tests === > Worker 0 handled 32 packets > Sanity test with all zero hashes done. > Worker 0 handled 32 packets > Sanity test with non-zero hashes done > === testing big burst (burst) === > Sanity test of returned packets done > > === Sanity test with mbuf alloc/free (burst) === > Line 326: Packet count is incorrect, 1048568, expected 1048576 > Test Failed > RTE>> > real 0m36.668s > user 1m7.293s > sys 0m1.560s > > Could be worth running this loop on all tests? (not talking about the > CI, it would be a manual effort to catch lurking issues).