Hi Jerin On 10/13/2017 9:49 AM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Fri, 13 Oct 2017 09:16:31 +0800 >> From: Jia He >> To: Jerin Jacob , "Ananyev, Konstantin" >> >> Cc: Olivier MATZ , "dev@dpdk.org" , >> "jia.he@hxt-semitech.com" , >> "jie2.liu@hxt-semitech.com" , >> "bing.zhao@hxt-semitech.com" >> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when >> doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.3.0 >> >> Hi >> >> >> On 10/13/2017 9:02 AM, Jia He Wrote: >>> Hi Jerin >>> >>> >>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: >>>> -----Original Message----- >>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 >>>>> >> [...] >>>> On the same lines, >>>> >>>> Jia He, jie2.liu, bing.zhao, >>>> >>>> Is this patch based on code review or do you saw this issue on any >>>> of the >>>> arm/ppc target? arm64 will have performance impact with this change. >> sorry, miss one important information >> Our platform is an aarch64 server with 46 cpus. > Is this an OOO(Out of order execution) aarch64 CPU implementation? I think so, it is a server cpu (ARMv8-A), but do you know how to confirm it? cat /proc/cpuinfo processor       : 0 BogoMIPS        : 40.00 Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm CPU implementer : 0x51 CPU architecture: 8 CPU variant     : 0x0 CPU part        : 0x800 CPU revision    : 0 >> If we reduced the involved cpu numbers, the bug occurred less frequently. >> >> Yes, mb barrier impact the performance, but correctness is more important, >> isn't it ;-) > Yes. > >> Maybe we can  find any other lightweight barrier here? > Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release > semantics, which is exposed through gcc as architecture agnostic > functions. > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > http://preshing.com/20130922/acquire-and-release-fences/ > > Good to know, > 1) How much overhead this patch in your platform? Just relative > numbers are enough I create a *standalone* test case for test_mbuf Attached the debug patch It is hard to believe but the truth is that the performance after adding rmb barrier is better than no adding. With this patch (4 times running) time ./test_good --no-huge -l 1-20 real    0m23.311s user    7m21.870s sys     0m0.021s time ./test_bad --no-huge -l 1-20 Without this patch real    0m38.972s user    12m35.271s sys     0m0.030s Cheers, Jia > 2) As a prototype, Is Changing to acquire and release schematics > reduces the overhead in your platform? > > Reference FreeBSD ring/DPDK style ring implementation through acquire > and release schematics > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > I will also spend on cycles on this. > > >> Cheers, >> Jia >>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. >>> >>> PANIC in test_refcnt_iter(): >>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free >>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] >>> Aborted (core dumped) >>> >>> Cheers, >>> Jia >>>> >>>>> Konstantin