From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f195.google.com (mail-pf0-f195.google.com [209.85.192.195]) by dpdk.org (Postfix) with ESMTP id 7CAF91B712 for ; Tue, 24 Oct 2017 04:04:37 +0200 (CEST) Received: by mail-pf0-f195.google.com with SMTP id t188so18504348pfd.10 for ; Mon, 23 Oct 2017 19:04:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=hRux0C4SFSAjSDT6Oso0RaY/zCiMYP34BTHDWuncdSo=; b=Xj1vD9lzhl74JZx0IZKoV4LtZRP4WlMvexXZ1LWeT/ctT+13kvdb81G2mZRqJy+EN1 8fG8ktuK0+u/g/Ey6pPf/MSZEhbZvPRRPUh5hNxRQBzvXUaWSWOM307p+J0ZJIokzyvs pbEtOoQ1MAaOkd0OejMfkJQ8028qIR4y6WEMbwkyG6FhhhXFajdv/TP4jBIEwpDQBJ3h GWVhozH5so9T9qDowC1ev/kWetXAJ1ABjsRFIBmEfmKEIpMGN2mPpU1+bD/oz7uaAOIb 2JX00A85sfOgw0m/PbarI9mn6/xt86a7K2xloy31wMdg22whgdevQ2TeMck0X2VE3bE9 kLqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=hRux0C4SFSAjSDT6Oso0RaY/zCiMYP34BTHDWuncdSo=; b=WqmCjPwnJEZT60hA+iP77py3FyjUDgPDmbLm0tlpKwSU9MvFQA5LO2zXbbnFu16UPu xhKGcZVK7YQoZLq+hCwx+pt+0PoimRvD4LNeP5NNY3xm3zb+dk1v9ayInCI/DZmy6jmd 4lQYbwUlXG5hnAGfCCD7EdRdfMNBjI4Vp91DNSsx8WgC5Dh0sQ7PIwStXA3AVvxXu5YK 3/UCcaioQQnckSuPg/LDpTJJ5oUJn7rxym1qcZCURTBqm6fJ5f4zSMvpQsXv61iosW6J cjn7agPrjmj9E68eDsTeFZ/1St94pmHgsUJAI3JWy1eOm02pIYD2uiJUhXwBvzXAR3nh w2bA== X-Gm-Message-State: AMCzsaXNaMg8DcpPVeKzcQC3Z0WFrzE9npj0WmZWTF0FnS2Nq82v6jyH TF0MvhcbbRcFM2bqXkbUURs= X-Google-Smtp-Source: ABhQp+RCHv5CdrwT9AYtFct5WR2czk/KKgWdb4Hsc9BX2H4IftHBv35L7WhfFeAvQGzboPwCLzgXUA== X-Received: by 10.159.254.20 with SMTP id r20mr11885342pls.443.1508810676431; Mon, 23 Oct 2017 19:04:36 -0700 (PDT) Received: from [0.0.0.0] (67.209.179.165.16clouds.com. [67.209.179.165]) by smtp.gmail.com with ESMTPSA id b10sm18639535pfk.20.2017.10.23.19.04.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Oct 2017 19:04:35 -0700 (PDT) To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia.he@hxt-semitech.com" , "jie2.liu@hxt-semitech.com" , "bing.zhao@hxt-semitech.com" , "Richardson, Bruce" References: <20171012172311.GA8524@jerin> <2601191342CEEE43887BDE71AB9772585FAAB171@IRSMSX103.ger.corp.intel.com> <8806e2bd-c57b-03ff-a315-0a311690f1d9@163.com> <2601191342CEEE43887BDE71AB9772585FAAB404@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> <20171020054319.GA4249@jerin> <20171023100617.GA17957@jerin> From: Jia He Message-ID: Date: Tue, 24 Oct 2017 10:04:26 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171023100617.GA17957@jerin> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Oct 2017 02:04:37 -0000 Hi Jerin On 10/23/2017 6:06 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Mon, 23 Oct 2017 16:49:01 +0800 >> From: Jia He >> To: Jerin Jacob >> Cc: "Ananyev, Konstantin" , "Zhao, Bing" >> , Olivier MATZ , >> "dev@dpdk.org" , "jia.he@hxt-semitech.com" >> , "jie2.liu@hxt-semitech.com" >> , "bing.zhao@hxt-semitech.com" >> , "Richardson, Bruce" >> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin >> >> >> On 10/20/2017 1:43 PM, Jerin Jacob Wrote: >>> -----Original Message----- >> [...] >>>> dependant on each other. >>>> Thus a memory barrier is neccessary. >>> Yes. The barrier is necessary. >>> In fact, upstream freebsd fixed this issue for arm64. DPDK ring >>> implementation is derived from freebsd's buf_ring.h. >>> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 >>> >>> I think, the only outstanding issue is, how to reduce the performance >>> impact for arm64. I believe using accurate/release semantics instead >>> of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, >>> freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 >>> odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c >>> >>> Jia, >>> 1) Can you verify the use of accurate/release semantics fixes the problem in your >>> platform? like use of atomic_load_acq* in the reference code. >>> 2) If so, What is the overhead between accurate/release and plane smp_smb() >>> barriers. Based on that we need decide what path to take. >> I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier >> (half barrier) you mentioned >> at above link. >> The patch seems like: >> @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, >>                 /* Reset n to the initial burst count */ >>                 n = max; >> >> -               *old_head = r->prod.head; >> -               const uint32_t cons_tail = r->cons.tail; >> +               *old_head = atomic_load_acq_32(&r->prod.head); >> +               const uint32_t cons_tail = >> atomic_load_acq_32(&r->cons.tail); >> >> @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s >>                 /* Restore n as it may change every loop */ >>                 n = max; >> >> -               *old_head = r->cons.head; >> -               const uint32_t prod_tail = r->prod.tail; >> +               *old_head = atomic_load_acq_32(&r->cons.head); >> +               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) >>                 /* The subtraction is done between two unsigned 32bits value >>                  * (the result is always modulo 32 bits even if we have >>                  * cons_head > prod_tail). So 'entries' is always between 0 >>                  * and size(ring)-1. */ >> >> The half barrier patch passed the fuctional test. >> >> As for the performance comparision on *arm64*(the debug patch is at >> http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the >> test results >> below: >> >> [case 1] old codes, no barrier >> ============================================ >>  Performance counter stats for './test --no-huge -l 1-10': >> >>      689275.001200      task-clock (msec)         #    9.771 CPUs utilized >>               6223      context-switches          #    0.009 K/sec >>                 10      cpu-migrations            #    0.000 K/sec >>                653      page-faults               #    0.001 K/sec >>      1721190914583      cycles                    #    2.497 GHz >>      3363238266430      instructions              #    1.95  insn per cycle >>    branches >>           27804740      branch-misses             #    0.00% of all branches >> >>       70.540618825 seconds time elapsed >> >> [case 2] full barrier with rte_smp_rmb() >> ============================================ >>  Performance counter stats for './test --no-huge -l 1-10': >> >>      582557.895850      task-clock (msec)         #    9.752 CPUs utilized >>               5242      context-switches          #    0.009 K/sec >>                 10      cpu-migrations            #    0.000 K/sec >>                665      page-faults               #    0.001 K/sec >>      1454360730055      cycles                    #    2.497 GHz >>       587197839907      instructions              #    0.40  insn per cycle >>    branches >>           27799687      branch-misses             #    0.00% of all branches >> >>       59.735582356 seconds time elapse >> >> [case 1] half barrier with load_acquire >> ============================================ >>  Performance counter stats for './test --no-huge -l 1-10': >> >>      660758.877050      task-clock (msec)         #    9.764 CPUs utilized >>               5982      context-switches          #    0.009 K/sec >>                 11      cpu-migrations            #    0.000 K/sec >>                657      page-faults               #    0.001 K/sec >>      1649875318044      cycles                    #    2.497 GHz >>       591583257765      instructions              #    0.36  insn per cycle >>    branches >>           27994903      branch-misses             #    0.00% of all branches >> >>       67.672855107 seconds time elapsed >> >> Please see the context-switches in the perf results >> test result  sorted by time is: >> full barrier < half barrier < no barrier >> >> AFAICT, in this case ,the cpu reordering will add the possibility for >> context switching and >> increase the running time. >> Any ideas? > Regarding performance test, it better to use ring perf test case > on _isolated_ cores to measure impact on number of enqueue/dequeue operations. > > example: > ./build/app/test -c 0xff -n 4 >>> ring_perf_autotest Seem in our arm64 server, the ring_perf_autotest will be finished in a few seconds: Anything wrong about configuration or environment setup? root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 EAL: Detected 44 lcore(s) EAL: Probing VFIO support... APP: HPET is not enabled, using TSC as default timer RTE>>per_lcore_autotest RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 2 SP/SC burst enq/dequeue (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.02 MC empty dequeue: 0.04 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.31 SP/SC bulk enq/dequeue (size: 32): 0.05 MP/MC bulk enq/dequeue (size: 32): 0.09 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.39 SP/SC bulk enq/dequeue (size: 32): 0.04 MP/MC bulk enq/dequeue (size: 32): 0.12 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 0.37 MP/MC bulk enq/dequeue (size: 8): 0.92 SP/SC bulk enq/dequeue (size: 32): 0.12 MP/MC bulk enq/dequeue (size: 32): 0.26 Test OK RTE>> Cheers, Jia > By default, arm64+dpdk will be using el0 counter to measure the cycles. I > think, in your SoC, it will be running at 50MHz or 100MHz.So, You can > follow the below scheme to get accurate cycle measurement scheme: > > See: http://dpdk.org/doc/guides/prog_guide/profile_app.html > check: 44.2.2. High-resolution cycle counter