From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 559E71B3C0 for ; Mon, 23 Oct 2017 11:11:05 +0200 (CEST) Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 23 Oct 2017 02:11:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.43,422,1503385200"; d="scan'208";a="165848937" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.32]) by fmsmga005.fm.intel.com with SMTP; 23 Oct 2017 02:11:00 -0700 Received: by (sSMTP sendmail emulation); Mon, 23 Oct 2017 10:10:59 +0100 Date: Mon, 23 Oct 2017 10:10:59 +0100 From: Bruce Richardson To: "Kuusisaari, Juhamatti" Cc: Jia He , Jerin Jacob , "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia.he@hxt-semitech.com" , "jie2.liu@hxt-semitech.com" , "bing.zhao@hxt-semitech.com" Message-ID: <20171023091059.GA17728@bricha3-MOBL3.ger.corp.intel.com> References: <2601191342CEEE43887BDE71AB9772585FAAB171@IRSMSX103.ger.corp.intel.com> <8806e2bd-c57b-03ff-a315-0a311690f1d9@163.com> <2601191342CEEE43887BDE71AB9772585FAAB404@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> <20171020054319.GA4249@jerin> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Organization: Intel Research and Development Ireland Ltd. User-Agent: Mutt/1.9.1 (2017-09-22) Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Oct 2017 09:11:05 -0000 On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote: > > Hi, > > > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > > -----Original Message----- > > >> > > [...] > > >> dependant on each other. Thus a memory barrier is neccessary. > > > Yes. The barrier is necessary. In fact, upstream freebsd fixed > > > this issue for arm64. DPDK ring implementation is derived from > > > freebsd's buf_ring.h. > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > > > I think, the only outstanding issue is, how to reduce the > > > performance impact for arm64. I believe using accurate/release > > > semantics instead of rte_smp_rmb() will reduce the performance > > > overhead like similar ring implementations below, freebsd: > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > odp: > > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio > > > /ring.c > > > > > > Jia, 1) Can you verify the use of accurate/release semantics fixes > > > the problem in your platform? like use of atomic_load_acq* in the > > > reference > > code. > > > 2) If so, What is the overhead between accurate/release and plane > > > smp_smb() barriers. Based on that we need decide what path to > > > take. > > I've tested 3 cases.  The new 3rd case is to use the load_acquire > > barrier (half barrier) you mentioned at above link. The patch seems > > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring > > *r, int is_sp,                 /* Reset n to the initial burst count > > */                 n = max; > > > > -               *old_head = r->prod.head; -               const > > uint32_t cons_tail = r->cons.tail; +               *old_head = > > atomic_load_acq_32(&r->prod.head); +               const uint32_t > > cons_tail = atomic_load_acq_32(&r->cons.tail); > > > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, > > int is_s                 /* Restore n as it may change every loop */ > >                 n = max; > > > > -               *old_head = r->cons.head; -               const > > uint32_t prod_tail = r->prod.tail; +               *old_head = > > atomic_load_acq_32(&r->cons.head); +               const uint32_t > > prod_tail = atomic_load_acq_32(&r->prod.tail)                 /* The > > subtraction is done between two unsigned 32bits value > >                  * (the result is always modulo 32 bits even if we > > have                  * cons_head > prod_tail). So 'entries' is > > always between 0                  * and size(ring)-1. */ > > > > The half barrier patch passed the fuctional test. > > > > As for the performance comparision on *arm64*(the debug patch is at > > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please > > see the test results below: > > > > [case 1] old codes, no barrier > > ============================================  Performance counter > > stats for './test --no-huge -l 1-10': > > > >      689275.001200      task-clock (msec)         #    9.771 CPUs > > utilized               6223      context-switches          #    > > 0.009 K/sec                 10      cpu-migrations            #    > > 0.000 K/sec                653      page-faults               #    > > 0.001 K/sec      1721190914583      cycles                    #    > > 2.497 GHz      3363238266430      instructions              #    > > 1.95  insn per cycle    branches           > > 27804740      branch-misses             #    0.00% of all branches > > > >       70.540618825 seconds time elapsed > > > > [case 2] full barrier with rte_smp_rmb() > > ============================================  Performance counter > > stats for './test --no-huge -l 1-10': > > > >      582557.895850      task-clock (msec)         #    9.752 CPUs > > utilized               5242      context-switches          #    > > 0.009 K/sec                 10      cpu-migrations            #    > > 0.000 K/sec                665      page-faults               #    > > 0.001 K/sec      1454360730055      cycles                    #    > > 2.497 GHz       587197839907      instructions              #    > > 0.40  insn per cycle    branches           > > 27799687      branch-misses             #    0.00% of all branches > > > >       59.735582356 seconds time elapse > > > > [case 1] half barrier with load_acquire > > ============================================  Performance counter > > stats for './test --no-huge -l 1-10': > > > >      660758.877050      task-clock (msec)         #    9.764 CPUs > > utilized               5982      context-switches          #    > > 0.009 K/sec                 11      cpu-migrations            #    > > 0.000 K/sec                657      page-faults               #    > > 0.001 K/sec      1649875318044      cycles                    #    > > 2.497 GHz       591583257765      instructions              #    > > 0.36  insn per cycle    branches           > > 27994903      branch-misses             #    0.00% of all branches > > > >       67.672855107 seconds time elapsed > > > > Please see the context-switches in the perf results test result  > > sorted by time is: full barrier < half barrier < no barrier > > > > AFAICT, in this case ,the cpu reordering will add the possibility > > for context switching and increase the running time. > > > > Any ideas? > > You could also try a case where you rearrange the rte_ring structure > so that prod.head and cons.tail are parts of an union of a 64-bit > variable and then read this 64-bit variable with one atomic read. I do > not think that half barrier is even needed here with this approach, as > long as you can really read the 64-bit variable fully at once. This > should speed up. > > Cheers, -- Yes, that would work, but unfortunately it means mixing producer and consumer data onto the same cacheline, which may have other negative effects on performance (depending on arch/platform) due to cores invalidating each other's cachelines. /Bruce