From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bruce.richardson@intel.com>
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
 by dpdk.org (Postfix) with ESMTP id 559E71B3C0
 for <dev@dpdk.org>; Mon, 23 Oct 2017 11:11:05 +0200 (CEST)
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 23 Oct 2017 02:11:03 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.43,422,1503385200"; d="scan'208";a="165848937"
Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.32])
 by fmsmga005.fm.intel.com with SMTP; 23 Oct 2017 02:11:00 -0700
Received: by  (sSMTP sendmail emulation); Mon, 23 Oct 2017 10:10:59 +0100
Date: Mon, 23 Oct 2017 10:10:59 +0100
From: Bruce Richardson <bruce.richardson@intel.com>
To: "Kuusisaari, Juhamatti" <Juhamatti.Kuusisaari@coriant.com>
Cc: Jia He <hejianet@gmail.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com>,
 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "Zhao, Bing" <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
 "dev@dpdk.org" <dev@dpdk.org>,
 "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
 "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
 "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>
Message-ID: <20171023091059.GA17728@bricha3-MOBL3.ger.corp.intel.com>
References: <c3517bf8-95f1-0aa4-fc64-47922c35ce1f@gmail.com>
 <d48351a5-2c43-4fff-0a66-fd06707a530d@gmail.com>
 <2601191342CEEE43887BDE71AB9772585FAAB171@IRSMSX103.ger.corp.intel.com>
 <8806e2bd-c57b-03ff-a315-0a311690f1d9@163.com>
 <2601191342CEEE43887BDE71AB9772585FAAB404@IRSMSX103.ger.corp.intel.com>
 <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com>
 <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com>
 <20171020054319.GA4249@jerin>
 <ab7154a2-a9f8-f12e-b6a0-2805c2065e2e@gmail.com>
 <AM6PR0402MB339814E0657993F76B9801D49D460@AM6PR0402MB3398.eurprd04.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <AM6PR0402MB339814E0657993F76B9801D49D460@AM6PR0402MB3398.eurprd04.prod.outlook.com>
Organization: Intel Research and Development Ireland Ltd.
User-Agent: Mutt/1.9.1 (2017-09-22)
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
 loading when doing enqueue/dequeue
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Oct 2017 09:11:05 -0000

On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote:
> 
> Hi,
> 
> > On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
> > > -----Original Message-----
> > >>
> > [...]
> > >> dependant on each other.  Thus a memory barrier is neccessary.
> > > Yes. The barrier is necessary.  In fact, upstream freebsd fixed
> > > this issue for arm64. DPDK ring implementation is derived from
> > > freebsd's buf_ring.h.
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > >
> > > I think, the only outstanding issue is, how to reduce the
> > > performance impact for arm64. I believe using accurate/release
> > > semantics instead of rte_smp_rmb() will reduce the performance
> > > overhead like similar ring implementations below, freebsd:
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > > odp:
> > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio
> > > /ring.c
> > >
> > > Jia, 1) Can you verify the use of accurate/release semantics fixes
> > > the problem in your platform? like use of atomic_load_acq* in the
> > > reference
> > code.
> > > 2) If so, What is the overhead between accurate/release and plane
> > > smp_smb() barriers. Based on that we need decide what path to
> > > take.
> > I've tested 3 cases.  The new 3rd case is to use the load_acquire
> > barrier (half barrier) you mentioned at above link.  The patch seems
> > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring
> > *r, int is_sp,                 /* Reset n to the initial burst count
> > */                 n = max;
> > 
> > -               *old_head = r->prod.head; -               const
> > uint32_t cons_tail = r->cons.tail; +               *old_head =
> > atomic_load_acq_32(&r->prod.head); +               const uint32_t
> > cons_tail = atomic_load_acq_32(&r->cons.tail);
> > 
> > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_s                 /* Restore n as it may change every loop */
> >                 n = max;
> > 
> > -               *old_head = r->cons.head; -               const
> > uint32_t prod_tail = r->prod.tail; +               *old_head =
> > atomic_load_acq_32(&r->cons.head); +               const uint32_t
> > prod_tail = atomic_load_acq_32(&r->prod.tail)                 /* The
> > subtraction is done between two unsigned 32bits value
> >                  * (the result is always modulo 32 bits even if we
> > have                  * cons_head > prod_tail). So 'entries' is
> > always between 0                  * and size(ring)-1. */
> >
> > The half barrier patch passed the fuctional test.
> > 
> > As for the performance comparision on *arm64*(the debug patch is at
> > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please
> > see the test results below:
> > 
> > [case 1] old codes, no barrier
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       689275.001200      task-clock (msec)         #    9.771 CPUs
> >  utilized               6223      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                653      page-faults               #   
> >  0.001 K/sec      1721190914583      cycles                    #   
> >  2.497 GHz      3363238266430      instructions              #   
> >  1.95  insn per cycle    <not supported> branches          
> >  27804740      branch-misses             #    0.00% of all branches
> > 
> >        70.540618825 seconds time elapsed
> > 
> > [case 2] full barrier with rte_smp_rmb()
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       582557.895850      task-clock (msec)         #    9.752 CPUs
> >  utilized               5242      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                665      page-faults               #   
> >  0.001 K/sec      1454360730055      cycles                    #   
> >  2.497 GHz       587197839907      instructions              #   
> >  0.40  insn per cycle    <not supported> branches          
> >  27799687      branch-misses             #    0.00% of all branches
> > 
> >        59.735582356 seconds time elapse
> > 
> > [case 1] half barrier with load_acquire
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       660758.877050      task-clock (msec)         #    9.764 CPUs
> >  utilized               5982      context-switches          #   
> >  0.009 K/sec                 11      cpu-migrations            #   
> >  0.000 K/sec                657      page-faults               #   
> >  0.001 K/sec      1649875318044      cycles                    #   
> >  2.497 GHz       591583257765      instructions              #   
> >  0.36  insn per cycle    <not supported> branches          
> >  27994903      branch-misses             #    0.00% of all branches
> > 
> >        67.672855107 seconds time elapsed
> > 
> > Please see the context-switches in the perf results test result 
> > sorted by time is: full barrier < half barrier < no barrier
> > 
> > AFAICT, in this case ,the cpu reordering will add the possibility
> > for context switching and increase the running time.
> > 
> > Any ideas?
> 
> You could also try a case where you rearrange the rte_ring structure
> so that prod.head and cons.tail are parts of an union of a 64-bit
> variable and then read this 64-bit variable with one atomic read. I do
> not think that half barrier is even needed here with this approach, as
> long as you can really read the 64-bit variable fully at once. This
> should speed up. 
> 
> Cheers, --
Yes, that would work, but unfortunately it means mixing producer and
consumer data onto the same cacheline, which may have other negative
effects on performance (depending on arch/platform) due to cores
invalidating each other's cachelines.

/Bruce