Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: "Kuusisaari, Juhamatti" <Juhamatti.Kuusisaari@coriant.com>
Cc: Jia He <hejianet@gmail.com>,
	Jerin Jacob <jerin.jacob@caviumnetworks.com>,
	"Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
	"Zhao, Bing" <ilovethull@163.com>,
	Olivier MATZ <olivier.matz@6wind.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
	"jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
	"bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
Date: Mon, 23 Oct 2017 10:10:59 +0100	[thread overview]
Message-ID: <20171023091059.GA17728@bricha3-MOBL3.ger.corp.intel.com> (raw)
In-Reply-To: <AM6PR0402MB339814E0657993F76B9801D49D460@AM6PR0402MB3398.eurprd04.prod.outlook.com>

On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote:
> 
> Hi,
> 
> > On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
> > > -----Original Message-----
> > >>
> > [...]
> > >> dependant on each other.  Thus a memory barrier is neccessary.
> > > Yes. The barrier is necessary.  In fact, upstream freebsd fixed
> > > this issue for arm64. DPDK ring implementation is derived from
> > > freebsd's buf_ring.h.
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > >
> > > I think, the only outstanding issue is, how to reduce the
> > > performance impact for arm64. I believe using accurate/release
> > > semantics instead of rte_smp_rmb() will reduce the performance
> > > overhead like similar ring implementations below, freebsd:
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > > odp:
> > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio
> > > /ring.c
> > >
> > > Jia, 1) Can you verify the use of accurate/release semantics fixes
> > > the problem in your platform? like use of atomic_load_acq* in the
> > > reference
> > code.
> > > 2) If so, What is the overhead between accurate/release and plane
> > > smp_smb() barriers. Based on that we need decide what path to
> > > take.
> > I've tested 3 cases.  The new 3rd case is to use the load_acquire
> > barrier (half barrier) you mentioned at above link.  The patch seems
> > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring
> > *r, int is_sp,                 /* Reset n to the initial burst count
> > */                 n = max;
> > 
> > -               *old_head = r->prod.head; -               const
> > uint32_t cons_tail = r->cons.tail; +               *old_head =
> > atomic_load_acq_32(&r->prod.head); +               const uint32_t
> > cons_tail = atomic_load_acq_32(&r->cons.tail);
> > 
> > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_s                 /* Restore n as it may change every loop */
> >                 n = max;
> > 
> > -               *old_head = r->cons.head; -               const
> > uint32_t prod_tail = r->prod.tail; +               *old_head =
> > atomic_load_acq_32(&r->cons.head); +               const uint32_t
> > prod_tail = atomic_load_acq_32(&r->prod.tail)                 /* The
> > subtraction is done between two unsigned 32bits value
> >                  * (the result is always modulo 32 bits even if we
> > have                  * cons_head > prod_tail). So 'entries' is
> > always between 0                  * and size(ring)-1. */
> >
> > The half barrier patch passed the fuctional test.
> > 
> > As for the performance comparision on *arm64*(the debug patch is at
> > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please
> > see the test results below:
> > 
> > [case 1] old codes, no barrier
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       689275.001200      task-clock (msec)         #    9.771 CPUs
> >  utilized               6223      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                653      page-faults               #   
> >  0.001 K/sec      1721190914583      cycles                    #   
> >  2.497 GHz      3363238266430      instructions              #   
> >  1.95  insn per cycle    <not supported> branches          
> >  27804740      branch-misses             #    0.00% of all branches
> > 
> >        70.540618825 seconds time elapsed
> > 
> > [case 2] full barrier with rte_smp_rmb()
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       582557.895850      task-clock (msec)         #    9.752 CPUs
> >  utilized               5242      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                665      page-faults               #   
> >  0.001 K/sec      1454360730055      cycles                    #   
> >  2.497 GHz       587197839907      instructions              #   
> >  0.40  insn per cycle    <not supported> branches          
> >  27799687      branch-misses             #    0.00% of all branches
> > 
> >        59.735582356 seconds time elapse
> > 
> > [case 1] half barrier with load_acquire
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       660758.877050      task-clock (msec)         #    9.764 CPUs
> >  utilized               5982      context-switches          #   
> >  0.009 K/sec                 11      cpu-migrations            #   
> >  0.000 K/sec                657      page-faults               #   
> >  0.001 K/sec      1649875318044      cycles                    #   
> >  2.497 GHz       591583257765      instructions              #   
> >  0.36  insn per cycle    <not supported> branches          
> >  27994903      branch-misses             #    0.00% of all branches
> > 
> >        67.672855107 seconds time elapsed
> > 
> > Please see the context-switches in the perf results test result 
> > sorted by time is: full barrier < half barrier < no barrier
> > 
> > AFAICT, in this case ,the cpu reordering will add the possibility
> > for context switching and increase the running time.
> > 
> > Any ideas?
> 
> You could also try a case where you rearrange the rte_ring structure
> so that prod.head and cons.tail are parts of an union of a 64-bit
> variable and then read this 64-bit variable with one atomic read. I do
> not think that half barrier is even needed here with this approach, as
> long as you can really read the 64-bit variable fully at once. This
> should speed up. 
> 
> Cheers, --
Yes, that would work, but unfortunately it means mixing producer and
consumer data onto the same cacheline, which may have other negative
effects on performance (depending on arch/platform) due to cores
invalidating each other's cachelines.

/Bruce

next prev parent reply	other threads:[~2017-10-23  9:11 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-10  9:56 Jia He
2017-10-12 15:53 ` Olivier MATZ
2017-10-12 16:15   ` Stephen Hemminger
2017-10-12 17:05   ` Ananyev, Konstantin
2017-10-12 17:23     ` Jerin Jacob
2017-10-13  1:02       ` Jia He
2017-10-13  1:15         ` Jia He
2017-10-13  1:16         ` Jia He
2017-10-13  1:49           ` Jerin Jacob
2017-10-13  3:23             ` Jia He
2017-10-13  5:57               ` Zhao, Bing
2017-10-13  7:33             ` Jianbo Liu
2017-10-13  8:20               ` Jia He
2017-10-19 10:02           ` Ananyev, Konstantin
2017-10-19 11:18             ` Zhao, Bing
2017-10-19 14:15               ` Ananyev, Konstantin
2017-10-19 20:02                 ` Ananyev, Konstantin
2017-10-20  1:57                   ` Jia He
2017-10-20  5:43                     ` Jerin Jacob
2017-10-23  8:49                       ` Jia He
2017-10-23  9:05                         ` Kuusisaari, Juhamatti
2017-10-23  9:10                           ` Bruce Richardson [this message]
2017-10-23 10:06                         ` Jerin Jacob
2017-10-24  2:04                           ` Jia He
2017-10-25 13:26                             ` Jerin Jacob
2017-10-26  2:27                               ` Jia He
2017-10-31  2:55                               ` Jia He
2017-10-31 11:14                                 ` Jerin Jacob
2017-11-01  2:53                                   ` Jia He
2017-11-01 19:04                                     ` Jerin Jacob
2017-11-02  1:09                                       ` Jia He
2017-11-02  8:57                                       ` Jia He
2017-11-03  2:55                                         ` Jia He
2017-11-03 12:47                                           ` Jerin Jacob
2017-11-01  4:48                                   ` Jia He
2017-11-01 19:10                                     ` Jerin Jacob
2017-10-20  7:03                     ` Ananyev, Konstantin
2017-10-13  0:24     ` Liu, Jie2
2017-10-13  2:12       ` Zhao, Bing
2017-10-13  2:34         ` Jerin Jacob
2017-10-16 10:51       ` Kuusisaari, Juhamatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171023091059.GA17728@bricha3-MOBL3.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=Juhamatti.Kuusisaari@coriant.com \
    --cc=bing.zhao@hxt-semitech.com \
    --cc=dev@dpdk.org \
    --cc=hejianet@gmail.com \
    --cc=ilovethull@163.com \
    --cc=jerin.jacob@caviumnetworks.com \
    --cc=jia.he@hxt-semitech.com \
    --cc=jie2.liu@hxt-semitech.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=olivier.matz@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).