From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id 9D4B02A5F
 for <dev@dpdk.org>; Tue, 12 Jul 2016 13:01:22 +0200 (CEST)
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by fmsmga102.fm.intel.com with ESMTP; 12 Jul 2016 04:01:21 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.28,351,1464678000"; d="scan'208";a="1015262620"
Received: from irsmsx107.ger.corp.intel.com ([163.33.3.99])
 by orsmga002.jf.intel.com with ESMTP; 12 Jul 2016 04:01:20 -0700
Received: from irsmsx105.ger.corp.intel.com ([169.254.7.51]) by
 IRSMSX107.ger.corp.intel.com ([169.254.10.96]) with mapi id 14.03.0248.002;
 Tue, 12 Jul 2016 12:01:19 +0100
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Kuusisaari, Juhamatti" <Juhamatti.Kuusisaari@coriant.com>, "dev@dpdk.org"
 <dev@dpdk.org>
Thread-Topic: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct
 location
Thread-Index: AQHR214YM0lag5RGeUO4brWmDxiYS6ATCkbA///7T4CAACNU4IABC8IAgABhnkA=
Date: Tue, 12 Jul 2016 11:01:19 +0000
Message-ID: <2601191342CEEE43887BDE71AB97725836B7D20B@irsmsx105.ger.corp.intel.com>
References: <20160711102055.14855-1-juhamatti.kuusisaari@coriant.com>
 <2601191342CEEE43887BDE71AB97725836B7C858@irsmsx105.ger.corp.intel.com>
 <HE1PR04MB1337E26DB33E9E9F8196FF0B9D3F0@HE1PR04MB1337.eurprd04.prod.outlook.com>
 <2601191342CEEE43887BDE71AB97725836B7C916@irsmsx105.ger.corp.intel.com>
 <HE1PR04MB1337546AF3AF7176E7B5FCF79D300@HE1PR04MB1337.eurprd04.prod.outlook.com>
In-Reply-To: <HE1PR04MB1337546AF3AF7176E7B5FCF79D300@HE1PR04MB1337.eurprd04.prod.outlook.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.182]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to
	correct	location
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Jul 2016 11:01:23 -0000


Hi Juhamatti,

>=20
> Hello,
>=20
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti
> > > > > Kuusisaari
> > > > > Sent: Monday, July 11, 2016 11:21 AM
> > > > > To: dev@dpdk.org
> > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to
> > > > > correct location
> > > > >
> > > > > Fix the location of the rte_ring data dependency read barrier.
> > > > > It needs to be called before accessing indexed data to ensure tha=
t
> > > > > the data itself is guaranteed to be correctly updated.
> > > > >
> > > > > See more details at kernel/Documentation/memory-barriers.txt
> > > > > section 'Data dependency barriers'.
> > > >
> > > >
> > > > Any explanation why?
> > > > From my point smp_rmb()s are on the proper places here :) Konstanti=
n
> > >
> > > The problem here is that on a weak memory model system the CPU is
> > > allowed to load the address data out-of-order in advance.
> > > If the read barrier is after the DEQUEUE, you might end up having the
> > > old data there on a race situation when the buffer is continuously fu=
ll.
> > > Having it before the DEQUEUE guarantees that the load is not done in
> > > advance.
> >
> > Sorry, still didn't see any race condition in the current code.
> > Can you provide any particular example?
> > From other side, moving smp_rmb() before dequeueing the objects, could
> > introduce a race condition, on cpus where later writes can be reordered=
 with
> > earlier reads.
>=20
> Here is a simplified example sequence from time perspective:
> 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order
> (the key of the problem)

To read the value of ring[x] cpu has to calculate x first.
And to calculate x it needs to read cons.head and prod.tail first.
Are you saying that some modern cpu can:
 -'speculate' value of cons.head and  prod.tail
  (based on what?)
 -calculate x based on these speculated values.
- read ring[x]
- read cons.head and prod.tail
- if read values are not equal to speculated ones , then
  re-caluclate x and re-read ring[x]
- else use speculatively read ring[x]=20
?
If such thing is possible  (is it really? and if yes on which cpu?),
then yes, we might need an extra  smp_rmb() before DEQUEUE_PTRS()
for __rte_ring_sc_do_dequeue().
For __rte_ring_mc_do_dequeue(), I think we are ok, as
there is CAS just before DEQUEUE_PTRS().

> 2. Producer CPU (PCPU) updates r->ring[x] to value be z
> 3. PCPU updates prod_tail to be x
> 4. CCPU updates cons_head to be x
> 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in re=
ality]
>=20
> The problem here is that on weak memory model, the CCPU is allowed to loa=
d
> r->ring[x] value in advance, if it decides to do so (CCPU needs to be abl=
e to see
> in advance that x will be an interesting index worth loading). The index =
value x
> is updated atomically,  but it does not matter here. Also, the write barr=
ier on PCPU
> side guarantees that CCPU cannot see update of x before PCPU has really u=
pdated
> the r->ring[x] to z and moved the tail, but still allows to do the out-of=
-order loads
> without proper read barrier.
>=20
> When the read barrier is moved between steps 4 and 5, it disallows to use
> any out-of-order loads so far and forces to drop r->ring[x] y value and
> load current value z.
>=20
> The ring queue appears to work well as this is a rare corner case. Due to=
 the
> head,tail-structure the problem needs queue to be full and also CCPU need=
s
> to see r->ring[x] update later than it does the out-of-order load. In add=
ition,
> the HW needs to be able to predict and choose the load to the future inde=
x
> (which should be quite possible, considering modern CPUs). If you have se=
en
> in the past problems and noticed that a larger ring queue works better as=
 a
> workaround, you may have encountered the problem already.

I don't understand what means 'larger rings works better' here.=20
What we are talking about is  race condition, that if hit, would
cause data corruption and most likely a crash.

>=20
> It is quite safe to move the barrier before DEQUEUE because after the DEQ=
UEUE
> there is nothing really that we would want to protect with a read barrier=
.

I don't think so.
If you remove barrier after DEQUEUE(), that means on systems with relaxed m=
emory ordering
cons.tail could be updated before DEQUEUE() will be finished and producer c=
an overwrite
queue entries that were not yet dequeued.
So if cpu can really do such speculative out of order loads,
then we do need for  __rte_ring_sc_do_dequeue() something like:
=20
rte_smp_rmb();
DEQUEUE_PTRS();
rte_smp_rmb();

Konstantin

> The read
> barrier is mapped to a compiler barrier on strong memory model systems an=
d this
> works fine too as the order of the head,tail updates is still guaranteed =
on the new
> location. Even if the problem would be theoretical on most systems, it is=
 worth fixing
> as the risk for problems is very low.
>=20
> --
>  Juhamatti
>=20
> > Konstantin
>=20
>=20
>=20
>=20
> > > > >
> > > > > Signed-off-by: Juhamatti Kuusisaari
> > > > > <juhamatti.kuusisaari@coriant.com>
> > > > > ---
> > > > >  lib/librte_ring/rte_ring.h | 6 ++++--
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ring/rte_ring.h
> > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644
> > > > > --- a/lib/librte_ring/rte_ring.h
> > > > > +++ b/lib/librte_ring/rte_ring.h
> > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r,
> > > > void **obj_table,
> > > > >                                               cons_next);
> > > > >         } while (unlikely(success =3D=3D 0));
> > > > >
> > > > > +       rte_smp_rmb();
> > > > > +
> > > > >         /* copy in table */
> > > > >         DEQUEUE_PTRS();
> > > > > -       rte_smp_rmb();
> > > > >
> > > > >         /*
> > > > >          * If there are other dequeues in progress that preceded
> > > > > us, @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring
> > > > > *r,
> > > > void **obj_table,
> > > > >         cons_next =3D cons_head + n;
> > > > >         r->cons.head =3D cons_next;
> > > > >
> > > > > +       rte_smp_rmb();
> > > > > +
> > > > >         /* copy in table */
> > > > >         DEQUEUE_PTRS();
> > > > > -       rte_smp_rmb();
> > > > >
> > > > >         __RING_STAT_ADD(r, deq_success, n);
> > > > >         r->cons.tail =3D cons_next;
> > > > > --
> > > > > 2.9.0
> > > > >
> > > > >
> > > > >
> > > >
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > =3D=3D
> > > > > The information contained in this message may be privileged and
> > > > > confidential and protected from disclosure. If the reader of this
> > > > > message is not the intended recipient, or an employee or agent
> > > > > responsible for delivering this message to the intended recipient=
,
> > > > > you are hereby notified that any reproduction, dissemination or
> > > > > distribution of this communication is strictly prohibited. If you
> > > > > have received this communication in error, please notify us
> > > > > immediately by replying to the message and deleting it from your
> > computer. Thank you.
> > > > > Coriant-Tellabs
> > > > >
> > > >
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > =3D=3D