From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 9D4B02A5F for ; Tue, 12 Jul 2016 13:01:22 +0200 (CEST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga102.fm.intel.com with ESMTP; 12 Jul 2016 04:01:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.28,351,1464678000"; d="scan'208";a="1015262620" Received: from irsmsx107.ger.corp.intel.com ([163.33.3.99]) by orsmga002.jf.intel.com with ESMTP; 12 Jul 2016 04:01:20 -0700 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.51]) by IRSMSX107.ger.corp.intel.com ([169.254.10.96]) with mapi id 14.03.0248.002; Tue, 12 Jul 2016 12:01:19 +0100 From: "Ananyev, Konstantin" To: "Kuusisaari, Juhamatti" , "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location Thread-Index: AQHR214YM0lag5RGeUO4brWmDxiYS6ATCkbA///7T4CAACNU4IABC8IAgABhnkA= Date: Tue, 12 Jul 2016 11:01:19 +0000 Message-ID: <2601191342CEEE43887BDE71AB97725836B7D20B@irsmsx105.ger.corp.intel.com> References: <20160711102055.14855-1-juhamatti.kuusisaari@coriant.com> <2601191342CEEE43887BDE71AB97725836B7C858@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB97725836B7C916@irsmsx105.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jul 2016 11:01:23 -0000 Hi Juhamatti, >=20 > Hello, >=20 > > > > > -----Original Message----- > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > > > > > Kuusisaari > > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > > To: dev@dpdk.org > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to > > > > > correct location > > > > > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > > > It needs to be called before accessing indexed data to ensure tha= t > > > > > the data itself is guaranteed to be correctly updated. > > > > > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > > > section 'Data dependency barriers'. > > > > > > > > > > > > Any explanation why? > > > > From my point smp_rmb()s are on the proper places here :) Konstanti= n > > > > > > The problem here is that on a weak memory model system the CPU is > > > allowed to load the address data out-of-order in advance. > > > If the read barrier is after the DEQUEUE, you might end up having the > > > old data there on a race situation when the buffer is continuously fu= ll. > > > Having it before the DEQUEUE guarantees that the load is not done in > > > advance. > > > > Sorry, still didn't see any race condition in the current code. > > Can you provide any particular example? > > From other side, moving smp_rmb() before dequeueing the objects, could > > introduce a race condition, on cpus where later writes can be reordered= with > > earlier reads. >=20 > Here is a simplified example sequence from time perspective: > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order > (the key of the problem) To read the value of ring[x] cpu has to calculate x first. And to calculate x it needs to read cons.head and prod.tail first. Are you saying that some modern cpu can: -'speculate' value of cons.head and prod.tail (based on what?) -calculate x based on these speculated values. - read ring[x] - read cons.head and prod.tail - if read values are not equal to speculated ones , then re-caluclate x and re-read ring[x] - else use speculatively read ring[x]=20 ? If such thing is possible (is it really? and if yes on which cpu?), then yes, we might need an extra smp_rmb() before DEQUEUE_PTRS() for __rte_ring_sc_do_dequeue(). For __rte_ring_mc_do_dequeue(), I think we are ok, as there is CAS just before DEQUEUE_PTRS(). > 2. Producer CPU (PCPU) updates r->ring[x] to value be z > 3. PCPU updates prod_tail to be x > 4. CCPU updates cons_head to be x > 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in re= ality] >=20 > The problem here is that on weak memory model, the CCPU is allowed to loa= d > r->ring[x] value in advance, if it decides to do so (CCPU needs to be abl= e to see > in advance that x will be an interesting index worth loading). The index = value x > is updated atomically, but it does not matter here. Also, the write barr= ier on PCPU > side guarantees that CCPU cannot see update of x before PCPU has really u= pdated > the r->ring[x] to z and moved the tail, but still allows to do the out-of= -order loads > without proper read barrier. >=20 > When the read barrier is moved between steps 4 and 5, it disallows to use > any out-of-order loads so far and forces to drop r->ring[x] y value and > load current value z. >=20 > The ring queue appears to work well as this is a rare corner case. Due to= the > head,tail-structure the problem needs queue to be full and also CCPU need= s > to see r->ring[x] update later than it does the out-of-order load. In add= ition, > the HW needs to be able to predict and choose the load to the future inde= x > (which should be quite possible, considering modern CPUs). If you have se= en > in the past problems and noticed that a larger ring queue works better as= a > workaround, you may have encountered the problem already. I don't understand what means 'larger rings works better' here.=20 What we are talking about is race condition, that if hit, would cause data corruption and most likely a crash. >=20 > It is quite safe to move the barrier before DEQUEUE because after the DEQ= UEUE > there is nothing really that we would want to protect with a read barrier= . I don't think so. If you remove barrier after DEQUEUE(), that means on systems with relaxed m= emory ordering cons.tail could be updated before DEQUEUE() will be finished and producer c= an overwrite queue entries that were not yet dequeued. So if cpu can really do such speculative out of order loads, then we do need for __rte_ring_sc_do_dequeue() something like: =20 rte_smp_rmb(); DEQUEUE_PTRS(); rte_smp_rmb(); Konstantin > The read > barrier is mapped to a compiler barrier on strong memory model systems an= d this > works fine too as the order of the head,tail updates is still guaranteed = on the new > location. Even if the problem would be theoretical on most systems, it is= worth fixing > as the risk for problems is very low. >=20 > -- > Juhamatti >=20 > > Konstantin >=20 >=20 >=20 >=20 > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > > > > --- > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, > > > > void **obj_table, > > > > > cons_next); > > > > > } while (unlikely(success =3D=3D 0)); > > > > > > > > > > + rte_smp_rmb(); > > > > > + > > > > > /* copy in table */ > > > > > DEQUEUE_PTRS(); > > > > > - rte_smp_rmb(); > > > > > > > > > > /* > > > > > * If there are other dequeues in progress that preceded > > > > > us, @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring > > > > > *r, > > > > void **obj_table, > > > > > cons_next =3D cons_head + n; > > > > > r->cons.head =3D cons_next; > > > > > > > > > > + rte_smp_rmb(); > > > > > + > > > > > /* copy in table */ > > > > > DEQUEUE_PTRS(); > > > > > - rte_smp_rmb(); > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > r->cons.tail =3D cons_next; > > > > > -- > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =3D=3D > > > > > The information contained in this message may be privileged and > > > > > confidential and protected from disclosure. If the reader of this > > > > > message is not the intended recipient, or an employee or agent > > > > > responsible for delivering this message to the intended recipient= , > > > > > you are hereby notified that any reproduction, dissemination or > > > > > distribution of this communication is strictly prohibited. If you > > > > > have received this communication in error, please notify us > > > > > immediately by replying to the message and deleting it from your > > computer. Thank you. > > > > > Coriant-Tellabs > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =3D=3D