From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 47E9A2A66 for ; Wed, 13 Jul 2016 15:00:55 +0200 (CEST) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga102.jf.intel.com with ESMTP; 13 Jul 2016 06:00:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.28,357,1464678000"; d="scan'208";a="845502625" Received: from irsmsx110.ger.corp.intel.com ([163.33.3.25]) by orsmga003.jf.intel.com with ESMTP; 13 Jul 2016 06:00:39 -0700 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.51]) by irsmsx110.ger.corp.intel.com ([163.33.3.25]) with mapi id 14.03.0248.002; Wed, 13 Jul 2016 14:00:36 +0100 From: "Ananyev, Konstantin" To: "Kuusisaari, Juhamatti" , "'dev@dpdk.org'" Thread-Topic: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location Thread-Index: AQHR214YM0lag5RGeUO4brWmDxiYS6ATCkbA///7T4CAACNU4IABC8IAgABhnkCAAH6EEIAAsicAgACKNzA= Date: Wed, 13 Jul 2016 13:00:36 +0000 Message-ID: <2601191342CEEE43887BDE71AB97725836B7D850@irsmsx105.ger.corp.intel.com> References: <20160711102055.14855-1-juhamatti.kuusisaari@coriant.com> <2601191342CEEE43887BDE71AB97725836B7C858@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB97725836B7C916@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB97725836B7D20B@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB97725836B7D628@irsmsx105.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jul 2016 13:00:56 -0000 Hi Juhamatti, =20 >=20 > Hello, >=20 > > > Hi Juhamatti, > > > > > > > > > > > Hello, > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of > > > > > > > > Juhamatti Kuusisaari > > > > > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > > > > > To: dev@dpdk.org > > > > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier > > > > > > > > to correct location > > > > > > > > > > > > > > > > Fix the location of the rte_ring data dependency read barri= er. > > > > > > > > It needs to be called before accessing indexed data to > > > > > > > > ensure that the data itself is guaranteed to be correctly u= pdated. > > > > > > > > > > > > > > > > See more details at kernel/Documentation/memory-barriers.tx= t > > > > > > > > section 'Data dependency barriers'. > > > > > > > > > > > > > > > > > > > > > Any explanation why? > > > > > > > From my point smp_rmb()s are on the proper places here :) > > > > > > > Konstantin > > > > > > > > > > > > The problem here is that on a weak memory model system the CPU > > > > > > is allowed to load the address data out-of-order in advance. > > > > > > If the read barrier is after the DEQUEUE, you might end up > > > > > > having the old data there on a race situation when the buffer i= s > > continuously full. > > > > > > Having it before the DEQUEUE guarantees that the load is not > > > > > > done in advance. > > > > > > > > > > Sorry, still didn't see any race condition in the current code. > > > > > Can you provide any particular example? > > > > > From other side, moving smp_rmb() before dequeueing the objects, > > > > > could introduce a race condition, on cpus where later writes can > > > > > be reordered with earlier reads. > > > > > > > > Here is a simplified example sequence from time perspective: > > > > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order > > > > (the key of the problem) > > > > > > To read the value of ring[x] cpu has to calculate x first. > > > And to calculate x it needs to read cons.head and prod.tail first. > > > Are you saying that some modern cpu can: > > > -'speculate' value of cons.head and prod.tail > > > (based on what?) > > > -calculate x based on these speculated values. > > > - read ring[x] > > > - read cons.head and prod.tail > > > - if read values are not equal to speculated ones , then > > > re-caluclate x and re-read ring[x] > > > - else use speculatively read ring[x] > > > ? > > > If such thing is possible (is it really? and if yes on which cpu?), > > > > As I can see, neither ARM or PPC support such things. > > Both of them do obey address dependency. > > (ARM & PPC guys feel free to correct me here, if I am wrong here). > > So what cpu we are talking about? >=20 > I checked that too, indeed the problem I described seems to be more acade= mic > than even theoretical and does not apply to current CPUs. So I agree here= and > this makes this patch unneeded, I'll withdraw it. However, the implementa= tion > may still have another issue, see below. >=20 > > > then yes, we might need an extra smp_rmb() before DEQUEUE_PTRS() for > > > __rte_ring_sc_do_dequeue(). > > > For __rte_ring_mc_do_dequeue(), I think we are ok, as there is CAS > > > just before DEQUEUE_PTRS(). > > > > > > > 2. Producer CPU (PCPU) updates r->ring[x] to value be z 3. PCPU > > > > updates prod_tail to be x 4. CCPU updates cons_head to be x 5. CCPU > > > > loads r->ring[x] by using out-of-order loaded value y [is z in > > > > reality] > > > > > > > > The problem here is that on weak memory model, the CCPU is allowed > > > > to load > > > > r->ring[x] value in advance, if it decides to do so (CCPU needs to > > > > r->be able to see > > > > in advance that x will be an interesting index worth loading). The > > > > index value x is updated atomically, but it does not matter here. > > > > Also, the write barrier on PCPU side guarantees that CCPU cannot se= e > > > > update of x before PCPU has really updated the r->ring[x] to z and > > > > moved the tail, but still allows to do the out-of-order loads witho= ut > > proper read barrier. > > > > > > > > When the read barrier is moved between steps 4 and 5, it disallows > > > > to use any out-of-order loads so far and forces to drop r->ring[x] = y > > > > value and load current value z. > > > > > > > > The ring queue appears to work well as this is a rare corner case. > > > > Due to the head,tail-structure the problem needs queue to be full > > > > and also CCPU needs to see r->ring[x] update later than it does the > > > > out-of-order load. In addition, the HW needs to be able to predict > > > > and choose the load to the future index (which should be quite > > > > possible, considering modern CPUs). If you have seen in the past > > > > problems and noticed that a larger ring queue works better as a > > workaround, you may have encountered the problem already. > > > > > > I don't understand what means 'larger rings works better' here. > > > What we are talking about is race condition, that if hit, would caus= e > > > data corruption and most likely a crash. >=20 > The larger ring queue length makes the problem more infrequent as the que= ue > has more available free space and the problem does not occur without queu= e > being full. The symptoms apply to a new problem I describe below too. >=20 > > > > > > > > It is quite safe to move the barrier before DEQUEUE because after > > > > the DEQUEUE there is nothing really that we would want to protect w= ith a > > read barrier. > > > > > > I don't think so. > > > If you remove barrier after DEQUEUE(), that means on systems with > > > relaxed memory ordering cons.tail could be updated before DEQUEUE() > > > will be finished and producer can overwrite queue entries that were n= ot > > yet dequeued. > > > So if cpu can really do such speculative out of order loads, then we > > > do need for __rte_ring_sc_do_dequeue() something like: > > > > > > rte_smp_rmb(); > > > DEQUEUE_PTRS(); > > > rte_smp_rmb(); >=20 > You have a valid point here, there needs to be a guarantee that cons_tail= cannot > be updated before DEQUEUE is completed. Nevertheless, my point was that i= t is > not guaranteed with a read barrier anyway. The implementation has the fol= lowing > sequence >=20 > DEQUEUE_PTRS(); (i.e. READ/LOAD) > rte_smp_rmb(); > .. > r->cons.tail =3D cons_next; (i.e WRITE/STORE) >=20 > Above read barrier does not guarantee any ordering for the following writ= es/stores. > As a guarantee is needed, I think we in fact need to change the read barr= ier on the > dequeue to a full barrier, which guarantees the read+write order, as foll= ows >=20 > DEQUEUE_PTRS(); > rte_smp_mb(); > .. > r->cons.tail =3D cons_next; >=20 > If you agree, I can for sure prepare another patch for this issue. Hmm, I think for __rte_ring_mc_do_dequeue() we are ok with smp_rmb(), as we have to read cons.tail anyway. For __rte_ring_sc_do_dequeue(), I think you right, we might need something= stronger. I don't want to put rte_smp_mb() here as it would cause full HW barrier eve= n on machines with strong memory order (IA). I think that rte_smp_wmb() might be enough here: it would force cpu to wait till writes in DEQUEUE_PTRS() are become visible= , which means reads have to be completed too. Another option would be to define a new macro: rte_weak_mb() or so, that would be expanded into CB on boxes with strong memory model, and to full MB on machines with relaxed ones. Interested to hear what ARM and PPC guys think. Konstantin P.S. Another thing a bit off-topic - for PPC guys: As I can see smp_rmb/smp_wmb are just a complier barriers: find lib/librte_eal/common/include/arch/ppc_64/ -type f | xargs grep smp_ lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define rte_smp_mb()= rte_mb() lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define rte_smp_wmb(= ) rte_compiler_barrier() lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define rte_smp_rmb(= ) rte_compiler_barrier() My knowledge about PPC architecture is rudimental, but is that really enoug= h? >=20 > Thanks, > -- > Juhamatti >=20 > > > Konstantin > > > > > > > The read > > > > barrier is mapped to a compiler barrier on strong memory model > > > > systems and this works fine too as the order of the head,tail > > > > updates is still guaranteed on the new location. Even if the proble= m > > > > would be theoretical on most systems, it is worth fixing as the ris= k for > > problems is very low. > > > > > > > > -- > > > > Juhamatti > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > > > > > > > > > > --- > > > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct > > > > > > > > rte_ring *r, > > > > > > > void **obj_table, > > > > > > > > cons_next); > > > > > > > > } while (unlikely(success =3D=3D 0)); > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > + > > > > > > > > /* copy in table */ > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > /* > > > > > > > > * If there are other dequeues in progress that > > > > > > > > preceded us, @@ -746,9 +747,10 @@ > > > > > > > > __rte_ring_sc_do_dequeue(struct rte_ring *r, > > > > > > > void **obj_table, > > > > > > > > cons_next =3D cons_head + n; > > > > > > > > r->cons.head =3D cons_next; > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > + > > > > > > > > /* copy in table */ > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > > > r->cons.tail =3D cons_next; > > > > > > > > -- > > > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > =3D=3D > > > > > > > > The information contained in this message may be privileged > > > > > > > > and confidential and protected from disclosure. If the > > > > > > > > reader of this message is not the intended recipient, or an > > > > > > > > employee or agent responsible for delivering this message t= o > > > > > > > > the intended recipient, you are hereby notified that any > > > > > > > > reproduction, dissemination or distribution of this > > > > > > > > communication is strictly prohibited. If you have received > > > > > > > > this communication in error, please notify us immediately b= y > > > > > > > > replying to the message and deleting it from your > > > > > computer. Thank you. > > > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > =3D=3D