* [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location @ 2016-07-11 10:20 Juhamatti Kuusisaari 2016-07-11 10:41 ` Ananyev, Konstantin 0 siblings, 1 reply; 17+ messages in thread From: Juhamatti Kuusisaari @ 2016-07-11 10:20 UTC (permalink / raw) To: dev Fix the location of the rte_ring data dependency read barrier. It needs to be called before accessing indexed data to ensure that the data itself is guaranteed to be correctly updated. See more details at kernel/Documentation/memory-barriers.txt section 'Data dependency barriers'. Signed-off-by: Juhamatti Kuusisaari <juhamatti.kuusisaari@coriant.com> --- lib/librte_ring/rte_ring.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table, cons_next); } while (unlikely(success == 0)); + rte_smp_rmb(); + /* copy in table */ DEQUEUE_PTRS(); - rte_smp_rmb(); /* * If there are other dequeues in progress that preceded us, @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void **obj_table, cons_next = cons_head + n; r->cons.head = cons_next; + rte_smp_rmb(); + /* copy in table */ DEQUEUE_PTRS(); - rte_smp_rmb(); __RING_STAT_ADD(r, deq_success, n); r->cons.tail = cons_next; -- 2.9.0 ============================================================ The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any reproduction, dissemination or distribution of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Coriant-Tellabs ============================================================ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-11 10:20 [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location Juhamatti Kuusisaari @ 2016-07-11 10:41 ` Ananyev, Konstantin 2016-07-11 11:22 ` Kuusisaari, Juhamatti 0 siblings, 1 reply; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-11 10:41 UTC (permalink / raw) To: Juhamatti Kuusisaari, dev Hi , > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti Kuusisaari > Sent: Monday, July 11, 2016 11:21 AM > To: dev@dpdk.org > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location > > Fix the location of the rte_ring data dependency read barrier. > It needs to be called before accessing indexed data to ensure > that the data itself is guaranteed to be correctly updated. > > See more details at kernel/Documentation/memory-barriers.txt > section 'Data dependency barriers'. Any explanation why? >From my point smp_rmb()s are on the proper places here :) Konstantin > > Signed-off-by: Juhamatti Kuusisaari <juhamatti.kuusisaari@coriant.com> > --- > lib/librte_ring/rte_ring.h | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > index eb45e41..a923e49 100644 > --- a/lib/librte_ring/rte_ring.h > +++ b/lib/librte_ring/rte_ring.h > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table, > cons_next); > } while (unlikely(success == 0)); > > + rte_smp_rmb(); > + > /* copy in table */ > DEQUEUE_PTRS(); > - rte_smp_rmb(); > > /* > * If there are other dequeues in progress that preceded us, > @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void **obj_table, > cons_next = cons_head + n; > r->cons.head = cons_next; > > + rte_smp_rmb(); > + > /* copy in table */ > DEQUEUE_PTRS(); > - rte_smp_rmb(); > > __RING_STAT_ADD(r, deq_success, n); > r->cons.tail = cons_next; > -- > 2.9.0 > > > ============================================================ > The information contained in this message may be privileged > and confidential and protected from disclosure. If the reader > of this message is not the intended recipient, or an employee > or agent responsible for delivering this message to the > intended recipient, you are hereby notified that any reproduction, > dissemination or distribution of this communication is strictly > prohibited. If you have received this communication in error, > please notify us immediately by replying to the message and > deleting it from your computer. Thank you. Coriant-Tellabs > ============================================================ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-11 10:41 ` Ananyev, Konstantin @ 2016-07-11 11:22 ` Kuusisaari, Juhamatti 2016-07-11 11:40 ` Olivier Matz 2016-07-11 12:34 ` Ananyev, Konstantin 0 siblings, 2 replies; 17+ messages in thread From: Kuusisaari, Juhamatti @ 2016-07-11 11:22 UTC (permalink / raw) To: Ananyev, Konstantin, dev Hi, > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > > Kuusisaari > > Sent: Monday, July 11, 2016 11:21 AM > > To: dev@dpdk.org > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct > > location > > > > Fix the location of the rte_ring data dependency read barrier. > > It needs to be called before accessing indexed data to ensure that the > > data itself is guaranteed to be correctly updated. > > > > See more details at kernel/Documentation/memory-barriers.txt > > section 'Data dependency barriers'. > > > Any explanation why? > From my point smp_rmb()s are on the proper places here :) Konstantin The problem here is that on a weak memory model system the CPU is allowed to load the address data out-of-order in advance. If the read barrier is after the DEQUEUE, you might end up having the old data there on a race situation when the buffer is continuously full. Having it before the DEQUEUE guarantees that the load is not done in advance. On Intel, it should not matter due to different memory model, so this is limited to weak memory model systems. -- Juhamatti > > > > Signed-off-by: Juhamatti Kuusisaari <juhamatti.kuusisaari@coriant.com> > > --- > > lib/librte_ring/rte_ring.h | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > index eb45e41..a923e49 100644 > > --- a/lib/librte_ring/rte_ring.h > > +++ b/lib/librte_ring/rte_ring.h > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, > void **obj_table, > > cons_next); > > } while (unlikely(success == 0)); > > > > + rte_smp_rmb(); > > + > > /* copy in table */ > > DEQUEUE_PTRS(); > > - rte_smp_rmb(); > > > > /* > > * If there are other dequeues in progress that preceded us, > > @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, > void **obj_table, > > cons_next = cons_head + n; > > r->cons.head = cons_next; > > > > + rte_smp_rmb(); > > + > > /* copy in table */ > > DEQUEUE_PTRS(); > > - rte_smp_rmb(); > > > > __RING_STAT_ADD(r, deq_success, n); > > r->cons.tail = cons_next; > > -- > > 2.9.0 > > > > > > > ========================================================== > == > > The information contained in this message may be privileged and > > confidential and protected from disclosure. If the reader of this > > message is not the intended recipient, or an employee or agent > > responsible for delivering this message to the intended recipient, you > > are hereby notified that any reproduction, dissemination or > > distribution of this communication is strictly prohibited. If you have > > received this communication in error, please notify us immediately by > > replying to the message and deleting it from your computer. Thank you. > > Coriant-Tellabs > > > ========================================================== > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-11 11:22 ` Kuusisaari, Juhamatti @ 2016-07-11 11:40 ` Olivier Matz 2016-07-12 4:10 ` Kuusisaari, Juhamatti 2016-07-11 12:34 ` Ananyev, Konstantin 1 sibling, 1 reply; 17+ messages in thread From: Olivier Matz @ 2016-07-11 11:40 UTC (permalink / raw) To: Kuusisaari, Juhamatti, Ananyev, Konstantin, dev Hi, On 07/11/2016 01:22 PM, Kuusisaari, Juhamatti wrote: > > Hi, > >>> -----Original Message----- >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti >>> Kuusisaari >>> Sent: Monday, July 11, 2016 11:21 AM >>> To: dev@dpdk.org >>> Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct >>> location >>> >>> Fix the location of the rte_ring data dependency read barrier. >>> It needs to be called before accessing indexed data to ensure that the >>> data itself is guaranteed to be correctly updated. >>> >>> See more details at kernel/Documentation/memory-barriers.txt >>> section 'Data dependency barriers'. >> >> >> Any explanation why? >> From my point smp_rmb()s are on the proper places here :) Konstantin > > The problem here is that on a weak memory model system the CPU is > allowed to load the address data out-of-order in advance. > If the read barrier is after the DEQUEUE, you might end up having the old > data there on a race situation when the buffer is continuously full. > Having it before the DEQUEUE guarantees that the load is not done > in advance. > > On Intel, it should not matter due to different memory model, so this is > limited to weak memory model systems. I agree with Juhamatti. To me, the reading of consumer_head must occur before the reading of objects ptrs. That was the case before, and this is something I already noticed when I sent that mail: http://dpdk.org/ml/archives/dev/2014-March/001742.html At that time, only Intel CPUs were supported, so it did not make any difference. Juhamatti, do you have a setup where you can trigger the issue or is it something you've seen by code review? Thanks, Olivier ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-11 11:40 ` Olivier Matz @ 2016-07-12 4:10 ` Kuusisaari, Juhamatti 0 siblings, 0 replies; 17+ messages in thread From: Kuusisaari, Juhamatti @ 2016-07-12 4:10 UTC (permalink / raw) To: Olivier Matz, Ananyev, Konstantin, dev Hello, > >>> -----Original Message----- > >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > >>> Kuusisaari > >>> Sent: Monday, July 11, 2016 11:21 AM > >>> To: dev@dpdk.org > >>> Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to > >>> correct location > >>> > >>> Fix the location of the rte_ring data dependency read barrier. > >>> It needs to be called before accessing indexed data to ensure that > >>> the data itself is guaranteed to be correctly updated. > >>> > >>> See more details at kernel/Documentation/memory-barriers.txt > >>> section 'Data dependency barriers'. > >> > >> > >> Any explanation why? > >> From my point smp_rmb()s are on the proper places here :) Konstantin > > > > The problem here is that on a weak memory model system the CPU is > > allowed to load the address data out-of-order in advance. > > If the read barrier is after the DEQUEUE, you might end up having the > > old data there on a race situation when the buffer is continuously full. > > Having it before the DEQUEUE guarantees that the load is not done in > > advance. > > > > On Intel, it should not matter due to different memory model, so this > > is limited to weak memory model systems. > > > I agree with Juhamatti. To me, the reading of consumer_head must occur > before the reading of objects ptrs. > > That was the case before, and this is something I already noticed when I sent > that mail: > http://dpdk.org/ml/archives/dev/2014-March/001742.html > > At that time, only Intel CPUs were supported, so it did not make any > difference. > > Juhamatti, do you have a setup where you can trigger the issue or is it > something you've seen by code review? This was found on a code review when we investigated a problem that could have caused issues that this kind of bug would introduce. I suppose one would be able to see this with very short ring queue lengths and high load, but it depends on the HW used of course too. BR, -- Juhamatti > Thanks, > Olivier ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-11 11:22 ` Kuusisaari, Juhamatti 2016-07-11 11:40 ` Olivier Matz @ 2016-07-11 12:34 ` Ananyev, Konstantin 2016-07-12 5:27 ` Kuusisaari, Juhamatti 1 sibling, 1 reply; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-11 12:34 UTC (permalink / raw) To: Kuusisaari, Juhamatti, dev > Hi, > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > > > Kuusisaari > > > Sent: Monday, July 11, 2016 11:21 AM > > > To: dev@dpdk.org > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct > > > location > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > It needs to be called before accessing indexed data to ensure that the > > > data itself is guaranteed to be correctly updated. > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > section 'Data dependency barriers'. > > > > > > Any explanation why? > > From my point smp_rmb()s are on the proper places here :) Konstantin > > The problem here is that on a weak memory model system the CPU is > allowed to load the address data out-of-order in advance. > If the read barrier is after the DEQUEUE, you might end up having the old > data there on a race situation when the buffer is continuously full. > Having it before the DEQUEUE guarantees that the load is not done > in advance. Sorry, still didn't see any race condition in the current code. Can you provide any particular example? >From other side, moving smp_rmb() before dequeueing the objects, could introduce a race condition, on cpus where later writes can be reordered with earlier reads. Konstantin > > On Intel, it should not matter due to different memory model, so this is > limited to weak memory model systems. > > -- > Juhamatti > > > > > > > Signed-off-by: Juhamatti Kuusisaari <juhamatti.kuusisaari@coriant.com> > > > --- > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > > index eb45e41..a923e49 100644 > > > --- a/lib/librte_ring/rte_ring.h > > > +++ b/lib/librte_ring/rte_ring.h > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, > > void **obj_table, > > > cons_next); > > > } while (unlikely(success == 0)); > > > > > > + rte_smp_rmb(); > > > + > > > /* copy in table */ > > > DEQUEUE_PTRS(); > > > - rte_smp_rmb(); > > > > > > /* > > > * If there are other dequeues in progress that preceded us, > > > @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, > > void **obj_table, > > > cons_next = cons_head + n; > > > r->cons.head = cons_next; > > > > > > + rte_smp_rmb(); > > > + > > > /* copy in table */ > > > DEQUEUE_PTRS(); > > > - rte_smp_rmb(); > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > r->cons.tail = cons_next; > > > -- > > > 2.9.0 > > > > > > > > > > > ========================================================== > > == > > > The information contained in this message may be privileged and > > > confidential and protected from disclosure. If the reader of this > > > message is not the intended recipient, or an employee or agent > > > responsible for delivering this message to the intended recipient, you > > > are hereby notified that any reproduction, dissemination or > > > distribution of this communication is strictly prohibited. If you have > > > received this communication in error, please notify us immediately by > > > replying to the message and deleting it from your computer. Thank you. > > > Coriant-Tellabs > > > > > ========================================================== > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-11 12:34 ` Ananyev, Konstantin @ 2016-07-12 5:27 ` Kuusisaari, Juhamatti 2016-07-12 11:01 ` Ananyev, Konstantin 0 siblings, 1 reply; 17+ messages in thread From: Kuusisaari, Juhamatti @ 2016-07-12 5:27 UTC (permalink / raw) To: Ananyev, Konstantin, dev Hello, > > > > -----Original Message----- > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > > > > Kuusisaari > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > To: dev@dpdk.org > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to > > > > correct location > > > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > > It needs to be called before accessing indexed data to ensure that > > > > the data itself is guaranteed to be correctly updated. > > > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > > section 'Data dependency barriers'. > > > > > > > > > Any explanation why? > > > From my point smp_rmb()s are on the proper places here :) Konstantin > > > > The problem here is that on a weak memory model system the CPU is > > allowed to load the address data out-of-order in advance. > > If the read barrier is after the DEQUEUE, you might end up having the > > old data there on a race situation when the buffer is continuously full. > > Having it before the DEQUEUE guarantees that the load is not done in > > advance. > > Sorry, still didn't see any race condition in the current code. > Can you provide any particular example? > From other side, moving smp_rmb() before dequeueing the objects, could > introduce a race condition, on cpus where later writes can be reordered with > earlier reads. Here is a simplified example sequence from time perspective: 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order (the key of the problem) 2. Producer CPU (PCPU) updates r->ring[x] to value be z 3. PCPU updates prod_tail to be x 4. CCPU updates cons_head to be x 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in reality] The problem here is that on weak memory model, the CCPU is allowed to load r->ring[x] value in advance, if it decides to do so (CCPU needs to be able to see in advance that x will be an interesting index worth loading). The index value x is updated atomically, but it does not matter here. Also, the write barrier on PCPU side guarantees that CCPU cannot see update of x before PCPU has really updated the r->ring[x] to z and moved the tail, but still allows to do the out-of-order loads without proper read barrier. When the read barrier is moved between steps 4 and 5, it disallows to use any out-of-order loads so far and forces to drop r->ring[x] y value and load current value z. The ring queue appears to work well as this is a rare corner case. Due to the head,tail-structure the problem needs queue to be full and also CCPU needs to see r->ring[x] update later than it does the out-of-order load. In addition, the HW needs to be able to predict and choose the load to the future index (which should be quite possible, considering modern CPUs). If you have seen in the past problems and noticed that a larger ring queue works better as a workaround, you may have encountered the problem already. It is quite safe to move the barrier before DEQUEUE because after the DEQUEUE there is nothing really that we would want to protect with a read barrier. The read barrier is mapped to a compiler barrier on strong memory model systems and this works fine too as the order of the head,tail updates is still guaranteed on the new location. Even if the problem would be theoretical on most systems, it is worth fixing as the risk for problems is very low. -- Juhamatti > Konstantin > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > <juhamatti.kuusisaari@coriant.com> > > > > --- > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > --- a/lib/librte_ring/rte_ring.h > > > > +++ b/lib/librte_ring/rte_ring.h > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, > > > void **obj_table, > > > > cons_next); > > > > } while (unlikely(success == 0)); > > > > > > > > + rte_smp_rmb(); > > > > + > > > > /* copy in table */ > > > > DEQUEUE_PTRS(); > > > > - rte_smp_rmb(); > > > > > > > > /* > > > > * If there are other dequeues in progress that preceded > > > > us, @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring > > > > *r, > > > void **obj_table, > > > > cons_next = cons_head + n; > > > > r->cons.head = cons_next; > > > > > > > > + rte_smp_rmb(); > > > > + > > > > /* copy in table */ > > > > DEQUEUE_PTRS(); > > > > - rte_smp_rmb(); > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > r->cons.tail = cons_next; > > > > -- > > > > 2.9.0 > > > > > > > > > > > > > > > > ========================================================== > > > == > > > > The information contained in this message may be privileged and > > > > confidential and protected from disclosure. If the reader of this > > > > message is not the intended recipient, or an employee or agent > > > > responsible for delivering this message to the intended recipient, > > > > you are hereby notified that any reproduction, dissemination or > > > > distribution of this communication is strictly prohibited. If you > > > > have received this communication in error, please notify us > > > > immediately by replying to the message and deleting it from your > computer. Thank you. > > > > Coriant-Tellabs > > > > > > > > ========================================================== > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-12 5:27 ` Kuusisaari, Juhamatti @ 2016-07-12 11:01 ` Ananyev, Konstantin 2016-07-12 17:58 ` Ananyev, Konstantin 0 siblings, 1 reply; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-12 11:01 UTC (permalink / raw) To: Kuusisaari, Juhamatti, dev Hi Juhamatti, > > Hello, > > > > > > -----Original Message----- > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > > > > > Kuusisaari > > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > > To: dev@dpdk.org > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to > > > > > correct location > > > > > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > > > It needs to be called before accessing indexed data to ensure that > > > > > the data itself is guaranteed to be correctly updated. > > > > > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > > > section 'Data dependency barriers'. > > > > > > > > > > > > Any explanation why? > > > > From my point smp_rmb()s are on the proper places here :) Konstantin > > > > > > The problem here is that on a weak memory model system the CPU is > > > allowed to load the address data out-of-order in advance. > > > If the read barrier is after the DEQUEUE, you might end up having the > > > old data there on a race situation when the buffer is continuously full. > > > Having it before the DEQUEUE guarantees that the load is not done in > > > advance. > > > > Sorry, still didn't see any race condition in the current code. > > Can you provide any particular example? > > From other side, moving smp_rmb() before dequeueing the objects, could > > introduce a race condition, on cpus where later writes can be reordered with > > earlier reads. > > Here is a simplified example sequence from time perspective: > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order > (the key of the problem) To read the value of ring[x] cpu has to calculate x first. And to calculate x it needs to read cons.head and prod.tail first. Are you saying that some modern cpu can: -'speculate' value of cons.head and prod.tail (based on what?) -calculate x based on these speculated values. - read ring[x] - read cons.head and prod.tail - if read values are not equal to speculated ones , then re-caluclate x and re-read ring[x] - else use speculatively read ring[x] ? If such thing is possible (is it really? and if yes on which cpu?), then yes, we might need an extra smp_rmb() before DEQUEUE_PTRS() for __rte_ring_sc_do_dequeue(). For __rte_ring_mc_do_dequeue(), I think we are ok, as there is CAS just before DEQUEUE_PTRS(). > 2. Producer CPU (PCPU) updates r->ring[x] to value be z > 3. PCPU updates prod_tail to be x > 4. CCPU updates cons_head to be x > 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in reality] > > The problem here is that on weak memory model, the CCPU is allowed to load > r->ring[x] value in advance, if it decides to do so (CCPU needs to be able to see > in advance that x will be an interesting index worth loading). The index value x > is updated atomically, but it does not matter here. Also, the write barrier on PCPU > side guarantees that CCPU cannot see update of x before PCPU has really updated > the r->ring[x] to z and moved the tail, but still allows to do the out-of-order loads > without proper read barrier. > > When the read barrier is moved between steps 4 and 5, it disallows to use > any out-of-order loads so far and forces to drop r->ring[x] y value and > load current value z. > > The ring queue appears to work well as this is a rare corner case. Due to the > head,tail-structure the problem needs queue to be full and also CCPU needs > to see r->ring[x] update later than it does the out-of-order load. In addition, > the HW needs to be able to predict and choose the load to the future index > (which should be quite possible, considering modern CPUs). If you have seen > in the past problems and noticed that a larger ring queue works better as a > workaround, you may have encountered the problem already. I don't understand what means 'larger rings works better' here. What we are talking about is race condition, that if hit, would cause data corruption and most likely a crash. > > It is quite safe to move the barrier before DEQUEUE because after the DEQUEUE > there is nothing really that we would want to protect with a read barrier. I don't think so. If you remove barrier after DEQUEUE(), that means on systems with relaxed memory ordering cons.tail could be updated before DEQUEUE() will be finished and producer can overwrite queue entries that were not yet dequeued. So if cpu can really do such speculative out of order loads, then we do need for __rte_ring_sc_do_dequeue() something like: rte_smp_rmb(); DEQUEUE_PTRS(); rte_smp_rmb(); Konstantin > The read > barrier is mapped to a compiler barrier on strong memory model systems and this > works fine too as the order of the head,tail updates is still guaranteed on the new > location. Even if the problem would be theoretical on most systems, it is worth fixing > as the risk for problems is very low. > > -- > Juhamatti > > > Konstantin > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > --- > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, > > > > void **obj_table, > > > > > cons_next); > > > > > } while (unlikely(success == 0)); > > > > > > > > > > + rte_smp_rmb(); > > > > > + > > > > > /* copy in table */ > > > > > DEQUEUE_PTRS(); > > > > > - rte_smp_rmb(); > > > > > > > > > > /* > > > > > * If there are other dequeues in progress that preceded > > > > > us, @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring > > > > > *r, > > > > void **obj_table, > > > > > cons_next = cons_head + n; > > > > > r->cons.head = cons_next; > > > > > > > > > > + rte_smp_rmb(); > > > > > + > > > > > /* copy in table */ > > > > > DEQUEUE_PTRS(); > > > > > - rte_smp_rmb(); > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > r->cons.tail = cons_next; > > > > > -- > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > == > > > > > The information contained in this message may be privileged and > > > > > confidential and protected from disclosure. If the reader of this > > > > > message is not the intended recipient, or an employee or agent > > > > > responsible for delivering this message to the intended recipient, > > > > > you are hereby notified that any reproduction, dissemination or > > > > > distribution of this communication is strictly prohibited. If you > > > > > have received this communication in error, please notify us > > > > > immediately by replying to the message and deleting it from your > > computer. Thank you. > > > > > Coriant-Tellabs > > > > > > > > > > > ========================================================== > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-12 11:01 ` Ananyev, Konstantin @ 2016-07-12 17:58 ` Ananyev, Konstantin 2016-07-13 5:27 ` Kuusisaari, Juhamatti 0 siblings, 1 reply; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-12 17:58 UTC (permalink / raw) To: 'Kuusisaari, Juhamatti', 'dev@dpdk.org' > > > Hi Juhamatti, > > > > > Hello, > > > > > > > > -----Original Message----- > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Juhamatti > > > > > > Kuusisaari > > > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > > > To: dev@dpdk.org > > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to > > > > > > correct location > > > > > > > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > > > > It needs to be called before accessing indexed data to ensure that > > > > > > the data itself is guaranteed to be correctly updated. > > > > > > > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > > > > section 'Data dependency barriers'. > > > > > > > > > > > > > > > Any explanation why? > > > > > From my point smp_rmb()s are on the proper places here :) Konstantin > > > > > > > > The problem here is that on a weak memory model system the CPU is > > > > allowed to load the address data out-of-order in advance. > > > > If the read barrier is after the DEQUEUE, you might end up having the > > > > old data there on a race situation when the buffer is continuously full. > > > > Having it before the DEQUEUE guarantees that the load is not done in > > > > advance. > > > > > > Sorry, still didn't see any race condition in the current code. > > > Can you provide any particular example? > > > From other side, moving smp_rmb() before dequeueing the objects, could > > > introduce a race condition, on cpus where later writes can be reordered with > > > earlier reads. > > > > Here is a simplified example sequence from time perspective: > > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order > > (the key of the problem) > > To read the value of ring[x] cpu has to calculate x first. > And to calculate x it needs to read cons.head and prod.tail first. > Are you saying that some modern cpu can: > -'speculate' value of cons.head and prod.tail > (based on what?) > -calculate x based on these speculated values. > - read ring[x] > - read cons.head and prod.tail > - if read values are not equal to speculated ones , then > re-caluclate x and re-read ring[x] > - else use speculatively read ring[x] > ? > If such thing is possible (is it really? and if yes on which cpu?), As I can see, neither ARM or PPC support such things. Both of them do obey address dependency. (ARM & PPC guys feel free to correct me here, if I am wrong here). So what cpu we are talking about? Konstantin > then yes, we might need an extra smp_rmb() before DEQUEUE_PTRS() > for __rte_ring_sc_do_dequeue(). > For __rte_ring_mc_do_dequeue(), I think we are ok, as > there is CAS just before DEQUEUE_PTRS(). > > > 2. Producer CPU (PCPU) updates r->ring[x] to value be z > > 3. PCPU updates prod_tail to be x > > 4. CCPU updates cons_head to be x > > 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in reality] > > > > The problem here is that on weak memory model, the CCPU is allowed to load > > r->ring[x] value in advance, if it decides to do so (CCPU needs to be able to see > > in advance that x will be an interesting index worth loading). The index value x > > is updated atomically, but it does not matter here. Also, the write barrier on PCPU > > side guarantees that CCPU cannot see update of x before PCPU has really updated > > the r->ring[x] to z and moved the tail, but still allows to do the out-of-order loads > > without proper read barrier. > > > > When the read barrier is moved between steps 4 and 5, it disallows to use > > any out-of-order loads so far and forces to drop r->ring[x] y value and > > load current value z. > > > > The ring queue appears to work well as this is a rare corner case. Due to the > > head,tail-structure the problem needs queue to be full and also CCPU needs > > to see r->ring[x] update later than it does the out-of-order load. In addition, > > the HW needs to be able to predict and choose the load to the future index > > (which should be quite possible, considering modern CPUs). If you have seen > > in the past problems and noticed that a larger ring queue works better as a > > workaround, you may have encountered the problem already. > > I don't understand what means 'larger rings works better' here. > What we are talking about is race condition, that if hit, would > cause data corruption and most likely a crash. > > > > > It is quite safe to move the barrier before DEQUEUE because after the DEQUEUE > > there is nothing really that we would want to protect with a read barrier. > > I don't think so. > If you remove barrier after DEQUEUE(), that means on systems with relaxed memory ordering > cons.tail could be updated before DEQUEUE() will be finished and producer can overwrite > queue entries that were not yet dequeued. > So if cpu can really do such speculative out of order loads, > then we do need for __rte_ring_sc_do_dequeue() something like: > > rte_smp_rmb(); > DEQUEUE_PTRS(); > rte_smp_rmb(); > > Konstantin > > > The read > > barrier is mapped to a compiler barrier on strong memory model systems and this > > works fine too as the order of the head,tail updates is still guaranteed on the new > > location. Even if the problem would be theoretical on most systems, it is worth fixing > > as the risk for problems is very low. > > > > -- > > Juhamatti > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > > --- > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, > > > > > void **obj_table, > > > > > > cons_next); > > > > > > } while (unlikely(success == 0)); > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > + > > > > > > /* copy in table */ > > > > > > DEQUEUE_PTRS(); > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > /* > > > > > > * If there are other dequeues in progress that preceded > > > > > > us, @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring > > > > > > *r, > > > > > void **obj_table, > > > > > > cons_next = cons_head + n; > > > > > > r->cons.head = cons_next; > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > + > > > > > > /* copy in table */ > > > > > > DEQUEUE_PTRS(); > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > r->cons.tail = cons_next; > > > > > > -- > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > == > > > > > > The information contained in this message may be privileged and > > > > > > confidential and protected from disclosure. If the reader of this > > > > > > message is not the intended recipient, or an employee or agent > > > > > > responsible for delivering this message to the intended recipient, > > > > > > you are hereby notified that any reproduction, dissemination or > > > > > > distribution of this communication is strictly prohibited. If you > > > > > > have received this communication in error, please notify us > > > > > > immediately by replying to the message and deleting it from your > > > computer. Thank you. > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > ========================================================== > > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-12 17:58 ` Ananyev, Konstantin @ 2016-07-13 5:27 ` Kuusisaari, Juhamatti 2016-07-13 13:00 ` Ananyev, Konstantin 0 siblings, 1 reply; 17+ messages in thread From: Kuusisaari, Juhamatti @ 2016-07-13 5:27 UTC (permalink / raw) To: Ananyev, Konstantin, 'dev@dpdk.org' Hello, > > Hi Juhamatti, > > > > > > > > Hello, > > > > > > > > > > -----Original Message----- > > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of > > > > > > > Juhamatti Kuusisaari > > > > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > > > > To: dev@dpdk.org > > > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier > > > > > > > to correct location > > > > > > > > > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > > > > > It needs to be called before accessing indexed data to > > > > > > > ensure that the data itself is guaranteed to be correctly updated. > > > > > > > > > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > > > > > section 'Data dependency barriers'. > > > > > > > > > > > > > > > > > > Any explanation why? > > > > > > From my point smp_rmb()s are on the proper places here :) > > > > > > Konstantin > > > > > > > > > > The problem here is that on a weak memory model system the CPU > > > > > is allowed to load the address data out-of-order in advance. > > > > > If the read barrier is after the DEQUEUE, you might end up > > > > > having the old data there on a race situation when the buffer is > continuously full. > > > > > Having it before the DEQUEUE guarantees that the load is not > > > > > done in advance. > > > > > > > > Sorry, still didn't see any race condition in the current code. > > > > Can you provide any particular example? > > > > From other side, moving smp_rmb() before dequeueing the objects, > > > > could introduce a race condition, on cpus where later writes can > > > > be reordered with earlier reads. > > > > > > Here is a simplified example sequence from time perspective: > > > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order > > > (the key of the problem) > > > > To read the value of ring[x] cpu has to calculate x first. > > And to calculate x it needs to read cons.head and prod.tail first. > > Are you saying that some modern cpu can: > > -'speculate' value of cons.head and prod.tail > > (based on what?) > > -calculate x based on these speculated values. > > - read ring[x] > > - read cons.head and prod.tail > > - if read values are not equal to speculated ones , then > > re-caluclate x and re-read ring[x] > > - else use speculatively read ring[x] > > ? > > If such thing is possible (is it really? and if yes on which cpu?), > > As I can see, neither ARM or PPC support such things. > Both of them do obey address dependency. > (ARM & PPC guys feel free to correct me here, if I am wrong here). > So what cpu we are talking about? I checked that too, indeed the problem I described seems to be more academic than even theoretical and does not apply to current CPUs. So I agree here and this makes this patch unneeded, I'll withdraw it. However, the implementation may still have another issue, see below. > > then yes, we might need an extra smp_rmb() before DEQUEUE_PTRS() for > > __rte_ring_sc_do_dequeue(). > > For __rte_ring_mc_do_dequeue(), I think we are ok, as there is CAS > > just before DEQUEUE_PTRS(). > > > > > 2. Producer CPU (PCPU) updates r->ring[x] to value be z 3. PCPU > > > updates prod_tail to be x 4. CCPU updates cons_head to be x 5. CCPU > > > loads r->ring[x] by using out-of-order loaded value y [is z in > > > reality] > > > > > > The problem here is that on weak memory model, the CCPU is allowed > > > to load > > > r->ring[x] value in advance, if it decides to do so (CCPU needs to > > > r->be able to see > > > in advance that x will be an interesting index worth loading). The > > > index value x is updated atomically, but it does not matter here. > > > Also, the write barrier on PCPU side guarantees that CCPU cannot see > > > update of x before PCPU has really updated the r->ring[x] to z and > > > moved the tail, but still allows to do the out-of-order loads without > proper read barrier. > > > > > > When the read barrier is moved between steps 4 and 5, it disallows > > > to use any out-of-order loads so far and forces to drop r->ring[x] y > > > value and load current value z. > > > > > > The ring queue appears to work well as this is a rare corner case. > > > Due to the head,tail-structure the problem needs queue to be full > > > and also CCPU needs to see r->ring[x] update later than it does the > > > out-of-order load. In addition, the HW needs to be able to predict > > > and choose the load to the future index (which should be quite > > > possible, considering modern CPUs). If you have seen in the past > > > problems and noticed that a larger ring queue works better as a > workaround, you may have encountered the problem already. > > > > I don't understand what means 'larger rings works better' here. > > What we are talking about is race condition, that if hit, would cause > > data corruption and most likely a crash. The larger ring queue length makes the problem more infrequent as the queue has more available free space and the problem does not occur without queue being full. The symptoms apply to a new problem I describe below too. > > > > > > It is quite safe to move the barrier before DEQUEUE because after > > > the DEQUEUE there is nothing really that we would want to protect with a > read barrier. > > > > I don't think so. > > If you remove barrier after DEQUEUE(), that means on systems with > > relaxed memory ordering cons.tail could be updated before DEQUEUE() > > will be finished and producer can overwrite queue entries that were not > yet dequeued. > > So if cpu can really do such speculative out of order loads, then we > > do need for __rte_ring_sc_do_dequeue() something like: > > > > rte_smp_rmb(); > > DEQUEUE_PTRS(); > > rte_smp_rmb(); You have a valid point here, there needs to be a guarantee that cons_tail cannot be updated before DEQUEUE is completed. Nevertheless, my point was that it is not guaranteed with a read barrier anyway. The implementation has the following sequence DEQUEUE_PTRS(); (i.e. READ/LOAD) rte_smp_rmb(); .. r->cons.tail = cons_next; (i.e WRITE/STORE) Above read barrier does not guarantee any ordering for the following writes/stores. As a guarantee is needed, I think we in fact need to change the read barrier on the dequeue to a full barrier, which guarantees the read+write order, as follows DEQUEUE_PTRS(); rte_smp_mb(); .. r->cons.tail = cons_next; If you agree, I can for sure prepare another patch for this issue. Thanks, -- Juhamatti > > Konstantin > > > > > The read > > > barrier is mapped to a compiler barrier on strong memory model > > > systems and this works fine too as the order of the head,tail > > > updates is still guaranteed on the new location. Even if the problem > > > would be theoretical on most systems, it is worth fixing as the risk for > problems is very low. > > > > > > -- > > > Juhamatti > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > > > --- > > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct > > > > > > > rte_ring *r, > > > > > > void **obj_table, > > > > > > > cons_next); > > > > > > > } while (unlikely(success == 0)); > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > + > > > > > > > /* copy in table */ > > > > > > > DEQUEUE_PTRS(); > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > /* > > > > > > > * If there are other dequeues in progress that > > > > > > > preceded us, @@ -746,9 +747,10 @@ > > > > > > > __rte_ring_sc_do_dequeue(struct rte_ring *r, > > > > > > void **obj_table, > > > > > > > cons_next = cons_head + n; > > > > > > > r->cons.head = cons_next; > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > + > > > > > > > /* copy in table */ > > > > > > > DEQUEUE_PTRS(); > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > > r->cons.tail = cons_next; > > > > > > > -- > > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > == > > > > > > > The information contained in this message may be privileged > > > > > > > and confidential and protected from disclosure. If the > > > > > > > reader of this message is not the intended recipient, or an > > > > > > > employee or agent responsible for delivering this message to > > > > > > > the intended recipient, you are hereby notified that any > > > > > > > reproduction, dissemination or distribution of this > > > > > > > communication is strictly prohibited. If you have received > > > > > > > this communication in error, please notify us immediately by > > > > > > > replying to the message and deleting it from your > > > > computer. Thank you. > > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-13 5:27 ` Kuusisaari, Juhamatti @ 2016-07-13 13:00 ` Ananyev, Konstantin 2016-07-14 4:17 ` Kuusisaari, Juhamatti 0 siblings, 1 reply; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-13 13:00 UTC (permalink / raw) To: Kuusisaari, Juhamatti, 'dev@dpdk.org' Hi Juhamatti, > > Hello, > > > > Hi Juhamatti, > > > > > > > > > > > Hello, > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of > > > > > > > > Juhamatti Kuusisaari > > > > > > > > Sent: Monday, July 11, 2016 11:21 AM > > > > > > > > To: dev@dpdk.org > > > > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier > > > > > > > > to correct location > > > > > > > > > > > > > > > > Fix the location of the rte_ring data dependency read barrier. > > > > > > > > It needs to be called before accessing indexed data to > > > > > > > > ensure that the data itself is guaranteed to be correctly updated. > > > > > > > > > > > > > > > > See more details at kernel/Documentation/memory-barriers.txt > > > > > > > > section 'Data dependency barriers'. > > > > > > > > > > > > > > > > > > > > > Any explanation why? > > > > > > > From my point smp_rmb()s are on the proper places here :) > > > > > > > Konstantin > > > > > > > > > > > > The problem here is that on a weak memory model system the CPU > > > > > > is allowed to load the address data out-of-order in advance. > > > > > > If the read barrier is after the DEQUEUE, you might end up > > > > > > having the old data there on a race situation when the buffer is > > continuously full. > > > > > > Having it before the DEQUEUE guarantees that the load is not > > > > > > done in advance. > > > > > > > > > > Sorry, still didn't see any race condition in the current code. > > > > > Can you provide any particular example? > > > > > From other side, moving smp_rmb() before dequeueing the objects, > > > > > could introduce a race condition, on cpus where later writes can > > > > > be reordered with earlier reads. > > > > > > > > Here is a simplified example sequence from time perspective: > > > > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order > > > > (the key of the problem) > > > > > > To read the value of ring[x] cpu has to calculate x first. > > > And to calculate x it needs to read cons.head and prod.tail first. > > > Are you saying that some modern cpu can: > > > -'speculate' value of cons.head and prod.tail > > > (based on what?) > > > -calculate x based on these speculated values. > > > - read ring[x] > > > - read cons.head and prod.tail > > > - if read values are not equal to speculated ones , then > > > re-caluclate x and re-read ring[x] > > > - else use speculatively read ring[x] > > > ? > > > If such thing is possible (is it really? and if yes on which cpu?), > > > > As I can see, neither ARM or PPC support such things. > > Both of them do obey address dependency. > > (ARM & PPC guys feel free to correct me here, if I am wrong here). > > So what cpu we are talking about? > > I checked that too, indeed the problem I described seems to be more academic > than even theoretical and does not apply to current CPUs. So I agree here and > this makes this patch unneeded, I'll withdraw it. However, the implementation > may still have another issue, see below. > > > > then yes, we might need an extra smp_rmb() before DEQUEUE_PTRS() for > > > __rte_ring_sc_do_dequeue(). > > > For __rte_ring_mc_do_dequeue(), I think we are ok, as there is CAS > > > just before DEQUEUE_PTRS(). > > > > > > > 2. Producer CPU (PCPU) updates r->ring[x] to value be z 3. PCPU > > > > updates prod_tail to be x 4. CCPU updates cons_head to be x 5. CCPU > > > > loads r->ring[x] by using out-of-order loaded value y [is z in > > > > reality] > > > > > > > > The problem here is that on weak memory model, the CCPU is allowed > > > > to load > > > > r->ring[x] value in advance, if it decides to do so (CCPU needs to > > > > r->be able to see > > > > in advance that x will be an interesting index worth loading). The > > > > index value x is updated atomically, but it does not matter here. > > > > Also, the write barrier on PCPU side guarantees that CCPU cannot see > > > > update of x before PCPU has really updated the r->ring[x] to z and > > > > moved the tail, but still allows to do the out-of-order loads without > > proper read barrier. > > > > > > > > When the read barrier is moved between steps 4 and 5, it disallows > > > > to use any out-of-order loads so far and forces to drop r->ring[x] y > > > > value and load current value z. > > > > > > > > The ring queue appears to work well as this is a rare corner case. > > > > Due to the head,tail-structure the problem needs queue to be full > > > > and also CCPU needs to see r->ring[x] update later than it does the > > > > out-of-order load. In addition, the HW needs to be able to predict > > > > and choose the load to the future index (which should be quite > > > > possible, considering modern CPUs). If you have seen in the past > > > > problems and noticed that a larger ring queue works better as a > > workaround, you may have encountered the problem already. > > > > > > I don't understand what means 'larger rings works better' here. > > > What we are talking about is race condition, that if hit, would cause > > > data corruption and most likely a crash. > > The larger ring queue length makes the problem more infrequent as the queue > has more available free space and the problem does not occur without queue > being full. The symptoms apply to a new problem I describe below too. > > > > > > > > > It is quite safe to move the barrier before DEQUEUE because after > > > > the DEQUEUE there is nothing really that we would want to protect with a > > read barrier. > > > > > > I don't think so. > > > If you remove barrier after DEQUEUE(), that means on systems with > > > relaxed memory ordering cons.tail could be updated before DEQUEUE() > > > will be finished and producer can overwrite queue entries that were not > > yet dequeued. > > > So if cpu can really do such speculative out of order loads, then we > > > do need for __rte_ring_sc_do_dequeue() something like: > > > > > > rte_smp_rmb(); > > > DEQUEUE_PTRS(); > > > rte_smp_rmb(); > > You have a valid point here, there needs to be a guarantee that cons_tail cannot > be updated before DEQUEUE is completed. Nevertheless, my point was that it is > not guaranteed with a read barrier anyway. The implementation has the following > sequence > > DEQUEUE_PTRS(); (i.e. READ/LOAD) > rte_smp_rmb(); > .. > r->cons.tail = cons_next; (i.e WRITE/STORE) > > Above read barrier does not guarantee any ordering for the following writes/stores. > As a guarantee is needed, I think we in fact need to change the read barrier on the > dequeue to a full barrier, which guarantees the read+write order, as follows > > DEQUEUE_PTRS(); > rte_smp_mb(); > .. > r->cons.tail = cons_next; > > If you agree, I can for sure prepare another patch for this issue. Hmm, I think for __rte_ring_mc_do_dequeue() we are ok with smp_rmb(), as we have to read cons.tail anyway. For __rte_ring_sc_do_dequeue(), I think you right, we might need something stronger. I don't want to put rte_smp_mb() here as it would cause full HW barrier even on machines with strong memory order (IA). I think that rte_smp_wmb() might be enough here: it would force cpu to wait till writes in DEQUEUE_PTRS() are become visible, which means reads have to be completed too. Another option would be to define a new macro: rte_weak_mb() or so, that would be expanded into CB on boxes with strong memory model, and to full MB on machines with relaxed ones. Interested to hear what ARM and PPC guys think. Konstantin P.S. Another thing a bit off-topic - for PPC guys: As I can see smp_rmb/smp_wmb are just a complier barriers: find lib/librte_eal/common/include/arch/ppc_64/ -type f | xargs grep smp_ lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define rte_smp_mb() rte_mb() lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define rte_smp_wmb() rte_compiler_barrier() lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier() My knowledge about PPC architecture is rudimental, but is that really enough? > > Thanks, > -- > Juhamatti > > > > Konstantin > > > > > > > The read > > > > barrier is mapped to a compiler barrier on strong memory model > > > > systems and this works fine too as the order of the head,tail > > > > updates is still guaranteed on the new location. Even if the problem > > > > would be theoretical on most systems, it is worth fixing as the risk for > > problems is very low. > > > > > > > > -- > > > > Juhamatti > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > > > > --- > > > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct > > > > > > > > rte_ring *r, > > > > > > > void **obj_table, > > > > > > > > cons_next); > > > > > > > > } while (unlikely(success == 0)); > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > + > > > > > > > > /* copy in table */ > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > /* > > > > > > > > * If there are other dequeues in progress that > > > > > > > > preceded us, @@ -746,9 +747,10 @@ > > > > > > > > __rte_ring_sc_do_dequeue(struct rte_ring *r, > > > > > > > void **obj_table, > > > > > > > > cons_next = cons_head + n; > > > > > > > > r->cons.head = cons_next; > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > + > > > > > > > > /* copy in table */ > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > > > r->cons.tail = cons_next; > > > > > > > > -- > > > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > == > > > > > > > > The information contained in this message may be privileged > > > > > > > > and confidential and protected from disclosure. If the > > > > > > > > reader of this message is not the intended recipient, or an > > > > > > > > employee or agent responsible for delivering this message to > > > > > > > > the intended recipient, you are hereby notified that any > > > > > > > > reproduction, dissemination or distribution of this > > > > > > > > communication is strictly prohibited. If you have received > > > > > > > > this communication in error, please notify us immediately by > > > > > > > > replying to the message and deleting it from your > > > > > computer. Thank you. > > > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-13 13:00 ` Ananyev, Konstantin @ 2016-07-14 4:17 ` Kuusisaari, Juhamatti 2016-07-14 12:56 ` Ananyev, Konstantin 0 siblings, 1 reply; 17+ messages in thread From: Kuusisaari, Juhamatti @ 2016-07-14 4:17 UTC (permalink / raw) To: Ananyev, Konstantin, 'dev@dpdk.org' Hi Konstantin, > > > > > It is quite safe to move the barrier before DEQUEUE because after > > > > > the DEQUEUE there is nothing really that we would want to protect > with a > > > read barrier. > > > > > > > > I don't think so. > > > > If you remove barrier after DEQUEUE(), that means on systems with > > > > relaxed memory ordering cons.tail could be updated before DEQUEUE() > > > > will be finished and producer can overwrite queue entries that were > not > > > yet dequeued. > > > > So if cpu can really do such speculative out of order loads, then we > > > > do need for __rte_ring_sc_do_dequeue() something like: > > > > > > > > rte_smp_rmb(); > > > > DEQUEUE_PTRS(); > > > > rte_smp_rmb(); > > > > You have a valid point here, there needs to be a guarantee that cons_tail > cannot > > be updated before DEQUEUE is completed. Nevertheless, my point was > that it is > > not guaranteed with a read barrier anyway. The implementation has the > following > > sequence > > > > DEQUEUE_PTRS(); (i.e. READ/LOAD) > > rte_smp_rmb(); > > .. > > r->cons.tail = cons_next; (i.e WRITE/STORE) > > > > Above read barrier does not guarantee any ordering for the following > writes/stores. > > As a guarantee is needed, I think we in fact need to change the read barrier > on the > > dequeue to a full barrier, which guarantees the read+write order, as > follows > > > > DEQUEUE_PTRS(); > > rte_smp_mb(); > > .. > > r->cons.tail = cons_next; > > > > If you agree, I can for sure prepare another patch for this issue. > > Hmm, I think for __rte_ring_mc_do_dequeue() we are ok with smp_rmb(), > as we have to read cons.tail anyway. Are you certain that this read creates strong enough dependency between read of cons.tail and the write of it on the mc_do_dequeue()? I think it does not really create any control dependency there as the next write is not dependent of the result of the read. The CPU also knows already the value that will be written to cons.tail and that value does not depend on the previous read either. The CPU does not know we are planning to do a spinlock there, so it might do things out-of-order without proper dependencies. > For __rte_ring_sc_do_dequeue(), I think you right, we might need > something stronger. > I don't want to put rte_smp_mb() here as it would cause full HW barrier even > on machines > with strong memory order (IA). > I think that rte_smp_wmb() might be enough here: > it would force cpu to wait till writes in DEQUEUE_PTRS() are become visible, > which > means reads have to be completed too. In practice I think that rte_smp_wmb() would work fine, even though it is not strictly according to the book. Below solution would be my proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, if we have the problem of CPU completely ignoring the spinlock in reality there): DEQUEUE_PTRS(); .. rte_smp_wmb(); r->cons.tail = cons_next; -- Juhamatti > Another option would be to define a new macro: rte_weak_mb() or so, > that would be expanded into CB on boxes with strong memory model, > and to full MB on machines with relaxed ones. > Interested to hear what ARM and PPC guys think. > Konstantin > > P.S. Another thing a bit off-topic - for PPC guys: > As I can see smp_rmb/smp_wmb are just a complier barriers: > find lib/librte_eal/common/include/arch/ppc_64/ -type f | xargs grep smp_ > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > rte_smp_mb() rte_mb() > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > rte_smp_wmb() rte_compiler_barrier() > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > rte_smp_rmb() rte_compiler_barrier() > My knowledge about PPC architecture is rudimental, but is that really enough? > > > > > Thanks, > > -- > > Juhamatti > > > > > > Konstantin > > > > > > > > > The read > > > > > barrier is mapped to a compiler barrier on strong memory model > > > > > systems and this works fine too as the order of the head,tail > > > > > updates is still guaranteed on the new location. Even if the problem > > > > > would be theoretical on most systems, it is worth fixing as the risk for > > > problems is very low. > > > > > > > > > > -- > > > > > Juhamatti > > > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > > > > > --- > > > > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 100644 > > > > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct > > > > > > > > > rte_ring *r, > > > > > > > > void **obj_table, > > > > > > > > > cons_next); > > > > > > > > > } while (unlikely(success == 0)); > > > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > > + > > > > > > > > > /* copy in table */ > > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > > > /* > > > > > > > > > * If there are other dequeues in progress that > > > > > > > > > preceded us, @@ -746,9 +747,10 @@ > > > > > > > > > __rte_ring_sc_do_dequeue(struct rte_ring *r, > > > > > > > > void **obj_table, > > > > > > > > > cons_next = cons_head + n; > > > > > > > > > r->cons.head = cons_next; > > > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > > + > > > > > > > > > /* copy in table */ > > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > > > > r->cons.tail = cons_next; > > > > > > > > > -- > > > > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > > == > > > > > > > > > The information contained in this message may be privileged > > > > > > > > > and confidential and protected from disclosure. If the > > > > > > > > > reader of this message is not the intended recipient, or an > > > > > > > > > employee or agent responsible for delivering this message to > > > > > > > > > the intended recipient, you are hereby notified that any > > > > > > > > > reproduction, dissemination or distribution of this > > > > > > > > > communication is strictly prohibited. If you have received > > > > > > > > > this communication in error, please notify us immediately by > > > > > > > > > replying to the message and deleting it from your > > > > > > computer. Thank you. > > > > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-14 4:17 ` Kuusisaari, Juhamatti @ 2016-07-14 12:56 ` Ananyev, Konstantin 2016-07-15 5:40 ` Kuusisaari, Juhamatti 2016-07-15 6:29 ` Jerin Jacob 0 siblings, 2 replies; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-14 12:56 UTC (permalink / raw) To: Kuusisaari, Juhamatti, 'dev@dpdk.org' Cc: Jerin Jacob (jerin.jacob@caviumnetworks.com), Jan Viktorin (viktorin@rehivetech.com), Chao Zhu (bjzhuc@cn.ibm.com) Hi Juhamatti, > > Hi Konstantin, > > > > > > > It is quite safe to move the barrier before DEQUEUE because > > > > > > after the DEQUEUE there is nothing really that we would want > > > > > > to protect > > with a > > > > read barrier. > > > > > > > > > > I don't think so. > > > > > If you remove barrier after DEQUEUE(), that means on systems > > > > > with relaxed memory ordering cons.tail could be updated before > > > > > DEQUEUE() will be finished and producer can overwrite queue > > > > > entries that were > > not > > > > yet dequeued. > > > > > So if cpu can really do such speculative out of order loads, > > > > > then we do need for __rte_ring_sc_do_dequeue() something like: > > > > > > > > > > rte_smp_rmb(); > > > > > DEQUEUE_PTRS(); > > > > > rte_smp_rmb(); > > > > > > You have a valid point here, there needs to be a guarantee that > > > cons_tail > > cannot > > > be updated before DEQUEUE is completed. Nevertheless, my point was > > that it is > > > not guaranteed with a read barrier anyway. The implementation has > > > the > > following > > > sequence > > > > > > DEQUEUE_PTRS(); (i.e. READ/LOAD) > > > rte_smp_rmb(); > > > .. > > > r->cons.tail = cons_next; (i.e WRITE/STORE) > > > > > > Above read barrier does not guarantee any ordering for the following > > writes/stores. > > > As a guarantee is needed, I think we in fact need to change the read > > > barrier > > on the > > > dequeue to a full barrier, which guarantees the read+write order, as > > follows > > > > > > DEQUEUE_PTRS(); > > > rte_smp_mb(); > > > .. > > > r->cons.tail = cons_next; > > > > > > If you agree, I can for sure prepare another patch for this issue. > > > > Hmm, I think for __rte_ring_mc_do_dequeue() we are ok with smp_rmb(), > > as we have to read cons.tail anyway. > > Are you certain that this read creates strong enough dependency between read of cons.tail and the write of it on the mc_do_dequeue()? Yes, I believe so. > I think it does not really create any control dependency there as the next write is not dependent of the result of the read. I think it is dependent: cons.tail can be updated only if it's current value is eual to precomputed before cons_head. So cpu has to read cons.tail value first. > The CPU also > knows already the value that will be written to cons.tail and that value does not depend on the previous read either. The CPU does not > know we are planning to do a spinlock there, so it might do things out-of-order without proper dependencies. > > > For __rte_ring_sc_do_dequeue(), I think you right, we might need > > something stronger. > > I don't want to put rte_smp_mb() here as it would cause full HW > > barrier even on machines with strong memory order (IA). > > I think that rte_smp_wmb() might be enough here: > > it would force cpu to wait till writes in DEQUEUE_PTRS() are become > > visible, which means reads have to be completed too. > > In practice I think that rte_smp_wmb() would work fine, even though it is not strictly according to the book. Below solution would be my > proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, if we have the problem of CPU completely ignoring the spinlock > in reality there): > > DEQUEUE_PTRS(); > .. > rte_smp_wmb(); > r->cons.tail = cons_next; As I said in previous email - it looks good for me for _rte_ring_sc_do_dequeue(), but I am interested to hear what ARM and PPC maintainers think about it. Jan, Jerin do you have any comments on it? Chao, sorry but I still not sure why PPC is considered as architecture with strong memory ordering? Might be I am missing something obvious here. Thank Konstantin > > -- > Juhamatti > > > Another option would be to define a new macro: rte_weak_mb() or so, > > that would be expanded into CB on boxes with strong memory model, and > > to full MB on machines with relaxed ones. > > Interested to hear what ARM and PPC guys think. > > Konstantin > > > > P.S. Another thing a bit off-topic - for PPC guys: > > As I can see smp_rmb/smp_wmb are just a complier barriers: > > find lib/librte_eal/common/include/arch/ppc_64/ -type f | xargs grep > > smp_ lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > > rte_smp_mb() rte_mb() > > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > > rte_smp_wmb() rte_compiler_barrier() > > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > > rte_smp_rmb() rte_compiler_barrier() > > My knowledge about PPC architecture is rudimental, but is that really enough? > > > > > > > > Thanks, > > > -- > > > Juhamatti > > > > > > > > Konstantin > > > > > > > > > > > The read > > > > > > barrier is mapped to a compiler barrier on strong memory model > > > > > > systems and this works fine too as the order of the head,tail > > > > > > updates is still guaranteed on the new location. Even if the > > > > > > problem would be theoretical on most systems, it is worth > > > > > > fixing as the risk for > > > > problems is very low. > > > > > > > > > > > > -- > > > > > > Juhamatti > > > > > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > > > > > > --- > > > > > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 > > > > > > > > > > 100644 > > > > > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > > > > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct > > > > > > > > > > rte_ring *r, > > > > > > > > > void **obj_table, > > > > > > > > > > cons_next); > > > > > > > > > > } while (unlikely(success == 0)); > > > > > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > > > + > > > > > > > > > > /* copy in table */ > > > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > > > > > /* > > > > > > > > > > * If there are other dequeues in progress > > > > > > > > > > that preceded us, @@ -746,9 +747,10 @@ > > > > > > > > > > __rte_ring_sc_do_dequeue(struct rte_ring *r, > > > > > > > > > void **obj_table, > > > > > > > > > > cons_next = cons_head + n; > > > > > > > > > > r->cons.head = cons_next; > > > > > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > > > + > > > > > > > > > > /* copy in table */ > > > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > > > > > r->cons.tail = cons_next; > > > > > > > > > > -- > > > > > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > > > == > > > > > > > > > > The information contained in this message may be > > > > > > > > > > privileged and confidential and protected from > > > > > > > > > > disclosure. If the reader of this message is not the > > > > > > > > > > intended recipient, or an employee or agent > > > > > > > > > > responsible for delivering this message to the > > > > > > > > > > intended recipient, you are hereby notified that any > > > > > > > > > > reproduction, dissemination or distribution of this > > > > > > > > > > communication is strictly prohibited. If you have > > > > > > > > > > received this communication in error, please notify us > > > > > > > > > > immediately by replying to the message and deleting it > > > > > > > > > > from your > > > > > > > computer. Thank you. > > > > > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-14 12:56 ` Ananyev, Konstantin @ 2016-07-15 5:40 ` Kuusisaari, Juhamatti 2016-07-15 6:29 ` Jerin Jacob 1 sibling, 0 replies; 17+ messages in thread From: Kuusisaari, Juhamatti @ 2016-07-15 5:40 UTC (permalink / raw) To: Ananyev, Konstantin, 'dev@dpdk.org' Hi Konstantin, > Hi Juhamatti, > > > > > Hi Konstantin, > > > > > > > > > It is quite safe to move the barrier before DEQUEUE because > > > > > > > after the DEQUEUE there is nothing really that we would want > > > > > > > to protect > > > with a > > > > > read barrier. > > > > > > > > > > > > I don't think so. > > > > > > If you remove barrier after DEQUEUE(), that means on systems > > > > > > with relaxed memory ordering cons.tail could be updated before > > > > > > DEQUEUE() will be finished and producer can overwrite queue > > > > > > entries that were > > > not > > > > > yet dequeued. > > > > > > So if cpu can really do such speculative out of order loads, > > > > > > then we do need for __rte_ring_sc_do_dequeue() something like: > > > > > > > > > > > > rte_smp_rmb(); > > > > > > DEQUEUE_PTRS(); > > > > > > rte_smp_rmb(); > > > > > > > > You have a valid point here, there needs to be a guarantee that > > > > cons_tail > > > cannot > > > > be updated before DEQUEUE is completed. Nevertheless, my point was > > > that it is > > > > not guaranteed with a read barrier anyway. The implementation has > > > > the > > > following > > > > sequence > > > > > > > > DEQUEUE_PTRS(); (i.e. READ/LOAD) > > > > rte_smp_rmb(); > > > > .. > > > > r->cons.tail = cons_next; (i.e WRITE/STORE) > > > > > > > > Above read barrier does not guarantee any ordering for the > > > > following > > > writes/stores. > > > > As a guarantee is needed, I think we in fact need to change the > > > > read barrier > > > on the > > > > dequeue to a full barrier, which guarantees the read+write order, > > > > as > > > follows > > > > > > > > DEQUEUE_PTRS(); > > > > rte_smp_mb(); > > > > .. > > > > r->cons.tail = cons_next; > > > > > > > > If you agree, I can for sure prepare another patch for this issue. > > > > > > Hmm, I think for __rte_ring_mc_do_dequeue() we are ok with > > > smp_rmb(), as we have to read cons.tail anyway. > > > > Are you certain that this read creates strong enough dependency between > read of cons.tail and the write of it on the mc_do_dequeue()? > > Yes, I believe so. > > > I think it does not really create any control dependency there as the next > write is not dependent of the result of the read. > > I think it is dependent: cons.tail can be updated only if it's current value is > eual to precomputed before cons_head. > So cpu has to read cons.tail value first. I was thinking that it might match to this processing pattern: S1. if (a == b) S2. a = a + b S3. b = a + b Above S3 has no control dependency to S1 and can be in theory executed in parallel. In the (simplified) implementation we have: X1. while (a != b) X2. { } X3. a = c If we would consider S3 and X3 equal, there might be a problem with coherence without a write barrier after the while(). It may of course be purely theoretical, depending on the HW. This is anyway the reason for my suggestion to have a write-barrier also on the mc_do_dequeue() just before tail update. -- Juhamatti > > The CPU also > > knows already the value that will be written to cons.tail and that > > value does not depend on the previous read either. The CPU does not > know we are planning to do a spinlock there, so it might do things out-of- > order without proper dependencies. > > > > > For __rte_ring_sc_do_dequeue(), I think you right, we might need > > > something stronger. > > > I don't want to put rte_smp_mb() here as it would cause full HW > > > barrier even on machines with strong memory order (IA). > > > I think that rte_smp_wmb() might be enough here: > > > it would force cpu to wait till writes in DEQUEUE_PTRS() are become > > > visible, which means reads have to be completed too. > > > > In practice I think that rte_smp_wmb() would work fine, even though it > > is not strictly according to the book. Below solution would be my > > proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, > if we have the problem of CPU completely ignoring the spinlock in reality > there): > > > > DEQUEUE_PTRS(); > > .. > > rte_smp_wmb(); > > r->cons.tail = cons_next; > > As I said in previous email - it looks good for me for > _rte_ring_sc_do_dequeue(), but I am interested to hear what ARM and PPC > maintainers think about it. > Jan, Jerin do you have any comments on it? > Chao, sorry but I still not sure why PPC is considered as architecture with > strong memory ordering? > Might be I am missing something obvious here. > Thank > Konstantin > > > > > -- > > Juhamatti > > > > > Another option would be to define a new macro: rte_weak_mb() or so, > > > that would be expanded into CB on boxes with strong memory model, > > > and to full MB on machines with relaxed ones. > > > Interested to hear what ARM and PPC guys think. > > > Konstantin > > > > > > P.S. Another thing a bit off-topic - for PPC guys: > > > As I can see smp_rmb/smp_wmb are just a complier barriers: > > > find lib/librte_eal/common/include/arch/ppc_64/ -type f | xargs grep > > > smp_ lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > > > rte_smp_mb() rte_mb() > > > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > > > rte_smp_wmb() rte_compiler_barrier() > > > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define > > > rte_smp_rmb() rte_compiler_barrier() My knowledge about PPC > > > architecture is rudimental, but is that really enough? > > > > > > > > > > > Thanks, > > > > -- > > > > Juhamatti > > > > > > > > > > Konstantin > > > > > > > > > > > > > The read > > > > > > > barrier is mapped to a compiler barrier on strong memory > > > > > > > model systems and this works fine too as the order of the > > > > > > > head,tail updates is still guaranteed on the new location. > > > > > > > Even if the problem would be theoretical on most systems, it > > > > > > > is worth fixing as the risk for > > > > > problems is very low. > > > > > > > > > > > > > > -- > > > > > > > Juhamatti > > > > > > > > > > > > > > > Konstantin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Juhamatti Kuusisaari > > > > > > > > > > > <juhamatti.kuusisaari@coriant.com> > > > > > > > > > > > --- > > > > > > > > > > > lib/librte_ring/rte_ring.h | 6 ++++-- > > > > > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/lib/librte_ring/rte_ring.h > > > > > > > > > > > b/lib/librte_ring/rte_ring.h index eb45e41..a923e49 > > > > > > > > > > > 100644 > > > > > > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > > > > > > @@ -662,9 +662,10 @@ > __rte_ring_mc_do_dequeue(struct > > > > > > > > > > > rte_ring *r, > > > > > > > > > > void **obj_table, > > > > > > > > > > > cons_next); > > > > > > > > > > > } while (unlikely(success == 0)); > > > > > > > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > > > > + > > > > > > > > > > > /* copy in table */ > > > > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > > > > > > > /* > > > > > > > > > > > * If there are other dequeues in progress > > > > > > > > > > > that preceded us, @@ -746,9 +747,10 @@ > > > > > > > > > > > __rte_ring_sc_do_dequeue(struct rte_ring *r, > > > > > > > > > > void **obj_table, > > > > > > > > > > > cons_next = cons_head + n; > > > > > > > > > > > r->cons.head = cons_next; > > > > > > > > > > > > > > > > > > > > > > + rte_smp_rmb(); > > > > > > > > > > > + > > > > > > > > > > > /* copy in table */ > > > > > > > > > > > DEQUEUE_PTRS(); > > > > > > > > > > > - rte_smp_rmb(); > > > > > > > > > > > > > > > > > > > > > > __RING_STAT_ADD(r, deq_success, n); > > > > > > > > > > > r->cons.tail = cons_next; > > > > > > > > > > > -- > > > > > > > > > > > 2.9.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > > > > == > > > > > > > > > > > The information contained in this message may be > > > > > > > > > > > privileged and confidential and protected from > > > > > > > > > > > disclosure. If the reader of this message is not the > > > > > > > > > > > intended recipient, or an employee or agent > > > > > > > > > > > responsible for delivering this message to the > > > > > > > > > > > intended recipient, you are hereby notified that any > > > > > > > > > > > reproduction, dissemination or distribution of this > > > > > > > > > > > communication is strictly prohibited. If you have > > > > > > > > > > > received this communication in error, please notify > > > > > > > > > > > us immediately by replying to the message and > > > > > > > > > > > deleting it from your > > > > > > > > computer. Thank you. > > > > > > > > > > > Coriant-Tellabs > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ========================================================== > > > > > > > > > > == ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-14 12:56 ` Ananyev, Konstantin 2016-07-15 5:40 ` Kuusisaari, Juhamatti @ 2016-07-15 6:29 ` Jerin Jacob 2016-07-15 10:34 ` Ananyev, Konstantin 1 sibling, 1 reply; 17+ messages in thread From: Jerin Jacob @ 2016-07-15 6:29 UTC (permalink / raw) To: Ananyev, Konstantin Cc: Kuusisaari, Juhamatti, 'dev@dpdk.org', Jan Viktorin (viktorin@rehivetech.com), Chao Zhu (bjzhuc@cn.ibm.com) On Thu, Jul 14, 2016 at 12:56:11PM +0000, Ananyev, Konstantin wrote: > > > > The CPU also > > knows already the value that will be written to cons.tail and that value does not depend on the previous read either. The CPU does not > > know we are planning to do a spinlock there, so it might do things out-of-order without proper dependencies. > > > > > For __rte_ring_sc_do_dequeue(), I think you right, we might need > > > something stronger. > > > I don't want to put rte_smp_mb() here as it would cause full HW > > > barrier even on machines with strong memory order (IA). > > > I think that rte_smp_wmb() might be enough here: > > > it would force cpu to wait till writes in DEQUEUE_PTRS() are become > > > visible, which means reads have to be completed too. > > > > In practice I think that rte_smp_wmb() would work fine, even though it is not strictly according to the book. Below solution would be my > > proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, if we have the problem of CPU completely ignoring the spinlock > > in reality there): > > > > DEQUEUE_PTRS(); > > .. > > rte_smp_wmb(); > > r->cons.tail = cons_next; > > As I said in previous email - it looks good for me for _rte_ring_sc_do_dequeue(), > but I am interested to hear what ARM and PPC maintainers think about it. > Jan, Jerin do you have any comments on it? Actually it is NOT performance effective and difficult to capture the ORDER dependency with plane store and load barriers on WEAK ordered machines. Beyond plane store and load barriers, We need to express #LoadLoad, #LoadStore,#StoreStore barrier dependency with Acquire and Release Semantics in Arch neutral code(Looks like this is compiler barrier on IA) http://preshing.com/20120913/acquire-and-release-semantics/ For instance, Full barrier CAS(__sync_bool_compare_and_swap) will not be required for weak ordered machine in MP case. I can send out a RFC version of ring implementation changes required with acquire-and-release-semantics. If it has performance degradation on IA then we can separate it out through conditional compilation flag. GCC Built-in Functions for Memory Model Aware Atomic Operations https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html Thoughts ? Jerin > Chao, sorry but I still not sure why PPC is considered as architecture with strong memory ordering? > Might be I am missing something obvious here. > Thank > Konstantin > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-15 6:29 ` Jerin Jacob @ 2016-07-15 10:34 ` Ananyev, Konstantin 2016-07-18 2:47 ` Jerin Jacob 0 siblings, 1 reply; 17+ messages in thread From: Ananyev, Konstantin @ 2016-07-15 10:34 UTC (permalink / raw) To: Jerin Jacob Cc: Kuusisaari, Juhamatti, 'dev@dpdk.org', Jan Viktorin (viktorin@rehivetech.com), Chao Zhu (bjzhuc@cn.ibm.com) Hi Jerin, > > > > > > > The CPU also > > > knows already the value that will be written to cons.tail and that > > > value does not depend on the previous read either. The CPU does not know we are planning to do a spinlock there, so it might do things > out-of-order without proper dependencies. > > > > > > > For __rte_ring_sc_do_dequeue(), I think you right, we might need > > > > something stronger. > > > > I don't want to put rte_smp_mb() here as it would cause full HW > > > > barrier even on machines with strong memory order (IA). > > > > I think that rte_smp_wmb() might be enough here: > > > > it would force cpu to wait till writes in DEQUEUE_PTRS() are > > > > become visible, which means reads have to be completed too. > > > > > > In practice I think that rte_smp_wmb() would work fine, even though > > > it is not strictly according to the book. Below solution would be my > > > proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, if we have the problem of CPU completely ignoring the > spinlock in reality there): > > > > > > DEQUEUE_PTRS(); > > > .. > > > rte_smp_wmb(); > > > r->cons.tail = cons_next; > > > > As I said in previous email - it looks good for me for > > _rte_ring_sc_do_dequeue(), but I am interested to hear what ARM and PPC maintainers think about it. > > Jan, Jerin do you have any comments on it? > > Actually it is NOT performance effective and difficult to capture the ORDER dependency with plane store and load barriers on WEAK > ordered machines. > Beyond plane store and load barriers, We need to express #LoadLoad, #LoadStore,#StoreStore barrier dependency with Acquire and > Release Semantics in Arch neutral code(Looks like this is compiler barrier on IA) http://preshing.com/20120913/acquire-and-release- > semantics/ > > For instance, Full barrier CAS(__sync_bool_compare_and_swap) will not be required for weak ordered machine in MP case. > I can send out a RFC version of ring implementation changes required with acquire-and-release-semantics. > If it has performance degradation on IA then we can separate it out through conditional compilation flag. > > GCC Built-in Functions for Memory Model Aware Atomic Operations https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html I am not sure what exactly changes you are planning, but I suppose I'd just wait for your RFC here. Though my question was: what do you think about current _rte_ring_sc_do_dequeue()? Do you agree that rmb() is not sufficient here and does Juhamatti patch: http://dpdk.org/dev/patchwork/patch/14846/ looks good to you? It looks good to me ,and I am going to ACK it, but thought you'd better have a look too. Thanks Konstantin > > Thoughts ? > > Jerin > > > Chao, sorry but I still not sure why PPC is considered as architecture with strong memory ordering? > > Might be I am missing something obvious here. > > Thank > > Konstantin > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location 2016-07-15 10:34 ` Ananyev, Konstantin @ 2016-07-18 2:47 ` Jerin Jacob 0 siblings, 0 replies; 17+ messages in thread From: Jerin Jacob @ 2016-07-18 2:47 UTC (permalink / raw) To: Ananyev, Konstantin Cc: Kuusisaari, Juhamatti, 'dev@dpdk.org', Jan Viktorin (viktorin@rehivetech.com), Chao Zhu (bjzhuc@cn.ibm.com), jianbo.liu On Fri, Jul 15, 2016 at 10:34:40AM +0000, Ananyev, Konstantin wrote: > Hi Jerin, > > > > > > > > > > > The CPU also > > > > knows already the value that will be written to cons.tail and that > > > > value does not depend on the previous read either. The CPU does not know we are planning to do a spinlock there, so it might do things > > out-of-order without proper dependencies. > > > > > > > > > For __rte_ring_sc_do_dequeue(), I think you right, we might need > > > > > something stronger. > > > > > I don't want to put rte_smp_mb() here as it would cause full HW > > > > > barrier even on machines with strong memory order (IA). > > > > > I think that rte_smp_wmb() might be enough here: > > > > > it would force cpu to wait till writes in DEQUEUE_PTRS() are > > > > > become visible, which means reads have to be completed too. > > > > > > > > In practice I think that rte_smp_wmb() would work fine, even though > > > > it is not strictly according to the book. Below solution would be my > > > > proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, if we have the problem of CPU completely ignoring the > > spinlock in reality there): > > > > > > > > DEQUEUE_PTRS(); > > > > .. > > > > rte_smp_wmb(); > > > > r->cons.tail = cons_next; > > > > > > As I said in previous email - it looks good for me for > > > _rte_ring_sc_do_dequeue(), but I am interested to hear what ARM and PPC maintainers think about it. > > > Jan, Jerin do you have any comments on it? > > > > Actually it is NOT performance effective and difficult to capture the ORDER dependency with plane store and load barriers on WEAK > > ordered machines. > > Beyond plane store and load barriers, We need to express #LoadLoad, #LoadStore,#StoreStore barrier dependency with Acquire and > > Release Semantics in Arch neutral code(Looks like this is compiler barrier on IA) http://preshing.com/20120913/acquire-and-release- > > semantics/ > > > > For instance, Full barrier CAS(__sync_bool_compare_and_swap) will not be required for weak ordered machine in MP case. > > I can send out a RFC version of ring implementation changes required with acquire-and-release-semantics. > > If it has performance degradation on IA then we can separate it out through conditional compilation flag. > > > > GCC Built-in Functions for Memory Model Aware Atomic Operations https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > I am not sure what exactly changes you are planning, > but I suppose I'd just wait for your RFC here. > Though my question was: what do you think about current _rte_ring_sc_do_dequeue()? > Do you agree that rmb() is not sufficient here and does Juhamatti patch: > http://dpdk.org/dev/patchwork/patch/14846/ > looks good to you? > It looks good to me ,and I am going to ACK it, but thought you'd better > have a look too. I think rte_smp_rmb() is enough in this case i.e DEQUEUE_PTRS(); (i.e. READ/LOAD) rte_smp_rmb(); // DMB ISHLD .. r->cons.tail = cons_next; (i.e WRITE/STORE) rte_smp_rmb()/DMB ISHLD will create the barrier that r->cons.tail write will wait for earlier LOADS to complete. I think we need only LOAD-STORE barrier here. STORE-STORE barrier would also work here as DEQUEUE_PTRS has loads from ring memory and store to local memory. IMO, as far as barrier is concerned we may need to wait for only LOADS to complete before we update the r->cons.tail Waiting for STORES(to local memory) to complete will be overkill here. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CHDGACJD.html ISHLD provides LOAD-STORE barrier maps rte_smp_rmb() for arm64 ISHST provides STORE-STORE barrier maps rte_smp_wmb() for arm64 Jerin > Thanks > Konstantin > > > > > > Thoughts ? > > > > Jerin > > > > > Chao, sorry but I still not sure why PPC is considered as architecture with strong memory ordering? > > > Might be I am missing something obvious here. > > > Thank > > > Konstantin > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-07-18 2:47 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-11 10:20 [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location Juhamatti Kuusisaari 2016-07-11 10:41 ` Ananyev, Konstantin 2016-07-11 11:22 ` Kuusisaari, Juhamatti 2016-07-11 11:40 ` Olivier Matz 2016-07-12 4:10 ` Kuusisaari, Juhamatti 2016-07-11 12:34 ` Ananyev, Konstantin 2016-07-12 5:27 ` Kuusisaari, Juhamatti 2016-07-12 11:01 ` Ananyev, Konstantin 2016-07-12 17:58 ` Ananyev, Konstantin 2016-07-13 5:27 ` Kuusisaari, Juhamatti 2016-07-13 13:00 ` Ananyev, Konstantin 2016-07-14 4:17 ` Kuusisaari, Juhamatti 2016-07-14 12:56 ` Ananyev, Konstantin 2016-07-15 5:40 ` Kuusisaari, Juhamatti 2016-07-15 6:29 ` Jerin Jacob 2016-07-15 10:34 ` Ananyev, Konstantin 2016-07-18 2:47 ` Jerin Jacob
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).