* [dpdk-dev] [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work
2018-11-09 11:42 [dpdk-dev] [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
@ 2018-11-09 11:42 ` Gavin Hu
2018-11-09 11:42 ` [dpdk-dev] [PATCH v1 2/2] ring: relaxed ordering for load and store the head Gavin Hu
2018-11-13 15:57 ` [dpdk-dev] [PATCH v1 0/2] ring C11 library fix and optimization Thomas Monjalon
2 siblings, 0 replies; 4+ messages in thread
From: Gavin Hu @ 2018-11-09 11:42 UTC (permalink / raw)
To: dev
Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
stable
Use case scenario:
1) Thread 1 is enqueuing. It reads prod.head and gets stalled for some
reasons (running out of cpu time, preempted,...)
2) Thread 2 is enqueuing. It succeeds in enqueuing and moves prod.head
forward.
3) Thread 3 is dequeuing. It succeeds in dequeuing and moves the cons.tail
beyond the prod.head read by thread 1.
4) Thread 1 is re-scheduled. It reads cons.tail.
cpu1(producer) cpu2(producer) cpu3(consumer)
load r->prod.head
^ load r->prod.head
| load r->cons.tail
| store r->prod.head(+n)
stalled <-- enqueue ----->
| store r->prod.tail(+n)
| load r->cons.head
| load r->prod.tail
| store r->cons.head(+n)
| <...dequeue.....>
v store r->cons.tail(+n)
load r->cons.tail
For thread 1, the __atomic_compare_exchange_n detects the outdated
prod.head and retry the flow with the new one. This retry flow works ok on
strong ordering platform(eg:x86). But for weak ordering platforms(arm,
ppc), loading cons.tail and prod.head might be re-ordered, prod.head is new
but cons.tail becomes too old, the retry flow, based on the detection of
outdated head, does not trigger as expected, thus the outdate cons.tail
causes wrong free_entries.
Similarly, for dequeuing, outdated prod.tail leads to wrong avail_entries.
The fix is to keep the deterministic order of two loads allowing the retry
to work.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core, 4 threads/core, 2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.64
MP/MC bulk enq/dequeue (size: 8): 9.58
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 2.30
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
The results showed the thread fence degrade the performance slightly, but
it is required for correctness.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
lib/librte_ring/rte_ring_c11_mem.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 7bc74a4..dc49a99 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -66,6 +66,9 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
/* Reset n to the initial burst count */
n = max;
+ /* Ensure the head is read before tail */
+ __atomic_thread_fence(__ATOMIC_ACQUIRE);
+
/* load-acquire synchronize with store-release of ht->tail
* in update_tail.
*/
@@ -139,6 +142,9 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
/* Restore n as it may change every loop */
n = max;
+ /* Ensure the head is read before tail */
+ __atomic_thread_fence(__ATOMIC_ACQUIRE);
+
/* this load-acquire synchronize with store-release of ht->tail
* in update_tail.
*/
--
2.7.4
^ permalink raw reply [flat|nested] 4+ messages in thread
* [dpdk-dev] [PATCH v1 2/2] ring: relaxed ordering for load and store the head
2018-11-09 11:42 [dpdk-dev] [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
2018-11-09 11:42 ` [dpdk-dev] [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work Gavin Hu
@ 2018-11-09 11:42 ` Gavin Hu
2018-11-13 15:57 ` [dpdk-dev] [PATCH v1 0/2] ring C11 library fix and optimization Thomas Monjalon
2 siblings, 0 replies; 4+ messages in thread
From: Gavin Hu @ 2018-11-09 11:42 UTC (permalink / raw)
To: dev
Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
stable
When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.
Because the thread fence in place, ordering for the first iteration can be
relaxed.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38
No significant improvement, nor regression was seen,as the optimisation is
not at the critical path.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
lib/librte_ring/rte_ring_c11_mem.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index dc49a99..0fb73a3 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,7 +61,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
unsigned int max = n;
int success;
- *old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
+ *old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
do {
/* Reset n to the initial burst count */
n = max;
@@ -97,7 +97,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
/* on failure, *old_head is updated */
success = __atomic_compare_exchange_n(&r->prod.head,
old_head, *new_head,
- 0, __ATOMIC_ACQUIRE,
+ 0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
@@ -137,7 +137,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
int success;
/* move cons.head atomically */
- *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
+ *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
do {
/* Restore n as it may change every loop */
n = max;
@@ -172,7 +172,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
/* on failure, *old_head will be updated */
success = __atomic_compare_exchange_n(&r->cons.head,
old_head, *new_head,
- 0, __ATOMIC_ACQUIRE,
+ 0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
--
2.7.4
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [dpdk-dev] [PATCH v1 0/2] ring C11 library fix and optimization
2018-11-09 11:42 [dpdk-dev] [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
2018-11-09 11:42 ` [dpdk-dev] [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work Gavin Hu
2018-11-09 11:42 ` [dpdk-dev] [PATCH v1 2/2] ring: relaxed ordering for load and store the head Gavin Hu
@ 2018-11-13 15:57 ` Thomas Monjalon
2 siblings, 0 replies; 4+ messages in thread
From: Thomas Monjalon @ 2018-11-13 15:57 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, stephen, olivier.matz, chaozhu, bruce.richardson,
konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli
09/11/2018 12:42, Gavin Hu:
> V1:
> Update the ring C11 library including the following changes:
> 1) On relaxed ordering platforms(like Arm64,PPPC), in ring C11 implementation,
> loading head and tail might be reodered, this makes CAS(compare and retry
> the flow if the head is outdated) not working as expected, the fix is to
> ensure the head is read before the tail, leaving no chances of the combination
> of outdated head and new tail.
> 2) With the above memory fence introduced, some loading can be relaxed.
>
> Gavin Hu (2):
> ring: keep the deterministic order allowing retry to work
> ring: relaxed ordering for load and store the head
No comment, so I guess it can enter in 18.11-rc3.
Applied, thanks
^ permalink raw reply [flat|nested] 4+ messages in thread