[RFC] ring: remove unnecessary fences in C11 ring for performance

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC] ring: remove unnecessary fences in C11 ring for performance
@ 2023-06-12 19:47 Wathsala Vithanage
  2023-06-12 19:47 ` Wathsala Vithanage
  0 siblings, 1 reply; 2+ messages in thread
From: Wathsala Vithanage @ 2023-06-12 19:47 UTC (permalink / raw)
  To: honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang
  Cc: dev, nd, Wathsala Vithanage

ACQUIRE fence between the two loads established program order to
prevent use of an outdated consumer or producer tail values in the
computation of free_entries and entries in the two functions
respectively. However, (A) the problem described above cannot be solved
by establishing the program order by using a fence, and (B) the
resulting values for free_entries and entries are safe for use even
with an outdated head value.

(A) First example below shows an example where load of the old_head
and the cons_tail is reordered resulting in a free_entries computation
that involves an outdated consumer tail. The second is a counter
example where the same outdated consumer tail is used to compute the
free_entries even with an ACQUIRE fence.

Initial values for both (i) example, and (ii) counter example are
capacity = 32
r->cons.tail = 5
r->cons.head = 5
r->prod.head = 10
r-prod.tail = 10

Assume CPU2 adds 2 elements to the ring and CPU3 removes 5 elements
from the ring.

(i) Example: The two loads are reversed. Program order violated.
---------------------------------------------------------------------
   cpu1(producer)        cpu2(producer)          cpu3(consumer)
load r->cons.tail=5
      ^
      |
      |
      |              load r->prod.head=10
      |              load r->cons.tail=5
      |            store r->prod.head(+n=2)=12
   stalled             <-- enqueue ----->
      |            store r->prod.tail(+n)=12
      |
      |
      |                                      load r->cons.head  = 5
      |                                      load r->prod.tail  = 12
      |                                   store r->cons.head(+n=5)=10
      |                                         <...dequeue.....>
      |                                     store r->cons.tail(+n)=10
      |
      v
load r->prod.head=12

Above shows that the value read from r->cons.tail to cons_tail in cpu1
is 5 which is outdated because r->cons.tail was later updated by cpu3.
In this instance the state is...
CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25
RING: r->cons.tail=10, r->prod.head=12

(ii) Example: ACQUIRE fence between the loads. Program order preserved.
-----------------------------------------------------------------------
   cpu1(producer)        cpu2(producer)          cpu3(consumer)
      |
      |
      |              load r->prod.head=10
      |              load r->cons.tail=5
      |            store r->prod.head(+n=2)=12
      |               <-- enqueue ----->
      |            store r->prod.tail(+n)=12
      |
      |
load r->prod.head=12
      ^
      |
      |                                      load r->cons.head  = 5
      |                                      load r->prod.tail  = 12
   stalled                                 store r->cons.head(+n=5)=10
      |                                        <...dequeue.....>
      |                                               ^
      |                                               |
      v                                               |
load r->cons.tail=5                                stalled
                                                      |
                                                      |
                                                      v
                                           store r->cons.tail(+n)= 10

Above shows that the value read from r->cons.tail to cons_tail in cpu1
is 5 which is outdated because r->cons.tail was later updated by cpu3.
In this instance with the ACQUIRE fence the state is...
CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25
RING: r->cons.tail=10, r->prod.head=12

Example ii even with an ACQUIRE fence to prevent reordering of the
loads of old_head and cons_tail produces the same result for the
free_entries variable, and the system ends up in the same state up
to that point as example i. Same can be shown for the consumer threads
as well. Therefore, the ACQUIRE fences do not solve the problem they
intended to solve.
If each thread is required to use the latest r->cons.head,r->cons.tail,
r->prod.head, and r->prod.tail values in their computations of
free_entries and entries, there should be a real-time ordering of all
the threads. Establishing program order in a single thread only provides
guarantees of sequential-consistency which does not include real-time
ordering of all threads.

(B) The values computed for free_entries, and entries are always safe
without the ACQUIRE fence, because the resulting values are equal or
less than the number of actual free solts or elements in the ring.
This can be seen in both example i and ii, where free_entries
computation ended up with 25, even when ring has 30 free_entries.
Enqueueing 25 elements into a ring that has space for 30 is safe.
Same can be shown for consumer threads as well. Therefore, the ACQUIRE
fence is not required for safety reasons.

Due to (A) and (B) above the ACQUIRE fence between the load of the head
and load of the tail in both consumer and producer head updates can be
removed to improve the performance of the ring library.

Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1)			    |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686  | Total count: 728546  |
|                     |        Gain: -15%    |        Gain: -4%     |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1)                        |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385  | Total count: 784019  |
|                     |        Gain: -21%    |        Gain: -4%     |
+-------------------------------------------------------------------+

Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8                                  |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 165786 | Total count: 168971  | Total count: 168811  |
|                     |        Gain: 1.9%    |        Gain: %1.8    |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32                                 |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 150729 | Total count: 153162  | Total count: 150303  |
|                     |        Gain: 1.6%    |        Gain: -0.2%   |
+-------------------------------------------------------------------+

Wathsala Vithanage (1):
  ring: improving performance with C11

 .mailmap                    | 1 +
 lib/ring/rte_ring_c11_pvt.h | 6 ------
 2 files changed, 1 insertion(+), 6 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [RFC] ring: remove unnecessary fences in C11 ring for performance
  2023-06-12 19:47 [RFC] ring: remove unnecessary fences in C11 ring for performance Wathsala Vithanage
@ 2023-06-12 19:47 ` Wathsala Vithanage
  0 siblings, 0 replies; 2+ messages in thread
From: Wathsala Vithanage @ 2023-06-12 19:47 UTC (permalink / raw)
  To: honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang
  Cc: dev, nd, Wathsala Vithanage

For improved performance over the current C11 based ring
implementation, two ACQUIRE fences between the load of the old_head
and the load of the const_tail in __rte_ring_move_prod_head and the
load of old_head and load of the prod_tail in
__rte_ring_move_cons_head functions have been removed. These two
fences are not required for the safety of the ring library.

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 .mailmap                    | 1 +
 lib/ring/rte_ring_c11_pvt.h | 6 ------
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
 Wang Sheng-Hui <shhuiw@gmail.com>
 Wangyu (Eric) <seven.wangyu@huawei.com>
 Waterman Cao <waterman.cao@intel.com>
+Wathsala Vithanage <wathsala.vithanage@arm.com>
 Weichun Chen <weichunx.chen@intel.com>
 Wei Dai <wei.dai@intel.com>
 Weifeng Li <liweifeng96@126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..eb59c61b6a 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -66,9 +66,6 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Reset n to the initial burst count */
 		n = max;
 
-		/* Ensure the head is read before tail */
-		__atomic_thread_fence(__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -142,9 +139,6 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Restore n as it may change every loop */
 		n = max;
 
-		/* Ensure the head is read before tail */
-		__atomic_thread_fence(__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-06-12 19:48 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-12 19:47 [RFC] ring: remove unnecessary fences in C11 ring for performance Wathsala Vithanage
2023-06-12 19:47 ` Wathsala Vithanage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).