DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC] ring: Further performance improvements with C11
@ 2023-06-15 20:13 Wathsala Vithanage
  2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage
  0 siblings, 1 reply; 9+ messages in thread
From: Wathsala Vithanage @ 2023-06-15 20:13 UTC (permalink / raw)
  To: honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang
  Cc: dev, nd, Wathsala Vithanage



(1) Replace tail store with RELEASE semantics with a RELEASE fence and
load of the tail in producer/consumer head update with ACQUIRE semantics
with ACQUIRE fences.
------------------------------------------------------------------------
Use __ATOMIC_ACQUIRE fences between producer/consumer head updates
and ring element accesses. CAS instructions always load the memory
address it is supposed to conditionally update. An ACQUIRE fence between
the CAS (LOAD) and the LOADs to the ring elements prevents reordering of
these operations in a producer thread.
Similarly, it also prevents reordering of CAS and the STOREs to the ring
elements that follow in a consumer thread.

An __ATOMIC_RELEASE fence in __rte_ring_update_tail  prevents reordering
of tail update (STORE) and LOADs to the ring elements in conumer threads.
It also prevents reordering of the tail update and the STOREs to the ring
in producer threads.

This pairing can improve performance by eliminating the loading of the
tail with ACQUIRE semantics in the consumer/producer head update loop.

(2) The removal of ACQUIRE fence between the Load of the old_head and
load of the cons_tail in __rte_ring_move_prod_head and
__rte_ring_move_cons_head functions.
-----------------------------------------------------------------------

ACQUIRE fence between the two loads established program order to
prevent use of an outdated consumer or producer tail values in the
computation of free_entries and entries in the two functions
respectively. However, (A) the problem described above cannot be solved
by establishing the program order by using a fence, and (B) the
resulting values for free_entries and entries are safe for use even
with an outdated head value.

(A) First example below shows an example where load of the old_head
and the cons_tail is reordered resulting in a free_entries computation
that involves an outdated consumer tail. The second is a counter
example where the same outdated consumer tail is used to compute the
free_entries even with an ACQUIRE fence.

Initial values for both (i) example, and (ii) counter example are
capacity = 32
r->cons.tail = 5
r->cons.head = 5
r->prod.head = 10
r-prod.tail = 10

Assume CPU2 adds 2 elements to the ring and CPU3 removes 5 elements
from the ring.

(i) Example: The two loads are reversed. Program order violated.
---------------------------------------------------------------------
   cpu1(producer)        cpu2(producer)          cpu3(consumer)
load r->cons.tail=5
      ^
      |
      |
      |              load r->prod.head=10
      |              load r->cons.tail=5
      |            store r->prod.head(+n=2)=12
   stalled             <-- enqueue ----->
      |            store r->prod.tail(+n)=12
      |
      |
      |                                      load r->cons.head  = 5
      |                                      load r->prod.tail  = 12
      |                                   store r->cons.head(+n=5)=10
      |                                         <...dequeue.....>
      |                                     store r->cons.tail(+n)=10
      |
      v
load r->prod.head=12

Above shows that the value read from r->cons.tail to cons_tail in cpu1
is 5 which is outdated because r->cons.tail was later updated by cpu3.
In this instance the state is...
CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25
RING: r->cons.tail=10, r->prod.head=12

(ii) Example: ACQUIRE fence between the loads. Program order preserved.
-----------------------------------------------------------------------
   cpu1(producer)        cpu2(producer)          cpu3(consumer)
      |
      |
      |              load r->prod.head=10
      |              load r->cons.tail=5
      |            store r->prod.head(+n=2)=12
      |               <-- enqueue ----->
      |            store r->prod.tail(+n)=12
      |
      |
load r->prod.head=12
      ^
      |
      |                                      load r->cons.head  = 5
      |                                      load r->prod.tail  = 12
   stalled                                 store r->cons.head(+n=5)=10
      |                                        <...dequeue.....>
      |                                               ^
      |                                               |
      v                                               |
load r->cons.tail=5                                stalled
                                                      |
                                                      |
                                                      v
                                           store r->cons.tail(+n)= 10

Above shows that the value read from r->cons.tail to cons_tail in cpu1
is 5 which is outdated because r->cons.tail was later updated by cpu3.
In this instance with the ACQUIRE fence the state is...
CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25
RING: r->cons.tail=10, r->prod.head=12

Example ii even with an ACQUIRE fence to prevent reordering of the
loads of old_head and cons_tail produces the same result for the
free_entries variable, and the system ends up in the same state up
to that point as example i. Same can be shown for the consumer threads
as well. Therefore, the ACQUIRE fences do not solve the problem they
intended to solve.
If each thread is required to use the latest r->cons.head,r->cons.tail,
r->prod.head, and r->prod.tail values in their computations of
free_entries and entries, there should be a real-time ordering of all
the threads. Establishing program order in a single thread only provides
guarantees of sequential-consistency which does not include real-time
ordering of all threads.

(B) The values computed for free_entries, and entries are always safe
without the ACQUIRE fence, because the resulting values are equal or
less than the number of actual free solts or elements in the ring.
This can be seen in both example i and ii, where free_entries
computation ended up with 25, even when ring has 30 free_entries.
Enqueueing 25 elements into a ring that has space for 30 is safe.
Same can be shown for consumer threads as well. Therefore, the ACQUIRE
fence is not required for safety reasons.

Due to (A) and (B) above the ACQUIRE fence between the load of the head
and load of the tail in both consumer and producer head updates can be
removed to improve the performance of the ring library.

Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1)                         |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686  | Total count: 959621  |
|                     |        Gain: -15%    |        Gain: 25%     |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1)                        |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385  | Total count: 766730  |
|                     |        Gain: -21%    |        Gain: -6%     |
+-------------------------------------------------------------------+

Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8                                  |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 165042 | Total count: 168971  | Total count: 168089  |
|                     |        Gain: 2.3%    |        Gain: 1.8%    |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32                                 |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 150884 | Total count: 153162  | Total count: 159661  |
|                     |        Gain: 1.5%    |        Gain: 5.8%    |
+-------------------------------------------------------------------+

Wathsala Vithanage (1):
  ring: improving performance with C11

 .mailmap                    |  1 +
 lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++---------------
 2 files changed, 21 insertions(+), 15 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] ring: further performance improvements with C11
  2023-06-15 20:13 [RFC] ring: Further performance improvements with C11 Wathsala Vithanage
@ 2023-06-15 20:13 ` Wathsala Vithanage
  2023-07-31 12:31   ` Thomas Monjalon
  2023-08-02  9:42   ` Konstantin Ananyev
  0 siblings, 2 replies; 9+ messages in thread
From: Wathsala Vithanage @ 2023-06-15 20:13 UTC (permalink / raw)
  To: honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang
  Cc: dev, nd, Wathsala Vithanage

For improved performance over the current C11 based ring implementation
following changes were made.
(1) Replace tail store with RELEASE semantics in __rte_ring_update_tail
with a RELEASE fence. Replace load of the tail with ACQUIRE semantics 
in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE
fences.
(2) Remove ACQUIRE fences between load of the old_head and load of the
cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
These two fences are not required for the safety of the ring library.

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 .mailmap                    |  1 +
 lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++---------------
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
 Wang Sheng-Hui <shhuiw@gmail.com>
 Wangyu (Eric) <seven.wangyu@huawei.com>
 Waterman Cao <waterman.cao@intel.com>
+Wathsala Vithanage <wathsala.vithanage@arm.com>
 Weichun Chen <weichunx.chen@intel.com>
 Wei Dai <wei.dai@intel.com>
 Weifeng Li <liweifeng96@126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..63fe58ce9e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
 		uint32_t new_val, uint32_t single, uint32_t enqueue)
 {
 	RTE_SET_USED(enqueue);
+	/*
+	 * Updating of ht->tail cannot happen before elements are added to or
+	 * removed from the ring, as it could result in data races between
+	 * producer and consumer threads. Therefore we need a release
+	 * barrier here.
+	 */
+	rte_atomic_thread_fence(__ATOMIC_RELEASE);
 
 	/*
 	 * If there are other enqueues/dequeues in progress that preceded us,
@@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
 	if (!single)
 		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
-	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
+	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
 }
 
 /**
@@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Reset n to the initial burst count */
 		n = max;
 
-		/* Ensure the head is read before tail */
-		__atomic_thread_fence(__ATOMIC_ACQUIRE);
-
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		cons_tail = __atomic_load_n(&r->cons.tail,
-					__ATOMIC_ACQUIRE);
+					__ATOMIC_RELAXED);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -100,6 +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 					0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+	/*
+	 * Ensure that updates to the ring doesn't rise above
+	 * load of the new_head in SP and MP cases.
+	 */
+	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
 	return n;
 }
 
@@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Restore n as it may change every loop */
 		n = max;
 
-		/* Ensure the head is read before tail */
-		__atomic_thread_fence(__ATOMIC_ACQUIRE);
-
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		prod_tail = __atomic_load_n(&r->prod.tail,
-					__ATOMIC_ACQUIRE);
+					__ATOMIC_RELAXED);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -175,6 +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 							0, __ATOMIC_RELAXED,
 							__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+	/*
+	 * Ensure that updates to the ring doesn't rise above
+	 * load of the new_head in SP and MP cases.
+	 */
+	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
 	return n;
 }
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] ring: further performance improvements with C11
  2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage
@ 2023-07-31 12:31   ` Thomas Monjalon
  2023-08-03  2:56     ` Honnappa Nagarahalli
  2023-08-02  9:42   ` Konstantin Ananyev
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Monjalon @ 2023-07-31 12:31 UTC (permalink / raw)
  To: Wathsala Vithanage
  Cc: honnappa.nagarahalli, konstantin.v.ananyev, ruifeng.wang, dev,
	nd, stephen, jerinj, Morten Brørup, Tyler Retzlaff

15/06/2023 22:13, Wathsala Vithanage:
> For improved performance over the current C11 based ring implementation
> following changes were made.
> (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail
> with a RELEASE fence. Replace load of the tail with ACQUIRE semantics 
> in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE
> fences.
> (2) Remove ACQUIRE fences between load of the old_head and load of the
> cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
> These two fences are not required for the safety of the ring library.
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Are we waiting for more reviews?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] ring: further performance improvements with C11
  2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage
  2023-07-31 12:31   ` Thomas Monjalon
@ 2023-08-02  9:42   ` Konstantin Ananyev
  2023-08-04 22:50     ` Wathsala Wathawana Vithanage
  1 sibling, 1 reply; 9+ messages in thread
From: Konstantin Ananyev @ 2023-08-02  9:42 UTC (permalink / raw)
  To: Wathsala Vithanage, honnappa.nagarahalli, konstantin.v.ananyev,
	thomas, ruifeng.wang
  Cc: dev, nd, justin.he



> For improved performance over the current C11 based ring implementation
> following changes were made.
> (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail
> with a RELEASE fence. Replace load of the tail with ACQUIRE semantics
> in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE
> fences.
> (2) Remove ACQUIRE fences between load of the old_head and load of the
> cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
> These two fences are not required for the safety of the ring library.

Hmm... with these changes, aren't we re-introducing the old bug fixed by
this commit:

commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
Author: Jia He <justin.he@arm.com>
Date:   Fri Nov 10 03:30:42 2017 +0000

    ring: guarantee load/load order in enqueue and dequeue

    We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
    (Amberwing).

    Root cause:
    In __rte_ring_move_cons_head()
    ...
            do {
                    /* Restore n as it may change every loop */
                    n = max;

                    *old_head = r->cons.head;                //1st load
                    const uint32_t prod_tail = r->prod.tail; //2nd load

    In weak memory order architectures (powerpc,arm), the 2nd load might be
    reodered before the 1st load, that makes *entries is bigger than we wanted.
    This nasty reording messed enque/deque up. 
    ....
?

> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++---------------
>  2 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
>  Wang Sheng-Hui <shhuiw@gmail.com>
>  Wangyu (Eric) <seven.wangyu@huawei.com>
>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index f895950df4..63fe58ce9e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
>  		uint32_t new_val, uint32_t single, uint32_t enqueue)
>  {
>  	RTE_SET_USED(enqueue);
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore we need a release
> +	 * barrier here.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
> 
>  	/*
>  	 * If there are other enqueues/dequeues in progress that preceded us,
> @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
> 
> -	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> +	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
>  }
> 
>  /**
> @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>  		/* Reset n to the initial burst count */
>  		n = max;
> 
> -		/* Ensure the head is read before tail */
> -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> -
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail = __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -100,6 +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>  					0, __ATOMIC_RELAXED,
>  					__ATOMIC_RELAXED);
>  	} while (unlikely(success == 0));
> +	/*
> +	 * Ensure that updates to the ring doesn't rise above
> +	 * load of the new_head in SP and MP cases.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
>  	return n;
>  }
> 
> @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		/* Restore n as it may change every loop */
>  		n = max;
> 
> -		/* Ensure the head is read before tail */
> -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> -
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail = __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -175,6 +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  							0, __ATOMIC_RELAXED,
>  							__ATOMIC_RELAXED);
>  	} while (unlikely(success == 0));
> +	/*
> +	 * Ensure that updates to the ring doesn't rise above
> +	 * load of the new_head in SP and MP cases.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
>  	return n;
>  }
> 
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] ring: further performance improvements with C11
  2023-07-31 12:31   ` Thomas Monjalon
@ 2023-08-03  2:56     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 9+ messages in thread
From: Honnappa Nagarahalli @ 2023-08-03  2:56 UTC (permalink / raw)
  To: thomas, Wathsala Wathawana Vithanage
  Cc: konstantin.v.ananyev, Ruifeng Wang, dev, nd, stephen, jerinj,
	Morten Brørup, Tyler Retzlaff, nd



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Monday, July 31, 2023 7:31 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> konstantin.v.ananyev@yandex.ru; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> dev@dpdk.org; nd <nd@arm.com>; stephen@networkplumber.org;
> jerinj@marvell.com; Morten Brørup <mb@smartsharesystems.com>; Tyler
> Retzlaff <roretzla@linux.microsoft.com>
> Subject: Re: [RFC] ring: further performance improvements with C11
> 
> 15/06/2023 22:13, Wathsala Vithanage:
> > For improved performance over the current C11 based ring
> > implementation following changes were made.
> > (1) Replace tail store with RELEASE semantics in
> > __rte_ring_update_tail with a RELEASE fence. Replace load of the tail
> > with ACQUIRE semantics in __rte_ring_move_prod_head and
> > __rte_ring_move_cons_head with ACQUIRE fences.
> > (2) Remove ACQUIRE fences between load of the old_head and load of the
> > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
> > These two fences are not required for the safety of the ring library.
> >
> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> 
> Are we waiting for more reviews?
We do not have a good solution. We should discuss this in the Techboard meeting.

> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] ring: further performance improvements with C11
  2023-08-02  9:42   ` Konstantin Ananyev
@ 2023-08-04 22:50     ` Wathsala Wathawana Vithanage
  2023-08-09 18:18       ` Konstantin Ananyev
  0 siblings, 1 reply; 9+ messages in thread
From: Wathsala Wathawana Vithanage @ 2023-08-04 22:50 UTC (permalink / raw)
  To: Konstantin Ananyev, Honnappa Nagarahalli, konstantin.v.ananyev,
	thomas, Ruifeng Wang
  Cc: dev, nd, Justin He, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Wednesday, August 2, 2023 4:43 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> konstantin.v.ananyev@yandex.ru; thomas@monjalon.net; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Justin He <Justin.He@arm.com>
> Subject: RE: [RFC] ring: further performance improvements with C11
> 
> 
> 
> > For improved performance over the current C11 based ring
> > implementation following changes were made.
> > (1) Replace tail store with RELEASE semantics in
> > __rte_ring_update_tail with a RELEASE fence. Replace load of the tail
> > with ACQUIRE semantics in __rte_ring_move_prod_head and
> > __rte_ring_move_cons_head with ACQUIRE fences.
> > (2) Remove ACQUIRE fences between load of the old_head and load of the
> > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
> > These two fences are not required for the safety of the ring library.
> 
> Hmm... with these changes, aren't we re-introducing the old bug fixed by this
> commit:

Cover letter explains why this barrier does not solve what it intends to solve and
Why it should not matter.
https://mails.dpdk.org/archives/dev/2023-June/270874.html

> 
> commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
> Author: Jia He <justin.he@arm.com>
> Date:   Fri Nov 10 03:30:42 2017 +0000
> 
>     ring: guarantee load/load order in enqueue and dequeue
> 
>     We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
>     (Amberwing).
> 
>     Root cause:
>     In __rte_ring_move_cons_head()
>     ...
>             do {
>                     /* Restore n as it may change every loop */
>                     n = max;
> 
>                     *old_head = r->cons.head;                //1st load
>                     const uint32_t prod_tail = r->prod.tail; //2nd load
> 
>     In weak memory order architectures (powerpc,arm), the 2nd load might be
>     reodered before the 1st load, that makes *entries is bigger than we wanted.
>     This nasty reording messed enque/deque up.
>     ....
> ?
> 
> >
> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  .mailmap                    |  1 +
> >  lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++---------------
> >  2 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/.mailmap b/.mailmap
> > index 4018f0fc47..367115d134 100644
> > --- a/.mailmap
> > +++ b/.mailmap
> > @@ -1430,6 +1430,7 @@ Walter Heymans
> <walter.heymans@corigine.com>
> > Wang Sheng-Hui <shhuiw@gmail.com>  Wangyu (Eric)
> > <seven.wangyu@huawei.com>  Waterman Cao <waterman.cao@intel.com>
> > +Wathsala Vithanage <wathsala.vithanage@arm.com>
> >  Weichun Chen <weichunx.chen@intel.com>  Wei Dai <wei.dai@intel.com>
> > Weifeng Li <liweifeng96@126.com> diff --git
> > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > f895950df4..63fe58ce9e 100644
> > --- a/lib/ring/rte_ring_c11_pvt.h
> > +++ b/lib/ring/rte_ring_c11_pvt.h
> > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> >  		uint32_t new_val, uint32_t single, uint32_t enqueue)  {
> >  	RTE_SET_USED(enqueue);
> > +	/*
> > +	 * Updating of ht->tail cannot happen before elements are added to or
> > +	 * removed from the ring, as it could result in data races between
> > +	 * producer and consumer threads. Therefore we need a release
> > +	 * barrier here.
> > +	 */
> > +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
> >
> >  	/*
> >  	 * If there are other enqueues/dequeues in progress that preceded
> > us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail
> *ht, uint32_t old_val,
> >  	if (!single)
> >  		rte_wait_until_equal_32(&ht->tail, old_val,
> __ATOMIC_RELAXED);
> >
> > -	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > +	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
> >  }
> >
> >  /**
> > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
> >  		/* Reset n to the initial burst count */
> >  		n = max;
> >
> > -		/* Ensure the head is read before tail */
> > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > -
> > -		/* load-acquire synchronize with store-release of ht->tail
> > -		 * in update_tail.
> > -		 */
> >  		cons_tail = __atomic_load_n(&r->cons.tail,
> > -					__ATOMIC_ACQUIRE);
> > +					__ATOMIC_RELAXED);
> >
> >  		/* The subtraction is done between two unsigned 32bits value
> >  		 * (the result is always modulo 32 bits even if we have @@ -
> 100,6
> > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int
> is_sp,
> >  					0, __ATOMIC_RELAXED,
> >  					__ATOMIC_RELAXED);
> >  	} while (unlikely(success == 0));
> > +	/*
> > +	 * Ensure that updates to the ring doesn't rise above
> > +	 * load of the new_head in SP and MP cases.
> > +	 */
> > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> >  	return n;
> >  }
> >
> > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
> >  		/* Restore n as it may change every loop */
> >  		n = max;
> >
> > -		/* Ensure the head is read before tail */
> > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > -
> > -		/* this load-acquire synchronize with store-release of ht->tail
> > -		 * in update_tail.
> > -		 */
> >  		prod_tail = __atomic_load_n(&r->prod.tail,
> > -					__ATOMIC_ACQUIRE);
> > +					__ATOMIC_RELAXED);
> >
> >  		/* The subtraction is done between two unsigned 32bits value
> >  		 * (the result is always modulo 32 bits even if we have @@ -
> 175,6
> > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> >  							0,
> __ATOMIC_RELAXED,
> >  							__ATOMIC_RELAXED);
> >  	} while (unlikely(success == 0));
> > +	/*
> > +	 * Ensure that updates to the ring doesn't rise above
> > +	 * load of the new_head in SP and MP cases.
> > +	 */
> > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> >  	return n;
> >  }
> >
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] ring: further performance improvements with C11
  2023-08-04 22:50     ` Wathsala Wathawana Vithanage
@ 2023-08-09 18:18       ` Konstantin Ananyev
  2023-08-15  5:14         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ananyev @ 2023-08-09 18:18 UTC (permalink / raw)
  To: Wathsala Wathawana Vithanage, Honnappa Nagarahalli,
	konstantin.v.ananyev, thomas, Ruifeng Wang
  Cc: dev, nd, Justin He, nd


> > > For improved performance over the current C11 based ring
> > > implementation following changes were made.
> > > (1) Replace tail store with RELEASE semantics in
> > > __rte_ring_update_tail with a RELEASE fence. Replace load of the tail
> > > with ACQUIRE semantics in __rte_ring_move_prod_head and
> > > __rte_ring_move_cons_head with ACQUIRE fences.
> > > (2) Remove ACQUIRE fences between load of the old_head and load of the
> > > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
> > > These two fences are not required for the safety of the ring library.
> >
> > Hmm... with these changes, aren't we re-introducing the old bug fixed by this
> > commit:
> 
> Cover letter explains why this barrier does not solve what it intends to solve and
> Why it should not matter.
> https://mails.dpdk.org/archives/dev/2023-June/270874.html

Ok, let's consider the case similar to yours (i), but when r->prod.head was moved for distance greater then r->capacity. 
To be more specific, let' start with the same initial state:
capacity = 32
r->cons.tail = 5
r->cons.head = 5
r->prod.head = 10
r-prod.tail = 10

time 0,  thread1: 
    /* re-ordered load */
     const_tail = r->cons.tail; //= 5

Now, thread1 was stalled for a bit, meanwhile there were few enqueue/dequeus
done by other threads, so current state of the ring:
r->cons.tail = 105
r->cons.head = 105
r->prod.head = 110
r-prod.tail = 110

time 1,  thread1:
old_head =  r->prod.head; // 110
*free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 110) ==  (uint32_t)-73 == 4294967223 
 
 So, free_entries value is way too big, and that comparison:
 
if (unlikely(n > *free_entries))

might provide wrong result.

So I still think we do need some sort of _read_fence_ between these two loads.
As I said before, that looks exactly like the old bug, fixed a while ago:
http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
but now re-introduced for C11 case.

> >
> > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
> > Author: Jia He <justin.he@arm.com>
> > Date:   Fri Nov 10 03:30:42 2017 +0000
> >
> >     ring: guarantee load/load order in enqueue and dequeue
> >
> >     We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
> >     (Amberwing).
> >
> >     Root cause:
> >     In __rte_ring_move_cons_head()
> >     ...
> >             do {
> >                     /* Restore n as it may change every loop */
> >                     n = max;
> >
> >                     *old_head = r->cons.head;                //1st load
> >                     const uint32_t prod_tail = r->prod.tail; //2nd load
> >
> >     In weak memory order architectures (powerpc,arm), the 2nd load might be
> >     reodered before the 1st load, that makes *entries is bigger than we wanted.
> >     This nasty reording messed enque/deque up.
> >     ....
> > ?
> >
> > >
> > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  .mailmap                    |  1 +
> > >  lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++---------------
> > >  2 files changed, 21 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/.mailmap b/.mailmap
> > > index 4018f0fc47..367115d134 100644
> > > --- a/.mailmap
> > > +++ b/.mailmap
> > > @@ -1430,6 +1430,7 @@ Walter Heymans
> > <walter.heymans@corigine.com>
> > > Wang Sheng-Hui <shhuiw@gmail.com>  Wangyu (Eric)
> > > <seven.wangyu@huawei.com>  Waterman Cao <waterman.cao@intel.com>
> > > +Wathsala Vithanage <wathsala.vithanage@arm.com>
> > >  Weichun Chen <weichunx.chen@intel.com>  Wei Dai <wei.dai@intel.com>
> > > Weifeng Li <liweifeng96@126.com> diff --git
> > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > > f895950df4..63fe58ce9e 100644
> > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> > uint32_t old_val,
> > >  		uint32_t new_val, uint32_t single, uint32_t enqueue)  {
> > >  	RTE_SET_USED(enqueue);
> > > +	/*
> > > +	 * Updating of ht->tail cannot happen before elements are added to or
> > > +	 * removed from the ring, as it could result in data races between
> > > +	 * producer and consumer threads. Therefore we need a release
> > > +	 * barrier here.
> > > +	 */
> > > +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
> > >
> > >  	/*
> > >  	 * If there are other enqueues/dequeues in progress that preceded
> > > us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail
> > *ht, uint32_t old_val,
> > >  	if (!single)
> > >  		rte_wait_until_equal_32(&ht->tail, old_val,
> > __ATOMIC_RELAXED);
> > >
> > > -	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > > +	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
> > >  }
> > >
> > >  /**
> > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> > >  		/* Reset n to the initial burst count */
> > >  		n = max;
> > >
> > > -		/* Ensure the head is read before tail */
> > > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > -
> > > -		/* load-acquire synchronize with store-release of ht->tail
> > > -		 * in update_tail.
> > > -		 */
> > >  		cons_tail = __atomic_load_n(&r->cons.tail,
> > > -					__ATOMIC_ACQUIRE);
> > > +					__ATOMIC_RELAXED);
> > >
> > >  		/* The subtraction is done between two unsigned 32bits value
> > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > 100,6
> > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int
> > is_sp,
> > >  					0, __ATOMIC_RELAXED,
> > >  					__ATOMIC_RELAXED);
> > >  	} while (unlikely(success == 0));
> > > +	/*
> > > +	 * Ensure that updates to the ring doesn't rise above
> > > +	 * load of the new_head in SP and MP cases.
> > > +	 */
> > > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > >  	return n;
> > >  }
> > >
> > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > is_sc,
> > >  		/* Restore n as it may change every loop */
> > >  		n = max;
> > >
> > > -		/* Ensure the head is read before tail */
> > > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > -
> > > -		/* this load-acquire synchronize with store-release of ht->tail
> > > -		 * in update_tail.
> > > -		 */
> > >  		prod_tail = __atomic_load_n(&r->prod.tail,
> > > -					__ATOMIC_ACQUIRE);
> > > +					__ATOMIC_RELAXED);
> > >
> > >  		/* The subtraction is done between two unsigned 32bits value
> > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > 175,6
> > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> > >  							0,
> > __ATOMIC_RELAXED,
> > >  							__ATOMIC_RELAXED);
> > >  	} while (unlikely(success == 0));
> > > +	/*
> > > +	 * Ensure that updates to the ring doesn't rise above
> > > +	 * load of the new_head in SP and MP cases.
> > > +	 */
> > > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > >  	return n;
> > >  }
> > >
> > > --
> > > 2.25.1
> > >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] ring: further performance improvements with C11
  2023-08-09 18:18       ` Konstantin Ananyev
@ 2023-08-15  5:14         ` Honnappa Nagarahalli
  2023-08-21 13:27           ` Konstantin Ananyev
  0 siblings, 1 reply; 9+ messages in thread
From: Honnappa Nagarahalli @ 2023-08-15  5:14 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Wathawana Vithanage,
	konstantin.v.ananyev, thomas, Ruifeng Wang
  Cc: dev, nd, Justin He, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Wednesday, August 9, 2023 1:19 PM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> konstantin.v.ananyev@yandex.ru; thomas@monjalon.net; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Justin He <Justin.He@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [RFC] ring: further performance improvements with C11
> 
> 
> > > > For improved performance over the current C11 based ring
> > > > implementation following changes were made.
> > > > (1) Replace tail store with RELEASE semantics in
> > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the
> > > > tail with ACQUIRE semantics in __rte_ring_move_prod_head and
> > > > __rte_ring_move_cons_head with ACQUIRE fences.
> > > > (2) Remove ACQUIRE fences between load of the old_head and load of
> > > > the cons_tail in __rte_ring_move_prod_head and
> __rte_ring_move_cons_head.
> > > > These two fences are not required for the safety of the ring library.
> > >
> > > Hmm... with these changes, aren't we re-introducing the old bug
> > > fixed by this
> > > commit:
> >
> > Cover letter explains why this barrier does not solve what it intends
> > to solve and Why it should not matter.
> > https://mails.dpdk.org/archives/dev/2023-June/270874.html
> 
> Ok, let's consider the case similar to yours (i), but when r->prod.head was
> moved for distance greater then r->capacity.
> To be more specific, let' start with the same initial state:
> capacity = 32
> r->cons.tail = 5
> r->cons.head = 5
> r->prod.head = 10
> r-prod.tail = 10
> 
> time 0,  thread1:
>     /* re-ordered load */
>      const_tail = r->cons.tail; //= 5
> 
> Now, thread1 was stalled for a bit, meanwhile there were few
What exactly do you mean by 'stalled'? If you are meaning, thread is preempted, then the ring algorithm is not designed for it. There are restrictions mentioned in [1].
However, we do need to handle the re-ordering case.

[1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issues

> enqueue/dequeus done by other threads, so current state of the ring:
> r->cons.tail = 105
> r->cons.head = 105
> r->prod.head = 110
> r-prod.tail = 110
> 
> time 1,  thread1:
> old_head =  r->prod.head; // 110
> *free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 110)
> ==  (uint32_t)-73 == 4294967223
> 
>  So, free_entries value is way too big, and that comparison:
> 
> if (unlikely(n > *free_entries))
> 
> might provide wrong result.
> 
> So I still think we do need some sort of _read_fence_ between these two loads.
> As I said before, that looks exactly like the old bug, fixed a while ago:
> http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37
> ea0b654
> but now re-introduced for C11 case.
Agree that the re-ordering case should be handled. I am thinking a check (*free_entries > capacity) and restarting the loop might suffice (without the barrier)?

> 
> > >
> > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
> > > Author: Jia He <justin.he@arm.com>
> > > Date:   Fri Nov 10 03:30:42 2017 +0000
> > >
> > >     ring: guarantee load/load order in enqueue and dequeue
> > >
> > >     We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
> > >     (Amberwing).
> > >
> > >     Root cause:
> > >     In __rte_ring_move_cons_head()
> > >     ...
> > >             do {
> > >                     /* Restore n as it may change every loop */
> > >                     n = max;
> > >
> > >                     *old_head = r->cons.head;                //1st load
> > >                     const uint32_t prod_tail = r->prod.tail; //2nd
> > > load
> > >
> > >     In weak memory order architectures (powerpc,arm), the 2nd load might
> be
> > >     reodered before the 1st load, that makes *entries is bigger than we
> wanted.
> > >     This nasty reording messed enque/deque up.
> > >     ....
> > > ?
> > >
> > > >
> > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  .mailmap                    |  1 +
> > > >  lib/ring/rte_ring_c11_pvt.h | 35
> > > > ++++++++++++++++++++---------------
> > > >  2 files changed, 21 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/.mailmap b/.mailmap
> > > > index 4018f0fc47..367115d134 100644
> > > > --- a/.mailmap
> > > > +++ b/.mailmap
> > > > @@ -1430,6 +1430,7 @@ Walter Heymans
> > > <walter.heymans@corigine.com>
> > > > Wang Sheng-Hui <shhuiw@gmail.com>  Wangyu (Eric)
> > > > <seven.wangyu@huawei.com>  Waterman Cao
> <waterman.cao@intel.com>
> > > > +Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > >  Weichun Chen <weichunx.chen@intel.com>  Wei Dai
> > > > <wei.dai@intel.com> Weifeng Li <liweifeng96@126.com> diff --git
> > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > > > f895950df4..63fe58ce9e 100644
> > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail
> > > > *ht,
> > > uint32_t old_val,
> > > >  		uint32_t new_val, uint32_t single, uint32_t enqueue)  {
> > > >  	RTE_SET_USED(enqueue);
> > > > +	/*
> > > > +	 * Updating of ht->tail cannot happen before elements are added to or
> > > > +	 * removed from the ring, as it could result in data races between
> > > > +	 * producer and consumer threads. Therefore we need a release
> > > > +	 * barrier here.
> > > > +	 */
> > > > +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
> > > >
> > > >  	/*
> > > >  	 * If there are other enqueues/dequeues in progress that
> > > > preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct
> > > > rte_ring_headtail
> > > *ht, uint32_t old_val,
> > > >  	if (!single)
> > > >  		rte_wait_until_equal_32(&ht->tail, old_val,
> > > __ATOMIC_RELAXED);
> > > >
> > > > -	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > > > +	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
> > > >  }
> > > >
> > > >  /**
> > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > > unsigned int is_sp,
> > > >  		/* Reset n to the initial burst count */
> > > >  		n = max;
> > > >
> > > > -		/* Ensure the head is read before tail */
> > > > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > -
> > > > -		/* load-acquire synchronize with store-release of ht->tail
> > > > -		 * in update_tail.
> > > > -		 */
> > > >  		cons_tail = __atomic_load_n(&r->cons.tail,
> > > > -					__ATOMIC_ACQUIRE);
> > > > +					__ATOMIC_RELAXED);
> > > >
> > > >  		/* The subtraction is done between two unsigned 32bits value
> > > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > > 100,6
> > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned
> > > > +int
> > > is_sp,
> > > >  					0, __ATOMIC_RELAXED,
> > > >  					__ATOMIC_RELAXED);
> > > >  	} while (unlikely(success == 0));
> > > > +	/*
> > > > +	 * Ensure that updates to the ring doesn't rise above
> > > > +	 * load of the new_head in SP and MP cases.
> > > > +	 */
> > > > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > >  	return n;
> > > >  }
> > > >
> > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > > > int
> > > is_sc,
> > > >  		/* Restore n as it may change every loop */
> > > >  		n = max;
> > > >
> > > > -		/* Ensure the head is read before tail */
> > > > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > -
> > > > -		/* this load-acquire synchronize with store-release of ht->tail
> > > > -		 * in update_tail.
> > > > -		 */
> > > >  		prod_tail = __atomic_load_n(&r->prod.tail,
> > > > -					__ATOMIC_ACQUIRE);
> > > > +					__ATOMIC_RELAXED);
> > > >
> > > >  		/* The subtraction is done between two unsigned 32bits value
> > > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > > 175,6
> > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > > > +is_sc,
> > > >  							0,
> > > __ATOMIC_RELAXED,
> > > >
> 	__ATOMIC_RELAXED);
> > > >  	} while (unlikely(success == 0));
> > > > +	/*
> > > > +	 * Ensure that updates to the ring doesn't rise above
> > > > +	 * load of the new_head in SP and MP cases.
> > > > +	 */
> > > > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > >  	return n;
> > > >  }
> > > >
> > > > --
> > > > 2.25.1
> > > >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] ring: further performance improvements with C11
  2023-08-15  5:14         ` Honnappa Nagarahalli
@ 2023-08-21 13:27           ` Konstantin Ananyev
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Ananyev @ 2023-08-21 13:27 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Wathsala Wathawana Vithanage,
	konstantin.v.ananyev, thomas, Ruifeng Wang
  Cc: dev, nd, Justin He, nd


> > > > > For improved performance over the current C11 based ring
> > > > > implementation following changes were made.
> > > > > (1) Replace tail store with RELEASE semantics in
> > > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the
> > > > > tail with ACQUIRE semantics in __rte_ring_move_prod_head and
> > > > > __rte_ring_move_cons_head with ACQUIRE fences.
> > > > > (2) Remove ACQUIRE fences between load of the old_head and load of
> > > > > the cons_tail in __rte_ring_move_prod_head and
> > __rte_ring_move_cons_head.
> > > > > These two fences are not required for the safety of the ring library.
> > > >
> > > > Hmm... with these changes, aren't we re-introducing the old bug
> > > > fixed by this
> > > > commit:
> > >
> > > Cover letter explains why this barrier does not solve what it intends
> > > to solve and Why it should not matter.
> > > https://mails.dpdk.org/archives/dev/2023-June/270874.html
> >
> > Ok, let's consider the case similar to yours (i), but when r->prod.head was
> > moved for distance greater then r->capacity.
> > To be more specific, let' start with the same initial state:
> > capacity = 32
> > r->cons.tail = 5
> > r->cons.head = 5
> > r->prod.head = 10
> > r-prod.tail = 10
> >
> > time 0,  thread1:
> >     /* re-ordered load */
> >      const_tail = r->cons.tail; //= 5
> >
> > Now, thread1 was stalled for a bit, meanwhile there were few
> What exactly do you mean by 'stalled'? 

I mean: CPU pipeline the thread was running on get stalled by whatever reasons -
memory load latency, tlb miss, etc.

> If you are meaning, thread is preempted,
> then the ring algorithm is not designed for it. There
> are restrictions mentioned in [1].

I am not talking about SW thread preemption here.
With the example I provided, I think it is clear that the problem can happen even
with 'ideal' conditions: each thread runs non-preempted on a separate core.

> However, we do need to handle the re-ordering case.

To be clear: for me right now this patch is bogus,  it has to be either reworked or abandoned.

> 
> [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issues
> 
> > enqueue/dequeus done by other threads, so current state of the ring:
> > r->cons.tail = 105
> > r->cons.head = 105
> > r->prod.head = 110
> > r-prod.tail = 110
> >
> > time 1,  thread1:
> > old_head =  r->prod.head; // 110
> > *free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 110)
> > ==  (uint32_t)-73 == 4294967223
> >

> >  So, free_entries value is way too big, and that comparison:
> >
> > if (unlikely(n > *free_entries))
> >
> > might provide wrong result.
> >
> > So I still think we do need some sort of _read_fence_ between these two loads.
> > As I said before, that looks exactly like the old bug, fixed a while ago:
> > http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37
> > ea0b654
> > but now re-introduced for C11 case.
> Agree that the re-ordering case should be handled.

Either handled, or simply not allowed. 

> I am thinking a check (*free_entries > capacity) and restarting the loop might
> suffice (without the barrier)?

I thought about the same thing, and at first glance it seems workable in principle.
Though I am still hesitate to remove ordering completely here:
As both compiler and CPU reordering will be possible, it will probably introduce a possibility of sort-of ABA problem:
old cons.tail is read at a very early stage
after that cur cons.tail value get wrapped around 2^32 and is now less then old cons.tail.
Then we read prod.head
So:
(invalid_free_ent=capacity + cons.tail._old - prod.head) <= capacity
and
(valid_free_ent=capacity + cons.tail._cur - prod.head) <= capacity

Are both true, but invalid > valid and we overestimate number of free entries.

In majority of cases possibility of such situation is negligible.
But for huge values for 'capacity' it will grow.

Can I ask a bit different question:  as I remember this series started as an attempt
to improve C11 ring enqueue/dequeue implementation.
Though looking at non-C11 one, there also exist a read-fence between these two loads:
https://elixir.bootlin.com/dpdk/v23.07/source/lib/ring/rte_ring_generic_pvt.h#L73
Which makes me - might be it is not actual read-ordering that causes slowdown of C11 version.
Did anyone try to compare generated  code for both cases, wonder what is the difference?

Konstantin

> >
> > > >
> > > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
> > > > Author: Jia He <justin.he@arm.com>
> > > > Date:   Fri Nov 10 03:30:42 2017 +0000
> > > >
> > > >     ring: guarantee load/load order in enqueue and dequeue
> > > >
> > > >     We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
> > > >     (Amberwing).
> > > >
> > > >     Root cause:
> > > >     In __rte_ring_move_cons_head()
> > > >     ...
> > > >             do {
> > > >                     /* Restore n as it may change every loop */
> > > >                     n = max;
> > > >
> > > >                     *old_head = r->cons.head;                //1st load
> > > >                     const uint32_t prod_tail = r->prod.tail; //2nd
> > > > load
> > > >
> > > >     In weak memory order architectures (powerpc,arm), the 2nd load might
> > be
> > > >     reodered before the 1st load, that makes *entries is bigger than we
> > wanted.
> > > >     This nasty reording messed enque/deque up.
> > > >     ....
> > > > ?
> > > >
> > > > >
> > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > ---
> > > > >  .mailmap                    |  1 +
> > > > >  lib/ring/rte_ring_c11_pvt.h | 35
> > > > > ++++++++++++++++++++---------------
> > > > >  2 files changed, 21 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/.mailmap b/.mailmap
> > > > > index 4018f0fc47..367115d134 100644
> > > > > --- a/.mailmap
> > > > > +++ b/.mailmap
> > > > > @@ -1430,6 +1430,7 @@ Walter Heymans
> > > > <walter.heymans@corigine.com>
> > > > > Wang Sheng-Hui <shhuiw@gmail.com>  Wangyu (Eric)
> > > > > <seven.wangyu@huawei.com>  Waterman Cao
> > <waterman.cao@intel.com>
> > > > > +Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > > >  Weichun Chen <weichunx.chen@intel.com>  Wei Dai
> > > > > <wei.dai@intel.com> Weifeng Li <liweifeng96@126.com> diff --git
> > > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > > > > f895950df4..63fe58ce9e 100644
> > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail
> > > > > *ht,
> > > > uint32_t old_val,
> > > > >  		uint32_t new_val, uint32_t single, uint32_t enqueue)  {
> > > > >  	RTE_SET_USED(enqueue);
> > > > > +	/*
> > > > > +	 * Updating of ht->tail cannot happen before elements are added to or
> > > > > +	 * removed from the ring, as it could result in data races between
> > > > > +	 * producer and consumer threads. Therefore we need a release
> > > > > +	 * barrier here.
> > > > > +	 */
> > > > > +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
> > > > >
> > > > >  	/*
> > > > >  	 * If there are other enqueues/dequeues in progress that
> > > > > preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct
> > > > > rte_ring_headtail
> > > > *ht, uint32_t old_val,
> > > > >  	if (!single)
> > > > >  		rte_wait_until_equal_32(&ht->tail, old_val,
> > > > __ATOMIC_RELAXED);
> > > > >
> > > > > -	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > > > > +	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
> > > > >  }
> > > > >
> > > > >  /**
> > > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > > > unsigned int is_sp,
> > > > >  		/* Reset n to the initial burst count */
> > > > >  		n = max;
> > > > >
> > > > > -		/* Ensure the head is read before tail */
> > > > > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > > -
> > > > > -		/* load-acquire synchronize with store-release of ht->tail
> > > > > -		 * in update_tail.
> > > > > -		 */
> > > > >  		cons_tail = __atomic_load_n(&r->cons.tail,
> > > > > -					__ATOMIC_ACQUIRE);
> > > > > +					__ATOMIC_RELAXED);
> > > > >
> > > > >  		/* The subtraction is done between two unsigned 32bits value
> > > > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > > > 100,6
> > > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned
> > > > > +int
> > > > is_sp,
> > > > >  					0, __ATOMIC_RELAXED,
> > > > >  					__ATOMIC_RELAXED);
> > > > >  	} while (unlikely(success == 0));
> > > > > +	/*
> > > > > +	 * Ensure that updates to the ring doesn't rise above
> > > > > +	 * load of the new_head in SP and MP cases.
> > > > > +	 */
> > > > > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > >  	return n;
> > > > >  }
> > > > >
> > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > > > > int
> > > > is_sc,
> > > > >  		/* Restore n as it may change every loop */
> > > > >  		n = max;
> > > > >
> > > > > -		/* Ensure the head is read before tail */
> > > > > -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > > -
> > > > > -		/* this load-acquire synchronize with store-release of ht->tail
> > > > > -		 * in update_tail.
> > > > > -		 */
> > > > >  		prod_tail = __atomic_load_n(&r->prod.tail,
> > > > > -					__ATOMIC_ACQUIRE);
> > > > > +					__ATOMIC_RELAXED);
> > > > >
> > > > >  		/* The subtraction is done between two unsigned 32bits value
> > > > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > > > 175,6
> > > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > > > > +is_sc,
> > > > >  							0,
> > > > __ATOMIC_RELAXED,
> > > > >
> > 	__ATOMIC_RELAXED);
> > > > >  	} while (unlikely(success == 0));
> > > > > +	/*
> > > > > +	 * Ensure that updates to the ring doesn't rise above
> > > > > +	 * load of the new_head in SP and MP cases.
> > > > > +	 */
> > > > > +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > > > >  	return n;
> > > > >  }
> > > > >
> > > > > --
> > > > > 2.25.1
> > > > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-21 13:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-15 20:13 [RFC] ring: Further performance improvements with C11 Wathsala Vithanage
2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage
2023-07-31 12:31   ` Thomas Monjalon
2023-08-03  2:56     ` Honnappa Nagarahalli
2023-08-02  9:42   ` Konstantin Ananyev
2023-08-04 22:50     ` Wathsala Wathawana Vithanage
2023-08-09 18:18       ` Konstantin Ananyev
2023-08-15  5:14         ` Honnappa Nagarahalli
2023-08-21 13:27           ` Konstantin Ananyev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).