DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC] ring: improve ring performance with C11 atomics
@ 2023-04-21 19:14 Wathsala Vithanage
  2023-04-21 21:27 ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Wathsala Vithanage @ 2023-04-21 19:14 UTC (permalink / raw)
  To: honnappa.nagarahalli, konstantin.v.ananyev, feifei.wang2
  Cc: dev, nd, Wathsala Vithanage

Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
Because to calculate the addresses of the dequeue
elements __rte_ring_dequeue_elems uses the old_head updated by the
__atomic_compare_exchange_n intrinsic used in
__rte_ring_move_prod_head. This results in an address dependency
between the two operations. Therefore __rte_ring_dequeue_elems cannot
happen before  __rte_ring_move_prod_head.
Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
won't be reordered either.

Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1)                         |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686  | Total count: 812125  |
|                     |        Gain: -15%    |        Gain: 6%      |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1)                        |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385  | Total count: 830935  |
|                     |        Gain: -21%    |        Gain: 2%      |
+-------------------------------------------------------------------+

Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8                                  |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 181640 | Total count: 181995  | Total count: 182791  |
|                     |        Gain: 0.2%    |        Gain: 0.6%
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32                                 |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 167495 | Total count: 161536  | Total count: 163190  |
|                     |        Gain: -3.5%   |        Gain: -2.6%   |
+-------------------------------------------------------------------+

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
---
 .mailmap                    |  1 +
 lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
 Wang Sheng-Hui <shhuiw@gmail.com>
 Wangyu (Eric) <seven.wangyu@huawei.com>
 Waterman Cao <waterman.cao@intel.com>
+Wathsala Vithanage <wathsala.vithanage@arm.com>
 Weichun Chen <weichunx.chen@intel.com>
 Wei Dai <wei.dai@intel.com>
 Weifeng Li <liweifeng96@126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..1895f2bb0e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
 	if (!single)
 		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
+	/*
+	 * Updating of ht->tail cannot happen before elements are added to or
+	 * removed from the ring, as it could result in data races between
+	 * producer and consumer threads. Therefore ht->tail should be updated
+	 * with release semantics to prevent ring data copy phase from sinking
+	 * below it.
+	 */
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
 
@@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		cons_tail = __atomic_load_n(&r->cons.tail,
-					__ATOMIC_ACQUIRE);
+					__ATOMIC_RELAXED);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		prod_tail = __atomic_load_n(&r->prod.tail,
-					__ATOMIC_ACQUIRE);
-
+					__ATOMIC_RELAXED);
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
-- 
2.25.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] ring: improve ring performance with C11 atomics
  2023-04-21 19:14 [RFC] ring: improve ring performance with C11 atomics Wathsala Vithanage
@ 2023-04-21 21:27 ` Stephen Hemminger
  0 siblings, 0 replies; 5+ messages in thread
From: Stephen Hemminger @ 2023-04-21 21:27 UTC (permalink / raw)
  To: Wathsala Vithanage
  Cc: honnappa.nagarahalli, konstantin.v.ananyev, feifei.wang2, dev, nd

On Fri, 21 Apr 2023 19:14:46 +0000
Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:

>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
>  
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */

This makes sense since tail is already synchronized by the previous thread fence.

Probably best to add the comment about ht->tail into the existing thread_fence comment.
Something like:
		/* Ensure that head is read before tail and that tail synchronized with update_tail */

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC] ring: improve ring performance with C11 atomics
  2023-04-21 19:16 Wathsala Vithanage
  2023-04-22 13:19 ` Morten Brørup
@ 2023-06-08 13:21 ` Wathsala Wathawana Vithanage
  1 sibling, 0 replies; 5+ messages in thread
From: Wathsala Wathawana Vithanage @ 2023-06-08 13:21 UTC (permalink / raw)
  To: Honnappa Nagarahalli, konstantin.v.ananyev, Feifei Wang; +Cc: dev, nd, nd

The solution presented in this RFC is not C11 compliant. 
C11 __atomic_compare_exchange_n updates "expected" only when CAS instruction fails. 
Therefore, the assumption that there is an address dependency from CAS instructions in both producer/consumer head update to the ring element accesses falls apart with respect to semantics of the C11 memory model. 
Thus, this solution may corrupt ring elements when executed on CPUs with weak memory models.
Therefore, this RFC will be retracted.

> -----Original Message-----
> From: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Sent: Friday, April 21, 2023 3:17 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> konstantin.v.ananyev@yandex.ru; Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Wathsala Wathawana Vithanage
> <wathsala.vithanage@arm.com>
> Subject: [RFC] ring: improve ring performance with C11 atomics
> 
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head can
> be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue elements
> __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in __rte_ring_move_prod_head.
> This results in an address dependency between the two operations. Therefore
> __rte_ring_dequeue_elems cannot happen before
> __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head won't be
> reordered either.
> 
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+
> 
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
> Wang Sheng-Hui <shhuiw@gmail.com>  Wangyu (Eric)
> <seven.wangyu@huawei.com>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val,
> __ATOMIC_RELAXED);
> 
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore ht->tail should be
> updated
> +	 * with release semantics to prevent ring data copy phase from sinking
> +	 * below it.
> +	 */
>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);  }
> 
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail = __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have @@ -
> 145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail = __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> -
> +					__ATOMIC_RELAXED);
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC] ring: improve ring performance with C11 atomics
  2023-04-21 19:16 Wathsala Vithanage
@ 2023-04-22 13:19 ` Morten Brørup
  2023-06-08 13:21 ` Wathsala Wathawana Vithanage
  1 sibling, 0 replies; 5+ messages in thread
From: Morten Brørup @ 2023-04-22 13:19 UTC (permalink / raw)
  To: Wathsala Vithanage, honnappa.nagarahalli, konstantin.v.ananyev,
	feifei.wang2
  Cc: dev, nd

> From: Wathsala Vithanage [mailto:wathsala.vithanage@arm.com]
> Sent: Friday, 21 April 2023 21.17
> 
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
> can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue
> elements __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in
> __rte_ring_move_prod_head. This results in an address dependency
> between the two operations. Therefore __rte_ring_dequeue_elems cannot
> happen before  __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
> won't be reordered either.

These preconditions should be added as comments in the source code.

> 
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+

Big performance gain compared to pre-improved C11 atomics! Excellent.

> 
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+

I noticed that the larger size (32 objects) had a larger relative drop in performance than the smaller size (8 objects), so I am wondering what the performance numbers are for size 512, the default RTE_MEMPOOL_CACHE_MAX_SIZE? It's probably not going to change anything regarding the patch acceptance, but I'm curious about the numbers.

> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
>  Wang Sheng-Hui <shhuiw@gmail.com>
>  Wangyu (Eric) <seven.wangyu@huawei.com>
>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
> 
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore ht->tail should be updated
> +	 * with release semantics to prevent ring data copy phase from sinking
> +	 * below it.
> +	 */

I think this comment should clarified as:

Updating of ht->tail SHOULD NOT happen before elements are added to or
removed from the ring, as it WOULD result in data races between
producer and consumer threads. Therefore ht->tail MUST be updated
with release semantics to prevent ring data copy phase from sinking
below it.

Don't use capitals; I only used them to highlight my modifications.

NB: English is not my native language, so please ignore me if I am mistaken!

>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
>  }
> 
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int
> is_sp,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail = __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);

The __ATOMIC_ACQUIRE had a comment describing its purpose.

Please don't just remove that comment. Please replace it with a new comment describing why __ATOMIC_RELAXED is valid here.

> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail = __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> -
> +					__ATOMIC_RELAXED);

Same comment as above: Don't just remove the old comment, please replace it with a new comment.

>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1

With comments regarding __ATOMIC_RELAXED added to the source code,

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC] ring: improve ring performance with C11 atomics
@ 2023-04-21 19:16 Wathsala Vithanage
  2023-04-22 13:19 ` Morten Brørup
  2023-06-08 13:21 ` Wathsala Wathawana Vithanage
  0 siblings, 2 replies; 5+ messages in thread
From: Wathsala Vithanage @ 2023-04-21 19:16 UTC (permalink / raw)
  To: honnappa.nagarahalli, konstantin.v.ananyev, feifei.wang2
  Cc: dev, nd, Wathsala Vithanage

Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
Because to calculate the addresses of the dequeue
elements __rte_ring_dequeue_elems uses the old_head updated by the
__atomic_compare_exchange_n intrinsic used in
__rte_ring_move_prod_head. This results in an address dependency
between the two operations. Therefore __rte_ring_dequeue_elems cannot
happen before  __rte_ring_move_prod_head.
Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
won't be reordered either.

Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1)                         |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686  | Total count: 812125  |
|                     |        Gain: -15%    |        Gain: 6%      |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1)                        |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385  | Total count: 830935  |
|                     |        Gain: -21%    |        Gain: 2%      |
+-------------------------------------------------------------------+

Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8                                  |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 181640 | Total count: 181995  | Total count: 182791  |
|                     |        Gain: 0.2%    |        Gain: 0.6%
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32                                 |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 167495 | Total count: 161536  | Total count: 163190  |
|                     |        Gain: -3.5%   |        Gain: -2.6%   |
+-------------------------------------------------------------------+

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
---
 .mailmap                    |  1 +
 lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
 Wang Sheng-Hui <shhuiw@gmail.com>
 Wangyu (Eric) <seven.wangyu@huawei.com>
 Waterman Cao <waterman.cao@intel.com>
+Wathsala Vithanage <wathsala.vithanage@arm.com>
 Weichun Chen <weichunx.chen@intel.com>
 Wei Dai <wei.dai@intel.com>
 Weifeng Li <liweifeng96@126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..1895f2bb0e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
 	if (!single)
 		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
+	/*
+	 * Updating of ht->tail cannot happen before elements are added to or
+	 * removed from the ring, as it could result in data races between
+	 * producer and consumer threads. Therefore ht->tail should be updated
+	 * with release semantics to prevent ring data copy phase from sinking
+	 * below it.
+	 */
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
 
@@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		cons_tail = __atomic_load_n(&r->cons.tail,
-					__ATOMIC_ACQUIRE);
+					__ATOMIC_RELAXED);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		prod_tail = __atomic_load_n(&r->prod.tail,
-					__ATOMIC_ACQUIRE);
-
+					__ATOMIC_RELAXED);
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
-- 
2.25.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-06-08 13:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-21 19:14 [RFC] ring: improve ring performance with C11 atomics Wathsala Vithanage
2023-04-21 21:27 ` Stephen Hemminger
2023-04-21 19:16 Wathsala Vithanage
2023-04-22 13:19 ` Morten Brørup
2023-06-08 13:21 ` Wathsala Wathawana Vithanage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).