DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue
@ 2018-08-06  1:18 Gavin Hu
  2018-08-06  9:19 ` Thomas Monjalon
  2018-08-07  3:19 ` [dpdk-dev] [PATCH v2] " Gavin Hu
  0 siblings, 2 replies; 131+ messages in thread
From: Gavin Hu @ 2018-08-06  1:18 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, hemant.agrawal, jia.he, stable

1) In update_tail, read ht->tail using __atomic_load.
2) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
   the do {} while loop as upon failure the old_head will be updated,
   another load is not necessary.
3) Synchronize the load-acquires of prod.tail and cons.tail with store-
   releases of update_tail which releases all ring updates up to the
   value of ht->tail.
4) When calling __atomic_compare_exchange_n, relaxed ordering for both
   success and failure, as multiple threads can work independently on
   the same end of the ring (either enqueue or dequeue) without
   synchronization, not as operating on tail, which has to be finished
   in sequence.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4a6..cfa3be4a7 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
+		while (unlikely(old_val != __atomic_load_n(&ht->tail,
+						__ATOMIC_RELAXED)))
 			rte_pause();
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
@@ -60,20 +61,24 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -87,9 +92,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
-					0, __ATOMIC_ACQUIRE,
+					/*weak=*/0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
 	return n;
@@ -128,18 +134,23 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+							__ATOMIC_ACQUIRE);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
@@ -152,10 +163,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
-							old_head, *new_head,
-							0, __ATOMIC_ACQUIRE,
-							__ATOMIC_RELAXED);
+						old_head, *new_head,
+						/*weak=*/0, __ATOMIC_RELAXED,
+						__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
 	return n;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue
  2018-08-06  1:18 [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue Gavin Hu
@ 2018-08-06  9:19 ` Thomas Monjalon
  2018-08-08  1:39   ` Gavin Hu
  2018-08-07  3:19 ` [dpdk-dev] [PATCH v2] " Gavin Hu
  1 sibling, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2018-08-06  9:19 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, hemant.agrawal, jia.he, stable

Hi,

Please start your patch by explaining the issue
you are solving.
What is the consequence of the bug on the behaviour?
What is the scope of the bug? only PPC?

06/08/2018 03:18, Gavin Hu:
> 1) In update_tail, read ht->tail using __atomic_load.
> 2) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
>    the do {} while loop as upon failure the old_head will be updated,
>    another load is not necessary.
> 3) Synchronize the load-acquires of prod.tail and cons.tail with store-
>    releases of update_tail which releases all ring updates up to the
>    value of ht->tail.
> 4) When calling __atomic_compare_exchange_n, relaxed ordering for both
>    success and failure, as multiple threads can work independently on
>    the same end of the ring (either enqueue or dequeue) without
>    synchronization, not as operating on tail, which has to be finished
>    in sequence.
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>

Your Signed-off-by should be first in the list.
These tags are added in the chronological order.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v2] ring: fix c11 memory ordering issue
  2018-08-06  1:18 [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue Gavin Hu
  2018-08-06  9:19 ` Thomas Monjalon
@ 2018-08-07  3:19 ` Gavin Hu
  2018-08-07  5:56   ` He, Jia
                     ` (2 more replies)
  1 sibling, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-08-07  3:19 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, hemant.agrawal, jia.he, stable

This patch includes two bug fixes(#1 and 2) and two optimisations(#3
and #4).
1) In update_tail, read ht->tail using __atomic_load.Although the
   compiler currently seems to be doing the right thing even without
   _atomic_load, we don't want to give the compiler freedom to optimise
   what should be an atomic load, it should not be arbitarily moved
   around.
2) Synchronize the load-acquire of the tail and the store-release
   within update_tail, the store release ensures all the ring operations,
   engqueu or dequeue are seen by the observers as soon as they see
   the updated tail. The load-acquire is required for correctly compu-
   tate the free_entries or avail_entries, respectively for enqueue and
   dequeue operations, the data dependency is not reliable for ordering
   as the compiler might break it by saving to temporary values to boost
   performance.
3) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
   the do {} while loop as upon failure the old_head will be updated,
   another load is costy and not necessary.
4) When calling __atomic_compare_exchange_n, relaxed ordering for both
   success and failure, as multiple threads can work independently on
   the same end of the ring (either enqueue or dequeue) without
   synchronization, not as operating on tail, which has to be finished
   in sequence.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 24.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with rings size of 32, it saves
latency up to (6.90-5.20)/6.90 = 24.6%.

Test result data:

SP/SC bulk enq/dequeue (size: 8): 13.19
MP/MC bulk enq/dequeue (size: 8): 25.79
SP/SC bulk enq/dequeue (size: 32): 3.85
MP/MC bulk enq/dequeue (size: 32): 6.90

SP/SC bulk enq/dequeue (size: 8): 12.05
MP/MC bulk enq/dequeue (size: 8): 23.06
SP/SC bulk enq/dequeue (size: 32): 3.62
MP/MC bulk enq/dequeue (size: 32): 5.20

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4a6..cfa3be4a7 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
+		while (unlikely(old_val != __atomic_load_n(&ht->tail,
+						__ATOMIC_RELAXED)))
 			rte_pause();
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
@@ -60,20 +61,24 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -87,9 +92,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
-					0, __ATOMIC_ACQUIRE,
+					/*weak=*/0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
 	return n;
@@ -128,18 +134,23 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+							__ATOMIC_ACQUIRE);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
@@ -152,10 +163,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
-							old_head, *new_head,
-							0, __ATOMIC_ACQUIRE,
-							__ATOMIC_RELAXED);
+						old_head, *new_head,
+						/*weak=*/0, __ATOMIC_RELAXED,
+						__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
 	return n;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ring: fix c11 memory ordering issue
  2018-08-07  3:19 ` [dpdk-dev] [PATCH v2] " Gavin Hu
@ 2018-08-07  5:56   ` He, Jia
  2018-08-07  7:56     ` Gavin Hu
  2018-09-17  7:47   ` [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors Gavin Hu
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
  2 siblings, 1 reply; 131+ messages in thread
From: He, Jia @ 2018-08-07  5:56 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl, jerin.jacob,
	hemant.agrawal, stable

Hi Gavin
> -----Original Message-----
> From: Gavin Hu [mailto:gavin.hu@arm.com]
> Sent: 2018年8月7日 11:20
> To: dev@dpdk.org
> Cc: gavin.hu@arm.com; Honnappa.Nagarahalli@arm.com;
> steve.capper@arm.com; Ola.Liljedahl@arm.com;
> jerin.jacob@caviumnetworks.com; hemant.agrawal@nxp.com; He, Jia
> <jia.he@hxt-semitech.com>; stable@dpdk.org
> Subject: [PATCH v2] ring: fix c11 memory ordering issue
> 
> This patch includes two bug fixes(#1 and 2) and two optimisations(#3 and #4).

Maybe you need to split this into small parts.

> 1) In update_tail, read ht->tail using __atomic_load.Although the
>    compiler currently seems to be doing the right thing even without
>    _atomic_load, we don't want to give the compiler freedom to optimise
>    what should be an atomic load, it should not be arbitarily moved
>    around.
> 2) Synchronize the load-acquire of the tail and the store-release
>    within update_tail, the store release ensures all the ring operations,
>    engqueu or dequeue are seen by the observers as soon as they see
>    the updated tail. The load-acquire is required for correctly compu-
>    tate the free_entries or avail_entries, respectively for enqueue and
>    dequeue operations, the data dependency is not reliable for ordering
>    as the compiler might break it by saving to temporary values to boost
>    performance.

Could you describe the race condition in details?
e.g.
cpu 1			cpu2
code1
				code2

Cheers,
Jia
> 3) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
>    the do {} while loop as upon failure the old_head will be updated,
>    another load is costy and not necessary.
> 4) When calling __atomic_compare_exchange_n, relaxed ordering for both
>    success and failure, as multiple threads can work independently on
>    the same end of the ring (either enqueue or dequeue) without
>    synchronization, not as operating on tail, which has to be finished
>    in sequence.
> 
> The patch was benchmarked with test/ring_perf_autotest and it decreases the
> enqueue/dequeue latency by 5% ~ 24.6% with two lcores, the real gains are
> dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with rings size of 32, it saves latency up
> to (6.90-5.20)/6.90 = 24.6%.
> 
> Test result data:
> 
> SP/SC bulk enq/dequeue (size: 8): 13.19
> MP/MC bulk enq/dequeue (size: 8): 25.79
> SP/SC bulk enq/dequeue (size: 32): 3.85
> MP/MC bulk enq/dequeue (size: 32): 6.90
> 
> SP/SC bulk enq/dequeue (size: 8): 12.05
> MP/MC bulk enq/dequeue (size: 8): 23.06
> SP/SC bulk enq/dequeue (size: 32): 3.62
> MP/MC bulk enq/dequeue (size: 32): 5.20
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 38
> +++++++++++++++++++++++++-------------
>  1 file changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4a6..cfa3be4a7 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
> uint32_t new_val,
>  	 * we need to wait for them to complete
>  	 */
>  	if (!single)
> -		while (unlikely(ht->tail != old_val))
> +		while (unlikely(old_val != __atomic_load_n(&ht->tail,
> +						__ATOMIC_RELAXED)))
>  			rte_pause();
> 
>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); @@ -60,20
> +61,24 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>  	unsigned int max = n;
>  	int success;
> 
> +	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
>  	do {
>  		/* Reset n to the initial burst count */
>  		n = max;
> 
> -		*old_head = __atomic_load_n(&r->prod.head,
> -					__ATOMIC_ACQUIRE);
> 
> -		/*
> -		 *  The subtraction is done between two unsigned 32bits value
> +		/* load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> +							__ATOMIC_ACQUIRE);
> +
> +		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * *old_head > cons_tail). So 'free_entries' is always between 0
>  		 * and capacity (which is < size).
>  		 */
> -		*free_entries = (capacity + r->cons.tail - *old_head);
> +		*free_entries = (capacity + cons_tail - *old_head);
> 
>  		/* check that we have enough room in ring */
>  		if (unlikely(n > *free_entries))
> @@ -87,9 +92,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned
> int is_sp,
>  		if (is_sp)
>  			r->prod.head = *new_head, success = 1;
>  		else
> +			/* on failure, *old_head is updated */
>  			success = __atomic_compare_exchange_n(&r->prod.head,
>  					old_head, *new_head,
> -					0, __ATOMIC_ACQUIRE,
> +					/*weak=*/0, __ATOMIC_RELAXED,
>  					__ATOMIC_RELAXED);
>  	} while (unlikely(success == 0));
>  	return n;
> @@ -128,18 +134,23 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  	int success;
> 
>  	/* move cons.head atomically */
> +	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
>  	do {
>  		/* Restore n as it may change every loop */
>  		n = max;
> -		*old_head = __atomic_load_n(&r->cons.head,
> -					__ATOMIC_ACQUIRE);
> +
> +		/* this load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> +							__ATOMIC_ACQUIRE);
> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
>  		 * and size(ring)-1.
>  		 */
> -		*entries = (r->prod.tail - *old_head);
> +		*entries = (prod_tail - *old_head);
> 
>  		/* Set the actual entries for dequeue */
>  		if (n > *entries)
> @@ -152,10 +163,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  		if (is_sc)
>  			r->cons.head = *new_head, success = 1;
>  		else
> +			/* on failure, *old_head will be updated */
>  			success = __atomic_compare_exchange_n(&r->cons.head,
> -							old_head, *new_head,
> -							0, __ATOMIC_ACQUIRE,
> -							__ATOMIC_RELAXED);
> +						old_head, *new_head,
> +						/*weak=*/0, __ATOMIC_RELAXED,
> +						__ATOMIC_RELAXED);
>  	} while (unlikely(success == 0));
>  	return n;
>  }
> --
> 2.11.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ring: fix c11 memory ordering issue
  2018-08-07  5:56   ` He, Jia
@ 2018-08-07  7:56     ` Gavin Hu
  2018-08-08  3:07       ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-08-07  7:56 UTC (permalink / raw)
  To: He, Jia, dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	hemant.agrawal, stable

Hi Jia,

Thanks for your feedback, let's see if there are requests from others to split the fix.

The is a race condition between updating the tail and getting free/avail_entries, which is dependent on the tails.
Which should be synchronized by load-acquire and store-release. In simple words below, step #1 and #5 should be synchronized with each other, mutually, otherwise the free_/avail_entries calculation possibly get wrong.

On each locre, either enque or deque, step #1 and #2 order should be maintained as #2 has dependency on #1,
That's why Acquire ordering is necessary.

Please raise new questions if I don't get across this clearly.

Ring enqueue / lcore #0                       Ring deque / lcore #1
1. load-acquire prod_tail                      1. Load-acquire cons_tail
2. get free_entries                                 2. Get avail_entries
3. move prod_head accordingly          3. Move cons_head accordingly
4. do enqueue operations                    4. Do dequeue operations
5. store-release prod_tail                     5. Store-release cons_tail

Best Regards,
Gavin
-----Original Message-----
From: He, Jia <jia.he@hxt-semitech.com>
Sent: Tuesday, August 7, 2018 1:57 PM
To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; jerin.jacob@caviumnetworks.com; hemant.agrawal@nxp.com; stable@dpdk.org
Subject: RE: [PATCH v2] ring: fix c11 memory ordering issue

Hi Gavin
> -----Original Message-----
> From: Gavin Hu [mailto:gavin.hu@arm.com]
> Sent: 2018年8月7日 11:20
> To: dev@dpdk.org
> Cc: gavin.hu@arm.com; Honnappa.Nagarahalli@arm.com;
> steve.capper@arm.com; Ola.Liljedahl@arm.com;
> jerin.jacob@caviumnetworks.com; hemant.agrawal@nxp.com; He, Jia
> <jia.he@hxt-semitech.com>; stable@dpdk.org
> Subject: [PATCH v2] ring: fix c11 memory ordering issue
>
> This patch includes two bug fixes(#1 and 2) and two optimisations(#3 and #4).

Maybe you need to split this into small parts.

> 1) In update_tail, read ht->tail using __atomic_load.Although the
>    compiler currently seems to be doing the right thing even without
>    _atomic_load, we don't want to give the compiler freedom to optimise
>    what should be an atomic load, it should not be arbitarily moved
>    around.
> 2) Synchronize the load-acquire of the tail and the store-release
>    within update_tail, the store release ensures all the ring operations,
>    engqueu or dequeue are seen by the observers as soon as they see
>    the updated tail. The load-acquire is required for correctly compu-
>    tate the free_entries or avail_entries, respectively for enqueue and
>    dequeue operations, the data dependency is not reliable for ordering
>    as the compiler might break it by saving to temporary values to boost
>    performance.

Could you describe the race condition in details?
e.g.
cpu 1cpu2
code1
code2

Cheers,
Jia
> 3) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
>    the do {} while loop as upon failure the old_head will be updated,
>    another load is costy and not necessary.
> 4) When calling __atomic_compare_exchange_n, relaxed ordering for both
>    success and failure, as multiple threads can work independently on
>    the same end of the ring (either enqueue or dequeue) without
>    synchronization, not as operating on tail, which has to be finished
>    in sequence.
>
> The patch was benchmarked with test/ring_perf_autotest and it
> decreases the enqueue/dequeue latency by 5% ~ 24.6% with two lcores,
> the real gains are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with rings size of 32, it
> saves latency up to (6.90-5.20)/6.90 = 24.6%.
>
> Test result data:
>
> SP/SC bulk enq/dequeue (size: 8): 13.19 MP/MC bulk enq/dequeue (size:
> 8): 25.79 SP/SC bulk enq/dequeue (size: 32): 3.85 MP/MC bulk
> enq/dequeue (size: 32): 6.90
>
> SP/SC bulk enq/dequeue (size: 8): 12.05 MP/MC bulk enq/dequeue (size:
> 8): 23.06 SP/SC bulk enq/dequeue (size: 32): 3.62 MP/MC bulk
> enq/dequeue (size: 32): 5.20
>
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 38
> +++++++++++++++++++++++++-------------
>  1 file changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4a6..cfa3be4a7 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> old_val, uint32_t new_val,
>   * we need to wait for them to complete
>   */
>  if (!single)
> -while (unlikely(ht->tail != old_val))
> +while (unlikely(old_val != __atomic_load_n(&ht->tail,
> +__ATOMIC_RELAXED)))
>  rte_pause();
>
>  __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); @@ -60,20
> +61,24 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int
> +is_sp,
>  unsigned int max = n;
>  int success;
>
> +*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
>  do {
>  /* Reset n to the initial burst count */
>  n = max;
>
> -*old_head = __atomic_load_n(&r->prod.head,
> -__ATOMIC_ACQUIRE);
>
> -/*
> - *  The subtraction is done between two unsigned 32bits value
> +/* load-acquire synchronize with store-release of ht->tail
> + * in update_tail.
> + */
> +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> +__ATOMIC_ACQUIRE);
> +
> +/* The subtraction is done between two unsigned 32bits value
>   * (the result is always modulo 32 bits even if we have
>   * *old_head > cons_tail). So 'free_entries' is always between 0
>   * and capacity (which is < size).
>   */
> -*free_entries = (capacity + r->cons.tail - *old_head);
> +*free_entries = (capacity + cons_tail - *old_head);
>
>  /* check that we have enough room in ring */
>  if (unlikely(n > *free_entries))
> @@ -87,9 +92,10 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  if (is_sp)
>  r->prod.head = *new_head, success = 1;
>  else
> +/* on failure, *old_head is updated */
>  success = __atomic_compare_exchange_n(&r->prod.head,
>  old_head, *new_head,
> -0, __ATOMIC_ACQUIRE,
> +/*weak=*/0, __ATOMIC_RELAXED,
>  __ATOMIC_RELAXED);
>  } while (unlikely(success == 0));
>  return n;
> @@ -128,18 +134,23 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> int is_sc,
>  int success;
>
>  /* move cons.head atomically */
> +*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
>  do {
>  /* Restore n as it may change every loop */
>  n = max;
> -*old_head = __atomic_load_n(&r->cons.head,
> -__ATOMIC_ACQUIRE);
> +
> +/* this load-acquire synchronize with store-release of ht->tail
> + * in update_tail.
> + */
> +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> +__ATOMIC_ACQUIRE);
>
>  /* The subtraction is done between two unsigned 32bits value
>   * (the result is always modulo 32 bits even if we have
>   * cons_head > prod_tail). So 'entries' is always between 0
>   * and size(ring)-1.
>   */
> -*entries = (r->prod.tail - *old_head);
> +*entries = (prod_tail - *old_head);
>
>  /* Set the actual entries for dequeue */
>  if (n > *entries)
> @@ -152,10 +163,11 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> int is_sc,
>  if (is_sc)
>  r->cons.head = *new_head, success = 1;
>  else
> +/* on failure, *old_head will be updated */
>  success = __atomic_compare_exchange_n(&r->cons.head,
> -old_head, *new_head,
> -0, __ATOMIC_ACQUIRE,
> -__ATOMIC_RELAXED);
> +old_head, *new_head,
> +/*weak=*/0, __ATOMIC_RELAXED,
> +__ATOMIC_RELAXED);
>  } while (unlikely(success == 0));
>  return n;
>  }
> --
> 2.11.0

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue
  2018-08-06  9:19 ` Thomas Monjalon
@ 2018-08-08  1:39   ` Gavin Hu
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-08-08  1:39 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl,
	jerin.jacob, hemant.agrawal, jia.he, stable

Hi Thomas,

I updated the commit message in my v2 patch.
Could you check if your questions get answered by the new commit message and this mail?

And what's your opinions of splitting the patch into 4 smaller ones(2 bug fixes and 2 optimizations)? I got this comment from Jia He, he is the author of this file.

>>What is the consequence of the bug on the behaviour?
Potential effects could be read of stale values.

>> What is the scope of the bug? only PPC?
The scope is all weakly ordered architectures such as ARM (32- and 64-bit) and POWER.


-----Original Message-----
From: Thomas Monjalon <thomas@monjalon.net>
Sent: 2018年8月6日 17:19
To: Gavin Hu <Gavin.Hu@arm.com>
Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; jerin.jacob@caviumnetworks.com; hemant.agrawal@nxp.com; jia.he@hxt-semitech.com; stable@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue

Hi,

Please start your patch by explaining the issue you are solving.
What is the consequence of the bug on the behaviour?
What is the scope of the bug? only PPC?

06/08/2018 03:18, Gavin Hu:
> 1) In update_tail, read ht->tail using __atomic_load.
> 2) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
>    the do {} while loop as upon failure the old_head will be updated,
>    another load is not necessary.
> 3) Synchronize the load-acquires of prod.tail and cons.tail with store-
>    releases of update_tail which releases all ring updates up to the
>    value of ht->tail.
> 4) When calling __atomic_compare_exchange_n, relaxed ordering for both
>    success and failure, as multiple threads can work independently on
>    the same end of the ring (either enqueue or dequeue) without
>    synchronization, not as operating on tail, which has to be finished
>    in sequence.
>
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>

Your Signed-off-by should be first in the list.
These tags are added in the chronological order.


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ring: fix c11 memory ordering issue
  2018-08-07  7:56     ` Gavin Hu
@ 2018-08-08  3:07       ` Jerin Jacob
  2018-08-08  7:23         ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-08-08  3:07 UTC (permalink / raw)
  To: Gavin Hu
  Cc: He, Jia, dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl,
	hemant.agrawal, stable

-----Original Message-----
> Date: Tue, 7 Aug 2018 07:56:08 +0000
> From: Gavin Hu <Gavin.Hu@arm.com>
> To: "He, Jia" <jia.he@hxt-semitech.com>, "dev@dpdk.org" <dev@dpdk.org>
> CC: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, Steve Capper
>  <Steve.Capper@arm.com>, Ola Liljedahl <Ola.Liljedahl@arm.com>,
>  "jerin.jacob@caviumnetworks.com" <jerin.jacob@caviumnetworks.com>,
>  "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>, "stable@dpdk.org"
>  <stable@dpdk.org>
> Subject: RE: [PATCH v2] ring: fix c11 memory ordering issue
> 
> 
> Hi Jia,
> 
> Thanks for your feedback, let's see if there are requests from others to split the fix.

+1 to split it as small patches. If possible, 4 patches, 2 for bug fix
and 2 for optimization.

Like you mentioned over all performance improvement data, Please add it
per optimization patch if possible.

> 
> The is a race condition between updating the tail and getting free/avail_entries, which is dependent on the tails.
> Which should be synchronized by load-acquire and store-release. In simple words below, step #1 and #5 should be synchronized with each other, mutually, otherwise the free_/avail_entries calculation possibly get wrong.
> 
> On each locre, either enque or deque, step #1 and #2 order should be maintained as #2 has dependency on #1,
> That's why Acquire ordering is necessary.
> 
> Please raise new questions if I don't get across this clearly.
> 
> Ring enqueue / lcore #0                       Ring deque / lcore #1
> 1. load-acquire prod_tail                      1. Load-acquire cons_tail
> 2. get free_entries                                 2. Get avail_entries
> 3. move prod_head accordingly          3. Move cons_head accordingly
> 4. do enqueue operations                    4. Do dequeue operations
> 5. store-release prod_tail                     5. Store-release cons_tail
> 
> Best Regards,
> Gavin
> -----Original Message-----
> From: He, Jia <jia.he@hxt-semitech.com>
> Sent: Tuesday, August 7, 2018 1:57 PM
> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; jerin.jacob@caviumnetworks.com; hemant.agrawal@nxp.com; stable@dpdk.org
> Subject: RE: [PATCH v2] ring: fix c11 memory ordering issue
> 
> Hi Gavin
> > -----Original Message-----
> > From: Gavin Hu [mailto:gavin.hu@arm.com]
> > Sent: 2018年8月7日 11:20
> > To: dev@dpdk.org
> > Cc: gavin.hu@arm.com; Honnappa.Nagarahalli@arm.com;
> > steve.capper@arm.com; Ola.Liljedahl@arm.com;
> > jerin.jacob@caviumnetworks.com; hemant.agrawal@nxp.com; He, Jia
> > <jia.he@hxt-semitech.com>; stable@dpdk.org
> > Subject: [PATCH v2] ring: fix c11 memory ordering issue
> >
> > This patch includes two bug fixes(#1 and 2) and two optimisations(#3 and #4).
> 
> Maybe you need to split this into small parts.
> 
> > 1) In update_tail, read ht->tail using __atomic_load.Although the
> >    compiler currently seems to be doing the right thing even without
> >    _atomic_load, we don't want to give the compiler freedom to optimise
> >    what should be an atomic load, it should not be arbitarily moved
> >    around.
> > 2) Synchronize the load-acquire of the tail and the store-release
> >    within update_tail, the store release ensures all the ring operations,
> >    engqueu or dequeue are seen by the observers as soon as they see
> >    the updated tail. The load-acquire is required for correctly compu-
> >    tate the free_entries or avail_entries, respectively for enqueue and
> >    dequeue operations, the data dependency is not reliable for ordering
> >    as the compiler might break it by saving to temporary values to boost
> >    performance.
> 
> Could you describe the race condition in details?
> e.g.
> cpu 1cpu2
> code1
> code2
> 
> Cheers,
> Jia
> > 3) In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> >    the do {} while loop as upon failure the old_head will be updated,
> >    another load is costy and not necessary.
> > 4) When calling __atomic_compare_exchange_n, relaxed ordering for both
> >    success and failure, as multiple threads can work independently on
> >    the same end of the ring (either enqueue or dequeue) without
> >    synchronization, not as operating on tail, which has to be finished
> >    in sequence.
> >
> > The patch was benchmarked with test/ring_perf_autotest and it
> > decreases the enqueue/dequeue latency by 5% ~ 24.6% with two lcores,
> > the real gains are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> > For 1 lcore, it also improves a little, about 3 ~ 4%.
> > It is a big improvement, in case of MPMC, with rings size of 32, it
> > saves latency up to (6.90-5.20)/6.90 = 24.6%.
> >
> > Test result data:
> >
> > SP/SC bulk enq/dequeue (size: 8): 13.19 MP/MC bulk enq/dequeue (size:
> > 8): 25.79 SP/SC bulk enq/dequeue (size: 32): 3.85 MP/MC bulk
> > enq/dequeue (size: 32): 6.90
> >
> > SP/SC bulk enq/dequeue (size: 8): 12.05 MP/MC bulk enq/dequeue (size:
> > 8): 23.06 SP/SC bulk enq/dequeue (size: 32): 3.62 MP/MC bulk
> > enq/dequeue (size: 32): 5.20
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 38
> > +++++++++++++++++++++++++-------------
> >  1 file changed, 25 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 94df3c4a6..cfa3be4a7 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> > old_val, uint32_t new_val,
> >   * we need to wait for them to complete
> >   */
> >  if (!single)
> > -while (unlikely(ht->tail != old_val))
> > +while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > +__ATOMIC_RELAXED)))
> >  rte_pause();
> >
> >  __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); @@ -60,20
> > +61,24 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int
> > +is_sp,
> >  unsigned int max = n;
> >  int success;
> >
> > +*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
> >  do {
> >  /* Reset n to the initial burst count */
> >  n = max;
> >
> > -*old_head = __atomic_load_n(&r->prod.head,
> > -__ATOMIC_ACQUIRE);
> >
> > -/*
> > - *  The subtraction is done between two unsigned 32bits value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> > +__ATOMIC_ACQUIRE);
> > +
> > +/* The subtraction is done between two unsigned 32bits value
> >   * (the result is always modulo 32 bits even if we have
> >   * *old_head > cons_tail). So 'free_entries' is always between 0
> >   * and capacity (which is < size).
> >   */
> > -*free_entries = (capacity + r->cons.tail - *old_head);
> > +*free_entries = (capacity + cons_tail - *old_head);
> >
> >  /* check that we have enough room in ring */
> >  if (unlikely(n > *free_entries))
> > @@ -87,9 +92,10 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  if (is_sp)
> >  r->prod.head = *new_head, success = 1;
> >  else
> > +/* on failure, *old_head is updated */
> >  success = __atomic_compare_exchange_n(&r->prod.head,
> >  old_head, *new_head,
> > -0, __ATOMIC_ACQUIRE,
> > +/*weak=*/0, __ATOMIC_RELAXED,
> >  __ATOMIC_RELAXED);
> >  } while (unlikely(success == 0));
> >  return n;
> > @@ -128,18 +134,23 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_sc,
> >  int success;
> >
> >  /* move cons.head atomically */
> > +*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
> >  do {
> >  /* Restore n as it may change every loop */
> >  n = max;
> > -*old_head = __atomic_load_n(&r->cons.head,
> > -__ATOMIC_ACQUIRE);
> > +
> > +/* this load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> > +__ATOMIC_ACQUIRE);
> >
> >  /* The subtraction is done between two unsigned 32bits value
> >   * (the result is always modulo 32 bits even if we have
> >   * cons_head > prod_tail). So 'entries' is always between 0
> >   * and size(ring)-1.
> >   */
> > -*entries = (r->prod.tail - *old_head);
> > +*entries = (prod_tail - *old_head);
> >
> >  /* Set the actual entries for dequeue */
> >  if (n > *entries)
> > @@ -152,10 +163,11 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_sc,
> >  if (is_sc)
> >  r->cons.head = *new_head, success = 1;
> >  else
> > +/* on failure, *old_head will be updated */
> >  success = __atomic_compare_exchange_n(&r->cons.head,
> > -old_head, *new_head,
> > -0, __ATOMIC_ACQUIRE,
> > -__ATOMIC_RELAXED);
> > +old_head, *new_head,
> > +/*weak=*/0, __ATOMIC_RELAXED,
> > +__ATOMIC_RELAXED);
> >  } while (unlikely(success == 0));
> >  return n;
> >  }
> > --
> > 2.11.0
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v2] ring: fix c11 memory ordering issue
  2018-08-08  3:07       ` Jerin Jacob
@ 2018-08-08  7:23         ` Thomas Monjalon
  0 siblings, 0 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-08-08  7:23 UTC (permalink / raw)
  To: Gavin Hu
  Cc: stable, Jerin Jacob, He, Jia, dev, Honnappa Nagarahalli,
	Steve Capper, Ola Liljedahl, hemant.agrawal

08/08/2018 05:07, Jerin Jacob:
> From: Gavin Hu <Gavin.Hu@arm.com>
> > 
> > Hi Jia,
> > 
> > Thanks for your feedback, let's see if there are requests from others to split the fix.
> 
> +1 to split it as small patches. If possible, 4 patches, 2 for bug fix
> and 2 for optimization.
> 
> Like you mentioned over all performance improvement data, Please add it
> per optimization patch if possible.

Yes each fix deserves a separate patch with full explanation
of the bug, how it is fixed and how much it is improved.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors
  2018-08-07  3:19 ` [dpdk-dev] [PATCH v2] " Gavin Hu
  2018-08-07  5:56   ` He, Jia
@ 2018-09-17  7:47   ` Gavin Hu
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment Gavin Hu
                       ` (2 more replies)
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
  2 siblings, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  7:47 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, Adrien Mazarguil

From: Adrien Mazarguil <adrien.mazarguil@6wind.com>

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 app/test-pmd/config.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 92686a05f..a0f934932 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1346,11 +1346,12 @@ port_flow_complain(struct rte_flow_error *error)
 		errstr = "unknown type";
 	else
 		errstr = errstrlist[error->type];
-	printf("Caught error type %d (%s): %s%s\n",
+	printf("Caught error type %d (%s): %s%s: %s\n",
 	       error->type, errstr,
 	       error->cause ? (snprintf(buf, sizeof(buf), "cause: %p, ",
 					error->cause), buf) : "",
-	       error->message ? error->message : "(no stated reason)");
+	       error->message ? error->message : "(no stated reason)",
+	       rte_strerror(err));
 	return -err;
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment
  2018-09-17  7:47   ` [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors Gavin Hu
@ 2018-09-17  7:47     ` Gavin Hu
  2018-09-17  8:25       ` Gavin Hu (Arm Technology China)
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications Gavin Hu
  2018-09-17  8:11     ` [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs Gavin Hu
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  7:47 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, Ferruh Yigit

From: Ferruh Yigit <ferruh.yigit@intel.com>

Comments says "no csum error report support" but there is no check
related csum offloads. Removing the comment.

Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/i40e/i40e_rxtx_vec_common.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index 63cb17742..f00f6d648 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -199,9 +199,7 @@ i40e_rx_vec_dev_conf_condition_check_default(struct rte_eth_dev *dev)
 	if (fconf->mode != RTE_FDIR_MODE_NONE)
 		return -1;
 
-	 /* - no csum error report support
-	 * - no header split support
-	 */
+	 /* no header split support */
 	if (rxmode->offloads & DEV_RX_OFFLOAD_HEADER_SPLIT)
 		return -1;
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications
  2018-09-17  7:47   ` [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors Gavin Hu
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment Gavin Hu
@ 2018-09-17  7:47     ` Gavin Hu
  2018-09-17  9:48       ` Jerin Jacob
  2018-09-17 10:49       ` [dpdk-dev] [PATCH v4] " Gavin Hu
  2018-09-17  8:11     ` [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs Gavin Hu
  2 siblings, 2 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  7:47 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

Fixes: 7cacb05655 ("doc: add generic build instructions for sample apps")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/sample_app_ug/compiling.rst | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/compiling.rst b/doc/guides/sample_app_ug/compiling.rst
index a2d75ed22..6f04743c8 100644
--- a/doc/guides/sample_app_ug/compiling.rst
+++ b/doc/guides/sample_app_ug/compiling.rst
@@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample applications.
 To compile all the sample applications
 --------------------------------------
 
-
 Set the path to DPDK source code if its not set:
 
     .. code-block:: console
@@ -93,3 +92,17 @@ Build the application:
 
         export RTE_TARGET=build
         make
+
+To cross compile the sample application(s)
+------------------------------------------
+
+For cross compiling the sample application(s), please append 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
+In example of AARCH64 cross compiling:
+
+    .. code-block:: console
+
+        export RTE_TARGET=build
+        export RTE_SDK=/path/to/rte_sdk
+        make -C examples CROSS=aarch64-linux-gnu-
+               or
+        make CROSS=aarch64-linux-gnu-
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs
  2018-09-17  7:47   ` [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors Gavin Hu
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment Gavin Hu
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications Gavin Hu
@ 2018-09-17  8:11     ` Gavin Hu
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load Gavin Hu
                         ` (2 more replies)
  2 siblings, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:11 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, Hemant Agrawal, stable

From: Hemant Agrawal <hemant.agrawal@nxp.com>

This patch fix the undefined reference issue with rte_dpaa2_memsegs
when compiled in shared lib mode with EXTRA_CFLAGS="-g -O0"

Bugzilla ID: 61
Fixes: 365fb925d3b3 ("bus/fslmc: optimize physical to virtual address search")
Cc: stable@dpdk.org

Reported-by: Keith Wiles <keith.wiles@intel.com>
Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
---
 drivers/bus/fslmc/portal/dpaa2_hw_dpbp.c            | 7 +++++++
 drivers/bus/fslmc/rte_bus_fslmc_version.map         | 1 +
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c            | 7 -------
 drivers/mempool/dpaa2/rte_mempool_dpaa2_version.map | 1 -
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_dpbp.c b/drivers/bus/fslmc/portal/dpaa2_hw_dpbp.c
index 39c5adf..db49d63 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_dpbp.c
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpbp.c
@@ -28,6 +28,13 @@
 #include "portal/dpaa2_hw_pvt.h"
 #include "portal/dpaa2_hw_dpio.h"
 
+/* List of all the memseg information locally maintained in dpaa2 driver. This
+ * is to optimize the PA_to_VA searches until a better mechanism (algo) is
+ * available.
+ */
+struct dpaa2_memseg_list rte_dpaa2_memsegs
+	= TAILQ_HEAD_INITIALIZER(rte_dpaa2_memsegs);
+
 TAILQ_HEAD(dpbp_dev_list, dpaa2_dpbp_dev);
 static struct dpbp_dev_list dpbp_dev_list
 	= TAILQ_HEAD_INITIALIZER(dpbp_dev_list); /*!< DPBP device list */
diff --git a/drivers/bus/fslmc/rte_bus_fslmc_version.map b/drivers/bus/fslmc/rte_bus_fslmc_version.map
index fe45a11..b4a8817 100644
--- a/drivers/bus/fslmc/rte_bus_fslmc_version.map
+++ b/drivers/bus/fslmc/rte_bus_fslmc_version.map
@@ -114,5 +114,6 @@ DPDK_18.05 {
 	dpdmai_open;
 	dpdmai_set_rx_queue;
 	rte_dpaa2_free_dpci_dev;
+	rte_dpaa2_memsegs;
 
 } DPDK_18.02;
diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 7d0435f..84ff128 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -33,13 +33,6 @@
 struct dpaa2_bp_info rte_dpaa2_bpid_info[MAX_BPID];
 static struct dpaa2_bp_list *h_bp_list;
 
-/* List of all the memseg information locally maintained in dpaa2 driver. This
- * is to optimize the PA_to_VA searches until a better mechanism (algo) is
- * available.
- */
-struct dpaa2_memseg_list rte_dpaa2_memsegs
-	= TAILQ_HEAD_INITIALIZER(rte_dpaa2_memsegs);
-
 /* Dynamic logging identified for mempool */
 int dpaa2_logtype_mempool;
 
diff --git a/drivers/mempool/dpaa2/rte_mempool_dpaa2_version.map b/drivers/mempool/dpaa2/rte_mempool_dpaa2_version.map
index b9d996a..b45e7a9 100644
--- a/drivers/mempool/dpaa2/rte_mempool_dpaa2_version.map
+++ b/drivers/mempool/dpaa2/rte_mempool_dpaa2_version.map
@@ -3,7 +3,6 @@ DPDK_17.05 {
 
 	rte_dpaa2_bpid_info;
 	rte_dpaa2_mbuf_alloc_bulk;
-	rte_dpaa2_memsegs;
 
 	local: *;
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load
  2018-09-17  8:11     ` [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs Gavin Hu
@ 2018-09-17  8:11       ` Gavin Hu
  2018-09-20  6:41         ` Jerin Jacob
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 3/4] ring: synchronize the load and store of the tail Gavin Hu
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 4/4] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:11 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

In update_tail, read ht->tail using __atomic_load.Although the
compiler currently seems to be doing the right thing even without
_atomic_load, we don't want to give the compiler freedom to optimise
what should be an atomic load, it should not be arbitarily moved
around.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..234fea0 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
+		while (unlikely(old_val != __atomic_load_n(&ht->tail,
+						__ATOMIC_RELAXED)))
 			rte_pause();
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 3/4] ring: synchronize the load and store of the tail
  2018-09-17  8:11     ` [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs Gavin Hu
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load Gavin Hu
@ 2018-09-17  8:11       ` Gavin Hu
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 4/4] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:11 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 234fea0..0eae3b3 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 4/4] ring: move the atomic load of head above the loop
  2018-09-17  8:11     ` [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs Gavin Hu
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load Gavin Hu
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 3/4] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-09-17  8:11       ` Gavin Hu
  2018-10-27 14:21         ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:11 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0eae3b3..95cc508 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -134,13 +133,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -165,6 +162,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-08-07  3:19 ` [dpdk-dev] [PATCH v2] " Gavin Hu
  2018-08-07  5:56   ` He, Jia
  2018-09-17  7:47   ` [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors Gavin Hu
@ 2018-09-17  8:17   ` Gavin Hu
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
                       ` (4 more replies)
  2 siblings, 5 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:17 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

In update_tail, read ht->tail using __atomic_load.Although the
compiler currently seems to be doing the right thing even without
_atomic_load, we don't want to give the compiler freedom to optimise
what should be an atomic load, it should not be arbitarily moved
around.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..234fea0 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
+		while (unlikely(old_val != __atomic_load_n(&ht->tail,
+						__ATOMIC_RELAXED)))
 			rte_pause();
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
@ 2018-09-17  8:17     ` Gavin Hu
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
                         ` (2 more replies)
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 3/3] " Gavin Hu
                       ` (3 subsequent siblings)
  4 siblings, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:17 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 234fea0..0eae3b3 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] ring: move the atomic load of head above the loop
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-09-17  8:17     ` Gavin Hu
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
  2018-09-29 10:59       ` Jerin Jacob
  2018-09-26  9:29     ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu (Arm Technology China)
                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17  8:17 UTC (permalink / raw)
  To: dev
  Cc: gavin.hu, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0eae3b3..95cc508 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -134,13 +133,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -165,6 +162,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment Gavin Hu
@ 2018-09-17  8:25       ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-17  8:25 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, Ferruh Yigit

Hi All,

I am really sorry I made a mistake by submitting patches from a wrong branch and tried to rescue quickly and made another mistake(submitted one more patch not mine).

Please skip these wrong patches(I already superseded them) and help review the new 3 patches for rte ring.

Really sorry again for the confusion and inconveniences!

Best Regards,
Gavin

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Monday, September 17, 2018 3:48 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; Ferruh Yigit
> <ferruh.yigit@intel.com>
> Subject: [PATCH v3 2/3] net/i40e: remove invalid comment
> 
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> 
> Comments says "no csum error report support" but there is no check related
> csum offloads. Removing the comment.
> 
> Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
> Acked-by: Qi Zhang <qi.z.zhang@intel.com>
> ---
>  drivers/net/i40e/i40e_rxtx_vec_common.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h
> b/drivers/net/i40e/i40e_rxtx_vec_common.h
> index 63cb17742..f00f6d648 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> @@ -199,9 +199,7 @@
> i40e_rx_vec_dev_conf_condition_check_default(struct rte_eth_dev *dev)
>  	if (fconf->mode != RTE_FDIR_MODE_NONE)
>  		return -1;
> 
> -	 /* - no csum error report support
> -	 * - no header split support
> -	 */
> +	 /* no header split support */
>  	if (rxmode->offloads & DEV_RX_OFFLOAD_HEADER_SPLIT)
>  		return -1;
> 
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications Gavin Hu
@ 2018-09-17  9:48       ` Jerin Jacob
  2018-09-17 10:28         ` Gavin Hu (Arm Technology China)
  2018-09-17 10:49       ` [dpdk-dev] [PATCH v4] " Gavin Hu
  1 sibling, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-09-17  9:48 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl, nd, stable

-----Original Message-----
> Date: Mon, 17 Sep 2018 15:47:35 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com, steve.capper@arm.com,
>  Ola.Liljedahl@arm.com, jerin.jacob@caviumnetworks.com, nd@arm.com,
>  stable@dpdk.org
> Subject: [PATCH v3 3/3] doc: add cross compile part for sample applications
> X-Mailer: git-send-email 2.11.0
> 
> External Email
> 
> Fixes: 7cacb05655 ("doc: add generic build instructions for sample apps")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  doc/guides/sample_app_ug/compiling.rst | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/sample_app_ug/compiling.rst b/doc/guides/sample_app_ug/compiling.rst
> index a2d75ed22..6f04743c8 100644
> --- a/doc/guides/sample_app_ug/compiling.rst
> +++ b/doc/guides/sample_app_ug/compiling.rst
> @@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample applications.
>  To compile all the sample applications
>  --------------------------------------
> 
> -
>  Set the path to DPDK source code if its not set:
> 
>      .. code-block:: console
> @@ -93,3 +92,17 @@ Build the application:
> 
>          export RTE_TARGET=build
>          make
> +
> +To cross compile the sample application(s)
> +------------------------------------------
> +
> +For cross compiling the sample application(s), please append 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
> +In example of AARCH64 cross compiling:
> +
> +    .. code-block:: console
> +
> +        export RTE_TARGET=build
> +        export RTE_SDK=/path/to/rte_sdk
> +        make -C examples CROSS=aarch64-linux-gnu-
> +               or
> +        make CROSS=aarch64-linux-gnu-

It should be make -C examples/l3fwd CROSS=aarch64-linux-gnu-, Right? as
without giving directory it builds the SDK only.

> --
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications
  2018-09-17  9:48       ` Jerin Jacob
@ 2018-09-17 10:28         ` Gavin Hu (Arm Technology China)
  2018-09-17 10:34           ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-17 10:28 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, nd, stable



> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Monday, September 17, 2018 5:48 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 3/3] doc: add cross compile part for sample
> applications
> 
> -----Original Message-----
> > Date: Mon, 17 Sep 2018 15:47:35 +0800
> > From: Gavin Hu <gavin.hu@arm.com>
> > To: dev@dpdk.org
> > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
> > Subject: [PATCH v3 3/3] doc: add cross compile part for sample
> > applications
> > X-Mailer: git-send-email 2.11.0
> >
> > External Email
> >
> > Fixes: 7cacb05655 ("doc: add generic build instructions for sample
> > apps")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  doc/guides/sample_app_ug/compiling.rst | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/doc/guides/sample_app_ug/compiling.rst
> > b/doc/guides/sample_app_ug/compiling.rst
> > index a2d75ed22..6f04743c8 100644
> > --- a/doc/guides/sample_app_ug/compiling.rst
> > +++ b/doc/guides/sample_app_ug/compiling.rst
> > @@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample
> applications.
> >  To compile all the sample applications
> >  --------------------------------------
> >
> > -
> >  Set the path to DPDK source code if its not set:
> >
> >      .. code-block:: console
> > @@ -93,3 +92,17 @@ Build the application:
> >
> >          export RTE_TARGET=build
> >          make
> > +
> > +To cross compile the sample application(s)
> > +------------------------------------------
> > +
> > +For cross compiling the sample application(s), please append
> 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
> > +In example of AARCH64 cross compiling:
> > +
> > +    .. code-block:: console
> > +
> > +        export RTE_TARGET=build
> > +        export RTE_SDK=/path/to/rte_sdk
> > +        make -C examples CROSS=aarch64-linux-gnu-
> > +               or
> > +        make CROSS=aarch64-linux-gnu-
> 
> It should be make -C examples/l3fwd CROSS=aarch64-linux-gnu-, Right? as
> without giving directory it builds the SDK only.

-C examples/l3fwd can be ignored if the $(pwd) is already in there. 

> 
> > --
> > 2.11.0
> >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications
  2018-09-17 10:28         ` Gavin Hu (Arm Technology China)
@ 2018-09-17 10:34           ` Jerin Jacob
  2018-09-17 10:55             ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-09-17 10:34 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, nd, stable

-----Original Message-----
> Date: Mon, 17 Sep 2018 10:28:57 +0000
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, Steve Capper <Steve.Capper@arm.com>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, nd <nd@arm.com>, "stable@dpdk.org"
>  <stable@dpdk.org>
> Subject: RE: [PATCH v3 3/3] doc: add cross compile part for sample
>  applications
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Sent: Monday, September 17, 2018 5:48 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
> > <nd@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v3 3/3] doc: add cross compile part for sample
> > applications
> >
> > -----Original Message-----
> > > Date: Mon, 17 Sep 2018 15:47:35 +0800
> > > From: Gavin Hu <gavin.hu@arm.com>
> > > To: dev@dpdk.org
> > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
> > > Subject: [PATCH v3 3/3] doc: add cross compile part for sample
> > > applications
> > > X-Mailer: git-send-email 2.11.0
> > >
> > > External Email
> > >
> > > Fixes: 7cacb05655 ("doc: add generic build instructions for sample
> > > apps")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> > >  doc/guides/sample_app_ug/compiling.rst | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/doc/guides/sample_app_ug/compiling.rst
> > > b/doc/guides/sample_app_ug/compiling.rst
> > > index a2d75ed22..6f04743c8 100644
> > > --- a/doc/guides/sample_app_ug/compiling.rst
> > > +++ b/doc/guides/sample_app_ug/compiling.rst
> > > @@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample
> > applications.
> > >  To compile all the sample applications
> > >  --------------------------------------
> > >
> > > -
> > >  Set the path to DPDK source code if its not set:
> > >
> > >      .. code-block:: console
> > > @@ -93,3 +92,17 @@ Build the application:
> > >
> > >          export RTE_TARGET=build
> > >          make
> > > +
> > > +To cross compile the sample application(s)
> > > +------------------------------------------
> > > +
> > > +For cross compiling the sample application(s), please append
> > 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
> > > +In example of AARCH64 cross compiling:
> > > +
> > > +    .. code-block:: console
> > > +
> > > +        export RTE_TARGET=build
> > > +        export RTE_SDK=/path/to/rte_sdk
> > > +        make -C examples CROSS=aarch64-linux-gnu-
> > > +               or
> > > +        make CROSS=aarch64-linux-gnu-
> >
> > It should be make -C examples/l3fwd CROSS=aarch64-linux-gnu-, Right? as
> > without giving directory it builds the SDK only.
> 
> -C examples/l3fwd can be ignored if the $(pwd) is already in there.

Yes. Since it mentioned as "or" it better to explicitly mentioned in it.

i.e

make -C examples CROSS=aarch64-linux-gnu-

or

cd $(pwd)/examples/<example_app>
make CROSS=aarch64-linux-gnu-

or

make -C examples/<example_app> CROSS=aarch64-linux-gnu-

> 
> >
> > > --
> > > 2.11.0
> > >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4] doc: add cross compile part for sample applications
  2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications Gavin Hu
  2018-09-17  9:48       ` Jerin Jacob
@ 2018-09-17 10:49       ` Gavin Hu
  2018-09-17 10:53         ` [dpdk-dev] [PATCH v5] " Gavin Hu
  1 sibling, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-09-17 10:49 UTC (permalink / raw)
  To: dev; +Cc: gavin.hu, Honnappa.Nagarahalli, jerin.jacob, stable

Fixes: 7cacb05655 ("doc: add generic build instructions for sample apps")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/sample_app_ug/compiling.rst | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/compiling.rst b/doc/guides/sample_app_ug/compiling.rst
index a2d75ed22..f6b31fc7b 100644
--- a/doc/guides/sample_app_ug/compiling.rst
+++ b/doc/guides/sample_app_ug/compiling.rst
@@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample applications.
 To compile all the sample applications
 --------------------------------------
 
-
 Set the path to DPDK source code if its not set:
 
     .. code-block:: console
@@ -93,3 +92,18 @@ Build the application:
 
         export RTE_TARGET=build
         make
+
+To cross compile the sample application(s)
+------------------------------------------
+
+For cross compiling the sample application(s), please append 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
+In example of AARCH64 cross compiling:
+
+    .. code-block:: console
+
+        export RTE_TARGET=build
+        export RTE_SDK=/path/to/rte_sdk
+        make -C examples CROSS=aarch64-linux-gnu-
+               or
+		cd $(pwd)/examples/<example_app>
+        make CROSS=aarch64-linux-gnu-
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v5] doc: add cross compile part for sample applications
  2018-09-17 10:49       ` [dpdk-dev] [PATCH v4] " Gavin Hu
@ 2018-09-17 10:53         ` Gavin Hu
  2018-09-18 11:00           ` Jerin Jacob
  2018-09-19  0:33           ` [dpdk-dev] [PATCH v6] " Gavin Hu
  0 siblings, 2 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-17 10:53 UTC (permalink / raw)
  To: dev; +Cc: gavin.hu, Honnappa.Nagarahalli, jerin.jacob, stable

Fixes: 7cacb05655 ("doc: add generic build instructions for sample apps")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/sample_app_ug/compiling.rst | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/compiling.rst b/doc/guides/sample_app_ug/compiling.rst
index a2d75ed22..9ff531906 100644
--- a/doc/guides/sample_app_ug/compiling.rst
+++ b/doc/guides/sample_app_ug/compiling.rst
@@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample applications.
 To compile all the sample applications
 --------------------------------------
 
-
 Set the path to DPDK source code if its not set:
 
     .. code-block:: console
@@ -93,3 +92,18 @@ Build the application:
 
         export RTE_TARGET=build
         make
+
+To cross compile the sample application(s)
+------------------------------------------
+
+For cross compiling the sample application(s), please append 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
+In example of AARCH64 cross compiling:
+
+    .. code-block:: console
+
+        export RTE_TARGET=build
+        export RTE_SDK=/path/to/rte_sdk
+        make -C examples CROSS=aarch64-linux-gnu-
+               or
+        cd $(pwd)/examples/<example_app>
+        make CROSS=aarch64-linux-gnu-
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications
  2018-09-17 10:34           ` Jerin Jacob
@ 2018-09-17 10:55             ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-17 10:55 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, nd, stable

Hi Jerin,

Thanks for review, could you help review the v5 version?

Best Regards,
Gavin

> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Monday, September 17, 2018 6:35 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 3/3] doc: add cross compile part for sample
> applications
> 
> -----Original Message-----
> > Date: Mon, 17 Sep 2018 10:28:57 +0000
> > From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>, Steve Capper
> <Steve.Capper@arm.com>,
> > Ola  Liljedahl <Ola.Liljedahl@arm.com>, nd <nd@arm.com>,
> "stable@dpdk.org"
> >  <stable@dpdk.org>
> > Subject: RE: [PATCH v3 3/3] doc: add cross compile part for sample
> > applications
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Sent: Monday, September 17, 2018 5:48 PM
> > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>;
> > > Ola Liljedahl <Ola.Liljedahl@arm.com>; nd <nd@arm.com>;
> > > stable@dpdk.org
> > > Subject: Re: [PATCH v3 3/3] doc: add cross compile part for sample
> > > applications
> > >
> > > -----Original Message-----
> > > > Date: Mon, 17 Sep 2018 15:47:35 +0800
> > > > From: Gavin Hu <gavin.hu@arm.com>
> > > > To: dev@dpdk.org
> > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > > > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
> > > > Subject: [PATCH v3 3/3] doc: add cross compile part for sample
> > > > applications
> > > > X-Mailer: git-send-email 2.11.0
> > > >
> > > > External Email
> > > >
> > > > Fixes: 7cacb05655 ("doc: add generic build instructions for sample
> > > > apps")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > > >  doc/guides/sample_app_ug/compiling.rst | 15 ++++++++++++++-
> > > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/doc/guides/sample_app_ug/compiling.rst
> > > > b/doc/guides/sample_app_ug/compiling.rst
> > > > index a2d75ed22..6f04743c8 100644
> > > > --- a/doc/guides/sample_app_ug/compiling.rst
> > > > +++ b/doc/guides/sample_app_ug/compiling.rst
> > > > @@ -9,7 +9,6 @@ This section explains how to compile the DPDK
> > > > sample
> > > applications.
> > > >  To compile all the sample applications
> > > >  --------------------------------------
> > > >
> > > > -
> > > >  Set the path to DPDK source code if its not set:
> > > >
> > > >      .. code-block:: console
> > > > @@ -93,3 +92,17 @@ Build the application:
> > > >
> > > >          export RTE_TARGET=build
> > > >          make
> > > > +
> > > > +To cross compile the sample application(s)
> > > > +------------------------------------------
> > > > +
> > > > +For cross compiling the sample application(s), please append
> > > 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
> > > > +In example of AARCH64 cross compiling:
> > > > +
> > > > +    .. code-block:: console
> > > > +
> > > > +        export RTE_TARGET=build
> > > > +        export RTE_SDK=/path/to/rte_sdk
> > > > +        make -C examples CROSS=aarch64-linux-gnu-
> > > > +               or
> > > > +        make CROSS=aarch64-linux-gnu-
> > >
> > > It should be make -C examples/l3fwd CROSS=aarch64-linux-gnu-, Right?
> > > as without giving directory it builds the SDK only.
> >
> > -C examples/l3fwd can be ignored if the $(pwd) is already in there.
> 
> Yes. Since it mentioned as "or" it better to explicitly mentioned in it.
> 
> i.e
> 
> make -C examples CROSS=aarch64-linux-gnu-
> 
> or
> 
> cd $(pwd)/examples/<example_app>
> make CROSS=aarch64-linux-gnu-
> 
> or
> 
> make -C examples/<example_app> CROSS=aarch64-linux-gnu-
> 
> >
> > >
> > > > --
> > > > 2.11.0
> > > >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v5] doc: add cross compile part for sample applications
  2018-09-17 10:53         ` [dpdk-dev] [PATCH v5] " Gavin Hu
@ 2018-09-18 11:00           ` Jerin Jacob
  2018-09-19  0:33           ` [dpdk-dev] [PATCH v6] " Gavin Hu
  1 sibling, 0 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-09-18 11:00 UTC (permalink / raw)
  To: Gavin Hu; +Cc: dev, Honnappa.Nagarahalli, stable

-----Original Message-----
> Date: Mon, 17 Sep 2018 18:53:43 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
>  jerin.jacob@caviumnetworks.com, stable@dpdk.org
> Subject: [PATCH v5] doc: add cross compile part for sample applications
> X-Mailer: git-send-email 2.11.0
> 
> External Email
> 
> Fixes: 7cacb05655 ("doc: add generic build instructions for sample apps")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  doc/guides/sample_app_ug/compiling.rst | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/sample_app_ug/compiling.rst b/doc/guides/sample_app_ug/compiling.rst
> index a2d75ed22..9ff531906 100644
> --- a/doc/guides/sample_app_ug/compiling.rst
> +++ b/doc/guides/sample_app_ug/compiling.rst
> @@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample applications.
>  To compile all the sample applications
>  --------------------------------------
> 
> -
>  Set the path to DPDK source code if its not set:
> 
>      .. code-block:: console
> @@ -93,3 +92,18 @@ Build the application:
> 
>          export RTE_TARGET=build
>          make
> +
> +To cross compile the sample application(s)
> +------------------------------------------
> +
> +For cross compiling the sample application(s), please append 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.

IMO, You can remove "please"

> +In example of AARCH64 cross compiling:

I think, it is better to change to "AARCH64 cross compiling example:"

> +
> +    .. code-block:: console
> +
> +        export RTE_TARGET=build
> +        export RTE_SDK=/path/to/rte_sdk
> +        make -C examples CROSS=aarch64-linux-gnu-
> +               or
> +        cd $(pwd)/examples/<example_app>

Better to change to:
cd $(RTE_SDK)/examples/<example_app>

> +        make CROSS=aarch64-linux-gnu-
> --
> 2.11.0


With above changes you can add my Acked-by:
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v6] doc: add cross compile part for sample applications
  2018-09-17 10:53         ` [dpdk-dev] [PATCH v5] " Gavin Hu
  2018-09-18 11:00           ` Jerin Jacob
@ 2018-09-19  0:33           ` Gavin Hu
  1 sibling, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-09-19  0:33 UTC (permalink / raw)
  To: dev; +Cc: gavin.hu, Honnappa.Nagarahalli, jerin.jacob, stable

Fixes: 7cacb05655 ("doc: add generic build instructions for sample apps")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/sample_app_ug/compiling.rst | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/compiling.rst b/doc/guides/sample_app_ug/compiling.rst
index a2d75ed22..984725306 100644
--- a/doc/guides/sample_app_ug/compiling.rst
+++ b/doc/guides/sample_app_ug/compiling.rst
@@ -9,7 +9,6 @@ This section explains how to compile the DPDK sample applications.
 To compile all the sample applications
 --------------------------------------
 
-
 Set the path to DPDK source code if its not set:
 
     .. code-block:: console
@@ -93,3 +92,18 @@ Build the application:
 
         export RTE_TARGET=build
         make
+
+To cross compile the sample application(s)
+------------------------------------------
+
+For cross compiling the sample application(s), append 'CROSS=$(CROSS_COMPILER_PREFIX)' to the 'make' command.
+AARCH64 cross compiling example:
+
+    .. code-block:: console
+
+        export RTE_TARGET=build
+        export RTE_SDK=/path/to/rte_sdk
+        make -C examples CROSS=aarch64-linux-gnu-
+               or
+        cd $(RTE_SDK)/examples/<example_app>
+        make CROSS=aarch64-linux-gnu-
-- 
2.11.0

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load Gavin Hu
@ 2018-09-20  6:41         ` Jerin Jacob
  2018-09-25  9:26           ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-09-20  6:41 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl, nd,
	stable, jia.he

-----Original Message-----
> Date: Mon, 17 Sep 2018 16:11:17 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com, steve.capper@arm.com,
>  Ola.Liljedahl@arm.com, jerin.jacob@caviumnetworks.com, nd@arm.com,
>  stable@dpdk.org
> Subject: [PATCH v4 2/4] ring: read tail using atomic load
> X-Mailer: git-send-email 2.7.4
> 
> 
> In update_tail, read ht->tail using __atomic_load.Although the
> compiler currently seems to be doing the right thing even without
> _atomic_load, we don't want to give the compiler freedom to optimise
> what should be an atomic load, it should not be arbitarily moved
> around.
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org


+ Jia He <jia.he@hxt-semitech.com>

> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4..234fea0 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
>          * we need to wait for them to complete
>          */
>         if (!single)
> -               while (unlikely(ht->tail != old_val))
> +               while (unlikely(old_val != __atomic_load_n(&ht->tail,
> +                                               __ATOMIC_RELAXED)))
>                         rte_pause();
> 
>         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> --
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load
  2018-09-20  6:41         ` Jerin Jacob
@ 2018-09-25  9:26           ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-25  9:26 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, nd,
	stable, Justin He

+ Justin He as Jerin requested. 

> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Thursday, September 20, 2018 2:41 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org; jia.he@hxt-semitech.com
> Subject: Re: [PATCH v4 2/4] ring: read tail using atomic load
> 
> -----Original Message-----
> > Date: Mon, 17 Sep 2018 16:11:17 +0800
> > From: Gavin Hu <gavin.hu@arm.com>
> > To: dev@dpdk.org
> > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
> > Subject: [PATCH v4 2/4] ring: read tail using atomic load
> > X-Mailer: git-send-email 2.7.4
> >
> >
> > In update_tail, read ht->tail using __atomic_load.Although the
> > compiler currently seems to be doing the right thing even without
> > _atomic_load, we don't want to give the compiler freedom to optimise
> > what should be an atomic load, it should not be arbitarily moved
> > around.
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> 
> 
> + Jia He <jia.he@hxt-semitech.com>
> 
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 94df3c4..234fea0 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> old_val, uint32_t new_val,
> >          * we need to wait for them to complete
> >          */
> >         if (!single)
> > -               while (unlikely(ht->tail != old_val))
> > +               while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > +                                               __ATOMIC_RELAXED)))
> >                         rte_pause();
> >
> >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 3/3] " Gavin Hu
@ 2018-09-26  9:29     ` Gavin Hu (Arm Technology China)
  2018-09-26 10:09       ` Justin He
  2018-09-29 10:48     ` Jerin Jacob
  2018-10-27 14:17     ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  4 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-26  9:29 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, stable, Justin He

+Justin He

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Monday, September 17, 2018 4:17 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 1/3] ring: read tail using atomic load
> 
> In update_tail, read ht->tail using __atomic_load.Although the compiler
> currently seems to be doing the right thing even without _atomic_load, we
> don't want to give the compiler freedom to optimise what should be an
> atomic load, it should not be arbitarily moved around.
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4..234fea0 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> old_val, uint32_t new_val,
>  	 * we need to wait for them to complete
>  	 */
>  	if (!single)
> -		while (unlikely(ht->tail != old_val))
> +		while (unlikely(old_val != __atomic_load_n(&ht->tail,
> +						__ATOMIC_RELAXED)))
>  			rte_pause();
> 
>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
  2018-09-26  9:59         ` Justin He
  2018-09-29 10:57       ` Jerin Jacob
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-26  9:29 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, stable, Justin He

+Justin He for review.

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Monday, September 17, 2018 4:17 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 2/3] ring: synchronize the load and store of the tail
> 
> Synchronize the load-acquire of the tail and the store-release within
> update_tail, the store release ensures all the ring operations, enqueue or
> dequeue, are seen by the observers on the other side as soon as they see
> the updated tail. The load-acquire is needed here as the data dependency is
> not a reliable way for ordering as the compiler might break it by saving to
> temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics to
> load the heads and tails instead.
> 
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size of 32,
> it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> 
> This patch is a bug fix, while the improvement is a bonus. In our analysis the
> improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
> 
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
> 
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15  SP/SC bulk enq/dequeue (size:
> 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> 
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12  SP/SC bulk enq/dequeue (size:
> 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 234fea0..0eae3b3 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  		*old_head = __atomic_load_n(&r->prod.head,
>  					__ATOMIC_ACQUIRE);
> 
> -		/*
> -		 *  The subtraction is done between two unsigned 32bits
> value
> +		/* load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> +
> 	__ATOMIC_ACQUIRE);
> +
> +		/* The subtraction is done between two unsigned 32bits
> value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * *old_head > cons_tail). So 'free_entries' is always
> between 0
>  		 * and capacity (which is < size).
>  		 */
> -		*free_entries = (capacity + r->cons.tail - *old_head);
> +		*free_entries = (capacity + cons_tail - *old_head);
> 
>  		/* check that we have enough room in ring */
>  		if (unlikely(n > *free_entries))
> @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  	do {
>  		/* Restore n as it may change every loop */
>  		n = max;
> +
>  		*old_head = __atomic_load_n(&r->cons.head,
>  					__ATOMIC_ACQUIRE);
> 
> +		/* this load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> +					__ATOMIC_ACQUIRE);
> +
>  		/* The subtraction is done between two unsigned 32bits
> value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
>  		 * and size(ring)-1.
>  		 */
> -		*entries = (r->prod.tail - *old_head);
> +		*entries = (prod_tail - *old_head);
> 
>  		/* Set the actual entries for dequeue */
>  		if (n > *entries)
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] ring: move the atomic load of head above the loop
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 3/3] " Gavin Hu
@ 2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
  2018-09-26 10:06         ` Justin He
  2018-09-29 10:59       ` Jerin Jacob
  1 sibling, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-09-26  9:29 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, stable, Justin He

+Justin He for review

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Monday, September 17, 2018 4:17 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 3/3] ring: move the atomic load of head above the loop
> 
> In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> the do {} while loop as upon failure the old_head will be updated, another
> load is costly and not necessary.
> 
> This helps a little on the latency,about 1~5%.
> 
>  Test result with the patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.64
>  MP/MC bulk enq/dequeue (size: 8): 9.58
>  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk enq/dequeue (size:
> 32): 2.30
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 0eae3b3..95cc508 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  	unsigned int max = n;
>  	int success;
> 
> +	*old_head = __atomic_load_n(&r->prod.head,
> __ATOMIC_ACQUIRE);
>  	do {
>  		/* Reset n to the initial burst count */
>  		n = max;
> 
> -		*old_head = __atomic_load_n(&r->prod.head,
> -					__ATOMIC_ACQUIRE);
> -
>  		/* load-acquire synchronize with store-release of ht->tail
>  		 * in update_tail.
>  		 */
> @@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  		if (is_sp)
>  			r->prod.head = *new_head, success = 1;
>  		else
> +			/* on failure, *old_head is updated */
>  			success = __atomic_compare_exchange_n(&r-
> >prod.head,
>  					old_head, *new_head,
>  					0, __ATOMIC_ACQUIRE,
> @@ -134,13 +133,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  	int success;
> 
>  	/* move cons.head atomically */
> +	*old_head = __atomic_load_n(&r->cons.head,
> __ATOMIC_ACQUIRE);
>  	do {
>  		/* Restore n as it may change every loop */
>  		n = max;
> 
> -		*old_head = __atomic_load_n(&r->cons.head,
> -					__ATOMIC_ACQUIRE);
> -
>  		/* this load-acquire synchronize with store-release of ht->tail
>  		 * in update_tail.
>  		 */
> @@ -165,6 +162,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  		if (is_sc)
>  			r->cons.head = *new_head, success = 1;
>  		else
> +			/* on failure, *old_head will be updated */
>  			success = __atomic_compare_exchange_n(&r-
> >cons.head,
>  							old_head,
> *new_head,
>  							0,
> __ATOMIC_ACQUIRE,
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
@ 2018-09-26  9:59         ` Justin He
  0 siblings, 0 replies; 131+ messages in thread
From: Justin He @ 2018-09-26  9:59 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, stable

Hi Gavin

> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:30
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org; Justin
> He <Justin.He@arm.com>
> Subject: RE: [PATCH v3 2/3] ring: synchronize the load and store of the tail
>
> +Justin He for review.
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> > jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> > Subject: [PATCH v3 2/3] ring: synchronize the load and store of the
> > tail
> >
> > Synchronize the load-acquire of the tail and the store-release within
> > update_tail, the store release ensures all the ring operations,
> > enqueue or dequeue, are seen by the observers on the other side as
> > soon as they see the updated tail. The load-acquire is needed here as
> > the data dependency is not a reliable way for ordering as the compiler
> > might break it by saving to temporary values to boost performance.
> > When computing the free_entries and avail_entries, use atomic
> > semantics to load the heads and tails instead.
> >
> > The patch was benchmarked with test/ring_perf_autotest and it
> > decreases the enqueue/dequeue latency by 5% ~ 27.6% with two lcores,
> > the real gains are dependent on the number of lcores, depth of the ring, SPSC
> or MPMC.
> > For 1 lcore, it also improves a little, about 3 ~ 4%.
> > It is a big improvement, in case of MPMC, with two lcores and ring
> > size of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> >
> > This patch is a bug fix, while the improvement is a bonus. In our
> > analysis the improvement comes from the cacheline pre-filling after
> > hoisting load- acquire from _atomic_compare_exchange_n up above.
> >
> > The test command:
> > $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > --socket-mem=\
> > 1024 -- -i
> >
> > Test result with this patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.86  MP/MC bulk enq/dequeue (size:
> > 8): 10.15  SP/SC bulk enq/dequeue (size:
> > 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> >
> > In comparison of the test result without this patch:
> >  SP/SC bulk enq/dequeue (size: 8): 6.67  MP/MC bulk enq/dequeue (size:
> > 8): 13.12  SP/SC bulk enq/dequeue (size:
> > 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 234fea0..0eae3b3 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  *old_head = __atomic_load_n(&r->prod.head,
> >  __ATOMIC_ACQUIRE);
> >
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> > value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> > +
> > __ATOMIC_ACQUIRE);
I ever noticed that freebsd also used the double __atomic_load_n. And as we
discussed it several months ago [1], seems the second load_acquire is not
necessary. But as you verified, it looks good to me
[1] http://mails.dpdk.org/archives/dev/2017-November/080983.html
So,
Reviewed-by: Jia He <justin.he@arm.com>

Cheers,
Justin (Jia He)
> > +
> > +/* The subtraction is done between two unsigned 32bits
> > value
> >   * (the result is always modulo 32 bits even if we have
> >   * *old_head > cons_tail). So 'free_entries' is always between 0
> >   * and capacity (which is < size).
> >   */
> > -*free_entries = (capacity + r->cons.tail - *old_head);
> > +*free_entries = (capacity + cons_tail - *old_head);
> >
> >  /* check that we have enough room in ring */
> >  if (unlikely(n > *free_entries))
> > @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_sc,
> >  do {
> >  /* Restore n as it may change every loop */
> >  n = max;
> > +
> >  *old_head = __atomic_load_n(&r->cons.head,
> >  __ATOMIC_ACQUIRE);
> >
> > +/* this load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> > +__ATOMIC_ACQUIRE);
> > +
> >  /* The subtraction is done between two unsigned 32bits value
> >   * (the result is always modulo 32 bits even if we have
> >   * cons_head > prod_tail). So 'entries' is always between 0
> >   * and size(ring)-1.
> >   */
> > -*entries = (r->prod.tail - *old_head);
> > +*entries = (prod_tail - *old_head);
> >
> >  /* Set the actual entries for dequeue */
> >  if (n > *entries)
> > --
> > 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] ring: move the atomic load of head above the loop
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
@ 2018-09-26 10:06         ` Justin He
  2018-09-29  7:19           ` Stephen Hemminger
  0 siblings, 1 reply; 131+ messages in thread
From: Justin He @ 2018-09-26 10:06 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, stable



> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:30
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org; Justin
> He <Justin.He@arm.com>
> Subject: RE: [PATCH v3 3/3] ring: move the atomic load of head above the loop
>
> +Justin He for review
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> > jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> > Subject: [PATCH v3 3/3] ring: move the atomic load of head above the loop
> >
> > In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> > the do {} while loop as upon failure the old_head will be updated, another
> > load is costly and not necessary.
> >
> > This helps a little on the latency,about 1~5%.
> >
> >  Test result with the patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.64
> >  MP/MC bulk enq/dequeue (size: 8): 9.58
> >  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk enq/dequeue (size:
> > 32): 2.30
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 0eae3b3..95cc508 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  unsigned int max = n;
> >  int success;
> >
> > +*old_head = __atomic_load_n(&r->prod.head,
> > __ATOMIC_ACQUIRE);
> >  do {
> >  /* Reset n to the initial burst count */
> >  n = max;
> >
> > -*old_head = __atomic_load_n(&r->prod.head,
> > -__ATOMIC_ACQUIRE);
> > -
> >  /* load-acquire synchronize with store-release of ht->tail
> >   * in update_tail.
> >   */
> > @@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  if (is_sp)
> >  r->prod.head = *new_head, success = 1;
> >  else
> > +/* on failure, *old_head is updated */
> >  success = __atomic_compare_exchange_n(&r-
> > >prod.head,
> >  old_head, *new_head,
> >  0, __ATOMIC_ACQUIRE,
> > @@ -134,13 +133,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > is_sc,
> >  int success;
> >
> >  /* move cons.head atomically */
> > +*old_head = __atomic_load_n(&r->cons.head,
> > __ATOMIC_ACQUIRE);
> >  do {
> >  /* Restore n as it may change every loop */
> >  n = max;
> >
> > -*old_head = __atomic_load_n(&r->cons.head,
> > -__ATOMIC_ACQUIRE);
> > -
> >  /* this load-acquire synchronize with store-release of ht->tail
> >   * in update_tail.
> >   */
> > @@ -165,6 +162,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > is_sc,
> >  if (is_sc)
> >  r->cons.head = *new_head, success = 1;
> >  else
> > +/* on failure, *old_head will be updated */
> >  success = __atomic_compare_exchange_n(&r-
> > >cons.head,
> >  old_head,
> > *new_head,
> >  0,
> > __ATOMIC_ACQUIRE,
> > --
> > 2.7.4
Reviewed-by: Jia He <justin.he@arm.com>

Cheers,
Justin (Jia He)
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-09-26  9:29     ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu (Arm Technology China)
@ 2018-09-26 10:09       ` Justin He
  0 siblings, 0 replies; 131+ messages in thread
From: Justin He @ 2018-09-26 10:09 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, jerin.jacob,
	nd, stable



> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:29
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org; Justin
> He <Justin.He@arm.com>
> Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>
> +Justin He
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> > jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> > Subject: [PATCH v3 1/3] ring: read tail using atomic load
> >
> > In update_tail, read ht->tail using __atomic_load.Although the
> > compiler currently seems to be doing the right thing even without
> > _atomic_load, we don't want to give the compiler freedom to optimise
> > what should be an atomic load, it should not be arbitarily moved around.
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 94df3c4..234fea0 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> > old_val, uint32_t new_val,
> >   * we need to wait for them to complete
> >   */
> >  if (!single)
> > -while (unlikely(ht->tail != old_val))
> > +while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > +__ATOMIC_RELAXED)))

I still wonder why an atomic load is needed here?
Cheers,
Justin (Jia He)
> >  rte_pause();
> >
> >  __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > --
> > 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] ring: move the atomic load of head above the loop
  2018-09-26 10:06         ` Justin He
@ 2018-09-29  7:19           ` Stephen Hemminger
  0 siblings, 0 replies; 131+ messages in thread
From: Stephen Hemminger @ 2018-09-29  7:19 UTC (permalink / raw)
  To: Justin He
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl,
	jerin.jacob, nd, stable

On Wed, 26 Sep 2018 10:06:36 +0000
Justin He <Justin.He@arm.com> wrote:

> Reviewed-by: Jia He <justin.he@arm.com>
> 
> Cheers,
> Justin (Jia He)
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Please adjust your corporate mail settings to remove this automatic footer
because the footer wording creates a legal conflict with the open and public
mailing lists.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
                       ` (2 preceding siblings ...)
  2018-09-26  9:29     ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu (Arm Technology China)
@ 2018-09-29 10:48     ` Jerin Jacob
  2018-10-05  0:47       ` Gavin Hu (Arm Technology China)
  2018-10-27 14:17     ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  4 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-09-29 10:48 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl, nd, stable

-----Original Message-----
> Date: Mon, 17 Sep 2018 16:17:22 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com, steve.capper@arm.com,
>  Ola.Liljedahl@arm.com, jerin.jacob@caviumnetworks.com, nd@arm.com,
>  stable@dpdk.org
> Subject: [PATCH v3 1/3] ring: read tail using atomic load
> X-Mailer: git-send-email 2.7.4
> 
> External Email
> 
> In update_tail, read ht->tail using __atomic_load.Although the
> compiler currently seems to be doing the right thing even without
> _atomic_load, we don't want to give the compiler freedom to optimise
> what should be an atomic load, it should not be arbitarily moved
> around.
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4..234fea0 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
>          * we need to wait for them to complete
>          */
>         if (!single)
> -               while (unlikely(ht->tail != old_val))
> +               while (unlikely(old_val != __atomic_load_n(&ht->tail,
> +                                               __ATOMIC_RELAXED)))
>                         rte_pause();

Since it is a while loop with rte_pause(), IMO, There is no scope of false compiler optimization.
IMO, this change may not required though I don't see any performance
difference with two core ring_perf_autotest test. May be more core
case it may have effect. IMO, If it not absolutely required, we can avoid
this change.

> 
>         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> --
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
@ 2018-09-29 10:57       ` Jerin Jacob
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
  2 siblings, 0 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-09-29 10:57 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl, nd, stable

-----Original Message-----
> Date: Mon, 17 Sep 2018 16:17:23 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com, steve.capper@arm.com,
>  Ola.Liljedahl@arm.com, jerin.jacob@caviumnetworks.com, nd@arm.com,
>  stable@dpdk.org
> Subject: [PATCH v3 2/3] ring: synchronize the load and store of the tail
> X-Mailer: git-send-email 2.7.4
> 
> 
> Synchronize the load-acquire of the tail and the store-release
> within update_tail, the store release ensures all the ring operations,
> enqueue or dequeue, are seen by the observers on the other side as soon
> as they see the updated tail. The load-acquire is needed here as the
> data dependency is not a reliable way for ordering as the compiler might
> break it by saving to temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics
> to load the heads and tails instead.
> 
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size
> of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> 
> This patch is a bug fix, while the improvement is a bonus. In our analysis
> the improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
> 
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
> 
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15
>  SP/SC bulk enq/dequeue (size: 32): 1.94
>  MP/MC bulk enq/dequeue (size: 32): 2.36
> 
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12
>  SP/SC bulk enq/dequeue (size: 32): 2.04
>  MP/MC bulk enq/dequeue (size: 32): 3.26
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>

Tested with ThunderX2 server platform. Though it has minor performance impact on
non burst variants. For Burst variant, I could see similar performance
improvement and it is C11 semantically correct too.

Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] ring: move the atomic load of head above the loop
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 3/3] " Gavin Hu
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
@ 2018-09-29 10:59       ` Jerin Jacob
  1 sibling, 0 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-09-29 10:59 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl, nd, stable

-----Original Message-----
> Date: Mon, 17 Sep 2018 16:17:24 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com, steve.capper@arm.com,
>  Ola.Liljedahl@arm.com, jerin.jacob@caviumnetworks.com, nd@arm.com,
>  stable@dpdk.org
> Subject: [PATCH v3 3/3] ring: move the atomic load of head above the loop
> X-Mailer: git-send-email 2.7.4
> 
> External Email
> 
> In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> the do {} while loop as upon failure the old_head will be updated,
> another load is costly and not necessary.
> 
> This helps a little on the latency,about 1~5%.
> 
>  Test result with the patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.64
>  MP/MC bulk enq/dequeue (size: 8): 9.58
>  SP/SC bulk enq/dequeue (size: 32): 1.98
>  MP/MC bulk enq/dequeue (size: 32): 2.30
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>

Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-09-29 10:48     ` Jerin Jacob
@ 2018-10-05  0:47       ` Gavin Hu (Arm Technology China)
  2018-10-05  8:21         ` Ananyev, Konstantin
  0 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-05  0:47 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, nd, stable

Hi Jerin,

Thanks for your review, inline comments from our internal discussions.

BR. Gavin

> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Saturday, September 29, 2018 6:49 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> 
> -----Original Message-----
> > Date: Mon, 17 Sep 2018 16:17:22 +0800
> > From: Gavin Hu <gavin.hu@arm.com>
> > To: dev@dpdk.org
> > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
> > Subject: [PATCH v3 1/3] ring: read tail using atomic load
> > X-Mailer: git-send-email 2.7.4
> >
> > External Email
> >
> > In update_tail, read ht->tail using __atomic_load.Although the
> > compiler currently seems to be doing the right thing even without
> > _atomic_load, we don't want to give the compiler freedom to optimise
> > what should be an atomic load, it should not be arbitarily moved
> > around.
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 94df3c4..234fea0 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> old_val, uint32_t new_val,
> >          * we need to wait for them to complete
> >          */
> >         if (!single)
> > -               while (unlikely(ht->tail != old_val))
> > +               while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > +                                               __ATOMIC_RELAXED)))
> >                         rte_pause();
> 
> Since it is a while loop with rte_pause(), IMO, There is no scope of false
> compiler optimization.
> IMO, this change may not required though I don't see any performance
> difference with two core ring_perf_autotest test. May be more core case it
> may have effect. IMO, If it not absolutely required, we can avoid this change.
> 

Using __atomic_load_n() has two purposes:
1) the old code only works because ht->tail is declared volatile which is not a requirement for C11 or for the use of __atomic builtins. If ht->tail was not declared volatile and __atomic_load_n() not used, the compiler would likely hoist the load above the loop. 
2) I think all memory locations used for synchronization should use __atomic operations for access in order to clearly indicate that these locations (and these accesses) are used for synchronization.

The read of ht->tail needs to be atomic, a non-atomic read would not be correct. But there are no memory ordering requirements (with regards to other loads and/or stores by this thread) so relaxed memory order is sufficient.
Another aspect of using __atomic_load_n() is that the compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be done as specified in the source code which is also what we need here.

One point worth mentioning though is that this change is for the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting with getting the C11 code right when people are less excited about sending a release out?

We can explain that for C11 we would prefer to do loads and stores as per the C11 memory model. In the case of rte_ring, the code is separated cleanly into C11 specific files anyway.

I think reading ht->tail using __atomic_load_n() is the most appropriate way. We show that ht->tail is used for synchronization, we acknowledge that ht->tail may be written by other threads without any other kind of synchronization (e.g. no lock involved) and we require an atomic load (any write to ht->tail must also be atomic).

Using volatile and explicit compiler (or processor) memory barriers (fences) is the legacy pre-C11 way of accomplishing these things. There's a reason why C11/C++11 moved away from the old ways.
> >
> >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05  0:47       ` Gavin Hu (Arm Technology China)
@ 2018-10-05  8:21         ` Ananyev, Konstantin
  2018-10-05 11:15           ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Ananyev, Konstantin @ 2018-10-05  8:21 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, Ola Liljedahl, nd, stable



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Gavin Hu (Arm Technology China)
> Sent: Friday, October 5, 2018 1:47 AM
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl
> <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; stable@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
> 
> Hi Jerin,
> 
> Thanks for your review, inline comments from our internal discussions.
> 
> BR. Gavin
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Sent: Saturday, September 29, 2018 6:49 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
> > <nd@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >
> > -----Original Message-----
> > > Date: Mon, 17 Sep 2018 16:17:22 +0800
> > > From: Gavin Hu <gavin.hu@arm.com>
> > > To: dev@dpdk.org
> > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
> > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
> > > X-Mailer: git-send-email 2.7.4
> > >
> > > External Email
> > >
> > > In update_tail, read ht->tail using __atomic_load.Although the
> > > compiler currently seems to be doing the right thing even without
> > > _atomic_load, we don't want to give the compiler freedom to optimise
> > > what should be an atomic load, it should not be arbitarily moved
> > > around.
> > >
> > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > ---
> > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > > b/lib/librte_ring/rte_ring_c11_mem.h
> > > index 94df3c4..234fea0 100644
> > > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > > @@ -21,7 +21,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t
> > old_val, uint32_t new_val,
> > >          * we need to wait for them to complete
> > >          */
> > >         if (!single)
> > > -               while (unlikely(ht->tail != old_val))
> > > +               while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > > +                                               __ATOMIC_RELAXED)))
> > >                         rte_pause();
> >
> > Since it is a while loop with rte_pause(), IMO, There is no scope of false
> > compiler optimization.
> > IMO, this change may not required though I don't see any performance
> > difference with two core ring_perf_autotest test. May be more core case it
> > may have effect. IMO, If it not absolutely required, we can avoid this change.
> >
> 
> Using __atomic_load_n() has two purposes:
> 1) the old code only works because ht->tail is declared volatile which is not a requirement for C11 or for the use of __atomic builtins. If ht-
> >tail was not declared volatile and __atomic_load_n() not used, the compiler would likely hoist the load above the loop.
> 2) I think all memory locations used for synchronization should use __atomic operations for access in order to clearly indicate that these
> locations (and these accesses) are used for synchronization.
> 
> The read of ht->tail needs to be atomic, a non-atomic read would not be correct.

That's a 32bit value load.
AFAIK on all CPUs that we support it is an atomic operation.

> But there are no memory ordering requirements (with
> regards to other loads and/or stores by this thread) so relaxed memory order is sufficient.
> Another aspect of using __atomic_load_n() is that the compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be done as
> specified in the source code which is also what we need here.

I think Jerin points that rte_pause() acts here as compiler barrier too,
so no need to worry that compiler would optimize out the loop.
Konstantin

> 
> One point worth mentioning though is that this change is for the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting
> with getting the C11 code right when people are less excited about sending a release out?
> 
> We can explain that for C11 we would prefer to do loads and stores as per the C11 memory model. In the case of rte_ring, the code is
> separated cleanly into C11 specific files anyway.
> 
> I think reading ht->tail using __atomic_load_n() is the most appropriate way. We show that ht->tail is used for synchronization, we
> acknowledge that ht->tail may be written by other threads without any other kind of synchronization (e.g. no lock involved) and we require
> an atomic load (any write to ht->tail must also be atomic).
> 
> Using volatile and explicit compiler (or processor) memory barriers (fences) is the legacy pre-C11 way of accomplishing these things. There's
> a reason why C11/C++11 moved away from the old ways.
> > >
> > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > > --
> > > 2.7.4
> > >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05  8:21         ` Ananyev, Konstantin
@ 2018-10-05 11:15           ` Ola Liljedahl
  2018-10-05 11:36             ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-05 11:15 UTC (permalink / raw)
  To: Ananyev, Konstantin, Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, nd, stable



On 05/10/2018, 10:22, "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

    
    
    > -----Original Message-----
    > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Gavin Hu (Arm Technology China)
    > Sent: Friday, October 5, 2018 1:47 AM
    > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl
    > <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; stable@dpdk.org
    > Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
    > 
    > Hi Jerin,
    > 
    > Thanks for your review, inline comments from our internal discussions.
    > 
    > BR. Gavin
    > 
    > > -----Original Message-----
    > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > > Sent: Saturday, September 29, 2018 6:49 PM
    > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
    > > Cc: dev@dpdk.org; Honnappa Nagarahalli
    > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
    > > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
    > > <nd@arm.com>; stable@dpdk.org
    > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > >
    > > -----Original Message-----
    > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
    > > > From: Gavin Hu <gavin.hu@arm.com>
    > > > To: dev@dpdk.org
    > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
    > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
    > > > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
    > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
    > > > X-Mailer: git-send-email 2.7.4
    > > >
    > > > External Email
    > > >
    > > > In update_tail, read ht->tail using __atomic_load.Although the
    > > > compiler currently seems to be doing the right thing even without
    > > > _atomic_load, we don't want to give the compiler freedom to optimise
    > > > what should be an atomic load, it should not be arbitarily moved
    > > > around.
    > > >
    > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
    > > > Cc: stable@dpdk.org
    > > >
    > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
    > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
    > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
    > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > > > ---
    > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
    > > >  1 file changed, 2 insertions(+), 1 deletion(-)
    > > > 
    > The read of ht->tail needs to be atomic, a non-atomic read would not be correct.
    
    That's a 32bit value load.
    AFAIK on all CPUs that we support it is an atomic operation.
[Ola] But that the ordinary C load is translated to an atomic load for the target architecture is incidental.

If the design requires an atomic load (which is the case here), we should use an atomic load on the language level. Then we can be sure it will always be translated to an atomic load for the target in question or compilation will fail. We don't have to depend on assumptions.


    
    > But there are no memory ordering requirements (with
    > regards to other loads and/or stores by this thread) so relaxed memory order is sufficient.
    > Another aspect of using __atomic_load_n() is that the compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be done as
    > specified in the source code which is also what we need here.
    
    I think Jerin points that rte_pause() acts here as compiler barrier too,
    so no need to worry that compiler would optimize out the loop.
[Ola] Sorry missed that. But the barrier behaviour of rte_pause() is not part of C11, is it essentially a hand-made feature to support the legacy multithreaded memory model (which uses explicit HW and compiler barriers). I'd prefer code using the C11 memory model not to depend on such legacy features.



    Konstantin
    
    > 
    > One point worth mentioning though is that this change is for the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting
    > with getting the C11 code right when people are less excited about sending a release out?
    > 
    > We can explain that for C11 we would prefer to do loads and stores as per the C11 memory model. In the case of rte_ring, the code is
    > separated cleanly into C11 specific files anyway.
    > 
    > I think reading ht->tail using __atomic_load_n() is the most appropriate way. We show that ht->tail is used for synchronization, we
    > acknowledge that ht->tail may be written by other threads without any other kind of synchronization (e.g. no lock involved) and we require
    > an atomic load (any write to ht->tail must also be atomic).
    > 
    > Using volatile and explicit compiler (or processor) memory barriers (fences) is the legacy pre-C11 way of accomplishing these things. There's
    > a reason why C11/C++11 moved away from the old ways.
    > > >
    > > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
    > > > --
    > > > 2.7.4
    > > >
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 11:15           ` Ola Liljedahl
@ 2018-10-05 11:36             ` Ola Liljedahl
  2018-10-05 13:44               ` Ananyev, Konstantin
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-05 11:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, nd, stable

https://en.cppreference.com/w/cpp/language/memory_model
<quote>
When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless

* both evaluations execute on the same thread or in the same signal handler, or
* both conflicting evaluations are atomic operations (see std::atomic), or
* one of the conflicting evaluations happens-before another (see std::memory_order)
If a data race occurs, the behavior of the program is undefined.
</quote>

C11 and C++11 have the same memory model (otherwise interoperability would be difficult).

Or as Jeff Preshing explains it:
"Any time two threads operate on a shared variable concurrently, and one of those operations performs a write, both threads must use atomic operations."
https://preshing.com/20130618/atomic-vs-non-atomic-operations/

So if ht->tail is written using e.g. __atomic_store_n(&ht->tail, val, mo), we need to also read it using e.g. __atomic_load_n().

-- Ola


On 05/10/2018, 13:15, "Ola Liljedahl" <Ola.Liljedahl@arm.com> wrote:

    
    
    On 05/10/2018, 10:22, "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
    
        
        
        > -----Original Message-----
        > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Gavin Hu (Arm Technology China)
        > Sent: Friday, October 5, 2018 1:47 AM
        > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
        > Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl
        > <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; stable@dpdk.org
        > Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
        > 
        > Hi Jerin,
        > 
        > Thanks for your review, inline comments from our internal discussions.
        > 
        > BR. Gavin
        > 
        > > -----Original Message-----
        > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
        > > Sent: Saturday, September 29, 2018 6:49 PM
        > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
        > > Cc: dev@dpdk.org; Honnappa Nagarahalli
        > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
        > > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
        > > <nd@arm.com>; stable@dpdk.org
        > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
        > >
        > > -----Original Message-----
        > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
        > > > From: Gavin Hu <gavin.hu@arm.com>
        > > > To: dev@dpdk.org
        > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
        > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
        > > > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
        > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
        > > > X-Mailer: git-send-email 2.7.4
        > > >
        > > > External Email
        > > >
        > > > In update_tail, read ht->tail using __atomic_load.Although the
        > > > compiler currently seems to be doing the right thing even without
        > > > _atomic_load, we don't want to give the compiler freedom to optimise
        > > > what should be an atomic load, it should not be arbitarily moved
        > > > around.
        > > >
        > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
        > > > Cc: stable@dpdk.org
        > > >
        > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
        > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
        > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
        > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
        > > > ---
        > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
        > > >  1 file changed, 2 insertions(+), 1 deletion(-)
        > > > 
        > The read of ht->tail needs to be atomic, a non-atomic read would not be correct.
        
        That's a 32bit value load.
        AFAIK on all CPUs that we support it is an atomic operation.
    [Ola] But that the ordinary C load is translated to an atomic load for the target architecture is incidental.
    
    If the design requires an atomic load (which is the case here), we should use an atomic load on the language level. Then we can be sure it will always be translated to an atomic load for the target in question or compilation will fail. We don't have to depend on assumptions.
    
    
        
        > But there are no memory ordering requirements (with
        > regards to other loads and/or stores by this thread) so relaxed memory order is sufficient.
        > Another aspect of using __atomic_load_n() is that the compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be done as
        > specified in the source code which is also what we need here.
        
        I think Jerin points that rte_pause() acts here as compiler barrier too,
        so no need to worry that compiler would optimize out the loop.
    [Ola] Sorry missed that. But the barrier behaviour of rte_pause() is not part of C11, is it essentially a hand-made feature to support the legacy multithreaded memory model (which uses explicit HW and compiler barriers). I'd prefer code using the C11 memory model not to depend on such legacy features.
    
    
    
        Konstantin
        
        > 
        > One point worth mentioning though is that this change is for the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting
        > with getting the C11 code right when people are less excited about sending a release out?
        > 
        > We can explain that for C11 we would prefer to do loads and stores as per the C11 memory model. In the case of rte_ring, the code is
        > separated cleanly into C11 specific files anyway.
        > 
        > I think reading ht->tail using __atomic_load_n() is the most appropriate way. We show that ht->tail is used for synchronization, we
        > acknowledge that ht->tail may be written by other threads without any other kind of synchronization (e.g. no lock involved) and we require
        > an atomic load (any write to ht->tail must also be atomic).
        > 
        > Using volatile and explicit compiler (or processor) memory barriers (fences) is the legacy pre-C11 way of accomplishing these things. There's
        > a reason why C11/C++11 moved away from the old ways.
        > > >
        > > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
        > > > --
        > > > 2.7.4
        > > >
        
    
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 11:36             ` Ola Liljedahl
@ 2018-10-05 13:44               ` Ananyev, Konstantin
  2018-10-05 14:21                 ` Ola Liljedahl
  2018-10-05 15:11                 ` Honnappa Nagarahalli
  0 siblings, 2 replies; 131+ messages in thread
From: Ananyev, Konstantin @ 2018-10-05 13:44 UTC (permalink / raw)
  To: Ola Liljedahl, Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, nd, stable



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Friday, October 5, 2018 12:37 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> 
> https://en.cppreference.com/w/cpp/language/memory_model
> <quote>
> When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the
> expressions are said to conflict. A program that has two conflicting evaluations has a data race unless
> 
> * both evaluations execute on the same thread or in the same signal handler, or
> * both conflicting evaluations are atomic operations (see std::atomic), or
> * one of the conflicting evaluations happens-before another (see std::memory_order)
> If a data race occurs, the behavior of the program is undefined.
> </quote>
> 
> C11 and C++11 have the same memory model (otherwise interoperability would be difficult).
> 
> Or as Jeff Preshing explains it:
> "Any time two threads operate on a shared variable concurrently, and one of those operations performs a write, both threads must use
> atomic operations."
> https://preshing.com/20130618/atomic-vs-non-atomic-operations/
> 
> So if ht->tail is written using e.g. __atomic_store_n(&ht->tail, val, mo), we need to also read it using e.g. __atomic_load_n().
> 
> -- Ola
> 
> 
> On 05/10/2018, 13:15, "Ola Liljedahl" <Ola.Liljedahl@arm.com> wrote:
> 
> 
> 
>     On 05/10/2018, 10:22, "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> 
> 
> 
>         > -----Original Message-----
>         > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Gavin Hu (Arm Technology China)
>         > Sent: Friday, October 5, 2018 1:47 AM
>         > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>         > Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola
> Liljedahl
>         > <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; stable@dpdk.org
>         > Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
>         >
>         > Hi Jerin,
>         >
>         > Thanks for your review, inline comments from our internal discussions.
>         >
>         > BR. Gavin
>         >
>         > > -----Original Message-----
>         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>         > > Sent: Saturday, September 29, 2018 6:49 PM
>         > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
>         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
>         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
>         > > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd
>         > > <nd@arm.com>; stable@dpdk.org
>         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>         > >
>         > > -----Original Message-----
>         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
>         > > > From: Gavin Hu <gavin.hu@arm.com>
>         > > > To: dev@dpdk.org
>         > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
>         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
>         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,  stable@dpdk.org
>         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
>         > > > X-Mailer: git-send-email 2.7.4
>         > > >
>         > > > External Email
>         > > >
>         > > > In update_tail, read ht->tail using __atomic_load.Although the
>         > > > compiler currently seems to be doing the right thing even without
>         > > > _atomic_load, we don't want to give the compiler freedom to optimise
>         > > > what should be an atomic load, it should not be arbitarily moved
>         > > > around.
>         > > >
>         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
>         > > > Cc: stable@dpdk.org
>         > > >
>         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>         > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
>         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
>         > > > ---
>         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
>         > > >
>         > The read of ht->tail needs to be atomic, a non-atomic read would not be correct.
> 
>         That's a 32bit value load.
>         AFAIK on all CPUs that we support it is an atomic operation.
>     [Ola] But that the ordinary C load is translated to an atomic load for the target architecture is incidental.
> 
>     If the design requires an atomic load (which is the case here), we should use an atomic load on the language level. Then we can be sure it
> will always be translated to an atomic load for the target in question or compilation will fail. We don't have to depend on assumptions.

We all know that 32bit load/store on cpu we support - are atomic.
If it wouldn't be the case - DPDK would be broken in dozen places.
So what the point to pretend that "it might be not atomic" if we do know for sure that it is?
I do understand that you want to use atomic_load(relaxed) here for consistency,
and to conform with C11 mem-model and I don't see any harm in that.
But argument that we shouldn't assume 32bit load/store ops as atomic sounds a bit flaky to me.
Konstantin


> 
> 
> 
>         > But there are no memory ordering requirements (with
>         > regards to other loads and/or stores by this thread) so relaxed memory order is sufficient.
>         > Another aspect of using __atomic_load_n() is that the compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be done
> as
>         > specified in the source code which is also what we need here.
> 
>         I think Jerin points that rte_pause() acts here as compiler barrier too,
>         so no need to worry that compiler would optimize out the loop.
>     [Ola] Sorry missed that. But the barrier behaviour of rte_pause() is not part of C11, is it essentially a hand-made feature to support the
> legacy multithreaded memory model (which uses explicit HW and compiler barriers). I'd prefer code using the C11 memory model not to
> depend on such legacy features.
> 
> 
> 
>         Konstantin
> 
>         >
>         > One point worth mentioning though is that this change is for the rte_ring_c11_mem.h file, not the legacy ring. It may be worth
> persisting
>         > with getting the C11 code right when people are less excited about sending a release out?
>         >
>         > We can explain that for C11 we would prefer to do loads and stores as per the C11 memory model. In the case of rte_ring, the code is
>         > separated cleanly into C11 specific files anyway.
>         >
>         > I think reading ht->tail using __atomic_load_n() is the most appropriate way. We show that ht->tail is used for synchronization, we
>         > acknowledge that ht->tail may be written by other threads without any other kind of synchronization (e.g. no lock involved) and we
> require
>         > an atomic load (any write to ht->tail must also be atomic).
>         >
>         > Using volatile and explicit compiler (or processor) memory barriers (fences) is the legacy pre-C11 way of accomplishing these things.
> There's
>         > a reason why C11/C++11 moved away from the old ways.
>         > > >
>         > > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
>         > > > --
>         > > > 2.7.4
>         > > >
> 
> 
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 13:44               ` Ananyev, Konstantin
@ 2018-10-05 14:21                 ` Ola Liljedahl
  2018-10-05 15:11                 ` Honnappa Nagarahalli
  1 sibling, 0 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-05 14:21 UTC (permalink / raw)
  To: Ananyev, Konstantin, Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Steve Capper, nd, stable



On 05/10/2018, 15:45, "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

    We all know that 32bit load/store on cpu we support - are atomic.
Well, not necessarily true for unaligned loads and stores. But the "problem" here is that we are not directly generating 32-bit load and store instructions (that would require inline assembly), we are performing C-level reads and writes and trust the compiler to generate the machine instructions.

    If it wouldn't be the case - DPDK would be broken in dozen places.
And maybe it is, if you compile for a new architecture or with a new compiler (which are getting more and more aggressive with regards to utilising e.g. undefined behavior in order to optimise the generated code).

    So what the point to pretend that "it might be not atomic" if we do know for sure that it is?
Any argument that includes the words "for sure" is surely suspect.

    I do understand that you want to use atomic_load(relaxed) here for consistency,
    and to conform with C11 mem-model and I don't see any harm in that.
    But argument that we shouldn't assume 32bit load/store ops as atomic sounds a bit flaky to me.
    Konstantin
I prefer to declare intent and requirements to the compiler, not to depend on assumptions even if I can be reasonably sure my assumptions are correct right here right now. Compilers will find new ways to break non-compliant code if that means it can improve the execution time of compliant code.

Someone may modify the code or follow the model for some similar thing but change that 32-bit variable to a 64-bit variable. Now for a 32-bit target, the plain C read from a volatile 64-bit variable will not be atomic but there will be no warning from the compiler, it will happily generate a sequence of non-atomic loads. If you instead use __atomic_load_n() to read the variable, you would get a compiler or linker error (unless you link with -latomic which will actually provide an atomic load, e.g. by using a lock to protect the access).



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 13:44               ` Ananyev, Konstantin
  2018-10-05 14:21                 ` Ola Liljedahl
@ 2018-10-05 15:11                 ` Honnappa Nagarahalli
  2018-10-05 17:07                   ` Jerin Jacob
  1 sibling, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-05 15:11 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ola Liljedahl,
	Gavin Hu (Arm Technology China),
	Jerin Jacob
  Cc: dev, Steve Capper, nd, stable

> >         >
> >         > Hi Jerin,
> >         >
> >         > Thanks for your review, inline comments from our internal
> discussions.
> >         >
> >         > BR. Gavin
> >         >
> >         > > -----Original Message-----
> >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >         > > Sent: Saturday, September 29, 2018 6:49 PM
> >         > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
> >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
> >         > > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> nd
> >         > > <nd@arm.com>; stable@dpdk.org
> >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >         > >
> >         > > -----Original Message-----
> >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
> >         > > > From: Gavin Hu <gavin.hu@arm.com>
> >         > > > To: dev@dpdk.org
> >         > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
> stable@dpdk.org
> >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
> >         > > > X-Mailer: git-send-email 2.7.4
> >         > > >
> >         > > > External Email
> >         > > >
> >         > > > In update_tail, read ht->tail using __atomic_load.Although the
> >         > > > compiler currently seems to be doing the right thing even without
> >         > > > _atomic_load, we don't want to give the compiler freedom to
> optimise
> >         > > > what should be an atomic load, it should not be arbitarily moved
> >         > > > around.
> >         > > >
> >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier
> option")
> >         > > > Cc: stable@dpdk.org
> >         > > >
> >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> >         > > > Reviewed-by: Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> >         > > > ---
> >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> >         > > >
> >         > The read of ht->tail needs to be atomic, a non-atomic read would not
> be correct.
> >
> >         That's a 32bit value load.
> >         AFAIK on all CPUs that we support it is an atomic operation.
> >     [Ola] But that the ordinary C load is translated to an atomic load for the
> target architecture is incidental.
> >
> >     If the design requires an atomic load (which is the case here), we
> > should use an atomic load on the language level. Then we can be sure it will
> always be translated to an atomic load for the target in question or
> compilation will fail. We don't have to depend on assumptions.
> 
> We all know that 32bit load/store on cpu we support - are atomic.
> If it wouldn't be the case - DPDK would be broken in dozen places.
> So what the point to pretend that "it might be not atomic" if we do know for
> sure that it is?
> I do understand that you want to use atomic_load(relaxed) here for
> consistency, and to conform with C11 mem-model and I don't see any harm in
> that.
We can continue to discuss the topic, it is a good discussion. But, as far this patch is concerned, can I consider this as us having a consensus? The file rte_ring_c11_mem.h is specifically for C11 memory model and I also do not see any harm in having code that completely conforms to C11 memory model.

> But argument that we shouldn't assume 32bit load/store ops as atomic
> sounds a bit flaky to me.
> Konstantin
> 
> 
> >
> >
> >
> >         > But there are no memory ordering requirements (with
> >         > regards to other loads and/or stores by this thread) so relaxed
> memory order is sufficient.
> >         > Another aspect of using __atomic_load_n() is that the
> > compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be
> done as
> >         > specified in the source code which is also what we need here.
> >
> >         I think Jerin points that rte_pause() acts here as compiler barrier too,
> >         so no need to worry that compiler would optimize out the loop.
> >     [Ola] Sorry missed that. But the barrier behaviour of rte_pause()
> > is not part of C11, is it essentially a hand-made feature to support
> > the legacy multithreaded memory model (which uses explicit HW and
> compiler barriers). I'd prefer code using the C11 memory model not to
> depend on such legacy features.
> >
> >
> >
> >         Konstantin
> >
> >         >
> >         > One point worth mentioning though is that this change is for
> > the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting
> >         > with getting the C11 code right when people are less excited about
> sending a release out?
> >         >
> >         > We can explain that for C11 we would prefer to do loads and stores
> as per the C11 memory model. In the case of rte_ring, the code is
> >         > separated cleanly into C11 specific files anyway.
> >         >
> >         > I think reading ht->tail using __atomic_load_n() is the most
> appropriate way. We show that ht->tail is used for synchronization, we
> >         > acknowledge that ht->tail may be written by other threads
> > without any other kind of synchronization (e.g. no lock involved) and we
> require
> >         > an atomic load (any write to ht->tail must also be atomic).
> >         >
> >         > Using volatile and explicit compiler (or processor) memory barriers
> (fences) is the legacy pre-C11 way of accomplishing these things.
> > There's
> >         > a reason why C11/C++11 moved away from the old ways.
> >         > > >
> >         > > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> >         > > > --
> >         > > > 2.7.4
> >         > > >
> >
> >
> >


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 15:11                 ` Honnappa Nagarahalli
@ 2018-10-05 17:07                   ` Jerin Jacob
  2018-10-05 18:05                     ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-05 17:07 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ananyev, Konstantin, Ola Liljedahl,
	Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

-----Original Message-----
> Date: Fri, 5 Oct 2018 15:11:44 +0000
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola Liljedahl
>  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper <Steve.Capper@arm.com>, nd
>  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
> Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
> 
> > >         > Hi Jerin,
> > >         >
> > >         > Thanks for your review, inline comments from our internal
> > discussions.
> > >         >
> > >         > BR. Gavin
> > >         >
> > >         > > -----Original Message-----
> > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > >         > > Sent: Saturday, September 29, 2018 6:49 PM
> > >         > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > >         > > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> > nd
> > >         > > <nd@arm.com>; stable@dpdk.org
> > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> > >         > >
> > >         > > -----Original Message-----
> > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
> > >         > > > From: Gavin Hu <gavin.hu@arm.com>
> > >         > > > To: dev@dpdk.org
> > >         > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
> > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
> > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
> > stable@dpdk.org
> > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
> > >         > > > X-Mailer: git-send-email 2.7.4
> > >         > > >
> > >         > > > External Email
> > >         > > >
> > >         > > > In update_tail, read ht->tail using __atomic_load.Although the
> > >         > > > compiler currently seems to be doing the right thing even without
> > >         > > > _atomic_load, we don't want to give the compiler freedom to
> > optimise
> > >         > > > what should be an atomic load, it should not be arbitarily moved
> > >         > > > around.
> > >         > > >
> > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier
> > option")
> > >         > > > Cc: stable@dpdk.org
> > >         > > >
> > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > >         > > > Reviewed-by: Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>
> > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > >         > > > ---
> > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >         > > >
> > >         > The read of ht->tail needs to be atomic, a non-atomic read would not
> > be correct.
> > >
> > >         That's a 32bit value load.
> > >         AFAIK on all CPUs that we support it is an atomic operation.
> > >     [Ola] But that the ordinary C load is translated to an atomic load for the
> > target architecture is incidental.
> > >
> > >     If the design requires an atomic load (which is the case here), we
> > > should use an atomic load on the language level. Then we can be sure it will
> > always be translated to an atomic load for the target in question or
> > compilation will fail. We don't have to depend on assumptions.
> >
> > We all know that 32bit load/store on cpu we support - are atomic.
> > If it wouldn't be the case - DPDK would be broken in dozen places.
> > So what the point to pretend that "it might be not atomic" if we do know for
> > sure that it is?
> > I do understand that you want to use atomic_load(relaxed) here for
> > consistency, and to conform with C11 mem-model and I don't see any harm in
> > that.
> We can continue to discuss the topic, it is a good discussion. But, as far this patch is concerned, can I consider this as us having a consensus? The file rte_ring_c11_mem.h is specifically for C11 memory model and I also do not see any harm in having code that completely conforms to C11 memory model.

Have you guys checked the output assembly with and without atomic load?
There is an extra "add" instruction with at least the code I have checked.
I think, compiler is not smart enough to understand it is a dead code for
arm64.

➜ [~] $ aarch64-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=aarch64-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-wrapper
Target: aarch64-linux-gnu
Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
--prefix=/usr --program-prefix=aarch64-linux-gnu-
--with-local-prefix=/usr/aarch64-linux-gnu
--with-sysroot=/usr/aarch64-linux-gnu
--with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
--libexecdir=/usr/lib --target=aarch64-linux-gnu
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
--enable-languages=c,c++ --enable-shared --enable-threads=posix
--with-system-zlib --with-isl --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-clocale=gnu
--disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
--enable-linker-build-id --enable-lto --enable-plugin
--enable-install-libiberty --with-linker-hash-style=gnu
--enable-gnu-indirect-function --disable-multilib --disable-werror
--enable-checking=release
Thread model: posix
gcc version 8.2.0 (GCC)


# build setup
make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-gnu-
make -j 8 test-build CROSS=aarch64-linux-gnu-

# generate asm
aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rs bucket_enqueue_single' 

I have uploaded generated file for your convenience
with_atomic_load.txt(includes patch 1,2,3)
-----------------------
https://pastebin.com/SQ6w1yRu

without_atomic_load.txt(includes patch 2,3)
-----------------------
https://pastebin.com/BpvnD0CA


without_atomic
-------------
23              if (!single)
   0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0 <bucket_enqueue_single+256>
   0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
   0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
   0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0 <bucket_enqueue_single+288>  // b.any

24                      while (unlikely(ht->tail != old_val))
25                              rte_pause();


with_atomic
-----------
23              if (!single)
   0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
   0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4 <bucket_enqueue_single+260>
   0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
   0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
   0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0 <bucket_enqueue_single+544>  // b.any

24                      while (unlikely(old_val != __atomic_load_n(&ht->tail, __ATOMIC_RELAXED)))


I don't want to block this series of patches due this patch. Can we make
re spin one series with 2 and 3 patches. And Wait for patch 1 to conclude?

Thoughts?




> 
> > But argument that we shouldn't assume 32bit load/store ops as atomic
> > sounds a bit flaky to me.
> > Konstantin
> >
> >
> > >
> > >
> > >
> > >         > But there are no memory ordering requirements (with
> > >         > regards to other loads and/or stores by this thread) so relaxed
> > memory order is sufficient.
> > >         > Another aspect of using __atomic_load_n() is that the
> > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be
> > done as
> > >         > specified in the source code which is also what we need here.
> > >
> > >         I think Jerin points that rte_pause() acts here as compiler barrier too,
> > >         so no need to worry that compiler would optimize out the loop.
> > >     [Ola] Sorry missed that. But the barrier behaviour of rte_pause()
> > > is not part of C11, is it essentially a hand-made feature to support
> > > the legacy multithreaded memory model (which uses explicit HW and
> > compiler barriers). I'd prefer code using the C11 memory model not to
> > depend on such legacy features.
> > >
> > >
> > >
> > >         Konstantin
> > >
> > >         >
> > >         > One point worth mentioning though is that this change is for
> > > the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting
> > >         > with getting the C11 code right when people are less excited about
> > sending a release out?
> > >         >
> > >         > We can explain that for C11 we would prefer to do loads and stores
> > as per the C11 memory model. In the case of rte_ring, the code is
> > >         > separated cleanly into C11 specific files anyway.
> > >         >
> > >         > I think reading ht->tail using __atomic_load_n() is the most
> > appropriate way. We show that ht->tail is used for synchronization, we
> > >         > acknowledge that ht->tail may be written by other threads
> > > without any other kind of synchronization (e.g. no lock involved) and we
> > require
> > >         > an atomic load (any write to ht->tail must also be atomic).
> > >         >
> > >         > Using volatile and explicit compiler (or processor) memory barriers
> > (fences) is the legacy pre-C11 way of accomplishing these things.
> > > There's
> > >         > a reason why C11/C++11 moved away from the old ways.
> > >         > > >
> > >         > > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> > >         > > > --
> > >         > > > 2.7.4
> > >         > > >
> > >
> > >
> > >
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 17:07                   ` Jerin Jacob
@ 2018-10-05 18:05                     ` Ola Liljedahl
  2018-10-05 20:06                       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-05 18:05 UTC (permalink / raw)
  To: Jerin Jacob, Honnappa Nagarahalli
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

So you don't want to write the proper C11 code because the compiler generates one extra instruction that way?
You don't even know if that one extra instruction has any measurable impact on performance. E.g. it could be issued the cycle before together with other instructions.

We can complain to the compiler writers that the code generation for __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on ARM/A64). I think the problem is that the __atomic builtins only accept a base address without any offset and this is possibly because e.g. load/store exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR) only accept a base register with no offset. So any offset has to be added before the actual "atomic" instruction, LDR in this case.


-- Ola


On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Fri, 5 Oct 2018 15:11:44 +0000
    > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
    > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola Liljedahl
    >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
    >  <Gavin.Hu@arm.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper <Steve.Capper@arm.com>, nd
    >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
    > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
    > 
    > > >         > Hi Jerin,
    > > >         >
    > > >         > Thanks for your review, inline comments from our internal
    > > discussions.
    > > >         >
    > > >         > BR. Gavin
    > > >         >
    > > >         > > -----Original Message-----
    > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
    > > >         > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
    > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
    > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
    > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
    > > nd
    > > >         > > <nd@arm.com>; stable@dpdk.org
    > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > > >         > >
    > > >         > > -----Original Message-----
    > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
    > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
    > > >         > > > To: dev@dpdk.org
    > > >         > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
    > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
    > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
    > > stable@dpdk.org
    > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
    > > >         > > > X-Mailer: git-send-email 2.7.4
    > > >         > > >
    > > >         > > > External Email
    > > >         > > >
    > > >         > > > In update_tail, read ht->tail using __atomic_load.Although the
    > > >         > > > compiler currently seems to be doing the right thing even without
    > > >         > > > _atomic_load, we don't want to give the compiler freedom to
    > > optimise
    > > >         > > > what should be an atomic load, it should not be arbitarily moved
    > > >         > > > around.
    > > >         > > >
    > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier
    > > option")
    > > >         > > > Cc: stable@dpdk.org
    > > >         > > >
    > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
    > > >         > > > Reviewed-by: Honnappa Nagarahalli
    > > <Honnappa.Nagarahalli@arm.com>
    > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
    > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > > >         > > > ---
    > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
    > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
    > > >         > > >
    > > >         > The read of ht->tail needs to be atomic, a non-atomic read would not
    > > be correct.
    > > >
    > > >         That's a 32bit value load.
    > > >         AFAIK on all CPUs that we support it is an atomic operation.
    > > >     [Ola] But that the ordinary C load is translated to an atomic load for the
    > > target architecture is incidental.
    > > >
    > > >     If the design requires an atomic load (which is the case here), we
    > > > should use an atomic load on the language level. Then we can be sure it will
    > > always be translated to an atomic load for the target in question or
    > > compilation will fail. We don't have to depend on assumptions.
    > >
    > > We all know that 32bit load/store on cpu we support - are atomic.
    > > If it wouldn't be the case - DPDK would be broken in dozen places.
    > > So what the point to pretend that "it might be not atomic" if we do know for
    > > sure that it is?
    > > I do understand that you want to use atomic_load(relaxed) here for
    > > consistency, and to conform with C11 mem-model and I don't see any harm in
    > > that.
    > We can continue to discuss the topic, it is a good discussion. But, as far this patch is concerned, can I consider this as us having a consensus? The file rte_ring_c11_mem.h is specifically for C11 memory model and I also do not see any harm in having code that completely conforms to C11 memory model.
    
    Have you guys checked the output assembly with and without atomic load?
    There is an extra "add" instruction with at least the code I have checked.
    I think, compiler is not smart enough to understand it is a dead code for
    arm64.
    
    ➜ [~] $ aarch64-linux-gnu-gcc -v
    Using built-in specs.
    COLLECT_GCC=aarch64-linux-gnu-gcc
    COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-wrapper
    Target: aarch64-linux-gnu
    Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
    --prefix=/usr --program-prefix=aarch64-linux-gnu-
    --with-local-prefix=/usr/aarch64-linux-gnu
    --with-sysroot=/usr/aarch64-linux-gnu
    --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
    --libexecdir=/usr/lib --target=aarch64-linux-gnu
    --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
    --enable-languages=c,c++ --enable-shared --enable-threads=posix
    --with-system-zlib --with-isl --enable-__cxa_atexit
    --disable-libunwind-exceptions --enable-clocale=gnu
    --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
    --enable-linker-build-id --enable-lto --enable-plugin
    --enable-install-libiberty --with-linker-hash-style=gnu
    --enable-gnu-indirect-function --disable-multilib --disable-werror
    --enable-checking=release
    Thread model: posix
    gcc version 8.2.0 (GCC)
    
    
    # build setup
    make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-gnu-
    make -j 8 test-build CROSS=aarch64-linux-gnu-
    
    # generate asm
    aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rs bucket_enqueue_single' 
    
    I have uploaded generated file for your convenience
    with_atomic_load.txt(includes patch 1,2,3)
    -----------------------
    https://pastebin.com/SQ6w1yRu
    
    without_atomic_load.txt(includes patch 2,3)
    -----------------------
    https://pastebin.com/BpvnD0CA
    
    
    without_atomic
    -------------
    23              if (!single)
       0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0 <bucket_enqueue_single+256>
       0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
       0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
       0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0 <bucket_enqueue_single+288>  // b.any
    
    24                      while (unlikely(ht->tail != old_val))
    25                              rte_pause();
    
    
    with_atomic
    -----------
    23              if (!single)
       0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
       0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4 <bucket_enqueue_single+260>
       0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
       0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
       0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0 <bucket_enqueue_single+544>  // b.any
    
    24                      while (unlikely(old_val != __atomic_load_n(&ht->tail, __ATOMIC_RELAXED)))
    
    
    I don't want to block this series of patches due this patch. Can we make
    re spin one series with 2 and 3 patches. And Wait for patch 1 to conclude?
    
    Thoughts?
    
    
    
    
    > 
    > > But argument that we shouldn't assume 32bit load/store ops as atomic
    > > sounds a bit flaky to me.
    > > Konstantin
    > >
    > >
    > > >
    > > >
    > > >
    > > >         > But there are no memory ordering requirements (with
    > > >         > regards to other loads and/or stores by this thread) so relaxed
    > > memory order is sufficient.
    > > >         > Another aspect of using __atomic_load_n() is that the
    > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it has to be
    > > done as
    > > >         > specified in the source code which is also what we need here.
    > > >
    > > >         I think Jerin points that rte_pause() acts here as compiler barrier too,
    > > >         so no need to worry that compiler would optimize out the loop.
    > > >     [Ola] Sorry missed that. But the barrier behaviour of rte_pause()
    > > > is not part of C11, is it essentially a hand-made feature to support
    > > > the legacy multithreaded memory model (which uses explicit HW and
    > > compiler barriers). I'd prefer code using the C11 memory model not to
    > > depend on such legacy features.
    > > >
    > > >
    > > >
    > > >         Konstantin
    > > >
    > > >         >
    > > >         > One point worth mentioning though is that this change is for
    > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be worth persisting
    > > >         > with getting the C11 code right when people are less excited about
    > > sending a release out?
    > > >         >
    > > >         > We can explain that for C11 we would prefer to do loads and stores
    > > as per the C11 memory model. In the case of rte_ring, the code is
    > > >         > separated cleanly into C11 specific files anyway.
    > > >         >
    > > >         > I think reading ht->tail using __atomic_load_n() is the most
    > > appropriate way. We show that ht->tail is used for synchronization, we
    > > >         > acknowledge that ht->tail may be written by other threads
    > > > without any other kind of synchronization (e.g. no lock involved) and we
    > > require
    > > >         > an atomic load (any write to ht->tail must also be atomic).
    > > >         >
    > > >         > Using volatile and explicit compiler (or processor) memory barriers
    > > (fences) is the legacy pre-C11 way of accomplishing these things.
    > > > There's
    > > >         > a reason why C11/C++11 moved away from the old ways.
    > > >         > > >
    > > >         > > >         __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
    > > >         > > > --
    > > >         > > > 2.7.4
    > > >         > > >
    > > >
    > > >
    > > >
    > 
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 18:05                     ` Ola Liljedahl
@ 2018-10-05 20:06                       ` Honnappa Nagarahalli
  2018-10-05 20:17                         ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-05 20:06 UTC (permalink / raw)
  To: Ola Liljedahl, Jerin Jacob
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

Hi Jerin,
	Thank you for generating the disassembly, that is really helpful. I agree with you that we have the option of moving parts 2 and 3 forward. I will let Gavin take a decision.

I suggest that we run benchmarks on this patch alone and in combination with other patches in the series. We have few Arm machines and we will run on all of them along with x86. We take a decision based on that.

Would that be a way to move forward? I think this should address both your and Ola's concerns.

I am open for other suggestions as well.

Thank you,
Honnappa

> 
> So you don't want to write the proper C11 code because the compiler
> generates one extra instruction that way?
> You don't even know if that one extra instruction has any measurable
> impact on performance. E.g. it could be issued the cycle before together
> with other instructions.
> 
> We can complain to the compiler writers that the code generation for
> __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
> ARM/A64). I think the problem is that the __atomic builtins only accept a
> base address without any offset and this is possibly because e.g. load/store
> exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR) only
> accept a base register with no offset. So any offset has to be added before
> the actual "atomic" instruction, LDR in this case.
> 
> 
> -- Ola
> 
> 
> On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> wrote:
> 
>     -----Original Message-----
>     > Date: Fri, 5 Oct 2018 15:11:44 +0000
>     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
> Liljedahl
>     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
>     >  <Gavin.Hu@arm.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
> <Steve.Capper@arm.com>, nd
>     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
>     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>     >
>     > > >         > Hi Jerin,
>     > > >         >
>     > > >         > Thanks for your review, inline comments from our internal
>     > > discussions.
>     > > >         >
>     > > >         > BR. Gavin
>     > > >         >
>     > > >         > > -----Original Message-----
>     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
>     > > >         > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
>     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
>     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
>     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
> <Ola.Liljedahl@arm.com>;
>     > > nd
>     > > >         > > <nd@arm.com>; stable@dpdk.org
>     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > > >         > >
>     > > >         > > -----Original Message-----
>     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
>     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
>     > > >         > > > To: dev@dpdk.org
>     > > >         > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
>     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
>     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
>     > > stable@dpdk.org
>     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
>     > > >         > > > X-Mailer: git-send-email 2.7.4
>     > > >         > > >
>     > > >         > > > External Email
>     > > >         > > >
>     > > >         > > > In update_tail, read ht->tail using
> __atomic_load.Although the
>     > > >         > > > compiler currently seems to be doing the right thing even
> without
>     > > >         > > > _atomic_load, we don't want to give the compiler
> freedom to
>     > > optimise
>     > > >         > > > what should be an atomic load, it should not be arbitarily
> moved
>     > > >         > > > around.
>     > > >         > > >
>     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
> barrier
>     > > option")
>     > > >         > > > Cc: stable@dpdk.org
>     > > >         > > >
>     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>     > > >         > > > Reviewed-by: Honnappa Nagarahalli
>     > > <Honnappa.Nagarahalli@arm.com>
>     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
>     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > > >         > > > ---
>     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
>     > > >         > > >
>     > > >         > The read of ht->tail needs to be atomic, a non-atomic read
> would not
>     > > be correct.
>     > > >
>     > > >         That's a 32bit value load.
>     > > >         AFAIK on all CPUs that we support it is an atomic operation.
>     > > >     [Ola] But that the ordinary C load is translated to an atomic load
> for the
>     > > target architecture is incidental.
>     > > >
>     > > >     If the design requires an atomic load (which is the case here), we
>     > > > should use an atomic load on the language level. Then we can be
> sure it will
>     > > always be translated to an atomic load for the target in question or
>     > > compilation will fail. We don't have to depend on assumptions.
>     > >
>     > > We all know that 32bit load/store on cpu we support - are atomic.
>     > > If it wouldn't be the case - DPDK would be broken in dozen places.
>     > > So what the point to pretend that "it might be not atomic" if we do
> know for
>     > > sure that it is?
>     > > I do understand that you want to use atomic_load(relaxed) here for
>     > > consistency, and to conform with C11 mem-model and I don't see any
> harm in
>     > > that.
>     > We can continue to discuss the topic, it is a good discussion. But, as far
> this patch is concerned, can I consider this as us having a consensus? The
> file rte_ring_c11_mem.h is specifically for C11 memory model and I also do
> not see any harm in having code that completely conforms to C11 memory
> model.
> 
>     Have you guys checked the output assembly with and without atomic
> load?
>     There is an extra "add" instruction with at least the code I have checked.
>     I think, compiler is not smart enough to understand it is a dead code for
>     arm64.
> 
>     ➜ [~] $ aarch64-linux-gnu-gcc -v
>     Using built-in specs.
>     COLLECT_GCC=aarch64-linux-gnu-gcc
>     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
> wrapper
>     Target: aarch64-linux-gnu
>     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
>     --prefix=/usr --program-prefix=aarch64-linux-gnu-
>     --with-local-prefix=/usr/aarch64-linux-gnu
>     --with-sysroot=/usr/aarch64-linux-gnu
>     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
>     --libexecdir=/usr/lib --target=aarch64-linux-gnu
>     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
>     --enable-languages=c,c++ --enable-shared --enable-threads=posix
>     --with-system-zlib --with-isl --enable-__cxa_atexit
>     --disable-libunwind-exceptions --enable-clocale=gnu
>     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
>     --enable-linker-build-id --enable-lto --enable-plugin
>     --enable-install-libiberty --with-linker-hash-style=gnu
>     --enable-gnu-indirect-function --disable-multilib --disable-werror
>     --enable-checking=release
>     Thread model: posix
>     gcc version 8.2.0 (GCC)
> 
> 
>     # build setup
>     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-gnu-
>     make -j 8 test-build CROSS=aarch64-linux-gnu-
> 
>     # generate asm
>     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rs
> bucket_enqueue_single'
> 
>     I have uploaded generated file for your convenience
>     with_atomic_load.txt(includes patch 1,2,3)
>     -----------------------
>     https://pastebin.com/SQ6w1yRu
> 
>     without_atomic_load.txt(includes patch 2,3)
>     -----------------------
>     https://pastebin.com/BpvnD0CA
> 
> 
>     without_atomic
>     -------------
>     23              if (!single)
>        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
> <bucket_enqueue_single+256>
>        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
>        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
>        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
> <bucket_enqueue_single+288>  // b.any
> 
>     24                      while (unlikely(ht->tail != old_val))
>     25                              rte_pause();
> 
> 
>     with_atomic
>     -----------
>     23              if (!single)
>        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
>        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
> <bucket_enqueue_single+260>
>        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
>        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
>        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
> <bucket_enqueue_single+544>  // b.any
> 
>     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
> __ATOMIC_RELAXED)))
> 
> 
>     I don't want to block this series of patches due this patch. Can we make
>     re spin one series with 2 and 3 patches. And Wait for patch 1 to conclude?
> 
>     Thoughts?
> 
> 
> 
> 
>     >
>     > > But argument that we shouldn't assume 32bit load/store ops as
> atomic
>     > > sounds a bit flaky to me.
>     > > Konstantin
>     > >
>     > >
>     > > >
>     > > >
>     > > >
>     > > >         > But there are no memory ordering requirements (with
>     > > >         > regards to other loads and/or stores by this thread) so
> relaxed
>     > > memory order is sufficient.
>     > > >         > Another aspect of using __atomic_load_n() is that the
>     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it has
> to be
>     > > done as
>     > > >         > specified in the source code which is also what we need here.
>     > > >
>     > > >         I think Jerin points that rte_pause() acts here as compiler
> barrier too,
>     > > >         so no need to worry that compiler would optimize out the loop.
>     > > >     [Ola] Sorry missed that. But the barrier behaviour of rte_pause()
>     > > > is not part of C11, is it essentially a hand-made feature to support
>     > > > the legacy multithreaded memory model (which uses explicit HW
> and
>     > > compiler barriers). I'd prefer code using the C11 memory model not to
>     > > depend on such legacy features.
>     > > >
>     > > >
>     > > >
>     > > >         Konstantin
>     > > >
>     > > >         >
>     > > >         > One point worth mentioning though is that this change is for
>     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be worth
> persisting
>     > > >         > with getting the C11 code right when people are less excited
> about
>     > > sending a release out?
>     > > >         >
>     > > >         > We can explain that for C11 we would prefer to do loads and
> stores
>     > > as per the C11 memory model. In the case of rte_ring, the code is
>     > > >         > separated cleanly into C11 specific files anyway.
>     > > >         >
>     > > >         > I think reading ht->tail using __atomic_load_n() is the most
>     > > appropriate way. We show that ht->tail is used for synchronization,
> we
>     > > >         > acknowledge that ht->tail may be written by other threads
>     > > > without any other kind of synchronization (e.g. no lock involved)
> and we
>     > > require
>     > > >         > an atomic load (any write to ht->tail must also be atomic).
>     > > >         >
>     > > >         > Using volatile and explicit compiler (or processor) memory
> barriers
>     > > (fences) is the legacy pre-C11 way of accomplishing these things.
>     > > > There's
>     > > >         > a reason why C11/C++11 moved away from the old ways.
>     > > >         > > >
>     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
> __ATOMIC_RELEASE);
>     > > >         > > > --
>     > > >         > > > 2.7.4
>     > > >         > > >
>     > > >
>     > > >
>     > > >
>     >
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 20:06                       ` Honnappa Nagarahalli
@ 2018-10-05 20:17                         ` Ola Liljedahl
  2018-10-05 20:29                           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-05 20:17 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Jerin Jacob
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

I doubt it is possible to benchmark with such a precision so to see the potential difference of one ADD instruction.
Just changes in function alignment can affect performance by percents. And the natural variation when not using a 100% deterministic system is going to be a lot larger than one cycle per ring buffer operation.

Some of the other patches are also for correctness (e.g. load-acquire of tail) so while performance measurements may be interesting, we can't skip a bug fix just because it proves to decrease performance.

-- Ola

On 05/10/2018, 22:06, "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com> wrote:

    Hi Jerin,
    	Thank you for generating the disassembly, that is really helpful. I agree with you that we have the option of moving parts 2 and 3 forward. I will let Gavin take a decision.
    
    I suggest that we run benchmarks on this patch alone and in combination with other patches in the series. We have few Arm machines and we will run on all of them along with x86. We take a decision based on that.
    
    Would that be a way to move forward? I think this should address both your and Ola's concerns.
    
    I am open for other suggestions as well.
    
    Thank you,
    Honnappa
    
    > 
    > So you don't want to write the proper C11 code because the compiler
    > generates one extra instruction that way?
    > You don't even know if that one extra instruction has any measurable
    > impact on performance. E.g. it could be issued the cycle before together
    > with other instructions.
    > 
    > We can complain to the compiler writers that the code generation for
    > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
    > ARM/A64). I think the problem is that the __atomic builtins only accept a
    > base address without any offset and this is possibly because e.g. load/store
    > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR) only
    > accept a base register with no offset. So any offset has to be added before
    > the actual "atomic" instruction, LDR in this case.
    > 
    > 
    > -- Ola
    > 
    > 
    > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
    > wrote:
    > 
    >     -----Original Message-----
    >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
    >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
    >     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
    > Liljedahl
    >     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
    >     >  <Gavin.Hu@arm.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
    > <Steve.Capper@arm.com>, nd
    >     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
    >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
    >     >
    >     > > >         > Hi Jerin,
    >     > > >         >
    >     > > >         > Thanks for your review, inline comments from our internal
    >     > > discussions.
    >     > > >         >
    >     > > >         > BR. Gavin
    >     > > >         >
    >     > > >         > > -----Original Message-----
    >     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
    >     > > >         > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
    >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
    >     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
    >     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
    > <Ola.Liljedahl@arm.com>;
    >     > > nd
    >     > > >         > > <nd@arm.com>; stable@dpdk.org
    >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     > > >         > >
    >     > > >         > > -----Original Message-----
    >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
    >     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
    >     > > >         > > > To: dev@dpdk.org
    >     > > >         > > > CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com,
    >     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
    >     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
    >     > > stable@dpdk.org
    >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic load
    >     > > >         > > > X-Mailer: git-send-email 2.7.4
    >     > > >         > > >
    >     > > >         > > > External Email
    >     > > >         > > >
    >     > > >         > > > In update_tail, read ht->tail using
    > __atomic_load.Although the
    >     > > >         > > > compiler currently seems to be doing the right thing even
    > without
    >     > > >         > > > _atomic_load, we don't want to give the compiler
    > freedom to
    >     > > optimise
    >     > > >         > > > what should be an atomic load, it should not be arbitarily
    > moved
    >     > > >         > > > around.
    >     > > >         > > >
    >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
    > barrier
    >     > > option")
    >     > > >         > > > Cc: stable@dpdk.org
    >     > > >         > > >
    >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
    >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
    >     > > <Honnappa.Nagarahalli@arm.com>
    >     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
    >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     > > >         > > > ---
    >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
    >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
    >     > > >         > > >
    >     > > >         > The read of ht->tail needs to be atomic, a non-atomic read
    > would not
    >     > > be correct.
    >     > > >
    >     > > >         That's a 32bit value load.
    >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
    >     > > >     [Ola] But that the ordinary C load is translated to an atomic load
    > for the
    >     > > target architecture is incidental.
    >     > > >
    >     > > >     If the design requires an atomic load (which is the case here), we
    >     > > > should use an atomic load on the language level. Then we can be
    > sure it will
    >     > > always be translated to an atomic load for the target in question or
    >     > > compilation will fail. We don't have to depend on assumptions.
    >     > >
    >     > > We all know that 32bit load/store on cpu we support - are atomic.
    >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
    >     > > So what the point to pretend that "it might be not atomic" if we do
    > know for
    >     > > sure that it is?
    >     > > I do understand that you want to use atomic_load(relaxed) here for
    >     > > consistency, and to conform with C11 mem-model and I don't see any
    > harm in
    >     > > that.
    >     > We can continue to discuss the topic, it is a good discussion. But, as far
    > this patch is concerned, can I consider this as us having a consensus? The
    > file rte_ring_c11_mem.h is specifically for C11 memory model and I also do
    > not see any harm in having code that completely conforms to C11 memory
    > model.
    > 
    >     Have you guys checked the output assembly with and without atomic
    > load?
    >     There is an extra "add" instruction with at least the code I have checked.
    >     I think, compiler is not smart enough to understand it is a dead code for
    >     arm64.
    > 
    >     ➜ [~] $ aarch64-linux-gnu-gcc -v
    >     Using built-in specs.
    >     COLLECT_GCC=aarch64-linux-gnu-gcc
    >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
    > wrapper
    >     Target: aarch64-linux-gnu
    >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
    >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
    >     --with-local-prefix=/usr/aarch64-linux-gnu
    >     --with-sysroot=/usr/aarch64-linux-gnu
    >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
    >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
    >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
    >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
    >     --with-system-zlib --with-isl --enable-__cxa_atexit
    >     --disable-libunwind-exceptions --enable-clocale=gnu
    >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
    >     --enable-linker-build-id --enable-lto --enable-plugin
    >     --enable-install-libiberty --with-linker-hash-style=gnu
    >     --enable-gnu-indirect-function --disable-multilib --disable-werror
    >     --enable-checking=release
    >     Thread model: posix
    >     gcc version 8.2.0 (GCC)
    > 
    > 
    >     # build setup
    >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-gnu-
    >     make -j 8 test-build CROSS=aarch64-linux-gnu-
    > 
    >     # generate asm
    >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rs
    > bucket_enqueue_single'
    > 
    >     I have uploaded generated file for your convenience
    >     with_atomic_load.txt(includes patch 1,2,3)
    >     -----------------------
    >     https://pastebin.com/SQ6w1yRu
    > 
    >     without_atomic_load.txt(includes patch 2,3)
    >     -----------------------
    >     https://pastebin.com/BpvnD0CA
    > 
    > 
    >     without_atomic
    >     -------------
    >     23              if (!single)
    >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
    > <bucket_enqueue_single+256>
    >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
    >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
    >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
    > <bucket_enqueue_single+288>  // b.any
    > 
    >     24                      while (unlikely(ht->tail != old_val))
    >     25                              rte_pause();
    > 
    > 
    >     with_atomic
    >     -----------
    >     23              if (!single)
    >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
    >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
    > <bucket_enqueue_single+260>
    >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
    >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
    >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
    > <bucket_enqueue_single+544>  // b.any
    > 
    >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
    > __ATOMIC_RELAXED)))
    > 
    > 
    >     I don't want to block this series of patches due this patch. Can we make
    >     re spin one series with 2 and 3 patches. And Wait for patch 1 to conclude?
    > 
    >     Thoughts?
    > 
    > 
    > 
    > 
    >     >
    >     > > But argument that we shouldn't assume 32bit load/store ops as
    > atomic
    >     > > sounds a bit flaky to me.
    >     > > Konstantin
    >     > >
    >     > >
    >     > > >
    >     > > >
    >     > > >
    >     > > >         > But there are no memory ordering requirements (with
    >     > > >         > regards to other loads and/or stores by this thread) so
    > relaxed
    >     > > memory order is sufficient.
    >     > > >         > Another aspect of using __atomic_load_n() is that the
    >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it has
    > to be
    >     > > done as
    >     > > >         > specified in the source code which is also what we need here.
    >     > > >
    >     > > >         I think Jerin points that rte_pause() acts here as compiler
    > barrier too,
    >     > > >         so no need to worry that compiler would optimize out the loop.
    >     > > >     [Ola] Sorry missed that. But the barrier behaviour of rte_pause()
    >     > > > is not part of C11, is it essentially a hand-made feature to support
    >     > > > the legacy multithreaded memory model (which uses explicit HW
    > and
    >     > > compiler barriers). I'd prefer code using the C11 memory model not to
    >     > > depend on such legacy features.
    >     > > >
    >     > > >
    >     > > >
    >     > > >         Konstantin
    >     > > >
    >     > > >         >
    >     > > >         > One point worth mentioning though is that this change is for
    >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be worth
    > persisting
    >     > > >         > with getting the C11 code right when people are less excited
    > about
    >     > > sending a release out?
    >     > > >         >
    >     > > >         > We can explain that for C11 we would prefer to do loads and
    > stores
    >     > > as per the C11 memory model. In the case of rte_ring, the code is
    >     > > >         > separated cleanly into C11 specific files anyway.
    >     > > >         >
    >     > > >         > I think reading ht->tail using __atomic_load_n() is the most
    >     > > appropriate way. We show that ht->tail is used for synchronization,
    > we
    >     > > >         > acknowledge that ht->tail may be written by other threads
    >     > > > without any other kind of synchronization (e.g. no lock involved)
    > and we
    >     > > require
    >     > > >         > an atomic load (any write to ht->tail must also be atomic).
    >     > > >         >
    >     > > >         > Using volatile and explicit compiler (or processor) memory
    > barriers
    >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
    >     > > > There's
    >     > > >         > a reason why C11/C++11 moved away from the old ways.
    >     > > >         > > >
    >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
    > __ATOMIC_RELEASE);
    >     > > >         > > > --
    >     > > >         > > > 2.7.4
    >     > > >         > > >
    >     > > >
    >     > > >
    >     > > >
    >     >
    > 
    
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 20:17                         ` Ola Liljedahl
@ 2018-10-05 20:29                           ` Honnappa Nagarahalli
  2018-10-05 20:34                             ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-05 20:29 UTC (permalink / raw)
  To: Ola Liljedahl, Jerin Jacob
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

> 
> I doubt it is possible to benchmark with such a precision so to see the
> potential difference of one ADD instruction.
> Just changes in function alignment can affect performance by percents. And
> the natural variation when not using a 100% deterministic system is going to
> be a lot larger than one cycle per ring buffer operation.
> 
> Some of the other patches are also for correctness (e.g. load-acquire of tail)
The discussion is about this patch alone. Other patches are already Acked.

> so while performance measurements may be interesting, we can't skip a bug
> fix just because it proves to decrease performance.
IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.

> 
> -- Ola
> 
> On 05/10/2018, 22:06, "Honnappa Nagarahalli"
> <Honnappa.Nagarahalli@arm.com> wrote:
> 
>     Hi Jerin,
>     	Thank you for generating the disassembly, that is really helpful. I
> agree with you that we have the option of moving parts 2 and 3 forward. I
> will let Gavin take a decision.
> 
>     I suggest that we run benchmarks on this patch alone and in combination
> with other patches in the series. We have few Arm machines and we will run
> on all of them along with x86. We take a decision based on that.
> 
>     Would that be a way to move forward? I think this should address both
> your and Ola's concerns.
> 
>     I am open for other suggestions as well.
> 
>     Thank you,
>     Honnappa
> 
>     >
>     > So you don't want to write the proper C11 code because the compiler
>     > generates one extra instruction that way?
>     > You don't even know if that one extra instruction has any measurable
>     > impact on performance. E.g. it could be issued the cycle before together
>     > with other instructions.
>     >
>     > We can complain to the compiler writers that the code generation for
>     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
>     > ARM/A64). I think the problem is that the __atomic builtins only accept
> a
>     > base address without any offset and this is possibly because e.g.
> load/store
>     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
> only
>     > accept a base register with no offset. So any offset has to be added
> before
>     > the actual "atomic" instruction, LDR in this case.
>     >
>     >
>     > -- Ola
>     >
>     >
>     > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
>     > wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
>     >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>     >     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
>     > Liljedahl
>     >     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
>     >     >  <Gavin.Hu@arm.com>, Jerin Jacob
> <jerin.jacob@caviumnetworks.com>
>     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
>     > <Steve.Capper@arm.com>, nd
>     >     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
>     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >
>     >     > > >         > Hi Jerin,
>     >     > > >         >
>     >     > > >         > Thanks for your review, inline comments from our
> internal
>     >     > > discussions.
>     >     > > >         >
>     >     > > >         > BR. Gavin
>     >     > > >         >
>     >     > > >         > > -----Original Message-----
>     >     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
>     >     > > >         > > To: Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>
>     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
>     >     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
>     >     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
>     > <Ola.Liljedahl@arm.com>;
>     >     > > nd
>     >     > > >         > > <nd@arm.com>; stable@dpdk.org
>     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
> load
>     >     > > >         > >
>     >     > > >         > > -----Original Message-----
>     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
>     >     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
>     >     > > >         > > > To: dev@dpdk.org
>     >     > > >         > > > CC: gavin.hu@arm.com,
> Honnappa.Nagarahalli@arm.com,
>     >     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
>     >     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
>     >     > > stable@dpdk.org
>     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
> load
>     >     > > >         > > > X-Mailer: git-send-email 2.7.4
>     >     > > >         > > >
>     >     > > >         > > > External Email
>     >     > > >         > > >
>     >     > > >         > > > In update_tail, read ht->tail using
>     > __atomic_load.Although the
>     >     > > >         > > > compiler currently seems to be doing the right thing
> even
>     > without
>     >     > > >         > > > _atomic_load, we don't want to give the compiler
>     > freedom to
>     >     > > optimise
>     >     > > >         > > > what should be an atomic load, it should not be
> arbitarily
>     > moved
>     >     > > >         > > > around.
>     >     > > >         > > >
>     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
>     > barrier
>     >     > > option")
>     >     > > >         > > > Cc: stable@dpdk.org
>     >     > > >         > > >
>     >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
>     >     > > <Honnappa.Nagarahalli@arm.com>
>     >     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
>     >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     > > >         > > > ---
>     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>     >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
>     >     > > >         > > >
>     >     > > >         > The read of ht->tail needs to be atomic, a non-atomic
> read
>     > would not
>     >     > > be correct.
>     >     > > >
>     >     > > >         That's a 32bit value load.
>     >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
>     >     > > >     [Ola] But that the ordinary C load is translated to an atomic
> load
>     > for the
>     >     > > target architecture is incidental.
>     >     > > >
>     >     > > >     If the design requires an atomic load (which is the case here),
> we
>     >     > > > should use an atomic load on the language level. Then we can
> be
>     > sure it will
>     >     > > always be translated to an atomic load for the target in question
> or
>     >     > > compilation will fail. We don't have to depend on assumptions.
>     >     > >
>     >     > > We all know that 32bit load/store on cpu we support - are atomic.
>     >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
>     >     > > So what the point to pretend that "it might be not atomic" if we
> do
>     > know for
>     >     > > sure that it is?
>     >     > > I do understand that you want to use atomic_load(relaxed) here
> for
>     >     > > consistency, and to conform with C11 mem-model and I don't see
> any
>     > harm in
>     >     > > that.
>     >     > We can continue to discuss the topic, it is a good discussion. But, as
> far
>     > this patch is concerned, can I consider this as us having a consensus?
> The
>     > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
> do
>     > not see any harm in having code that completely conforms to C11
> memory
>     > model.
>     >
>     >     Have you guys checked the output assembly with and without atomic
>     > load?
>     >     There is an extra "add" instruction with at least the code I have
> checked.
>     >     I think, compiler is not smart enough to understand it is a dead code
> for
>     >     arm64.
>     >
>     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
>     >     Using built-in specs.
>     >     COLLECT_GCC=aarch64-linux-gnu-gcc
>     >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
>     > wrapper
>     >     Target: aarch64-linux-gnu
>     >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
>     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
>     >     --with-local-prefix=/usr/aarch64-linux-gnu
>     >     --with-sysroot=/usr/aarch64-linux-gnu
>     >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
>     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
>     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
>     >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
>     >     --with-system-zlib --with-isl --enable-__cxa_atexit
>     >     --disable-libunwind-exceptions --enable-clocale=gnu
>     >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
>     >     --enable-linker-build-id --enable-lto --enable-plugin
>     >     --enable-install-libiberty --with-linker-hash-style=gnu
>     >     --enable-gnu-indirect-function --disable-multilib --disable-werror
>     >     --enable-checking=release
>     >     Thread model: posix
>     >     gcc version 8.2.0 (GCC)
>     >
>     >
>     >     # build setup
>     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-
> gnu-
>     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
>     >
>     >     # generate asm
>     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
> 'disassemble /rs
>     > bucket_enqueue_single'
>     >
>     >     I have uploaded generated file for your convenience
>     >     with_atomic_load.txt(includes patch 1,2,3)
>     >     -----------------------
>     >     https://pastebin.com/SQ6w1yRu
>     >
>     >     without_atomic_load.txt(includes patch 2,3)
>     >     -----------------------
>     >     https://pastebin.com/BpvnD0CA
>     >
>     >
>     >     without_atomic
>     >     -------------
>     >     23              if (!single)
>     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
>     > <bucket_enqueue_single+256>
>     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
>     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
>     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
>     > <bucket_enqueue_single+288>  // b.any
>     >
>     >     24                      while (unlikely(ht->tail != old_val))
>     >     25                              rte_pause();
>     >
>     >
>     >     with_atomic
>     >     -----------
>     >     23              if (!single)
>     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
>     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
>     > <bucket_enqueue_single+260>
>     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
>     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
>     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
>     > <bucket_enqueue_single+544>  // b.any
>     >
>     >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
>     > __ATOMIC_RELAXED)))
>     >
>     >
>     >     I don't want to block this series of patches due this patch. Can we
> make
>     >     re spin one series with 2 and 3 patches. And Wait for patch 1 to
> conclude?
>     >
>     >     Thoughts?
>     >
>     >
>     >
>     >
>     >     >
>     >     > > But argument that we shouldn't assume 32bit load/store ops as
>     > atomic
>     >     > > sounds a bit flaky to me.
>     >     > > Konstantin
>     >     > >
>     >     > >
>     >     > > >
>     >     > > >
>     >     > > >
>     >     > > >         > But there are no memory ordering requirements (with
>     >     > > >         > regards to other loads and/or stores by this thread) so
>     > relaxed
>     >     > > memory order is sufficient.
>     >     > > >         > Another aspect of using __atomic_load_n() is that the
>     >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
> has
>     > to be
>     >     > > done as
>     >     > > >         > specified in the source code which is also what we need
> here.
>     >     > > >
>     >     > > >         I think Jerin points that rte_pause() acts here as compiler
>     > barrier too,
>     >     > > >         so no need to worry that compiler would optimize out the
> loop.
>     >     > > >     [Ola] Sorry missed that. But the barrier behaviour of
> rte_pause()
>     >     > > > is not part of C11, is it essentially a hand-made feature to
> support
>     >     > > > the legacy multithreaded memory model (which uses explicit
> HW
>     > and
>     >     > > compiler barriers). I'd prefer code using the C11 memory model
> not to
>     >     > > depend on such legacy features.
>     >     > > >
>     >     > > >
>     >     > > >
>     >     > > >         Konstantin
>     >     > > >
>     >     > > >         >
>     >     > > >         > One point worth mentioning though is that this change is
> for
>     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
> worth
>     > persisting
>     >     > > >         > with getting the C11 code right when people are less
> excited
>     > about
>     >     > > sending a release out?
>     >     > > >         >
>     >     > > >         > We can explain that for C11 we would prefer to do loads
> and
>     > stores
>     >     > > as per the C11 memory model. In the case of rte_ring, the code is
>     >     > > >         > separated cleanly into C11 specific files anyway.
>     >     > > >         >
>     >     > > >         > I think reading ht->tail using __atomic_load_n() is the
> most
>     >     > > appropriate way. We show that ht->tail is used for
> synchronization,
>     > we
>     >     > > >         > acknowledge that ht->tail may be written by other
> threads
>     >     > > > without any other kind of synchronization (e.g. no lock involved)
>     > and we
>     >     > > require
>     >     > > >         > an atomic load (any write to ht->tail must also be atomic).
>     >     > > >         >
>     >     > > >         > Using volatile and explicit compiler (or processor)
> memory
>     > barriers
>     >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
>     >     > > > There's
>     >     > > >         > a reason why C11/C++11 moved away from the old ways.
>     >     > > >         > > >
>     >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
>     > __ATOMIC_RELEASE);
>     >     > > >         > > > --
>     >     > > >         > > > 2.7.4
>     >     > > >         > > >
>     >     > > >
>     >     > > >
>     >     > > >
>     >     >
>     >
> 
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 20:29                           ` Honnappa Nagarahalli
@ 2018-10-05 20:34                             ` Ola Liljedahl
  2018-10-06  7:41                               ` Jerin Jacob
  2018-10-08  5:27                               ` Honnappa Nagarahalli
  0 siblings, 2 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-05 20:34 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Jerin Jacob
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable



On 05/10/2018, 22:29, "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com> wrote:

    > 
    > I doubt it is possible to benchmark with such a precision so to see the
    > potential difference of one ADD instruction.
    > Just changes in function alignment can affect performance by percents. And
    > the natural variation when not using a 100% deterministic system is going to
    > be a lot larger than one cycle per ring buffer operation.
    > 
    > Some of the other patches are also for correctness (e.g. load-acquire of tail)
    The discussion is about this patch alone. Other patches are already Acked.
So the benchmarking then makes zero sense.

    
    > so while performance measurements may be interesting, we can't skip a bug
    > fix just because it proves to decrease performance.
    IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.
It's a fix for correctness. Per the C++11 (and probably C11 as well due to the shared memory model), we have undefined behaviour here. If the compiler detects UB, it is allowed to do anything. Current compilers might not exploit this but future compilers could.


    
    > 
    > -- Ola
    > 
    > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
    > <Honnappa.Nagarahalli@arm.com> wrote:
    > 
    >     Hi Jerin,
    >     	Thank you for generating the disassembly, that is really helpful. I
    > agree with you that we have the option of moving parts 2 and 3 forward. I
    > will let Gavin take a decision.
    > 
    >     I suggest that we run benchmarks on this patch alone and in combination
    > with other patches in the series. We have few Arm machines and we will run
    > on all of them along with x86. We take a decision based on that.
    > 
    >     Would that be a way to move forward? I think this should address both
    > your and Ola's concerns.
    > 
    >     I am open for other suggestions as well.
    > 
    >     Thank you,
    >     Honnappa
    > 
    >     >
    >     > So you don't want to write the proper C11 code because the compiler
    >     > generates one extra instruction that way?
    >     > You don't even know if that one extra instruction has any measurable
    >     > impact on performance. E.g. it could be issued the cycle before together
    >     > with other instructions.
    >     >
    >     > We can complain to the compiler writers that the code generation for
    >     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
    >     > ARM/A64). I think the problem is that the __atomic builtins only accept
    > a
    >     > base address without any offset and this is possibly because e.g.
    > load/store
    >     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
    > only
    >     > accept a base register with no offset. So any offset has to be added
    > before
    >     > the actual "atomic" instruction, LDR in this case.
    >     >
    >     >
    >     > -- Ola
    >     >
    >     >
    >     > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
    >     > wrote:
    >     >
    >     >     -----Original Message-----
    >     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
    >     >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
    >     >     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
    >     > Liljedahl
    >     >     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
    >     >     >  <Gavin.Hu@arm.com>, Jerin Jacob
    > <jerin.jacob@caviumnetworks.com>
    >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
    >     > <Steve.Capper@arm.com>, nd
    >     >     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
    >     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
    >     >     >
    >     >     > > >         > Hi Jerin,
    >     >     > > >         >
    >     >     > > >         > Thanks for your review, inline comments from our
    > internal
    >     >     > > discussions.
    >     >     > > >         >
    >     >     > > >         > BR. Gavin
    >     >     > > >         >
    >     >     > > >         > > -----Original Message-----
    >     >     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
    >     >     > > >         > > To: Gavin Hu (Arm Technology China)
    > <Gavin.Hu@arm.com>
    >     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
    >     >     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
    >     >     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
    >     > <Ola.Liljedahl@arm.com>;
    >     >     > > nd
    >     >     > > >         > > <nd@arm.com>; stable@dpdk.org
    >     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
    > load
    >     >     > > >         > >
    >     >     > > >         > > -----Original Message-----
    >     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
    >     >     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
    >     >     > > >         > > > To: dev@dpdk.org
    >     >     > > >         > > > CC: gavin.hu@arm.com,
    > Honnappa.Nagarahalli@arm.com,
    >     >     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
    >     >     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
    >     >     > > stable@dpdk.org
    >     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
    > load
    >     >     > > >         > > > X-Mailer: git-send-email 2.7.4
    >     >     > > >         > > >
    >     >     > > >         > > > External Email
    >     >     > > >         > > >
    >     >     > > >         > > > In update_tail, read ht->tail using
    >     > __atomic_load.Although the
    >     >     > > >         > > > compiler currently seems to be doing the right thing
    > even
    >     > without
    >     >     > > >         > > > _atomic_load, we don't want to give the compiler
    >     > freedom to
    >     >     > > optimise
    >     >     > > >         > > > what should be an atomic load, it should not be
    > arbitarily
    >     > moved
    >     >     > > >         > > > around.
    >     >     > > >         > > >
    >     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
    >     > barrier
    >     >     > > option")
    >     >     > > >         > > > Cc: stable@dpdk.org
    >     >     > > >         > > >
    >     >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
    >     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
    >     >     > > <Honnappa.Nagarahalli@arm.com>
    >     >     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
    >     >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     >     > > >         > > > ---
    >     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
    >     >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
    >     >     > > >         > > >
    >     >     > > >         > The read of ht->tail needs to be atomic, a non-atomic
    > read
    >     > would not
    >     >     > > be correct.
    >     >     > > >
    >     >     > > >         That's a 32bit value load.
    >     >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
    >     >     > > >     [Ola] But that the ordinary C load is translated to an atomic
    > load
    >     > for the
    >     >     > > target architecture is incidental.
    >     >     > > >
    >     >     > > >     If the design requires an atomic load (which is the case here),
    > we
    >     >     > > > should use an atomic load on the language level. Then we can
    > be
    >     > sure it will
    >     >     > > always be translated to an atomic load for the target in question
    > or
    >     >     > > compilation will fail. We don't have to depend on assumptions.
    >     >     > >
    >     >     > > We all know that 32bit load/store on cpu we support - are atomic.
    >     >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
    >     >     > > So what the point to pretend that "it might be not atomic" if we
    > do
    >     > know for
    >     >     > > sure that it is?
    >     >     > > I do understand that you want to use atomic_load(relaxed) here
    > for
    >     >     > > consistency, and to conform with C11 mem-model and I don't see
    > any
    >     > harm in
    >     >     > > that.
    >     >     > We can continue to discuss the topic, it is a good discussion. But, as
    > far
    >     > this patch is concerned, can I consider this as us having a consensus?
    > The
    >     > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
    > do
    >     > not see any harm in having code that completely conforms to C11
    > memory
    >     > model.
    >     >
    >     >     Have you guys checked the output assembly with and without atomic
    >     > load?
    >     >     There is an extra "add" instruction with at least the code I have
    > checked.
    >     >     I think, compiler is not smart enough to understand it is a dead code
    > for
    >     >     arm64.
    >     >
    >     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
    >     >     Using built-in specs.
    >     >     COLLECT_GCC=aarch64-linux-gnu-gcc
    >     >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
    >     > wrapper
    >     >     Target: aarch64-linux-gnu
    >     >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
    >     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
    >     >     --with-local-prefix=/usr/aarch64-linux-gnu
    >     >     --with-sysroot=/usr/aarch64-linux-gnu
    >     >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
    >     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
    >     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
    >     >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
    >     >     --with-system-zlib --with-isl --enable-__cxa_atexit
    >     >     --disable-libunwind-exceptions --enable-clocale=gnu
    >     >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
    >     >     --enable-linker-build-id --enable-lto --enable-plugin
    >     >     --enable-install-libiberty --with-linker-hash-style=gnu
    >     >     --enable-gnu-indirect-function --disable-multilib --disable-werror
    >     >     --enable-checking=release
    >     >     Thread model: posix
    >     >     gcc version 8.2.0 (GCC)
    >     >
    >     >
    >     >     # build setup
    >     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-
    > gnu-
    >     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
    >     >
    >     >     # generate asm
    >     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
    > 'disassemble /rs
    >     > bucket_enqueue_single'
    >     >
    >     >     I have uploaded generated file for your convenience
    >     >     with_atomic_load.txt(includes patch 1,2,3)
    >     >     -----------------------
    >     >     https://pastebin.com/SQ6w1yRu
    >     >
    >     >     without_atomic_load.txt(includes patch 2,3)
    >     >     -----------------------
    >     >     https://pastebin.com/BpvnD0CA
    >     >
    >     >
    >     >     without_atomic
    >     >     -------------
    >     >     23              if (!single)
    >     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
    >     > <bucket_enqueue_single+256>
    >     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
    >     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
    >     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
    >     > <bucket_enqueue_single+288>  // b.any
    >     >
    >     >     24                      while (unlikely(ht->tail != old_val))
    >     >     25                              rte_pause();
    >     >
    >     >
    >     >     with_atomic
    >     >     -----------
    >     >     23              if (!single)
    >     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
    >     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
    >     > <bucket_enqueue_single+260>
    >     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
    >     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
    >     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
    >     > <bucket_enqueue_single+544>  // b.any
    >     >
    >     >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
    >     > __ATOMIC_RELAXED)))
    >     >
    >     >
    >     >     I don't want to block this series of patches due this patch. Can we
    > make
    >     >     re spin one series with 2 and 3 patches. And Wait for patch 1 to
    > conclude?
    >     >
    >     >     Thoughts?
    >     >
    >     >
    >     >
    >     >
    >     >     >
    >     >     > > But argument that we shouldn't assume 32bit load/store ops as
    >     > atomic
    >     >     > > sounds a bit flaky to me.
    >     >     > > Konstantin
    >     >     > >
    >     >     > >
    >     >     > > >
    >     >     > > >
    >     >     > > >
    >     >     > > >         > But there are no memory ordering requirements (with
    >     >     > > >         > regards to other loads and/or stores by this thread) so
    >     > relaxed
    >     >     > > memory order is sufficient.
    >     >     > > >         > Another aspect of using __atomic_load_n() is that the
    >     >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
    > has
    >     > to be
    >     >     > > done as
    >     >     > > >         > specified in the source code which is also what we need
    > here.
    >     >     > > >
    >     >     > > >         I think Jerin points that rte_pause() acts here as compiler
    >     > barrier too,
    >     >     > > >         so no need to worry that compiler would optimize out the
    > loop.
    >     >     > > >     [Ola] Sorry missed that. But the barrier behaviour of
    > rte_pause()
    >     >     > > > is not part of C11, is it essentially a hand-made feature to
    > support
    >     >     > > > the legacy multithreaded memory model (which uses explicit
    > HW
    >     > and
    >     >     > > compiler barriers). I'd prefer code using the C11 memory model
    > not to
    >     >     > > depend on such legacy features.
    >     >     > > >
    >     >     > > >
    >     >     > > >
    >     >     > > >         Konstantin
    >     >     > > >
    >     >     > > >         >
    >     >     > > >         > One point worth mentioning though is that this change is
    > for
    >     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
    > worth
    >     > persisting
    >     >     > > >         > with getting the C11 code right when people are less
    > excited
    >     > about
    >     >     > > sending a release out?
    >     >     > > >         >
    >     >     > > >         > We can explain that for C11 we would prefer to do loads
    > and
    >     > stores
    >     >     > > as per the C11 memory model. In the case of rte_ring, the code is
    >     >     > > >         > separated cleanly into C11 specific files anyway.
    >     >     > > >         >
    >     >     > > >         > I think reading ht->tail using __atomic_load_n() is the
    > most
    >     >     > > appropriate way. We show that ht->tail is used for
    > synchronization,
    >     > we
    >     >     > > >         > acknowledge that ht->tail may be written by other
    > threads
    >     >     > > > without any other kind of synchronization (e.g. no lock involved)
    >     > and we
    >     >     > > require
    >     >     > > >         > an atomic load (any write to ht->tail must also be atomic).
    >     >     > > >         >
    >     >     > > >         > Using volatile and explicit compiler (or processor)
    > memory
    >     > barriers
    >     >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
    >     >     > > > There's
    >     >     > > >         > a reason why C11/C++11 moved away from the old ways.
    >     >     > > >         > > >
    >     >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
    >     > __ATOMIC_RELEASE);
    >     >     > > >         > > > --
    >     >     > > >         > > > 2.7.4
    >     >     > > >         > > >
    >     >     > > >
    >     >     > > >
    >     >     > > >
    >     >     >
    >     >
    > 
    > 
    
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 20:34                             ` Ola Liljedahl
@ 2018-10-06  7:41                               ` Jerin Jacob
  2018-10-06 19:44                                 ` Ola Liljedahl
  2018-10-08  5:27                               ` Honnappa Nagarahalli
  1 sibling, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-06  7:41 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

-----Original Message-----
> Date: Fri, 5 Oct 2018 20:34:15 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, Jerin Jacob
>  <jerin.jacob@caviumnetworks.com>
> CC: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Gavin Hu (Arm
>  Technology China)" <Gavin.Hu@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
>  Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>, "stable@dpdk.org"
>  <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.10.0.180812
> 
> External Email
> 
> On 05/10/2018, 22:29, "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com> wrote:
> 
>     >
>     > I doubt it is possible to benchmark with such a precision so to see the
>     > potential difference of one ADD instruction.
>     > Just changes in function alignment can affect performance by percents. And
>     > the natural variation when not using a 100% deterministic system is going to
>     > be a lot larger than one cycle per ring buffer operation.
>     >
>     > Some of the other patches are also for correctness (e.g. load-acquire of tail)
>     The discussion is about this patch alone. Other patches are already Acked.
> So the benchmarking then makes zero sense.

Why ?


> 
> 
>     > so while performance measurements may be interesting, we can't skip a bug
>     > fix just because it proves to decrease performance.
>     IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.
> It's a fix for correctness. Per the C++11 (and probably C11 as well due to the shared memory model), we have undefined behaviour here. If the compiler detects UB, it is allowed to do anything. Current compilers might not exploit this but future compilers could.

All I am saying this, The code is not same and compiler(the very latest
gcc 8.2) is not smart enough understand it is a dead code. I think,
The moment any __builtin_gcc comes the compiler add predefined template
which has additional "add" instruction. I think this specific case,
we ALL know that, 
a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in
all DPDK supported processors
b) The rte_pause() down which never make and compiler reordering etc.
so why to loose one cycle at worst case? It is easy loose one cycle and it very
difficult to get one back in fastpath.


> 
> 
> 
>     >
>     > -- Ola
>     >
>     > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
>     > <Honnappa.Nagarahalli@arm.com> wrote:
>     >
>     >     Hi Jerin,
>     >           Thank you for generating the disassembly, that is really helpful. I
>     > agree with you that we have the option of moving parts 2 and 3 forward. I
>     > will let Gavin take a decision.
>     >
>     >     I suggest that we run benchmarks on this patch alone and in combination
>     > with other patches in the series. We have few Arm machines and we will run
>     > on all of them along with x86. We take a decision based on that.
>     >
>     >     Would that be a way to move forward? I think this should address both
>     > your and Ola's concerns.
>     >
>     >     I am open for other suggestions as well.
>     >
>     >     Thank you,
>     >     Honnappa
>     >
>     >     >
>     >     > So you don't want to write the proper C11 code because the compiler
>     >     > generates one extra instruction that way?
>     >     > You don't even know if that one extra instruction has any measurable
>     >     > impact on performance. E.g. it could be issued the cycle before together
>     >     > with other instructions.
>     >     >
>     >     > We can complain to the compiler writers that the code generation for
>     >     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
>     >     > ARM/A64). I think the problem is that the __atomic builtins only accept
>     > a
>     >     > base address without any offset and this is possibly because e.g.
>     > load/store
>     >     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
>     > only
>     >     > accept a base register with no offset. So any offset has to be added
>     > before
>     >     > the actual "atomic" instruction, LDR in this case.
>     >     >
>     >     >
>     >     > -- Ola
>     >     >
>     >     >
>     >     > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
>     >     > wrote:
>     >     >
>     >     >     -----Original Message-----
>     >     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
>     >     >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>     >     >     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
>     >     > Liljedahl
>     >     >     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
>     >     >     >  <Gavin.Hu@arm.com>, Jerin Jacob
>     > <jerin.jacob@caviumnetworks.com>
>     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
>     >     > <Steve.Capper@arm.com>, nd
>     >     >     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
>     >     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >     >
>     >     >     > > >         > Hi Jerin,
>     >     >     > > >         >
>     >     >     > > >         > Thanks for your review, inline comments from our
>     > internal
>     >     >     > > discussions.
>     >     >     > > >         >
>     >     >     > > >         > BR. Gavin
>     >     >     > > >         >
>     >     >     > > >         > > -----Original Message-----
>     >     >     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
>     >     >     > > >         > > To: Gavin Hu (Arm Technology China)
>     > <Gavin.Hu@arm.com>
>     >     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
>     >     >     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
>     >     >     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
>     >     > <Ola.Liljedahl@arm.com>;
>     >     >     > > nd
>     >     >     > > >         > > <nd@arm.com>; stable@dpdk.org
>     >     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
>     > load
>     >     >     > > >         > >
>     >     >     > > >         > > -----Original Message-----
>     >     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
>     >     >     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
>     >     >     > > >         > > > To: dev@dpdk.org
>     >     >     > > >         > > > CC: gavin.hu@arm.com,
>     > Honnappa.Nagarahalli@arm.com,
>     >     >     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
>     >     >     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
>     >     >     > > stable@dpdk.org
>     >     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
>     > load
>     >     >     > > >         > > > X-Mailer: git-send-email 2.7.4
>     >     >     > > >         > > >
>     >     >     > > >         > > > External Email
>     >     >     > > >         > > >
>     >     >     > > >         > > > In update_tail, read ht->tail using
>     >     > __atomic_load.Although the
>     >     >     > > >         > > > compiler currently seems to be doing the right thing
>     > even
>     >     > without
>     >     >     > > >         > > > _atomic_load, we don't want to give the compiler
>     >     > freedom to
>     >     >     > > optimise
>     >     >     > > >         > > > what should be an atomic load, it should not be
>     > arbitarily
>     >     > moved
>     >     >     > > >         > > > around.
>     >     >     > > >         > > >
>     >     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
>     >     > barrier
>     >     >     > > option")
>     >     >     > > >         > > > Cc: stable@dpdk.org
>     >     >     > > >         > > >
>     >     >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>     >     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
>     >     >     > > <Honnappa.Nagarahalli@arm.com>
>     >     >     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
>     >     >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     >     > > >         > > > ---
>     >     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>     >     >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
>     >     >     > > >         > > >
>     >     >     > > >         > The read of ht->tail needs to be atomic, a non-atomic
>     > read
>     >     > would not
>     >     >     > > be correct.
>     >     >     > > >
>     >     >     > > >         That's a 32bit value load.
>     >     >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
>     >     >     > > >     [Ola] But that the ordinary C load is translated to an atomic
>     > load
>     >     > for the
>     >     >     > > target architecture is incidental.
>     >     >     > > >
>     >     >     > > >     If the design requires an atomic load (which is the case here),
>     > we
>     >     >     > > > should use an atomic load on the language level. Then we can
>     > be
>     >     > sure it will
>     >     >     > > always be translated to an atomic load for the target in question
>     > or
>     >     >     > > compilation will fail. We don't have to depend on assumptions.
>     >     >     > >
>     >     >     > > We all know that 32bit load/store on cpu we support - are atomic.
>     >     >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
>     >     >     > > So what the point to pretend that "it might be not atomic" if we
>     > do
>     >     > know for
>     >     >     > > sure that it is?
>     >     >     > > I do understand that you want to use atomic_load(relaxed) here
>     > for
>     >     >     > > consistency, and to conform with C11 mem-model and I don't see
>     > any
>     >     > harm in
>     >     >     > > that.
>     >     >     > We can continue to discuss the topic, it is a good discussion. But, as
>     > far
>     >     > this patch is concerned, can I consider this as us having a consensus?
>     > The
>     >     > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
>     > do
>     >     > not see any harm in having code that completely conforms to C11
>     > memory
>     >     > model.
>     >     >
>     >     >     Have you guys checked the output assembly with and without atomic
>     >     > load?
>     >     >     There is an extra "add" instruction with at least the code I have
>     > checked.
>     >     >     I think, compiler is not smart enough to understand it is a dead code
>     > for
>     >     >     arm64.
>     >     >
>     >     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
>     >     >     Using built-in specs.
>     >     >     COLLECT_GCC=aarch64-linux-gnu-gcc
>     >     >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
>     >     > wrapper
>     >     >     Target: aarch64-linux-gnu
>     >     >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
>     >     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
>     >     >     --with-local-prefix=/usr/aarch64-linux-gnu
>     >     >     --with-sysroot=/usr/aarch64-linux-gnu
>     >     >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
>     >     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
>     >     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
>     >     >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
>     >     >     --with-system-zlib --with-isl --enable-__cxa_atexit
>     >     >     --disable-libunwind-exceptions --enable-clocale=gnu
>     >     >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
>     >     >     --enable-linker-build-id --enable-lto --enable-plugin
>     >     >     --enable-install-libiberty --with-linker-hash-style=gnu
>     >     >     --enable-gnu-indirect-function --disable-multilib --disable-werror
>     >     >     --enable-checking=release
>     >     >     Thread model: posix
>     >     >     gcc version 8.2.0 (GCC)
>     >     >
>     >     >
>     >     >     # build setup
>     >     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-
>     > gnu-
>     >     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
>     >     >
>     >     >     # generate asm
>     >     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
>     > 'disassemble /rs
>     >     > bucket_enqueue_single'
>     >     >
>     >     >     I have uploaded generated file for your convenience
>     >     >     with_atomic_load.txt(includes patch 1,2,3)
>     >     >     -----------------------
>     >     >     https://pastebin.com/SQ6w1yRu
>     >     >
>     >     >     without_atomic_load.txt(includes patch 2,3)
>     >     >     -----------------------
>     >     >     https://pastebin.com/BpvnD0CA
>     >     >
>     >     >
>     >     >     without_atomic
>     >     >     -------------
>     >     >     23              if (!single)
>     >     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
>     >     > <bucket_enqueue_single+256>
>     >     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
>     >     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
>     >     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
>     >     > <bucket_enqueue_single+288>  // b.any
>     >     >
>     >     >     24                      while (unlikely(ht->tail != old_val))
>     >     >     25                              rte_pause();
>     >     >
>     >     >
>     >     >     with_atomic
>     >     >     -----------
>     >     >     23              if (!single)
>     >     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
>     >     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
>     >     > <bucket_enqueue_single+260>
>     >     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
>     >     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
>     >     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
>     >     > <bucket_enqueue_single+544>  // b.any
>     >     >
>     >     >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
>     >     > __ATOMIC_RELAXED)))
>     >     >
>     >     >
>     >     >     I don't want to block this series of patches due this patch. Can we
>     > make
>     >     >     re spin one series with 2 and 3 patches. And Wait for patch 1 to
>     > conclude?
>     >     >
>     >     >     Thoughts?
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >     >
>     >     >     > > But argument that we shouldn't assume 32bit load/store ops as
>     >     > atomic
>     >     >     > > sounds a bit flaky to me.
>     >     >     > > Konstantin
>     >     >     > >
>     >     >     > >
>     >     >     > > >
>     >     >     > > >
>     >     >     > > >
>     >     >     > > >         > But there are no memory ordering requirements (with
>     >     >     > > >         > regards to other loads and/or stores by this thread) so
>     >     > relaxed
>     >     >     > > memory order is sufficient.
>     >     >     > > >         > Another aspect of using __atomic_load_n() is that the
>     >     >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
>     > has
>     >     > to be
>     >     >     > > done as
>     >     >     > > >         > specified in the source code which is also what we need
>     > here.
>     >     >     > > >
>     >     >     > > >         I think Jerin points that rte_pause() acts here as compiler
>     >     > barrier too,
>     >     >     > > >         so no need to worry that compiler would optimize out the
>     > loop.
>     >     >     > > >     [Ola] Sorry missed that. But the barrier behaviour of
>     > rte_pause()
>     >     >     > > > is not part of C11, is it essentially a hand-made feature to
>     > support
>     >     >     > > > the legacy multithreaded memory model (which uses explicit
>     > HW
>     >     > and
>     >     >     > > compiler barriers). I'd prefer code using the C11 memory model
>     > not to
>     >     >     > > depend on such legacy features.
>     >     >     > > >
>     >     >     > > >
>     >     >     > > >
>     >     >     > > >         Konstantin
>     >     >     > > >
>     >     >     > > >         >
>     >     >     > > >         > One point worth mentioning though is that this change is
>     > for
>     >     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
>     > worth
>     >     > persisting
>     >     >     > > >         > with getting the C11 code right when people are less
>     > excited
>     >     > about
>     >     >     > > sending a release out?
>     >     >     > > >         >
>     >     >     > > >         > We can explain that for C11 we would prefer to do loads
>     > and
>     >     > stores
>     >     >     > > as per the C11 memory model. In the case of rte_ring, the code is
>     >     >     > > >         > separated cleanly into C11 specific files anyway.
>     >     >     > > >         >
>     >     >     > > >         > I think reading ht->tail using __atomic_load_n() is the
>     > most
>     >     >     > > appropriate way. We show that ht->tail is used for
>     > synchronization,
>     >     > we
>     >     >     > > >         > acknowledge that ht->tail may be written by other
>     > threads
>     >     >     > > > without any other kind of synchronization (e.g. no lock involved)
>     >     > and we
>     >     >     > > require
>     >     >     > > >         > an atomic load (any write to ht->tail must also be atomic).
>     >     >     > > >         >
>     >     >     > > >         > Using volatile and explicit compiler (or processor)
>     > memory
>     >     > barriers
>     >     >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
>     >     >     > > > There's
>     >     >     > > >         > a reason why C11/C++11 moved away from the old ways.
>     >     >     > > >         > > >
>     >     >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
>     >     > __ATOMIC_RELEASE);
>     >     >     > > >         > > > --
>     >     >     > > >         > > > 2.7.4
>     >     >     > > >         > > >
>     >     >     > > >
>     >     >     > > >
>     >     >     > > >
>     >     >     >
>     >     >
>     >
>     >
> 
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-06  7:41                               ` Jerin Jacob
@ 2018-10-06 19:44                                 ` Ola Liljedahl
  2018-10-06 19:59                                   ` Ola Liljedahl
  2018-10-07  4:02                                   ` Jerin Jacob
  0 siblings, 2 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-06 19:44 UTC (permalink / raw)
  To: Jerin Jacob, dev
  Cc: Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 06/10/2018, 09:42, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Fri, 5 Oct 2018 20:34:15 +0000
    > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, Jerin Jacob
    >  <jerin.jacob@caviumnetworks.com>
    > CC: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Gavin Hu (Arm
    >  Technology China)" <Gavin.Hu@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
    >  Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>, "stable@dpdk.org"
    >  <stable@dpdk.org>
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > user-agent: Microsoft-MacOutlook/10.10.0.180812
    > 
    > External Email
    > 
    > On 05/10/2018, 22:29, "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com> wrote:
    > 
    >     >
    >     > I doubt it is possible to benchmark with such a precision so to see the
    >     > potential difference of one ADD instruction.
    >     > Just changes in function alignment can affect performance by percents. And
    >     > the natural variation when not using a 100% deterministic system is going to
    >     > be a lot larger than one cycle per ring buffer operation.
    >     >
    >     > Some of the other patches are also for correctness (e.g. load-acquire of tail)
    >     The discussion is about this patch alone. Other patches are already Acked.
    > So the benchmarking then makes zero sense.
    
    Why ?
Because the noise in benchmarking (due to e.g. non-deterministic systems, potential changes in function and loop alignment) will be much larger than 1 cycle additional overhead per ring buffer operation. What will benchmarking tell us about the performance impact of this change that adds one ALU operation?
    
    
    > 
    > 
    >     > so while performance measurements may be interesting, we can't skip a bug
    >     > fix just because it proves to decrease performance.
    >     IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.
    > It's a fix for correctness. Per the C++11 (and probably C11 as well due to the shared memory model), we have undefined behaviour here. If the compiler detects UB, it is allowed to do anything. Current compilers might not exploit this but future compilers could.
    
    All I am saying this, The code is not same and compiler(the very latest
    gcc 8.2) is not smart enough understand it is a dead code.
What code is dead? The ADD instruction has a purpose, it is adding an offset (from ring buffer start to the tail field) to a base pointer. It's is merely (most likely) not the optimal code sequence for any ARM processor.

 I think,
    The moment any __builtin_gcc comes the compiler add predefined template
    which has additional "add" instruction.
I suspect the add instruction is because this atomic_load operates on a struct member at a non-zero offset and GCC's "template(s)" for atomic operations don't support register + immediate offset (because basically on AArch64/A64 ISA, all atomic operations except atomic_load(RELAXED) and atomic_store(RELAXED) only support the addressing mode base register without offset). I would be surprised if this minor instance of non-optimal code generation couldn't be corrected in the compiler.

    I think this specific case,
    we ALL know that, 
    a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in
    all DPDK supported processors
    b) The rte_pause() down which never make and compiler reordering etc.
For 32-bit ARM and 64-bit POWER (ppc64), the rte_pause() implementation looks like this (DPDK 18.08-rc0):
static inline void rte_pause(void)
{
}

How does calling this function prevent compiler optimisations e.g. of the loop or of surrounding memory accesses?
Is rte_pause() supposed to behave like some kind of compiler barrier? I can't see anything in the DPDK documentation for rte_pause() that claims this.


    so why to loose one cycle at worst case? It is easy loose one cycle and it very
    difficult to get one back in fastpath.
I suggest you read up on why undefined behaviour is bad in C. Have a chat with Andrew Pinski.

Not depending on the compiler memory barrier in rte_pause() would allow the compiler to make optimisations (e.g. schedule loads earlier) that actually increase performance. Since the atomic load of ht->tail here has relaxed MO, the compiler is allowed hoist later loads (and stores) ahead of it (and also push down and/or merge with stores after the load of ht->tail). But the compiler (memory) barrier in (that supposedly is part of) rte_pause() prevents such optimisations (well some accesses could be pushed down between the atomic load and the compiler barrier but not further than that).

A C compiler that supports C11 and beyond implements the C11 memory model. The compiler understands the memory model and can optimise memory accesses according to the semantics of the model and the ordering directives in the code. Atomic operations using ATOMIC_SEQ_CST, ATOMIC_ACQUIRE, ATOMIC_RELEASE (and ATOMIC_ACQ_REL, ignoring ATOMIC_CONSUME here) each allow and disallow certain kinds of movements and other optimisations of memory accesses (loads and stores, I assume prefetches are also included). Atomic operations with ATOMIC_RELAXED don't impose any ordering constraints so give maximum flexibility to the compiler. Using a compiler memory barrier (e.g. asm volatile ("":::"memory")) is a much more brutal way of constraining the compiler.
    
    
    > 
    > 
    > 
    >     >
    >     > -- Ola
    >     >
    >     > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
    >     > <Honnappa.Nagarahalli@arm.com> wrote:
    >     >
    >     >     Hi Jerin,
    >     >           Thank you for generating the disassembly, that is really helpful. I
    >     > agree with you that we have the option of moving parts 2 and 3 forward. I
    >     > will let Gavin take a decision.
    >     >
    >     >     I suggest that we run benchmarks on this patch alone and in combination
    >     > with other patches in the series. We have few Arm machines and we will run
    >     > on all of them along with x86. We take a decision based on that.
    >     >
    >     >     Would that be a way to move forward? I think this should address both
    >     > your and Ola's concerns.
    >     >
    >     >     I am open for other suggestions as well.
    >     >
    >     >     Thank you,
    >     >     Honnappa
    >     >
    >     >     >
    >     >     > So you don't want to write the proper C11 code because the compiler
    >     >     > generates one extra instruction that way?
    >     >     > You don't even know if that one extra instruction has any measurable
    >     >     > impact on performance. E.g. it could be issued the cycle before together
    >     >     > with other instructions.
    >     >     >
    >     >     > We can complain to the compiler writers that the code generation for
    >     >     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
    >     >     > ARM/A64). I think the problem is that the __atomic builtins only accept
    >     > a
    >     >     > base address without any offset and this is possibly because e.g.
    >     > load/store
    >     >     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
    >     > only
    >     >     > accept a base register with no offset. So any offset has to be added
    >     > before
    >     >     > the actual "atomic" instruction, LDR in this case.
    >     >     >
    >     >     >
    >     >     > -- Ola
    >     >     >
    >     >     >
    >     >     > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
    >     >     > wrote:
    >     >     >
    >     >     >     -----Original Message-----
    >     >     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
    >     >     >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
    >     >     >     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
    >     >     > Liljedahl
    >     >     >     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
    >     >     >     >  <Gavin.Hu@arm.com>, Jerin Jacob
    >     > <jerin.jacob@caviumnetworks.com>
    >     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
    >     >     > <Steve.Capper@arm.com>, nd
    >     >     >     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
    >     >     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
    >     >     >     >
    >     >     >     > > >         > Hi Jerin,
    >     >     >     > > >         >
    >     >     >     > > >         > Thanks for your review, inline comments from our
    >     > internal
    >     >     >     > > discussions.
    >     >     >     > > >         >
    >     >     >     > > >         > BR. Gavin
    >     >     >     > > >         >
    >     >     >     > > >         > > -----Original Message-----
    >     >     >     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     >     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
    >     >     >     > > >         > > To: Gavin Hu (Arm Technology China)
    >     > <Gavin.Hu@arm.com>
    >     >     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
    >     >     >     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
    >     >     >     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
    >     >     > <Ola.Liljedahl@arm.com>;
    >     >     >     > > nd
    >     >     >     > > >         > > <nd@arm.com>; stable@dpdk.org
    >     >     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
    >     > load
    >     >     >     > > >         > >
    >     >     >     > > >         > > -----Original Message-----
    >     >     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
    >     >     >     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
    >     >     >     > > >         > > > To: dev@dpdk.org
    >     >     >     > > >         > > > CC: gavin.hu@arm.com,
    >     > Honnappa.Nagarahalli@arm.com,
    >     >     >     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
    >     >     >     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
    >     >     >     > > stable@dpdk.org
    >     >     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
    >     > load
    >     >     >     > > >         > > > X-Mailer: git-send-email 2.7.4
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > External Email
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > In update_tail, read ht->tail using
    >     >     > __atomic_load.Although the
    >     >     >     > > >         > > > compiler currently seems to be doing the right thing
    >     > even
    >     >     > without
    >     >     >     > > >         > > > _atomic_load, we don't want to give the compiler
    >     >     > freedom to
    >     >     >     > > optimise
    >     >     >     > > >         > > > what should be an atomic load, it should not be
    >     > arbitarily
    >     >     > moved
    >     >     >     > > >         > > > around.
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
    >     >     > barrier
    >     >     >     > > option")
    >     >     >     > > >         > > > Cc: stable@dpdk.org
    >     >     >     > > >         > > >
    >     >     >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
    >     >     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
    >     >     >     > > <Honnappa.Nagarahalli@arm.com>
    >     >     >     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
    >     >     >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     >     >     > > >         > > > ---
    >     >     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
    >     >     >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
    >     >     >     > > >         > > >
    >     >     >     > > >         > The read of ht->tail needs to be atomic, a non-atomic
    >     > read
    >     >     > would not
    >     >     >     > > be correct.
    >     >     >     > > >
    >     >     >     > > >         That's a 32bit value load.
    >     >     >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
    >     >     >     > > >     [Ola] But that the ordinary C load is translated to an atomic
    >     > load
    >     >     > for the
    >     >     >     > > target architecture is incidental.
    >     >     >     > > >
    >     >     >     > > >     If the design requires an atomic load (which is the case here),
    >     > we
    >     >     >     > > > should use an atomic load on the language level. Then we can
    >     > be
    >     >     > sure it will
    >     >     >     > > always be translated to an atomic load for the target in question
    >     > or
    >     >     >     > > compilation will fail. We don't have to depend on assumptions.
    >     >     >     > >
    >     >     >     > > We all know that 32bit load/store on cpu we support - are atomic.
    >     >     >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
    >     >     >     > > So what the point to pretend that "it might be not atomic" if we
    >     > do
    >     >     > know for
    >     >     >     > > sure that it is?
    >     >     >     > > I do understand that you want to use atomic_load(relaxed) here
    >     > for
    >     >     >     > > consistency, and to conform with C11 mem-model and I don't see
    >     > any
    >     >     > harm in
    >     >     >     > > that.
    >     >     >     > We can continue to discuss the topic, it is a good discussion. But, as
    >     > far
    >     >     > this patch is concerned, can I consider this as us having a consensus?
    >     > The
    >     >     > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
    >     > do
    >     >     > not see any harm in having code that completely conforms to C11
    >     > memory
    >     >     > model.
    >     >     >
    >     >     >     Have you guys checked the output assembly with and without atomic
    >     >     > load?
    >     >     >     There is an extra "add" instruction with at least the code I have
    >     > checked.
    >     >     >     I think, compiler is not smart enough to understand it is a dead code
    >     > for
    >     >     >     arm64.
    >     >     >
    >     >     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
    >     >     >     Using built-in specs.
    >     >     >     COLLECT_GCC=aarch64-linux-gnu-gcc
    >     >     >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
    >     >     > wrapper
    >     >     >     Target: aarch64-linux-gnu
    >     >     >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
    >     >     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
    >     >     >     --with-local-prefix=/usr/aarch64-linux-gnu
    >     >     >     --with-sysroot=/usr/aarch64-linux-gnu
    >     >     >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
    >     >     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
    >     >     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
    >     >     >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
    >     >     >     --with-system-zlib --with-isl --enable-__cxa_atexit
    >     >     >     --disable-libunwind-exceptions --enable-clocale=gnu
    >     >     >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
    >     >     >     --enable-linker-build-id --enable-lto --enable-plugin
    >     >     >     --enable-install-libiberty --with-linker-hash-style=gnu
    >     >     >     --enable-gnu-indirect-function --disable-multilib --disable-werror
    >     >     >     --enable-checking=release
    >     >     >     Thread model: posix
    >     >     >     gcc version 8.2.0 (GCC)
    >     >     >
    >     >     >
    >     >     >     # build setup
    >     >     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-
    >     > gnu-
    >     >     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
    >     >     >
    >     >     >     # generate asm
    >     >     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
    >     > 'disassemble /rs
    >     >     > bucket_enqueue_single'
    >     >     >
    >     >     >     I have uploaded generated file for your convenience
    >     >     >     with_atomic_load.txt(includes patch 1,2,3)
    >     >     >     -----------------------
    >     >     >     https://pastebin.com/SQ6w1yRu
    >     >     >
    >     >     >     without_atomic_load.txt(includes patch 2,3)
    >     >     >     -----------------------
    >     >     >     https://pastebin.com/BpvnD0CA
    >     >     >
    >     >     >
    >     >     >     without_atomic
    >     >     >     -------------
    >     >     >     23              if (!single)
    >     >     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
    >     >     > <bucket_enqueue_single+256>
    >     >     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
    >     >     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
    >     >     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
    >     >     > <bucket_enqueue_single+288>  // b.any
    >     >     >
    >     >     >     24                      while (unlikely(ht->tail != old_val))
    >     >     >     25                              rte_pause();
    >     >     >
    >     >     >
    >     >     >     with_atomic
    >     >     >     -----------
    >     >     >     23              if (!single)
    >     >     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
    >     >     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
    >     >     > <bucket_enqueue_single+260>
    >     >     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
    >     >     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
    >     >     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
    >     >     > <bucket_enqueue_single+544>  // b.any
    >     >     >
    >     >     >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
    >     >     > __ATOMIC_RELAXED)))
    >     >     >
    >     >     >
    >     >     >     I don't want to block this series of patches due this patch. Can we
    >     > make
    >     >     >     re spin one series with 2 and 3 patches. And Wait for patch 1 to
    >     > conclude?
    >     >     >
    >     >     >     Thoughts?
    >     >     >
    >     >     >
    >     >     >
    >     >     >
    >     >     >     >
    >     >     >     > > But argument that we shouldn't assume 32bit load/store ops as
    >     >     > atomic
    >     >     >     > > sounds a bit flaky to me.
    >     >     >     > > Konstantin
    >     >     >     > >
    >     >     >     > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >         > But there are no memory ordering requirements (with
    >     >     >     > > >         > regards to other loads and/or stores by this thread) so
    >     >     > relaxed
    >     >     >     > > memory order is sufficient.
    >     >     >     > > >         > Another aspect of using __atomic_load_n() is that the
    >     >     >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
    >     > has
    >     >     > to be
    >     >     >     > > done as
    >     >     >     > > >         > specified in the source code which is also what we need
    >     > here.
    >     >     >     > > >
    >     >     >     > > >         I think Jerin points that rte_pause() acts here as compiler
    >     >     > barrier too,
    >     >     >     > > >         so no need to worry that compiler would optimize out the
    >     > loop.
    >     >     >     > > >     [Ola] Sorry missed that. But the barrier behaviour of
    >     > rte_pause()
    >     >     >     > > > is not part of C11, is it essentially a hand-made feature to
    >     > support
    >     >     >     > > > the legacy multithreaded memory model (which uses explicit
    >     > HW
    >     >     > and
    >     >     >     > > compiler barriers). I'd prefer code using the C11 memory model
    >     > not to
    >     >     >     > > depend on such legacy features.
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >         Konstantin
    >     >     >     > > >
    >     >     >     > > >         >
    >     >     >     > > >         > One point worth mentioning though is that this change is
    >     > for
    >     >     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
    >     > worth
    >     >     > persisting
    >     >     >     > > >         > with getting the C11 code right when people are less
    >     > excited
    >     >     > about
    >     >     >     > > sending a release out?
    >     >     >     > > >         >
    >     >     >     > > >         > We can explain that for C11 we would prefer to do loads
    >     > and
    >     >     > stores
    >     >     >     > > as per the C11 memory model. In the case of rte_ring, the code is
    >     >     >     > > >         > separated cleanly into C11 specific files anyway.
    >     >     >     > > >         >
    >     >     >     > > >         > I think reading ht->tail using __atomic_load_n() is the
    >     > most
    >     >     >     > > appropriate way. We show that ht->tail is used for
    >     > synchronization,
    >     >     > we
    >     >     >     > > >         > acknowledge that ht->tail may be written by other
    >     > threads
    >     >     >     > > > without any other kind of synchronization (e.g. no lock involved)
    >     >     > and we
    >     >     >     > > require
    >     >     >     > > >         > an atomic load (any write to ht->tail must also be atomic).
    >     >     >     > > >         >
    >     >     >     > > >         > Using volatile and explicit compiler (or processor)
    >     > memory
    >     >     > barriers
    >     >     >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
    >     >     >     > > > There's
    >     >     >     > > >         > a reason why C11/C++11 moved away from the old ways.
    >     >     >     > > >         > > >
    >     >     >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
    >     >     > __ATOMIC_RELEASE);
    >     >     >     > > >         > > > --
    >     >     >     > > >         > > > 2.7.4
    >     >     >     > > >         > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     > > >
    >     >     >     >
    >     >     >
    >     >
    >     >
    > 
    > 
    > 
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-06 19:44                                 ` Ola Liljedahl
@ 2018-10-06 19:59                                   ` Ola Liljedahl
  2018-10-07  4:02                                   ` Jerin Jacob
  1 sibling, 0 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-06 19:59 UTC (permalink / raw)
  To: Jerin Jacob, dev
  Cc: Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

Some blogs posts about undefined behaviour in C/C++:
https://blog.regehr.org/archives/213
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

-- Ola
 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-06 19:44                                 ` Ola Liljedahl
  2018-10-06 19:59                                   ` Ola Liljedahl
@ 2018-10-07  4:02                                   ` Jerin Jacob
  2018-10-07 20:11                                     ` Ola Liljedahl
  2018-10-07 20:44                                     ` Ola Liljedahl
  1 sibling, 2 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-10-07  4:02 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Sat, 6 Oct 2018 19:44:35 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "dev@dpdk.org"
>  <dev@dpdk.org>
> CC: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, "Ananyev,
>  Konstantin" <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
>  China)" <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
>  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.10.0.180812
> 
> 
> On 06/10/2018, 09:42, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Fri, 5 Oct 2018 20:34:15 +0000
>     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, Jerin Jacob
>     >  <jerin.jacob@caviumnetworks.com>
>     > CC: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Gavin Hu (Arm
>     >  Technology China)" <Gavin.Hu@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
>     >  Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>, "stable@dpdk.org"
>     >  <stable@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.10.0.180812
>     >
>     > External Email
>     >
>     > On 05/10/2018, 22:29, "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com> wrote:
>     >
>     >     >
>     >     > I doubt it is possible to benchmark with such a precision so to see the
>     >     > potential difference of one ADD instruction.
>     >     > Just changes in function alignment can affect performance by percents. And
>     >     > the natural variation when not using a 100% deterministic system is going to
>     >     > be a lot larger than one cycle per ring buffer operation.
>     >     >
>     >     > Some of the other patches are also for correctness (e.g. load-acquire of tail)
>     >     The discussion is about this patch alone. Other patches are already Acked.
>     > So the benchmarking then makes zero sense.
> 
>     Why ?
> Because the noise in benchmarking (due to e.g. non-deterministic systems, potential changes in function and loop alignment) will be much larger than 1 cycle additional overhead per ring buffer operation. What will benchmarking tell us about the performance impact of this change that adds one ALU operation?

Yes. There will be noise in bench marking. That the reason why checked
with generated assembly code only of this patch. Found LDR vs LDR + ADD
case.How much overhead it gives, it completely based micro
architecture, like how many instruction issue it has, constraints on executing LD/ST
on specific instruction issue etc?

In any case LDR will be better than LDR + ADD in any micro
architecture.
> 
> 
>     >
>     >
>     >     > so while performance measurements may be interesting, we can't skip a bug
>     >     > fix just because it proves to decrease performance.
>     >     IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.
>     > It's a fix for correctness. Per the C++11 (and probably C11 as well due to the shared memory model), we have undefined behaviour here. If the compiler detects UB, it is allowed to do anything. Current compilers might not exploit this but future compilers could.
> 
>     All I am saying this, The code is not same and compiler(the very latest
>     gcc 8.2) is not smart enough understand it is a dead code.
> What code is dead? The ADD instruction has a purpose, it is adding an offset (from ring buffer start to the tail field) to a base pointer. It's is merely (most likely) not the optimal code sequence for any ARM processor.
> 
>  I think,
>     The moment any __builtin_gcc comes the compiler add predefined template
>     which has additional "add" instruction.
> I suspect the add instruction is because this atomic_load operates on a struct member at a non-zero offset and GCC's "template(s)" for atomic operations don't support register + immediate offset (because basically on AArch64/A64 ISA, all atomic operations except atomic_load(RELAXED) and atomic_store(RELAXED) only support the addressing mode base register without offset). I would be surprised if this minor instance of non-optimal code generation couldn't be corrected in the compiler.
> 
>     I think this specific case,
>     we ALL know that,
>     a) ht->tail will be 32 bit for life long of DPDK, it will be atomic in
>     all DPDK supported processors
>     b) The rte_pause() down which never make and compiler reordering etc.
> For 32-bit ARM and 64-bit POWER (ppc64), the rte_pause() implementation looks like this (DPDK 18.08-rc0):
> static inline void rte_pause(void)
> {
> }
> 
> How does calling this function prevent compiler optimisations e.g. of the loop or of surrounding memory accesses?
> Is rte_pause() supposed to behave like some kind of compiler barrier? I can't see anything in the DPDK documentation for rte_pause() that claims this.

How about fixing rte_pause() then?
Meaning issuing power saving instructions on missing archs.

> 
> 
>     so why to loose one cycle at worst case? It is easy loose one cycle and it very
>     difficult to get one back in fastpath.
> I suggest you read up on why undefined behaviour is bad in C. Have a chat with Andrew Pinski.
> 
> Not depending on the compiler memory barrier in rte_pause() would allow the compiler to make optimisations (e.g. schedule loads earlier) that actually increase performance. Since the atomic load of ht->tail here has relaxed MO, the compiler is allowed hoist later loads (and stores) ahead of it (and also push down and/or merge with stores after the load of ht->tail). But the compiler (memory) barrier in (that supposedly is part of) rte_pause() prevents such optimisations (well some accesses could be pushed down between the atomic load and the compiler barrier but not further than that).
> 
> A C compiler that supports C11 and beyond implements the C11 memory model. The compiler understands the memory model and can optimise memory accesses according to the semantics of the model and the ordering directives in the code. Atomic operations using ATOMIC_SEQ_CST, ATOMIC_ACQUIRE, ATOMIC_RELEASE (and ATOMIC_ACQ_REL, ignoring ATOMIC_CONSUME here) each allow and disallow certain kinds of movements and other optimisations of memory accesses (loads and stores, I assume prefetches are also included). Atomic operations with ATOMIC_RELAXED don't impose any ordering constraints so give maximum flexibility to the compiler. Using a compiler memory barrier (e.g. asm volatile ("":::"memory")) is a much more brutal way of constraining the compiler.

In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?


> 
> 
>     >
>     >
>     >
>     >     >
>     >     > -- Ola
>     >     >
>     >     > On 05/10/2018, 22:06, "Honnappa Nagarahalli"
>     >     > <Honnappa.Nagarahalli@arm.com> wrote:
>     >     >
>     >     >     Hi Jerin,
>     >     >           Thank you for generating the disassembly, that is really helpful. I
>     >     > agree with you that we have the option of moving parts 2 and 3 forward. I
>     >     > will let Gavin take a decision.
>     >     >
>     >     >     I suggest that we run benchmarks on this patch alone and in combination
>     >     > with other patches in the series. We have few Arm machines and we will run
>     >     > on all of them along with x86. We take a decision based on that.
>     >     >
>     >     >     Would that be a way to move forward? I think this should address both
>     >     > your and Ola's concerns.
>     >     >
>     >     >     I am open for other suggestions as well.
>     >     >
>     >     >     Thank you,
>     >     >     Honnappa
>     >     >
>     >     >     >
>     >     >     > So you don't want to write the proper C11 code because the compiler
>     >     >     > generates one extra instruction that way?
>     >     >     > You don't even know if that one extra instruction has any measurable
>     >     >     > impact on performance. E.g. it could be issued the cycle before together
>     >     >     > with other instructions.
>     >     >     >
>     >     >     > We can complain to the compiler writers that the code generation for
>     >     >     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
>     >     >     > ARM/A64). I think the problem is that the __atomic builtins only accept
>     >     > a
>     >     >     > base address without any offset and this is possibly because e.g.
>     >     > load/store
>     >     >     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
>     >     > only
>     >     >     > accept a base register with no offset. So any offset has to be added
>     >     > before
>     >     >     > the actual "atomic" instruction, LDR in this case.
>     >     >     >
>     >     >     >
>     >     >     > -- Ola
>     >     >     >
>     >     >     >
>     >     >     > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
>     >     >     > wrote:
>     >     >     >
>     >     >     >     -----Original Message-----
>     >     >     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
>     >     >     >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>     >     >     >     > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, Ola
>     >     >     > Liljedahl
>     >     >     >     >  <Ola.Liljedahl@arm.com>, "Gavin Hu (Arm Technology China)"
>     >     >     >     >  <Gavin.Hu@arm.com>, Jerin Jacob
>     >     > <jerin.jacob@caviumnetworks.com>
>     >     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Steve Capper
>     >     >     > <Steve.Capper@arm.com>, nd
>     >     >     >     >  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
>     >     >     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >     >     >
>     >     >     >     > > >         > Hi Jerin,
>     >     >     >     > > >         >
>     >     >     >     > > >         > Thanks for your review, inline comments from our
>     >     > internal
>     >     >     >     > > discussions.
>     >     >     >     > > >         >
>     >     >     >     > > >         > BR. Gavin
>     >     >     >     > > >         >
>     >     >     >     > > >         > > -----Original Message-----
>     >     >     >     > > >         > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     >     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
>     >     >     >     > > >         > > To: Gavin Hu (Arm Technology China)
>     >     > <Gavin.Hu@arm.com>
>     >     >     >     > > >         > > Cc: dev@dpdk.org; Honnappa Nagarahalli
>     >     >     >     > > >         > > <Honnappa.Nagarahalli@arm.com>; Steve Capper
>     >     >     >     > > >         > > <Steve.Capper@arm.com>; Ola Liljedahl
>     >     >     > <Ola.Liljedahl@arm.com>;
>     >     >     >     > > nd
>     >     >     >     > > >         > > <nd@arm.com>; stable@dpdk.org
>     >     >     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
>     >     > load
>     >     >     >     > > >         > >
>     >     >     >     > > >         > > -----Original Message-----
>     >     >     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
>     >     >     >     > > >         > > > From: Gavin Hu <gavin.hu@arm.com>
>     >     >     >     > > >         > > > To: dev@dpdk.org
>     >     >     >     > > >         > > > CC: gavin.hu@arm.com,
>     >     > Honnappa.Nagarahalli@arm.com,
>     >     >     >     > > >         > > > steve.capper@arm.com,  Ola.Liljedahl@arm.com,
>     >     >     >     > > >         > > > jerin.jacob@caviumnetworks.com, nd@arm.com,
>     >     >     >     > > stable@dpdk.org
>     >     >     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
>     >     > load
>     >     >     >     > > >         > > > X-Mailer: git-send-email 2.7.4
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > External Email
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > In update_tail, read ht->tail using
>     >     >     > __atomic_load.Although the
>     >     >     >     > > >         > > > compiler currently seems to be doing the right thing
>     >     > even
>     >     >     > without
>     >     >     >     > > >         > > > _atomic_load, we don't want to give the compiler
>     >     >     > freedom to
>     >     >     >     > > optimise
>     >     >     >     > > >         > > > what should be an atomic load, it should not be
>     >     > arbitarily
>     >     >     > moved
>     >     >     >     > > >         > > > around.
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
>     >     >     > barrier
>     >     >     >     > > option")
>     >     >     >     > > >         > > > Cc: stable@dpdk.org
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>     >     >     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
>     >     >     >     > > <Honnappa.Nagarahalli@arm.com>
>     >     >     >     > > >         > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
>     >     >     >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     >     >     > > >         > > > ---
>     >     >     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>     >     >     >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > The read of ht->tail needs to be atomic, a non-atomic
>     >     > read
>     >     >     > would not
>     >     >     >     > > be correct.
>     >     >     >     > > >
>     >     >     >     > > >         That's a 32bit value load.
>     >     >     >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
>     >     >     >     > > >     [Ola] But that the ordinary C load is translated to an atomic
>     >     > load
>     >     >     > for the
>     >     >     >     > > target architecture is incidental.
>     >     >     >     > > >
>     >     >     >     > > >     If the design requires an atomic load (which is the case here),
>     >     > we
>     >     >     >     > > > should use an atomic load on the language level. Then we can
>     >     > be
>     >     >     > sure it will
>     >     >     >     > > always be translated to an atomic load for the target in question
>     >     > or
>     >     >     >     > > compilation will fail. We don't have to depend on assumptions.
>     >     >     >     > >
>     >     >     >     > > We all know that 32bit load/store on cpu we support - are atomic.
>     >     >     >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
>     >     >     >     > > So what the point to pretend that "it might be not atomic" if we
>     >     > do
>     >     >     > know for
>     >     >     >     > > sure that it is?
>     >     >     >     > > I do understand that you want to use atomic_load(relaxed) here
>     >     > for
>     >     >     >     > > consistency, and to conform with C11 mem-model and I don't see
>     >     > any
>     >     >     > harm in
>     >     >     >     > > that.
>     >     >     >     > We can continue to discuss the topic, it is a good discussion. But, as
>     >     > far
>     >     >     > this patch is concerned, can I consider this as us having a consensus?
>     >     > The
>     >     >     > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
>     >     > do
>     >     >     > not see any harm in having code that completely conforms to C11
>     >     > memory
>     >     >     > model.
>     >     >     >
>     >     >     >     Have you guys checked the output assembly with and without atomic
>     >     >     > load?
>     >     >     >     There is an extra "add" instruction with at least the code I have
>     >     > checked.
>     >     >     >     I think, compiler is not smart enough to understand it is a dead code
>     >     > for
>     >     >     >     arm64.
>     >     >     >
>     >     >     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
>     >     >     >     Using built-in specs.
>     >     >     >     COLLECT_GCC=aarch64-linux-gnu-gcc
>     >     >     >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
>     >     >     > wrapper
>     >     >     >     Target: aarch64-linux-gnu
>     >     >     >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
>     >     >     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
>     >     >     >     --with-local-prefix=/usr/aarch64-linux-gnu
>     >     >     >     --with-sysroot=/usr/aarch64-linux-gnu
>     >     >     >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
>     >     >     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
>     >     >     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
>     >     >     >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
>     >     >     >     --with-system-zlib --with-isl --enable-__cxa_atexit
>     >     >     >     --disable-libunwind-exceptions --enable-clocale=gnu
>     >     >     >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
>     >     >     >     --enable-linker-build-id --enable-lto --enable-plugin
>     >     >     >     --enable-install-libiberty --with-linker-hash-style=gnu
>     >     >     >     --enable-gnu-indirect-function --disable-multilib --disable-werror
>     >     >     >     --enable-checking=release
>     >     >     >     Thread model: posix
>     >     >     >     gcc version 8.2.0 (GCC)
>     >     >     >
>     >     >     >
>     >     >     >     # build setup
>     >     >     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-
>     >     > gnu-
>     >     >     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
>     >     >     >
>     >     >     >     # generate asm
>     >     >     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
>     >     > 'disassemble /rs
>     >     >     > bucket_enqueue_single'
>     >     >     >
>     >     >     >     I have uploaded generated file for your convenience
>     >     >     >     with_atomic_load.txt(includes patch 1,2,3)
>     >     >     >     -----------------------
>     >     >     >     https://pastebin.com/SQ6w1yRu
>     >     >     >
>     >     >     >     without_atomic_load.txt(includes patch 2,3)
>     >     >     >     -----------------------
>     >     >     >     https://pastebin.com/BpvnD0CA
>     >     >     >
>     >     >     >
>     >     >     >     without_atomic
>     >     >     >     -------------
>     >     >     >     23              if (!single)
>     >     >     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
>     >     >     > <bucket_enqueue_single+256>
>     >     >     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
>     >     >     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
>     >     >     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
>     >     >     > <bucket_enqueue_single+288>  // b.any
>     >     >     >
>     >     >     >     24                      while (unlikely(ht->tail != old_val))
>     >     >     >     25                              rte_pause();
>     >     >     >
>     >     >     >
>     >     >     >     with_atomic
>     >     >     >     -----------
>     >     >     >     23              if (!single)
>     >     >     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
>     >     >     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
>     >     >     > <bucket_enqueue_single+260>
>     >     >     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
>     >     >     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
>     >     >     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
>     >     >     > <bucket_enqueue_single+544>  // b.any
>     >     >     >
>     >     >     >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
>     >     >     > __ATOMIC_RELAXED)))
>     >     >     >
>     >     >     >
>     >     >     >     I don't want to block this series of patches due this patch. Can we
>     >     > make
>     >     >     >     re spin one series with 2 and 3 patches. And Wait for patch 1 to
>     >     > conclude?
>     >     >     >
>     >     >     >     Thoughts?
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >     >
>     >     >     >     > > But argument that we shouldn't assume 32bit load/store ops as
>     >     >     > atomic
>     >     >     >     > > sounds a bit flaky to me.
>     >     >     >     > > Konstantin
>     >     >     >     > >
>     >     >     >     > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >         > But there are no memory ordering requirements (with
>     >     >     >     > > >         > regards to other loads and/or stores by this thread) so
>     >     >     > relaxed
>     >     >     >     > > memory order is sufficient.
>     >     >     >     > > >         > Another aspect of using __atomic_load_n() is that the
>     >     >     >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
>     >     > has
>     >     >     > to be
>     >     >     >     > > done as
>     >     >     >     > > >         > specified in the source code which is also what we need
>     >     > here.
>     >     >     >     > > >
>     >     >     >     > > >         I think Jerin points that rte_pause() acts here as compiler
>     >     >     > barrier too,
>     >     >     >     > > >         so no need to worry that compiler would optimize out the
>     >     > loop.
>     >     >     >     > > >     [Ola] Sorry missed that. But the barrier behaviour of
>     >     > rte_pause()
>     >     >     >     > > > is not part of C11, is it essentially a hand-made feature to
>     >     > support
>     >     >     >     > > > the legacy multithreaded memory model (which uses explicit
>     >     > HW
>     >     >     > and
>     >     >     >     > > compiler barriers). I'd prefer code using the C11 memory model
>     >     > not to
>     >     >     >     > > depend on such legacy features.
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >         Konstantin
>     >     >     >     > > >
>     >     >     >     > > >         >
>     >     >     >     > > >         > One point worth mentioning though is that this change is
>     >     > for
>     >     >     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
>     >     > worth
>     >     >     > persisting
>     >     >     >     > > >         > with getting the C11 code right when people are less
>     >     > excited
>     >     >     > about
>     >     >     >     > > sending a release out?
>     >     >     >     > > >         >
>     >     >     >     > > >         > We can explain that for C11 we would prefer to do loads
>     >     > and
>     >     >     > stores
>     >     >     >     > > as per the C11 memory model. In the case of rte_ring, the code is
>     >     >     >     > > >         > separated cleanly into C11 specific files anyway.
>     >     >     >     > > >         >
>     >     >     >     > > >         > I think reading ht->tail using __atomic_load_n() is the
>     >     > most
>     >     >     >     > > appropriate way. We show that ht->tail is used for
>     >     > synchronization,
>     >     >     > we
>     >     >     >     > > >         > acknowledge that ht->tail may be written by other
>     >     > threads
>     >     >     >     > > > without any other kind of synchronization (e.g. no lock involved)
>     >     >     > and we
>     >     >     >     > > require
>     >     >     >     > > >         > an atomic load (any write to ht->tail must also be atomic).
>     >     >     >     > > >         >
>     >     >     >     > > >         > Using volatile and explicit compiler (or processor)
>     >     > memory
>     >     >     > barriers
>     >     >     >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
>     >     >     >     > > > There's
>     >     >     >     > > >         > a reason why C11/C++11 moved away from the old ways.
>     >     >     >     > > >         > > >
>     >     >     >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
>     >     >     > __ATOMIC_RELEASE);
>     >     >     >     > > >         > > > --
>     >     >     >     > > >         > > > 2.7.4
>     >     >     >     > > >         > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
>     >
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-07  4:02                                   ` Jerin Jacob
@ 2018-10-07 20:11                                     ` Ola Liljedahl
  2018-10-07 20:44                                     ` Ola Liljedahl
  1 sibling, 0 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-07 20:11 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    How about fixing rte_pause() then?
    Meaning issuing power saving instructions on missing archs.
Rte_pause() implemented as NOP or YIELD on ARM will likely not save any power. You should use WFE for that.
I use this portable pattern:

        //Wait for our turn to signal consumers (producers) 
        if (UNLIKELY(__atomic_load_n(loc, __ATOMIC_RELAXED) != idx))
        {
            SEVL();
            while (WFE() && LDXR32(loc, __ATOMIC_RELAXED) != idx)
            {   
                DOZE();
            }
        }

    
For AArch64 with WFE usage enabled:
#define SEVL() sevl()
#define WFE() wfe()
#define LDXR32(a, b)  ldx32((a), (b))
#define DOZE() (void)0
static inline void sevl(void)
{
    __asm__ volatile("sevl" : : : );
}
static inline int wfe(void)
{
    __asm__ volatile("wfe" : : : "memory");
    return 1;
}

For architectures without WFE support:
#define SEVL() (void)0
#define WFE() 1
#define LDXR32(a, b) __atomic_load_n((a), (b))
#define DOZE() doze()
static inline void doze(void)
{
    __asm__ volatile("rep; nop" : : : );
}

-- Ola



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-07  4:02                                   ` Jerin Jacob
  2018-10-07 20:11                                     ` Ola Liljedahl
@ 2018-10-07 20:44                                     ` Ola Liljedahl
  2018-10-08  6:06                                       ` Jerin Jacob
  1 sibling, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-07 20:44 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable


On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
    I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).

Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.

From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
Programming languages — C

5.1.2.4
4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.

25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

-- Ola

    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-05 20:34                             ` Ola Liljedahl
  2018-10-06  7:41                               ` Jerin Jacob
@ 2018-10-08  5:27                               ` Honnappa Nagarahalli
  2018-10-08 10:01                                 ` Ola Liljedahl
  1 sibling, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-08  5:27 UTC (permalink / raw)
  To: Ola Liljedahl, Jerin Jacob
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

>     >
>     > I doubt it is possible to benchmark with such a precision so to see the
>     > potential difference of one ADD instruction.
>     > Just changes in function alignment can affect performance by percents.
> And
>     > the natural variation when not using a 100% deterministic system is going
> to
>     > be a lot larger than one cycle per ring buffer operation.
>     >
>     > Some of the other patches are also for correctness (e.g. load-acquire of
> tail)
>     The discussion is about this patch alone. Other patches are already Acked.
> So the benchmarking then makes zero sense.
The whole point is to prove the effect of 1 instruction either way. IMO, it is simple enough, follow the memory model to the full extent. We have to keep in mind about other architectures as well. May be that additional instruction is not required on other architectures. 

> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-07 20:44                                     ` Ola Liljedahl
@ 2018-10-08  6:06                                       ` Jerin Jacob
  2018-10-08  9:22                                         ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08  6:06 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Sun, 7 Oct 2018 20:44:54 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.11.0.180909
> 


Could you please fix the email client for inline reply.

https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html


> 
> On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
>     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
>     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
> The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).

I am not convinced because of use of volatile in head and tail indexes.
For me that brings the defined behavior.That the reason why I shared
the generated assembly code. If you think other way, Pick any compiler
and see generated output.

And

Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108

See below too.

> 
> Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
> 
> From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
> Programming languages — C
> 
> 5.1.2.4
> 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
> 
> 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
for 32b and 64b machines. If not, the problem persist for generic case
as well(lib/librte_ring/rte_ring_generic.h)


I agree with you on C11 memory model semantics usage. The reason why I
propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
had definitions for load acquire and store release semantics.
I was looking for taking load acquire and store release semantics
from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
concern then we could create new abstractions as well. That would help
exiting KNI problem as well.

I think, currently it mixed usage because, the same variable declaration
used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
case. Either we need to change only to C11 mode OR have APIs for 
atomic_load_acq_() and atomic_store_rel_() to allow both models like
Linux kernel and FreeBSD.

> 
> -- Ola
> 
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08  6:06                                       ` Jerin Jacob
@ 2018-10-08  9:22                                         ` Ola Liljedahl
  2018-10-08 10:00                                           ` Jerin Jacob
  2018-10-08 14:43                                           ` Bruce Richardson
  0 siblings, 2 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08  9:22 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Sun, 7 Oct 2018 20:44:54 +0000
    > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >  "stable@dpdk.org" <stable@dpdk.org>
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > user-agent: Microsoft-MacOutlook/10.11.0.180909
    > 
    
    
    Could you please fix the email client for inline reply.
Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. The official Office365/Outlook
documentation doesn't match the actual user interface...


    
    https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
    
    
    > 
    > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    > 
    >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
    >     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
    > The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).
    
    I am not convinced because of use of volatile in head and tail indexes.
    For me that brings the defined behavior.
As long as you don't mix in C11 atomic accesses (just use "plain" accesses to volatile objects),
it is AFAIK defined behaviour (but not necessarily using atomic loads and stores). But I quoted
the C11 spec where it explicitly mentions that mixing atomic and non-atomic accesses to the same
object is undefined behaviour. Don't argue with me, argue with the C11 spec.
If you want to disobey the spec, this should at least be called out for in the code with a comment.


    That the reason why I shared
    the generated assembly code. If you think other way, Pick any compiler
    and see generated output.
This is what one compiler for one architecture generates today. These things change. Other things
that used to work or worked for some specific architecture has stopped working in newer versions of
the compiler.

    
    And
    
    Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
    such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API although I don't see
exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in explicit barriers
so definitively not pure C11 memory model usage. And finally, it doesn't establish the proper
load-acquire/store-release relationships (e.g. store-release cons_tail requires a load-acquire cons_tail,
same for prod_tail).

"* multi-producer safe lock-free ring buffer enqueue"
The comment is also wrong. This design is not lock-free, how could it be when there is spinning
(waiting) for other threads in the code? If a thread must wait for other threads, then by definition
the design is blocking.

So you are saying that because FreeBSD is doing it wrong, DPDK can also do it wrong?

    
    See below too.
    
    > 
    > Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
    > 
    > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    > Programming languages — C
    > 
    > 5.1.2.4
    > 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
    > 
    > 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
    
    IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
    for 32b and 64b machines. If not, the problem persist for generic case
    as well(lib/librte_ring/rte_ring_generic.h)
The read from a volatile object is not an atomic access per the C11 spec. It just happens to
be translated to an instruction (on x86-64 and AArch64/A64) that implements an atomic load.
I don't think any compiler would change this code generation and suddenly generate some
non-atomic load instruction for a program that *only* uses volatile to do "atomic" accesses.
But a future compiler could detect the mix of atomic and non-atomic accesses and mark this
expression as causing undefined behaviour and that would have consequences for code generation.
    
    
    I agree with you on C11 memory model semantics usage. The reason why I
    propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
    had definitions for load acquire and store release semantics.
    I was looking for taking load acquire and store release semantics
    from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
    like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
    concern then we could create new abstractions as well. That would help
    exiting KNI problem as well.
I appreciate your embrace of the C11 memory model. I think it is better for describing
(both to the compiler and to humans) which and how objects are used for synchronisation.

However, I don't think an API as you suggest (and others have suggested before, e.g. as
done in ODP) is a good idea. There is an infinite amount of possible base types, an
increasing number of operations and a bunch of different memory orderings, a "complete"
API would be very large and difficult to test, and most members of the API would never be used.

GCC and Clang both support the __atomic intrinsics. This API avoids the problems I
described above. Or we could use the official C11 syntax (stdatomic.h). But then we
have the problem with using pre-C11 compilers...



    
    I think, currently it mixed usage because, the same variable declaration
    used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
    case. Either we need to change only to C11 mode OR have APIs for 
    atomic_load_acq_() and atomic_store_rel_() to allow both models like
    Linux kernel and FreeBSD.
    
    > 
    > -- Ola
    > 
    > 
    > 
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08  9:22                                         ` Ola Liljedahl
@ 2018-10-08 10:00                                           ` Jerin Jacob
  2018-10-08 10:25                                             ` Ola Liljedahl
  2018-10-09  3:16                                             ` Honnappa Nagarahalli
  2018-10-08 14:43                                           ` Bruce Richardson
  1 sibling, 2 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08 10:00 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Mon, 8 Oct 2018 09:22:05 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.11.0.180909
> 
> External Email
> 
> On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >  "stable@dpdk.org" <stable@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >
> 
> 
>     Could you please fix the email client for inline reply.
> Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. The official Office365/Outlook
> documentation doesn't match the actual user interface...
> 
> 
> 
>     https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
> 
> 
>     >
>     > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     >
>     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
>     >     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
>     > The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).
> 
>     I am not convinced because of use of volatile in head and tail indexes.
>     For me that brings the defined behavior.
> As long as you don't mix in C11 atomic accesses (just use "plain" accesses to volatile objects),
> it is AFAIK defined behaviour (but not necessarily using atomic loads and stores). But I quoted
> the C11 spec where it explicitly mentions that mixing atomic and non-atomic accesses to the same
> object is undefined behaviour. Don't argue with me, argue with the C11 spec.
> If you want to disobey the spec, this should at least be called out for in the code with a comment.

That's boils down only one question, should we follow C11 spec? Why not only take load
acquire and store release semantics only just like Linux kernel and FreeBSD.
Does not look like C11 memory model is super efficient in term of gcc
implementation.

> 
> 
>     That the reason why I shared
>     the generated assembly code. If you think other way, Pick any compiler
>     and see generated output.
> This is what one compiler for one architecture generates today. These things change. Other things
> that used to work or worked for some specific architecture has stopped working in newer versions of
> the compiler.
> 
> 
>     And
> 
>     Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
>     such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
> It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API although I don't see
> exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in explicit barriers
> so definitively not pure C11 memory model usage. And finally, it doesn't establish the proper
> load-acquire/store-release relationships (e.g. store-release cons_tail requires a load-acquire cons_tail,
> same for prod_tail).
> 
> "* multi-producer safe lock-free ring buffer enqueue"
> The comment is also wrong. This design is not lock-free, how could it be when there is spinning
> (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
> the design is blocking.
> 
> So you are saying that because FreeBSD is doing it wrong, DPDK can also do it wrong?
> 
> 
>     See below too.
> 
>     >
>     > Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
>     >
>     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     > Programming languages — C
>     >
>     > 5.1.2.4
>     > 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
>     >
>     > 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
> 
>     IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
>     for 32b and 64b machines. If not, the problem persist for generic case
>     as well(lib/librte_ring/rte_ring_generic.h)
> The read from a volatile object is not an atomic access per the C11 spec. It just happens to
> be translated to an instruction (on x86-64 and AArch64/A64) that implements an atomic load.
> I don't think any compiler would change this code generation and suddenly generate some
> non-atomic load instruction for a program that *only* uses volatile to do "atomic" accesses.
> But a future compiler could detect the mix of atomic and non-atomic accesses and mark this
> expression as causing undefined behaviour and that would have consequences for code generation.
> 
> 
>     I agree with you on C11 memory model semantics usage. The reason why I
>     propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
>     had definitions for load acquire and store release semantics.
>     I was looking for taking load acquire and store release semantics
>     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
>     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
>     concern then we could create new abstractions as well. That would help
>     exiting KNI problem as well.
> I appreciate your embrace of the C11 memory model. I think it is better for describing
> (both to the compiler and to humans) which and how objects are used for synchronisation.
> 
> However, I don't think an API as you suggest (and others have suggested before, e.g. as
> done in ODP) is a good idea. There is an infinite amount of possible base types, an
> increasing number of operations and a bunch of different memory orderings, a "complete"
> API would be very large and difficult to test, and most members of the API would never be used.
> GCC and Clang both support the __atomic intrinsics. This API avoids the problems I
> described above. Or we could use the official C11 syntax (stdatomic.h). But then we
> have the problem with using pre-C11 compilers...

I have no objection, if everyone agrees to move C11 memory model
with __atomic intrinsics. But if we need to keep both have then
atomic_load_acq_32() kind of API make sense.


> 
> 
> 
> 
>     I think, currently it mixed usage because, the same variable declaration
>     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
>     case. Either we need to change only to C11 mode OR have APIs for
>     atomic_load_acq_() and atomic_store_rel_() to allow both models like
>     Linux kernel and FreeBSD.
> 
>     >
>     > -- Ola
>     >
>     >
>     >
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08  5:27                               ` Honnappa Nagarahalli
@ 2018-10-08 10:01                                 ` Ola Liljedahl
  0 siblings, 0 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 10:01 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Jerin Jacob
  Cc: Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	dev, Steve Capper, nd, stable

Or maybe performance gets worse but not because of that one additional instruction/cycle in ring buffer enqueue and dequeue but because function or loop alignment changed for one or more functions.

When the benchmarking noise (possibly several % due to changes in code alignment) is bigger than the effect you are trying to measure (1 cycle per ring buffer enqueue/dequeue), benchmarking is not the right approach.

-- Ola

On 08/10/2018, 07:27, "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com> wrote:

    >     >
    >     > I doubt it is possible to benchmark with such a precision so to see the
    >     > potential difference of one ADD instruction.
    >     > Just changes in function alignment can affect performance by percents.
    > And
    >     > the natural variation when not using a 100% deterministic system is going
    > to
    >     > be a lot larger than one cycle per ring buffer operation.
    >     >
    >     > Some of the other patches are also for correctness (e.g. load-acquire of
    > tail)
    >     The discussion is about this patch alone. Other patches are already Acked.
    > So the benchmarking then makes zero sense.
    The whole point is to prove the effect of 1 instruction either way. IMO, it is simple enough, follow the memory model to the full extent. We have to keep in mind about other architectures as well. May be that additional instruction is not required on other architectures. 
    
    > 
    > 
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:00                                           ` Jerin Jacob
@ 2018-10-08 10:25                                             ` Ola Liljedahl
  2018-10-08 10:33                                               ` Gavin Hu (Arm Technology China)
  2018-10-08 10:46                                               ` Jerin Jacob
  2018-10-09  3:16                                             ` Honnappa Nagarahalli
  1 sibling, 2 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 10:25 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Mon, 8 Oct 2018 09:22:05 +0000
    > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >  "stable@dpdk.org" <stable@dpdk.org>
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > user-agent: Microsoft-MacOutlook/10.11.0.180909
    > 
    > External Email
    > 
    > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    > 
    >     -----Original Message-----
    >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
    >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >     >  "stable@dpdk.org" <stable@dpdk.org>
    >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >
    > 
    > 
    >     Could you please fix the email client for inline reply.
    > Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. The official Office365/Outlook
    > documentation doesn't match the actual user interface...
    > 
    > 
    > 
    >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
    > 
    > 
    >     >
    >     > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    >     >
    >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
    >     >     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
    >     > The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).
    > 
    >     I am not convinced because of use of volatile in head and tail indexes.
    >     For me that brings the defined behavior.
    > As long as you don't mix in C11 atomic accesses (just use "plain" accesses to volatile objects),
    > it is AFAIK defined behaviour (but not necessarily using atomic loads and stores). But I quoted
    > the C11 spec where it explicitly mentions that mixing atomic and non-atomic accesses to the same
    > object is undefined behaviour. Don't argue with me, argue with the C11 spec.
    > If you want to disobey the spec, this should at least be called out for in the code with a comment.
    
    That's boils down only one question, should we follow C11 spec? Why not only take load
    acquire and store release semantics only just like Linux kernel and FreeBSD.
And introduce even more undefined behaviour?

    Does not look like C11 memory model is super efficient in term of gcc
    implementation.
You are making a chicken out of a feather.

I think this "problem" with one additional ADD instruction will only concern __atomic_load_n(__ATOMIC_RELAXED) and __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates the address generation (add offset of struct member) from the load or store itself. For other atomic operations and memory orderings (e.g. __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be included anyway (as long as we access a non-first struct member) because e.g. LDAR only accepts a base register with no offset.

I suggest minimising the imposed memory orderings can have a much larger (positive) effect on performance compared to avoiding one ADD instruction (memory accesses are much slower than CPU ALU instructions).
Using C11 memory model and identifying exactly which objects are used for synchronisation and whether (any) updates to shared memory are acquired or released (no updates to shared memory means relaxed order can be used) will provide maximum freedom to the compiler and hardware to get the best result.

The FreeBSD and DPDK ring buffers show some fundamental misunderstandings here. Instead excessive orderings and explicit barriers have been used as band-aids, with unknown effects on performance.

    
    > 
    > 
    >     That the reason why I shared
    >     the generated assembly code. If you think other way, Pick any compiler
    >     and see generated output.
    > This is what one compiler for one architecture generates today. These things change. Other things
    > that used to work or worked for some specific architecture has stopped working in newer versions of
    > the compiler.
    > 
    > 
    >     And
    > 
    >     Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
    >     such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
    > It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API although I don't see
    > exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in explicit barriers
    > so definitively not pure C11 memory model usage. And finally, it doesn't establish the proper
    > load-acquire/store-release relationships (e.g. store-release cons_tail requires a load-acquire cons_tail,
    > same for prod_tail).
    > 
    > "* multi-producer safe lock-free ring buffer enqueue"
    > The comment is also wrong. This design is not lock-free, how could it be when there is spinning
    > (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
    > the design is blocking.
    > 
    > So you are saying that because FreeBSD is doing it wrong, DPDK can also do it wrong?
    > 
    > 
    >     See below too.
    > 
    >     >
    >     > Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
    >     >
    >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    >     > Programming languages — C
    >     >
    >     > 5.1.2.4
    >     > 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
    >     >
    >     > 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
    > 
    >     IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
    >     for 32b and 64b machines. If not, the problem persist for generic case
    >     as well(lib/librte_ring/rte_ring_generic.h)
    > The read from a volatile object is not an atomic access per the C11 spec. It just happens to
    > be translated to an instruction (on x86-64 and AArch64/A64) that implements an atomic load.
    > I don't think any compiler would change this code generation and suddenly generate some
    > non-atomic load instruction for a program that *only* uses volatile to do "atomic" accesses.
    > But a future compiler could detect the mix of atomic and non-atomic accesses and mark this
    > expression as causing undefined behaviour and that would have consequences for code generation.
    > 
    > 
    >     I agree with you on C11 memory model semantics usage. The reason why I
    >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
    >     had definitions for load acquire and store release semantics.
    >     I was looking for taking load acquire and store release semantics
    >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
    >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
    >     concern then we could create new abstractions as well. That would help
    >     exiting KNI problem as well.
    > I appreciate your embrace of the C11 memory model. I think it is better for describing
    > (both to the compiler and to humans) which and how objects are used for synchronisation.
    > 
    > However, I don't think an API as you suggest (and others have suggested before, e.g. as
    > done in ODP) is a good idea. There is an infinite amount of possible base types, an
    > increasing number of operations and a bunch of different memory orderings, a "complete"
    > API would be very large and difficult to test, and most members of the API would never be used.
    > GCC and Clang both support the __atomic intrinsics. This API avoids the problems I
    > described above. Or we could use the official C11 syntax (stdatomic.h). But then we
    > have the problem with using pre-C11 compilers...
    
    I have no objection, if everyone agrees to move C11 memory model
    with __atomic intrinsics. But if we need to keep both have then
    atomic_load_acq_32() kind of API make sense.
    
    
    > 
    > 
    > 
    > 
    >     I think, currently it mixed usage because, the same variable declaration
    >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
    >     case. Either we need to change only to C11 mode OR have APIs for
    >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
    >     Linux kernel and FreeBSD.
    > 
    >     >
    >     > -- Ola
    >     >
    >     >
    >     >
    > 
    > 
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:25                                             ` Ola Liljedahl
@ 2018-10-08 10:33                                               ` Gavin Hu (Arm Technology China)
  2018-10-08 10:39                                                 ` Ola Liljedahl
  2018-10-08 10:49                                                 ` Jerin Jacob
  2018-10-08 10:46                                               ` Jerin Jacob
  1 sibling, 2 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-08 10:33 UTC (permalink / raw)
  To: Ola Liljedahl, Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin, Steve Capper, nd, stable

I did benchmarking w/o and w/ the patch, it did not show any noticeable differences in terms of latency.
Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).

sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i

RTE>>ring_perf_autotest (#1 run of test without the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 8
MP/MC single enq/dequeue: 10
SP/SC burst enq/dequeue (size: 8): 1
MP/MC burst enq/dequeue (size: 8): 1
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.24
MC empty dequeue: 0.24

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.40
MP/MC bulk enq/dequeue (size: 8): 1.74
SP/SC bulk enq/dequeue (size: 32): 0.59
MP/MC bulk enq/dequeue (size: 32): 0.68

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 2.49
MP/MC bulk enq/dequeue (size: 8): 3.09
SP/SC bulk enq/dequeue (size: 32): 1.07
MP/MC bulk enq/dequeue (size: 32): 1.13

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 6.55
MP/MC bulk enq/dequeue (size: 8): 12.99
SP/SC bulk enq/dequeue (size: 32): 1.97
MP/MC bulk enq/dequeue (size: 32): 3.41

RTE>>ring_perf_autotest(#2 run of test without the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 8
MP/MC single enq/dequeue: 10
SP/SC burst enq/dequeue (size: 8): 1
MP/MC burst enq/dequeue (size: 8): 1
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.24
MC empty dequeue: 0.24

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.40
MP/MC bulk enq/dequeue (size: 8): 1.74
SP/SC bulk enq/dequeue (size: 32): 0.59
MP/MC bulk enq/dequeue (size: 32): 0.68

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 2.50
MP/MC bulk enq/dequeue (size: 8): 3.08
SP/SC bulk enq/dequeue (size: 32): 1.07
MP/MC bulk enq/dequeue (size: 32): 1.13

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 6.57
MP/MC bulk enq/dequeue (size: 8): 13.00
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 3.41
Test OK

RTE>>ring_perf_autotest(#3 run of test without the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 8
MP/MC single enq/dequeue: 10
SP/SC burst enq/dequeue (size: 8): 1
MP/MC burst enq/dequeue (size: 8): 1
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.24
MC empty dequeue: 0.24

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.40
MP/MC bulk enq/dequeue (size: 8): 1.74
SP/SC bulk enq/dequeue (size: 32): 0.59
MP/MC bulk enq/dequeue (size: 32): 0.68

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 2.49
MP/MC bulk enq/dequeue (size: 8): 3.08
SP/SC bulk enq/dequeue (size: 32): 1.07
MP/MC bulk enq/dequeue (size: 32): 1.13

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 6.55
MP/MC bulk enq/dequeue (size: 8): 12.96
SP/SC bulk enq/dequeue (size: 32): 1.99
MP/MC bulk enq/dequeue (size: 32): 3.37


RTE>>ring_perf_autotest(#1 run of the test with the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 8
MP/MC single enq/dequeue: 10
SP/SC burst enq/dequeue (size: 8): 1
MP/MC burst enq/dequeue (size: 8): 1
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.24
MC empty dequeue: 0.32

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.43
MP/MC bulk enq/dequeue (size: 8): 1.74
SP/SC bulk enq/dequeue (size: 32): 0.60
MP/MC bulk enq/dequeue (size: 32): 0.68

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 2.53
MP/MC bulk enq/dequeue (size: 8): 3.09
SP/SC bulk enq/dequeue (size: 32): 1.06
MP/MC bulk enq/dequeue (size: 32): 1.18

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 6.57
MP/MC bulk enq/dequeue (size: 8): 12.93
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 3.44
Test OK
RTE>>ring_perf_autotest (#1 run of the test with the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 8
MP/MC single enq/dequeue: 10
SP/SC burst enq/dequeue (size: 8): 1
MP/MC burst enq/dequeue (size: 8): 1
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.24
MC empty dequeue: 0.32

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.43
MP/MC bulk enq/dequeue (size: 8): 1.74
SP/SC bulk enq/dequeue (size: 32): 0.60
MP/MC bulk enq/dequeue (size: 32): 0.68

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 2.53
MP/MC bulk enq/dequeue (size: 8): 3.09
SP/SC bulk enq/dequeue (size: 32): 1.06
MP/MC bulk enq/dequeue (size: 32): 1.18

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 6.57
MP/MC bulk enq/dequeue (size: 8): 12.93
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 3.46
Test OK

> -----Original Message-----
> From: Ola Liljedahl
> Sent: Monday, October 8, 2018 6:26 PM
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> 
> 
> 
> On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> wrote:
> 
>     -----Original Message-----
>     > Date: Mon, 8 Oct 2018 09:22:05 +0000
>     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
> <nd@arm.com>,
>     >  "stable@dpdk.org" <stable@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >
>     > External Email
>     >
>     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
> China)"
>     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
> <nd@arm.com>,
>     >     >  "stable@dpdk.org" <stable@dpdk.org>
>     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >
>     >
>     >
>     >     Could you please fix the email client for inline reply.
>     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
> Office365. The official Office365/Outlook
>     > documentation doesn't match the actual user interface...
>     >
>     >
>     >
>     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
> clients.html
>     >
>     >
>     >     >
>     >     > On 07/10/2018, 06:03, "Jerin Jacob"
> <jerin.jacob@caviumnetworks.com> wrote:
>     >     >
>     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
> volatile ("":::"memory") of rte_pause().
>     >     >     I would n't have any issue, if the generated code code is same or
> better than the exiting case. but it not the case, Right?
>     >     > The existing case is actually not interesting (IMO) as it exposes
> undefined behaviour which allows the compiler to do anything. But you
> seem to be satisfied with "works for me, right here right now". I think the
> cost of avoiding undefined behaviour is acceptable (actually I don't think it
> even will be noticeable).
>     >
>     >     I am not convinced because of use of volatile in head and tail indexes.
>     >     For me that brings the defined behavior.
>     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses
> to volatile objects),
>     > it is AFAIK defined behaviour (but not necessarily using atomic loads and
> stores). But I quoted
>     > the C11 spec where it explicitly mentions that mixing atomic and non-
> atomic accesses to the same
>     > object is undefined behaviour. Don't argue with me, argue with the C11
> spec.
>     > If you want to disobey the spec, this should at least be called out for in
> the code with a comment.
> 
>     That's boils down only one question, should we follow C11 spec? Why not
> only take load
>     acquire and store release semantics only just like Linux kernel and FreeBSD.
> And introduce even more undefined behaviour?
> 
>     Does not look like C11 memory model is super efficient in term of gcc
>     implementation.
> You are making a chicken out of a feather.
> 
> I think this "problem" with one additional ADD instruction will only concern
> __atomic_load_n(__ATOMIC_RELAXED) and
> __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates
> the address generation (add offset of struct member) from the load or store
> itself. For other atomic operations and memory orderings (e.g.
> __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be
> included anyway (as long as we access a non-first struct member) because
> e.g. LDAR only accepts a base register with no offset.
> 
> I suggest minimising the imposed memory orderings can have a much larger
> (positive) effect on performance compared to avoiding one ADD instruction
> (memory accesses are much slower than CPU ALU instructions).
> Using C11 memory model and identifying exactly which objects are used for
> synchronisation and whether (any) updates to shared memory are acquired
> or released (no updates to shared memory means relaxed order can be used)
> will provide maximum freedom to the compiler and hardware to get the best
> result.
> 
> The FreeBSD and DPDK ring buffers show some fundamental
> misunderstandings here. Instead excessive orderings and explicit barriers
> have been used as band-aids, with unknown effects on performance.
> 
> 
>     >
>     >
>     >     That the reason why I shared
>     >     the generated assembly code. If you think other way, Pick any compiler
>     >     and see generated output.
>     > This is what one compiler for one architecture generates today. These
> things change. Other things
>     > that used to work or worked for some specific architecture has stopped
> working in newer versions of
>     > the compiler.
>     >
>     >
>     >     And
>     >
>     >     Freebsd implementation of ring buffer(Which DPDK derived from),
> Don't have
>     >     such logic, See
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
>     > It looks like FreeBSD uses some kind of C11 atomic memory model-
> inspired API although I don't see
>     > exactly how e.g. atomic_store_rel_int() is implemented. The code also
> mixes in explicit barriers
>     > so definitively not pure C11 memory model usage. And finally, it doesn't
> establish the proper
>     > load-acquire/store-release relationships (e.g. store-release cons_tail
> requires a load-acquire cons_tail,
>     > same for prod_tail).
>     >
>     > "* multi-producer safe lock-free ring buffer enqueue"
>     > The comment is also wrong. This design is not lock-free, how could it be
> when there is spinning
>     > (waiting) for other threads in the code? If a thread must wait for other
> threads, then by definition
>     > the design is blocking.
>     >
>     > So you are saying that because FreeBSD is doing it wrong, DPDK can also
> do it wrong?
>     >
>     >
>     >     See below too.
>     >
>     >     >
>     >     > Skipping the compiler memory barrier in rte_pause() potentially
> allows for optimisations that provide much more benefit, e.g. hiding some
> cache miss latency for later loads. The DPDK ring buffer implementation is
> defined so to enable inlining of enqueue/dequeue functions into the caller,
> any code could immediately follow these calls.
>     >     >
>     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     >     > Programming languages — C
>     >     >
>     >     > 5.1.2.4
>     >     > 4 Two expression evaluations conflict if one of them modifies a
> memory location and the other one reads or modifies the same memory
> location.
>     >     >
>     >     > 25 The execution of a program contains a data race if it contains two
> conflicting actions in different threads, at least one of which is not atomic,
> and neither happens before the other. Any such data race results in
> undefined behavior.
>     >
>     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read
> will atomic
>     >     for 32b and 64b machines. If not, the problem persist for generic case
>     >     as well(lib/librte_ring/rte_ring_generic.h)
>     > The read from a volatile object is not an atomic access per the C11 spec. It
> just happens to
>     > be translated to an instruction (on x86-64 and AArch64/A64) that
> implements an atomic load.
>     > I don't think any compiler would change this code generation and
> suddenly generate some
>     > non-atomic load instruction for a program that *only* uses volatile to do
> "atomic" accesses.
>     > But a future compiler could detect the mix of atomic and non-atomic
> accesses and mark this
>     > expression as causing undefined behaviour and that would have
> consequences for code generation.
>     >
>     >
>     >     I agree with you on C11 memory model semantics usage. The reason
> why I
>     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did
> not
>     >     had definitions for load acquire and store release semantics.
>     >     I was looking for taking load acquire and store release semantics
>     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
>     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is
> your
>     >     concern then we could create new abstractions as well. That would
> help
>     >     exiting KNI problem as well.
>     > I appreciate your embrace of the C11 memory model. I think it is better
> for describing
>     > (both to the compiler and to humans) which and how objects are used
> for synchronisation.
>     >
>     > However, I don't think an API as you suggest (and others have suggested
> before, e.g. as
>     > done in ODP) is a good idea. There is an infinite amount of possible base
> types, an
>     > increasing number of operations and a bunch of different memory
> orderings, a "complete"
>     > API would be very large and difficult to test, and most members of the
> API would never be used.
>     > GCC and Clang both support the __atomic intrinsics. This API avoids the
> problems I
>     > described above. Or we could use the official C11 syntax (stdatomic.h).
> But then we
>     > have the problem with using pre-C11 compilers...
> 
>     I have no objection, if everyone agrees to move C11 memory model
>     with __atomic intrinsics. But if we need to keep both have then
>     atomic_load_acq_32() kind of API make sense.
> 
> 
>     >
>     >
>     >
>     >
>     >     I think, currently it mixed usage because, the same variable declaration
>     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
>     >     case. Either we need to change only to C11 mode OR have APIs for
>     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
>     >     Linux kernel and FreeBSD.
>     >
>     >     >
>     >     > -- Ola
>     >     >
>     >     >
>     >     >
>     >
>     >
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:33                                               ` Gavin Hu (Arm Technology China)
@ 2018-10-08 10:39                                                 ` Ola Liljedahl
  2018-10-08 10:41                                                   ` Gavin Hu (Arm Technology China)
  2018-10-08 10:49                                                 ` Jerin Jacob
  1 sibling, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 10:39 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin, Steve Capper, nd, stable

On 08/10/2018, 12:33, "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com> wrote:

    I did benchmarking w/o and w/ the patch, it did not show any noticeable differences in terms of latency.
Which platform is this?


    Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
    
    sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i
    
    RTE>>ring_perf_autotest (#1 run of test without the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.24
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.40
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.59
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.49
    MP/MC bulk enq/dequeue (size: 8): 3.09
    SP/SC bulk enq/dequeue (size: 32): 1.07
    MP/MC bulk enq/dequeue (size: 32): 1.13
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.55
    MP/MC bulk enq/dequeue (size: 8): 12.99
    SP/SC bulk enq/dequeue (size: 32): 1.97
    MP/MC bulk enq/dequeue (size: 32): 3.41
    
    RTE>>ring_perf_autotest(#2 run of test without the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.24
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.40
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.59
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.50
    MP/MC bulk enq/dequeue (size: 8): 3.08
    SP/SC bulk enq/dequeue (size: 32): 1.07
    MP/MC bulk enq/dequeue (size: 32): 1.13
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.57
    MP/MC bulk enq/dequeue (size: 8): 13.00
    SP/SC bulk enq/dequeue (size: 32): 1.98
    MP/MC bulk enq/dequeue (size: 32): 3.41
    Test OK
    
    RTE>>ring_perf_autotest(#3 run of test without the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.24
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.40
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.59
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.49
    MP/MC bulk enq/dequeue (size: 8): 3.08
    SP/SC bulk enq/dequeue (size: 32): 1.07
    MP/MC bulk enq/dequeue (size: 32): 1.13
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.55
    MP/MC bulk enq/dequeue (size: 8): 12.96
    SP/SC bulk enq/dequeue (size: 32): 1.99
    MP/MC bulk enq/dequeue (size: 32): 3.37
    
    
    RTE>>ring_perf_autotest(#1 run of the test with the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.32
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.43
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.60
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.53
    MP/MC bulk enq/dequeue (size: 8): 3.09
    SP/SC bulk enq/dequeue (size: 32): 1.06
    MP/MC bulk enq/dequeue (size: 32): 1.18
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.57
    MP/MC bulk enq/dequeue (size: 8): 12.93
    SP/SC bulk enq/dequeue (size: 32): 1.98
    MP/MC bulk enq/dequeue (size: 32): 3.44
    Test OK
    RTE>>ring_perf_autotest (#1 run of the test with the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.32
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.43
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.60
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.53
    MP/MC bulk enq/dequeue (size: 8): 3.09
    SP/SC bulk enq/dequeue (size: 32): 1.06
    MP/MC bulk enq/dequeue (size: 32): 1.18
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.57
    MP/MC bulk enq/dequeue (size: 8): 12.93
    SP/SC bulk enq/dequeue (size: 32): 1.98
    MP/MC bulk enq/dequeue (size: 32): 3.46
    Test OK
    
    > -----Original Message-----
    > From: Ola Liljedahl
    > Sent: Monday, October 8, 2018 6:26 PM
    > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > Cc: dev@dpdk.org; Honnappa Nagarahalli
    > <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
    > <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
    > <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
    > <nd@arm.com>; stable@dpdk.org
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > 
    > 
    > 
    > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
    > wrote:
    > 
    >     -----Original Message-----
    >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
    >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
    > <nd@arm.com>,
    >     >  "stable@dpdk.org" <stable@dpdk.org>
    >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >
    >     > External Email
    >     >
    >     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
    > wrote:
    >     >
    >     >     -----Original Message-----
    >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
    >     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
    > China)"
    >     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
    > <nd@arm.com>,
    >     >     >  "stable@dpdk.org" <stable@dpdk.org>
    >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >     >
    >     >
    >     >
    >     >     Could you please fix the email client for inline reply.
    >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
    > Office365. The official Office365/Outlook
    >     > documentation doesn't match the actual user interface...
    >     >
    >     >
    >     >
    >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
    > clients.html
    >     >
    >     >
    >     >     >
    >     >     > On 07/10/2018, 06:03, "Jerin Jacob"
    > <jerin.jacob@caviumnetworks.com> wrote:
    >     >     >
    >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
    > volatile ("":::"memory") of rte_pause().
    >     >     >     I would n't have any issue, if the generated code code is same or
    > better than the exiting case. but it not the case, Right?
    >     >     > The existing case is actually not interesting (IMO) as it exposes
    > undefined behaviour which allows the compiler to do anything. But you
    > seem to be satisfied with "works for me, right here right now". I think the
    > cost of avoiding undefined behaviour is acceptable (actually I don't think it
    > even will be noticeable).
    >     >
    >     >     I am not convinced because of use of volatile in head and tail indexes.
    >     >     For me that brings the defined behavior.
    >     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses
    > to volatile objects),
    >     > it is AFAIK defined behaviour (but not necessarily using atomic loads and
    > stores). But I quoted
    >     > the C11 spec where it explicitly mentions that mixing atomic and non-
    > atomic accesses to the same
    >     > object is undefined behaviour. Don't argue with me, argue with the C11
    > spec.
    >     > If you want to disobey the spec, this should at least be called out for in
    > the code with a comment.
    > 
    >     That's boils down only one question, should we follow C11 spec? Why not
    > only take load
    >     acquire and store release semantics only just like Linux kernel and FreeBSD.
    > And introduce even more undefined behaviour?
    > 
    >     Does not look like C11 memory model is super efficient in term of gcc
    >     implementation.
    > You are making a chicken out of a feather.
    > 
    > I think this "problem" with one additional ADD instruction will only concern
    > __atomic_load_n(__ATOMIC_RELAXED) and
    > __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates
    > the address generation (add offset of struct member) from the load or store
    > itself. For other atomic operations and memory orderings (e.g.
    > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be
    > included anyway (as long as we access a non-first struct member) because
    > e.g. LDAR only accepts a base register with no offset.
    > 
    > I suggest minimising the imposed memory orderings can have a much larger
    > (positive) effect on performance compared to avoiding one ADD instruction
    > (memory accesses are much slower than CPU ALU instructions).
    > Using C11 memory model and identifying exactly which objects are used for
    > synchronisation and whether (any) updates to shared memory are acquired
    > or released (no updates to shared memory means relaxed order can be used)
    > will provide maximum freedom to the compiler and hardware to get the best
    > result.
    > 
    > The FreeBSD and DPDK ring buffers show some fundamental
    > misunderstandings here. Instead excessive orderings and explicit barriers
    > have been used as band-aids, with unknown effects on performance.
    > 
    > 
    >     >
    >     >
    >     >     That the reason why I shared
    >     >     the generated assembly code. If you think other way, Pick any compiler
    >     >     and see generated output.
    >     > This is what one compiler for one architecture generates today. These
    > things change. Other things
    >     > that used to work or worked for some specific architecture has stopped
    > working in newer versions of
    >     > the compiler.
    >     >
    >     >
    >     >     And
    >     >
    >     >     Freebsd implementation of ring buffer(Which DPDK derived from),
    > Don't have
    >     >     such logic, See
    > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
    >     > It looks like FreeBSD uses some kind of C11 atomic memory model-
    > inspired API although I don't see
    >     > exactly how e.g. atomic_store_rel_int() is implemented. The code also
    > mixes in explicit barriers
    >     > so definitively not pure C11 memory model usage. And finally, it doesn't
    > establish the proper
    >     > load-acquire/store-release relationships (e.g. store-release cons_tail
    > requires a load-acquire cons_tail,
    >     > same for prod_tail).
    >     >
    >     > "* multi-producer safe lock-free ring buffer enqueue"
    >     > The comment is also wrong. This design is not lock-free, how could it be
    > when there is spinning
    >     > (waiting) for other threads in the code? If a thread must wait for other
    > threads, then by definition
    >     > the design is blocking.
    >     >
    >     > So you are saying that because FreeBSD is doing it wrong, DPDK can also
    > do it wrong?
    >     >
    >     >
    >     >     See below too.
    >     >
    >     >     >
    >     >     > Skipping the compiler memory barrier in rte_pause() potentially
    > allows for optimisations that provide much more benefit, e.g. hiding some
    > cache miss latency for later loads. The DPDK ring buffer implementation is
    > defined so to enable inlining of enqueue/dequeue functions into the caller,
    > any code could immediately follow these calls.
    >     >     >
    >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    >     >     > Programming languages — C
    >     >     >
    >     >     > 5.1.2.4
    >     >     > 4 Two expression evaluations conflict if one of them modifies a
    > memory location and the other one reads or modifies the same memory
    > location.
    >     >     >
    >     >     > 25 The execution of a program contains a data race if it contains two
    > conflicting actions in different threads, at least one of which is not atomic,
    > and neither happens before the other. Any such data race results in
    > undefined behavior.
    >     >
    >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read
    > will atomic
    >     >     for 32b and 64b machines. If not, the problem persist for generic case
    >     >     as well(lib/librte_ring/rte_ring_generic.h)
    >     > The read from a volatile object is not an atomic access per the C11 spec. It
    > just happens to
    >     > be translated to an instruction (on x86-64 and AArch64/A64) that
    > implements an atomic load.
    >     > I don't think any compiler would change this code generation and
    > suddenly generate some
    >     > non-atomic load instruction for a program that *only* uses volatile to do
    > "atomic" accesses.
    >     > But a future compiler could detect the mix of atomic and non-atomic
    > accesses and mark this
    >     > expression as causing undefined behaviour and that would have
    > consequences for code generation.
    >     >
    >     >
    >     >     I agree with you on C11 memory model semantics usage. The reason
    > why I
    >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did
    > not
    >     >     had definitions for load acquire and store release semantics.
    >     >     I was looking for taking load acquire and store release semantics
    >     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
    >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is
    > your
    >     >     concern then we could create new abstractions as well. That would
    > help
    >     >     exiting KNI problem as well.
    >     > I appreciate your embrace of the C11 memory model. I think it is better
    > for describing
    >     > (both to the compiler and to humans) which and how objects are used
    > for synchronisation.
    >     >
    >     > However, I don't think an API as you suggest (and others have suggested
    > before, e.g. as
    >     > done in ODP) is a good idea. There is an infinite amount of possible base
    > types, an
    >     > increasing number of operations and a bunch of different memory
    > orderings, a "complete"
    >     > API would be very large and difficult to test, and most members of the
    > API would never be used.
    >     > GCC and Clang both support the __atomic intrinsics. This API avoids the
    > problems I
    >     > described above. Or we could use the official C11 syntax (stdatomic.h).
    > But then we
    >     > have the problem with using pre-C11 compilers...
    > 
    >     I have no objection, if everyone agrees to move C11 memory model
    >     with __atomic intrinsics. But if we need to keep both have then
    >     atomic_load_acq_32() kind of API make sense.
    > 
    > 
    >     >
    >     >
    >     >
    >     >
    >     >     I think, currently it mixed usage because, the same variable declaration
    >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
    >     >     case. Either we need to change only to C11 mode OR have APIs for
    >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
    >     >     Linux kernel and FreeBSD.
    >     >
    >     >     >
    >     >     > -- Ola
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    > 
    
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:39                                                 ` Ola Liljedahl
@ 2018-10-08 10:41                                                   ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-08 10:41 UTC (permalink / raw)
  To: Ola Liljedahl, Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin, Steve Capper, nd, stable

This is ThunderX2.

> -----Original Message-----
> From: Ola Liljedahl
> Sent: Monday, October 8, 2018 6:39 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Steve Capper <Steve.Capper@arm.com>;
> nd <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> 
> On 08/10/2018, 12:33, "Gavin Hu (Arm Technology China)"
> <Gavin.Hu@arm.com> wrote:
> 
>     I did benchmarking w/o and w/ the patch, it did not show any noticeable
> differences in terms of latency.
> Which platform is this?
> 
> 
>     Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
> 
>     sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024
> -- -i
> 
>     RTE>>ring_perf_autotest (#1 run of test without the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.24
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.40
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.59
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.49
>     MP/MC bulk enq/dequeue (size: 8): 3.09
>     SP/SC bulk enq/dequeue (size: 32): 1.07
>     MP/MC bulk enq/dequeue (size: 32): 1.13
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.55
>     MP/MC bulk enq/dequeue (size: 8): 12.99
>     SP/SC bulk enq/dequeue (size: 32): 1.97
>     MP/MC bulk enq/dequeue (size: 32): 3.41
> 
>     RTE>>ring_perf_autotest(#2 run of test without the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.24
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.40
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.59
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.50
>     MP/MC bulk enq/dequeue (size: 8): 3.08
>     SP/SC bulk enq/dequeue (size: 32): 1.07
>     MP/MC bulk enq/dequeue (size: 32): 1.13
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.57
>     MP/MC bulk enq/dequeue (size: 8): 13.00
>     SP/SC bulk enq/dequeue (size: 32): 1.98
>     MP/MC bulk enq/dequeue (size: 32): 3.41
>     Test OK
> 
>     RTE>>ring_perf_autotest(#3 run of test without the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.24
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.40
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.59
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.49
>     MP/MC bulk enq/dequeue (size: 8): 3.08
>     SP/SC bulk enq/dequeue (size: 32): 1.07
>     MP/MC bulk enq/dequeue (size: 32): 1.13
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.55
>     MP/MC bulk enq/dequeue (size: 8): 12.96
>     SP/SC bulk enq/dequeue (size: 32): 1.99
>     MP/MC bulk enq/dequeue (size: 32): 3.37
> 
> 
>     RTE>>ring_perf_autotest(#1 run of the test with the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.32
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.43
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.60
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.53
>     MP/MC bulk enq/dequeue (size: 8): 3.09
>     SP/SC bulk enq/dequeue (size: 32): 1.06
>     MP/MC bulk enq/dequeue (size: 32): 1.18
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.57
>     MP/MC bulk enq/dequeue (size: 8): 12.93
>     SP/SC bulk enq/dequeue (size: 32): 1.98
>     MP/MC bulk enq/dequeue (size: 32): 3.44
>     Test OK
>     RTE>>ring_perf_autotest (#1 run of the test with the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.32
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.43
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.60
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.53
>     MP/MC bulk enq/dequeue (size: 8): 3.09
>     SP/SC bulk enq/dequeue (size: 32): 1.06
>     MP/MC bulk enq/dequeue (size: 32): 1.18
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.57
>     MP/MC bulk enq/dequeue (size: 8): 12.93
>     SP/SC bulk enq/dequeue (size: 32): 1.98
>     MP/MC bulk enq/dequeue (size: 32): 3.46
>     Test OK
> 
>     > -----Original Message-----
>     > From: Ola Liljedahl
>     > Sent: Monday, October 8, 2018 6:26 PM
>     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > Cc: dev@dpdk.org; Honnappa Nagarahalli
>     > <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
>     > <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
>     > <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
>     > <nd@arm.com>; stable@dpdk.org
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >
>     >
>     >
>     > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
>     > wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
>     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
> China)"
>     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
>     > <nd@arm.com>,
>     >     >  "stable@dpdk.org" <stable@dpdk.org>
>     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >
>     >     > External Email
>     >     >
>     >     > On 08/10/2018, 08:06, "Jerin Jacob"
> <jerin.jacob@caviumnetworks.com>
>     > wrote:
>     >     >
>     >     >     -----Original Message-----
>     >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     >     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
>     > China)"
>     >     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>,
> nd
>     > <nd@arm.com>,
>     >     >     >  "stable@dpdk.org" <stable@dpdk.org>
>     >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >     >
>     >     >
>     >     >
>     >     >     Could you please fix the email client for inline reply.
>     >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
>     > Office365. The official Office365/Outlook
>     >     > documentation doesn't match the actual user interface...
>     >     >
>     >     >
>     >     >
>     >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
>     > clients.html
>     >     >
>     >     >
>     >     >     >
>     >     >     > On 07/10/2018, 06:03, "Jerin Jacob"
>     > <jerin.jacob@caviumnetworks.com> wrote:
>     >     >     >
>     >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
>     > volatile ("":::"memory") of rte_pause().
>     >     >     >     I would n't have any issue, if the generated code code is same
> or
>     > better than the exiting case. but it not the case, Right?
>     >     >     > The existing case is actually not interesting (IMO) as it exposes
>     > undefined behaviour which allows the compiler to do anything. But you
>     > seem to be satisfied with "works for me, right here right now". I think the
>     > cost of avoiding undefined behaviour is acceptable (actually I don't think
> it
>     > even will be noticeable).
>     >     >
>     >     >     I am not convinced because of use of volatile in head and tail
> indexes.
>     >     >     For me that brings the defined behavior.
>     >     > As long as you don't mix in C11 atomic accesses (just use "plain"
> accesses
>     > to volatile objects),
>     >     > it is AFAIK defined behaviour (but not necessarily using atomic loads
> and
>     > stores). But I quoted
>     >     > the C11 spec where it explicitly mentions that mixing atomic and non-
>     > atomic accesses to the same
>     >     > object is undefined behaviour. Don't argue with me, argue with the
> C11
>     > spec.
>     >     > If you want to disobey the spec, this should at least be called out for
> in
>     > the code with a comment.
>     >
>     >     That's boils down only one question, should we follow C11 spec? Why
> not
>     > only take load
>     >     acquire and store release semantics only just like Linux kernel and
> FreeBSD.
>     > And introduce even more undefined behaviour?
>     >
>     >     Does not look like C11 memory model is super efficient in term of gcc
>     >     implementation.
>     > You are making a chicken out of a feather.
>     >
>     > I think this "problem" with one additional ADD instruction will only
> concern
>     > __atomic_load_n(__ATOMIC_RELAXED) and
>     > __atomic_store_n(__ATOMIC_RELAXED) because the compiler
> separates
>     > the address generation (add offset of struct member) from the load or
> store
>     > itself. For other atomic operations and memory orderings (e.g.
>     > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will
> be
>     > included anyway (as long as we access a non-first struct member)
> because
>     > e.g. LDAR only accepts a base register with no offset.
>     >
>     > I suggest minimising the imposed memory orderings can have a much
> larger
>     > (positive) effect on performance compared to avoiding one ADD
> instruction
>     > (memory accesses are much slower than CPU ALU instructions).
>     > Using C11 memory model and identifying exactly which objects are used
> for
>     > synchronisation and whether (any) updates to shared memory are
> acquired
>     > or released (no updates to shared memory means relaxed order can be
> used)
>     > will provide maximum freedom to the compiler and hardware to get the
> best
>     > result.
>     >
>     > The FreeBSD and DPDK ring buffers show some fundamental
>     > misunderstandings here. Instead excessive orderings and explicit barriers
>     > have been used as band-aids, with unknown effects on performance.
>     >
>     >
>     >     >
>     >     >
>     >     >     That the reason why I shared
>     >     >     the generated assembly code. If you think other way, Pick any
> compiler
>     >     >     and see generated output.
>     >     > This is what one compiler for one architecture generates today.
> These
>     > things change. Other things
>     >     > that used to work or worked for some specific architecture has
> stopped
>     > working in newer versions of
>     >     > the compiler.
>     >     >
>     >     >
>     >     >     And
>     >     >
>     >     >     Freebsd implementation of ring buffer(Which DPDK derived from),
>     > Don't have
>     >     >     such logic, See
>     >
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
>     >     > It looks like FreeBSD uses some kind of C11 atomic memory model-
>     > inspired API although I don't see
>     >     > exactly how e.g. atomic_store_rel_int() is implemented. The code
> also
>     > mixes in explicit barriers
>     >     > so definitively not pure C11 memory model usage. And finally, it
> doesn't
>     > establish the proper
>     >     > load-acquire/store-release relationships (e.g. store-release cons_tail
>     > requires a load-acquire cons_tail,
>     >     > same for prod_tail).
>     >     >
>     >     > "* multi-producer safe lock-free ring buffer enqueue"
>     >     > The comment is also wrong. This design is not lock-free, how could it
> be
>     > when there is spinning
>     >     > (waiting) for other threads in the code? If a thread must wait for
> other
>     > threads, then by definition
>     >     > the design is blocking.
>     >     >
>     >     > So you are saying that because FreeBSD is doing it wrong, DPDK can
> also
>     > do it wrong?
>     >     >
>     >     >
>     >     >     See below too.
>     >     >
>     >     >     >
>     >     >     > Skipping the compiler memory barrier in rte_pause() potentially
>     > allows for optimisations that provide much more benefit, e.g. hiding
> some
>     > cache miss latency for later loads. The DPDK ring buffer implementation is
>     > defined so to enable inlining of enqueue/dequeue functions into the
> caller,
>     > any code could immediately follow these calls.
>     >     >     >
>     >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     >     >     > Programming languages — C
>     >     >     >
>     >     >     > 5.1.2.4
>     >     >     > 4 Two expression evaluations conflict if one of them modifies a
>     > memory location and the other one reads or modifies the same memory
>     > location.
>     >     >     >
>     >     >     > 25 The execution of a program contains a data race if it contains
> two
>     > conflicting actions in different threads, at least one of which is not atomic,
>     > and neither happens before the other. Any such data race results in
>     > undefined behavior.
>     >     >
>     >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit
> read
>     > will atomic
>     >     >     for 32b and 64b machines. If not, the problem persist for generic
> case
>     >     >     as well(lib/librte_ring/rte_ring_generic.h)
>     >     > The read from a volatile object is not an atomic access per the C11
> spec. It
>     > just happens to
>     >     > be translated to an instruction (on x86-64 and AArch64/A64) that
>     > implements an atomic load.
>     >     > I don't think any compiler would change this code generation and
>     > suddenly generate some
>     >     > non-atomic load instruction for a program that *only* uses volatile to
> do
>     > "atomic" accesses.
>     >     > But a future compiler could detect the mix of atomic and non-atomic
>     > accesses and mark this
>     >     > expression as causing undefined behaviour and that would have
>     > consequences for code generation.
>     >     >
>     >     >
>     >     >     I agree with you on C11 memory model semantics usage. The
> reason
>     > why I
>     >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self
> did
>     > not
>     >     >     had definitions for load acquire and store release semantics.
>     >     >     I was looking for taking load acquire and store release semantics
>     >     >     from C11 instead of creating new API like Linux kernel for
> FreeBSD(APIs
>     >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name
> is
>     > your
>     >     >     concern then we could create new abstractions as well. That would
>     > help
>     >     >     exiting KNI problem as well.
>     >     > I appreciate your embrace of the C11 memory model. I think it is
> better
>     > for describing
>     >     > (both to the compiler and to humans) which and how objects are
> used
>     > for synchronisation.
>     >     >
>     >     > However, I don't think an API as you suggest (and others have
> suggested
>     > before, e.g. as
>     >     > done in ODP) is a good idea. There is an infinite amount of possible
> base
>     > types, an
>     >     > increasing number of operations and a bunch of different memory
>     > orderings, a "complete"
>     >     > API would be very large and difficult to test, and most members of
> the
>     > API would never be used.
>     >     > GCC and Clang both support the __atomic intrinsics. This API avoids
> the
>     > problems I
>     >     > described above. Or we could use the official C11 syntax
> (stdatomic.h).
>     > But then we
>     >     > have the problem with using pre-C11 compilers...
>     >
>     >     I have no objection, if everyone agrees to move C11 memory model
>     >     with __atomic intrinsics. But if we need to keep both have then
>     >     atomic_load_acq_32() kind of API make sense.
>     >
>     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >     I think, currently it mixed usage because, the same variable
> declaration
>     >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for
> C11
>     >     >     case. Either we need to change only to C11 mode OR have APIs for
>     >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models
> like
>     >     >     Linux kernel and FreeBSD.
>     >     >
>     >     >     >
>     >     >     > -- Ola
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
> 
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:25                                             ` Ola Liljedahl
  2018-10-08 10:33                                               ` Gavin Hu (Arm Technology China)
@ 2018-10-08 10:46                                               ` Jerin Jacob
  2018-10-08 11:21                                                 ` Ola Liljedahl
  1 sibling, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08 10:46 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Mon, 8 Oct 2018 10:25:45 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.11.0.180909
> 
> 
> On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Mon, 8 Oct 2018 09:22:05 +0000
>     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >  "stable@dpdk.org" <stable@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >
>     > External Email
>     >
>     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >     >  "stable@dpdk.org" <stable@dpdk.org>
>     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >
>     >
>     >
>     >     Could you please fix the email client for inline reply.
>     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. The official Office365/Outlook
>     > documentation doesn't match the actual user interface...
>     >
>     >
>     >
>     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
>     >
>     >
>     >     >
>     >     > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     >     >
>     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
>     >     >     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
>     >     > The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).
>     >
>     >     I am not convinced because of use of volatile in head and tail indexes.
>     >     For me that brings the defined behavior.
>     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses to volatile objects),
>     > it is AFAIK defined behaviour (but not necessarily using atomic loads and stores). But I quoted
>     > the C11 spec where it explicitly mentions that mixing atomic and non-atomic accesses to the same
>     > object is undefined behaviour. Don't argue with me, argue with the C11 spec.
>     > If you want to disobey the spec, this should at least be called out for in the code with a comment.
> 
>     That's boils down only one question, should we follow C11 spec? Why not only take load
>     acquire and store release semantics only just like Linux kernel and FreeBSD.
> And introduce even more undefined behaviour?


Yes. The all world(Linux and Freebsd) is running with undefined behavior and still it runs.



> 
>     Does not look like C11 memory model is super efficient in term of gcc
>     implementation.
> You are making a chicken out of a feather.
> 
> I think this "problem" with one additional ADD instruction will only concern __atomic_load_n(__ATOMIC_RELAXED) and __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates the address generation (add offset of struct member) from the load or store itself. For other atomic operations and memory orderings (e.g. __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be included anyway (as long as we access a non-first struct member) because e.g. LDAR only accepts a base register with no offset.
> 
> I suggest minimising the imposed memory orderings can have a much larger (positive) effect on performance compared to avoiding one ADD instruction (memory accesses are much slower than CPU ALU instructions).
> Using C11 memory model and identifying exactly which objects are used for synchronisation and whether (any) updates to shared memory are acquired or released (no updates to shared memory means relaxed order can be used) will provide maximum freedom to the compiler and hardware to get the best result.

No more comments on this. It is not data driven.


> 
> The FreeBSD and DPDK ring buffers show some fundamental misunderstandings here. Instead excessive orderings and explicit barriers have been used as band-aids, with unknown effects on performance.
> 
> 
>     >
>     >
>     >     That the reason why I shared
>     >     the generated assembly code. If you think other way, Pick any compiler
>     >     and see generated output.
>     > This is what one compiler for one architecture generates today. These things change. Other things
>     > that used to work or worked for some specific architecture has stopped working in newer versions of
>     > the compiler.
>     >
>     >
>     >     And
>     >
>     >     Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
>     >     such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
>     > It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API although I don't see
>     > exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in explicit barriers
>     > so definitively not pure C11 memory model usage. And finally, it doesn't establish the proper
>     > load-acquire/store-release relationships (e.g. store-release cons_tail requires a load-acquire cons_tail,
>     > same for prod_tail).
>     >
>     > "* multi-producer safe lock-free ring buffer enqueue"
>     > The comment is also wrong. This design is not lock-free, how could it be when there is spinning
>     > (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
>     > the design is blocking.
>     >
>     > So you are saying that because FreeBSD is doing it wrong, DPDK can also do it wrong?
>     >
>     >
>     >     See below too.
>     >
>     >     >
>     >     > Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
>     >     >
>     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     >     > Programming languages — C
>     >     >
>     >     > 5.1.2.4
>     >     > 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
>     >     >
>     >     > 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
>     >
>     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
>     >     for 32b and 64b machines. If not, the problem persist for generic case
>     >     as well(lib/librte_ring/rte_ring_generic.h)
>     > The read from a volatile object is not an atomic access per the C11 spec. It just happens to
>     > be translated to an instruction (on x86-64 and AArch64/A64) that implements an atomic load.
>     > I don't think any compiler would change this code generation and suddenly generate some
>     > non-atomic load instruction for a program that *only* uses volatile to do "atomic" accesses.
>     > But a future compiler could detect the mix of atomic and non-atomic accesses and mark this
>     > expression as causing undefined behaviour and that would have consequences for code generation.
>     >
>     >
>     >     I agree with you on C11 memory model semantics usage. The reason why I
>     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
>     >     had definitions for load acquire and store release semantics.
>     >     I was looking for taking load acquire and store release semantics
>     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
>     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
>     >     concern then we could create new abstractions as well. That would help
>     >     exiting KNI problem as well.
>     > I appreciate your embrace of the C11 memory model. I think it is better for describing
>     > (both to the compiler and to humans) which and how objects are used for synchronisation.
>     >
>     > However, I don't think an API as you suggest (and others have suggested before, e.g. as
>     > done in ODP) is a good idea. There is an infinite amount of possible base types, an
>     > increasing number of operations and a bunch of different memory orderings, a "complete"
>     > API would be very large and difficult to test, and most members of the API would never be used.
>     > GCC and Clang both support the __atomic intrinsics. This API avoids the problems I
>     > described above. Or we could use the official C11 syntax (stdatomic.h). But then we
>     > have the problem with using pre-C11 compilers...
> 
>     I have no objection, if everyone agrees to move C11 memory model
>     with __atomic intrinsics. But if we need to keep both have then
>     atomic_load_acq_32() kind of API make sense.
> 
> 
>     >
>     >
>     >
>     >
>     >     I think, currently it mixed usage because, the same variable declaration
>     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
>     >     case. Either we need to change only to C11 mode OR have APIs for
>     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
>     >     Linux kernel and FreeBSD.
>     >
>     >     >
>     >     > -- Ola
>     >     >
>     >     >
>     >     >
>     >
>     >
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:33                                               ` Gavin Hu (Arm Technology China)
  2018-10-08 10:39                                                 ` Ola Liljedahl
@ 2018-10-08 10:49                                                 ` Jerin Jacob
  2018-10-10  6:28                                                   ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08 10:49 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: Ola Liljedahl, dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Steve Capper, nd, stable

-----Original Message-----
> Date: Mon, 8 Oct 2018 10:33:43 +0000
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> To: Ola Liljedahl <Ola.Liljedahl@arm.com>, Jerin Jacob
>  <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, Steve Capper <Steve.Capper@arm.com>, nd
>  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
> Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
> 
> 
> I did benchmarking w/o and w/ the patch, it did not show any noticeable differences in terms of latency.
> Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
> 
> sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i

These counters are running at 100MHz. Use PMU counters to get more
accurate results.

https://doc.dpdk.org/guides/prog_guide/profile_app.html
See: 55.2. Profiling on ARM64

> > -----Original Message-----
> > From: Ola Liljedahl
> > Sent: Monday, October 8, 2018 6:26 PM
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
> > <nd@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >
> >
> >
> > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> > wrote:
> >
> >     -----Original Message-----
> >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
> >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
> >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
> > <nd@arm.com>,
> >     >  "stable@dpdk.org" <stable@dpdk.org>
> >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
> >     >
> >     > External Email
> >     >
> >     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> > wrote:
> >     >
> >     >     -----Original Message-----
> >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
> >     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> >     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> >     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> >     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology
> > China)"
> >     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd
> > <nd@arm.com>,
> >     >     >  "stable@dpdk.org" <stable@dpdk.org>
> >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
> >     >     >
> >     >
> >     >
> >     >     Could you please fix the email client for inline reply.
> >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
> > Office365. The official Office365/Outlook
> >     > documentation doesn't match the actual user interface...
> >     >
> >     >
> >     >
> >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
> > clients.html
> >     >
> >     >
> >     >     >
> >     >     > On 07/10/2018, 06:03, "Jerin Jacob"
> > <jerin.jacob@caviumnetworks.com> wrote:
> >     >     >
> >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
> > volatile ("":::"memory") of rte_pause().
> >     >     >     I would n't have any issue, if the generated code code is same or
> > better than the exiting case. but it not the case, Right?
> >     >     > The existing case is actually not interesting (IMO) as it exposes
> > undefined behaviour which allows the compiler to do anything. But you
> > seem to be satisfied with "works for me, right here right now". I think the
> > cost of avoiding undefined behaviour is acceptable (actually I don't think it
> > even will be noticeable).
> >     >
> >     >     I am not convinced because of use of volatile in head and tail indexes.
> >     >     For me that brings the defined behavior.
> >     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses
> > to volatile objects),
> >     > it is AFAIK defined behaviour (but not necessarily using atomic loads and
> > stores). But I quoted
> >     > the C11 spec where it explicitly mentions that mixing atomic and non-
> > atomic accesses to the same
> >     > object is undefined behaviour. Don't argue with me, argue with the C11
> > spec.
> >     > If you want to disobey the spec, this should at least be called out for in
> > the code with a comment.
> >
> >     That's boils down only one question, should we follow C11 spec? Why not
> > only take load
> >     acquire and store release semantics only just like Linux kernel and FreeBSD.
> > And introduce even more undefined behaviour?
> >
> >     Does not look like C11 memory model is super efficient in term of gcc
> >     implementation.
> > You are making a chicken out of a feather.
> >
> > I think this "problem" with one additional ADD instruction will only concern
> > __atomic_load_n(__ATOMIC_RELAXED) and
> > __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates
> > the address generation (add offset of struct member) from the load or store
> > itself. For other atomic operations and memory orderings (e.g.
> > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be
> > included anyway (as long as we access a non-first struct member) because
> > e.g. LDAR only accepts a base register with no offset.
> >
> > I suggest minimising the imposed memory orderings can have a much larger
> > (positive) effect on performance compared to avoiding one ADD instruction
> > (memory accesses are much slower than CPU ALU instructions).
> > Using C11 memory model and identifying exactly which objects are used for
> > synchronisation and whether (any) updates to shared memory are acquired
> > or released (no updates to shared memory means relaxed order can be used)
> > will provide maximum freedom to the compiler and hardware to get the best
> > result.
> >
> > The FreeBSD and DPDK ring buffers show some fundamental
> > misunderstandings here. Instead excessive orderings and explicit barriers
> > have been used as band-aids, with unknown effects on performance.
> >
> >
> >     >
> >     >
> >     >     That the reason why I shared
> >     >     the generated assembly code. If you think other way, Pick any compiler
> >     >     and see generated output.
> >     > This is what one compiler for one architecture generates today. These
> > things change. Other things
> >     > that used to work or worked for some specific architecture has stopped
> > working in newer versions of
> >     > the compiler.
> >     >
> >     >
> >     >     And
> >     >
> >     >     Freebsd implementation of ring buffer(Which DPDK derived from),
> > Don't have
> >     >     such logic, See
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
> >     > It looks like FreeBSD uses some kind of C11 atomic memory model-
> > inspired API although I don't see
> >     > exactly how e.g. atomic_store_rel_int() is implemented. The code also
> > mixes in explicit barriers
> >     > so definitively not pure C11 memory model usage. And finally, it doesn't
> > establish the proper
> >     > load-acquire/store-release relationships (e.g. store-release cons_tail
> > requires a load-acquire cons_tail,
> >     > same for prod_tail).
> >     >
> >     > "* multi-producer safe lock-free ring buffer enqueue"
> >     > The comment is also wrong. This design is not lock-free, how could it be
> > when there is spinning
> >     > (waiting) for other threads in the code? If a thread must wait for other
> > threads, then by definition
> >     > the design is blocking.
> >     >
> >     > So you are saying that because FreeBSD is doing it wrong, DPDK can also
> > do it wrong?
> >     >
> >     >
> >     >     See below too.
> >     >
> >     >     >
> >     >     > Skipping the compiler memory barrier in rte_pause() potentially
> > allows for optimisations that provide much more benefit, e.g. hiding some
> > cache miss latency for later loads. The DPDK ring buffer implementation is
> > defined so to enable inlining of enqueue/dequeue functions into the caller,
> > any code could immediately follow these calls.
> >     >     >
> >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
> >     >     > Programming languages — C
> >     >     >
> >     >     > 5.1.2.4
> >     >     > 4 Two expression evaluations conflict if one of them modifies a
> > memory location and the other one reads or modifies the same memory
> > location.
> >     >     >
> >     >     > 25 The execution of a program contains a data race if it contains two
> > conflicting actions in different threads, at least one of which is not atomic,
> > and neither happens before the other. Any such data race results in
> > undefined behavior.
> >     >
> >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read
> > will atomic
> >     >     for 32b and 64b machines. If not, the problem persist for generic case
> >     >     as well(lib/librte_ring/rte_ring_generic.h)
> >     > The read from a volatile object is not an atomic access per the C11 spec. It
> > just happens to
> >     > be translated to an instruction (on x86-64 and AArch64/A64) that
> > implements an atomic load.
> >     > I don't think any compiler would change this code generation and
> > suddenly generate some
> >     > non-atomic load instruction for a program that *only* uses volatile to do
> > "atomic" accesses.
> >     > But a future compiler could detect the mix of atomic and non-atomic
> > accesses and mark this
> >     > expression as causing undefined behaviour and that would have
> > consequences for code generation.
> >     >
> >     >
> >     >     I agree with you on C11 memory model semantics usage. The reason
> > why I
> >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did
> > not
> >     >     had definitions for load acquire and store release semantics.
> >     >     I was looking for taking load acquire and store release semantics
> >     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
> >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is
> > your
> >     >     concern then we could create new abstractions as well. That would
> > help
> >     >     exiting KNI problem as well.
> >     > I appreciate your embrace of the C11 memory model. I think it is better
> > for describing
> >     > (both to the compiler and to humans) which and how objects are used
> > for synchronisation.
> >     >
> >     > However, I don't think an API as you suggest (and others have suggested
> > before, e.g. as
> >     > done in ODP) is a good idea. There is an infinite amount of possible base
> > types, an
> >     > increasing number of operations and a bunch of different memory
> > orderings, a "complete"
> >     > API would be very large and difficult to test, and most members of the
> > API would never be used.
> >     > GCC and Clang both support the __atomic intrinsics. This API avoids the
> > problems I
> >     > described above. Or we could use the official C11 syntax (stdatomic.h).
> > But then we
> >     > have the problem with using pre-C11 compilers...
> >
> >     I have no objection, if everyone agrees to move C11 memory model
> >     with __atomic intrinsics. But if we need to keep both have then
> >     atomic_load_acq_32() kind of API make sense.
> >
> >
> >     >
> >     >
> >     >
> >     >
> >     >     I think, currently it mixed usage because, the same variable declaration
> >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
> >     >     case. Either we need to change only to C11 mode OR have APIs for
> >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
> >     >     Linux kernel and FreeBSD.
> >     >
> >     >     >
> >     >     > -- Ola
> >     >     >
> >     >     >
> >     >     >
> >     >
> >     >
> >
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:46                                               ` Jerin Jacob
@ 2018-10-08 11:21                                                 ` Ola Liljedahl
  2018-10-08 11:50                                                   ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 11:21 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 08/10/2018, 12:47, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Mon, 8 Oct 2018 10:25:45 +0000
    > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >  "stable@dpdk.org" <stable@dpdk.org>
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > user-agent: Microsoft-MacOutlook/10.11.0.180909
    > 
    > 
    > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    > 
    >     -----Original Message-----
    >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
    >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >     >  "stable@dpdk.org" <stable@dpdk.org>
    >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >
    >     > External Email
    >     >
    >     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    >     >
    >     >     -----Original Message-----
    >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
    >     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    >     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >     >     >  "stable@dpdk.org" <stable@dpdk.org>
    >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >     >
    >     >
    >     >
    >     >     Could you please fix the email client for inline reply.
    >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. The official Office365/Outlook
    >     > documentation doesn't match the actual user interface...
    >     >
    >     >
    >     >
    >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
    >     >
    >     >
    >     >     >
    >     >     > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    >     >     >
    >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
    >     >     >     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
    >     >     > The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).
    >     >
    >     >     I am not convinced because of use of volatile in head and tail indexes.
    >     >     For me that brings the defined behavior.
    >     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses to volatile objects),
    >     > it is AFAIK defined behaviour (but not necessarily using atomic loads and stores). But I quoted
    >     > the C11 spec where it explicitly mentions that mixing atomic and non-atomic accesses to the same
    >     > object is undefined behaviour. Don't argue with me, argue with the C11 spec.
    >     > If you want to disobey the spec, this should at least be called out for in the code with a comment.
    > 
    >     That's boils down only one question, should we follow C11 spec? Why not only take load
    >     acquire and store release semantics only just like Linux kernel and FreeBSD.
    > And introduce even more undefined behaviour?
    
    
    Yes. The all world(Linux and Freebsd) is running with undefined behavior and still it runs.
Are you saying that Linux kernel is using GCC __atomic or C11 _Atomic/stdatomic.h features?
I can't see any traces of this in e.g. Linux 4.6.

It seems like you don't understand. The undefined behaviour comes from mixing non-atomic and atomic accesses
in the C11 source code. It doesn't have anything to do about if a read or write of a volatile object is translated
to an instruction that may or may not provide an atomic load or store for some specific architecture that
you are compiling for.

Since the Linux kernel (AFAIK) doesn't use C11 _Atomic datatypes or GCC __atomic builtins, there is no
undefined behaviour per the C11 standard. But it is relying on implementation specific behaviour (which
I grant is not so likely to change due to backwards compatibility requirements).

    
    
    
    > 
    >     Does not look like C11 memory model is super efficient in term of gcc
    >     implementation.
    > You are making a chicken out of a feather.
    > 
    > I think this "problem" with one additional ADD instruction will only concern __atomic_load_n(__ATOMIC_RELAXED) and __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates the address generation (add offset of struct member) from the load or store itself. For other atomic operations and memory orderings (e.g. __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be included anyway (as long as we access a non-first struct member) because e.g. LDAR only accepts a base register with no offset.
    > 
    > I suggest minimising the imposed memory orderings can have a much larger (positive) effect on performance compared to avoiding one ADD instruction (memory accesses are much slower than CPU ALU instructions).
    > Using C11 memory model and identifying exactly which objects are used for synchronisation and whether (any) updates to shared memory are acquired or released (no updates to shared memory means relaxed order can be used) will provide maximum freedom to the compiler and hardware to get the best result.
    
    No more comments on this. It is not data driven.
Now this is something that would be interesting to benchmark. I don't know to what
extent current compilers actually utilise the freedoms to optimise memory accesses
according to e.g. acquire and release ordering of surrounding atomic operations.
But I would like to know and this could possibly also lead to suggestions to compiler
developers. I don't think there are too many multithreaded benchmarks and even fewer
which don't use locks and other synchronisation mechanisms from OS's and libraries.
But I have some multithreaded applications and benchmarks.


    
    
    > 
    > The FreeBSD and DPDK ring buffers show some fundamental misunderstandings here. Instead excessive orderings and explicit barriers have been used as band-aids, with unknown effects on performance.
    > 
    > 
    >     >
    >     >
    >     >     That the reason why I shared
    >     >     the generated assembly code. If you think other way, Pick any compiler
    >     >     and see generated output.
    >     > This is what one compiler for one architecture generates today. These things change. Other things
    >     > that used to work or worked for some specific architecture has stopped working in newer versions of
    >     > the compiler.
    >     >
    >     >
    >     >     And
    >     >
    >     >     Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
    >     >     such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
    >     > It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API although I don't see
    >     > exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in explicit barriers
    >     > so definitively not pure C11 memory model usage. And finally, it doesn't establish the proper
    >     > load-acquire/store-release relationships (e.g. store-release cons_tail requires a load-acquire cons_tail,
    >     > same for prod_tail).
    >     >
    >     > "* multi-producer safe lock-free ring buffer enqueue"
    >     > The comment is also wrong. This design is not lock-free, how could it be when there is spinning
    >     > (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
    >     > the design is blocking.
    >     >
    >     > So you are saying that because FreeBSD is doing it wrong, DPDK can also do it wrong?
    >     >
    >     >
    >     >     See below too.
    >     >
    >     >     >
    >     >     > Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
    >     >     >
    >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    >     >     > Programming languages — C
    >     >     >
    >     >     > 5.1.2.4
    >     >     > 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
    >     >     >
    >     >     > 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
    >     >
    >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
    >     >     for 32b and 64b machines. If not, the problem persist for generic case
    >     >     as well(lib/librte_ring/rte_ring_generic.h)
    >     > The read from a volatile object is not an atomic access per the C11 spec. It just happens to
    >     > be translated to an instruction (on x86-64 and AArch64/A64) that implements an atomic load.
    >     > I don't think any compiler would change this code generation and suddenly generate some
    >     > non-atomic load instruction for a program that *only* uses volatile to do "atomic" accesses.
    >     > But a future compiler could detect the mix of atomic and non-atomic accesses and mark this
    >     > expression as causing undefined behaviour and that would have consequences for code generation.
    >     >
    >     >
    >     >     I agree with you on C11 memory model semantics usage. The reason why I
    >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
    >     >     had definitions for load acquire and store release semantics.
    >     >     I was looking for taking load acquire and store release semantics
    >     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
    >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
    >     >     concern then we could create new abstractions as well. That would help
    >     >     exiting KNI problem as well.
    >     > I appreciate your embrace of the C11 memory model. I think it is better for describing
    >     > (both to the compiler and to humans) which and how objects are used for synchronisation.
    >     >
    >     > However, I don't think an API as you suggest (and others have suggested before, e.g. as
    >     > done in ODP) is a good idea. There is an infinite amount of possible base types, an
    >     > increasing number of operations and a bunch of different memory orderings, a "complete"
    >     > API would be very large and difficult to test, and most members of the API would never be used.
    >     > GCC and Clang both support the __atomic intrinsics. This API avoids the problems I
    >     > described above. Or we could use the official C11 syntax (stdatomic.h). But then we
    >     > have the problem with using pre-C11 compilers...
    > 
    >     I have no objection, if everyone agrees to move C11 memory model
    >     with __atomic intrinsics. But if we need to keep both have then
    >     atomic_load_acq_32() kind of API make sense.
    > 
    > 
    >     >
    >     >
    >     >
    >     >
    >     >     I think, currently it mixed usage because, the same variable declaration
    >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
    >     >     case. Either we need to change only to C11 mode OR have APIs for
    >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
    >     >     Linux kernel and FreeBSD.
    >     >
    >     >     >
    >     >     > -- Ola
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    > 
    > 
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 11:21                                                 ` Ola Liljedahl
@ 2018-10-08 11:50                                                   ` Jerin Jacob
  2018-10-08 11:59                                                     ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08 11:50 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Mon, 8 Oct 2018 11:21:42 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.11.0.180909
> 
> 
> On 08/10/2018, 12:47, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Mon, 8 Oct 2018 10:25:45 +0000
>     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >  "stable@dpdk.org" <stable@dpdk.org>
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >
>     >
>     > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
>     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >     >  "stable@dpdk.org" <stable@dpdk.org>
>     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >
>     >     > External Email
>     >     >
>     >     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     >     >
>     >     >     -----Original Message-----
>     >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     >     >     > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     >     >     > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     >     >     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >     >     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >     >     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >     >     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >     >     >  "stable@dpdk.org" <stable@dpdk.org>
>     >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >     >
>     >     >
>     >     >
>     >     >     Could you please fix the email client for inline reply.
>     >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or Office365. The official Office365/Outlook
>     >     > documentation doesn't match the actual user interface...
>     >     >
>     >     >
>     >     >
>     >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-clients.html
>     >     >
>     >     >
>     >     >     >
>     >     >     > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     >     >     >
>     >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile ("":::"memory") of rte_pause().
>     >     >     >     I would n't have any issue, if the generated code code is same or better than the exiting case. but it not the case, Right?
>     >     >     > The existing case is actually not interesting (IMO) as it exposes undefined behaviour which allows the compiler to do anything. But you seem to be satisfied with "works for me, right here right now". I think the cost of avoiding undefined behaviour is acceptable (actually I don't think it even will be noticeable).
>     >     >
>     >     >     I am not convinced because of use of volatile in head and tail indexes.
>     >     >     For me that brings the defined behavior.
>     >     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses to volatile objects),
>     >     > it is AFAIK defined behaviour (but not necessarily using atomic loads and stores). But I quoted
>     >     > the C11 spec where it explicitly mentions that mixing atomic and non-atomic accesses to the same
>     >     > object is undefined behaviour. Don't argue with me, argue with the C11 spec.
>     >     > If you want to disobey the spec, this should at least be called out for in the code with a comment.
>     >
>     >     That's boils down only one question, should we follow C11 spec? Why not only take load
>     >     acquire and store release semantics only just like Linux kernel and FreeBSD.
>     > And introduce even more undefined behaviour?
> 
> 
>     Yes. The all world(Linux and Freebsd) is running with undefined behavior and still it runs.
> Are you saying that Linux kernel is using GCC __atomic or C11 _Atomic/stdatomic.h features?

No.

Since your email client is removing the context, I will clarify.

I asked:
That's boils down to only one question, should we follow C11 spec? Why not
only take load acquire and store release semantics only just like Linux kernel and FreeBSD.

You replied:
And introduce even more undefined behavior?

I don't know how that creates more undefined behavior. So replied in the
context of your reply that, according to your view even Linux is running
with undefined behavior.



> I can't see any traces of this in e.g. Linux 4.6.
> 
> It seems like you don't understand. The undefined behaviour comes from mixing non-atomic and atomic accesses
> in the C11 source code. It doesn't have anything to do about if a read or write of a volatile object is translated
> to an instruction that may or may not provide an atomic load or store for some specific architecture that
> you are compiling for.
> 
> Since the Linux kernel (AFAIK) doesn't use C11 _Atomic datatypes or GCC __atomic builtins, there is no
> undefined behaviour per the C11 standard. But it is relying on implementation specific behaviour (which
> I grant is not so likely to change due to backwards compatibility requirements).
> 
> 
> 
> 
>     >
>     >     Does not look like C11 memory model is super efficient in term of gcc
>     >     implementation.
>     > You are making a chicken out of a feather.
>     >
>     > I think this "problem" with one additional ADD instruction will only concern __atomic_load_n(__ATOMIC_RELAXED) and __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates the address generation (add offset of struct member) from the load or store itself. For other atomic operations and memory orderings (e.g. __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be included anyway (as long as we access a non-first struct member) because e.g. LDAR only accepts a base register with no offset.
>     >
>     > I suggest minimising the imposed memory orderings can have a much larger (positive) effect on performance compared to avoiding one ADD instruction (memory accesses are much slower than CPU ALU instructions).
>     > Using C11 memory model and identifying exactly which objects are used for synchronisation and whether (any) updates to shared memory are acquired or released (no updates to shared memory means relaxed order can be used) will provide maximum freedom to the compiler and hardware to get the best result.
> 
>     No more comments on this. It is not data driven.
> Now this is something that would be interesting to benchmark. I don't know to what
> extent current compilers actually utilise the freedoms to optimise memory accesses
> according to e.g. acquire and release ordering of surrounding atomic operations.
> But I would like to know and this could possibly also lead to suggestions to compiler
> developers. I don't think there are too many multithreaded benchmarks and even fewer
> which don't use locks and other synchronisation mechanisms from OS's and libraries.
> But I have some multithreaded applications and benchmarks.
> 
> 
> 
> 
>     >
>     > The FreeBSD and DPDK ring buffers show some fundamental misunderstandings here. Instead excessive orderings and explicit barriers have been used as band-aids, with unknown effects on performance.
>     >
>     >
>     >     >
>     >     >
>     >     >     That the reason why I shared
>     >     >     the generated assembly code. If you think other way, Pick any compiler
>     >     >     and see generated output.
>     >     > This is what one compiler for one architecture generates today. These things change. Other things
>     >     > that used to work or worked for some specific architecture has stopped working in newer versions of
>     >     > the compiler.
>     >     >
>     >     >
>     >     >     And
>     >     >
>     >     >     Freebsd implementation of ring buffer(Which DPDK derived from), Don't have
>     >     >     such logic, See https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
>     >     > It looks like FreeBSD uses some kind of C11 atomic memory model-inspired API although I don't see
>     >     > exactly how e.g. atomic_store_rel_int() is implemented. The code also mixes in explicit barriers
>     >     > so definitively not pure C11 memory model usage. And finally, it doesn't establish the proper
>     >     > load-acquire/store-release relationships (e.g. store-release cons_tail requires a load-acquire cons_tail,
>     >     > same for prod_tail).
>     >     >
>     >     > "* multi-producer safe lock-free ring buffer enqueue"
>     >     > The comment is also wrong. This design is not lock-free, how could it be when there is spinning
>     >     > (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
>     >     > the design is blocking.
>     >     >
>     >     > So you are saying that because FreeBSD is doing it wrong, DPDK can also do it wrong?
>     >     >
>     >     >
>     >     >     See below too.
>     >     >
>     >     >     >
>     >     >     > Skipping the compiler memory barrier in rte_pause() potentially allows for optimisations that provide much more benefit, e.g. hiding some cache miss latency for later loads. The DPDK ring buffer implementation is defined so to enable inlining of enqueue/dequeue functions into the caller, any code could immediately follow these calls.
>     >     >     >
>     >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     >     >     > Programming languages — C
>     >     >     >
>     >     >     > 5.1.2.4
>     >     >     > 4 Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
>     >     >     >
>     >     >     > 25 The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
>     >     >
>     >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read will atomic
>     >     >     for 32b and 64b machines. If not, the problem persist for generic case
>     >     >     as well(lib/librte_ring/rte_ring_generic.h)
>     >     > The read from a volatile object is not an atomic access per the C11 spec. It just happens to
>     >     > be translated to an instruction (on x86-64 and AArch64/A64) that implements an atomic load.
>     >     > I don't think any compiler would change this code generation and suddenly generate some
>     >     > non-atomic load instruction for a program that *only* uses volatile to do "atomic" accesses.
>     >     > But a future compiler could detect the mix of atomic and non-atomic accesses and mark this
>     >     > expression as causing undefined behaviour and that would have consequences for code generation.
>     >     >
>     >     >
>     >     >     I agree with you on C11 memory model semantics usage. The reason why I
>     >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
>     >     >     had definitions for load acquire and store release semantics.
>     >     >     I was looking for taking load acquire and store release semantics
>     >     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
>     >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is your
>     >     >     concern then we could create new abstractions as well. That would help
>     >     >     exiting KNI problem as well.
>     >     > I appreciate your embrace of the C11 memory model. I think it is better for describing
>     >     > (both to the compiler and to humans) which and how objects are used for synchronisation.
>     >     >
>     >     > However, I don't think an API as you suggest (and others have suggested before, e.g. as
>     >     > done in ODP) is a good idea. There is an infinite amount of possible base types, an
>     >     > increasing number of operations and a bunch of different memory orderings, a "complete"
>     >     > API would be very large and difficult to test, and most members of the API would never be used.
>     >     > GCC and Clang both support the __atomic intrinsics. This API avoids the problems I
>     >     > described above. Or we could use the official C11 syntax (stdatomic.h). But then we
>     >     > have the problem with using pre-C11 compilers...
>     >
>     >     I have no objection, if everyone agrees to move C11 memory model
>     >     with __atomic intrinsics. But if we need to keep both have then
>     >     atomic_load_acq_32() kind of API make sense.
>     >
>     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >     I think, currently it mixed usage because, the same variable declaration
>     >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
>     >     >     case. Either we need to change only to C11 mode OR have APIs for
>     >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
>     >     >     Linux kernel and FreeBSD.
>     >     >
>     >     >     >
>     >     >     > -- Ola
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 11:50                                                   ` Jerin Jacob
@ 2018-10-08 11:59                                                     ` Ola Liljedahl
  2018-10-08 12:05                                                       ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 11:59 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 08/10/2018, 13:50, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    
    I don't know how that creates more undefined behavior. So replied in the
    context of your reply that, according to your view even Linux is running
    with undefined behavior.
    
As I explained, Linux does not use C11 atomics (nor GCC __atomic builtins) so
cannot express the kind of undefined behaviour caused by mixing conflicting atomic
(as defined by the C11 standard) and non-atomic accesses to the same object.

Checked the latest version from https://github.com/torvalds/linux



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 11:59                                                     ` Ola Liljedahl
@ 2018-10-08 12:05                                                       ` Jerin Jacob
  2018-10-08 12:20                                                         ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08 12:05 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Mon, 8 Oct 2018 11:59:16 +0000
> From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> user-agent: Microsoft-MacOutlook/10.11.0.180909
> 
> 
> On 08/10/2018, 13:50, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
> 
>     I don't know how that creates more undefined behavior. So replied in the
>     context of your reply that, according to your view even Linux is running
>     with undefined behavior.
> 
> As I explained, Linux does not use C11 atomics (nor GCC __atomic builtins) so
> cannot express the kind of undefined behaviour caused by mixing conflicting atomic
> (as defined by the C11 standard) and non-atomic accesses to the same object.
> 
> Checked the latest version from https://github.com/torvalds/linux

Yet another top post. So you removed the complete earlier context. Never
mind.

I am not saying Linux is using C11 atomic. I asked, Can't we follow
like Linux to use the HW feature of load acquire and store release
semantics with introducing C11 memory model.



> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 12:05                                                       ` Jerin Jacob
@ 2018-10-08 12:20                                                         ` Jerin Jacob
  2018-10-08 12:30                                                           ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-08 12:20 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

-----Original Message-----
> Date: Mon, 8 Oct 2018 17:35:25 +0530
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> To: Ola Liljedahl <Ola.Liljedahl@arm.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
> User-Agent: Mutt/1.10.1 (2018-07-13)
> 
> External Email
> 
> -----Original Message-----
> > Date: Mon, 8 Oct 2018 11:59:16 +0000
> > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
> >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
> >  "stable@dpdk.org" <stable@dpdk.org>
> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> > user-agent: Microsoft-MacOutlook/10.11.0.180909
> >
> >
> > On 08/10/2018, 13:50, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> >
> >
> >     I don't know how that creates more undefined behavior. So replied in the
> >     context of your reply that, according to your view even Linux is running
> >     with undefined behavior.
> >
> > As I explained, Linux does not use C11 atomics (nor GCC __atomic builtins) so
> > cannot express the kind of undefined behaviour caused by mixing conflicting atomic
> > (as defined by the C11 standard) and non-atomic accesses to the same object.
> >
> > Checked the latest version from https://github.com/torvalds/linux
> 
> Yet another top post. So you removed the complete earlier context. Never
> mind.
> 
> I am not saying Linux is using C11 atomic. I asked, Can't we follow
> like Linux to use the HW feature of load acquire and store release
> semantics with introducing C11 memory model.

correction:

s/with introducing C11 memory model/with out introducing C11 memory model

> 
> 
> 
> >
> >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 12:20                                                         ` Jerin Jacob
@ 2018-10-08 12:30                                                           ` Ola Liljedahl
  2018-10-09  8:53                                                             ` Olivier Matz
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 12:30 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 08/10/2018, 14:21, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Mon, 8 Oct 2018 17:35:25 +0530
    > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > To: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    >  "stable@dpdk.org" <stable@dpdk.org>
    > Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
    > User-Agent: Mutt/1.10.1 (2018-07-13)
    > 
    > External Email
    > 
    > -----Original Message-----
    > > Date: Mon, 8 Oct 2018 11:59:16 +0000
    > > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
    > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
    > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
    > >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
    > >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
    > >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
    > >  "stable@dpdk.org" <stable@dpdk.org>
    > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > > user-agent: Microsoft-MacOutlook/10.11.0.180909
    > >
    > >
    > > On 08/10/2018, 13:50, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
    > >
    > >
    > >     I don't know how that creates more undefined behavior. So replied in the
    > >     context of your reply that, according to your view even Linux is running
    > >     with undefined behavior.
    > >
    > > As I explained, Linux does not use C11 atomics (nor GCC __atomic builtins) so
    > > cannot express the kind of undefined behaviour caused by mixing conflicting atomic
    > > (as defined by the C11 standard) and non-atomic accesses to the same object.
    > >
    > > Checked the latest version from https://github.com/torvalds/linux
    > 
    > Yet another top post. So you removed the complete earlier context. Never
    > mind.
Top post? My reply is under your text. As is this.

Don't blame my stupid mail agent on your misunderstanding of C11.

    > 
    > I am not saying Linux is using C11 atomic. I asked, Can't we follow
    > like Linux to use the HW feature of load acquire and store release
    > semantics with introducing C11 memory model.
    
    correction:
    
    s/with introducing C11 memory model/with out introducing C11 memory model
You can generate e.g. AArch64/A64 LDAR and STLR instructions using inline assembler.
But you won't be able to specify acquire and release ordering to the compiler, so you
must specify a full memory barrier instead.

But why create a C11-like but custom DPDK specific memory model when the compiler
already supports a standardised, well defined and tested memory model? You would just be
creating a mountain of technical debt.

    
    > 
    > 
    > 
    > >
    > >
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08  9:22                                         ` Ola Liljedahl
  2018-10-08 10:00                                           ` Jerin Jacob
@ 2018-10-08 14:43                                           ` Bruce Richardson
  2018-10-08 14:46                                             ` Ola Liljedahl
  1 sibling, 1 reply; 131+ messages in thread
From: Bruce Richardson @ 2018-10-08 14:43 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: Jerin Jacob, dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

On Mon, Oct 08, 2018 at 09:22:05AM +0000, Ola Liljedahl wrote:
<snip> 
> "* multi-producer safe lock-free ring buffer enqueue"
> The comment is also wrong. This design is not lock-free, how could it be when there is spinning
> (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
> the design is blocking.
>
My understanding is that the code is lock-free but not wait-free, though
I'm not an expert in this area.

/Bruce

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 14:43                                           ` Bruce Richardson
@ 2018-10-08 14:46                                             ` Ola Liljedahl
  2018-10-08 15:45                                               ` Ola Liljedahl
  0 siblings, 1 reply; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 14:46 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Jerin Jacob, dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 08/10/2018, 16:44, "Bruce Richardson" <bruce.richardson@intel.com> wrote:

    On Mon, Oct 08, 2018 at 09:22:05AM +0000, Ola Liljedahl wrote:
    <snip> 
    > "* multi-producer safe lock-free ring buffer enqueue"
    > The comment is also wrong. This design is not lock-free, how could it be when there is spinning
    > (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
    > the design is blocking.
    >
    My understanding is that the code is lock-free but not wait-free, though
    I'm not an expert in this area.
Notice this code:
	while (br->br_cons_tail != cons_head)
		cpu_spinwait();
Waiting for another thread to update a specific location => blocking.
Sure, the code doesn't use locks but that doesn't make it lock-free (in the computer science meaning).


    
    /Bruce
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 14:46                                             ` Ola Liljedahl
@ 2018-10-08 15:45                                               ` Ola Liljedahl
  0 siblings, 0 replies; 131+ messages in thread
From: Ola Liljedahl @ 2018-10-08 15:45 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Jerin Jacob, dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable



On 08/10/2018, 16:46, "Ola Liljedahl" <Ola.Liljedahl@arm.com> wrote:

    
    
    On 08/10/2018, 16:44, "Bruce Richardson" <bruce.richardson@intel.com> wrote:
    
        On Mon, Oct 08, 2018 at 09:22:05AM +0000, Ola Liljedahl wrote:
        <snip> 
        > "* multi-producer safe lock-free ring buffer enqueue"
        > The comment is also wrong. This design is not lock-free, how could it be when there is spinning
        > (waiting) for other threads in the code? If a thread must wait for other threads, then by definition
        > the design is blocking.
        >
        My understanding is that the code is lock-free but not wait-free, though
        I'm not an expert in this area.
    Notice this code:
    	while (br->br_cons_tail != cons_head)
    		cpu_spinwait();
    Waiting for another thread to update a specific location => blocking.
    Sure, the code doesn't use locks but that doesn't make it lock-free (in the computer science meaning).
Well to be more specific. The BSD and DPDK ring buffer is lock-free between consumer-side and producer-side
(regardless whether SP/SC or MP/MC). But it is blocking between multiple producers and between multiple consumers.
The waiting for tail here is similar to a ticket lock. You take a ticket using CAS (because the ticket space
isn't infinite), you can overlap processing (because all threads read or write different slots in the ring),
but then you must release your ticket in the order in which it was taken. So you wait for any earlier thread(s).
Waiting is blocking. If any earlier thread doesn't release its updates, you get stuck.
    
    
        
        /Bruce
        
    
    


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:00                                           ` Jerin Jacob
  2018-10-08 10:25                                             ` Ola Liljedahl
@ 2018-10-09  3:16                                             ` Honnappa Nagarahalli
  1 sibling, 0 replies; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-09  3:16 UTC (permalink / raw)
  To: Jerin Jacob, Ola Liljedahl
  Cc: dev, Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

> >     > On 07/10/2018, 06:03, "Jerin Jacob" <jerin.jacob@caviumnetworks.com>
> wrote:
> >     >
> >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm volatile
> ("":::"memory") of rte_pause().
> >     >     I would n't have any issue, if the generated code code is same or
> better than the exiting case. but it not the case, Right?
> >     > The existing case is actually not interesting (IMO) as it exposes
> undefined behaviour which allows the compiler to do anything. But you seem
> to be satisfied with "works for me, right here right now". I think the cost of
> avoiding undefined behaviour is acceptable (actually I don't think it even will
> be noticeable).
> >
> >     I am not convinced because of use of volatile in head and tail indexes.
> >     For me that brings the defined behavior.
> > As long as you don't mix in C11 atomic accesses (just use "plain"
> > accesses to volatile objects), it is AFAIK defined behaviour (but not
> > necessarily using atomic loads and stores). But I quoted the C11 spec
> > where it explicitly mentions that mixing atomic and non-atomic accesses to
> the same object is undefined behaviour. Don't argue with me, argue with the
> C11 spec.
> > If you want to disobey the spec, this should at least be called out for in the
> code with a comment.
> 
> That's boils down only one question, should we follow C11 spec? Why not
> only take load acquire and store release semantics only just like Linux kernel
> and FreeBSD.
> Does not look like C11 memory model is super efficient in term of gcc
> implementation.
> 
> >
> >
> >     That the reason why I shared
> >     the generated assembly code. If you think other way, Pick any compiler
> >     and see generated output.
> > This is what one compiler for one architecture generates today. These
> > things change. Other things that used to work or worked for some
> > specific architecture has stopped working in newer versions of the compiler.
> >
> >
> >     And
> >
> >     Freebsd implementation of ring buffer(Which DPDK derived from), Don't
> have
> >     such logic, See
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
> > It looks like FreeBSD uses some kind of C11 atomic memory
> > model-inspired API although I don't see exactly how e.g.
> > atomic_store_rel_int() is implemented. The code also mixes in explicit
> > barriers so definitively not pure C11 memory model usage. And finally,
> > it doesn't establish the proper load-acquire/store-release relationships (e.g.
> store-release cons_tail requires a load-acquire cons_tail, same for prod_tail).
> >
> > "* multi-producer safe lock-free ring buffer enqueue"
> > The comment is also wrong. This design is not lock-free, how could it
> > be when there is spinning
> > (waiting) for other threads in the code? If a thread must wait for
> > other threads, then by definition the design is blocking.
> >
> > So you are saying that because FreeBSD is doing it wrong, DPDK can also do
> it wrong?
> >
> >
> >     See below too.
> >
> >     >
> >     > Skipping the compiler memory barrier in rte_pause() potentially allows
> for optimisations that provide much more benefit, e.g. hiding some cache
> miss latency for later loads. The DPDK ring buffer implementation is defined
> so to enable inlining of enqueue/dequeue functions into the caller, any code
> could immediately follow these calls.
> >     >
> >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
> >     > Programming languages — C
> >     >
> >     > 5.1.2.4
> >     > 4 Two expression evaluations conflict if one of them modifies a memory
> location and the other one reads or modifies the same memory location.
> >     >
> >     > 25 The execution of a program contains a data race if it contains two
> conflicting actions in different threads, at least one of which is not atomic,
> and neither happens before the other. Any such data race results in undefined
> behavior.
> >
> >     IMO, Both condition will satisfy if the variable is volatile and 32bit read
> will atomic
> >     for 32b and 64b machines. If not, the problem persist for generic case
> >     as well(lib/librte_ring/rte_ring_generic.h)
> > The read from a volatile object is not an atomic access per the C11
> > spec. It just happens to be translated to an instruction (on x86-64 and
> AArch64/A64) that implements an atomic load.
> > I don't think any compiler would change this code generation and
> > suddenly generate some non-atomic load instruction for a program that
> *only* uses volatile to do "atomic" accesses.
> > But a future compiler could detect the mix of atomic and non-atomic
> > accesses and mark this expression as causing undefined behaviour and that
> would have consequences for code generation.
> >
> >
> >     I agree with you on C11 memory model semantics usage. The reason why
> I
> >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did not
> >     had definitions for load acquire and store release semantics.
> >     I was looking for taking load acquire and store release semantics
> >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
> >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is
> your
> >     concern then we could create new abstractions as well. That would help
> >     exiting KNI problem as well.
We tried this in KNI. Creating these abstractions with optimal performance is not possible as release/acquire semantics are one-way barriers. We will end up using full memory-barriers.

> > I appreciate your embrace of the C11 memory model. I think it is
> > better for describing (both to the compiler and to humans) which and how
> objects are used for synchronisation.
> >
> > However, I don't think an API as you suggest (and others have
> > suggested before, e.g. as done in ODP) is a good idea. There is an
> > infinite amount of possible base types, an increasing number of operations
> and a bunch of different memory orderings, a "complete"
> > API would be very large and difficult to test, and most members of the API
> would never be used.
> > GCC and Clang both support the __atomic intrinsics. This API avoids
> > the problems I described above. Or we could use the official C11
> > syntax (stdatomic.h). But then we have the problem with using pre-C11
> compilers...
> 
> I have no objection, if everyone agrees to move C11 memory model with
> __atomic intrinsics. But if we need to keep both have then
> atomic_load_acq_32() kind of API make sense.
> 
> 
> >
> >
> >
> >
> >     I think, currently it mixed usage because, the same variable declaration
> >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
> >     case. Either we need to change only to C11 mode OR have APIs for
> >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
> >     Linux kernel and FreeBSD.
> >
> >     >
> >     > -- Ola
> >     >
> >     >
> >     >
> >
> >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 12:30                                                           ` Ola Liljedahl
@ 2018-10-09  8:53                                                             ` Olivier Matz
  0 siblings, 0 replies; 131+ messages in thread
From: Olivier Matz @ 2018-10-09  8:53 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: Jerin Jacob, dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	Steve Capper, nd, stable

Hi,

On Mon, Oct 08, 2018 at 12:30:15PM +0000, Ola Liljedahl wrote:
> 
> 
> On 08/10/2018, 14:21, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
> 
>     -----Original Message-----
>     > Date: Mon, 8 Oct 2018 17:35:25 +0530
>     > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > To: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     >  "stable@dpdk.org" <stable@dpdk.org>
>     > Subject: Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
>     > User-Agent: Mutt/1.10.1 (2018-07-13)
>     > 
>     > External Email
>     > 
>     > -----Original Message-----
>     > > Date: Mon, 8 Oct 2018 11:59:16 +0000
>     > > From: Ola Liljedahl <Ola.Liljedahl@arm.com>
>     > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>     > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>     > >  <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
>     > >  <konstantin.ananyev@intel.com>, "Gavin Hu (Arm Technology China)"
>     > >  <Gavin.Hu@arm.com>, Steve Capper <Steve.Capper@arm.com>, nd <nd@arm.com>,
>     > >  "stable@dpdk.org" <stable@dpdk.org>
>     > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     > > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     > >
>     > >
>     > > On 08/10/2018, 13:50, "Jerin Jacob" <jerin.jacob@caviumnetworks.com> wrote:
>     > >
>     > >
>     > >     I don't know how that creates more undefined behavior. So replied in the
>     > >     context of your reply that, according to your view even Linux is running
>     > >     with undefined behavior.
>     > >
>     > > As I explained, Linux does not use C11 atomics (nor GCC __atomic builtins) so
>     > > cannot express the kind of undefined behaviour caused by mixing conflicting atomic
>     > > (as defined by the C11 standard) and non-atomic accesses to the same object.
>     > >
>     > > Checked the latest version from https://github.com/torvalds/linux
>     > 
>     > Yet another top post. So you removed the complete earlier context. Never
>     > mind.
> Top post? My reply is under your text. As is this.
> 
> Don't blame my stupid mail agent on your misunderstanding of C11.

Sorry, but honnestly, your mail agent configuration makes the thread harder
to read for many people.


>     > I am not saying Linux is using C11 atomic. I asked, Can't we follow
>     > like Linux to use the HW feature of load acquire and store release
>     > semantics with introducing C11 memory model.
>     
>     correction:
>     
>     s/with introducing C11 memory model/with out introducing C11 memory model
> You can generate e.g. AArch64/A64 LDAR and STLR instructions using inline assembler.
> But you won't be able to specify acquire and release ordering to the compiler, so you
> must specify a full memory barrier instead.
>
> But why create a C11-like but custom DPDK specific memory model when the compiler
> already supports a standardised, well defined and tested memory model? You would just be
> creating a mountain of technical debt.

Let's try to sumarize my understanding of the discussion:

- mixing standard access and C11-like access is undefined according to the
  C11 specification. However it works today with current compilers and hw.
- the patch does not fix any identified issue, but it can make the code
  more consistent by using C11 access.
- with the patch, one "add" instruction is added. The impact is difficult
  to measure, but it is expected to be smaller than the noise generated
  by a code alignment change.

I'm unfortunately not an expert about C11 atomic memory model. But, if
it is more consistent with the rest of the code and if we cannot detect
any performance impact, well, I think it could go in.


Regards,
Olivier

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-08 10:49                                                 ` Jerin Jacob
@ 2018-10-10  6:28                                                   ` Gavin Hu (Arm Technology China)
  2018-10-10 19:26                                                     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-10  6:28 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ola Liljedahl, dev, Honnappa Nagarahalli, Ananyev, Konstantin,
	Steve Capper, nd, stable

Hi Jerin,

Following the guide to use the PMU counters(KO inserted and DPDK recompiled), the numbers increased 10+ folds(bigger numbers here mean more precise?), is this valid and expected? 
No significant difference was seen. 

gavin@net-arm-thunderx2:~/community/dpdk$ sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i
RTE>>ring_perf_autotest (#1 run w/o the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 103
MP/MC single enq/dequeue: 130
SP/SC burst enq/dequeue (size: 8): 18
MP/MC burst enq/dequeue (size: 8): 21
SP/SC burst enq/dequeue (size: 32): 7
MP/MC burst enq/dequeue (size: 32): 8

### Testing empty dequeue ###
SC empty dequeue: 3.00
MC empty dequeue: 3.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 17.48
MP/MC bulk enq/dequeue (size: 8): 21.77
SP/SC bulk enq/dequeue (size: 32): 7.39
MP/MC bulk enq/dequeue (size: 32): 8.52

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 31.32
MP/MC bulk enq/dequeue (size: 8): 38.52
SP/SC bulk enq/dequeue (size: 32): 13.39
MP/MC bulk enq/dequeue (size: 32): 14.15

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 75.00
MP/MC bulk enq/dequeue (size: 8): 141.97
SP/SC bulk enq/dequeue (size: 32): 23.85
MP/MC bulk enq/dequeue (size: 32): 36.13
Test OK
RTE>>ring_perf_autotest (#2 run w/o the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 103
MP/MC single enq/dequeue: 130
SP/SC burst enq/dequeue (size: 8): 18
MP/MC burst enq/dequeue (size: 8): 21
SP/SC burst enq/dequeue (size: 32): 7
MP/MC burst enq/dequeue (size: 32): 8

### Testing empty dequeue ###
SC empty dequeue: 3.00
MC empty dequeue: 3.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 17.48
MP/MC bulk enq/dequeue (size: 8): 21.77
SP/SC bulk enq/dequeue (size: 32): 7.38
MP/MC bulk enq/dequeue (size: 32): 8.52

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 31.31
MP/MC bulk enq/dequeue (size: 8): 38.52
SP/SC bulk enq/dequeue (size: 32): 13.33
MP/MC bulk enq/dequeue (size: 32): 14.16

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 75.74
MP/MC bulk enq/dequeue (size: 8): 147.33
SP/SC bulk enq/dequeue (size: 32): 24.79
MP/MC bulk enq/dequeue (size: 32): 40.09
Test OK

RTE>>ring_perf_autotest (#1 run w/ the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 103
MP/MC single enq/dequeue: 129
SP/SC burst enq/dequeue (size: 8): 18
MP/MC burst enq/dequeue (size: 8): 22
SP/SC burst enq/dequeue (size: 32): 7
MP/MC burst enq/dequeue (size: 32): 8

### Testing empty dequeue ###
SC empty dequeue: 3.00
MC empty dequeue: 4.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 17.89
MP/MC bulk enq/dequeue (size: 8): 21.77
SP/SC bulk enq/dequeue (size: 32): 7.50
MP/MC bulk enq/dequeue (size: 32): 8.52

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 31.24
MP/MC bulk enq/dequeue (size: 8): 38.14
SP/SC bulk enq/dequeue (size: 32): 13.24
MP/MC bulk enq/dequeue (size: 32): 14.69

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 74.63
MP/MC bulk enq/dequeue (size: 8): 137.61
SP/SC bulk enq/dequeue (size: 32): 24.82
MP/MC bulk enq/dequeue (size: 32): 36.64
Test OK
RTE>>ring_perf_autotest (#1 run w/ the patch)
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 103
MP/MC single enq/dequeue: 129
SP/SC burst enq/dequeue (size: 8): 18
MP/MC burst enq/dequeue (size: 8): 22
SP/SC burst enq/dequeue (size: 32): 7
MP/MC burst enq/dequeue (size: 32): 8

### Testing empty dequeue ###
SC empty dequeue: 3.00
MC empty dequeue: 4.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 17.89
MP/MC bulk enq/dequeue (size: 8): 21.77
SP/SC bulk enq/dequeue (size: 32): 7.50
MP/MC bulk enq/dequeue (size: 32): 8.52

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 31.53
MP/MC bulk enq/dequeue (size: 8): 38.59
SP/SC bulk enq/dequeue (size: 32): 13.24
MP/MC bulk enq/dequeue (size: 32): 14.69

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 75.60
MP/MC bulk enq/dequeue (size: 8): 149.14
SP/SC bulk enq/dequeue (size: 32): 25.13
MP/MC bulk enq/dequeue (size: 32): 40.60
Test OK


> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Monday, October 8, 2018 6:50 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: Ola Liljedahl <Ola.Liljedahl@arm.com>; dev@dpdk.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Steve Capper <Steve.Capper@arm.com>;
> nd <nd@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> 
> -----Original Message-----
> > Date: Mon, 8 Oct 2018 10:33:43 +0000
> > From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > To: Ola Liljedahl <Ola.Liljedahl@arm.com>, Jerin Jacob
> > <jerin.jacob@caviumnetworks.com>
> > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> >  <konstantin.ananyev@intel.com>, Steve Capper
> <Steve.Capper@arm.com>,
> > nd  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
> > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
> >
> >
> > I did benchmarking w/o and w/ the patch, it did not show any noticeable
> differences in terms of latency.
> > Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
> >
> > sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > --socket-mem=1024  -- -i
> 
> These counters are running at 100MHz. Use PMU counters to get more
> accurate results.
> 
> https://doc.dpdk.org/guides/prog_guide/profile_app.html
> See: 55.2. Profiling on ARM64
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
  2018-10-10  6:28                                                   ` Gavin Hu (Arm Technology China)
@ 2018-10-10 19:26                                                     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-10 19:26 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Jerin Jacob
  Cc: Ola Liljedahl, dev, Ananyev, Konstantin, Steve Capper, nd, stable

> 
> Hi Jerin,
> 
> Following the guide to use the PMU counters(KO inserted and DPDK
> recompiled), the numbers increased 10+ folds(bigger numbers here mean
> more precise?), is this valid and expected?
This is correct, big numbers mean, more precise/granular results.

> No significant difference was seen.
This is what we are interested in. Do you have any before and after this change numbers?

> 
> gavin@net-arm-thunderx2:~/community/dpdk$ sudo ./test/test/test -l 16-
> 19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i
> RTE>>ring_perf_autotest (#1 run w/o the patch)
> ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue: 103 MP/MC single enq/dequeue: 130 SP/SC burst
> enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 21 SP/SC
> burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 3.00
> MC empty dequeue: 3.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 17.48
> MP/MC bulk enq/dequeue (size: 8): 21.77
> SP/SC bulk enq/dequeue (size: 32): 7.39
> MP/MC bulk enq/dequeue (size: 32): 8.52
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 31.32
> MP/MC bulk enq/dequeue (size: 8): 38.52
> SP/SC bulk enq/dequeue (size: 32): 13.39 MP/MC bulk enq/dequeue (size:
> 32): 14.15
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> 75.00 MP/MC bulk enq/dequeue (size: 8): 141.97 SP/SC bulk enq/dequeue
> (size: 32): 23.85 MP/MC bulk enq/dequeue (size: 32): 36.13 Test OK
> RTE>>ring_perf_autotest (#2 run w/o the patch)
> ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue: 103 MP/MC single enq/dequeue: 130 SP/SC burst
> enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 21 SP/SC
> burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 3.00
> MC empty dequeue: 3.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 17.48
> MP/MC bulk enq/dequeue (size: 8): 21.77
> SP/SC bulk enq/dequeue (size: 32): 7.38
> MP/MC bulk enq/dequeue (size: 32): 8.52
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 31.31
> MP/MC bulk enq/dequeue (size: 8): 38.52
> SP/SC bulk enq/dequeue (size: 32): 13.33 MP/MC bulk enq/dequeue (size:
> 32): 14.16
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> 75.74 MP/MC bulk enq/dequeue (size: 8): 147.33 SP/SC bulk enq/dequeue
> (size: 32): 24.79 MP/MC bulk enq/dequeue (size: 32): 40.09 Test OK
> 
> RTE>>ring_perf_autotest (#1 run w/ the patch)
> ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue: 103 MP/MC single enq/dequeue: 129 SP/SC burst
> enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 22 SP/SC
> burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 3.00
> MC empty dequeue: 4.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 17.89
> MP/MC bulk enq/dequeue (size: 8): 21.77
> SP/SC bulk enq/dequeue (size: 32): 7.50
> MP/MC bulk enq/dequeue (size: 32): 8.52
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 31.24
> MP/MC bulk enq/dequeue (size: 8): 38.14
> SP/SC bulk enq/dequeue (size: 32): 13.24 MP/MC bulk enq/dequeue (size:
> 32): 14.69
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> 74.63 MP/MC bulk enq/dequeue (size: 8): 137.61 SP/SC bulk enq/dequeue
> (size: 32): 24.82 MP/MC bulk enq/dequeue (size: 32): 36.64 Test OK
> RTE>>ring_perf_autotest (#1 run w/ the patch)
> ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue: 103 MP/MC single enq/dequeue: 129 SP/SC burst
> enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 22 SP/SC
> burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 3.00
> MC empty dequeue: 4.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 17.89
> MP/MC bulk enq/dequeue (size: 8): 21.77
> SP/SC bulk enq/dequeue (size: 32): 7.50
> MP/MC bulk enq/dequeue (size: 32): 8.52
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 31.53
> MP/MC bulk enq/dequeue (size: 8): 38.59
> SP/SC bulk enq/dequeue (size: 32): 13.24 MP/MC bulk enq/dequeue (size:
> 32): 14.69
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> 75.60 MP/MC bulk enq/dequeue (size: 8): 149.14 SP/SC bulk enq/dequeue
> (size: 32): 25.13 MP/MC bulk enq/dequeue (size: 32): 40.60 Test OK
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Sent: Monday, October 8, 2018 6:50 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > Cc: Ola Liljedahl <Ola.Liljedahl@arm.com>; dev@dpdk.org; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Steve Capper
> <Steve.Capper@arm.com>;
> > nd <nd@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> >
> > -----Original Message-----
> > > Date: Mon, 8 Oct 2018 10:33:43 +0000
> > > From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > > To: Ola Liljedahl <Ola.Liljedahl@arm.com>, Jerin Jacob
> > > <jerin.jacob@caviumnetworks.com>
> > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>, "Ananyev, Konstantin"
> > >  <konstantin.ananyev@intel.com>, Steve Capper
> > <Steve.Capper@arm.com>,
> > > nd  <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
> > > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
> > >
> > >
> > > I did benchmarking w/o and w/ the patch, it did not show any
> > > noticeable
> > differences in terms of latency.
> > > Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
> > >
> > > sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > > --socket-mem=1024  -- -i
> >
> > These counters are running at 100MHz. Use PMU counters to get more
> > accurate results.
> >
> > https://doc.dpdk.org/guides/prog_guide/profile_app.html
> > See: 55.2. Profiling on ARM64
> >

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
  2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
  2018-09-29 10:57       ` Jerin Jacob
@ 2018-10-17  6:29       ` Gavin Hu
  2018-10-17  6:29         ` [dpdk-dev] [PATCH 2/2] ring: move the atomic load of head above the loop Gavin Hu
                           ` (5 more replies)
  2 siblings, 6 replies; 131+ messages in thread
From: Gavin Hu @ 2018-10-17  6:29 UTC (permalink / raw)
  To: dev; +Cc: gavin.hu, Honnappa.Nagarahalli, jerin.jacob, stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..4851763 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -67,13 +67,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -131,15 +136,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH 2/2] ring: move the atomic load of head above the loop
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
@ 2018-10-17  6:29         ` Gavin Hu
  2018-10-17  6:35         ` [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail Gavin Hu (Arm Technology China)
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-10-17  6:29 UTC (permalink / raw)
  To: dev; +Cc: gavin.hu, Honnappa.Nagarahalli, jerin.jacob, stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 4851763..fdab2b9 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -60,13 +60,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -92,6 +90,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -133,13 +132,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -164,6 +161,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
  2018-10-17  6:29         ` [dpdk-dev] [PATCH 2/2] ring: move the atomic load of head above the loop Gavin Hu
@ 2018-10-17  6:35         ` Gavin Hu (Arm Technology China)
  2018-10-27 14:39           ` Thomas Monjalon
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization Gavin Hu
                           ` (3 subsequent siblings)
  5 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-17  6:35 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev, jerin.jacob
  Cc: Honnappa Nagarahalli, stable, Ola Liljedahl

Hi Jerin

As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.

Best Regards,
Gavin

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Wednesday, October 17, 2018 2:30 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerin.jacob@caviumnetworks.com; stable@dpdk.org
> Subject: [PATCH 1/2] ring: synchronize the load and store of the tail
>
> Synchronize the load-acquire of the tail and the store-release within
> update_tail, the store release ensures all the ring operations, enqueue or
> dequeue, are seen by the observers on the other side as soon as they see
> the updated tail. The load-acquire is needed here as the data dependency is
> not a reliable way for ordering as the compiler might break it by saving to
> temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics to
> load the heads and tails instead.
>
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size of 32,
> it saves latency up to (3.26-2.36)/3.26 = 27.6%.
>
> This patch is a bug fix, while the improvement is a bonus. In our analysis the
> improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
>
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
>
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15  SP/SC bulk enq/dequeue (size:
> 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
>
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12  SP/SC bulk enq/dequeue (size:
> 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
>
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Jia He <justin.he@arm.com>
> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4..4851763 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -67,13 +67,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  *old_head = __atomic_load_n(&r->prod.head,
>  __ATOMIC_ACQUIRE);
>
> -/*
> - *  The subtraction is done between two unsigned 32bits
> value
> +/* load-acquire synchronize with store-release of ht->tail
> + * in update_tail.
> + */
> +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> +
> __ATOMIC_ACQUIRE);
> +
> +/* The subtraction is done between two unsigned 32bits
> value
>   * (the result is always modulo 32 bits even if we have
>   * *old_head > cons_tail). So 'free_entries' is always
> between 0
>   * and capacity (which is < size).
>   */
> -*free_entries = (capacity + r->cons.tail - *old_head);
> +*free_entries = (capacity + cons_tail - *old_head);
>
>  /* check that we have enough room in ring */
>  if (unlikely(n > *free_entries))
> @@ -131,15 +136,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  do {
>  /* Restore n as it may change every loop */
>  n = max;
> +
>  *old_head = __atomic_load_n(&r->cons.head,
>  __ATOMIC_ACQUIRE);
>
> +/* this load-acquire synchronize with store-release of ht->tail
> + * in update_tail.
> + */
> +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> +__ATOMIC_ACQUIRE);
> +
>  /* The subtraction is done between two unsigned 32bits
> value
>   * (the result is always modulo 32 bits even if we have
>   * cons_head > prod_tail). So 'entries' is always between 0
>   * and size(ring)-1.
>   */
> -*entries = (r->prod.tail - *old_head);
> +*entries = (prod_tail - *old_head);
>
>  /* Set the actual entries for dequeue */
>  if (n > *entries)
> --
> 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v3 1/3] ring: read tail using atomic load
  2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
                       ` (3 preceding siblings ...)
  2018-09-29 10:48     ` Jerin Jacob
@ 2018-10-27 14:17     ` Thomas Monjalon
  4 siblings, 0 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-27 14:17 UTC (permalink / raw)
  To: Gavin Hu
  Cc: stable, dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, bruce.richardson, Justin He, konstantin.ananyev

17/09/2018 10:17, Gavin Hu:
> In update_tail, read ht->tail using __atomic_load.Although the
> compiler currently seems to be doing the right thing even without
> _atomic_load, we don't want to give the compiler freedom to optimise
> what should be an atomic load, it should not be arbitarily moved
> around.
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>

It seems there is no consensus on this series?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v4 4/4] ring: move the atomic load of head above the loop
  2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 4/4] ring: move the atomic load of head above the loop Gavin Hu
@ 2018-10-27 14:21         ` Thomas Monjalon
  0 siblings, 0 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-27 14:21 UTC (permalink / raw)
  To: Gavin Hu
  Cc: stable, dev, Honnappa.Nagarahalli, steve.capper, Ola.Liljedahl,
	jerin.jacob, nd, justin.he, bruce.richardson, konstantin.ananyev,
	olivier.matz

17/09/2018 10:11, Gavin Hu:
> In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> the do {} while loop as upon failure the old_head will be updated,
> another load is costly and not necessary.
> 
> This helps a little on the latency,about 1~5%.
> 
>  Test result with the patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.64
>  MP/MC bulk enq/dequeue (size: 8): 9.58
>  SP/SC bulk enq/dequeue (size: 32): 1.98
>  MP/MC bulk enq/dequeue (size: 32): 2.30
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>

We are missing reviews and acknowledgements on this series.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-17  6:35         ` [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail Gavin Hu (Arm Technology China)
@ 2018-10-27 14:39           ` Thomas Monjalon
  2018-10-27 15:00             ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-27 14:39 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: dev, jerin.jacob, Honnappa Nagarahalli, stable, Ola Liljedahl,
	olivier.matz, chaozhu, bruce.richardson, konstantin.ananyev

17/10/2018 08:35, Gavin Hu (Arm Technology China):
> Hi Jerin
> 
> As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.

The thread is totally messed up because:
	- there is no cover letter
	- some different series (testpmd, i40e and doc) are in the same thread
	- v4 replies to a different series
	- this version should be a v5 but has no number
	- this version replies to the v3
	- patchwork still shows v3 and "v5"
	- replies from Ola are not quoting previous discussion

Because of all of this, it is really difficult to follow.
This is probably the reason of the lack of review outside of Arm.

One more issue: you must Cc the relevant maintainers.
Here:
	- Olivier for rte_ring
	- Chao for IBM platform
	- Bruce and Konstantin for x86

Guys, it is really cool to have more Arm developpers in DPDK.
But please consider better formatting your discussions, it is really
important in our contribution workflow.

I don't know what to do.
I suggest to wait for more feedbacks and integrate it in -rc2.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 14:39           ` Thomas Monjalon
@ 2018-10-27 15:00             ` Jerin Jacob
  2018-10-27 15:13               ` Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-27 15:00 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

-----Original Message-----
> Date: Sat, 27 Oct 2018 16:39:58 +0200
> From: Thomas Monjalon <thomas@monjalon.net>
> To: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org, "jerin.jacob@caviumnetworks.com"
>  <jerin.jacob@caviumnetworks.com>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "stable@dpdk.org" <stable@dpdk.org>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, olivier.matz@6wind.com,
>  chaozhu@linux.vnet.ibm.com, bruce.richardson@intel.com,
>  konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> External Email
> 
> 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > Hi Jerin
> >
> > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> 
> The thread is totally messed up because:
>         - there is no cover letter
>         - some different series (testpmd, i40e and doc) are in the same thread
>         - v4 replies to a different series
>         - this version should be a v5 but has no number
>         - this version replies to the v3
>         - patchwork still shows v3 and "v5"
>         - replies from Ola are not quoting previous discussion
> 
> Because of all of this, it is really difficult to follow.
> This is probably the reason of the lack of review outside of Arm.
> 
> One more issue: you must Cc the relevant maintainers.
> Here:
>         - Olivier for rte_ring
>         - Chao for IBM platform
>         - Bruce and Konstantin for x86
> 
> Guys, it is really cool to have more Arm developpers in DPDK.
> But please consider better formatting your discussions, it is really
> important in our contribution workflow.
> 
> I don't know what to do.
> I suggest to wait for more feedbacks and integrate it in -rc2.

This series has been acked and tested. Sure, if we are looking for some
more feedback we can push to -rc2 if not it a good candidate to be
selected for -rc1.

> 
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 15:00             ` Jerin Jacob
@ 2018-10-27 15:13               ` Thomas Monjalon
  2018-10-27 15:34                 ` Jerin Jacob
  2018-11-03 20:12                 ` Mattias Rönnblom
  0 siblings, 2 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-27 15:13 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

27/10/2018 17:00, Jerin Jacob:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > Hi Jerin
> > >
> > > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> > 
> > The thread is totally messed up because:
> >         - there is no cover letter
> >         - some different series (testpmd, i40e and doc) are in the same thread
> >         - v4 replies to a different series
> >         - this version should be a v5 but has no number
> >         - this version replies to the v3
> >         - patchwork still shows v3 and "v5"
> >         - replies from Ola are not quoting previous discussion
> > 
> > Because of all of this, it is really difficult to follow.
> > This is probably the reason of the lack of review outside of Arm.
> > 
> > One more issue: you must Cc the relevant maintainers.
> > Here:
> >         - Olivier for rte_ring
> >         - Chao for IBM platform
> >         - Bruce and Konstantin for x86
> > 
> > Guys, it is really cool to have more Arm developpers in DPDK.
> > But please consider better formatting your discussions, it is really
> > important in our contribution workflow.
> > 
> > I don't know what to do.
> > I suggest to wait for more feedbacks and integrate it in -rc2.
> 
> This series has been acked and tested. Sure, if we are looking for some
> more feedback we can push to -rc2 if not it a good candidate to be
> selected for -rc1.

It has been acked and tested only for Arm platforms.
And Olivier, the ring maintainer, was not Cc.

I feel it is not enough.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 15:13               ` Thomas Monjalon
@ 2018-10-27 15:34                 ` Jerin Jacob
  2018-10-27 15:48                   ` Thomas Monjalon
                                     ` (2 more replies)
  2018-11-03 20:12                 ` Mattias Rönnblom
  1 sibling, 3 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-10-27 15:34 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

-----Original Message-----
> Date: Sat, 27 Oct 2018 17:13:10 +0200
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, "dev@dpdk.org"
>  <dev@dpdk.org>, Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
>  <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
>  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
>  <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> 
> 27/10/2018 17:00, Jerin Jacob:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > Hi Jerin
> > > >
> > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> > >
> > > The thread is totally messed up because:
> > >         - there is no cover letter
> > >         - some different series (testpmd, i40e and doc) are in the same thread
> > >         - v4 replies to a different series
> > >         - this version should be a v5 but has no number
> > >         - this version replies to the v3
> > >         - patchwork still shows v3 and "v5"
> > >         - replies from Ola are not quoting previous discussion
> > >
> > > Because of all of this, it is really difficult to follow.
> > > This is probably the reason of the lack of review outside of Arm.
> > >
> > > One more issue: you must Cc the relevant maintainers.
> > > Here:
> > >         - Olivier for rte_ring
> > >         - Chao for IBM platform
> > >         - Bruce and Konstantin for x86
> > >
> > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > But please consider better formatting your discussions, it is really
> > > important in our contribution workflow.
> > >
> > > I don't know what to do.
> > > I suggest to wait for more feedbacks and integrate it in -rc2.
> >
> > This series has been acked and tested. Sure, if we are looking for some
> > more feedback we can push to -rc2 if not it a good candidate to be
> > selected for -rc1.
> 
> It has been acked and tested only for Arm platforms.
> And Olivier, the ring maintainer, was not Cc.
> 
> I feel it is not enough.

Sure, More reviews is already better. But lets keep as -rc2 target.


> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 15:34                 ` Jerin Jacob
@ 2018-10-27 15:48                   ` Thomas Monjalon
  2018-10-29  2:51                   ` Gavin Hu (Arm Technology China)
  2018-10-29  2:57                   ` Gavin Hu (Arm Technology China)
  2 siblings, 0 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-27 15:48 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

27/10/2018 17:34, Jerin Jacob:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > Hi Jerin
> > > > >
> > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> > > >
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the same thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is really
> > > > important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for some
> > > more feedback we can push to -rc2 if not it a good candidate to be
> > > selected for -rc1.
> > 
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> > 
> > I feel it is not enough.
> 
> Sure, More reviews is already better. But lets keep as -rc2 target.

Yes. If no objection, it will enter in -rc2.
After -rc2, we cannot do such core change anyway.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 15:34                 ` Jerin Jacob
  2018-10-27 15:48                   ` Thomas Monjalon
@ 2018-10-29  2:51                   ` Gavin Hu (Arm Technology China)
  2018-10-29  2:57                   ` Gavin Hu (Arm Technology China)
  2 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-29  2:51 UTC (permalink / raw)
  To: Jerin Jacob, Thomas Monjalon
  Cc: dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

Hi Thomas and Jerin,

The patches were extensively and heavily reviewed by Arm internally. As the 1st patch was not concluded, so I create a new series(2 patches),


> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Saturday, October 27, 2018 11:34 PM
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org;
> Ola Liljedahl <Ola.Liljedahl@arm.com>; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> the tail
>
> -----Original Message-----
> > Date: Sat, 27 Oct 2018 17:13:10 +0200
> > From: Thomas Monjalon <thomas@monjalon.net>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
> "dev@dpdk.org"
> >  <dev@dpdk.org>, Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>,
> > "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
> > <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
> >  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
> >  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
> >  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
> >  <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and
> > store of  the tail
> >
> >
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > Hi Jerin
> > > > >
> > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> patch series to unblock the merge.
> > > >
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the same
> thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is
> > > > really important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for
> > > some more feedback we can push to -rc2 if not it a good candidate to
> > > be selected for -rc1.
> >
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> >
> > I feel it is not enough.
>
> Sure, More reviews is already better. But lets keep as -rc2 target.
>
>
> >
> >
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 15:34                 ` Jerin Jacob
  2018-10-27 15:48                   ` Thomas Monjalon
  2018-10-29  2:51                   ` Gavin Hu (Arm Technology China)
@ 2018-10-29  2:57                   ` Gavin Hu (Arm Technology China)
  2018-10-29 10:16                     ` Jerin Jacob
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-29  2:57 UTC (permalink / raw)
  To: Jerin Jacob, Thomas Monjalon
  Cc: dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

Hi Thomas and Jerin,

The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
How can I clean up this mess?
1. make all the previous patches Superseded?
2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?

Best Regards,
Gavin


> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Saturday, October 27, 2018 11:34 PM
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org;
> Ola Liljedahl <Ola.Liljedahl@arm.com>; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> the tail
>
> -----Original Message-----
> > Date: Sat, 27 Oct 2018 17:13:10 +0200
> > From: Thomas Monjalon <thomas@monjalon.net>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
> "dev@dpdk.org"
> >  <dev@dpdk.org>, Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>,
> > "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
> > <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
> >  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
> >  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
> >  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
> >  <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and
> > store of  the tail
> >
> >
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > Hi Jerin
> > > > >
> > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> patch series to unblock the merge.
> > > >
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the same
> thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is
> > > > really important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for
> > > some more feedback we can push to -rc2 if not it a good candidate to
> > > be selected for -rc1.
> >
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> >
> > I feel it is not enough.
>
> Sure, More reviews is already better. But lets keep as -rc2 target.
>
>
> >
> >
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-29  2:57                   ` Gavin Hu (Arm Technology China)
@ 2018-10-29 10:16                     ` Jerin Jacob
  2018-10-29 10:47                       ` Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2018-10-29 10:16 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: Thomas Monjalon, dev, Honnappa Nagarahalli, stable,
	Ola Liljedahl, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev

-----Original Message-----
> Date: Mon, 29 Oct 2018 02:57:17 +0000
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, Thomas Monjalon
>  <thomas@monjalon.net>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "stable@dpdk.org" <stable@dpdk.org>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
>  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
>  <konstantin.ananyev@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> 
> Hi Thomas and Jerin,
> 
> The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
> How can I clean up this mess?
> 1. make all the previous patches Superseded?
> 2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?

I would suggest to supersede the old patches(not in this case, in any case when you
send new version and update the version number).

I would suggest send new patches as separate series. If it has dependency on
exiting Acked patches please mention that in cover letter.


> 
> Best Regards,
> Gavin
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Sent: Saturday, October 27, 2018 11:34 PM
> > To: Thomas Monjalon <thomas@monjalon.net>
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org;
> > Ola Liljedahl <Ola.Liljedahl@arm.com>; olivier.matz@6wind.com;
> > chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> > konstantin.ananyev@intel.com
> > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> > the tail
> >
> > -----Original Message-----
> > > Date: Sat, 27 Oct 2018 17:13:10 +0200
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
> > "dev@dpdk.org"
> > >  <dev@dpdk.org>, Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>,
> > > "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
> > > <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
> > >  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
> > >  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
> > >  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
> > >  <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and
> > > store of  the tail
> > >
> > >
> > > 27/10/2018 17:00, Jerin Jacob:
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > > Hi Jerin
> > > > > >
> > > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> > patch series to unblock the merge.
> > > > >
> > > > > The thread is totally messed up because:
> > > > >         - there is no cover letter
> > > > >         - some different series (testpmd, i40e and doc) are in the same
> > thread
> > > > >         - v4 replies to a different series
> > > > >         - this version should be a v5 but has no number
> > > > >         - this version replies to the v3
> > > > >         - patchwork still shows v3 and "v5"
> > > > >         - replies from Ola are not quoting previous discussion
> > > > >
> > > > > Because of all of this, it is really difficult to follow.
> > > > > This is probably the reason of the lack of review outside of Arm.
> > > > >
> > > > > One more issue: you must Cc the relevant maintainers.
> > > > > Here:
> > > > >         - Olivier for rte_ring
> > > > >         - Chao for IBM platform
> > > > >         - Bruce and Konstantin for x86
> > > > >
> > > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > > But please consider better formatting your discussions, it is
> > > > > really important in our contribution workflow.
> > > > >
> > > > > I don't know what to do.
> > > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > > >
> > > > This series has been acked and tested. Sure, if we are looking for
> > > > some more feedback we can push to -rc2 if not it a good candidate to
> > > > be selected for -rc1.
> > >
> > > It has been acked and tested only for Arm platforms.
> > > And Olivier, the ring maintainer, was not Cc.
> > >
> > > I feel it is not enough.
> >
> > Sure, More reviews is already better. But lets keep as -rc2 target.
> >
> >
> > >
> > >
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-29 10:16                     ` Jerin Jacob
@ 2018-10-29 10:47                       ` Thomas Monjalon
  2018-10-29 11:10                         ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-29 10:47 UTC (permalink / raw)
  To: Jerin Jacob, Gavin Hu (Arm Technology China)
  Cc: dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

29/10/2018 11:16, Jerin Jacob:
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > 
> > Hi Thomas and Jerin,
> > 
> > The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
> > How can I clean up this mess?
> > 1. make all the previous patches Superseded?
> > 2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?
> 
> I would suggest to supersede the old patches(not in this case, in any case when you
> send new version and update the version number).

Why not in this case?
There are some old patches in patchwork which should be superseded.

> I would suggest send new patches as separate series. If it has dependency on
> exiting Acked patches please mention that in cover letter.

I would suggest also to stop top-posting, it doesn't help reading threads.


> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 27/10/2018 17:00, Jerin Jacob:
> > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > > > Hi Jerin
> > > > > > >
> > > > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> > > patch series to unblock the merge.
> > > > > >
> > > > > > The thread is totally messed up because:
> > > > > >         - there is no cover letter
> > > > > >         - some different series (testpmd, i40e and doc) are in the same
> > > thread
> > > > > >         - v4 replies to a different series
> > > > > >         - this version should be a v5 but has no number
> > > > > >         - this version replies to the v3
> > > > > >         - patchwork still shows v3 and "v5"
> > > > > >         - replies from Ola are not quoting previous discussion
> > > > > >
> > > > > > Because of all of this, it is really difficult to follow.
> > > > > > This is probably the reason of the lack of review outside of Arm.
> > > > > >
> > > > > > One more issue: you must Cc the relevant maintainers.
> > > > > > Here:
> > > > > >         - Olivier for rte_ring
> > > > > >         - Chao for IBM platform
> > > > > >         - Bruce and Konstantin for x86
> > > > > >
> > > > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > > > But please consider better formatting your discussions, it is
> > > > > > really important in our contribution workflow.
> > > > > >
> > > > > > I don't know what to do.
> > > > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > > > >
> > > > > This series has been acked and tested. Sure, if we are looking for
> > > > > some more feedback we can push to -rc2 if not it a good candidate to
> > > > > be selected for -rc1.
> > > >
> > > > It has been acked and tested only for Arm platforms.
> > > > And Olivier, the ring maintainer, was not Cc.
> > > >
> > > > I feel it is not enough.
> > >
> > > Sure, More reviews is already better. But lets keep as -rc2 target.
> > >
> > >
> > > >
> > > >
> > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-29 10:47                       ` Thomas Monjalon
@ 2018-10-29 11:10                         ` Jerin Jacob
  0 siblings, 0 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-10-29 11:10 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

-----Original Message-----
> Date: Mon, 29 Oct 2018 11:47:05 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Gavin Hu (Arm Technology
>  China)" <Gavin.Hu@arm.com>
> Cc: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "stable@dpdk.org" <stable@dpdk.org>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
>  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
>  <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> 
> 29/10/2018 11:16, Jerin Jacob:
> > From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > >
> > > Hi Thomas and Jerin,
> > >
> > > The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
> > > How can I clean up this mess?
> > > 1. make all the previous patches Superseded?
> > > 2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?
> >
> > I would suggest to supersede the old patches(not in this case, in any case when you
> > send new version and update the version number).
> 
> Why not in this case?

I did not mean in this case particular. I meant in all cases.

> There are some old patches in patchwork which should be superseded.
> 
> > I would suggest send new patches as separate series. If it has dependency on
> > exiting Acked patches please mention that in cover letter.
> 
> I would suggest also to stop top-posting, it doesn't help reading threads.
> 
> 
> > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > 27/10/2018 17:00, Jerin Jacob:
> > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > > > > Hi Jerin
> > > > > > > >
> > > > > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> > > > patch series to unblock the merge.
> > > > > > >
> > > > > > > The thread is totally messed up because:
> > > > > > >         - there is no cover letter
> > > > > > >         - some different series (testpmd, i40e and doc) are in the same
> > > > thread
> > > > > > >         - v4 replies to a different series
> > > > > > >         - this version should be a v5 but has no number
> > > > > > >         - this version replies to the v3
> > > > > > >         - patchwork still shows v3 and "v5"
> > > > > > >         - replies from Ola are not quoting previous discussion
> > > > > > >
> > > > > > > Because of all of this, it is really difficult to follow.
> > > > > > > This is probably the reason of the lack of review outside of Arm.
> > > > > > >
> > > > > > > One more issue: you must Cc the relevant maintainers.
> > > > > > > Here:
> > > > > > >         - Olivier for rte_ring
> > > > > > >         - Chao for IBM platform
> > > > > > >         - Bruce and Konstantin for x86
> > > > > > >
> > > > > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > > > > But please consider better formatting your discussions, it is
> > > > > > > really important in our contribution workflow.
> > > > > > >
> > > > > > > I don't know what to do.
> > > > > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > > > > >
> > > > > > This series has been acked and tested. Sure, if we are looking for
> > > > > > some more feedback we can push to -rc2 if not it a good candidate to
> > > > > > be selected for -rc1.
> > > > >
> > > > > It has been acked and tested only for Arm platforms.
> > > > > And Olivier, the ring maintainer, was not Cc.
> > > > >
> > > > > I feel it is not enough.
> > > >
> > > > Sure, More reviews is already better. But lets keep as -rc2 target.
> > > >
> > > >
> > > > >
> > > > >
> > > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
  2018-10-17  6:29         ` [dpdk-dev] [PATCH 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2018-10-17  6:35         ` [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail Gavin Hu (Arm Technology China)
@ 2018-10-31  3:35         ` Gavin Hu
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
                             ` (2 more replies)
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 1/2] ring: synchronize the load and store of the tail Gavin Hu
                           ` (2 subsequent siblings)
  5 siblings, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-10-31  3:35 UTC (permalink / raw)
  To: dev
  Cc: thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu

v1->v2
1) Add the changes to the 18.11 release notes 

V1:
Updated the rte ring C11 driver including the following changes
1) Synchronize the load and store of the tail to ensure the enqueue/dequeue
   operations are really completed before seen by the observers on the other
   sides.
2) Move the atomic load of head above the loop for the first iteration,it is
   not unnecessary and degrade performance for the other iteration as the head
   was loaded in the failure case of CAS.

Gavin Hu (2):
  ring: synchronize the load and store of the tail
  ring: move the atomic load of head above the loop

 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 24 +++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] ring: synchronize the load and store of the tail
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
                           ` (2 preceding siblings ...)
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization Gavin Hu
@ 2018-10-31  3:35         ` Gavin Hu
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2018-11-01  9:53         ` [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization Gavin Hu
  5 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-10-31  3:35 UTC (permalink / raw)
  To: dev
  Cc: thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..4851763 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -67,13 +67,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -131,15 +136,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
                           ` (3 preceding siblings ...)
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 1/2] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-10-31  3:35         ` Gavin Hu
  2018-10-31  9:36           ` Thomas Monjalon
  2018-11-01  9:53         ` [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization Gavin Hu
  5 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-10-31  3:35 UTC (permalink / raw)
  To: dev
  Cc: thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index 376128f..3bbbe2a 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -158,6 +158,13 @@ New Features
   * Support for runtime Rx and Tx queues setup.
   * Support multicast MAC address set.
 
+* **Updated rte ring C11 driver.**
+
+  Updated the rte ring C11 driver including the following changes:
+
+  * Synchronize the load and store of the tail
+  * Move the atomic load of head above the loop
+
 * **Added a devarg to use PCAP interface physical MAC address.**
   A new devarg ``phy_mac`` was introduced to allow users to use physical
   MAC address of the selected PCAP interface.
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 4851763..fdab2b9 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -60,13 +60,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -92,6 +90,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -133,13 +132,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -164,6 +161,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop Gavin Hu
@ 2018-10-31  9:36           ` Thomas Monjalon
  2018-10-31 10:27             ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-31  9:36 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, olivier.matz, chaozhu, bruce.richardson, konstantin.ananyev,
	jerin.jacob, Honnappa.Nagarahalli, stable

31/10/2018 04:35, Gavin Hu:
> --- a/doc/guides/rel_notes/release_18_11.rst
> +++ b/doc/guides/rel_notes/release_18_11.rst
> +* **Updated rte ring C11 driver.**
> +
> +  Updated the rte ring C11 driver including the following changes:

It is not a driver, it is the ring library with C11 memory model.
Please reword and move it after memory related notes (at the beginning).
Thanks

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model bug fix and optimization
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization Gavin Hu
@ 2018-10-31 10:26           ` Gavin Hu
  2018-10-31 16:58             ` Thomas Monjalon
                               ` (2 more replies)
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail Gavin Hu
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-10-31 10:26 UTC (permalink / raw)
  To: dev
  Cc: thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu

v2->v3
1) reword the change and relocate it in the release note.

v1->v2
1) Add the changes to the 18.11 release note.

V1:
Updated the ring library with C11 memory model including the following changes
1) Synchronize the load and store of the tail to ensure the enqueue/dequeue
   operations are really completed before seen by the observers on the other
   sides.
2) Move the atomic load of head above the loop for the first iteration,it is
   not unnecessary and degrade performance for the other iteration as the head
   was loaded in the failure case of CAS.

Gavin Hu (2):
  ring: synchronize the load and store of the tail
  ring: move the atomic load of head above the loop

 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 24 +++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization Gavin Hu
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
@ 2018-10-31 10:26           ` Gavin Hu
  2018-10-31 22:07             ` Stephen Hemminger
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-10-31 10:26 UTC (permalink / raw)
  To: dev
  Cc: thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..4851763 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -67,13 +67,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -131,15 +136,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] ring: move the atomic load of head above the loop
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization Gavin Hu
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-10-31 10:26           ` Gavin Hu
  2 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-10-31 10:26 UTC (permalink / raw)
  To: dev
  Cc: thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index 376128f..c9c2b86 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -69,6 +69,13 @@ New Features
   checked out against that dma mask and rejected if out of range. If more than
   one device has addressing limitations, the dma mask is the more restricted one.
 
+* **Updated the ring library with C11 memory model.**
+
+  Updated the ring library with C11 memory model including the following changes:
+
+  * Synchronize the load and store of the tail
+  * Move the atomic load of head above the loop
+
 * **Added hot-unplug handle mechanism.**
 
   ``rte_dev_hotplug_handle_enable`` and ``rte_dev_hotplug_handle_disable`` are
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 4851763..fdab2b9 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -60,13 +60,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -92,6 +90,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -133,13 +132,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -164,6 +161,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop
  2018-10-31  9:36           ` Thomas Monjalon
@ 2018-10-31 10:27             ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-10-31 10:27 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, olivier.matz, chaozhu, bruce.richardson, konstantin.ananyev,
	jerin.jacob, Honnappa Nagarahalli, stable



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Wednesday, October 31, 2018 5:37 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> jerin.jacob@caviumnetworks.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v2 2/2] ring: move the atomic load of head above the
> loop
>
> 31/10/2018 04:35, Gavin Hu:
> > --- a/doc/guides/rel_notes/release_18_11.rst
> > +++ b/doc/guides/rel_notes/release_18_11.rst
> > +* **Updated rte ring C11 driver.**
> > +
> > +  Updated the rte ring C11 driver including the following changes:
>
> It is not a driver, it is the ring library with C11 memory model.
> Please reword and move it after memory related notes (at the beginning).
> Thanks
>

Done, v3 just submitted, thanks!

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model bug fix and optimization
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
@ 2018-10-31 16:58             ` Thomas Monjalon
  2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 1/2] ring: synchronize the load and store of the tail Gavin Hu
  2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 0 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-10-31 16:58 UTC (permalink / raw)
  To: dev, qian.q.xu
  Cc: Gavin Hu, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli

Last call for review.

Qian, it probably makes to have a validation test of this patchset.


31/10/2018 11:26, Gavin Hu:
> v2->v3
> 1) reword the change and relocate it in the release note.
> 
> v1->v2
> 1) Add the changes to the 18.11 release note.
> 
> V1:
> Updated the ring library with C11 memory model including the following changes
> 1) Synchronize the load and store of the tail to ensure the enqueue/dequeue
>    operations are really completed before seen by the observers on the other
>    sides.
> 2) Move the atomic load of head above the loop for the first iteration,it is
>    not unnecessary and degrade performance for the other iteration as the head
>    was loaded in the failure case of CAS.
> 
> Gavin Hu (2):
>   ring: synchronize the load and store of the tail
>   ring: move the atomic load of head above the loop
> 
>  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
>  lib/librte_ring/rte_ring_c11_mem.h     | 24 +++++++++++++++++-------
>  2 files changed, 24 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-10-31 22:07             ` Stephen Hemminger
  2018-11-01  9:56               ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Stephen Hemminger @ 2018-10-31 22:07 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, stable

On Wed, 31 Oct 2018 18:26:26 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> -		/*
> -		 *  The subtraction is done between two unsigned 32bits value
> +		/* load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t cons_tail

Please don't mix declarations and code. Although it is sometimes used in DPDK,
in general the style is to have declarations at the start of the block scope.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization
  2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
                           ` (4 preceding siblings ...)
  2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop Gavin Hu
@ 2018-11-01  9:53         ` Gavin Hu
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 " Gavin Hu
                             ` (2 more replies)
  5 siblings, 3 replies; 131+ messages in thread
From: Gavin Hu @ 2018-11-01  9:53 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu

v3->v4
1) Move the variable declarations to the beginning of the block.

v2->v3
1) reword the change and relocate it in the release note.

v1->v2
1) Add the changes to the 18.11 release note.

V1:
Updated the ring library with C11 memory model including the following changes
1) Synchronize the load and store of the tail to ensure the enqueue/dequeue
   operations are really completed before seen by the observers on the other
   sides.
2) Move the atomic load of head above the loop for the first iteration,it is
   not unnecessary and degrade performance for the other iteration as the head
   was loaded in the failure case of CAS.

Gavin Hu (2):
  ring: synchronize the load and store of the tail
  ring: move the atomic load of head above the loop

 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 24 ++++++++++++++++++------
 2 files changed, 25 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] ring: synchronize the load and store of the tail
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
  2018-10-31 16:58             ` Thomas Monjalon
@ 2018-11-01  9:53             ` Gavin Hu
  2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-11-01  9:53 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..52da95a 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -57,6 +57,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		uint32_t *free_entries)
 {
 	const uint32_t capacity = r->capacity;
+	uint32_t cons_tail;
 	unsigned int max = n;
 	int success;
 
@@ -67,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons.tail,
+					__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -125,21 +131,29 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		uint32_t *entries)
 {
 	unsigned int max = n;
+	uint32_t prod_tail;
 	int success;
 
 	/* move cons.head atomically */
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
  2018-10-31 16:58             ` Thomas Monjalon
  2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 1/2] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-11-01  9:53             ` Gavin Hu
  2018-11-01 17:26               ` Stephen Hemminger
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-11-01  9:53 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index 376128f..c9c2b86 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -69,6 +69,13 @@ New Features
   checked out against that dma mask and rejected if out of range. If more than
   one device has addressing limitations, the dma mask is the more restricted one.
 
+* **Updated the ring library with C11 memory model.**
+
+  Updated the ring library with C11 memory model including the following changes:
+
+  * Synchronize the load and store of the tail
+  * Move the atomic load of head above the loop
+
 * **Added hot-unplug handle mechanism.**
 
   ``rte_dev_hotplug_handle_enable`` and ``rte_dev_hotplug_handle_disable`` are
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 52da95a..7bc74a4 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -135,13 +134,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -166,6 +163,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail
  2018-10-31 22:07             ` Stephen Hemminger
@ 2018-11-01  9:56               ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-11-01  9:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa Nagarahalli, stable



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, November 1, 2018 6:08 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store
> of the tail
>
> On Wed, 31 Oct 2018 18:26:26 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
>
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail
>
> Please don't mix declarations and code. Although it is sometimes used in
> DPDK, in general the style is to have declarations at the start of the block
> scope.

Thanks for review, your comment was address in v4.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop Gavin Hu
@ 2018-11-01 17:26               ` Stephen Hemminger
  2018-11-02  0:53                 ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Stephen Hemminger @ 2018-11-01 17:26 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, stable

On Thu,  1 Nov 2018 17:53:51 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> +* **Updated the ring library with C11 memory model.**
> +
> +  Updated the ring library with C11 memory model including the following changes:
> +
> +  * Synchronize the load and store of the tail
> +  * Move the atomic load of head above the loop
> +

Does this really need to be in the release notes? Is it a user visible change
or just an internal/optimization and fix.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-11-01 17:26               ` Stephen Hemminger
@ 2018-11-02  0:53                 ` Gavin Hu (Arm Technology China)
  2018-11-02  4:30                   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-11-02  0:53 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa Nagarahalli, stable

Hi Stephen,

There is no api changes, but this is a significant change as ring is fundamental and widely used, it decreases latency by 25% in our tests, it may do even better for cases with more contending producers/consumers or deeper depth of rings.

Best regards
Gavin

Best regards
Gavin

________________________________
收件人: Stephen Hemminger <stephen@networkplumber.org>
发送时间: 星期五, 十一月 2, 2018 1:26 上午
收件人: Gavin Hu (Arm Technology China)
抄送: dev@dpdk.org; thomas@monjalon.net; olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com; konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com; Honnappa Nagarahalli; stable@dpdk.org
主题: Re: [PATCH v4 2/2] ring: move the atomic load of head above the loop

On Thu, 1 Nov 2018 17:53:51 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> +* **Updated the ring library with C11 memory model.**
> +
> + Updated the ring library with C11 memory model including the following changes:
> +
> + * Synchronize the load and store of the tail
> + * Move the atomic load of head above the loop
> +

Does this really need to be in the release notes? Is it a user visible change
or just an internal/optimization and fix.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-11-02  0:53                 ` Gavin Hu (Arm Technology China)
@ 2018-11-02  4:30                   ` Honnappa Nagarahalli
  2018-11-02  7:15                     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-02  4:30 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Stephen Hemminger
  Cc: dev, thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, stable, nd

<Fixing this to make the reply inline, making email plain text>

 
On Thu, 1 Nov 2018 17:53:51 +0800 
Gavin Hu <mailto:gavin.hu@arm.com> wrote: 

> +* **Updated the ring library with C11 memory model.** 
> + 
> + Updated the ring library with C11 memory model including the following changes: 
> + 
> + * Synchronize the load and store of the tail 
> + * Move the atomic load of head above the loop 
> + 

Does this really need to be in the release notes? Is it a user visible change 
or just an internal/optimization and fix. 

[Gavin] There is no api changes, but this is a significant change as ring is fundamental and widely used, it decreases latency by 25% in our tests, it may do even better for cases with more contending producers/consumers or deeper depth of rings.

[Honnappa] I agree with Stephen. Release notes should be written from DPDK user perspective. In the rte_ring case, the user has the option of choosing between c11 and non-c11 algorithms. Performance would be one of the criteria to choose between these 2 algorithms. IMO, it probably makes sense to indicate that the performance of c11 based algorithm has been improved. However, I do not know what DPDK has followed historically regarding performance optimizations. I would prefer to follow whatever has been followed so far.
I do not think that we need to document the details of the internal changes since it does not help the user make a decision.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-11-02  4:30                   ` Honnappa Nagarahalli
@ 2018-11-02  7:15                     ` Gavin Hu (Arm Technology China)
  2018-11-02  9:36                       ` Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-11-02  7:15 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger
  Cc: dev, thomas, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, stable, nd



> -----Original Message-----
> From: Honnappa Nagarahalli
> Sent: Friday, November 2, 2018 12:31 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Stephen
> Hemminger <stephen@networkplumber.org>
> Cc: dev@dpdk.org; thomas@monjalon.net; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> stable@dpdk.org; nd <nd@arm.com>
> Subject: RE: [PATCH v4 2/2] ring: move the atomic load of head above the
> loop
> 
> <Fixing this to make the reply inline, making email plain text>
> 
> 
> On Thu, 1 Nov 2018 17:53:51 +0800
> Gavin Hu <mailto:gavin.hu@arm.com> wrote:
> 
> > +* **Updated the ring library with C11 memory model.**
> > +
> > + Updated the ring library with C11 memory model including the following
> changes:
> > +
> > + * Synchronize the load and store of the tail
> > + * Move the atomic load of head above the loop
> > +
> 
> Does this really need to be in the release notes? Is it a user visible change or
> just an internal/optimization and fix.
> 
> [Gavin] There is no api changes, but this is a significant change as ring is
> fundamental and widely used, it decreases latency by 25% in our tests, it may
> do even better for cases with more contending producers/consumers or
> deeper depth of rings.
> 
> [Honnappa] I agree with Stephen. Release notes should be written from
> DPDK user perspective. In the rte_ring case, the user has the option of
> choosing between c11 and non-c11 algorithms. Performance would be one
> of the criteria to choose between these 2 algorithms. IMO, it probably makes
> sense to indicate that the performance of c11 based algorithm has been
> improved. However, I do not know what DPDK has followed historically
> regarding performance optimizations. I would prefer to follow whatever has
> been followed so far.
> I do not think that we need to document the details of the internal changes
> since it does not help the user make a decision.

I read through the online guidelines for release notes, besides API and new features, resolved issues which are significant and not newly introduced in this release cycle, should also be included.
In this case, the resolved issue existed for long time, across multiple release cycles and ring is a core lib, so it should be a candidate for release notes. 

https://doc.dpdk.org/guides-18.08/contributing/patches.html
section 5.5 says: 
Important changes will require an addition to the release notes in doc/guides/rel_notes/. 
See the Release Notes section of the Documentation Guidelines for details.
https://doc.dpdk.org/guides-18.08/contributing/documentation.html#doc-guidelines
"Developers should include updates to the Release Notes with patch sets that relate to any of the following sections:
New Features
Resolved Issues (see below)
Known Issues
API Changes
ABI Changes
Shared Library Versions
Resolved Issues should only include issues from previous releases that have been resolved in the current release. Issues that are introduced and then fixed within a release cycle do not have to be included here."

     Suggested order in release notes items:
     * Core libs (EAL, mempool, ring, mbuf, buses)
     * Device abstraction libs and PMDs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-11-02  7:15                     ` Gavin Hu (Arm Technology China)
@ 2018-11-02  9:36                       ` Thomas Monjalon
  2018-11-02 11:23                         ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2018-11-02  9:36 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Honnappa Nagarahalli
  Cc: Stephen Hemminger, dev, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, stable, nd

02/11/2018 08:15, Gavin Hu (Arm Technology China):
> 
> > -----Original Message-----
> > From: Honnappa Nagarahalli
> > Sent: Friday, November 2, 2018 12:31 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Stephen
> > Hemminger <stephen@networkplumber.org>
> > Cc: dev@dpdk.org; thomas@monjalon.net; olivier.matz@6wind.com;
> > chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> > konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> > stable@dpdk.org; nd <nd@arm.com>
> > Subject: RE: [PATCH v4 2/2] ring: move the atomic load of head above the
> > loop
> > 
> > <Fixing this to make the reply inline, making email plain text>
> > 
> > 
> > On Thu, 1 Nov 2018 17:53:51 +0800
> > Gavin Hu <mailto:gavin.hu@arm.com> wrote:
> > 
> > > +* **Updated the ring library with C11 memory model.**
> > > +
> > > + Updated the ring library with C11 memory model including the following
> > changes:
> > > +
> > > + * Synchronize the load and store of the tail
> > > + * Move the atomic load of head above the loop
> > > +
> > 
> > Does this really need to be in the release notes? Is it a user visible change or
> > just an internal/optimization and fix.
> > 
> > [Gavin] There is no api changes, but this is a significant change as ring is
> > fundamental and widely used, it decreases latency by 25% in our tests, it may
> > do even better for cases with more contending producers/consumers or
> > deeper depth of rings.
> > 
> > [Honnappa] I agree with Stephen. Release notes should be written from
> > DPDK user perspective. In the rte_ring case, the user has the option of
> > choosing between c11 and non-c11 algorithms. Performance would be one
> > of the criteria to choose between these 2 algorithms. IMO, it probably makes
> > sense to indicate that the performance of c11 based algorithm has been
> > improved. However, I do not know what DPDK has followed historically
> > regarding performance optimizations. I would prefer to follow whatever has
> > been followed so far.
> > I do not think that we need to document the details of the internal changes
> > since it does not help the user make a decision.
> 
> I read through the online guidelines for release notes, besides API and new features, resolved issues which are significant and not newly introduced in this release cycle, should also be included.
> In this case, the resolved issue existed for long time, across multiple release cycles and ring is a core lib, so it should be a candidate for release notes. 
> 
> https://doc.dpdk.org/guides-18.08/contributing/patches.html
> section 5.5 says: 
> Important changes will require an addition to the release notes in doc/guides/rel_notes/. 
> See the Release Notes section of the Documentation Guidelines for details.
> https://doc.dpdk.org/guides-18.08/contributing/documentation.html#doc-guidelines
> "Developers should include updates to the Release Notes with patch sets that relate to any of the following sections:
> New Features
> Resolved Issues (see below)
> Known Issues
> API Changes
> ABI Changes
> Shared Library Versions
> Resolved Issues should only include issues from previous releases that have been resolved in the current release. Issues that are introduced and then fixed within a release cycle do not have to be included here."
> 
>      Suggested order in release notes items:
>      * Core libs (EAL, mempool, ring, mbuf, buses)
>      * Device abstraction libs and PMDs

I agree with Honnappa.
You don't need to give details, but can explain that performance of
C11 version is improved.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v5 0/2] ring library with c11 memory model bug fix and optimization
  2018-11-01  9:53         ` [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization Gavin Hu
@ 2018-11-02 11:21           ` Gavin Hu
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 1/2] ring: synchronize the load and store of the tail Gavin Hu
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu @ 2018-11-02 11:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu

v4->v5
1) Indicate the improvement by the change in the release note.

v3->v4
1) Move the variable declarations to the beginning of the block.

v2->v3
1) Reword the change and relocate it in the release note.

v1->v2
1) Add the changes to the 18.11 release note.

V1:
Updated the ring library with C11 memory model including the following changes
1) Synchronize the load and store of the tail to ensure the enqueue/dequeue
   operations are really completed before seen by the observers on the other
   sides.
2) Move the atomic load of head above the loop for the first iteration,it is
   not unnecessary and degrade performance for the other iteration as the head
   was loaded in the failure case of CAS.

Gavin Hu (2):
  ring: synchronize the load and store of the tail
  ring: move the atomic load of head above the loop

 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 24 ++++++++++++++++++------
 2 files changed, 25 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v5 1/2] ring: synchronize the load and store of the tail
  2018-11-01  9:53         ` [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization Gavin Hu
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 " Gavin Hu
@ 2018-11-02 11:21           ` Gavin Hu
  2018-11-05  9:30             ` Olivier Matz
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop Gavin Hu
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-11-02 11:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..52da95a 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -57,6 +57,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		uint32_t *free_entries)
 {
 	const uint32_t capacity = r->capacity;
+	uint32_t cons_tail;
 	unsigned int max = n;
 	int success;
 
@@ -67,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons.tail,
+					__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -125,21 +131,29 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		uint32_t *entries)
 {
 	unsigned int max = n;
+	uint32_t prod_tail;
 	int success;
 
 	/* move cons.head atomically */
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-01  9:53         ` [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization Gavin Hu
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 " Gavin Hu
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 1/2] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-11-02 11:21           ` Gavin Hu
  2018-11-02 11:43             ` Bruce Richardson
  2 siblings, 1 reply; 131+ messages in thread
From: Gavin Hu @ 2018-11-02 11:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index 376128f..b68afab 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -69,6 +69,13 @@ New Features
   checked out against that dma mask and rejected if out of range. If more than
   one device has addressing limitations, the dma mask is the more restricted one.
 
+* **Updated the ring library with C11 memory model.**
+
+  Updated the ring library with C11 memory model, in our tests the changes
+  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases respectively.
+  The real improvements may vary with the number of contending lcores and the
+  size of ring.
+
 * **Added hot-unplug handle mechanism.**
 
   ``rte_dev_hotplug_handle_enable`` and ``rte_dev_hotplug_handle_disable`` are
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 52da95a..7bc74a4 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,13 +61,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -93,6 +91,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -135,13 +134,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -166,6 +163,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop
  2018-11-02  9:36                       ` Thomas Monjalon
@ 2018-11-02 11:23                         ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-11-02 11:23 UTC (permalink / raw)
  To: Thomas Monjalon, Honnappa Nagarahalli
  Cc: Stephen Hemminger, dev, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, stable, nd



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, November 2, 2018 5:37 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; dev@dpdk.org;
> olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> jerin.jacob@caviumnetworks.com; stable@dpdk.org; nd <nd@arm.com>
> Subject: Re: [PATCH v4 2/2] ring: move the atomic load of head above the
> loop
> 
> 02/11/2018 08:15, Gavin Hu (Arm Technology China):
> >
> > > -----Original Message-----
> > > From: Honnappa Nagarahalli
> > > Sent: Friday, November 2, 2018 12:31 PM
> > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Stephen
> > > Hemminger <stephen@networkplumber.org>
> > > Cc: dev@dpdk.org; thomas@monjalon.net; olivier.matz@6wind.com;
> > > chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> > > konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> > > stable@dpdk.org; nd <nd@arm.com>
> > > Subject: RE: [PATCH v4 2/2] ring: move the atomic load of head above
> > > the loop
> > >
> > > <Fixing this to make the reply inline, making email plain text>
> > >
> > >
> > > On Thu, 1 Nov 2018 17:53:51 +0800
> > > Gavin Hu <mailto:gavin.hu@arm.com> wrote:
> > >
> > > > +* **Updated the ring library with C11 memory model.**
> > > > +
> > > > + Updated the ring library with C11 memory model including the
> > > > + following
> > > changes:
> > > > +
> > > > + * Synchronize the load and store of the tail
> > > > + * Move the atomic load of head above the loop
> > > > +
> > >
> > > Does this really need to be in the release notes? Is it a user
> > > visible change or just an internal/optimization and fix.
> > >
> > > [Gavin] There is no api changes, but this is a significant change as
> > > ring is fundamental and widely used, it decreases latency by 25% in
> > > our tests, it may do even better for cases with more contending
> > > producers/consumers or deeper depth of rings.
> > >
> > > [Honnappa] I agree with Stephen. Release notes should be written
> > > from DPDK user perspective. In the rte_ring case, the user has the
> > > option of choosing between c11 and non-c11 algorithms. Performance
> > > would be one of the criteria to choose between these 2 algorithms.
> > > IMO, it probably makes sense to indicate that the performance of c11
> > > based algorithm has been improved. However, I do not know what DPDK
> > > has followed historically regarding performance optimizations. I
> > > would prefer to follow whatever has been followed so far.
> > > I do not think that we need to document the details of the internal
> > > changes since it does not help the user make a decision.
> >
> > I read through the online guidelines for release notes, besides API and new
> features, resolved issues which are significant and not newly introduced in
> this release cycle, should also be included.
> > In this case, the resolved issue existed for long time, across multiple
> release cycles and ring is a core lib, so it should be a candidate for release
> notes.
> >
> > https://doc.dpdk.org/guides-18.08/contributing/patches.html
> > section 5.5 says:
> > Important changes will require an addition to the release notes in
> doc/guides/rel_notes/.
> > See the Release Notes section of the Documentation Guidelines for details.
> > https://doc.dpdk.org/guides-18.08/contributing/documentation.html#doc-
> > guidelines "Developers should include updates to the Release Notes
> > with patch sets that relate to any of the following sections:
> > New Features
> > Resolved Issues (see below)
> > Known Issues
> > API Changes
> > ABI Changes
> > Shared Library Versions
> > Resolved Issues should only include issues from previous releases that
> have been resolved in the current release. Issues that are introduced and
> then fixed within a release cycle do not have to be included here."
> >
> >      Suggested order in release notes items:
> >      * Core libs (EAL, mempool, ring, mbuf, buses)
> >      * Device abstraction libs and PMDs
> 
> I agree with Honnappa.
> You don't need to give details, but can explain that performance of
> C11 version is improved.
> 

V5 was submitted to indicate the improvement by the change, without giving more technical details, please have a review, thanks!

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop Gavin Hu
@ 2018-11-02 11:43             ` Bruce Richardson
  2018-11-03  1:19               ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 131+ messages in thread
From: Bruce Richardson @ 2018-11-02 11:43 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, stephen, olivier.matz, chaozhu, konstantin.ananyev,
	jerin.jacob, Honnappa.Nagarahalli, stable

On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> the do {} while loop as upon failure the old_head will be updated,
> another load is costly and not necessary.
> 
> This helps a little on the latency,about 1~5%.
> 
>  Test result with the patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.64
>  MP/MC bulk enq/dequeue (size: 8): 9.58
>  SP/SC bulk enq/dequeue (size: 32): 1.98
>  MP/MC bulk enq/dequeue (size: 32): 2.30
> 
> Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Jia He <justin.he@arm.com>
> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
>  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
> index 376128f..b68afab 100644
> --- a/doc/guides/rel_notes/release_18_11.rst
> +++ b/doc/guides/rel_notes/release_18_11.rst
> @@ -69,6 +69,13 @@ New Features
>    checked out against that dma mask and rejected if out of range. If more than
>    one device has addressing limitations, the dma mask is the more restricted one.
>  
> +* **Updated the ring library with C11 memory model.**
> +
> +  Updated the ring library with C11 memory model, in our tests the changes
> +  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases respectively.
> +  The real improvements may vary with the number of contending lcores and the
> +  size of ring.
> +
Is this a little misleading, and will users expect massive performance
improvements generally? The C11 model seems to be used only on some, but
not all, arm platforms, and then only with "make" builds.

config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
config/defconfig_arm64-thunderx-linuxapp-gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n

/Bruce

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-02 11:43             ` Bruce Richardson
@ 2018-11-03  1:19               ` Gavin Hu (Arm Technology China)
  2018-11-03  9:34                 ` Honnappa Nagarahalli
  2018-11-05  9:44                 ` [dpdk-dev] " Olivier Matz
  0 siblings, 2 replies; 131+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-11-03  1:19 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, thomas, stephen, olivier.matz, chaozhu, konstantin.ananyev,
	jerin.jacob, Honnappa Nagarahalli, stable



> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com>
> Sent: Friday, November 2, 2018 7:44 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; stephen@networkplumber.org;
> olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v5 2/2] ring: move the atomic load of head above the
> loop
>
> On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> > In __rte_ring_move_prod_head, move the __atomic_load_n up and out
> of
> > the do {} while loop as upon failure the old_head will be updated,
> > another load is costly and not necessary.
> >
> > This helps a little on the latency,about 1~5%.
> >
> >  Test result with the patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.64  MP/MC bulk enq/dequeue (size:
> > 8): 9.58  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk
> > enq/dequeue (size: 32): 2.30
> >
> > Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier
> > option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Reviewed-by: Jia He <justin.he@arm.com>
> > Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > b/doc/guides/rel_notes/release_18_11.rst
> > index 376128f..b68afab 100644
> > --- a/doc/guides/rel_notes/release_18_11.rst
> > +++ b/doc/guides/rel_notes/release_18_11.rst
> > @@ -69,6 +69,13 @@ New Features
> >    checked out against that dma mask and rejected if out of range. If more
> than
> >    one device has addressing limitations, the dma mask is the more
> restricted one.
> >
> > +* **Updated the ring library with C11 memory model.**
> > +
> > +  Updated the ring library with C11 memory model, in our tests the
> > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> cases respectively.
> > +  The real improvements may vary with the number of contending lcores
> > + and the  size of ring.
> > +
> Is this a little misleading, and will users expect massive performance
> improvements generally? The C11 model seems to be used only on some,
> but not all, arm platforms, and then only with "make" builds.
>
> config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> config/defconfig_arm64-thunderx-linuxapp-
> gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
>
> /Bruce

Thank you Bruce for the review, to limit the scope of improvement, I rewrite the note as follows, could you help review? Feel free to change anything if you like.
" Updated the ring library with C11 memory model, running ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores) respectively. Note the changes help the relaxed memory ordering architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y was configured, no impact on strong memory ordering architectures like x86. To what extent they help the real use cases depends on other factors, like the number of contending readers/writers, size of the ring, whether or not it is on the critical path."

/Gavin
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-03  1:19               ` Gavin Hu (Arm Technology China)
@ 2018-11-03  9:34                 ` Honnappa Nagarahalli
  2018-11-05 13:17                   ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  2018-11-05  9:44                 ` [dpdk-dev] " Olivier Matz
  1 sibling, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-03  9:34 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Bruce Richardson
  Cc: dev, thomas, stephen, olivier.matz, chaozhu, konstantin.ananyev,
	jerin.jacob, stable, nd

> > > ---
> > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > b/doc/guides/rel_notes/release_18_11.rst
> > > index 376128f..b68afab 100644
> > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > @@ -69,6 +69,13 @@ New Features
> > >    checked out against that dma mask and rejected if out of range.
> > > If more
> > than
> > >    one device has addressing limitations, the dma mask is the more
> > restricted one.
> > >
> > > +* **Updated the ring library with C11 memory model.**
> > > +
> > > +  Updated the ring library with C11 memory model, in our tests the
> > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > cases respectively.
> > > +  The real improvements may vary with the number of contending
> > > + lcores and the  size of ring.
> > > +
> > Is this a little misleading, and will users expect massive performance
> > improvements generally? The C11 model seems to be used only on some,
> > but not all, arm platforms, and then only with "make" builds.
> >
> > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
This is an error. There is already an agreement that on Arm based platforms, C11 memory model would be used by default. Specific platforms can override it if required.
Would this be ab acceptable change for RC2 or RC3?

> > config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> > config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > config/defconfig_arm64-thunderx-linuxapp-
> > gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
> >
> > /Bruce
> 
> Thank you Bruce for the review, to limit the scope of improvement, I rewrite
> the note as follows, could you help review? Feel free to change anything if you
> like.
> " Updated the ring library with C11 memory model, running
> ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased
> latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores)
> respectively. Note the changes help the relaxed memory ordering
> architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y
> was configured, no impact on strong memory ordering architectures like x86.
> To what extent they help the real use cases depends on other factors, like the
> number of contending readers/writers, size of the ring, whether or not it is on
> the critical path."
> 
> /Gavin
IMO, mentioning the performance numbers requires mentioning system configurations. I suggest we keep this somewhat vague (which will make the users to run the test on their specific platform) and simple. Can I suggest the following:

"C11 memory model algorithm for ring library is updated. This results in improved performance on some Arm based platforms."

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-10-27 15:13               ` Thomas Monjalon
  2018-10-27 15:34                 ` Jerin Jacob
@ 2018-11-03 20:12                 ` Mattias Rönnblom
  2018-11-05 21:51                   ` Honnappa Nagarahalli
  1 sibling, 1 reply; 131+ messages in thread
From: Mattias Rönnblom @ 2018-11-03 20:12 UTC (permalink / raw)
  To: Thomas Monjalon, Jerin Jacob
  Cc: Gavin Hu (Arm Technology China),
	dev, Honnappa Nagarahalli, stable, Ola Liljedahl, olivier.matz,
	chaozhu, bruce.richardson, konstantin.ananyev

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Thomas Monjalon
> Sent: den 27 oktober 2018 17:13
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> stable@dpdk.org; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> the tail
> 
> 27/10/2018 17:00, Jerin Jacob:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > Hi Jerin
> > > >
> > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> patch series to unblock the merge.
> > >
> > > The thread is totally messed up because:
> > >         - there is no cover letter
> > >         - some different series (testpmd, i40e and doc) are in the same
> thread
> > >         - v4 replies to a different series
> > >         - this version should be a v5 but has no number
> > >         - this version replies to the v3
> > >         - patchwork still shows v3 and "v5"
> > >         - replies from Ola are not quoting previous discussion
> > >
> > > Because of all of this, it is really difficult to follow.
> > > This is probably the reason of the lack of review outside of Arm.
> > >
> > > One more issue: you must Cc the relevant maintainers.
> > > Here:
> > >         - Olivier for rte_ring
> > >         - Chao for IBM platform
> > >         - Bruce and Konstantin for x86
> > >
> > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > But please consider better formatting your discussions, it is really
> > > important in our contribution workflow.
> > >
> > > I don't know what to do.
> > > I suggest to wait for more feedbacks and integrate it in -rc2.
> >
> > This series has been acked and tested. Sure, if we are looking for
> > some more feedback we can push to -rc2 if not it a good candidate to
> > be selected for -rc1.
> 
> It has been acked and tested only for Arm platforms.
> And Olivier, the ring maintainer, was not Cc.
> 
> I feel it is not enough.
> 

I've just run an out-of-tree test program I have for the DSW scheduler, which verify scheduler atomic semantics. The results are:
Non-C11 mode: pass
C11 mode before this patch set: fail
C11 mode after this patch set: pass

This suggests the current C11 mode is broken even on x86_64. I haven't been following this thread closely, so maybe this is known already.

I've also run an out-of-tree DSW throughput benchmark, and I've found that going from Non-C11 to C11 gives a 4% slowdown. After this patch, the slowdown is only 2,8%.

GCC 7.3.0 and a Skylake x86_64.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/2] ring: synchronize the load and store of the tail
  2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 1/2] ring: synchronize the load and store of the tail Gavin Hu
@ 2018-11-05  9:30             ` Olivier Matz
  0 siblings, 0 replies; 131+ messages in thread
From: Olivier Matz @ 2018-11-05  9:30 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, stephen, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, stable

On Fri, Nov 02, 2018 at 07:21:27PM +0800, Gavin Hu wrote:
> Synchronize the load-acquire of the tail and the store-release
> within update_tail, the store release ensures all the ring operations,
> enqueue or dequeue, are seen by the observers on the other side as soon
> as they see the updated tail. The load-acquire is needed here as the
> data dependency is not a reliable way for ordering as the compiler might
> break it by saving to temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics
> to load the heads and tails instead.
> 
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size
> of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> 
> This patch is a bug fix, while the improvement is a bonus. In our analysis
> the improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
> 
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
> 
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15
>  SP/SC bulk enq/dequeue (size: 32): 1.94
>  MP/MC bulk enq/dequeue (size: 32): 2.36
> 
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12
>  SP/SC bulk enq/dequeue (size: 32): 2.04
>  MP/MC bulk enq/dequeue (size: 32): 3.26
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Jia He <justin.he@arm.com>
> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-03  1:19               ` Gavin Hu (Arm Technology China)
  2018-11-03  9:34                 ` Honnappa Nagarahalli
@ 2018-11-05  9:44                 ` Olivier Matz
  2018-11-05 13:36                   ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  1 sibling, 1 reply; 131+ messages in thread
From: Olivier Matz @ 2018-11-05  9:44 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: Bruce Richardson, dev, thomas, stephen, chaozhu,
	konstantin.ananyev, jerin.jacob, Honnappa Nagarahalli, stable

Hi,

On Sat, Nov 03, 2018 at 01:19:29AM +0000, Gavin Hu (Arm Technology China) wrote:
> 
> 
> > -----Original Message-----
> > From: Bruce Richardson <bruce.richardson@intel.com>
> > Sent: Friday, November 2, 2018 7:44 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > Cc: dev@dpdk.org; thomas@monjalon.net; stephen@networkplumber.org;
> > olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> > konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v5 2/2] ring: move the atomic load of head above the
> > loop
> >
> > On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> > > In __rte_ring_move_prod_head, move the __atomic_load_n up and out
> > of
> > > the do {} while loop as upon failure the old_head will be updated,
> > > another load is costly and not necessary.
> > >
> > > This helps a little on the latency,about 1~5%.
> > >
> > >  Test result with the patch(two cores):
> > >  SP/SC bulk enq/dequeue (size: 8): 5.64  MP/MC bulk enq/dequeue (size:
> > > 8): 9.58  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk
> > > enq/dequeue (size: 32): 2.30
> > >
> > > Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier
> > > option")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > Reviewed-by: Jia He <justin.he@arm.com>
> > > Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > ---
> > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > b/doc/guides/rel_notes/release_18_11.rst
> > > index 376128f..b68afab 100644
> > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > @@ -69,6 +69,13 @@ New Features
> > >    checked out against that dma mask and rejected if out of range. If more
> > than
> > >    one device has addressing limitations, the dma mask is the more
> > restricted one.
> > >
> > > +* **Updated the ring library with C11 memory model.**
> > > +
> > > +  Updated the ring library with C11 memory model, in our tests the
> > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > cases respectively.
> > > +  The real improvements may vary with the number of contending lcores
> > > + and the  size of ring.
> > > +
> > Is this a little misleading, and will users expect massive performance
> > improvements generally? The C11 model seems to be used only on some,
> > but not all, arm platforms, and then only with "make" builds.
> >
> > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> > config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> > config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > config/defconfig_arm64-thunderx-linuxapp-
> > gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
> >
> > /Bruce
> 
> Thank you Bruce for the review, to limit the scope of improvement, I rewrite the note as follows, could you help review? Feel free to change anything if you like.
> " Updated the ring library with C11 memory model, running ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores) respectively. Note the changes help the relaxed memory ordering architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y was configured, no impact on strong memory ordering architectures like x86. To what extent they help the real use cases depends on other factors, like the number of contending readers/writers, size of the ring, whether or not it is on the critical path."

I prefer your initial proposal which is more concise. What about
something like this?


* **Updated the C11 memory model version of ring library.**

  The latency is decreased for architectures using the C11 memory model
  version of the ring library.

  On Cavium ThunderX2 platform, the changes decreased latency by 27~29%
  and 3~15% for MPMC and SPSC cases respectively (with 2 lcores). The
  real improvements may vary with the number of contending lcores and
  the size of ring.


About the patch itself:
Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thanks

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-03  9:34                 ` Honnappa Nagarahalli
@ 2018-11-05 13:17                   ` Thomas Monjalon
  2018-11-05 13:41                     ` Jerin Jacob
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2018-11-05 13:17 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: stable, Gavin Hu (Arm Technology China),
	Bruce Richardson, dev, stephen, olivier.matz, chaozhu,
	konstantin.ananyev, jerin.jacob, nd, hemant.agrawal,
	shreyansh.jain

03/11/2018 10:34, Honnappa Nagarahalli:
> > > > ---
> > > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > > b/doc/guides/rel_notes/release_18_11.rst
> > > > index 376128f..b68afab 100644
> > > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > > @@ -69,6 +69,13 @@ New Features
> > > >    checked out against that dma mask and rejected if out of range.
> > > > If more
> > > than
> > > >    one device has addressing limitations, the dma mask is the more
> > > restricted one.
> > > >
> > > > +* **Updated the ring library with C11 memory model.**
> > > > +
> > > > +  Updated the ring library with C11 memory model, in our tests the
> > > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > > cases respectively.
> > > > +  The real improvements may vary with the number of contending
> > > > + lcores and the  size of ring.
> > > > +
> > > Is this a little misleading, and will users expect massive performance
> > > improvements generally? The C11 model seems to be used only on some,
> > > but not all, arm platforms, and then only with "make" builds.
> > >
> > > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> This is an error. There is already an agreement that on Arm based platforms, C11 memory model would be used by default. Specific platforms can override it if required.
> Would this be ab acceptable change for RC2 or RC3?

If NXP and Cavium agrees, I think it can go in RC2.
For RC3, not sure.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-05  9:44                 ` [dpdk-dev] " Olivier Matz
@ 2018-11-05 13:36                   ` Thomas Monjalon
  0 siblings, 0 replies; 131+ messages in thread
From: Thomas Monjalon @ 2018-11-05 13:36 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: stable, Olivier Matz, Bruce Richardson, dev, stephen, chaozhu,
	konstantin.ananyev, jerin.jacob, Honnappa Nagarahalli

05/11/2018 10:44, Olivier Matz:
> Hi,
> 
> On Sat, Nov 03, 2018 at 01:19:29AM +0000, Gavin Hu (Arm Technology China) wrote:
> > From: Bruce Richardson <bruce.richardson@intel.com>
> > > On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> > > > In __rte_ring_move_prod_head, move the __atomic_load_n up and out
> > > of
> > > > the do {} while loop as upon failure the old_head will be updated,
> > > > another load is costly and not necessary.
> > > >
> > > > This helps a little on the latency,about 1~5%.
> > > >
> > > >  Test result with the patch(two cores):
> > > >  SP/SC bulk enq/dequeue (size: 8): 5.64  MP/MC bulk enq/dequeue (size:
> > > > 8): 9.58  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk
> > > > enq/dequeue (size: 32): 2.30
> > > >
> > > > Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier
> > > > option")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > > Reviewed-by: Jia He <justin.he@arm.com>
> > > > Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > ---
> > > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > > b/doc/guides/rel_notes/release_18_11.rst
> > > > index 376128f..b68afab 100644
> > > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > > @@ -69,6 +69,13 @@ New Features
> > > >    checked out against that dma mask and rejected if out of range. If more
> > > than
> > > >    one device has addressing limitations, the dma mask is the more
> > > restricted one.
> > > >
> > > > +* **Updated the ring library with C11 memory model.**
> > > > +
> > > > +  Updated the ring library with C11 memory model, in our tests the
> > > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > > cases respectively.
> > > > +  The real improvements may vary with the number of contending lcores
> > > > + and the  size of ring.
> > > > +
> > > Is this a little misleading, and will users expect massive performance
> > > improvements generally? The C11 model seems to be used only on some,
> > > but not all, arm platforms, and then only with "make" builds.
> > >
> > > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> > > config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> > > config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > > config/defconfig_arm64-thunderx-linuxapp-
> > > gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > >
> > > /Bruce
> > 
> > Thank you Bruce for the review, to limit the scope of improvement, I rewrite the note as follows, could you help review? Feel free to change anything if you like.
> > " Updated the ring library with C11 memory model, running ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores) respectively. Note the changes help the relaxed memory ordering architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y was configured, no impact on strong memory ordering architectures like x86. To what extent they help the real use cases depends on other factors, like the number of contending readers/writers, size of the ring, whether or not it is on the critical path."
> 
> I prefer your initial proposal which is more concise. What about
> something like this?
> 
> 
> * **Updated the C11 memory model version of ring library.**
> 
>   The latency is decreased for architectures using the C11 memory model
>   version of the ring library.
> 
>   On Cavium ThunderX2 platform, the changes decreased latency by 27~29%
>   and 3~15% for MPMC and SPSC cases respectively (with 2 lcores). The
>   real improvements may vary with the number of contending lcores and
>   the size of ring.
> 
> 
> About the patch itself:
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Series applied with the suggested notes.

Thanks

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v5 2/2] ring: move the atomic load of head above the loop
  2018-11-05 13:17                   ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
@ 2018-11-05 13:41                     ` Jerin Jacob
  0 siblings, 0 replies; 131+ messages in thread
From: Jerin Jacob @ 2018-11-05 13:41 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Honnappa Nagarahalli, stable, Gavin Hu (Arm Technology China),
	Bruce Richardson, dev, stephen, olivier.matz, chaozhu,
	konstantin.ananyev, nd, hemant.agrawal, shreyansh.jain

-----Original Message-----
> Date: Mon, 05 Nov 2018 14:17:27 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: stable@dpdk.org, "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
>  Bruce Richardson <bruce.richardson@intel.com>, "dev@dpdk.org"
>  <dev@dpdk.org>, "stephen@networkplumber.org" <stephen@networkplumber.org>,
>  "olivier.matz@6wind.com" <olivier.matz@6wind.com>,
>  "chaozhu@linux.vnet.ibm.com" <chaozhu@linux.vnet.ibm.com>,
>  "konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>,
>  "jerin.jacob@caviumnetworks.com" <jerin.jacob@caviumnetworks.com>, nd
>  <nd@arm.com>, hemant.agrawal@nxp.com, shreyansh.jain@nxp.com
> Subject: Re: [dpdk-stable] [PATCH v5 2/2] ring: move the atomic load of
>  head above the loop
> 
> External Email
> 
> 03/11/2018 10:34, Honnappa Nagarahalli:
> > > > > ---
> > > > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > > > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > > > b/doc/guides/rel_notes/release_18_11.rst
> > > > > index 376128f..b68afab 100644
> > > > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > > > @@ -69,6 +69,13 @@ New Features
> > > > >    checked out against that dma mask and rejected if out of range.
> > > > > If more
> > > > than
> > > > >    one device has addressing limitations, the dma mask is the more
> > > > restricted one.
> > > > >
> > > > > +* **Updated the ring library with C11 memory model.**
> > > > > +
> > > > > +  Updated the ring library with C11 memory model, in our tests the
> > > > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > > > cases respectively.
> > > > > +  The real improvements may vary with the number of contending
> > > > > + lcores and the  size of ring.
> > > > > +
> > > > Is this a little misleading, and will users expect massive performance
> > > > improvements generally? The C11 model seems to be used only on some,
> > > > but not all, arm platforms, and then only with "make" builds.
> > > >
> > > > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> > This is an error. There is already an agreement that on Arm based platforms, C11 memory model would be used by default. Specific platforms can override it if required.
> > Would this be ab acceptable change for RC2 or RC3?
> 
> If NXP and Cavium agrees, I think it can go in RC2.

Yes. meson and make config should be same. i.e  on Arm based platforms,
C11 memory model would be used by default. Specific platforms can
override it if required.

I think, meson config needs to be updated to be inline with make config.


> For RC3, not sure.
> 
> 
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-11-03 20:12                 ` Mattias Rönnblom
@ 2018-11-05 21:51                   ` Honnappa Nagarahalli
  2018-11-06 11:03                     ` Mattias Rönnblom
  0 siblings, 1 reply; 131+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-05 21:51 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon, Jerin Jacob
  Cc: Gavin Hu (Arm Technology China),
	dev, stable, Ola Liljedahl, olivier.matz, chaozhu,
	bruce.richardson, konstantin.ananyev, nd

> >
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the
> > > > same
> > thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is
> > > > really important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for
> > > some more feedback we can push to -rc2 if not it a good candidate to
> > > be selected for -rc1.
> >
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> >
> > I feel it is not enough.
> >
> 
> I've just run an out-of-tree test program I have for the DSW scheduler, which
> verify scheduler atomic semantics. The results are:
> Non-C11 mode: pass
> C11 mode before this patch set: fail
> C11 mode after this patch set: pass
> 
> This suggests the current C11 mode is broken even on x86_64. I haven't
> been following this thread closely, so maybe this is known already.
> 
> I've also run an out-of-tree DSW throughput benchmark, and I've found that
> going from Non-C11 to C11 gives a 4% slowdown. After this patch, the
> slowdown is only 2,8%.
This is interesting. The general understanding seems to be that C11 atomics should not add any additional instructions on x86. But, we still see some drop in performance. Is this attributed to compiler not being allowed to re-order?

> 
> GCC 7.3.0 and a Skylake x86_64.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail
  2018-11-05 21:51                   ` Honnappa Nagarahalli
@ 2018-11-06 11:03                     ` Mattias Rönnblom
  0 siblings, 0 replies; 131+ messages in thread
From: Mattias Rönnblom @ 2018-11-06 11:03 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Thomas Monjalon, Jerin Jacob
  Cc: Gavin Hu (Arm Technology China),
	dev, stable, Ola Liljedahl, olivier.matz, chaozhu,
	bruce.richardson, konstantin.ananyev, nd

On 2018-11-05 22:51, Honnappa Nagarahalli wrote:
>> I've also run an out-of-tree DSW throughput benchmark, and I've found that
>> going from Non-C11 to C11 gives a 4% slowdown. After this patch, the
>> slowdown is only 2,8%.
> This is interesting. The general understanding seems to be that C11 atomics should not add any additional instructions on x86. But, we still see some drop in performance. Is this attributed to compiler not being allowed to re-order?
> 

I was lazy enough not to disassemble, so I don't know.

I would suggest non-C11 mode stays as the default on x86_64.

^ permalink raw reply	[flat|nested] 131+ messages in thread

end of thread, other threads:[~2018-11-06 11:03 UTC | newest]

Thread overview: 131+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-06  1:18 [dpdk-dev] [PATCH] ring: fix c11 memory ordering issue Gavin Hu
2018-08-06  9:19 ` Thomas Monjalon
2018-08-08  1:39   ` Gavin Hu
2018-08-07  3:19 ` [dpdk-dev] [PATCH v2] " Gavin Hu
2018-08-07  5:56   ` He, Jia
2018-08-07  7:56     ` Gavin Hu
2018-08-08  3:07       ` Jerin Jacob
2018-08-08  7:23         ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
2018-09-17  7:47   ` [dpdk-dev] [PATCH v3 1/3] app/testpmd: show errno along with flow API errors Gavin Hu
2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 2/3] net/i40e: remove invalid comment Gavin Hu
2018-09-17  8:25       ` Gavin Hu (Arm Technology China)
2018-09-17  7:47     ` [dpdk-dev] [PATCH v3 3/3] doc: add cross compile part for sample applications Gavin Hu
2018-09-17  9:48       ` Jerin Jacob
2018-09-17 10:28         ` Gavin Hu (Arm Technology China)
2018-09-17 10:34           ` Jerin Jacob
2018-09-17 10:55             ` Gavin Hu (Arm Technology China)
2018-09-17 10:49       ` [dpdk-dev] [PATCH v4] " Gavin Hu
2018-09-17 10:53         ` [dpdk-dev] [PATCH v5] " Gavin Hu
2018-09-18 11:00           ` Jerin Jacob
2018-09-19  0:33           ` [dpdk-dev] [PATCH v6] " Gavin Hu
2018-09-17  8:11     ` [dpdk-dev] [PATCH v4 1/4] bus/fslmc: fix undefined reference of memsegs Gavin Hu
2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 2/4] ring: read tail using atomic load Gavin Hu
2018-09-20  6:41         ` Jerin Jacob
2018-09-25  9:26           ` Gavin Hu (Arm Technology China)
2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 3/4] ring: synchronize the load and store of the tail Gavin Hu
2018-09-17  8:11       ` [dpdk-dev] [PATCH v4 4/4] ring: move the atomic load of head above the loop Gavin Hu
2018-10-27 14:21         ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
2018-09-17  8:17   ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu
2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail Gavin Hu
2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
2018-09-26  9:59         ` Justin He
2018-09-29 10:57       ` Jerin Jacob
2018-10-17  6:29       ` [dpdk-dev] [PATCH 1/2] " Gavin Hu
2018-10-17  6:29         ` [dpdk-dev] [PATCH 2/2] ring: move the atomic load of head above the loop Gavin Hu
2018-10-17  6:35         ` [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of the tail Gavin Hu (Arm Technology China)
2018-10-27 14:39           ` Thomas Monjalon
2018-10-27 15:00             ` Jerin Jacob
2018-10-27 15:13               ` Thomas Monjalon
2018-10-27 15:34                 ` Jerin Jacob
2018-10-27 15:48                   ` Thomas Monjalon
2018-10-29  2:51                   ` Gavin Hu (Arm Technology China)
2018-10-29  2:57                   ` Gavin Hu (Arm Technology China)
2018-10-29 10:16                     ` Jerin Jacob
2018-10-29 10:47                       ` Thomas Monjalon
2018-10-29 11:10                         ` Jerin Jacob
2018-11-03 20:12                 ` Mattias Rönnblom
2018-11-05 21:51                   ` Honnappa Nagarahalli
2018-11-06 11:03                     ` Mattias Rönnblom
2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 0/2] rte ring c11 bug fix and optimization Gavin Hu
2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 0/2] ring library with c11 memory model " Gavin Hu
2018-10-31 16:58             ` Thomas Monjalon
2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 1/2] ring: synchronize the load and store of the tail Gavin Hu
2018-11-01  9:53             ` [dpdk-dev] [PATCH v4 2/2] ring: move the atomic load of head above the loop Gavin Hu
2018-11-01 17:26               ` Stephen Hemminger
2018-11-02  0:53                 ` Gavin Hu (Arm Technology China)
2018-11-02  4:30                   ` Honnappa Nagarahalli
2018-11-02  7:15                     ` Gavin Hu (Arm Technology China)
2018-11-02  9:36                       ` Thomas Monjalon
2018-11-02 11:23                         ` Gavin Hu (Arm Technology China)
2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store of the tail Gavin Hu
2018-10-31 22:07             ` Stephen Hemminger
2018-11-01  9:56               ` Gavin Hu (Arm Technology China)
2018-10-31 10:26           ` [dpdk-dev] [PATCH v3 2/2] ring: move the atomic load of head above the loop Gavin Hu
2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 1/2] ring: synchronize the load and store of the tail Gavin Hu
2018-10-31  3:35         ` [dpdk-dev] [PATCH v2 2/2] ring: move the atomic load of head above the loop Gavin Hu
2018-10-31  9:36           ` Thomas Monjalon
2018-10-31 10:27             ` Gavin Hu (Arm Technology China)
2018-11-01  9:53         ` [dpdk-dev] [PATCH v4 0/2] ring library with c11 memory model bug fix and optimization Gavin Hu
2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 " Gavin Hu
2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 1/2] ring: synchronize the load and store of the tail Gavin Hu
2018-11-05  9:30             ` Olivier Matz
2018-11-02 11:21           ` [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop Gavin Hu
2018-11-02 11:43             ` Bruce Richardson
2018-11-03  1:19               ` Gavin Hu (Arm Technology China)
2018-11-03  9:34                 ` Honnappa Nagarahalli
2018-11-05 13:17                   ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
2018-11-05 13:41                     ` Jerin Jacob
2018-11-05  9:44                 ` [dpdk-dev] " Olivier Matz
2018-11-05 13:36                   ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
2018-09-17  8:17     ` [dpdk-dev] [PATCH v3 3/3] " Gavin Hu
2018-09-26  9:29       ` Gavin Hu (Arm Technology China)
2018-09-26 10:06         ` Justin He
2018-09-29  7:19           ` Stephen Hemminger
2018-09-29 10:59       ` Jerin Jacob
2018-09-26  9:29     ` [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load Gavin Hu (Arm Technology China)
2018-09-26 10:09       ` Justin He
2018-09-29 10:48     ` Jerin Jacob
2018-10-05  0:47       ` Gavin Hu (Arm Technology China)
2018-10-05  8:21         ` Ananyev, Konstantin
2018-10-05 11:15           ` Ola Liljedahl
2018-10-05 11:36             ` Ola Liljedahl
2018-10-05 13:44               ` Ananyev, Konstantin
2018-10-05 14:21                 ` Ola Liljedahl
2018-10-05 15:11                 ` Honnappa Nagarahalli
2018-10-05 17:07                   ` Jerin Jacob
2018-10-05 18:05                     ` Ola Liljedahl
2018-10-05 20:06                       ` Honnappa Nagarahalli
2018-10-05 20:17                         ` Ola Liljedahl
2018-10-05 20:29                           ` Honnappa Nagarahalli
2018-10-05 20:34                             ` Ola Liljedahl
2018-10-06  7:41                               ` Jerin Jacob
2018-10-06 19:44                                 ` Ola Liljedahl
2018-10-06 19:59                                   ` Ola Liljedahl
2018-10-07  4:02                                   ` Jerin Jacob
2018-10-07 20:11                                     ` Ola Liljedahl
2018-10-07 20:44                                     ` Ola Liljedahl
2018-10-08  6:06                                       ` Jerin Jacob
2018-10-08  9:22                                         ` Ola Liljedahl
2018-10-08 10:00                                           ` Jerin Jacob
2018-10-08 10:25                                             ` Ola Liljedahl
2018-10-08 10:33                                               ` Gavin Hu (Arm Technology China)
2018-10-08 10:39                                                 ` Ola Liljedahl
2018-10-08 10:41                                                   ` Gavin Hu (Arm Technology China)
2018-10-08 10:49                                                 ` Jerin Jacob
2018-10-10  6:28                                                   ` Gavin Hu (Arm Technology China)
2018-10-10 19:26                                                     ` Honnappa Nagarahalli
2018-10-08 10:46                                               ` Jerin Jacob
2018-10-08 11:21                                                 ` Ola Liljedahl
2018-10-08 11:50                                                   ` Jerin Jacob
2018-10-08 11:59                                                     ` Ola Liljedahl
2018-10-08 12:05                                                       ` Jerin Jacob
2018-10-08 12:20                                                         ` Jerin Jacob
2018-10-08 12:30                                                           ` Ola Liljedahl
2018-10-09  8:53                                                             ` Olivier Matz
2018-10-09  3:16                                             ` Honnappa Nagarahalli
2018-10-08 14:43                                           ` Bruce Richardson
2018-10-08 14:46                                             ` Ola Liljedahl
2018-10-08 15:45                                               ` Ola Liljedahl
2018-10-08  5:27                               ` Honnappa Nagarahalli
2018-10-08 10:01                                 ` Ola Liljedahl
2018-10-27 14:17     ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).