DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements
@ 2018-12-27  4:13 Gavin Hu
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 1/6] eal: fix clang compilation error on x86 Gavin Hu
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Gavin Hu

V3:
1. Implemented the ticket lock to improve the fairness and predictability.
   The locks are obtained in the order of requested.

V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
   by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
   is enabled.

V1:
1. Remove the 1us delay outside of the locked region to really benchmark
   the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
   benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
   hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
   two-way sync primitives, to optimize for performance.

Gavin Hu (5):
  eal: fix clang compilation error on x86
  test/spinlock: remove 1us delay for correct benchmarking
  test/spinlock: get timestamp more precisely
  test/spinlock: amortize the cost of getting time
  spinlock: reimplement with atomic one-way barrier builtins

Joyce Kong (1):
  spinlock: ticket based to improve fairness

 .../common/include/arch/ppc_64/rte_spinlock.h      |  5 +++
 .../common/include/arch/x86/rte_spinlock.h         |  6 +++
 lib/librte_eal/common/include/generic/rte_atomic.h |  6 +--
 .../common/include/generic/rte_spinlock.h          | 45 +++++++++++++++-------
 test/test/test_spinlock.c                          | 32 +++++++--------
 5 files changed, 61 insertions(+), 33 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dpdk-dev] [PATCH v3 1/6] eal: fix clang compilation error on x86
  2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
@ 2018-12-27  4:13 ` Gavin Hu
  2018-12-27  6:36   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 2/6] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Gavin Hu,
	stable

When CONFIG_RTE_FORCE_INTRINSICS is enabled for x86, the clang
compilation error was:
	include/generic/rte_atomic.h:215:9: error:
		implicit declaration of function '__atomic_exchange_2'
		is invalid in C99
	include/generic/rte_atomic.h:494:9: error:
		implicit declaration of function '__atomic_exchange_4'
		is invalid in C99
	include/generic/rte_atomic.h:772:9: error:
		implicit declaration of function '__atomic_exchange_8'
		is invalid in C99

Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
For more information, please refer to:
http://mails.dpdk.org/archives/dev/2018-April/096776.html

Fixes: 7bdccb93078e ("eal: fix ARM build with clang")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/include/generic/rte_atomic.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index b99ba4688..ed5b125b3 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -212,7 +212,7 @@ rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val);
 static inline uint16_t
 rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val)
 {
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
 	return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
 #else
 	return __atomic_exchange_2(dst, val, __ATOMIC_SEQ_CST);
@@ -495,7 +495,7 @@ rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val);
 static inline uint32_t
 rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val)
 {
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
 	return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
 #else
 	return __atomic_exchange_4(dst, val, __ATOMIC_SEQ_CST);
@@ -777,7 +777,7 @@ rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val);
 static inline uint64_t
 rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val)
 {
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
 	return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
 #else
 	return __atomic_exchange_8(dst, val, __ATOMIC_SEQ_CST);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dpdk-dev] [PATCH v3 2/6] test/spinlock: remove 1us delay for correct benchmarking
  2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 1/6] eal: fix clang compilation error on x86 Gavin Hu
@ 2018-12-27  4:13 ` Gavin Hu
  2018-12-27  7:20   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 3/6] test/spinlock: get timestamp more precisely Gavin Hu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Gavin Hu

The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 test/test/test_spinlock.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 73bff128e..6795195ae 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
 		lcount++;
 		if (use_lock)
 			rte_spinlock_unlock(&lk);
-		/* delay to make lock duty cycle slighlty realistic */
-		rte_delay_us(1);
 		time_diff = rte_get_timer_cycles() - begin;
 	}
 	lock_count[lcore] = lcount;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dpdk-dev] [PATCH v3 3/6] test/spinlock: get timestamp more precisely
  2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 1/6] eal: fix clang compilation error on x86 Gavin Hu
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 2/6] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
@ 2018-12-27  4:13 ` Gavin Hu
  2018-12-27  7:27   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 4/6] test/spinlock: amortize the cost of getting time Gavin Hu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Gavin Hu

To precisely benchmark the spinlock performance, uses the precise
version of getting timestamps, which enforces the timestamps are
obtained at the expected places.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 test/test/test_spinlock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 6795195ae..648474833 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -113,14 +113,14 @@ load_loop_fn(void *func_param)
 	if (lcore != rte_get_master_lcore())
 		while (rte_atomic32_read(&synchro) == 0);
 
-	begin = rte_get_timer_cycles();
+	begin = rte_rdtsc_precise();
 	while (time_diff < hz * TIME_MS / 1000) {
 		if (use_lock)
 			rte_spinlock_lock(&lk);
 		lcount++;
 		if (use_lock)
 			rte_spinlock_unlock(&lk);
-		time_diff = rte_get_timer_cycles() - begin;
+		time_diff = rte_rdtsc_precise() - begin;
 	}
 	lock_count[lcore] = lcount;
 	return 0;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dpdk-dev] [PATCH v3 4/6] test/spinlock: amortize the cost of getting time
  2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
                   ` (2 preceding siblings ...)
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 3/6] test/spinlock: get timestamp more precisely Gavin Hu
@ 2018-12-27  4:13 ` Gavin Hu
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 6/6] spinlock: ticket based to improve fairness Gavin Hu
  5 siblings, 0 replies; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Gavin Hu

Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
---
 test/test/test_spinlock.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 648474833..e9839b979 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -96,9 +96,9 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
 }
 
 static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
 
-#define TIME_MS 100
+#define MAX_LOOP 10000
 
 static int
 load_loop_fn(void *func_param)
@@ -114,15 +114,14 @@ load_loop_fn(void *func_param)
 		while (rte_atomic32_read(&synchro) == 0);
 
 	begin = rte_rdtsc_precise();
-	while (time_diff < hz * TIME_MS / 1000) {
+	while (lcount < MAX_LOOP) {
 		if (use_lock)
 			rte_spinlock_lock(&lk);
-		lcount++;
 		if (use_lock)
 			rte_spinlock_unlock(&lk);
-		time_diff = rte_rdtsc_precise() - begin;
 	}
-	lock_count[lcore] = lcount;
+	time_diff = rte_rdtsc_precise() - begin;
+	time_count[lcore] = time_diff * 1000000 / hz;
 	return 0;
 }
 
@@ -136,14 +135,16 @@ test_spinlock_perf(void)
 
 	printf("\nTest with no lock on single core...\n");
 	load_loop_fn(&lock);
-	printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
-	memset(lock_count, 0, sizeof(lock_count));
+	printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+						time_count[lcore]);
+	memset(time_count, 0, sizeof(time_count));
 
 	printf("\nTest with lock on single core...\n");
 	lock = 1;
 	load_loop_fn(&lock);
-	printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
-	memset(lock_count, 0, sizeof(lock_count));
+	printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+						time_count[lcore]);
+	memset(time_count, 0, sizeof(time_count));
 
 	printf("\nTest with lock on %u cores...\n", rte_lcore_count());
 
@@ -158,11 +159,12 @@ test_spinlock_perf(void)
 	rte_eal_mp_wait_lcore();
 
 	RTE_LCORE_FOREACH(i) {
-		printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
-		total += lock_count[i];
+		printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+						time_count[i]);
+		total += time_count[i];
 	}
 
-	printf("Total count = %"PRIu64"\n", total);
+	printf("Total Cost Time = %"PRIu64" us\n", total);
 
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dpdk-dev] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
                   ` (3 preceding siblings ...)
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 4/6] test/spinlock: amortize the cost of getting time Gavin Hu
@ 2018-12-27  4:13 ` Gavin Hu
  2018-12-27  7:42   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 6/6] spinlock: ticket based to improve fairness Gavin Hu
  5 siblings, 1 reply; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Gavin Hu

The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.

Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
   0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
   0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
   0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
   0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
   0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
   0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
   0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
<rte_atomic16_cmpset+52>  // b.any
   0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
   0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
<rte_atomic16_cmpset+32>
   0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
   0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq = none

The benchmarking results showed 3X performance gain on Cavium ThunderX2 and
13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
Here is the example test result on TX2:

*** spinlock_autotest without this patch ***
Core [123] Cost Time = 639822 us
Core [124] Cost Time = 633253 us
Core [125] Cost Time = 646030 us
Core [126] Cost Time = 643189 us
Core [127] Cost Time = 647039 us
Total Cost Time = 95433298 us

*** spinlock_autotest with this patch ***
Core [123] Cost Time = 163615 us
Core [124] Cost Time = 166471 us
Core [125] Cost Time = 189044 us
Core [126] Cost Time = 195745 us
Core [127] Cost Time = 78423 us
Total Cost Time = 27339656 us

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
 lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc31e..87ae7a4f1 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-	while (__sync_lock_test_and_set(&sl->locked, 1))
-		while(sl->locked)
+	int exp = 0;
+
+	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
 			rte_pause();
+		exp = 0;
+	}
 }
 #endif
 
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
 static inline void
 rte_spinlock_unlock (rte_spinlock_t *sl)
 {
-	__sync_lock_release(&sl->locked);
+	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
 }
 #endif
 
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
 static inline int
 rte_spinlock_trylock (rte_spinlock_t *sl)
 {
-	return __sync_lock_test_and_set(&sl->locked,1) == 0;
+	int exp = 0;
+	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+				0, /* disallow spurious failure */
+				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
 }
 #endif
 
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
  */
 static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
 {
-	return sl->locked;
+	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dpdk-dev] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
                   ` (4 preceding siblings ...)
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2018-12-27  4:13 ` Gavin Hu
  2018-12-27  6:58   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  5 siblings, 1 reply; 28+ messages in thread
From: Gavin Hu @ 2018-12-27  4:13 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, stephen, david.marchand, nd, Joyce Kong

From: Joyce Kong <joyce.kong@arm.com>

The old implementation is unfair, some threads may take locks aggressively
while leaving the other threads starving for long time. As shown in the
following test, within same period of time, there are threads taking locks
much more times than the others.

The new implementation gives each waiting thread a ticket and they can take
the lock one by one, first come, first serviced, this avoids starvation for
too long time and is more predictable.

*** spinlock_autotest without this patch ***
Core [0] count = 89
Core [1] count = 84
Core [2] count = 94
...
Core [208] count = 171
Core [209] count = 152
Core [210] count = 161
Core [211] count = 187

*** spinlock_autotest with this patch ***
Core [0] count = 534
Core [1] count = 533
Core [2] count = 542
...
Core [208] count = 554
Core [209] count = 556
Core [210] count = 555
Core [211] count = 551

The overal spinlock fairness increased on thundex-2.

Signed-off-by: Joyce Kong <joyce.kong@arm.com>
---
 .../common/include/arch/ppc_64/rte_spinlock.h      |  5 ++
 .../common/include/arch/x86/rte_spinlock.h         |  6 +++
 .../common/include/generic/rte_spinlock.h          | 53 +++++++++++++---------
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h b/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h
index 39815d9ee..9fa904f92 100644
--- a/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h
@@ -65,6 +65,11 @@ rte_spinlock_trylock(rte_spinlock_t *sl)
 	return __sync_lock_test_and_set(&sl->locked, 1) == 0;
 }
 
+static inline int
+rte_spinlock_is_locked(rte_spinlock_t *sl)
+{
+	return sl->locked;
+}
 #endif
 
 static inline int rte_tm_supported(void)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h
index e2e2b2643..db80fa420 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h
@@ -65,6 +65,12 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
 
 	return lockval == 0;
 }
+
+static inline int
+rte_spinlock_is_locked(rte_spinlock_t *sl)
+{
+	return sl->locked;
+}
 #endif
 
 extern uint8_t rte_rtm_supported;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4f1..607abd400 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -27,8 +27,12 @@
 /**
  * The rte_spinlock_t type.
  */
-typedef struct {
-	volatile int locked; /**< lock status 0 = unlocked, 1 = locked */
+typedef union {
+	volatile int locked; /* lock status 0 = unlocked, 1 = locked */
+	struct {
+		uint16_t current;
+		uint16_t next;
+	} s;
 } rte_spinlock_t;
 
 /**
@@ -45,7 +49,8 @@ typedef struct {
 static inline void
 rte_spinlock_init(rte_spinlock_t *sl)
 {
-	sl->locked = 0;
+	__atomic_store_n(&sl->s.current, 0, __ATOMIC_RELAXED);
+	__atomic_store_n(&sl->s.next, 0, __ATOMIC_RELAXED);
 }
 
 /**
@@ -61,14 +66,9 @@ rte_spinlock_lock(rte_spinlock_t *sl);
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-	int exp = 0;
-
-	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
-				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
-		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
-			rte_pause();
-		exp = 0;
-	}
+	uint16_t me = __atomic_fetch_add(&sl->s.next, 1, __ATOMIC_RELAXED);
+	while (__atomic_load_n(&sl->s.current, __ATOMIC_ACQUIRE) != me)
+		rte_pause();
 }
 #endif
 
@@ -79,13 +79,15 @@ rte_spinlock_lock(rte_spinlock_t *sl)
  *   A pointer to the spinlock.
  */
 static inline void
-rte_spinlock_unlock (rte_spinlock_t *sl);
+rte_spinlock_unlock(rte_spinlock_t *sl);
 
 #ifdef RTE_FORCE_INTRINSICS
 static inline void
-rte_spinlock_unlock (rte_spinlock_t *sl)
+rte_spinlock_unlock(rte_spinlock_t *sl)
 {
-	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
+	uint16_t i = __atomic_load_n(&sl->s.current, __ATOMIC_RELAXED);
+	i++;
+	__atomic_store_n(&sl->s.current, i, __ATOMIC_RELAXED);
 }
 #endif
 
@@ -98,16 +100,19 @@ rte_spinlock_unlock (rte_spinlock_t *sl)
  *   1 if the lock is successfully taken; 0 otherwise.
  */
 static inline int
-rte_spinlock_trylock (rte_spinlock_t *sl);
+rte_spinlock_trylock(rte_spinlock_t *sl);
 
 #ifdef RTE_FORCE_INTRINSICS
 static inline int
-rte_spinlock_trylock (rte_spinlock_t *sl)
+rte_spinlock_trylock(rte_spinlock_t *sl)
 {
-	int exp = 0;
-	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
-				0, /* disallow spurious failure */
-				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
+	uint16_t me = __atomic_fetch_add(&sl->s.next, 1, __ATOMIC_RELAXED);
+	while (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) != me) {
+		__atomic_sub_fetch(&sl->s.next, 1, __ATOMIC_RELAXED);
+		return 0;
+	}
+
+	return 1;
 }
 #endif
 
@@ -119,10 +124,14 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
  * @return
  *   1 if the lock is currently taken; 0 otherwise.
  */
-static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
+#ifdef RTE_FORCE_INTRINSICS
+static inline int
+rte_spinlock_is_locked(rte_spinlock_t *sl)
 {
-	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
+	return (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) !=
+			__atomic_load_n(&sl->s.next, __ATOMIC_RELAXED));
 }
+#endif
 
 /**
  * Test if hardware transactional memory (lock elision) is supported
-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 1/6] eal: fix clang compilation error on x86
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 1/6] eal: fix clang compilation error on x86 Gavin Hu
@ 2018-12-27  6:36   ` Jerin Jacob Kollanukkaran
  0 siblings, 0 replies; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-27  6:36 UTC (permalink / raw)
  To: gavin.hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stable, stephen, Honnappa.Nagarahalli

On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> When CONFIG_RTE_FORCE_INTRINSICS is enabled for x86, the clang
> compilation error was:
> 	include/generic/rte_atomic.h:215:9: error:
> 		implicit declaration of function '__atomic_exchange_2'
> 		is invalid in C99
> 	include/generic/rte_atomic.h:494:9: error:
> 		implicit declaration of function '__atomic_exchange_4'
> 		is invalid in C99
> 	include/generic/rte_atomic.h:772:9: error:
> 		implicit declaration of function '__atomic_exchange_8'
> 		is invalid in C99
> 
> Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
> For more information, please refer to:
> http://mails.dpdk.org/archives/dev/2018-April/096776.html
> 
> Fixes: 7bdccb93078e ("eal: fix ARM build with clang")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 6/6] spinlock: ticket based to improve fairness Gavin Hu
@ 2018-12-27  6:58   ` Jerin Jacob Kollanukkaran
  2018-12-27 10:05     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-27  6:58 UTC (permalink / raw)
  To: gavin.hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	joyce.kong, hemant.agrawal, stephen, Honnappa.Nagarahalli

On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> ---
> From: Joyce Kong <joyce.kong@arm.com>
> 
> The old implementation is unfair, some threads may take locks
> aggressively

I think, one issue here is x86 and ppc follows traditional spinlock and
arm64 will be following ticket lock for spinlock implementation.
This would change application behaviour on arm64 compared to x86 and
ppc.

How about having a separate API for ticket lock? That would give,
# application choice to use the locking strategy
# application behaviour will be same across all arch.

Initial ticket lock implementation can be generic with C11 memory
primitive, latter arch can optimize it, if required.

> while leaving the other threads starving for long time. As shown in
> the
> following test, within same period of time, there are threads taking
> locks
> much more times than the others.
> 
>  
>  #ifdef RTE_FORCE_INTRINSICS
>  static inline void
> -rte_spinlock_unlock (rte_spinlock_t *sl)
> +rte_spinlock_unlock(rte_spinlock_t *sl)
>  {
> -	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> +	uint16_t i = __atomic_load_n(&sl->s.current, __ATOMIC_RELAXED);
> +	i++;
> +	__atomic_store_n(&sl->s.current, i, __ATOMIC_RELAXED);

Shouldn't we use __ATOMIC_RELEASE here to pair with lock() ?


>  }
>  #endif
>  
> @@ -98,16 +100,19 @@ rte_spinlock_unlock (rte_spinlock_t *sl)
>   *   1 if the lock is successfully taken; 0 otherwise.
>   */
>  static inline int
> -rte_spinlock_trylock (rte_spinlock_t *sl);
> +rte_spinlock_trylock(rte_spinlock_t *sl);
>  
>  #ifdef RTE_FORCE_INTRINSICS
>  static inline int
> -rte_spinlock_trylock (rte_spinlock_t *sl)
> +rte_spinlock_trylock(rte_spinlock_t *sl)
>  {
> -	int exp = 0;
> -	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> -				0, /* disallow spurious failure */
> -				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> +	uint16_t me = __atomic_fetch_add(&sl->s.next, 1,
> __ATOMIC_RELAXED);
> +	while (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) != me)
> {
> +		__atomic_sub_fetch(&sl->s.next, 1, __ATOMIC_RELAXED);
> +		return 0;
> +	}
> +

Shouldn't we need CAS here?
Similar implementation here:
https://git.linaro.org/lng/odp.git/tree/platform/linux-generic/include/odp/api/plat/ticketlock_inlines.h


> +	return 1;
>  }
>  #endif
>  
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 2/6] test/spinlock: remove 1us delay for correct benchmarking
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 2/6] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
@ 2018-12-27  7:20   ` Jerin Jacob Kollanukkaran
  0 siblings, 0 replies; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-27  7:20 UTC (permalink / raw)
  To: gavin.hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Honnappa.Nagarahalli

On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> ---
> The test is to benchmark the performance of spinlock by counting the
> number of spinlock acquire and release operations within the
> specified
> time.
> A typical pair of lock and unlock operations costs tens or hundreds
> of
> nano seconds, in comparison to this, delaying 1 us outside of the
> locked
> region is too much, compromising the goal of benchmarking the lock
> and
> unlock performance.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---

Acked-by: Jerin Jacob <jerinj@marvell.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 3/6] test/spinlock: get timestamp more precisely
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 3/6] test/spinlock: get timestamp more precisely Gavin Hu
@ 2018-12-27  7:27   ` Jerin Jacob Kollanukkaran
  2019-01-03 18:22     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-27  7:27 UTC (permalink / raw)
  To: gavin.hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Honnappa.Nagarahalli

On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> ---
> To precisely benchmark the spinlock performance, uses the precise
> version of getting timestamps, which enforces the timestamps are
> obtained at the expected places.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> ---
>  test/test/test_spinlock.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
> index 6795195ae..648474833 100644
> --- a/test/test/test_spinlock.c
> +++ b/test/test/test_spinlock.c
> @@ -113,14 +113,14 @@ load_loop_fn(void *func_param)
>  	if (lcore != rte_get_master_lcore())
>  		while (rte_atomic32_read(&synchro) == 0);
>  
> -	begin = rte_get_timer_cycles();
> +	begin = rte_rdtsc_precise();
>  	while (time_diff < hz * TIME_MS / 1000) {
>  		if (use_lock)
>  			rte_spinlock_lock(&lk);
>  		lcount++;
>  		if (use_lock)
>  			rte_spinlock_unlock(&lk);
> -		time_diff = rte_get_timer_cycles() - begin;
> +		time_diff = rte_rdtsc_precise() - begin;

Since _precise() versions add a full barrier, time_diff would
include delay of a full barrier also. As mentioned in the commit 
message, Do you see rte_get_timer_cycles() called in unexpected places?
Is there difference in time_diff apart from delay in rte_mb() when
using with _precise() version?





>  	}
>  	lock_count[lcore] = lcount;
>  	return 0;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2018-12-27  7:42   ` Jerin Jacob Kollanukkaran
  2018-12-27  9:02     ` Gavin Hu (Arm Technology China)
  2019-01-11 13:52     ` Gavin Hu (Arm Technology China)
  0 siblings, 2 replies; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-27  7:42 UTC (permalink / raw)
  To: gavin.hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Honnappa.Nagarahalli

On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
-------------------------------------------------------------------
> ---
> The __sync builtin based implementation generates full memory
> barriers
> ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> one way
> barriers.
> 
> Here is the assembly code of __sync_compare_and_swap builtin.
> __sync_bool_compare_and_swap(dst, exp, src);
>    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
>    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
>    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
>    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
>    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
>    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
>    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> <rte_atomic16_cmpset+52>  // b.any
>    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
>    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> <rte_atomic16_cmpset+32>
>    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
>    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> none
> 
> The benchmarking results showed 3X performance gain on Cavium
> ThunderX2 and
> 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> Here is the example test result on TX2:
> 
> *** spinlock_autotest without this patch ***
> Core [123] Cost Time = 639822 us
> Core [124] Cost Time = 633253 us
> Core [125] Cost Time = 646030 us
> Core [126] Cost Time = 643189 us
> Core [127] Cost Time = 647039 us
> Total Cost Time = 95433298 us
> 
> *** spinlock_autotest with this patch ***
> Core [123] Cost Time = 163615 us
> Core [124] Cost Time = 166471 us
> Core [125] Cost Time = 189044 us
> Core [126] Cost Time = 195745 us
> Core [127] Cost Time = 78423 us
> Total Cost Time = 27339656 us
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> ---
>  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> b/lib/librte_eal/common/include/generic/rte_spinlock.h
> index c4c3fc31e..87ae7a4f1 100644
> --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
>  static inline void
>  rte_spinlock_lock(rte_spinlock_t *sl)
>  {
> -	while (__sync_lock_test_and_set(&sl->locked, 1))
> -		while(sl->locked)
> +	int exp = 0;
> +
> +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {

How about remove explict exp = 0 and change to
__atomic_test_and_set(flag, __ATOMIC_ACQUIRE);

i.e 
while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))



> +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
>  			rte_pause();
> +		exp = 0;

We can remove exp = 0 with above scheme.

> +	}
>  }
>  #endif
>  
> @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
>  static inline void
>  rte_spinlock_unlock (rte_spinlock_t *sl)
>  {
> -	__sync_lock_release(&sl->locked);
> +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
 }
>  #endif
>  
> @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
>  static inline int
>  rte_spinlock_trylock (rte_spinlock_t *sl)
>  {
> -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> +	int exp = 0;
> +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> +				0, /* disallow spurious failure */
> +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);

Here to remove explicit exp.

return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)


>  }
>  #endif
>  
> @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
>   */
>  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
>  {
> -	return sl->locked;
> +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);

__ATOMIC_RELAXED would be enough here. Right ?


>  }
>  
>  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2018-12-27  7:42   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2018-12-27  9:02     ` Gavin Hu (Arm Technology China)
  2019-01-03 20:35       ` Honnappa Nagarahalli
  2019-01-11 13:52     ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 28+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-12-27  9:02 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Honnappa Nagarahalli, nd



> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Thursday, December 27, 2018 3:42 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way
> barrier builtins
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory
> > barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way
> > barriers.
> >
> > Here is the assembly code of __sync_compare_and_swap builtin.
> > __sync_bool_compare_and_swap(dst, exp, src);
> >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > <rte_atomic16_cmpset+52>  // b.any
> >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > <rte_atomic16_cmpset+32>
> >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > none
> >
> > The benchmarking results showed 3X performance gain on Cavium
> > ThunderX2 and
> > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > Here is the example test result on TX2:
> >
> > *** spinlock_autotest without this patch ***
> > Core [123] Cost Time = 639822 us
> > Core [124] Cost Time = 633253 us
> > Core [125] Cost Time = 646030 us
> > Core [126] Cost Time = 643189 us
> > Core [127] Cost Time = 647039 us
> > Total Cost Time = 95433298 us
> >
> > *** spinlock_autotest with this patch ***
> > Core [123] Cost Time = 163615 us
> > Core [124] Cost Time = 166471 us
> > Core [125] Cost Time = 189044 us
> > Core [126] Cost Time = 195745 us
> > Core [127] Cost Time = 78423 us
> > Total Cost Time = 27339656 us
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > ---
> >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc31e..87ae7a4f1 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_lock(rte_spinlock_t *sl)
> >  {
> > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > -		while(sl->locked)
> > +	int exp = 0;
> > +
> > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
> 
> How about remove explict exp = 0 and change to
> __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> 
> i.e
> while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> 
Will do it in v4.
> 
> > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> >  			rte_pause();
> > +		exp = 0;
> 
> We can remove exp = 0 with above scheme.
> 
> > +	}
> >  }
> >  #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_unlock (rte_spinlock_t *sl)
> >  {
> > -	__sync_lock_release(&sl->locked);
> > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>  }
> >  #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> >  static inline int
> >  rte_spinlock_trylock (rte_spinlock_t *sl)
> >  {
> > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > +	int exp = 0;
> > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > +				0, /* disallow spurious failure */
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> Here to remove explicit exp.
> 
> return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)

Will do it in v4.

> >  }
> >  #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> >   */
> >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> >  {
> > -	return sl->locked;
> > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> 
> __ATOMIC_RELAXED would be enough here. Right ?
Yes, it is enough for current DPDK uses, used for testing and assertions only.

For general applicability, we set acquire as concerned about it is used for reading protected data while the lock is not taken by anybody.
In this use case, Acquire will properly see all updates from before the lock was released, but this is still dangerous, as during the course, someone else might have taken the lock and changed the data.

Anyway, I will set Relaxed in v4 as the above use scenario was not recommended and not present in DPDK. 

> 
> >  }
> >
> >  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-27  6:58   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2018-12-27 10:05     ` Gavin Hu (Arm Technology China)
  2018-12-27 12:08       ` Jerin Jacob Kollanukkaran
  0 siblings, 1 reply; 28+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-12-27 10:05 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	Joyce Kong (Arm Technology China),
	hemant.agrawal, stephen, Honnappa Nagarahalli, nd



> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Thursday, December 27, 2018 2:58 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net; Joyce
> Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > ---
> > From: Joyce Kong <joyce.kong@arm.com>
> >
> > The old implementation is unfair, some threads may take locks
> > aggressively
> 
> I think, one issue here is x86 and ppc follows traditional spinlock and
> arm64 will be following ticket lock for spinlock implementation.
> This would change application behaviour on arm64 compared to x86 and
> ppc.
> 
> How about having a separate API for ticket lock? That would give,
> # application choice to use the locking strategy
> # application behaviour will be same across all arch.

Ok, will do in v4 to have a new named rte_ticket_spinlock API.

> Initial ticket lock implementation can be generic with C11 memory
> primitive, latter arch can optimize it, if required.
Yes, latter we might optimize with new instructions. 

> > while leaving the other threads starving for long time. As shown in
> > the
> > following test, within same period of time, there are threads taking
> > locks
> > much more times than the others.
> >
> >
> >  #ifdef RTE_FORCE_INTRINSICS
> >  static inline void
> > -rte_spinlock_unlock (rte_spinlock_t *sl)
> > +rte_spinlock_unlock(rte_spinlock_t *sl)
> >  {
> > -	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> > +	uint16_t i = __atomic_load_n(&sl->s.current, __ATOMIC_RELAXED);
> > +	i++;
> > +	__atomic_store_n(&sl->s.current, i, __ATOMIC_RELAXED);
> 
> Shouldn't we use __ATOMIC_RELEASE here to pair with lock() ?
> 
> 
> >  }
> >  #endif
> >
> > @@ -98,16 +100,19 @@ rte_spinlock_unlock (rte_spinlock_t *sl)
> >   *   1 if the lock is successfully taken; 0 otherwise.
> >   */
> >  static inline int
> > -rte_spinlock_trylock (rte_spinlock_t *sl);
> > +rte_spinlock_trylock(rte_spinlock_t *sl);
> >
> >  #ifdef RTE_FORCE_INTRINSICS
> >  static inline int
> > -rte_spinlock_trylock (rte_spinlock_t *sl)
> > +rte_spinlock_trylock(rte_spinlock_t *sl)
> >  {
> > -	int exp = 0;
> > -	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > -				0, /* disallow spurious failure */
> > -				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> > +	uint16_t me = __atomic_fetch_add(&sl->s.next, 1,
> > __ATOMIC_RELAXED);
> > +	while (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) != me)
> > {
> > +		__atomic_sub_fetch(&sl->s.next, 1, __ATOMIC_RELAXED);
> > +		return 0;
> > +	}
> > +
> 
> Shouldn't we need CAS here?
> Similar implementation here:
> https://git.linaro.org/lng/odp.git/tree/platform/linux-
> generic/include/odp/api/plat/ticketlock_inlines.h

That is correct, CAS is required. 
Assume T2 takes precedence for requesting a ticket, eg. T2.next = 2, but did not take the lock yet, eg. T2.current = 1, then T1 trylock, T1.next = 3, then T2 takes the lock, T2.next = T2.current = 2, T1 fallback, T1.next = 2, both next = 2, that's not correct, next current will be 3, T1 will not get the lock any more.
> 
> > +	return 1;
> >  }
> >  #endif
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-27 10:05     ` Gavin Hu (Arm Technology China)
@ 2018-12-27 12:08       ` Jerin Jacob Kollanukkaran
  2018-12-27 23:41         ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-27 12:08 UTC (permalink / raw)
  To: Gavin.Hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	Joyce.Kong, hemant.agrawal, stephen, Honnappa.Nagarahalli

On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
wrote:
> > -----Original Message-----
> > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > Sent: Thursday, December 27, 2018 2:58 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; 
> > dev@dpdk.org
> > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > Joyce
> > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve
> > fairness
> > 
> > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > ---------------------------------------------------------------
> > > ----
> > > ---
> > > From: Joyce Kong <joyce.kong@arm.com>
> > > 
> > > The old implementation is unfair, some threads may take locks
> > > aggressively
> > 
> > I think, one issue here is x86 and ppc follows traditional spinlock
> > and
> > arm64 will be following ticket lock for spinlock implementation.
> > This would change application behaviour on arm64 compared to x86
> > and
> > ppc.
> > 
> > How about having a separate API for ticket lock? That would give,
> > # application choice to use the locking strategy
> > # application behaviour will be same across all arch.
> 
> Ok, will do in v4 to have a new named rte_ticket_spinlock API.

I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
instead of rte_ticket_spinlock_lock etc to reduce the length of the
API.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-27 12:08       ` Jerin Jacob Kollanukkaran
@ 2018-12-27 23:41         ` Stephen Hemminger
  2018-12-28  4:39           ` Jerin Jacob Kollanukkaran
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2018-12-27 23:41 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran
  Cc: Gavin.Hu, dev, david.marchand, chaozhu, nd, bruce.richardson,
	thomas, Joyce.Kong, hemant.agrawal, Honnappa.Nagarahalli

On Thu, 27 Dec 2018 12:08:26 +0000
Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:

> On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
> wrote:
> > > -----Original Message-----
> > > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > > Sent: Thursday, December 27, 2018 2:58 PM
> > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; 
> > > dev@dpdk.org
> > > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > > Joyce
> > > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve
> > > fairness
> > > 
> > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:  
> > > > ---------------------------------------------------------------
> > > > ----
> > > > ---
> > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > 
> > > > The old implementation is unfair, some threads may take locks
> > > > aggressively  
> > > 
> > > I think, one issue here is x86 and ppc follows traditional spinlock
> > > and
> > > arm64 will be following ticket lock for spinlock implementation.
> > > This would change application behaviour on arm64 compared to x86
> > > and
> > > ppc.
> > > 
> > > How about having a separate API for ticket lock? That would give,
> > > # application choice to use the locking strategy
> > > # application behaviour will be same across all arch.  
> > 
> > Ok, will do in v4 to have a new named rte_ticket_spinlock API.  
> 
> I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> instead of rte_ticket_spinlock_lock etc to reduce the length of the
> API.


NAK to adding new API for this.

I want the best possible locks for all applications and all architectures.
These should be called spinlock so there is no requirement for application
to change to get better performance. Why not just implement the best algorithm
across the board. Yes, this means collaboration or working on the other guys
architecture.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-27 23:41         ` Stephen Hemminger
@ 2018-12-28  4:39           ` Jerin Jacob Kollanukkaran
  2018-12-28 10:04             ` Gavin Hu (Arm Technology China)
  2019-01-03 18:35             ` Honnappa Nagarahalli
  0 siblings, 2 replies; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2018-12-28  4:39 UTC (permalink / raw)
  To: stephen
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas, dev,
	Joyce.Kong, hemant.agrawal, Honnappa.Nagarahalli, Gavin.Hu

On Thu, 2018-12-27 at 15:41 -0800, Stephen Hemminger wrote:
> On Thu, 27 Dec 2018 12:08:26 +0000
> Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:
> 
> > On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
> > wrote:
> > > > -----Original Message-----
> > > > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > > > Sent: Thursday, December 27, 2018 2:58 PM
> > > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; 
> > > > dev@dpdk.org
> > > > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > > > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > > > Joyce
> > > > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > > > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > > > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to
> > > > improve
> > > > fairness
> > > > 
> > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:  
> > > > > -----------------------------------------------------------
> > > > > ----
> > > > > ----
> > > > > ---
> > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > 
> > > > > The old implementation is unfair, some threads may take locks
> > > > > aggressively  
> > > > 
> > > > I think, one issue here is x86 and ppc follows traditional
> > > > spinlock
> > > > and
> > > > arm64 will be following ticket lock for spinlock
> > > > implementation.
> > > > This would change application behaviour on arm64 compared to
> > > > x86
> > > > and
> > > > ppc.
> > > > 
> > > > How about having a separate API for ticket lock? That would
> > > > give,
> > > > # application choice to use the locking strategy
> > > > # application behaviour will be same across all arch.  
> > > 
> > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.  
> > 
> > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > API.
> 
> NAK to adding new API for this.
> 
> I want the best possible locks for all applications and all
> architectures.
> These should be called spinlock so there is no requirement for
> application
> to change to get better performance. Why not just implement the best
> algorithm
> across the board. Yes, this means collaboration or working on the
> other guys
> architecture.

Then 6/6 patch needs to put on hold if every arch needs to make ticket
lock as default spinlock lock strategy.

How about following to make forward progress:
1) Introduce rte_ticketlock_[lock/unlock/trylock/is_locked] API now as
experimental with default implementation
2) Provide a time line to switch every arch for optimized ticketlock
implementation if needed.
3) Switch rte_ticketlock_ as rte_spinlock_ API.
4) Keep old version of spinlock as new API if some application does not
need fairness between threads at the cost of light weight spinlock
implementation.

I don't want arm64 to behave differently than other arch(s).









^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-28  4:39           ` Jerin Jacob Kollanukkaran
@ 2018-12-28 10:04             ` Gavin Hu (Arm Technology China)
  2019-01-03 18:35             ` Honnappa Nagarahalli
  1 sibling, 0 replies; 28+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-12-28 10:04 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, stephen
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas, dev,
	Joyce Kong (Arm Technology China),
	hemant.agrawal, Honnappa Nagarahalli, nd



> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Friday, December 28, 2018 12:39 PM
> To: stephen@networkplumber.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> dev@dpdk.org; Joyce Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>
> Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
> 
> On Thu, 2018-12-27 at 15:41 -0800, Stephen Hemminger wrote:
> > On Thu, 27 Dec 2018 12:08:26 +0000
> > Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:
> >
> > > On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
> > > wrote:
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > > > > Sent: Thursday, December 27, 2018 2:58 PM
> > > > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > > > > dev@dpdk.org
> > > > > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > > > > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > > > > Joyce
> > > > > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > > > > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > > > > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to
> > > > > improve
> > > > > fairness
> > > > >
> > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > > > > -----------------------------------------------------------
> > > > > > ----
> > > > > > ----
> > > > > > ---
> > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > >
> > > > > > The old implementation is unfair, some threads may take locks
> > > > > > aggressively
> > > > >
> > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > spinlock
> > > > > and
> > > > > arm64 will be following ticket lock for spinlock
> > > > > implementation.
> > > > > This would change application behaviour on arm64 compared to
> > > > > x86
> > > > > and
> > > > > ppc.
> > > > >
> > > > > How about having a separate API for ticket lock? That would
> > > > > give,
> > > > > # application choice to use the locking strategy
> > > > > # application behaviour will be same across all arch.
> > > >
> > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.
> > >
> > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > > API.
> >
> > NAK to adding new API for this.
> >
> > I want the best possible locks for all applications and all
> > architectures.
> > These should be called spinlock so there is no requirement for
> > application
> > to change to get better performance. Why not just implement the best
> > algorithm
> > across the board. Yes, this means collaboration or working on the
> > other guys
> > architecture.
> 
> Then 6/6 patch needs to put on hold if every arch needs to make ticket
> lock as default spinlock lock strategy.
> 
> How about following to make forward progress:
> 1) Introduce rte_ticketlock_[lock/unlock/trylock/is_locked] API now as
> experimental with default implementation
> 2) Provide a time line to switch every arch for optimized ticketlock
> implementation if needed.
> 3) Switch rte_ticketlock_ as rte_spinlock_ API.
> 4) Keep old version of spinlock as new API if some application does not
> need fairness between threads at the cost of light weight spinlock
> implementation.

We will rework the patches following this proposal, and let's switch over at some point of later.
 
> I don't want arm64 to behave differently than other arch(s).

This is the generic implementation, x86 and ppc can choose also by setting CONFIG_RTE_FORCE_INTRINSICS=y  in the config file.
The arch specific implementation is instruction level based, arm will implement this also, they are not ticket based.
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 3/6] test/spinlock: get timestamp more precisely
  2018-12-27  7:27   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2019-01-03 18:22     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 28+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-03 18:22 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, Gavin Hu (Arm Technology China), dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, nd

> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > ---
> > To precisely benchmark the spinlock performance, uses the precise
> > version of getting timestamps, which enforces the timestamps are
> > obtained at the expected places.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > ---
> >  test/test/test_spinlock.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
> > index 6795195ae..648474833 100644
> > --- a/test/test/test_spinlock.c
> > +++ b/test/test/test_spinlock.c
> > @@ -113,14 +113,14 @@ load_loop_fn(void *func_param)
> >  	if (lcore != rte_get_master_lcore())
> >  		while (rte_atomic32_read(&synchro) == 0);
> >
> > -	begin = rte_get_timer_cycles();
> > +	begin = rte_rdtsc_precise();
> >  	while (time_diff < hz * TIME_MS / 1000) {
> >  		if (use_lock)
> >  			rte_spinlock_lock(&lk);
> >  		lcount++;
> >  		if (use_lock)
> >  			rte_spinlock_unlock(&lk);
> > -		time_diff = rte_get_timer_cycles() - begin;
> > +		time_diff = rte_rdtsc_precise() - begin;
> 
> Since _precise() versions add a full barrier, time_diff would include delay of
> a full barrier also. As mentioned in the commit message, Do you see
> rte_get_timer_cycles() called in unexpected places?
> Is there difference in time_diff apart from delay in rte_mb() when using with
> _precise() version?
> 
I do not see the need for this patch in the presence of 4/6 [1] in this series.

[1] http://mails.dpdk.org/archives/dev/2018-December/122110.html

> 
> 
> 
> 
> >  	}
> >  	lock_count[lcore] = lcount;
> >  	return 0;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2018-12-28  4:39           ` Jerin Jacob Kollanukkaran
  2018-12-28 10:04             ` Gavin Hu (Arm Technology China)
@ 2019-01-03 18:35             ` Honnappa Nagarahalli
  2019-01-03 19:53               ` Stephen Hemminger
  1 sibling, 1 reply; 28+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-03 18:35 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, stephen
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas, dev,
	Joyce Kong (Arm Technology China),
	hemant.agrawal, Gavin Hu (Arm Technology China),
	nd

> > > > >
> > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > > > > -----------------------------------------------------------
> > > > > > ----
> > > > > > ----
> > > > > > ---
> > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > >
> > > > > > The old implementation is unfair, some threads may take locks
> > > > > > aggressively
> > > > >
> > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > spinlock and
> > > > > arm64 will be following ticket lock for spinlock implementation.
> > > > > This would change application behaviour on arm64 compared to
> > > > > x86
> > > > > and
> > > > > ppc.
> > > > >
> > > > > How about having a separate API for ticket lock? That would
> > > > > give, # application choice to use the locking strategy #
> > > > > application behaviour will be same across all arch.
> > > >
> > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.
> > >
> > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > > API.
> >
> > NAK to adding new API for this.
> >
> > I want the best possible locks for all applications and all
> > architectures.
> > These should be called spinlock so there is no requirement for
> > application to change to get better performance. Why not just
> > implement the best algorithm across the board. Yes, this means
> > collaboration or working on the other guys architecture.
IMO, the ticket lock should be a separate API. Spin lock (as implemented today) and ticket lock have different characteristics in terms of behavior as well as resources required. For ex: spin lock might be sufficient for a low contention use case and it requires less number of cache lines. Ticket lock needs more number of cache lines and might be good for use cases with more contention. The decision to use spin lock vs ticket lock should be left to the application.

> 
> Then 6/6 patch needs to put on hold if every arch needs to make ticket lock
> as default spinlock lock strategy.
+1 for doing 6/6 as a separate patch.

> 
> How about following to make forward progress:
> 1) Introduce rte_ticketlock_[lock/unlock/trylock/is_locked] API now as
> experimental with default implementation
> 2) Provide a time line to switch every arch for optimized ticketlock
> implementation if needed.
> 3) Switch rte_ticketlock_ as rte_spinlock_ API.
> 4) Keep old version of spinlock as new API if some application does not need
> fairness between threads at the cost of light weight spinlock
> implementation.
> 
> I don't want arm64 to behave differently than other arch(s).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2019-01-03 18:35             ` Honnappa Nagarahalli
@ 2019-01-03 19:53               ` Stephen Hemminger
  2019-01-04  7:06                 ` Honnappa Nagarahalli
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2019-01-03 19:53 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Jerin Jacob Kollanukkaran, david.marchand, chaozhu, nd,
	bruce.richardson, thomas, dev, Joyce Kong (Arm Technology China),
	hemant.agrawal, Gavin Hu (Arm Technology China)

On Thu, 3 Jan 2019 18:35:31 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > > > > >
> > > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:  
> > > > > > > -----------------------------------------------------------
> > > > > > > ----
> > > > > > > ----
> > > > > > > ---
> > > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > > >
> > > > > > > The old implementation is unfair, some threads may take locks
> > > > > > > aggressively  
> > > > > >
> > > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > > spinlock and
> > > > > > arm64 will be following ticket lock for spinlock implementation.
> > > > > > This would change application behaviour on arm64 compared to
> > > > > > x86
> > > > > > and
> > > > > > ppc.
> > > > > >
> > > > > > How about having a separate API for ticket lock? That would
> > > > > > give, # application choice to use the locking strategy #
> > > > > > application behaviour will be same across all arch.  
> > > > >
> > > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.  
> > > >
> > > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > > > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > > > API.  
> > >
> > > NAK to adding new API for this.
> > >
> > > I want the best possible locks for all applications and all
> > > architectures.
> > > These should be called spinlock so there is no requirement for
> > > application to change to get better performance. Why not just
> > > implement the best algorithm across the board. Yes, this means
> > > collaboration or working on the other guys architecture.  
> IMO, the ticket lock should be a separate API. Spin lock (as implemented today) and ticket lock have different characteristics in terms of behavior as well as resources required. For ex: spin lock might be sufficient for a low contention use case and it requires less number of cache lines. Ticket lock needs more number of cache lines and might be good for use cases with more contention. The decision to use spin lock vs ticket lock should be left to the application.

The problem is that changing applications is hard. Real world applications are non-trivial
and large. That is while doing things global or at the library level are easier.
Do you want to go back and evaluate every lock in VPP, NFF-go, OVS, Tungsten Fabric, ....

Either a config or runtime option seems best to me.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2018-12-27  9:02     ` Gavin Hu (Arm Technology China)
@ 2019-01-03 20:35       ` Honnappa Nagarahalli
  0 siblings, 0 replies; 28+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-03 20:35 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Jerin Jacob Kollanukkaran, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, nd

> >
> > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > > ---
> > > The __sync builtin based implementation generates full memory
> > > barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to
> > > generate one way barriers.
> > >
> > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > __sync_bool_compare_and_swap(dst, exp, src);
> > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > <rte_atomic16_cmpset+52>  // b.any
> > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > > <rte_atomic16_cmpset+32>
> > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > > none
> > >
> > > The benchmarking results showed 3X performance gain on Cavium
> > > ThunderX2 and
> > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
                                            ^^^^^^
Typo, should be Falkor

> > > Here is the example test result on TX2:
> > >
> > > *** spinlock_autotest without this patch *** Core [123] Cost Time =
> > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time =
> > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time =
> > > 647039 us Total Cost Time = 95433298 us
> > >
> > > *** spinlock_autotest with this patch *** Core [123] Cost Time =
> > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time =
> > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time =
> > > 78423 us Total Cost Time = 27339656 us
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > ---
> > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > index c4c3fc31e..87ae7a4f1 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > -		while(sl->locked)
> > > +	int exp = 0;
> > > +
> > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > {
> >
> > How about remove explict exp = 0 and change to
> > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> >
> > i.e
> > while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> >
> Will do it in v4.
The operand for '__atomic_test_and_set' needs to be a bool or char. This means, sl->locked need to be changed to bool or char. But the API 'rte_spinlock_lock_tm' also uses sl->locked. The requirements of 'rte_spinlock_lock_tm' need to be considered.

> >
> > > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > >  			rte_pause();
> > > +		exp = 0;
> >
> > We can remove exp = 0 with above scheme.
> >
> > > +	}
> > >  }
> > >  #endif
> > >
> > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_unlock (rte_spinlock_t *sl)  {
> > > -	__sync_lock_release(&sl->locked);
> > > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> >  }
> > >  #endif
> > >
> > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> > > static inline int  rte_spinlock_trylock (rte_spinlock_t *sl)  {
> > > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > +	int exp = 0;
> > > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > +				0, /* disallow spurious failure */
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> >
> > Here to remove explicit exp.
> >
> > return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> 
> Will do it in v4.
> 
> > >  }
> > >  #endif
> > >
> > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > >   */
> > >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)  {
> > > -	return sl->locked;
> > > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> >
> > __ATOMIC_RELAXED would be enough here. Right ?
> Yes, it is enough for current DPDK uses, used for testing and assertions only.
> 
> For general applicability, we set acquire as concerned about it is used for
> reading protected data while the lock is not taken by anybody.
> In this use case, Acquire will properly see all updates from before the lock
> was released, but this is still dangerous, as during the course, someone else
> might have taken the lock and changed the data.
> 
> Anyway, I will set Relaxed in v4 as the above use scenario was not
> recommended and not present in DPDK.
IMO, once the API is provided the application can make use of it in anyway it wants. Once use case I can think of is follows:

if (rte_spinlock_is_locked(&sl) == true) {
	access_shared_data();
}

access_shared_data() can get hoisted/executed speculatively before the 'if' statement. This needs to be prevented, hence we need 'acquire' semantics.

> 
> >
> > >  }
> > >
> > >  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
  2019-01-03 19:53               ` Stephen Hemminger
@ 2019-01-04  7:06                 ` Honnappa Nagarahalli
  0 siblings, 0 replies; 28+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-04  7:06 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jerin Jacob Kollanukkaran, david.marchand, chaozhu, nd,
	bruce.richardson, thomas, dev, Joyce Kong (Arm Technology China),
	hemant.agrawal, Gavin Hu (Arm Technology China),
	nd

> 
> On Thu, 3 Jan 2019 18:35:31 +0000
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> 
> > > > > > >
> > > > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > > > > > > ----------------------------------------------------------
> > > > > > > > -
> > > > > > > > ----
> > > > > > > > ----
> > > > > > > > ---
> > > > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > > > >
> > > > > > > > The old implementation is unfair, some threads may take
> > > > > > > > locks aggressively
> > > > > > >
> > > > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > > > spinlock and
> > > > > > > arm64 will be following ticket lock for spinlock implementation.
> > > > > > > This would change application behaviour on arm64 compared to
> > > > > > > x86
> > > > > > > and
> > > > > > > ppc.
> > > > > > >
> > > > > > > How about having a separate API for ticket lock? That would
> > > > > > > give, # application choice to use the locking strategy #
> > > > > > > application behaviour will be same across all arch.
> > > > > >
> > > > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.
> > > > >
> > > > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked]
> > > > > name instead of rte_ticket_spinlock_lock etc to reduce the
> > > > > length of the API.
> > > >
> > > > NAK to adding new API for this.
> > > >
> > > > I want the best possible locks for all applications and all
> > > > architectures.
> > > > These should be called spinlock so there is no requirement for
> > > > application to change to get better performance. Why not just
> > > > implement the best algorithm across the board. Yes, this means
> > > > collaboration or working on the other guys architecture.
> > IMO, the ticket lock should be a separate API. Spin lock (as implemented
> today) and ticket lock have different characteristics in terms of behavior as
> well as resources required. For ex: spin lock might be sufficient for a low
> contention use case and it requires less number of cache lines. Ticket lock
> needs more number of cache lines and might be good for use cases with more
> contention. The decision to use spin lock vs ticket lock should be left to the
> application.
> 
> The problem is that changing applications is hard. Real world applications are
> non-trivial and large. That is while doing things global or at the library level
> are easier.
> Do you want to go back and evaluate every lock in VPP, NFF-go, OVS,
> Tungsten Fabric, ....
> 
> Either a config or runtime option seems best to me.
IMO, we should provide both APIs and a config to use ticket lock as the implementation for spin lock. (similar to Jerin's proposal?)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2018-12-27  7:42   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2018-12-27  9:02     ` Gavin Hu (Arm Technology China)
@ 2019-01-11 13:52     ` Gavin Hu (Arm Technology China)
  2019-01-14  5:54       ` Honnappa Nagarahalli
  1 sibling, 1 reply; 28+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-11 13:52 UTC (permalink / raw)
  To: jerinj, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Honnappa Nagarahalli,
	Joyce Kong (Arm Technology China)



> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Thursday, December 27, 2018 3:42 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-
> way barrier builtins
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory
> > barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way
> > barriers.
> >
> > Here is the assembly code of __sync_compare_and_swap builtin.
> > __sync_bool_compare_and_swap(dst, exp, src);
> >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > <rte_atomic16_cmpset+52>  // b.any
> >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > <rte_atomic16_cmpset+32>
> >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > none
> >
> > The benchmarking results showed 3X performance gain on Cavium
> > ThunderX2 and
> > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > Here is the example test result on TX2:
> >
> > *** spinlock_autotest without this patch ***
> > Core [123] Cost Time = 639822 us
> > Core [124] Cost Time = 633253 us
> > Core [125] Cost Time = 646030 us
> > Core [126] Cost Time = 643189 us
> > Core [127] Cost Time = 647039 us
> > Total Cost Time = 95433298 us
> >
> > *** spinlock_autotest with this patch ***
> > Core [123] Cost Time = 163615 us
> > Core [124] Cost Time = 166471 us
> > Core [125] Cost Time = 189044 us
> > Core [126] Cost Time = 195745 us
> > Core [127] Cost Time = 78423 us
> > Total Cost Time = 27339656 us
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > ---
> >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc31e..87ae7a4f1 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_lock(rte_spinlock_t *sl)
> >  {
> > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > -		while(sl->locked)
> > +	int exp = 0;
> > +
> > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
> 
> How about remove explict exp = 0 and change to
> __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);

Yes, __atomic_test_and_set means simpler code and better, but __atomic_test_and_set takes the first argument as a pointer to type bool or char, in our case, sl->locked is of type uint32. 
We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit is really used in this case, is that ok? 

"It should be only used for operands of type bool or char. For other types only part of the value may be set."
https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-Builtins.html

From performance perspective, in our testing, the performance was very close, compared to __atomic.

> 
> i.e
> while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> 
> 
> 
> > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> >  			rte_pause();
> > +		exp = 0;
> 
> We can remove exp = 0 with above scheme.
> 
> > +	}
> >  }
> >  #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_unlock (rte_spinlock_t *sl)
> >  {
> > -	__sync_lock_release(&sl->locked);
> > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>  }
> >  #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> >  static inline int
> >  rte_spinlock_trylock (rte_spinlock_t *sl)
> >  {
> > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > +	int exp = 0;
> > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > +				0, /* disallow spurious failure */
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> Here to remove explicit exp.
> 
> return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> 
> 
> >  }
> >  #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> >   */
> >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> >  {
> > -	return sl->locked;
> > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> 
> __ATOMIC_RELAXED would be enough here. Right ?
> 
> 
> >  }
> >
> >  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2019-01-11 13:52     ` Gavin Hu (Arm Technology China)
@ 2019-01-14  5:54       ` Honnappa Nagarahalli
  2019-01-14  7:39         ` Jerin Jacob Kollanukkaran
  2019-01-14  7:57         ` Gavin Hu (Arm Technology China)
  0 siblings, 2 replies; 28+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-14  5:54 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), jerinj, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Joyce Kong (Arm Technology China),
	nd

> >
> > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > > ---
> > > The __sync builtin based implementation generates full memory
> > > barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to
> > > generate one way barriers.
> > >
> > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > __sync_bool_compare_and_swap(dst, exp, src);
> > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > <rte_atomic16_cmpset+52>  // b.any
> > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > > <rte_atomic16_cmpset+32>
> > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > > none
> > >
> > > The benchmarking results showed 3X performance gain on Cavium
> > > ThunderX2 and
> > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > > Here is the example test result on TX2:
> > >
> > > *** spinlock_autotest without this patch *** Core [123] Cost Time =
> > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time =
> > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time =
> > > 647039 us Total Cost Time = 95433298 us
> > >
> > > *** spinlock_autotest with this patch *** Core [123] Cost Time =
> > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time =
> > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time =
> > > 78423 us Total Cost Time = 27339656 us
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > ---
> > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > index c4c3fc31e..87ae7a4f1 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > -		while(sl->locked)
> > > +	int exp = 0;
> > > +
> > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > {
> >
> > How about remove explict exp = 0 and change to
> > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> 
> Yes, __atomic_test_and_set means simpler code and better, but
> __atomic_test_and_set takes the first argument as a pointer to type bool or
> char, in our case, sl->locked is of type uint32.
> We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit is
> really used in this case, is that ok?
> 
> "It should be only used for operands of type bool or char. For other types only
> part of the value may be set."
> https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> Builtins.html
> 
> From performance perspective, in our testing, the performance was very close,
> compared to __atomic.
If performance is close, I suggest we go with the existing patch. Changing sl->locked to bool/char would be an ABI change and will affect x86 TM based implementation as well.
Jerin, what do you think?

> 
> >
> > i.e
> > while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> >
> >
> >
> > > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > >  			rte_pause();
> > > +		exp = 0;
> >
> > We can remove exp = 0 with above scheme.
> >
> > > +	}
> > >  }
> > >  #endif
> > >
> > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_unlock (rte_spinlock_t *sl)  {
> > > -	__sync_lock_release(&sl->locked);
> > > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> >  }
> > >  #endif
> > >
> > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> > > static inline int  rte_spinlock_trylock (rte_spinlock_t *sl)  {
> > > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > +	int exp = 0;
> > > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > +				0, /* disallow spurious failure */
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> >
> > Here to remove explicit exp.
> >
> > return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> >
> >
> > >  }
> > >  #endif
> > >
> > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > >   */
> > >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)  {
> > > -	return sl->locked;
> > > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> >
> > __ATOMIC_RELAXED would be enough here. Right ?
> >
> >
> > >  }
> > >
> > >  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2019-01-14  5:54       ` Honnappa Nagarahalli
@ 2019-01-14  7:39         ` Jerin Jacob Kollanukkaran
  2019-01-14 17:08           ` Gavin Hu (Arm Technology China)
  2019-01-14  7:57         ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 28+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-01-14  7:39 UTC (permalink / raw)
  To: Honnappa.Nagarahalli, Gavin.Hu, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	Joyce.Kong, hemant.agrawal, stephen

On Mon, 2019-01-14 at 05:54 +0000, Honnappa Nagarahalli wrote:
> > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > ---------------------------------------------------------------
> > > ----
> > > > ---
> > > > The __sync builtin based implementation generates full memory
> > > > barriers ('dmb ish') on Arm platforms. Using C11 atomic
> > > > builtins to
> > > > generate one way barriers.
> > > > 
> > > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > > __sync_bool_compare_and_swap(dst, exp, src);
> > > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp,
> > > > #6]
> > > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp,
> > > > #4]
> > > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > > <rte_atomic16_cmpset+52>  // b.any
> > > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2,
> > > > [x0]
> > > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4,
> > > > 0x90f1c0
> > > > <rte_atomic16_cmpset+32>
> > > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  //
> > > > eq =
> > > > none
> > > > 
> > > > The benchmarking results showed 3X performance gain on Cavium
> > > > ThunderX2 and
> > > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > > > Here is the example test result on TX2:
> > > > 
> > > > *** spinlock_autotest without this patch *** Core [123] Cost
> > > > Time =
> > > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time
> > > > =
> > > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time
> > > > =
> > > > 647039 us Total Cost Time = 95433298 us
> > > > 
> > > > *** spinlock_autotest with this patch *** Core [123] Cost Time
> > > > =
> > > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time
> > > > =
> > > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time
> > > > =
> > > > 78423 us Total Cost Time = 27339656 us
> > > > 
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com
> > > > >
> > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > > ---
> > > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > > +++++++++++++-----
> > > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git
> > > > a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > index c4c3fc31e..87ae7a4f1 100644
> > > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t
> > > > *sl);  static
> > > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > > -		while(sl->locked)
> > > > +	int exp = 0;
> > > > +
> > > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp,
> > > > 1, 0,
> > > > +				__ATOMIC_ACQUIRE,
> > > > __ATOMIC_RELAXED))
> > > {
> > > 
> > > How about remove explict exp = 0 and change to
> > > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> > 
> > Yes, __atomic_test_and_set means simpler code and better, but
> > __atomic_test_and_set takes the first argument as a pointer to type
> > bool or
> > char, in our case, sl->locked is of type uint32.
> > We can force it to uint8, or just pass in the 32bit pointer, only
> > one byte/bit is
> > really used in this case, is that ok?
> > 
> > "It should be only used for operands of type bool or char. For
> > other types only
> > part of the value may be set."
> > https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> > Builtins.html
> > 
> > From performance perspective, in our testing, the performance was
> > very close,
> > compared to __atomic.
> If performance is close, I suggest we go with the existing patch.
> Changing sl->locked to bool/char would be an ABI change and will
> affect x86 TM based implementation as well.
> Jerin, what do you think?

Looks good to me.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2019-01-14  5:54       ` Honnappa Nagarahalli
  2019-01-14  7:39         ` Jerin Jacob Kollanukkaran
@ 2019-01-14  7:57         ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 28+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-14  7:57 UTC (permalink / raw)
  To: Honnappa Nagarahalli, jerinj, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	hemant.agrawal, stephen, Joyce Kong (Arm Technology China),
	nd


> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, January 14, 2019 1:55 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> jerinj@marvell.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Joyce Kong (Arm
> Technology China) <Joyce.Kong@arm.com>; nd <nd@arm.com>
> Subject: RE: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-
> way barrier builtins
> 
> > >
> > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > -------------------------------------------------------------------
> > > > ---
> > > > The __sync builtin based implementation generates full memory
> > > > barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to
> > > > generate one way barriers.
> > > >
> > > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > > __sync_bool_compare_and_swap(dst, exp, src);
> > > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> > > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> > > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > > <rte_atomic16_cmpset+52>  // b.any
> > > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> > > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > > > <rte_atomic16_cmpset+32>
> > > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > > > none
> > > >
> > > > The benchmarking results showed 3X performance gain on Cavium
> > > > ThunderX2 and
> > > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > > > Here is the example test result on TX2:
> > > >
> > > > *** spinlock_autotest without this patch *** Core [123] Cost Time =
> > > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time =
> > > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time =
> > > > 647039 us Total Cost Time = 95433298 us
> > > >
> > > > *** spinlock_autotest with this patch *** Core [123] Cost Time =
> > > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time =
> > > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time =
> > > > 78423 us Total Cost Time = 27339656 us
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > > ---
> > > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > > +++++++++++++-----
> > > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > index c4c3fc31e..87ae7a4f1 100644
> > > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);  static
> > > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > > -		while(sl->locked)
> > > > +	int exp = 0;
> > > > +
> > > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > > {
> > >
> > > How about remove explict exp = 0 and change to
> > > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> >
> > Yes, __atomic_test_and_set means simpler code and better, but
> > __atomic_test_and_set takes the first argument as a pointer to type bool
> or
> > char, in our case, sl->locked is of type uint32.
> > We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit
> is
> > really used in this case, is that ok?
> >
> > "It should be only used for operands of type bool or char. For other types
> only
> > part of the value may be set."
> > https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> > Builtins.html
> >
> > From performance perspective, in our testing, the performance was very
> close,
> > compared to __atomic.
> If performance is close, I suggest we go with the existing patch. Changing sl-
> >locked to bool/char would be an ABI change and will affect x86 TM based
> implementation as well.
> Jerin, what do you think?

I have benchmarked on Qualcomm, ThunderX2 and Context A72. 
In comparison to the existing patch, on the new patch using __atomic_test_and_set, Qualcomm Falkor gained 60% performance, 4-core A72 degraded 13%, ThunderX2 even worse, degraded 10 times. 
I am not sure why ThunderX2 degraded so much, maybe it was caused by two many cores (128 cores) with high contention? 

> 
> >
> > >
> > > i.e
> > > while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> > >
> > >
> > >
> > > > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > > >  			rte_pause();
> > > > +		exp = 0;
> > >
> > > We can remove exp = 0 with above scheme.
> > >
> > > > +	}
> > > >  }
> > > >  #endif
> > > >
> > > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);  static
> > > > inline void  rte_spinlock_unlock (rte_spinlock_t *sl)  {
> > > > -	__sync_lock_release(&sl->locked);
> > > > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> > >  }
> > > >  #endif
> > > >
> > > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> > > > static inline int  rte_spinlock_trylock (rte_spinlock_t *sl)  {
> > > > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > > +	int exp = 0;
> > > > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > > +				0, /* disallow spurious failure */
> > > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> > >
> > > Here to remove explicit exp.
> > >
> > > return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> > >
> > >
> > > >  }
> > > >  #endif
> > > >
> > > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > > >   */
> > > >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)  {
> > > > -	return sl->locked;
> > > > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> > >
> > > __ATOMIC_RELAXED would be enough here. Right ?
> > >
> > >
> > > >  }
> > > >
> > > >  /**

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins
  2019-01-14  7:39         ` Jerin Jacob Kollanukkaran
@ 2019-01-14 17:08           ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 28+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-14 17:08 UTC (permalink / raw)
  To: jerinj, Honnappa Nagarahalli, dev
  Cc: david.marchand, chaozhu, nd, bruce.richardson, thomas,
	Joyce Kong (Arm Technology China),
	hemant.agrawal, stephen

> > > > > *sl);  static
> > > > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > > > -		while(sl->locked)
> > > > > +	int exp = 0;
> > > > > +
> > > > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp,
> > > > > 1, 0,
> > > > > +				__ATOMIC_ACQUIRE,
> > > > > __ATOMIC_RELAXED))
> > > > {
> > > >
> > > > How about remove explict exp = 0 and change to
> > > > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> > >
> > > Yes, __atomic_test_and_set means simpler code and better, but
> > > __atomic_test_and_set takes the first argument as a pointer to type
> > > bool or
> > > char, in our case, sl->locked is of type uint32.
> > > We can force it to uint8, or just pass in the 32bit pointer, only
> > > one byte/bit is
> > > really used in this case, is that ok?
> > >
> > > "It should be only used for operands of type bool or char. For
> > > other types only
> > > part of the value may be set."
> > > https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> > > Builtins.html
> > >
> > > From performance perspective, in our testing, the performance was
> > > very close,
> > > compared to __atomic.
> > If performance is close, I suggest we go with the existing patch.
> > Changing sl->locked to bool/char would be an ABI change and will
> > affect x86 TM based implementation as well.
> > Jerin, what do you think?
> 
> Looks good to me.
> 
I tested __atomic_test_and_test based patch(I did not push this, it is in internal review), it caused 10 times performance degradation on ThunderX2.
In the internal review, Ola Liljedahl's comment well explained the degradation:
"Unlike the old code, the new code will write the lock location (and thus get exclusive access to the whole cache line) repeatedly while it is busy waiting. This is bad for the lock owner if it is accessing other data located in the same cache line as the lock (which is often the case)."
The real difference is __atomic_test_and_test keeps writing the lock location(whether it is locked or not) and monopolizing the cache line, it is bad to the lock owner and other contenders. 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-01-14 17:08 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-27  4:13 [dpdk-dev] [PATCH v3 0/6] spinlock optimization and test case enhancements Gavin Hu
2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 1/6] eal: fix clang compilation error on x86 Gavin Hu
2018-12-27  6:36   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 2/6] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
2018-12-27  7:20   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 3/6] test/spinlock: get timestamp more precisely Gavin Hu
2018-12-27  7:27   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2019-01-03 18:22     ` Honnappa Nagarahalli
2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 4/6] test/spinlock: amortize the cost of getting time Gavin Hu
2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2018-12-27  7:42   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2018-12-27  9:02     ` Gavin Hu (Arm Technology China)
2019-01-03 20:35       ` Honnappa Nagarahalli
2019-01-11 13:52     ` Gavin Hu (Arm Technology China)
2019-01-14  5:54       ` Honnappa Nagarahalli
2019-01-14  7:39         ` Jerin Jacob Kollanukkaran
2019-01-14 17:08           ` Gavin Hu (Arm Technology China)
2019-01-14  7:57         ` Gavin Hu (Arm Technology China)
2018-12-27  4:13 ` [dpdk-dev] [PATCH v3 6/6] spinlock: ticket based to improve fairness Gavin Hu
2018-12-27  6:58   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2018-12-27 10:05     ` Gavin Hu (Arm Technology China)
2018-12-27 12:08       ` Jerin Jacob Kollanukkaran
2018-12-27 23:41         ` Stephen Hemminger
2018-12-28  4:39           ` Jerin Jacob Kollanukkaran
2018-12-28 10:04             ` Gavin Hu (Arm Technology China)
2019-01-03 18:35             ` Honnappa Nagarahalli
2019-01-03 19:53               ` Stephen Hemminger
2019-01-04  7:06                 ` Honnappa Nagarahalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).