* [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements
@ 2018-12-20 10:42 Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 1/5] test/spinlock: remove 1us delay for correct spinlock benchmarking Gavin Hu
` (27 more replies)
0 siblings, 28 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 10:42 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu, nd,
Honnappa.Nagarahalli, Gavin Hu
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (5):
test/spinlock: remove 1us delay to create contention
test/spinlock: get timestamp more precisely
test/spinlock: amortize the cost of getting time
spinlock: move the implementation to arm specific file
spinlock: reimplement with atomic one-way barrier builtins
.../common/include/arch/arm/rte_spinlock.h | 28 +++++++++++++++++
.../common/include/generic/rte_spinlock.h | 28 +----------------
test/test/test_spinlock.c | 35 +++++++++++-----------
3 files changed, 47 insertions(+), 44 deletions(-)
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v1 1/5] test/spinlock: remove 1us delay for correct spinlock benchmarking
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
@ 2018-12-20 10:42 ` Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 2/5] test/spinlock: get timestamp more precisely Gavin Hu
` (26 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 10:42 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu, nd,
Honnappa.Nagarahalli, Gavin Hu
The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
test/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 73bff128e..6795195ae 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v1 2/5] test/spinlock: get timestamp more precisely
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 1/5] test/spinlock: remove 1us delay for correct spinlock benchmarking Gavin Hu
@ 2018-12-20 10:42 ` Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 3/5] test/spinlock: amortize the cost of getting time Gavin Hu
` (25 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 10:42 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu, nd,
Honnappa.Nagarahalli, Gavin Hu
To precisely benchmark the spinlock performance, uses the precise
version of getting timestamps, which enforces the timestamps are
obtained at the expected places.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
test/test/test_spinlock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 6795195ae..648474833 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -113,14 +113,14 @@ load_loop_fn(void *func_param)
if (lcore != rte_get_master_lcore())
while (rte_atomic32_read(&synchro) == 0);
- begin = rte_get_timer_cycles();
+ begin = rte_rdtsc_precise();
while (time_diff < hz * TIME_MS / 1000) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
+ time_diff = rte_rdtsc_precise() - begin;
}
lock_count[lcore] = lcount;
return 0;
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v1 3/5] test/spinlock: amortize the cost of getting time
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 1/5] test/spinlock: remove 1us delay for correct spinlock benchmarking Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 2/5] test/spinlock: get timestamp more precisely Gavin Hu
@ 2018-12-20 10:42 ` Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file Gavin Hu
` (24 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 10:42 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu, nd,
Honnappa.Nagarahalli, Gavin Hu
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
---
test/test/test_spinlock.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 648474833..e9839b979 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -96,9 +96,9 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
@@ -114,15 +114,14 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_rdtsc_precise();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
- lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_rdtsc_precise() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_rdtsc_precise() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +135,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +159,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (2 preceding siblings ...)
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 3/5] test/spinlock: amortize the cost of getting time Gavin Hu
@ 2018-12-20 10:42 ` Gavin Hu
2018-12-20 12:47 ` David Marchand
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 5/5] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
` (23 subsequent siblings)
27 siblings, 1 reply; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 10:42 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu, nd,
Honnappa.Nagarahalli, Gavin Hu
remove the hard code #ifdef RTE_FORCE_INTRINSICS, move the implementation
to the arm specific file, x86 and POWER have their own implementations.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
---
.../common/include/arch/arm/rte_spinlock.h | 20 +++++++++++++++++
.../common/include/generic/rte_spinlock.h | 26 ----------------------
2 files changed, 20 insertions(+), 26 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b6e..25d22fd1e 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,26 @@ extern "C" {
#include <rte_common.h>
#include "generic/rte_spinlock.h"
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+ while (__sync_lock_test_and_set(&sl->locked, 1))
+ while (sl->locked)
+ rte_pause();
+}
+
+static inline void
+rte_spinlock_unlock(rte_spinlock_t *sl)
+{
+ __sync_lock_release(&sl->locked);
+}
+
+static inline int
+rte_spinlock_trylock(rte_spinlock_t *sl)
+{
+ return __sync_lock_test_and_set(&sl->locked, 1) == 0;
+}
+
static inline int rte_tm_supported(void)
{
return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc31e..e555ecb95 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,16 +57,6 @@ rte_spinlock_init(rte_spinlock_t *sl)
static inline void
rte_spinlock_lock(rte_spinlock_t *sl);
-#ifdef RTE_FORCE_INTRINSICS
-static inline void
-rte_spinlock_lock(rte_spinlock_t *sl)
-{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
- rte_pause();
-}
-#endif
-
/**
* Release the spinlock.
*
@@ -76,14 +66,6 @@ rte_spinlock_lock(rte_spinlock_t *sl)
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl);
-#ifdef RTE_FORCE_INTRINSICS
-static inline void
-rte_spinlock_unlock (rte_spinlock_t *sl)
-{
- __sync_lock_release(&sl->locked);
-}
-#endif
-
/**
* Try to take the lock.
*
@@ -95,14 +77,6 @@ rte_spinlock_unlock (rte_spinlock_t *sl)
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl);
-#ifdef RTE_FORCE_INTRINSICS
-static inline int
-rte_spinlock_trylock (rte_spinlock_t *sl)
-{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
-}
-#endif
-
/**
* Test if the lock is taken.
*
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v1 5/5] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (3 preceding siblings ...)
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file Gavin Hu
@ 2018-12-20 10:42 ` Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
` (22 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 10:42 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu, nd,
Honnappa.Nagarahalli, Gavin Hu
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed 3X performance gain on Cavium ThunderX2 and
13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
Here is the example test result on TX2:
*** spinlock_autotest without this patch ***
Core [123] Cost Time = 639822 us
Core [124] Cost Time = 633253 us
Core [125] Cost Time = 646030 us
Core [126] Cost Time = 643189 us
Core [127] Cost Time = 647039 us
Total Cost Time = 95433298 us
*** spinlock_autotest with this patch ***
Core [123] Cost Time = 163615 us
Core [124] Cost Time = 166471 us
Core [125] Cost Time = 189044 us
Core [126] Cost Time = 195745 us
Core [127] Cost Time = 78423 us
Total Cost Time = 27339656 us
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/arch/arm/rte_spinlock.h | 16 ++++++++++++----
lib/librte_eal/common/include/generic/rte_spinlock.h | 2 +-
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 25d22fd1e..33a6d382f 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -19,21 +19,29 @@ extern "C" {
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while (sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 1,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
static inline void
rte_spinlock_unlock(rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
static inline int
rte_spinlock_trylock(rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked, 1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ false, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
static inline int rte_tm_supported(void)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index e555ecb95..a8cc75adc 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -87,7 +87,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file Gavin Hu
@ 2018-12-20 12:47 ` David Marchand
2018-12-20 12:55 ` David Marchand
2018-12-20 14:36 ` Gavin Hu (Arm Technology China)
0 siblings, 2 replies; 54+ messages in thread
From: David Marchand @ 2018-12-20 12:47 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa.Nagarahalli
On Thu, Dec 20, 2018 at 11:44 AM Gavin Hu <gavin.hu@arm.com> wrote:
> remove the hard code #ifdef RTE_FORCE_INTRINSICS, move the implementation
> to the arm specific file, x86 and POWER have their own implementations.
>
No, x86 and ppc define their own implementation when _not_ having
RTE_FORCE_INTRINSICS.
--
David Marchand
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 12:47 ` David Marchand
@ 2018-12-20 12:55 ` David Marchand
2018-12-20 14:40 ` Gavin Hu (Arm Technology China)
2018-12-20 14:36 ` Gavin Hu (Arm Technology China)
1 sibling, 1 reply; 54+ messages in thread
From: David Marchand @ 2018-12-20 12:55 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa.Nagarahalli
On Thu, Dec 20, 2018 at 1:47 PM David Marchand <david.marchand@redhat.com>
wrote:
> On Thu, Dec 20, 2018 at 11:44 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
>> remove the hard code #ifdef RTE_FORCE_INTRINSICS, move the implementation
>> to the arm specific file, x86 and POWER have their own implementations.
>>
>
> No, x86 and ppc define their own implementation when _not_ having
> RTE_FORCE_INTRINSICS.
>
Actually, I wondered in the past if we realled needed those double
implementations...
Is there really a point in maintaining both ? I am pretty sure people only
use the default configuration per architecture.
--
David Marchand
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 12:47 ` David Marchand
2018-12-20 12:55 ` David Marchand
@ 2018-12-20 14:36 ` Gavin Hu (Arm Technology China)
2018-12-20 15:09 ` David Marchand
1 sibling, 1 reply; 54+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-12-20 14:36 UTC (permalink / raw)
To: David Marchand
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa Nagarahalli, nd
>> On Thu, Dec 20, 2018 at 11:44 AM Gavin Hu <mailto:gavin.hu@arm.com> wrote:
>> remove the hard code #ifdef RTE_FORCE_INTRINSICS, move the implementation
>> to the arm specific file, x86 and POWER have their own implementations.
> No, x86 and ppc define their own implementation when _not_ having RTE_FORCE_INTRINSICS.
> David Marchand
Hi David,
Your reply is out of format, I re-format it to text based.
Yes, x86 and ppc define their own implementation, so this change is arm specific.
Only arm have RTE_FORCE_INTRINSICS, x86 and ppc don't define it in the config files.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 12:55 ` David Marchand
@ 2018-12-20 14:40 ` Gavin Hu (Arm Technology China)
0 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-12-20 14:40 UTC (permalink / raw)
To: David Marchand
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa Nagarahalli, nd
>> > On Thu, Dec 20, 2018 at 1:47 PM David Marchand <mailto:david.marchand@redhat.com> wrote:
>> > On Thu, Dec 20, 2018 at 11:44 AM Gavin Hu <mailto:gavin.hu@arm.com> wrote:
>>> remove the hard code #ifdef RTE_FORCE_INTRINSICS, move the implementation
>>> to the arm specific file, x86 and POWER have their own implementations.
>> No, x86 and ppc define their own implementation when _not_ having RTE_FORCE_INTRINSICS.
> Actually, I wondered in the past if we realled needed those double implementations...
> Is there really a point in maintaining both ? I am pretty sure people only use the default configuration per architecture.
> David Marchand
X86 implementation is xchg instruction based, it is arch specific.
Ppc has its own code base, but in reality it is same as arm implementation, but in future, arm will change to its own instruction level based implementation.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 14:36 ` Gavin Hu (Arm Technology China)
@ 2018-12-20 15:09 ` David Marchand
2018-12-20 15:58 ` Gavin Hu (Arm Technology China)
0 siblings, 1 reply; 54+ messages in thread
From: David Marchand @ 2018-12-20 15:09 UTC (permalink / raw)
To: Gavin Hu (Arm Technology China)
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa Nagarahalli
On Thu, Dec 20, 2018 at 3:36 PM Gavin Hu (Arm Technology China) <
Gavin.Hu@arm.com> wrote:
> >> On Thu, Dec 20, 2018 at 11:44 AM Gavin Hu <mailto:gavin.hu@arm.com>
> wrote:
> >> remove the hard code #ifdef RTE_FORCE_INTRINSICS, move the
> implementation
> >> to the arm specific file, x86 and POWER have their own implementations.
>
> > No, x86 and ppc define their own implementation when _not_ having
> RTE_FORCE_INTRINSICS.
> > David Marchand
>
> Hi David,
>
> Your reply is out of format, I re-format it to text based.
>
I suppose this is an issue with your mail client.
> Yes, x86 and ppc define their own implementation, so this change is arm
> specific.
> Only arm have RTE_FORCE_INTRINSICS, x86 and ppc don't define it in the
> config files.
>
This change breaks the use of intrinsics in x86 case at least.
$ git reset --hard origin/master
HEAD is now at 476c847 malloc: add option --match-allocations
$ git am
~/Downloads/v1-4-5-spinlock-move-the-implementation-to-arm-specific-file.patch
Applying: spinlock: move the implementation to arm specific file
# default config
$ rm -rf master; make defconfig O=master && make -j4 O=master
[...]
Build complete [x86_64-native-linuxapp-gcc]
# then enable use of intrinsics
$ echo CONFIG_RTE_FORCE_INTRINSICS=y >> master/.config && make O=master
[...]
CC eal.o
In file included from
/home/dmarchan/git/pub/dpdk/master/include/rte_spinlock.h:12:0,
from
/home/dmarchan/git/pub/dpdk/master/include/rte_malloc_heap.h:10,
from
/home/dmarchan/git/pub/dpdk/master/include/rte_eal_memconfig.h:12,
from
/home/dmarchan/git/pub/dpdk/lib/librte_eal/linuxapp/eal/eal.c:35:
/home/dmarchan/git/pub/dpdk/master/include/generic/rte_spinlock.h:58:1:
error: ‘rte_spinlock_lock’ used but never defined [-Werror]
rte_spinlock_lock(rte_spinlock_t *sl);
^
/home/dmarchan/git/pub/dpdk/master/include/generic/rte_spinlock.h:67:1:
error: ‘rte_spinlock_unlock’ used but never defined [-Werror]
rte_spinlock_unlock (rte_spinlock_t *sl);
^
/home/dmarchan/git/pub/dpdk/master/include/generic/rte_spinlock.h:78:1:
error: ‘rte_spinlock_trylock’ used but never defined [-Werror]
rte_spinlock_trylock (rte_spinlock_t *sl);
^
cc1: all warnings being treated as errors
make[5]: *** [eal.o] Error 1
make[4]: *** [eal] Error 2
--
David Marchand
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 15:09 ` David Marchand
@ 2018-12-20 15:58 ` Gavin Hu (Arm Technology China)
2018-12-20 15:59 ` David Marchand
0 siblings, 1 reply; 54+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2018-12-20 15:58 UTC (permalink / raw)
To: David Marchand
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa Nagarahalli, nd
> This change breaks the use of intrinsics in x86 case at least.
This is true, if intrinsics is still an option for x86 even not is use by default.
I will fix this in v2.
Don't know why mail mail client has this issue, I worked with other mail threads correctly.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file
2018-12-20 15:58 ` Gavin Hu (Arm Technology China)
@ 2018-12-20 15:59 ` David Marchand
0 siblings, 0 replies; 54+ messages in thread
From: David Marchand @ 2018-12-20 15:59 UTC (permalink / raw)
To: Gavin Hu (Arm Technology China)
Cc: dev, Thomas Monjalon, jerinj, hemant.agrawal, bruce.richardson,
chaozhu, nd, Honnappa Nagarahalli
On Thu, Dec 20, 2018 at 4:58 PM Gavin Hu (Arm Technology China) <
Gavin.Hu@arm.com> wrote:
>
> > This change breaks the use of intrinsics in x86 case at least.
>
> This is true, if intrinsics is still an option for x86 even not is use by
> default.
> I will fix this in v2.
>
Please, also check ppc, same pb afaics.
Thanks.
--
David Marchand
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (4 preceding siblings ...)
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 5/5] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2018-12-20 17:42 ` Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 1/5] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
` (4 more replies)
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 0/4] spinlock optimization and test case enhancements gavin hu
` (21 subsequent siblings)
27 siblings, 5 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 17:42 UTC (permalink / raw)
To: dev
Cc: thomas, bruce.richardson, jerinj, hemant.agrawal, ferruh.yigit,
Honnappa.Nagarahalli, nd, Gavin Hu
V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
is enabled.
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (5):
test/spinlock: remove 1us delay for correct benchmarking
test/spinlock: get timestamp more precisely
test/spinlock: amortize the cost of getting time
spinlock: reimplement with atomic one-way barrier builtins
eal: fix clang compilation error on x86
lib/librte_eal/common/include/generic/rte_atomic.h | 6 ++--
.../common/include/generic/rte_spinlock.h | 18 ++++++++----
test/test/test_spinlock.c | 32 +++++++++++-----------
3 files changed, 32 insertions(+), 24 deletions(-)
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v2 1/5] test/spinlock: remove 1us delay for correct benchmarking
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
@ 2018-12-20 17:42 ` Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 2/5] test/spinlock: get timestamp more precisely Gavin Hu
` (3 subsequent siblings)
4 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 17:42 UTC (permalink / raw)
To: dev
Cc: thomas, bruce.richardson, jerinj, hemant.agrawal, ferruh.yigit,
Honnappa.Nagarahalli, nd, Gavin Hu
The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
test/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 73bff128e..6795195ae 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v2 2/5] test/spinlock: get timestamp more precisely
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 1/5] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
@ 2018-12-20 17:42 ` Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 3/5] test/spinlock: amortize the cost of getting time Gavin Hu
` (2 subsequent siblings)
4 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 17:42 UTC (permalink / raw)
To: dev
Cc: thomas, bruce.richardson, jerinj, hemant.agrawal, ferruh.yigit,
Honnappa.Nagarahalli, nd, Gavin Hu
To precisely benchmark the spinlock performance, uses the precise
version of getting timestamps, which enforces the timestamps are
obtained at the expected places.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
test/test/test_spinlock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 6795195ae..648474833 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -113,14 +113,14 @@ load_loop_fn(void *func_param)
if (lcore != rte_get_master_lcore())
while (rte_atomic32_read(&synchro) == 0);
- begin = rte_get_timer_cycles();
+ begin = rte_rdtsc_precise();
while (time_diff < hz * TIME_MS / 1000) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
+ time_diff = rte_rdtsc_precise() - begin;
}
lock_count[lcore] = lcount;
return 0;
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v2 3/5] test/spinlock: amortize the cost of getting time
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 1/5] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 2/5] test/spinlock: get timestamp more precisely Gavin Hu
@ 2018-12-20 17:42 ` Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 4/5] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 5/5] eal: fix clang compilation error on x86 Gavin Hu
4 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 17:42 UTC (permalink / raw)
To: dev
Cc: thomas, bruce.richardson, jerinj, hemant.agrawal, ferruh.yigit,
Honnappa.Nagarahalli, nd, Gavin Hu
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
---
test/test/test_spinlock.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 648474833..e9839b979 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -96,9 +96,9 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
@@ -114,15 +114,14 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_rdtsc_precise();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
- lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_rdtsc_precise() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_rdtsc_precise() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +135,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +159,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v2 4/5] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
` (2 preceding siblings ...)
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 3/5] test/spinlock: amortize the cost of getting time Gavin Hu
@ 2018-12-20 17:42 ` Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 5/5] eal: fix clang compilation error on x86 Gavin Hu
4 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 17:42 UTC (permalink / raw)
To: dev
Cc: thomas, bruce.richardson, jerinj, hemant.agrawal, ferruh.yigit,
Honnappa.Nagarahalli, nd, Gavin Hu
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed 3X performance gain on Cavium ThunderX2 and
13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
Here is the example test result on TX2:
*** spinlock_autotest without this patch ***
Core [123] Cost Time = 639822 us
Core [124] Cost Time = 633253 us
Core [125] Cost Time = 646030 us
Core [126] Cost Time = 643189 us
Core [127] Cost Time = 647039 us
Total Cost Time = 95433298 us
*** spinlock_autotest with this patch ***
Core [123] Cost Time = 163615 us
Core [124] Cost Time = 166471 us
Core [125] Cost Time = 189044 us
Core [126] Cost Time = 195745 us
Core [127] Cost Time = 78423 us
Total Cost Time = 27339656 us
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc31e..87ae7a4f1 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
#endif
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
#endif
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ 0, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
#endif
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v2 5/5] eal: fix clang compilation error on x86
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
` (3 preceding siblings ...)
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 4/5] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2018-12-20 17:42 ` Gavin Hu
4 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2018-12-20 17:42 UTC (permalink / raw)
To: dev
Cc: thomas, bruce.richardson, jerinj, hemant.agrawal, ferruh.yigit,
Honnappa.Nagarahalli, nd, Gavin Hu, stable
When CONFIG_RTE_FORCE_INTRINSICS is enabled for x86, the clang
compilation error was:
include/generic/rte_atomic.h:215:9: error:
implicit declaration of function '__atomic_exchange_2'
is invalid in C99
include/generic/rte_atomic.h:494:9: error:
implicit declaration of function '__atomic_exchange_4'
is invalid in C99
include/generic/rte_atomic.h:772:9: error:
implicit declaration of function '__atomic_exchange_8'
is invalid in C99
Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
For more information, please refer to:
http://mails.dpdk.org/archives/dev/2018-April/096776.html
Fixes: 7bdccb93078e ("eal: fix ARM build with clang")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
---
lib/librte_eal/common/include/generic/rte_atomic.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index b99ba4688..ed5b125b3 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -212,7 +212,7 @@ rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val);
static inline uint16_t
rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_2(dst, val, __ATOMIC_SEQ_CST);
@@ -495,7 +495,7 @@ rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val);
static inline uint32_t
rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_4(dst, val, __ATOMIC_SEQ_CST);
@@ -777,7 +777,7 @@ rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val);
static inline uint64_t
rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_8(dst, val, __ATOMIC_SEQ_CST);
--
2.11.0
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v4 0/4] spinlock optimization and test case enhancements
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (5 preceding siblings ...)
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
@ 2019-01-15 7:54 ` gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 1/4] eal: fix clang compilation error on x86 gavin hu
` (20 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 7:54 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, stephen, Honnappa.Nagarahalli,
gavin.hu, nd
V4:
1. Drop one patch for the test case to get time precisely as the overhead
of getting time is amortized already in another patch.
2. Drop the ticket lock patch from this series as there are no dependency
between them, the ticket lock patch was submitted separately:
http://patchwork.dpdk.org/patch/49770/
3. Define volatile variable in patch #3 to be more realistic for spinlock
protection(avoid optimization be compiler).
4. Fix typos.
V3:
1. Implemented the ticket lock to improve the fairness and predictability.
The locks are obtained in the order of requested.
V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
is enabled.
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (4):
eal: fix clang compilation error on x86
test/spinlock: remove 1us delay for correct benchmarking
test/spinlock: amortize the cost of getting time
spinlock: reimplement with atomic one-way barrier builtins
lib/librte_eal/common/include/generic/rte_atomic.h | 6 ++---
.../common/include/generic/rte_spinlock.h | 18 +++++++++----
test/test/test_spinlock.c | 31 +++++++++++-----------
3 files changed, 32 insertions(+), 23 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v4 1/4] eal: fix clang compilation error on x86
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (6 preceding siblings ...)
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 0/4] spinlock optimization and test case enhancements gavin hu
@ 2019-01-15 7:54 ` gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 2/4] test/spinlock: remove 1us delay for correct benchmarking gavin hu
` (19 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 7:54 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, stephen, Honnappa.Nagarahalli,
gavin.hu, nd, stable
From: Gavin Hu <gavin.hu@arm.com>
When CONFIG_RTE_FORCE_INTRINSICS is enabled for x86, the clang
compilation error was:
include/generic/rte_atomic.h:215:9: error:
implicit declaration of function '__atomic_exchange_2'
is invalid in C99
include/generic/rte_atomic.h:494:9: error:
implicit declaration of function '__atomic_exchange_4'
is invalid in C99
include/generic/rte_atomic.h:772:9: error:
implicit declaration of function '__atomic_exchange_8'
is invalid in C99
Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
For more information, please refer to:
http://mails.dpdk.org/archives/dev/2018-April/096776.html
Fixes: 7bdccb93078e ("eal: fix ARM build with clang")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
lib/librte_eal/common/include/generic/rte_atomic.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index b99ba46..ed5b125 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -212,7 +212,7 @@ rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val);
static inline uint16_t
rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_2(dst, val, __ATOMIC_SEQ_CST);
@@ -495,7 +495,7 @@ rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val);
static inline uint32_t
rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_4(dst, val, __ATOMIC_SEQ_CST);
@@ -777,7 +777,7 @@ rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val);
static inline uint64_t
rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_8(dst, val, __ATOMIC_SEQ_CST);
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v4 2/4] test/spinlock: remove 1us delay for correct benchmarking
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (7 preceding siblings ...)
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 1/4] eal: fix clang compilation error on x86 gavin hu
@ 2019-01-15 7:54 ` gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 3/4] test/spinlock: amortize the cost of getting time gavin hu
` (18 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 7:54 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, stephen, Honnappa.Nagarahalli,
gavin.hu, nd
From: Gavin Hu <gavin.hu@arm.com>
The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
test/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 73bff12..6795195 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v4 3/4] test/spinlock: amortize the cost of getting time
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (8 preceding siblings ...)
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 2/4] test/spinlock: remove 1us delay for correct benchmarking gavin hu
@ 2019-01-15 7:54 ` gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 4/4] spinlock: reimplement with atomic one-way barrier builtins gavin hu
` (17 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 7:54 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, stephen, Honnappa.Nagarahalli,
gavin.hu, nd
From: Gavin Hu <gavin.hu@arm.com>
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Change-Id: I32642b181f883cd57bc1f7c4f56a4744b0438781
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
---
test/test/test_spinlock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 6795195..6ac7495 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -96,16 +96,16 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
{
uint64_t time_diff = 0, begin;
uint64_t hz = rte_get_timer_hz();
- uint64_t lcount = 0;
+ volatile uint64_t lcount = 0;
const int use_lock = *(int*)func_param;
const unsigned lcore = rte_lcore_id();
@@ -114,15 +114,15 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_get_timer_cycles();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_get_timer_cycles() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +136,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +160,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v4 4/4] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (9 preceding siblings ...)
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 3/4] test/spinlock: amortize the cost of getting time gavin hu
@ 2019-01-15 7:54 ` gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 0/4] spinlock optimization and test case enhancements gavin hu
` (16 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 7:54 UTC (permalink / raw)
To: dev
Cc: thomas, jerinj, hemant.agrawal, stephen, Honnappa.Nagarahalli,
gavin.hu, nd
From: Gavin Hu <gavin.hu@arm.com>
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed 3X performance gain on Cavium ThunderX2 and
13% on Qualcomm Falkor and 3.7% on 4-A72 Marvell macchiatobin.
Here is the example test result on TX2:
*** spinlock_autotest without this patch ***
Core [123] Cost Time = 639822 us
Core [124] Cost Time = 633253 us
Core [125] Cost Time = 646030 us
Core [126] Cost Time = 643189 us
Core [127] Cost Time = 647039 us
Total Cost Time = 95433298 us
*** spinlock_autotest with this patch ***
Core [123] Cost Time = 163615 us
Core [124] Cost Time = 166471 us
Core [125] Cost Time = 189044 us
Core [126] Cost Time = 195745 us
Core [127] Cost Time = 78423 us
Total Cost Time = 27339656 us
Change-Id: I888f22120697d42d5a63fda6b4f77d93a0409aab
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc3..87ae7a4 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
#endif
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
#endif
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ 0, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
#endif
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v5 0/4] spinlock optimization and test case enhancements
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (10 preceding siblings ...)
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 4/4] spinlock: reimplement with atomic one-way barrier builtins gavin hu
@ 2019-01-15 10:32 ` gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 1/4] eal: fix clang compilation error on x86 gavin hu
` (15 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 10:32 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
gavin.hu, olivier.matz, bruce.richardson
V5: Remove ChangeId(sorry for that)
V4:
1. Drop one patch for the test case to get time precisely as the overhead
of getting time is amortized already in another patch.
2. Drop the ticket lock patch from this series as there are no dependency
between them, the ticket lock patch was submitted separately:
http://patchwork.dpdk.org/patch/49770/
3. Define volatile variable in patch #3 to be more realistic for spinlock
protection(avoid optimization be compiler).
4. Fix typos.
V3:
1. Implemented the ticket lock to improve the fairness and predictability.
The locks are obtained in the order of requested.
V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
is enabled.
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (4):
eal: fix clang compilation error on x86
test/spinlock: remove 1us delay for correct benchmarking
test/spinlock: amortize the cost of getting time
spinlock: reimplement with atomic one-way barrier builtins
lib/librte_eal/common/include/generic/rte_atomic.h | 6 ++---
.../common/include/generic/rte_spinlock.h | 18 +++++++++----
test/test/test_spinlock.c | 31 +++++++++++-----------
3 files changed, 32 insertions(+), 23 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v5 1/4] eal: fix clang compilation error on x86
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (11 preceding siblings ...)
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 0/4] spinlock optimization and test case enhancements gavin hu
@ 2019-01-15 10:32 ` gavin hu
2019-01-15 17:42 ` Honnappa Nagarahalli
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 2/4] test/spinlock: remove 1us delay for correct benchmarking gavin hu
` (14 subsequent siblings)
27 siblings, 1 reply; 54+ messages in thread
From: gavin hu @ 2019-01-15 10:32 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
gavin.hu, olivier.matz, bruce.richardson, stable
From: Gavin Hu <gavin.hu@arm.com>
When CONFIG_RTE_FORCE_INTRINSICS is enabled for x86, the clang
compilation error was:
include/generic/rte_atomic.h:215:9: error:
implicit declaration of function '__atomic_exchange_2'
is invalid in C99
include/generic/rte_atomic.h:494:9: error:
implicit declaration of function '__atomic_exchange_4'
is invalid in C99
include/generic/rte_atomic.h:772:9: error:
implicit declaration of function '__atomic_exchange_8'
is invalid in C99
Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
For more information, please refer to:
http://mails.dpdk.org/archives/dev/2018-April/096776.html
Fixes: 7bdccb93078e ("eal: fix ARM build with clang")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
lib/librte_eal/common/include/generic/rte_atomic.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index b99ba46..ed5b125 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -212,7 +212,7 @@ rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val);
static inline uint16_t
rte_atomic16_exchange(volatile uint16_t *dst, uint16_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_2(dst, val, __ATOMIC_SEQ_CST);
@@ -495,7 +495,7 @@ rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val);
static inline uint32_t
rte_atomic32_exchange(volatile uint32_t *dst, uint32_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_4(dst, val, __ATOMIC_SEQ_CST);
@@ -777,7 +777,7 @@ rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val);
static inline uint64_t
rte_atomic64_exchange(volatile uint64_t *dst, uint64_t val)
{
-#if defined(RTE_ARCH_ARM64) && defined(RTE_TOOLCHAIN_CLANG)
+#if defined(RTE_TOOLCHAIN_CLANG)
return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST);
#else
return __atomic_exchange_8(dst, val, __ATOMIC_SEQ_CST);
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v5 2/4] test/spinlock: remove 1us delay for correct benchmarking
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (12 preceding siblings ...)
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 1/4] eal: fix clang compilation error on x86 gavin hu
@ 2019-01-15 10:32 ` gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 3/4] test/spinlock: amortize the cost of getting time gavin hu
` (13 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 10:32 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
gavin.hu, olivier.matz, bruce.richardson
From: Gavin Hu <gavin.hu@arm.com>
The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
test/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 73bff12..6795195 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v5 3/4] test/spinlock: amortize the cost of getting time
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (13 preceding siblings ...)
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 2/4] test/spinlock: remove 1us delay for correct benchmarking gavin hu
@ 2019-01-15 10:32 ` gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 4/4] spinlock: reimplement with atomic one-way barrier builtins gavin hu
` (12 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 10:32 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
gavin.hu, olivier.matz, bruce.richardson
From: Gavin Hu <gavin.hu@arm.com>
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
---
test/test/test_spinlock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/test/test/test_spinlock.c b/test/test/test_spinlock.c
index 6795195..6ac7495 100644
--- a/test/test/test_spinlock.c
+++ b/test/test/test_spinlock.c
@@ -96,16 +96,16 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
{
uint64_t time_diff = 0, begin;
uint64_t hz = rte_get_timer_hz();
- uint64_t lcount = 0;
+ volatile uint64_t lcount = 0;
const int use_lock = *(int*)func_param;
const unsigned lcore = rte_lcore_id();
@@ -114,15 +114,15 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_get_timer_cycles();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_get_timer_cycles() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +136,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +160,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v5 4/4] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (14 preceding siblings ...)
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 3/4] test/spinlock: amortize the cost of getting time gavin hu
@ 2019-01-15 10:32 ` gavin hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 0/3] generic spinlock optimization and test case enhancements Gavin Hu
` (11 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: gavin hu @ 2019-01-15 10:32 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
gavin.hu, olivier.matz, bruce.richardson
From: Gavin Hu <gavin.hu@arm.com>
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed 3X performance gain on Cavium ThunderX2 and
13% on Qualcomm Falkor and 3.7% on 4-A72 Marvell macchiatobin.
Here is the example test result on TX2:
*** spinlock_autotest without this patch ***
Core [123] Cost Time = 639822 us
Core [124] Cost Time = 633253 us
Core [125] Cost Time = 646030 us
Core [126] Cost Time = 643189 us
Core [127] Cost Time = 647039 us
Total Cost Time = 95433298 us
*** spinlock_autotest with this patch ***
Core [123] Cost Time = 163615 us
Core [124] Cost Time = 166471 us
Core [125] Cost Time = 189044 us
Core [126] Cost Time = 195745 us
Core [127] Cost Time = 78423 us
Total Cost Time = 27339656 us
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc3..87ae7a4 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
#endif
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
#endif
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ 0, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
#endif
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v5 1/4] eal: fix clang compilation error on x86
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 1/4] eal: fix clang compilation error on x86 gavin hu
@ 2019-01-15 17:42 ` Honnappa Nagarahalli
0 siblings, 0 replies; 54+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-15 17:42 UTC (permalink / raw)
To: Gavin Hu (Arm Technology China), dev
Cc: nd, thomas, jerinj, hemant.agrawal,
Gavin Hu (Arm Technology China),
olivier.matz, bruce.richardson, stable, nd
>
> From: Gavin Hu <gavin.hu@arm.com>
>
> When CONFIG_RTE_FORCE_INTRINSICS is enabled for x86, the clang
> compilation error was:
> include/generic/rte_atomic.h:215:9: error:
> implicit declaration of function '__atomic_exchange_2'
> is invalid in C99
> include/generic/rte_atomic.h:494:9: error:
> implicit declaration of function '__atomic_exchange_4'
> is invalid in C99
> include/generic/rte_atomic.h:772:9: error:
> implicit declaration of function '__atomic_exchange_8'
> is invalid in C99
>
> Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
> For more information, please refer to:
> http://mails.dpdk.org/archives/dev/2018-April/096776.html
>
> Fixes: 7bdccb93078e ("eal: fix ARM build with clang")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
> lib/librte_eal/common/include/generic/rte_atomic.h | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h
> b/lib/librte_eal/common/include/generic/rte_atomic.h
> index b99ba46..ed5b125 100644
> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> @@ -212,7 +212,7 @@ rte_atomic16_exchange(volatile uint16_t *dst,
> uint16_t val); static inline uint16_t rte_atomic16_exchange(volatile
> uint16_t *dst, uint16_t val) { -#if defined(RTE_ARCH_ARM64) &&
> defined(RTE_TOOLCHAIN_CLANG)
> +#if defined(RTE_TOOLCHAIN_CLANG)
Please check http://mails.dpdk.org/archives/dev/2019-January/123331.html
This needs to be changed to (__clang__). This applies for other similar changes here.
> return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST); #else
> return __atomic_exchange_2(dst, val, __ATOMIC_SEQ_CST); @@ -
> 495,7 +495,7 @@ rte_atomic32_exchange(volatile uint32_t *dst, uint32_t
> val); static inline uint32_t rte_atomic32_exchange(volatile uint32_t *dst,
> uint32_t val) { -#if defined(RTE_ARCH_ARM64) &&
> defined(RTE_TOOLCHAIN_CLANG)
> +#if defined(RTE_TOOLCHAIN_CLANG)
> return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST); #else
> return __atomic_exchange_4(dst, val, __ATOMIC_SEQ_CST); @@ -
> 777,7 +777,7 @@ rte_atomic64_exchange(volatile uint64_t *dst, uint64_t
> val); static inline uint64_t rte_atomic64_exchange(volatile uint64_t *dst,
> uint64_t val) { -#if defined(RTE_ARCH_ARM64) &&
> defined(RTE_TOOLCHAIN_CLANG)
> +#if defined(RTE_TOOLCHAIN_CLANG)
> return __atomic_exchange_n(dst, val, __ATOMIC_SEQ_CST); #else
> return __atomic_exchange_8(dst, val, __ATOMIC_SEQ_CST);
> --
> 2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v6 0/3] generic spinlock optimization and test case enhancements
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (15 preceding siblings ...)
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 4/4] spinlock: reimplement with atomic one-way barrier builtins gavin hu
@ 2019-03-08 7:16 ` Gavin Hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 1/3] test/spinlock: dealy 1 us to create contention Gavin Hu
` (10 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:16 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu
V6: Rebase and drop the first patch as a similar fix was already merged.
V5: Remove ChangeId(sorry for that)
V4:
1. Drop one patch for the test case to get time precisely as the overhead
of getting time is amortized already in another patch.
2. Drop the ticket lock patch from this series as there are no dependency
between them, the ticket lock patch was submitted separately:
http://patchwork.dpdk.org/patch/49770/
3. Define volatile variable in patch #3 to be more realistic for spinlock
protection(avoid optimization be compiler).
4. Fix typos.
V3:
1. Implemented the ticket lock to improve the fairness and predictability.
The locks are obtained in the order of requested.
V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
is enabled.
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (3):
test/spinlock: dealy 1 us to create contention
test/spinlock: amortize the cost of getting time
spinlock: reimplement with atomic one-way barrier builtins
app/test/test_spinlock.c | 31 +++++++++++-----------
.../common/include/generic/rte_spinlock.h | 18 +++++++++----
2 files changed, 29 insertions(+), 20 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v6 1/3] test/spinlock: dealy 1 us to create contention
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (16 preceding siblings ...)
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 0/3] generic spinlock optimization and test case enhancements Gavin Hu
@ 2019-03-08 7:16 ` Gavin Hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
` (9 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:16 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
Quickly taking and releasing the spinlock can't hit the contentions,
delay 1 us to create contention stress, this can help really show
the performance of spinlock implementation.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
app/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/app/test/test_spinlock.c b/app/test/test_spinlock.c
index 73bff12..6795195 100644
--- a/app/test/test_spinlock.c
+++ b/app/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v6 2/3] test/spinlock: amortize the cost of getting time
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (17 preceding siblings ...)
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 1/3] test/spinlock: dealy 1 us to create contention Gavin Hu
@ 2019-03-08 7:16 ` Gavin Hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
` (8 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:16 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: ruifeng wang <ruifeng.wang@arm.com>
Reviewed-by: honnappa nagarahalli <honnappa.nagarahalli@arm.com>
---
app/test/test_spinlock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/app/test/test_spinlock.c b/app/test/test_spinlock.c
index 6795195..6ac7495 100644
--- a/app/test/test_spinlock.c
+++ b/app/test/test_spinlock.c
@@ -96,16 +96,16 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
{
uint64_t time_diff = 0, begin;
uint64_t hz = rte_get_timer_hz();
- uint64_t lcount = 0;
+ volatile uint64_t lcount = 0;
const int use_lock = *(int*)func_param;
const unsigned lcore = rte_lcore_id();
@@ -114,15 +114,15 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_get_timer_cycles();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_get_timer_cycles() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +136,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +160,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v6 3/3] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (18 preceding siblings ...)
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
@ 2019-03-08 7:16 ` Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 0/3] generic spinlock optimization and test case enhancements Gavin Hu
` (7 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:16 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed constant improvements on all available
platforms:
1. Cavium ThunderX2: 126% performance;
2. Hisilicon 1616: 30%;
3. Qualcomm Falkor: 13%;
4. Marvell ARMADA 8040 with A72 cores on macchiatobin: 3.7%
Here is the example test result on TX2:
$sudo ./build/app/test -l 16-27 -- i
RTE>>spinlock_autotest
*** spinlock_autotest without this patch ***
Test with lock on 12 cores...
Core [16] Cost Time = 53886 us
Core [17] Cost Time = 53605 us
Core [18] Cost Time = 53163 us
Core [19] Cost Time = 49419 us
Core [20] Cost Time = 34317 us
Core [21] Cost Time = 53408 us
Core [22] Cost Time = 53970 us
Core [23] Cost Time = 53930 us
Core [24] Cost Time = 53283 us
Core [25] Cost Time = 51504 us
Core [26] Cost Time = 50718 us
Core [27] Cost Time = 51730 us
Total Cost Time = 612933 us
*** spinlock_autotest with this patch ***
Test with lock on 12 cores...
Core [16] Cost Time = 18808 us
Core [17] Cost Time = 29497 us
Core [18] Cost Time = 29132 us
Core [19] Cost Time = 26150 us
Core [20] Cost Time = 21892 us
Core [21] Cost Time = 24377 us
Core [22] Cost Time = 27211 us
Core [23] Cost Time = 11070 us
Core [24] Cost Time = 29802 us
Core [25] Cost Time = 15793 us
Core [26] Cost Time = 7474 us
Core [27] Cost Time = 29550 us
Total Cost Time = 270756 us
In the tests on ThunderX2, with more cores contending, the performance gain
was even higher, indicating the __atomic implementation scales better than
__sync.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc3..87ae7a4 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
#endif
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
#endif
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ 0, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
#endif
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v7 0/3] generic spinlock optimization and test case enhancements
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (19 preceding siblings ...)
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2019-03-08 7:37 ` Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 1/3] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
` (6 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:37 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu
V7: Update the 1/3 patch headline and commit message
V6: Rebase and drop the first patch as a similar fix was already merged.
V5: Remove ChangeId(sorry for that)
V4:
1. Drop one patch for the test case to get time precisely as the overhead
of getting time is amortized already in another patch.
2. Drop the ticket lock patch from this series as there are no dependency
between them, the ticket lock patch was submitted separately:
http://patchwork.dpdk.org/patch/49770/
3. Define volatile variable in patch #3 to be more realistic for spinlock
protection(avoid optimization be compiler).
4. Fix typos.
V3:
1. Implemented the ticket lock to improve the fairness and predictability.
The locks are obtained in the order of requested.
V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
is enabled.
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (3):
test/spinlock: remove 1us delay for correct benchmarking
test/spinlock: amortize the cost of getting time
spinlock: reimplement with atomic one-way barrier builtins
app/test/test_spinlock.c | 31 +++++++++++-----------
.../common/include/generic/rte_spinlock.h | 18 +++++++++----
2 files changed, 29 insertions(+), 20 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v7 1/3] test/spinlock: remove 1us delay for correct benchmarking
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (20 preceding siblings ...)
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 0/3] generic spinlock optimization and test case enhancements Gavin Hu
@ 2019-03-08 7:37 ` Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
` (5 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:37 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Change-Id: I7cc025e76082bb84de3d7cd5002e850f89b30eae
Jira: ENTNET-1047
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
app/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/app/test/test_spinlock.c b/app/test/test_spinlock.c
index 73bff12..6795195 100644
--- a/app/test/test_spinlock.c
+++ b/app/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v7 2/3] test/spinlock: amortize the cost of getting time
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (21 preceding siblings ...)
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 1/3] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
@ 2019-03-08 7:37 ` Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
` (4 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:37 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Change-Id: I5460f585937f65772c2eabe9ebc3d23a682e8af2
Jira: ENTNET-1047
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: ruifeng wang <ruifeng.wang@arm.com>
Reviewed-by: honnappa nagarahalli <honnappa.nagarahalli@arm.com>
---
app/test/test_spinlock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/app/test/test_spinlock.c b/app/test/test_spinlock.c
index 6795195..6ac7495 100644
--- a/app/test/test_spinlock.c
+++ b/app/test/test_spinlock.c
@@ -96,16 +96,16 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
{
uint64_t time_diff = 0, begin;
uint64_t hz = rte_get_timer_hz();
- uint64_t lcount = 0;
+ volatile uint64_t lcount = 0;
const int use_lock = *(int*)func_param;
const unsigned lcore = rte_lcore_id();
@@ -114,15 +114,15 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_get_timer_cycles();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_get_timer_cycles() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +136,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +160,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v7 3/3] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (22 preceding siblings ...)
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
@ 2019-03-08 7:37 ` Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements Gavin Hu
` (3 subsequent siblings)
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:37 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed constant improvements on all available
platforms:
1. Cavium ThunderX2: 126% performance;
2. Hisilicon 1616: 30%;
3. Qualcomm Falkor: 13%;
4. Marvell ARMADA 8040 with A72 cores on macchiatobin: 3.7%
Here is the example test result on TX2:
$sudo ./build/app/test -l 16-27 -- i
RTE>>spinlock_autotest
*** spinlock_autotest without this patch ***
Test with lock on 12 cores...
Core [16] Cost Time = 53886 us
Core [17] Cost Time = 53605 us
Core [18] Cost Time = 53163 us
Core [19] Cost Time = 49419 us
Core [20] Cost Time = 34317 us
Core [21] Cost Time = 53408 us
Core [22] Cost Time = 53970 us
Core [23] Cost Time = 53930 us
Core [24] Cost Time = 53283 us
Core [25] Cost Time = 51504 us
Core [26] Cost Time = 50718 us
Core [27] Cost Time = 51730 us
Total Cost Time = 612933 us
*** spinlock_autotest with this patch ***
Test with lock on 12 cores...
Core [16] Cost Time = 18808 us
Core [17] Cost Time = 29497 us
Core [18] Cost Time = 29132 us
Core [19] Cost Time = 26150 us
Core [20] Cost Time = 21892 us
Core [21] Cost Time = 24377 us
Core [22] Cost Time = 27211 us
Core [23] Cost Time = 11070 us
Core [24] Cost Time = 29802 us
Core [25] Cost Time = 15793 us
Core [26] Cost Time = 7474 us
Core [27] Cost Time = 29550 us
Total Cost Time = 270756 us
In the tests on ThunderX2, with more cores contending, the performance gain
was even higher, indicating the __atomic implementation scales better than
__sync.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Change-Id: Ibe82c1fa53a8409584aa84c1a2b4499aae8c5b4d
Jira: ENTNET-1047
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc3..87ae7a4 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
#endif
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
#endif
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ 0, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
#endif
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (23 preceding siblings ...)
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2019-03-08 7:56 ` Gavin Hu
2019-03-11 12:21 ` Nipun Gupta
` (2 more replies)
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 1/3] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
` (2 subsequent siblings)
27 siblings, 3 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:56 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu
V8: Remove internal ChangeId
V7: Update the 1/3 patch headline and commit message
V6: Rebase and drop the first patch as a similar fix was already merged.
V5: Remove ChangeId(sorry for that)
V4:
1. Drop one patch for the test case to get time precisely as the overhead
of getting time is amortized already in another patch.
2. Drop the ticket lock patch from this series as there are no dependency
between them, the ticket lock patch was submitted separately:
http://patchwork.dpdk.org/patch/49770/
3. Define volatile variable in patch #3 to be more realistic for spinlock
protection(avoid optimization be compiler).
4. Fix typos.
V3:
1. Implemented the ticket lock to improve the fairness and predictability.
The locks are obtained in the order of requested.
V2:
1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
by default, so don't remove it from generic file.
2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
is enabled.
V1:
1. Remove the 1us delay outside of the locked region to really benchmark
the spinlock acquire/release performance, not the delay API.
2. Use the precise version of getting timestamps for more precise
benchmarking results.
3. Amortize the overhead of getting the timestamp by 10000 loops.
4. Move the arm specific implementation to arm folder to remove the
hardcoded implementation.
5. Use atomic primitives, which translate to one-way barriers, instead of
two-way sync primitives, to optimize for performance.
Gavin Hu (3):
test/spinlock: remove 1us delay for correct benchmarking
test/spinlock: amortize the cost of getting time
spinlock: reimplement with atomic one-way barrier builtins
app/test/test_spinlock.c | 31 +++++++++++-----------
.../common/include/generic/rte_spinlock.h | 18 +++++++++----
2 files changed, 29 insertions(+), 20 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v8 1/3] test/spinlock: remove 1us delay for correct benchmarking
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (24 preceding siblings ...)
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements Gavin Hu
@ 2019-03-08 7:56 ` Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:56 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
The test is to benchmark the performance of spinlock by counting the
number of spinlock acquire and release operations within the specified
time.
A typical pair of lock and unlock operations costs tens or hundreds of
nano seconds, in comparison to this, delaying 1 us outside of the locked
region is too much, compromising the goal of benchmarking the lock and
unlock performance.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <Ruifeng.Wang@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
app/test/test_spinlock.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/app/test/test_spinlock.c b/app/test/test_spinlock.c
index 73bff12..6795195 100644
--- a/app/test/test_spinlock.c
+++ b/app/test/test_spinlock.c
@@ -120,8 +120,6 @@ load_loop_fn(void *func_param)
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- /* delay to make lock duty cycle slighlty realistic */
- rte_delay_us(1);
time_diff = rte_get_timer_cycles() - begin;
}
lock_count[lcore] = lcount;
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v8 2/3] test/spinlock: amortize the cost of getting time
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (25 preceding siblings ...)
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 1/3] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
@ 2019-03-08 7:56 ` Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
27 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:56 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
Instead of getting timestamps per iteration, amortize its overhead
can help getting more precise benchmarking results.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <Joyce.Kong@arm.com>
Reviewed-by: ruifeng wang <ruifeng.wang@arm.com>
Reviewed-by: honnappa nagarahalli <honnappa.nagarahalli@arm.com>
---
app/test/test_spinlock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/app/test/test_spinlock.c b/app/test/test_spinlock.c
index 6795195..6ac7495 100644
--- a/app/test/test_spinlock.c
+++ b/app/test/test_spinlock.c
@@ -96,16 +96,16 @@ test_spinlock_recursive_per_core(__attribute__((unused)) void *arg)
}
static rte_spinlock_t lk = RTE_SPINLOCK_INITIALIZER;
-static uint64_t lock_count[RTE_MAX_LCORE] = {0};
+static uint64_t time_count[RTE_MAX_LCORE] = {0};
-#define TIME_MS 100
+#define MAX_LOOP 10000
static int
load_loop_fn(void *func_param)
{
uint64_t time_diff = 0, begin;
uint64_t hz = rte_get_timer_hz();
- uint64_t lcount = 0;
+ volatile uint64_t lcount = 0;
const int use_lock = *(int*)func_param;
const unsigned lcore = rte_lcore_id();
@@ -114,15 +114,15 @@ load_loop_fn(void *func_param)
while (rte_atomic32_read(&synchro) == 0);
begin = rte_get_timer_cycles();
- while (time_diff < hz * TIME_MS / 1000) {
+ while (lcount < MAX_LOOP) {
if (use_lock)
rte_spinlock_lock(&lk);
lcount++;
if (use_lock)
rte_spinlock_unlock(&lk);
- time_diff = rte_get_timer_cycles() - begin;
}
- lock_count[lcore] = lcount;
+ time_diff = rte_get_timer_cycles() - begin;
+ time_count[lcore] = time_diff * 1000000 / hz;
return 0;
}
@@ -136,14 +136,16 @@ test_spinlock_perf(void)
printf("\nTest with no lock on single core...\n");
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
lock = 1;
load_loop_fn(&lock);
- printf("Core [%u] count = %"PRIu64"\n", lcore, lock_count[lcore]);
- memset(lock_count, 0, sizeof(lock_count));
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", lcore,
+ time_count[lcore]);
+ memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on %u cores...\n", rte_lcore_count());
@@ -158,11 +160,12 @@ test_spinlock_perf(void)
rte_eal_mp_wait_lcore();
RTE_LCORE_FOREACH(i) {
- printf("Core [%u] count = %"PRIu64"\n", i, lock_count[i]);
- total += lock_count[i];
+ printf("Core [%u] Cost Time = %"PRIu64" us\n", i,
+ time_count[i]);
+ total += time_count[i];
}
- printf("Total count = %"PRIu64"\n", total);
+ printf("Total Cost Time = %"PRIu64" us\n", total);
return 0;
}
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* [dpdk-dev] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
` (26 preceding siblings ...)
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
@ 2019-03-08 7:56 ` Gavin Hu
2019-03-12 14:53 ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
27 siblings, 2 replies; 54+ messages in thread
From: Gavin Hu @ 2019-03-08 7:56 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, gavin.hu, i.maximets, chaozhu, stable
The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.
Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
<rte_atomic16_cmpset+52> // b.any
0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
<rte_atomic16_cmpset+32>
0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
The benchmarking results showed constant improvements on all available
platforms:
1. Cavium ThunderX2: 126% performance;
2. Hisilicon 1616: 30%;
3. Qualcomm Falkor: 13%;
4. Marvell ARMADA 8040 with A72 cores on macchiatobin: 3.7%
Here is the example test result on TX2:
$sudo ./build/app/test -l 16-27 -- i
RTE>>spinlock_autotest
*** spinlock_autotest without this patch ***
Test with lock on 12 cores...
Core [16] Cost Time = 53886 us
Core [17] Cost Time = 53605 us
Core [18] Cost Time = 53163 us
Core [19] Cost Time = 49419 us
Core [20] Cost Time = 34317 us
Core [21] Cost Time = 53408 us
Core [22] Cost Time = 53970 us
Core [23] Cost Time = 53930 us
Core [24] Cost Time = 53283 us
Core [25] Cost Time = 51504 us
Core [26] Cost Time = 50718 us
Core [27] Cost Time = 51730 us
Total Cost Time = 612933 us
*** spinlock_autotest with this patch ***
Test with lock on 12 cores...
Core [16] Cost Time = 18808 us
Core [17] Cost Time = 29497 us
Core [18] Cost Time = 29132 us
Core [19] Cost Time = 26150 us
Core [20] Cost Time = 21892 us
Core [21] Cost Time = 24377 us
Core [22] Cost Time = 27211 us
Core [23] Cost Time = 11070 us
Core [24] Cost Time = 29802 us
Core [25] Cost Time = 15793 us
Core [26] Cost Time = 7474 us
Core [27] Cost Time = 29550 us
Total Cost Time = 270756 us
In the tests on ThunderX2, with more cores contending, the performance gain
was even higher, indicating the __atomic implementation scales up better
than __sync.
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc3..87ae7a4 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
- while (__sync_lock_test_and_set(&sl->locked, 1))
- while(sl->locked)
+ int exp = 0;
+
+ while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
rte_pause();
+ exp = 0;
+ }
}
#endif
@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
- __sync_lock_release(&sl->locked);
+ __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
}
#endif
@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
static inline int
rte_spinlock_trylock (rte_spinlock_t *sl)
{
- return __sync_lock_test_and_set(&sl->locked,1) == 0;
+ int exp = 0;
+ return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+ 0, /* disallow spurious failure */
+ __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}
#endif
@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
*/
static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
{
- return sl->locked;
+ return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
}
/**
--
2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements Gavin Hu
@ 2019-03-11 12:21 ` Nipun Gupta
2019-03-15 12:21 ` Ananyev, Konstantin
2019-03-28 7:47 ` Thomas Monjalon
2 siblings, 0 replies; 54+ messages in thread
From: Nipun Gupta @ 2019-03-11 12:21 UTC (permalink / raw)
To: Gavin Hu, dev
Cc: nd, thomas, jerinj, Hemant Agrawal, Honnappa.Nagarahalli,
i.maximets, chaozhu
> -----Original Message-----
> From: Gavin Hu [mailto:gavin.hu@arm.com]
> Sent: Friday, March 8, 2019 1:27 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; jerinj@marvell.com; Hemant
> Agrawal <hemant.agrawal@nxp.com>; Nipun Gupta
> <nipun.gupta@nxp.com>; Honnappa.Nagarahalli@arm.com;
> gavin.hu@arm.com; i.maximets@samsung.com;
> chaozhu@linux.vnet.ibm.com
> Subject: [PATCH v8 0/3] generic spinlock optimization and test case
> enhancements
>
...
>
> Gavin Hu (3):
> test/spinlock: remove 1us delay for correct benchmarking
> test/spinlock: amortize the cost of getting time
> spinlock: reimplement with atomic one-way barrier builtins
>
> app/test/test_spinlock.c | 31 +++++++++++-----------
> .../common/include/generic/rte_spinlock.h | 18 +++++++++----
> 2 files changed, 29 insertions(+), 20 deletions(-)
>
> --
Seems good.
Series-Acked-by: Nipun Gupta <nipun.gupta@nxp.com>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
@ 2019-03-12 14:53 ` Jerin Jacob Kollanukkaran
2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
1 sibling, 1 reply; 54+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-03-12 14:53 UTC (permalink / raw)
To: gavin.hu, dev
Cc: i.maximets, chaozhu, nd, nipun.gupta, thomas, hemant.agrawal,
stable, Honnappa.Nagarahalli
On Fri, 2019-03-08 at 15:56 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> ---
> The __sync builtin based implementation generates full memory
> barriers
> ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> one way
> barriers.
>
>
> lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> b/lib/librte_eal/common/include/generic/rte_spinlock.h
> index c4c3fc3..87ae7a4 100644
> --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> static inline void
> rte_spinlock_lock(rte_spinlock_t *sl)
> {
> - while (__sync_lock_test_and_set(&sl->locked, 1))
> - while(sl->locked)
> + int exp = 0;
> +
> + while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
Would it be clean to use __atomic_test_and_set()
to avoid explicit exp =
0.
> + while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> rte_pause();
> + exp = 0;
> + }
> }
> #endif
>
> @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> static inline void
> rte_spinlock_unlock (rte_spinlock_t *sl)
> {
> - __sync_lock_release(&sl->locked);
> + __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
__atomic_clear(.., __ATOMIC_RELEASE) looks more clean to me.
> }
> #endif
>
> @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> static inline int
> rte_spinlock_trylock (rte_spinlock_t *sl)
> {
> - return __sync_lock_test_and_set(&sl->locked,1) == 0;
> + int exp = 0;
> + return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> + 0, /* disallow spurious failure */
> + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
return (__atomic_test_and_set(.., __ATOMIC_ACQUIRE) == 0) will be more
clean version.
> }
> #endif
>
> @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> */
> static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> {
> - return sl->locked;
> + return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
Does __ATOMIC_RELAXED will be sufficient?
> }
>
> /**
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-12 14:53 ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
0 siblings, 2 replies; 54+ messages in thread
From: Honnappa Nagarahalli @ 2019-03-14 0:31 UTC (permalink / raw)
To: jerinj, Gavin Hu (Arm Technology China), dev
Cc: i.maximets, chaozhu, nd, Nipun.gupta@nxp.com, thomas,
hemant.agrawal, stable, nd
> > -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way barriers.
> >
> >
> > lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> > 1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc3..87ae7a4 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); static
> > inline void rte_spinlock_lock(rte_spinlock_t *sl) {
> > - while (__sync_lock_test_and_set(&sl->locked, 1))
> > - while(sl->locked)
> > + int exp = 0;
> > +
> > + while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
>
> Would it be clean to use __atomic_test_and_set() to avoid explicit exp = 0.
We addressed it here: http://mails.dpdk.org/archives/dev/2019-January/122363.html
>
>
> > + while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > rte_pause();
> > + exp = 0;
> > + }
> > }
> > #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); static
> > inline void rte_spinlock_unlock (rte_spinlock_t *sl) {
> > - __sync_lock_release(&sl->locked);
> > + __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>
> __atomic_clear(.., __ATOMIC_RELEASE) looks more clean to me.
This needs the operand to be of type bool.
>
> > }
> > #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); static
> > inline int rte_spinlock_trylock (rte_spinlock_t *sl) {
> > - return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > + int exp = 0;
> > + return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > + 0, /* disallow spurious failure */
> > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
>
> return (__atomic_test_and_set(.., __ATOMIC_ACQUIRE) == 0) will be more
> clean version.
>
> > }
> > #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > */
> > static inline int rte_spinlock_is_locked (rte_spinlock_t *sl) {
> > - return sl->locked;
> > + return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
>
> Does __ATOMIC_RELAXED will be sufficient?
This is also addressed here: http://mails.dpdk.org/archives/dev/2019-January/122363.html
I think you approved the patch here: http://mails.dpdk.org/archives/dev/2019-January/123238.html
I think this patch just needs your reviewed-by tag :)
>
>
> > }
> >
> > /**
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-14 0:31 ` Honnappa Nagarahalli
@ 2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
1 sibling, 0 replies; 54+ messages in thread
From: Honnappa Nagarahalli @ 2019-03-14 0:31 UTC (permalink / raw)
To: jerinj, Gavin Hu (Arm Technology China), dev
Cc: i.maximets, chaozhu, nd, Nipun.gupta@nxp.com, thomas,
hemant.agrawal, stable, nd
> > -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way barriers.
> >
> >
> > lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> > 1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc3..87ae7a4 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); static
> > inline void rte_spinlock_lock(rte_spinlock_t *sl) {
> > - while (__sync_lock_test_and_set(&sl->locked, 1))
> > - while(sl->locked)
> > + int exp = 0;
> > +
> > + while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
>
> Would it be clean to use __atomic_test_and_set() to avoid explicit exp = 0.
We addressed it here: http://mails.dpdk.org/archives/dev/2019-January/122363.html
>
>
> > + while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > rte_pause();
> > + exp = 0;
> > + }
> > }
> > #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); static
> > inline void rte_spinlock_unlock (rte_spinlock_t *sl) {
> > - __sync_lock_release(&sl->locked);
> > + __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>
> __atomic_clear(.., __ATOMIC_RELEASE) looks more clean to me.
This needs the operand to be of type bool.
>
> > }
> > #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); static
> > inline int rte_spinlock_trylock (rte_spinlock_t *sl) {
> > - return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > + int exp = 0;
> > + return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > + 0, /* disallow spurious failure */
> > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
>
> return (__atomic_test_and_set(.., __ATOMIC_ACQUIRE) == 0) will be more
> clean version.
>
> > }
> > #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > */
> > static inline int rte_spinlock_is_locked (rte_spinlock_t *sl) {
> > - return sl->locked;
> > + return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
>
> Does __ATOMIC_RELAXED will be sufficient?
This is also addressed here: http://mails.dpdk.org/archives/dev/2019-January/122363.html
I think you approved the patch here: http://mails.dpdk.org/archives/dev/2019-January/123238.html
I think this patch just needs your reviewed-by tag :)
>
>
> > }
> >
> > /**
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 0:31 ` Honnappa Nagarahalli
@ 2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
1 sibling, 1 reply; 54+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-03-14 2:36 UTC (permalink / raw)
To: Honnappa Nagarahalli, jerinj, dev
Cc: i.maximets, chaozhu, nd, Nipun.gupta@nxp.com, thomas,
hemant.agrawal, stable, Gavin Hu (Arm Technology China)
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, March 14, 2019 8:31 AM
> To: jerinj@marvell.com; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: i.maximets@samsung.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; Nipun.gupta@nxp.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stable@dpdk.org; nd <nd@arm.com>
> Subject: RE: [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-
> way barrier builtins
>
> > > -------------------------------------------------------------------
> > > ---
> > > The __sync builtin based implementation generates full memory barriers
> > > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > > one way barriers.
> > >
> > >
> > > lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > +++++++++++++-----
> > > 1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > index c4c3fc3..87ae7a4 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); static
> > > inline void rte_spinlock_lock(rte_spinlock_t *sl) {
> > > - while (__sync_lock_test_and_set(&sl->locked, 1))
> > > - while(sl->locked)
> > > + int exp = 0;
> > > +
> > > + while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > {
> >
> > Would it be clean to use __atomic_test_and_set() to avoid explicit exp = 0.
> We addressed it here: http://mails.dpdk.org/archives/dev/2019-
> January/122363.html
__atomic_test_and_set causes 10 times of performance degradation in our
micro benchmarking on ThunderX2. Here it is explained why:
http://mails.dpdk.org/archives/dev/2019-January/123340.html
>
> >
> >
> > > + while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > > rte_pause();
> > > + exp = 0;
> > > + }
> > > }
> > > #endif
> > >
> > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); static
> > > inline void rte_spinlock_unlock (rte_spinlock_t *sl) {
> > > - __sync_lock_release(&sl->locked);
> > > + __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> >
> > __atomic_clear(.., __ATOMIC_RELEASE) looks more clean to me.
> This needs the operand to be of type bool.
>
> >
> > > }
> > > #endif
> > >
> > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); static
> > > inline int rte_spinlock_trylock (rte_spinlock_t *sl) {
> > > - return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > + int exp = 0;
> > > + return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > + 0, /* disallow spurious failure */
> > > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> >
> > return (__atomic_test_and_set(.., __ATOMIC_ACQUIRE) == 0) will be
> more
> > clean version.
> >
> > > }
> > > #endif
> > >
> > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > > */
> > > static inline int rte_spinlock_is_locked (rte_spinlock_t *sl) {
> > > - return sl->locked;
> > > + return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> >
> > Does __ATOMIC_RELAXED will be sufficient?
> This is also addressed here: http://mails.dpdk.org/archives/dev/2019-
> January/122363.html
>
> I think you approved the patch here:
> http://mails.dpdk.org/archives/dev/2019-January/123238.html
> I think this patch just needs your reviewed-by tag :)
>
> >
> >
> > > }
> > >
> > > /**
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
@ 2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
0 siblings, 0 replies; 54+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-03-14 2:36 UTC (permalink / raw)
To: Honnappa Nagarahalli, jerinj, dev
Cc: i.maximets, chaozhu, nd, Nipun.gupta@nxp.com, thomas,
hemant.agrawal, stable, Gavin Hu (Arm Technology China)
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, March 14, 2019 8:31 AM
> To: jerinj@marvell.com; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: i.maximets@samsung.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; Nipun.gupta@nxp.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stable@dpdk.org; nd <nd@arm.com>
> Subject: RE: [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-
> way barrier builtins
>
> > > -------------------------------------------------------------------
> > > ---
> > > The __sync builtin based implementation generates full memory barriers
> > > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > > one way barriers.
> > >
> > >
> > > lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > +++++++++++++-----
> > > 1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > index c4c3fc3..87ae7a4 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); static
> > > inline void rte_spinlock_lock(rte_spinlock_t *sl) {
> > > - while (__sync_lock_test_and_set(&sl->locked, 1))
> > > - while(sl->locked)
> > > + int exp = 0;
> > > +
> > > + while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > {
> >
> > Would it be clean to use __atomic_test_and_set() to avoid explicit exp = 0.
> We addressed it here: http://mails.dpdk.org/archives/dev/2019-
> January/122363.html
__atomic_test_and_set causes 10 times of performance degradation in our
micro benchmarking on ThunderX2. Here it is explained why:
http://mails.dpdk.org/archives/dev/2019-January/123340.html
>
> >
> >
> > > + while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > > rte_pause();
> > > + exp = 0;
> > > + }
> > > }
> > > #endif
> > >
> > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); static
> > > inline void rte_spinlock_unlock (rte_spinlock_t *sl) {
> > > - __sync_lock_release(&sl->locked);
> > > + __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> >
> > __atomic_clear(.., __ATOMIC_RELEASE) looks more clean to me.
> This needs the operand to be of type bool.
>
> >
> > > }
> > > #endif
> > >
> > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); static
> > > inline int rte_spinlock_trylock (rte_spinlock_t *sl) {
> > > - return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > + int exp = 0;
> > > + return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > + 0, /* disallow spurious failure */
> > > + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> >
> > return (__atomic_test_and_set(.., __ATOMIC_ACQUIRE) == 0) will be
> more
> > clean version.
> >
> > > }
> > > #endif
> > >
> > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > > */
> > > static inline int rte_spinlock_is_locked (rte_spinlock_t *sl) {
> > > - return sl->locked;
> > > + return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> >
> > Does __ATOMIC_RELAXED will be sufficient?
> This is also addressed here: http://mails.dpdk.org/archives/dev/2019-
> January/122363.html
>
> I think you approved the patch here:
> http://mails.dpdk.org/archives/dev/2019-January/123238.html
> I think this patch just needs your reviewed-by tag :)
>
> >
> >
> > > }
> > >
> > > /**
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2019-03-12 14:53 ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
1 sibling, 1 reply; 54+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-03-14 14:22 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, nd, thomas, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, i.maximets, chaozhu, stable
On Fri, Mar 08, 2019 at 03:56:37PM +0800, Gavin Hu wrote:
> External Email
>
> ----------------------------------------------------------------------
> The __sync builtin based implementation generates full memory barriers
> ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
> barriers.
>
> Here is the assembly code of __sync_compare_and_swap builtin.
> __sync_bool_compare_and_swap(dst, exp, src);
> 0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
> 0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
> 0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
> 0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
> 0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
> 0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
> 0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
> <rte_atomic16_cmpset+52> // b.any
> 0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
> 0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
> <rte_atomic16_cmpset+32>
> 0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
> 0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
>
> The benchmarking results showed constant improvements on all available
> platforms:
> 1. Cavium ThunderX2: 126% performance;
> 2. Hisilicon 1616: 30%;
> 3. Qualcomm Falkor: 13%;
> 4. Marvell ARMADA 8040 with A72 cores on macchiatobin: 3.7%
>
> Here is the example test result on TX2:
> $sudo ./build/app/test -l 16-27 -- i
> RTE>>spinlock_autotest
>
> *** spinlock_autotest without this patch ***
> Test with lock on 12 cores...
> Core [16] Cost Time = 53886 us
> Core [17] Cost Time = 53605 us
> Core [18] Cost Time = 53163 us
> Core [19] Cost Time = 49419 us
> Core [20] Cost Time = 34317 us
> Core [21] Cost Time = 53408 us
> Core [22] Cost Time = 53970 us
> Core [23] Cost Time = 53930 us
> Core [24] Cost Time = 53283 us
> Core [25] Cost Time = 51504 us
> Core [26] Cost Time = 50718 us
> Core [27] Cost Time = 51730 us
> Total Cost Time = 612933 us
>
> *** spinlock_autotest with this patch ***
> Test with lock on 12 cores...
> Core [16] Cost Time = 18808 us
> Core [17] Cost Time = 29497 us
> Core [18] Cost Time = 29132 us
> Core [19] Cost Time = 26150 us
> Core [20] Cost Time = 21892 us
> Core [21] Cost Time = 24377 us
> Core [22] Cost Time = 27211 us
> Core [23] Cost Time = 11070 us
> Core [24] Cost Time = 29802 us
> Core [25] Cost Time = 15793 us
> Core [26] Cost Time = 7474 us
> Core [27] Cost Time = 29550 us
> Total Cost Time = 270756 us
>
> In the tests on ThunderX2, with more cores contending, the performance gain
> was even higher, indicating the __atomic implementation scales up better
> than __sync.
>
> Fixes: af75078fece3 ("first public release")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Steve Capper <Steve.Capper@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [EXT] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins
2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
@ 2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
0 siblings, 0 replies; 54+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-03-14 14:22 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, nd, thomas, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, i.maximets, chaozhu, stable
On Fri, Mar 08, 2019 at 03:56:37PM +0800, Gavin Hu wrote:
> External Email
>
> ----------------------------------------------------------------------
> The __sync builtin based implementation generates full memory barriers
> ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
> barriers.
>
> Here is the assembly code of __sync_compare_and_swap builtin.
> __sync_bool_compare_and_swap(dst, exp, src);
> 0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8]
> 0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6]
> 0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4]
> 0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff
> 0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0]
> 0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1
> 0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4
> <rte_atomic16_cmpset+52> // b.any
> 0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0]
> 0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0
> <rte_atomic16_cmpset+32>
> 0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish
> 0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none
>
> The benchmarking results showed constant improvements on all available
> platforms:
> 1. Cavium ThunderX2: 126% performance;
> 2. Hisilicon 1616: 30%;
> 3. Qualcomm Falkor: 13%;
> 4. Marvell ARMADA 8040 with A72 cores on macchiatobin: 3.7%
>
> Here is the example test result on TX2:
> $sudo ./build/app/test -l 16-27 -- i
> RTE>>spinlock_autotest
>
> *** spinlock_autotest without this patch ***
> Test with lock on 12 cores...
> Core [16] Cost Time = 53886 us
> Core [17] Cost Time = 53605 us
> Core [18] Cost Time = 53163 us
> Core [19] Cost Time = 49419 us
> Core [20] Cost Time = 34317 us
> Core [21] Cost Time = 53408 us
> Core [22] Cost Time = 53970 us
> Core [23] Cost Time = 53930 us
> Core [24] Cost Time = 53283 us
> Core [25] Cost Time = 51504 us
> Core [26] Cost Time = 50718 us
> Core [27] Cost Time = 51730 us
> Total Cost Time = 612933 us
>
> *** spinlock_autotest with this patch ***
> Test with lock on 12 cores...
> Core [16] Cost Time = 18808 us
> Core [17] Cost Time = 29497 us
> Core [18] Cost Time = 29132 us
> Core [19] Cost Time = 26150 us
> Core [20] Cost Time = 21892 us
> Core [21] Cost Time = 24377 us
> Core [22] Cost Time = 27211 us
> Core [23] Cost Time = 11070 us
> Core [24] Cost Time = 29802 us
> Core [25] Cost Time = 15793 us
> Core [26] Cost Time = 7474 us
> Core [27] Cost Time = 29550 us
> Total Cost Time = 270756 us
>
> In the tests on ThunderX2, with more cores contending, the performance gain
> was even higher, indicating the __atomic implementation scales up better
> than __sync.
>
> Fixes: af75078fece3 ("first public release")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Steve Capper <Steve.Capper@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements Gavin Hu
2019-03-11 12:21 ` Nipun Gupta
@ 2019-03-15 12:21 ` Ananyev, Konstantin
2019-03-15 12:21 ` Ananyev, Konstantin
2019-03-28 7:47 ` Thomas Monjalon
2 siblings, 1 reply; 54+ messages in thread
From: Ananyev, Konstantin @ 2019-03-15 12:21 UTC (permalink / raw)
To: Gavin Hu, dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, i.maximets, chaozhu
>
> V8: Remove internal ChangeId
>
> V7: Update the 1/3 patch headline and commit message
>
> V6: Rebase and drop the first patch as a similar fix was already merged.
>
> V5: Remove ChangeId(sorry for that)
>
> V4:
> 1. Drop one patch for the test case to get time precisely as the overhead
> of getting time is amortized already in another patch.
> 2. Drop the ticket lock patch from this series as there are no dependency
> between them, the ticket lock patch was submitted separately:
> http://patchwork.dpdk.org/patch/49770/
> 3. Define volatile variable in patch #3 to be more realistic for spinlock
> protection(avoid optimization be compiler).
> 4. Fix typos.
>
> V3:
> 1. Implemented the ticket lock to improve the fairness and predictability.
> The locks are obtained in the order of requested.
>
> V2:
> 1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
> by default, so don't remove it from generic file.
> 2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
> is enabled.
>
> V1:
> 1. Remove the 1us delay outside of the locked region to really benchmark
> the spinlock acquire/release performance, not the delay API.
> 2. Use the precise version of getting timestamps for more precise
> benchmarking results.
> 3. Amortize the overhead of getting the timestamp by 10000 loops.
> 4. Move the arm specific implementation to arm folder to remove the
> hardcoded implementation.
> 5. Use atomic primitives, which translate to one-way barriers, instead of
> two-way sync primitives, to optimize for performance.
>
> Gavin Hu (3):
> test/spinlock: remove 1us delay for correct benchmarking
> test/spinlock: amortize the cost of getting time
> spinlock: reimplement with atomic one-way barrier builtins
>
> app/test/test_spinlock.c | 31 +++++++++++-----------
> .../common/include/generic/rte_spinlock.h | 18 +++++++++----
> 2 files changed, 29 insertions(+), 20 deletions(-)
>
> --
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements
2019-03-15 12:21 ` Ananyev, Konstantin
@ 2019-03-15 12:21 ` Ananyev, Konstantin
0 siblings, 0 replies; 54+ messages in thread
From: Ananyev, Konstantin @ 2019-03-15 12:21 UTC (permalink / raw)
To: Gavin Hu, dev
Cc: nd, thomas, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, i.maximets, chaozhu
>
> V8: Remove internal ChangeId
>
> V7: Update the 1/3 patch headline and commit message
>
> V6: Rebase and drop the first patch as a similar fix was already merged.
>
> V5: Remove ChangeId(sorry for that)
>
> V4:
> 1. Drop one patch for the test case to get time precisely as the overhead
> of getting time is amortized already in another patch.
> 2. Drop the ticket lock patch from this series as there are no dependency
> between them, the ticket lock patch was submitted separately:
> http://patchwork.dpdk.org/patch/49770/
> 3. Define volatile variable in patch #3 to be more realistic for spinlock
> protection(avoid optimization be compiler).
> 4. Fix typos.
>
> V3:
> 1. Implemented the ticket lock to improve the fairness and predictability.
> The locks are obtained in the order of requested.
>
> V2:
> 1. FORCE_INTRINCIS is still an option for ppc/x86, although not is use
> by default, so don't remove it from generic file.
> 2. Fix the clang compiler error on x86 when the above FORCE_INTRINSICS
> is enabled.
>
> V1:
> 1. Remove the 1us delay outside of the locked region to really benchmark
> the spinlock acquire/release performance, not the delay API.
> 2. Use the precise version of getting timestamps for more precise
> benchmarking results.
> 3. Amortize the overhead of getting the timestamp by 10000 loops.
> 4. Move the arm specific implementation to arm folder to remove the
> hardcoded implementation.
> 5. Use atomic primitives, which translate to one-way barriers, instead of
> two-way sync primitives, to optimize for performance.
>
> Gavin Hu (3):
> test/spinlock: remove 1us delay for correct benchmarking
> test/spinlock: amortize the cost of getting time
> spinlock: reimplement with atomic one-way barrier builtins
>
> app/test/test_spinlock.c | 31 +++++++++++-----------
> .../common/include/generic/rte_spinlock.h | 18 +++++++++----
> 2 files changed, 29 insertions(+), 20 deletions(-)
>
> --
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 2.7.4
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements Gavin Hu
2019-03-11 12:21 ` Nipun Gupta
2019-03-15 12:21 ` Ananyev, Konstantin
@ 2019-03-28 7:47 ` Thomas Monjalon
2019-03-28 7:47 ` Thomas Monjalon
2 siblings, 1 reply; 54+ messages in thread
From: Thomas Monjalon @ 2019-03-28 7:47 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, nd, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, i.maximets, chaozhu
> Gavin Hu (3):
> test/spinlock: remove 1us delay for correct benchmarking
> test/spinlock: amortize the cost of getting time
> spinlock: reimplement with atomic one-way barrier builtins
Applied, thanks
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements
2019-03-28 7:47 ` Thomas Monjalon
@ 2019-03-28 7:47 ` Thomas Monjalon
0 siblings, 0 replies; 54+ messages in thread
From: Thomas Monjalon @ 2019-03-28 7:47 UTC (permalink / raw)
To: Gavin Hu
Cc: dev, nd, jerinj, hemant.agrawal, nipun.gupta,
Honnappa.Nagarahalli, i.maximets, chaozhu
> Gavin Hu (3):
> test/spinlock: remove 1us delay for correct benchmarking
> test/spinlock: amortize the cost of getting time
> spinlock: reimplement with atomic one-way barrier builtins
Applied, thanks
^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2019-03-28 7:47 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-20 10:42 [dpdk-dev] [PATCH v1 0/5] spinlock optimization and test case enhancements Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 1/5] test/spinlock: remove 1us delay for correct spinlock benchmarking Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 2/5] test/spinlock: get timestamp more precisely Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 3/5] test/spinlock: amortize the cost of getting time Gavin Hu
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 4/5] spinlock: move the implementation to arm specific file Gavin Hu
2018-12-20 12:47 ` David Marchand
2018-12-20 12:55 ` David Marchand
2018-12-20 14:40 ` Gavin Hu (Arm Technology China)
2018-12-20 14:36 ` Gavin Hu (Arm Technology China)
2018-12-20 15:09 ` David Marchand
2018-12-20 15:58 ` Gavin Hu (Arm Technology China)
2018-12-20 15:59 ` David Marchand
2018-12-20 10:42 ` [dpdk-dev] [PATCH v1 5/5] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 0/5] spinlock optimization and test case enhancements Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 1/5] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 2/5] test/spinlock: get timestamp more precisely Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 3/5] test/spinlock: amortize the cost of getting time Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 4/5] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2018-12-20 17:42 ` [dpdk-dev] [PATCH v2 5/5] eal: fix clang compilation error on x86 Gavin Hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 0/4] spinlock optimization and test case enhancements gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 1/4] eal: fix clang compilation error on x86 gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 2/4] test/spinlock: remove 1us delay for correct benchmarking gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 3/4] test/spinlock: amortize the cost of getting time gavin hu
2019-01-15 7:54 ` [dpdk-dev] [PATCH v4 4/4] spinlock: reimplement with atomic one-way barrier builtins gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 0/4] spinlock optimization and test case enhancements gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 1/4] eal: fix clang compilation error on x86 gavin hu
2019-01-15 17:42 ` Honnappa Nagarahalli
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 2/4] test/spinlock: remove 1us delay for correct benchmarking gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 3/4] test/spinlock: amortize the cost of getting time gavin hu
2019-01-15 10:32 ` [dpdk-dev] [PATCH v5 4/4] spinlock: reimplement with atomic one-way barrier builtins gavin hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 0/3] generic spinlock optimization and test case enhancements Gavin Hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 1/3] test/spinlock: dealy 1 us to create contention Gavin Hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
2019-03-08 7:16 ` [dpdk-dev] [PATCH v6 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 0/3] generic spinlock optimization and test case enhancements Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 1/3] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
2019-03-08 7:37 ` [dpdk-dev] [PATCH v7 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 0/3] generic spinlock optimization and test case enhancements Gavin Hu
2019-03-11 12:21 ` Nipun Gupta
2019-03-15 12:21 ` Ananyev, Konstantin
2019-03-15 12:21 ` Ananyev, Konstantin
2019-03-28 7:47 ` Thomas Monjalon
2019-03-28 7:47 ` Thomas Monjalon
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 1/3] test/spinlock: remove 1us delay for correct benchmarking Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 2/3] test/spinlock: amortize the cost of getting time Gavin Hu
2019-03-08 7:56 ` [dpdk-dev] [PATCH v8 3/3] spinlock: reimplement with atomic one-way barrier builtins Gavin Hu
2019-03-12 14:53 ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 0:31 ` Honnappa Nagarahalli
2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
2019-03-14 2:36 ` Gavin Hu (Arm Technology China)
2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
2019-03-14 14:22 ` Jerin Jacob Kollanukkaran
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).