From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 17A76A046B for ; Tue, 23 Jul 2019 03:02:51 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 095C31BEF8; Tue, 23 Jul 2019 03:02:51 +0200 (CEST) Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129]) by dpdk.org (Postfix) with ESMTP id 3AB7C1BEF8 for ; Tue, 23 Jul 2019 03:02:50 +0200 (CEST) Received: from Internal Mail-Server by MTLPINE2 (envelope-from yskoh@mellanox.com) with ESMTPS (AES256-SHA encrypted); 23 Jul 2019 04:02:47 +0300 Received: from scfae-sc-2.mti.labs.mlnx (scfae-sc-2.mti.labs.mlnx [10.101.0.96]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id x6N11HfY026580; Tue, 23 Jul 2019 04:02:44 +0300 From: Yongseok Koh To: Gavin Hu Cc: Phil Yang , Honnappa Nagarahalli , Ola Liljedahl , Steve Capper , Jerin Jacob , Nipun Gupta , Konstantin Ananyev , dpdk stable Date: Mon, 22 Jul 2019 18:00:15 -0700 Message-Id: <20190723010115.6446-48-yskoh@mellanox.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190723010115.6446-1-yskoh@mellanox.com> References: <20190723010115.6446-1-yskoh@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [dpdk-stable] patch 'spinlock: reimplement with atomic one-way barrier' has been queued to LTS release 17.11.7 X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org Sender: "stable" Hi, FYI, your patch has been queued to LTS release 17.11.7 Note it hasn't been pushed to http://dpdk.org/browse/dpdk-stable yet. It will be pushed if I get no objection by 07/27/19. So please shout if anyone has objection. Also note that after the patch there's a diff of the upstream commit vs the patch applied to the branch. This will indicate if there was any rebasing needed to apply to the stable branch. If there were code changes for rebasing (ie: not only metadata diffs), please double check that the rebase was correctly done. Thanks. Yongseok --- >From 2f66f485b44da1c7bad27a92fe847a8b7888a831 Mon Sep 17 00:00:00 2001 From: Gavin Hu Date: Fri, 8 Mar 2019 15:56:37 +0800 Subject: [PATCH] spinlock: reimplement with atomic one-way barrier [ upstream commit 453d8f736676255696bce3c5e01b43b543ff896c ] The __sync builtin based implementation generates full memory barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way barriers. Here is the assembly code of __sync_compare_and_swap builtin. __sync_bool_compare_and_swap(dst, exp, src); 0x000000000090f1b0 <+16>: e0 07 40 f9 ldr x0, [sp, #8] 0x000000000090f1b4 <+20>: e1 0f 40 79 ldrh w1, [sp, #6] 0x000000000090f1b8 <+24>: e2 0b 40 79 ldrh w2, [sp, #4] 0x000000000090f1bc <+28>: 21 3c 00 12 and w1, w1, #0xffff 0x000000000090f1c0 <+32>: 03 7c 5f 48 ldxrh w3, [x0] 0x000000000090f1c4 <+36>: 7f 00 01 6b cmp w3, w1 0x000000000090f1c8 <+40>: 61 00 00 54 b.ne 0x90f1d4 // b.any 0x000000000090f1cc <+44>: 02 fc 04 48 stlxrh w4, w2, [x0] 0x000000000090f1d0 <+48>: 84 ff ff 35 cbnz w4, 0x90f1c0 0x000000000090f1d4 <+52>: bf 3b 03 d5 dmb ish 0x000000000090f1d8 <+56>: e0 17 9f 1a cset w0, eq // eq = none The benchmarking results showed constant improvements on all available platforms: 1. Cavium ThunderX2: 126% performance; 2. Hisilicon 1616: 30%; 3. Qualcomm Falkor: 13%; 4. Marvell ARMADA 8040 with A72 cores on macchiatobin: 3.7% Here is the example test result on TX2: $sudo ./build/app/test -l 16-27 -- i RTE>>spinlock_autotest *** spinlock_autotest without this patch *** Test with lock on 12 cores... Core [16] Cost Time = 53886 us Core [17] Cost Time = 53605 us Core [18] Cost Time = 53163 us Core [19] Cost Time = 49419 us Core [20] Cost Time = 34317 us Core [21] Cost Time = 53408 us Core [22] Cost Time = 53970 us Core [23] Cost Time = 53930 us Core [24] Cost Time = 53283 us Core [25] Cost Time = 51504 us Core [26] Cost Time = 50718 us Core [27] Cost Time = 51730 us Total Cost Time = 612933 us *** spinlock_autotest with this patch *** Test with lock on 12 cores... Core [16] Cost Time = 18808 us Core [17] Cost Time = 29497 us Core [18] Cost Time = 29132 us Core [19] Cost Time = 26150 us Core [20] Cost Time = 21892 us Core [21] Cost Time = 24377 us Core [22] Cost Time = 27211 us Core [23] Cost Time = 11070 us Core [24] Cost Time = 29802 us Core [25] Cost Time = 15793 us Core [26] Cost Time = 7474 us Core [27] Cost Time = 29550 us Total Cost Time = 270756 us In the tests on ThunderX2, with more cores contending, the performance gain was even higher, indicating the __atomic implementation scales up better than __sync. Fixes: af75078fece3 ("first public release") Signed-off-by: Gavin Hu Reviewed-by: Phil Yang Reviewed-by: Honnappa Nagarahalli Reviewed-by: Ola Liljedahl Reviewed-by: Steve Capper Reviewed-by: Jerin Jacob Acked-by: Nipun Gupta Acked-by: Konstantin Ananyev --- .../common/include/generic/rte_spinlock.h | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h index 54f83a4c56..f260136f28 100644 --- a/lib/librte_eal/common/include/generic/rte_spinlock.h +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h @@ -90,9 +90,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); static inline void rte_spinlock_lock(rte_spinlock_t *sl) { - while (__sync_lock_test_and_set(&sl->locked, 1)) - while(sl->locked) + int exp = 0; + + while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0, + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) { + while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED)) rte_pause(); + exp = 0; + } } #endif @@ -109,7 +114,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); static inline void rte_spinlock_unlock (rte_spinlock_t *sl) { - __sync_lock_release(&sl->locked); + __atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE); } #endif @@ -128,7 +133,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); static inline int rte_spinlock_trylock (rte_spinlock_t *sl) { - return __sync_lock_test_and_set(&sl->locked,1) == 0; + int exp = 0; + return __atomic_compare_exchange_n(&sl->locked, &exp, 1, + 0, /* disallow spurious failure */ + __ATOMIC_ACQUIRE, __ATOMIC_RELAXED); } #endif @@ -142,7 +150,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl) */ static inline int rte_spinlock_is_locked (rte_spinlock_t *sl) { - return sl->locked; + return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE); } /** -- 2.21.0 --- Diff of the applied patch vs upstream commit (please double-check if non-empty: --- --- - 2019-07-22 17:55:09.104148109 -0700 +++ 0048-spinlock-reimplement-with-atomic-one-way-barrier.patch 2019-07-22 17:55:06.100479000 -0700 @@ -1,8 +1,10 @@ -From 453d8f736676255696bce3c5e01b43b543ff896c Mon Sep 17 00:00:00 2001 +From 2f66f485b44da1c7bad27a92fe847a8b7888a831 Mon Sep 17 00:00:00 2001 From: Gavin Hu Date: Fri, 8 Mar 2019 15:56:37 +0800 Subject: [PATCH] spinlock: reimplement with atomic one-way barrier +[ upstream commit 453d8f736676255696bce3c5e01b43b543ff896c ] + The __sync builtin based implementation generates full memory barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way barriers. @@ -71,7 +73,6 @@ than __sync. Fixes: af75078fece3 ("first public release") -Cc: stable@dpdk.org Signed-off-by: Gavin Hu Reviewed-by: Phil Yang @@ -86,10 +87,10 @@ 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h -index c4c3fc31ec..87ae7a4f18 100644 +index 54f83a4c56..f260136f28 100644 --- a/lib/librte_eal/common/include/generic/rte_spinlock.h +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h -@@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); +@@ -90,9 +90,14 @@ rte_spinlock_lock(rte_spinlock_t *sl); static inline void rte_spinlock_lock(rte_spinlock_t *sl) { @@ -106,7 +107,7 @@ } #endif -@@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); +@@ -109,7 +114,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl); static inline void rte_spinlock_unlock (rte_spinlock_t *sl) { @@ -115,7 +116,7 @@ } #endif -@@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); +@@ -128,7 +133,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl); static inline int rte_spinlock_trylock (rte_spinlock_t *sl) { @@ -127,7 +128,7 @@ } #endif -@@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl) +@@ -142,7 +150,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl) */ static inline int rte_spinlock_is_locked (rte_spinlock_t *sl) {