DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation
@ 2020-09-11 15:29 Steven Lariau
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Steven Lariau @ 2020-09-11 15:29 UTC (permalink / raw)
  Cc: dev, nd, dharmik.thakkar, Steven Lariau

One implementation of the DPDK stack library is lockfree,
based on C11 memory model for atomics.
Some of these atomic operations use unnecessary memory orders,
that can be relaxed.
This patch relax some of these operations in order to improve
the performance of the stack library.

The patch was tested on several architectures, to ensure that
the implementation is correct, and to measure performance.
Below are the results for a few architectures on multithread stack
lockfree test.
The cycles count is the average number of cycles per item to perform
a bulk push / pop.

$sudo ./builddir/app/dpdk-test
RTE>>stack_lf_perf_autotest
                              difference compared to main
Cycles count on ThunderX2
 2 cores, bulk size =  8:           -15.85%
 2 cores, bulk size = 32:           -04.56%
 4 cores, bulk size =  8:           -05.00%
 4 cores, bulk size = 32:           -04.35%
16 cores, bulk size =  8:           -02.38%
16 cores, bulk size = 32:           -01.88%

                              difference compared to main
Cycles count on N1SDP
 2 cores, batch size =  8:          +00.77%
 2 cores, batch size = 32:          -16.00%

                              difference compared to main
Cycles count on Skylake
 2 cores, bulk size =  8:           -00.18%
 2 cores, bulk size = 32:           -00.95%
 4 cores, bulk size =  8:           -01.19%
 4 cores, bulk size = 32:           +00.64%
16 cores, bulk size =  8:           +01.20%
16 cores, bulk size = 32:           +00.48%

Steven Lariau (5):
  lib/stack: fix inconsistent weak / strong cas
  lib/stack: remove push acquire fence
  lib/stack: remove redundant orderings for list->len
  lib/stack: reload head when pop fails
  lib/stack: remove pop cas release ordering

 lib/librte_stack/rte_stack_lf_c11.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas
  2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
@ 2020-09-11 15:29 ` Steven Lariau
  2020-09-21 17:16   ` Eads, Gage
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 2/5] lib/stack: remove push acquire fence Steven Lariau
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-11 15:29 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, dharmik.thakkar, Steven Lariau

Fix cmpexchange usage of weak / strong.
The generated code is the same on x86 and ARM (there is no weak
cmpexchange), but the old usage was inconsistent.
For push and pop update size, weak is used because cmpexchange is inside
a loop.
For pop update root, strong is used even though cmpexchange is inside a
loop, because there may be a lot of operations to do in a loop iteration
(locate the new head).

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 999359f08..1e0ea0bef 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -96,7 +96,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		/* len is updated on failure */
 		if (__atomic_compare_exchange_n(&list->len,
 						&len, len - num,
-						0, __ATOMIC_ACQUIRE,
+						1, __ATOMIC_ACQUIRE,
 						__ATOMIC_ACQUIRE))
 			break;
 	}
@@ -149,7 +149,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 				(rte_int128_t *)&list->head,
 				(rte_int128_t *)&old_head,
 				(rte_int128_t *)&new_head,
-				1, __ATOMIC_RELEASE,
+				0, __ATOMIC_RELEASE,
 				__ATOMIC_RELAXED);
 	} while (success == 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH 2/5] lib/stack: remove push acquire fence
  2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
@ 2020-09-11 15:29 ` Steven Lariau
  2020-09-21 17:16   ` Eads, Gage
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-11 15:29 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, dharmik.thakkar, Steven Lariau

An acquire fence is used to make sure loads after the fence can observe
all store operations before a specific store-release.
But push doesn't read any data, except for the head which is part of a
CAS operation (the items on the list are not read).
So there is no need for the acquire barrier.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 1e0ea0bef..82b7287f1 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -44,12 +44,6 @@ __rte_stack_lf_push_elems(struct rte_stack_lf_list *list,
 	do {
 		struct rte_stack_lf_head new_head;
 
-		/* Use an acquire fence to establish a synchronized-with
-		 * relationship between the list->head load and store-release
-		 * operations (as part of the rte_atomic128_cmp_exchange()).
-		 */
-		__atomic_thread_fence(__ATOMIC_ACQUIRE);
-
 		/* Swing the top pointer to the first element in the list and
 		 * make the last element point to the old top.
 		 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH 3/5] lib/stack: remove redundant orderings for list->len
  2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 2/5] lib/stack: remove push acquire fence Steven Lariau
@ 2020-09-11 15:29 ` Steven Lariau
  2020-09-21 17:16   ` Eads, Gage
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 4/5] lib/stack: reload head when pop fails Steven Lariau
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-11 15:29 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, dharmik.thakkar, Steven Lariau

The load-acquire of list->len on pop function is redundant.
Only the CAS success needs to be load-acquire.
It synchronizes with the store release in push, to ensure that the
updated head is visible when the new length is visible.
Without this, one thread in pop could see the increased length but the
old list, which doesn't have enough items yet for pop to succeed.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 82b7287f1..2bc639419 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -80,7 +80,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 	int success;
 
 	/* Reserve num elements, if available */
-	len = __atomic_load_n(&list->len, __ATOMIC_ACQUIRE);
+	len = __atomic_load_n(&list->len, __ATOMIC_RELAXED);
 
 	while (1) {
 		/* Does the list contain enough elements? */
@@ -91,7 +91,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		if (__atomic_compare_exchange_n(&list->len,
 						&len, len - num,
 						1, __ATOMIC_ACQUIRE,
-						__ATOMIC_ACQUIRE))
+						__ATOMIC_RELAXED))
 			break;
 	}
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH 4/5] lib/stack: reload head when pop fails
  2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
                   ` (2 preceding siblings ...)
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
@ 2020-09-11 15:29 ` Steven Lariau
  2020-09-21 17:16   ` Eads, Gage
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering Steven Lariau
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-11 15:29 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, dharmik.thakkar, Steven Lariau

List head must be loaded right before continue (when failed to
find the new head).
Without this, one thread might keep trying and failing to pop items
without ever loading the new correct head.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 2bc639419..adb9f590d 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -133,8 +133,10 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		/* If NULL was encountered, the list was modified while
 		 * traversing it. Retry.
 		 */
-		if (i != num)
+		if (i != num) {
+			old_head = list->head;
 			continue;
+		}
 
 		new_head.top = tmp;
 		new_head.cnt = old_head.cnt + 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering
  2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
                   ` (3 preceding siblings ...)
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 4/5] lib/stack: reload head when pop fails Steven Lariau
@ 2020-09-11 15:29 ` Steven Lariau
  2020-09-21 17:17   ` Eads, Gage
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-11 15:29 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, dharmik.thakkar, Steven Lariau

Replace the store-release by relaxed for the CAS success at the end of
pop. Release isn't needed, because there is not write to data that need
to be synchronized.
The only preceding write is when the length is decreased, but the length
CAS loop already ensures the right synchronization.
The situation to avoid is when a thread sees the old length but the new
list, that doesn't have enough items for pop to success.
But the CAS success on length before the pop loop ensures any core reads
and updates the latest length, preventing this situation.

The store-release is also used to make sure that the items are read
before the head is updated, in order to prevent a core in pop to read an
incorrect value because another core rewrites it with push.
But this isn't needed, because items are read only when removed from the
used list. Right after this, they are pushed to the free list, and the
store-release in push makes sure the items are read before they are
visible in the free list.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index adb9f590d..8800db42e 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -145,7 +145,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 				(rte_int128_t *)&list->head,
 				(rte_int128_t *)&old_head,
 				(rte_int128_t *)&new_head,
-				0, __ATOMIC_RELEASE,
+				0, __ATOMIC_RELAXED,
 				__ATOMIC_RELAXED);
 	} while (success == 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
@ 2020-09-21 17:16   ` Eads, Gage
  0 siblings, 0 replies; 22+ messages in thread
From: Eads, Gage @ 2020-09-21 17:16 UTC (permalink / raw)
  To: Steven Lariau, Olivier Matz; +Cc: dev, nd, dharmik.thakkar



> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas
> 
> Fix cmpexchange usage of weak / strong.
> The generated code is the same on x86 and ARM (there is no weak
> cmpexchange), but the old usage was inconsistent.
> For push and pop update size, weak is used because cmpexchange is inside
> a loop.
> For pop update root, strong is used even though cmpexchange is inside a
> loop, because there may be a lot of operations to do in a loop iteration
> (locate the new head).
> 
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Gage Eads <gage.eads@intel.com>

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/stack: remove push acquire fence
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 2/5] lib/stack: remove push acquire fence Steven Lariau
@ 2020-09-21 17:16   ` Eads, Gage
  0 siblings, 0 replies; 22+ messages in thread
From: Eads, Gage @ 2020-09-21 17:16 UTC (permalink / raw)
  To: Steven Lariau, Olivier Matz; +Cc: dev, nd, dharmik.thakkar



> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 2/5] lib/stack: remove push acquire fence
> 
> An acquire fence is used to make sure loads after the fence can observe
> all store operations before a specific store-release.
> But push doesn't read any data, except for the head which is part of a
> CAS operation (the items on the list are not read).
> So there is no need for the acquire barrier.
> 
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Gage Eads <gage.eads@intel.com>

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] lib/stack: remove redundant orderings for list->len
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
@ 2020-09-21 17:16   ` Eads, Gage
  0 siblings, 0 replies; 22+ messages in thread
From: Eads, Gage @ 2020-09-21 17:16 UTC (permalink / raw)
  To: Steven Lariau, Olivier Matz; +Cc: dev, nd, dharmik.thakkar



> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 3/5] lib/stack: remove redundant orderings for list->len
> 
> The load-acquire of list->len on pop function is redundant.
> Only the CAS success needs to be load-acquire.
> It synchronizes with the store release in push, to ensure that the
> updated head is visible when the new length is visible.
> Without this, one thread in pop could see the increased length but the
> old list, which doesn't have enough items yet for pop to succeed.
> 
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Gage Eads <gage.eads@intel.com>

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH 4/5] lib/stack: reload head when pop fails
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 4/5] lib/stack: reload head when pop fails Steven Lariau
@ 2020-09-21 17:16   ` Eads, Gage
  0 siblings, 0 replies; 22+ messages in thread
From: Eads, Gage @ 2020-09-21 17:16 UTC (permalink / raw)
  To: Steven Lariau, Olivier Matz; +Cc: dev, nd, dharmik.thakkar



> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 4/5] lib/stack: reload head when pop fails
> 
> List head must be loaded right before continue (when failed to
> find the new head).
> Without this, one thread might keep trying and failing to pop items
> without ever loading the new correct head.
> 
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Good catch. Please add the following so the fix is considered for dpdk-stable:
Fixes: 7e6e609939a8 ("stack: add C11 atomic implementation")
Cc: stable@dpdk.org

Acked-by: Gage Eads <gage.eads@intel.com>

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering Steven Lariau
@ 2020-09-21 17:17   ` Eads, Gage
  2020-09-25 14:27     ` David Marchand
  0 siblings, 1 reply; 22+ messages in thread
From: Eads, Gage @ 2020-09-21 17:17 UTC (permalink / raw)
  To: Steven Lariau, Olivier Matz; +Cc: dev, nd, dharmik.thakkar



> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 5/5] lib/stack: remove pop cas release ordering
> 
> Replace the store-release by relaxed for the CAS success at the end of
> pop. Release isn't needed, because there is not write to data that need
> to be synchronized.
> The only preceding write is when the length is decreased, but the length
> CAS loop already ensures the right synchronization.
> The situation to avoid is when a thread sees the old length but the new
> list, that doesn't have enough items for pop to success.
> But the CAS success on length before the pop loop ensures any core reads
> and updates the latest length, preventing this situation.
> 
> The store-release is also used to make sure that the items are read
> before the head is updated, in order to prevent a core in pop to read an
> incorrect value because another core rewrites it with push.
> But this isn't needed, because items are read only when removed from the
> used list. Right after this, they are pushed to the free list, and the
> store-release in push makes sure the items are read before they are
> visible in the free list.

Please add a comment to this effect above the compare-exchange call. Depending
on this caller behavior for correctness is a little risky, but since this header
is private to the library I think it's ok as long as it's well-documented.

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering
  2020-09-21 17:17   ` Eads, Gage
@ 2020-09-25 14:27     ` David Marchand
  0 siblings, 0 replies; 22+ messages in thread
From: David Marchand @ 2020-09-25 14:27 UTC (permalink / raw)
  To: Steven Lariau; +Cc: Olivier Matz, dev, nd, dharmik.thakkar, Eads, Gage

Hello Steven,

On Mon, Sep 21, 2020 at 7:17 PM Eads, Gage <gage.eads@intel.com> wrote:
> > -----Original Message-----
> > From: Steven Lariau <steven.lariau@arm.com>
> > Sent: Friday, September 11, 2020 10:30 AM
> > To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> > Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> > <steven.lariau@arm.com>
> > Subject: [PATCH 5/5] lib/stack: remove pop cas release ordering
> >
> > Replace the store-release by relaxed for the CAS success at the end of
> > pop. Release isn't needed, because there is not write to data that need
> > to be synchronized.
> > The only preceding write is when the length is decreased, but the length
> > CAS loop already ensures the right synchronization.
> > The situation to avoid is when a thread sees the old length but the new
> > list, that doesn't have enough items for pop to success.
> > But the CAS success on length before the pop loop ensures any core reads
> > and updates the latest length, preventing this situation.
> >
> > The store-release is also used to make sure that the items are read
> > before the head is updated, in order to prevent a core in pop to read an
> > incorrect value because another core rewrites it with push.
> > But this isn't needed, because items are read only when removed from the
> > used list. Right after this, they are pushed to the free list, and the
> > store-release in push makes sure the items are read before they are
> > visible in the free list.
>
> Please add a comment to this effect above the compare-exchange call. Depending
> on this caller behavior for correctness is a little risky, but since this header
> is private to the library I think it's ok as long as it's well-documented.
>

Can you prepare a v2 for this comment?
Thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation
  2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
                   ` (4 preceding siblings ...)
  2020-09-11 15:29 ` [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering Steven Lariau
@ 2020-09-25 17:43 ` Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
                     ` (5 more replies)
  5 siblings, 6 replies; 22+ messages in thread
From: Steven Lariau @ 2020-09-25 17:43 UTC (permalink / raw)
  Cc: dev, nd, Steven Lariau

One implementation of the DPDK stack library is lockfree,
based on C11 memory model for atomics.
Some of these atomic operations use unnecessary memory orders,
that can be relaxed.
This patch relax some of these operations in order to improve
the performance of the stack library.

The patch was tested on several architectures, to ensure that
the implementation is correct, and to measure performance.
Below are the results for a few architectures on multithread stack
lockfree test.
The cycles count is the average number of cycles per item to perform
a bulk push / pop.

$sudo ./builddir/app/dpdk-test
RTE>>stack_lf_perf_autotest
                              difference compared to main
Cycles count on ThunderX2
 2 cores, bulk size =  8:           -15.85%
 2 cores, bulk size = 32:           -04.56%
 4 cores, bulk size =  8:           -05.00%
 4 cores, bulk size = 32:           -04.35%
16 cores, bulk size =  8:           -02.38%
16 cores, bulk size = 32:           -01.88%

                              difference compared to main
Cycles count on N1SDP
 2 cores, batch size =  8:          +00.77%
 2 cores, batch size = 32:          -16.00%

                              difference compared to main
Cycles count on Skylake
 2 cores, bulk size =  8:           -00.18%
 2 cores, bulk size = 32:           -00.95%
 4 cores, bulk size =  8:           -01.19%
 4 cores, bulk size = 32:           +00.64%
16 cores, bulk size =  8:           +01.20%
16 cores, bulk size = 32:           +00.48%

v2: add comment to explain why pop head CAS relaxed is valid
    added Fixes information

Steven Lariau (5):
  lib/stack: fix inconsistent weak / strong cas
  lib/stack: remove push acquire fence
  lib/stack: remove redundant orderings for list->len
  lib/stack: reload head when pop fails
  lib/stack: remove pop cas release ordering

 lib/librte_stack/rte_stack_lf_c11.h | 32 +++++++++++++++++++----------
 1 file changed, 21 insertions(+), 11 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
@ 2020-09-25 17:43   ` Steven Lariau
  2020-09-28 10:22     ` David Marchand
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 2/5] lib/stack: remove push acquire fence Steven Lariau
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-25 17:43 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, Steven Lariau

Fix cmpexchange usage of weak / strong.
The generated code is the same on x86 and ARM (there is no weak
cmpexchange), but the old usage was inconsistent.
For push and pop update size, weak is used because cmpexchange is inside
a loop.
For pop update root, strong is used even though cmpexchange is inside a
loop, because there may be a lot of operations to do in a loop iteration
(locate the new head).

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 999359f08..1e0ea0bef 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -96,7 +96,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		/* len is updated on failure */
 		if (__atomic_compare_exchange_n(&list->len,
 						&len, len - num,
-						0, __ATOMIC_ACQUIRE,
+						1, __ATOMIC_ACQUIRE,
 						__ATOMIC_ACQUIRE))
 			break;
 	}
@@ -149,7 +149,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 				(rte_int128_t *)&list->head,
 				(rte_int128_t *)&old_head,
 				(rte_int128_t *)&new_head,
-				1, __ATOMIC_RELEASE,
+				0, __ATOMIC_RELEASE,
 				__ATOMIC_RELAXED);
 	} while (success == 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 2/5] lib/stack: remove push acquire fence
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
@ 2020-09-25 17:43   ` Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Steven Lariau @ 2020-09-25 17:43 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, Steven Lariau

An acquire fence is used to make sure loads after the fence can observe
all store operations before a specific store-release.
But push doesn't read any data, except for the head which is part of a
CAS operation (the items on the list are not read).
So there is no need for the acquire barrier.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 1e0ea0bef..82b7287f1 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -44,12 +44,6 @@ __rte_stack_lf_push_elems(struct rte_stack_lf_list *list,
 	do {
 		struct rte_stack_lf_head new_head;
 
-		/* Use an acquire fence to establish a synchronized-with
-		 * relationship between the list->head load and store-release
-		 * operations (as part of the rte_atomic128_cmp_exchange()).
-		 */
-		__atomic_thread_fence(__ATOMIC_ACQUIRE);
-
 		/* Swing the top pointer to the first element in the list and
 		 * make the last element point to the old top.
 		 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 3/5] lib/stack: remove redundant orderings for list->len
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 2/5] lib/stack: remove push acquire fence Steven Lariau
@ 2020-09-25 17:43   ` Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 4/5] lib/stack: reload head when pop fails Steven Lariau
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Steven Lariau @ 2020-09-25 17:43 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, Steven Lariau

The load-acquire of list->len on pop function is redundant.
Only the CAS success needs to be load-acquire.
It synchronizes with the store release in push, to ensure that the
updated head is visible when the new length is visible.
Without this, one thread in pop could see the increased length but the
old list, which doesn't have enough items yet for pop to succeed.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 82b7287f1..2bc639419 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -80,7 +80,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 	int success;
 
 	/* Reserve num elements, if available */
-	len = __atomic_load_n(&list->len, __ATOMIC_ACQUIRE);
+	len = __atomic_load_n(&list->len, __ATOMIC_RELAXED);
 
 	while (1) {
 		/* Does the list contain enough elements? */
@@ -91,7 +91,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		if (__atomic_compare_exchange_n(&list->len,
 						&len, len - num,
 						1, __ATOMIC_ACQUIRE,
-						__ATOMIC_ACQUIRE))
+						__ATOMIC_RELAXED))
 			break;
 	}
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 4/5] lib/stack: reload head when pop fails
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
                     ` (2 preceding siblings ...)
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
@ 2020-09-25 17:43   ` Steven Lariau
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 5/5] lib/stack: remove pop cas release ordering Steven Lariau
  2020-09-30 19:14   ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation David Marchand
  5 siblings, 0 replies; 22+ messages in thread
From: Steven Lariau @ 2020-09-25 17:43 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz, Honnappa Nagarahalli
  Cc: dev, nd, Steven Lariau, stable

List head must be loaded right before continue (when failed to
find the new head).
Without this, one thread might keep trying and failing to pop items
without ever loading the new correct head.

Fixes: 7e6e609939a8 ("stack: add C11 atomic implementation")
Cc: gage.eads@intel.com
Cc: stable@dpdk.org

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index 2bc639419..adb9f590d 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -133,8 +133,10 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		/* If NULL was encountered, the list was modified while
 		 * traversing it. Retry.
 		 */
-		if (i != num)
+		if (i != num) {
+			old_head = list->head;
 			continue;
+		}
 
 		new_head.top = tmp;
 		new_head.cnt = old_head.cnt + 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2 5/5] lib/stack: remove pop cas release ordering
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
                     ` (3 preceding siblings ...)
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 4/5] lib/stack: reload head when pop fails Steven Lariau
@ 2020-09-25 17:43   ` Steven Lariau
  2020-09-25 17:57     ` Eads, Gage
  2020-09-30 19:14   ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation David Marchand
  5 siblings, 1 reply; 22+ messages in thread
From: Steven Lariau @ 2020-09-25 17:43 UTC (permalink / raw)
  To: Gage Eads, Olivier Matz; +Cc: dev, nd, Steven Lariau

Replace the store-release by relaxed for the CAS success at the end of
pop. Release isn't needed, because there is not write to data that need
to be synchronized.
The only preceding write is when the length is decreased, but the length
CAS loop already ensures the right synchronization.
The situation to avoid is when a thread sees the old length but the new
list, that doesn't have enough items for pop to success.
But the CAS success on length before the pop loop ensures any core reads
and updates the latest length, preventing this situation.

The store-release is also used to make sure that the items are read
before the head is updated, in order to prevent a core in pop to read an
incorrect value because another core rewrites it with push.
But this isn't needed, because items are read only when removed from the
used list. Right after this, they are pushed to the free list, and the
store-release in push makes sure the items are read before they are
visible in the free list.

Signed-off-by: Steven Lariau <steven.lariau@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_stack/rte_stack_lf_c11.h | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h
index adb9f590d..8403196d5 100644
--- a/lib/librte_stack/rte_stack_lf_c11.h
+++ b/lib/librte_stack/rte_stack_lf_c11.h
@@ -141,11 +141,25 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list,
 		new_head.top = tmp;
 		new_head.cnt = old_head.cnt + 1;
 
+		/*
+		 * The CAS should have release semantics to ensure that
+		 * items are read before the head is updated.
+		 * But this function is internal, and items are read
+		 * only when __rte_stack_lf_pop calls this function to
+		 * pop items from used list.
+		 * Then, those items are pushed to the free list.
+		 * Push uses a CAS store-release on head, which makes
+		 * sure that items are read before they are pushed to
+		 * the free list, without need for a CAS release here.
+		 * This CAS could also be used to ensure that the new
+		 * length is visible before the head update, but
+		 * acquire semantics on the length update is enough.
+		 */
 		success = rte_atomic128_cmp_exchange(
 				(rte_int128_t *)&list->head,
 				(rte_int128_t *)&old_head,
 				(rte_int128_t *)&new_head,
-				0, __ATOMIC_RELEASE,
+				0, __ATOMIC_RELAXED,
 				__ATOMIC_RELAXED);
 	} while (success == 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/5] lib/stack: remove pop cas release ordering
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 5/5] lib/stack: remove pop cas release ordering Steven Lariau
@ 2020-09-25 17:57     ` Eads, Gage
  0 siblings, 0 replies; 22+ messages in thread
From: Eads, Gage @ 2020-09-25 17:57 UTC (permalink / raw)
  To: Steven Lariau, Olivier Matz; +Cc: dev, nd



> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 25, 2020 12:44 PM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; Steven Lariau <steven.lariau@arm.com>
> Subject: [PATCH v2 5/5] lib/stack: remove pop cas release ordering
> 
> Replace the store-release by relaxed for the CAS success at the end of
> pop. Release isn't needed, because there is not write to data that need
> to be synchronized.
> The only preceding write is when the length is decreased, but the length
> CAS loop already ensures the right synchronization.
> The situation to avoid is when a thread sees the old length but the new
> list, that doesn't have enough items for pop to success.
> But the CAS success on length before the pop loop ensures any core reads
> and updates the latest length, preventing this situation.
> 
> The store-release is also used to make sure that the items are read
> before the head is updated, in order to prevent a core in pop to read an
> incorrect value because another core rewrites it with push.
> But this isn't needed, because items are read only when removed from the
> used list. Right after this, they are pushed to the free list, and the
> store-release in push makes sure the items are read before they are
> visible in the free list.
> 
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Gage Eads <gage.eads@intel.com>

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
@ 2020-09-28 10:22     ` David Marchand
  2020-09-28 15:58       ` Eads, Gage
  0 siblings, 1 reply; 22+ messages in thread
From: David Marchand @ 2020-09-28 10:22 UTC (permalink / raw)
  To: Steven Lariau, Gage Eads; +Cc: Olivier Matz, dev, nd

On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
>
> Fix cmpexchange usage of weak / strong.
> The generated code is the same on x86 and ARM (there is no weak
> cmpexchange), but the old usage was inconsistent.
> For push and pop update size, weak is used because cmpexchange is inside
> a loop.
> For pop update root, strong is used even though cmpexchange is inside a
> loop, because there may be a lot of operations to do in a loop iteration
> (locate the new head).

Is this patch backport material?

Thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas
  2020-09-28 10:22     ` David Marchand
@ 2020-09-28 15:58       ` Eads, Gage
  0 siblings, 0 replies; 22+ messages in thread
From: Eads, Gage @ 2020-09-28 15:58 UTC (permalink / raw)
  To: David Marchand, Steven Lariau; +Cc: Olivier Matz, dev, nd

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Monday, September 28, 2020 5:23 AM
> To: Steven Lariau <steven.lariau@arm.com>; Eads, Gage <gage.eads@intel.com>
> Cc: Olivier Matz <olivier.matz@6wind.com>; dev <dev@dpdk.org>; nd
> <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong
> cas
> 
> On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
> >
> > Fix cmpexchange usage of weak / strong.
> > The generated code is the same on x86 and ARM (there is no weak
> > cmpexchange), but the old usage was inconsistent.
> > For push and pop update size, weak is used because cmpexchange is inside
> > a loop.
> > For pop update root, strong is used even though cmpexchange is inside a
> > loop, because there may be a lot of operations to do in a loop iteration
> > (locate the new head).
> 
> Is this patch backport material?

It's not a bugfix. It could help performance on a system with weak
cmpexchange -- e.g. the pop-update change would ensure no spurious
failures (which the code can handle, but would require another relatively
expensive pass through the pop loop.)

Thanks,
Gage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation
  2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
                     ` (4 preceding siblings ...)
  2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 5/5] lib/stack: remove pop cas release ordering Steven Lariau
@ 2020-09-30 19:14   ` David Marchand
  5 siblings, 0 replies; 22+ messages in thread
From: David Marchand @ 2020-09-30 19:14 UTC (permalink / raw)
  To: Steven Lariau
  Cc: dev, nd, Dharmik Thakkar, Ruifeng Wang (Arm Technology China), Gage Eads

On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
>
> One implementation of the DPDK stack library is lockfree,
> based on C11 memory model for atomics.
> Some of these atomic operations use unnecessary memory orders,
> that can be relaxed.
> This patch relax some of these operations in order to improve
> the performance of the stack library.
>
> The patch was tested on several architectures, to ensure that
> the implementation is correct, and to measure performance.
> Below are the results for a few architectures on multithread stack
> lockfree test.
> The cycles count is the average number of cycles per item to perform
> a bulk push / pop.
>
> $sudo ./builddir/app/dpdk-test
> RTE>>stack_lf_perf_autotest
>                               difference compared to main
> Cycles count on ThunderX2
>  2 cores, bulk size =  8:           -15.85%
>  2 cores, bulk size = 32:           -04.56%
>  4 cores, bulk size =  8:           -05.00%
>  4 cores, bulk size = 32:           -04.35%
> 16 cores, bulk size =  8:           -02.38%
> 16 cores, bulk size = 32:           -01.88%
>
>                               difference compared to main
> Cycles count on N1SDP
>  2 cores, batch size =  8:          +00.77%
>  2 cores, batch size = 32:          -16.00%
>
>                               difference compared to main
> Cycles count on Skylake
>  2 cores, bulk size =  8:           -00.18%
>  2 cores, bulk size = 32:           -00.95%
>  4 cores, bulk size =  8:           -01.19%
>  4 cores, bulk size = 32:           +00.64%
> 16 cores, bulk size =  8:           +01.20%
> 16 cores, bulk size = 32:           +00.48%
>
> v2: add comment to explain why pop head CAS relaxed is valid
>     added Fixes information
>
> Steven Lariau (5):
>   lib/stack: fix inconsistent weak / strong cas
>   lib/stack: remove push acquire fence
>   lib/stack: remove redundant orderings for list->len
>   lib/stack: reload head when pop fails
>   lib/stack: remove pop cas release ordering
>
>  lib/librte_stack/rte_stack_lf_c11.h | 32 +++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 11 deletions(-)

Series applied, thanks for those optimisations.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-09-30 19:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-11 15:29 [dpdk-dev] [PATCH 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
2020-09-11 15:29 ` [dpdk-dev] [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
2020-09-21 17:16   ` Eads, Gage
2020-09-11 15:29 ` [dpdk-dev] [PATCH 2/5] lib/stack: remove push acquire fence Steven Lariau
2020-09-21 17:16   ` Eads, Gage
2020-09-11 15:29 ` [dpdk-dev] [PATCH 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
2020-09-21 17:16   ` Eads, Gage
2020-09-11 15:29 ` [dpdk-dev] [PATCH 4/5] lib/stack: reload head when pop fails Steven Lariau
2020-09-21 17:16   ` Eads, Gage
2020-09-11 15:29 ` [dpdk-dev] [PATCH 5/5] lib/stack: remove pop cas release ordering Steven Lariau
2020-09-21 17:17   ` Eads, Gage
2020-09-25 14:27     ` David Marchand
2020-09-25 17:43 ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation Steven Lariau
2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong cas Steven Lariau
2020-09-28 10:22     ` David Marchand
2020-09-28 15:58       ` Eads, Gage
2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 2/5] lib/stack: remove push acquire fence Steven Lariau
2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 3/5] lib/stack: remove redundant orderings for list->len Steven Lariau
2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 4/5] lib/stack: reload head when pop fails Steven Lariau
2020-09-25 17:43   ` [dpdk-dev] [PATCH v2 5/5] lib/stack: remove pop cas release ordering Steven Lariau
2020-09-25 17:57     ` Eads, Gage
2020-09-30 19:14   ` [dpdk-dev] [PATCH v2 0/5] lib/stack: improve lockfree C11 implementation David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).