One implementation of the DPDK stack library is lockfree, based on C11 memory model for atomics. Some of these atomic operations use unnecessary memory orders, that can be relaxed. This patch relax some of these operations in order to improve the performance of the stack library. The patch was tested on several architectures, to ensure that the implementation is correct, and to measure performance. Below are the results for a few architectures on multithread stack lockfree test. The cycles count is the average number of cycles per item to perform a bulk push / pop. $sudo ./builddir/app/dpdk-test RTE>>stack_lf_perf_autotest difference compared to main Cycles count on ThunderX2 2 cores, bulk size = 8: -15.85% 2 cores, bulk size = 32: -04.56% 4 cores, bulk size = 8: -05.00% 4 cores, bulk size = 32: -04.35% 16 cores, bulk size = 8: -02.38% 16 cores, bulk size = 32: -01.88% difference compared to main Cycles count on N1SDP 2 cores, batch size = 8: +00.77% 2 cores, batch size = 32: -16.00% difference compared to main Cycles count on Skylake 2 cores, bulk size = 8: -00.18% 2 cores, bulk size = 32: -00.95% 4 cores, bulk size = 8: -01.19% 4 cores, bulk size = 32: +00.64% 16 cores, bulk size = 8: +01.20% 16 cores, bulk size = 32: +00.48% Steven Lariau (5): lib/stack: fix inconsistent weak / strong cas lib/stack: remove push acquire fence lib/stack: remove redundant orderings for list->len lib/stack: reload head when pop fails lib/stack: remove pop cas release ordering lib/librte_stack/rte_stack_lf_c11.h | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) -- 2.17.1
Fix cmpexchange usage of weak / strong. The generated code is the same on x86 and ARM (there is no weak cmpexchange), but the old usage was inconsistent. For push and pop update size, weak is used because cmpexchange is inside a loop. For pop update root, strong is used even though cmpexchange is inside a loop, because there may be a lot of operations to do in a loop iteration (locate the new head). Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_stack/rte_stack_lf_c11.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 999359f08..1e0ea0bef 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -96,7 +96,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, /* len is updated on failure */ if (__atomic_compare_exchange_n(&list->len, &len, len - num, - 0, __ATOMIC_ACQUIRE, + 1, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) break; } @@ -149,7 +149,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, (rte_int128_t *)&list->head, (rte_int128_t *)&old_head, (rte_int128_t *)&new_head, - 1, __ATOMIC_RELEASE, + 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED); } while (success == 0); -- 2.17.1
An acquire fence is used to make sure loads after the fence can observe all store operations before a specific store-release. But push doesn't read any data, except for the head which is part of a CAS operation (the items on the list are not read). So there is no need for the acquire barrier. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_stack/rte_stack_lf_c11.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 1e0ea0bef..82b7287f1 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -44,12 +44,6 @@ __rte_stack_lf_push_elems(struct rte_stack_lf_list *list, do { struct rte_stack_lf_head new_head; - /* Use an acquire fence to establish a synchronized-with - * relationship between the list->head load and store-release - * operations (as part of the rte_atomic128_cmp_exchange()). - */ - __atomic_thread_fence(__ATOMIC_ACQUIRE); - /* Swing the top pointer to the first element in the list and * make the last element point to the old top. */ -- 2.17.1
The load-acquire of list->len on pop function is redundant. Only the CAS success needs to be load-acquire. It synchronizes with the store release in push, to ensure that the updated head is visible when the new length is visible. Without this, one thread in pop could see the increased length but the old list, which doesn't have enough items yet for pop to succeed. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_stack/rte_stack_lf_c11.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 82b7287f1..2bc639419 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -80,7 +80,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, int success; /* Reserve num elements, if available */ - len = __atomic_load_n(&list->len, __ATOMIC_ACQUIRE); + len = __atomic_load_n(&list->len, __ATOMIC_RELAXED); while (1) { /* Does the list contain enough elements? */ @@ -91,7 +91,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, if (__atomic_compare_exchange_n(&list->len, &len, len - num, 1, __ATOMIC_ACQUIRE, - __ATOMIC_ACQUIRE)) + __ATOMIC_RELAXED)) break; } -- 2.17.1
List head must be loaded right before continue (when failed to find the new head). Without this, one thread might keep trying and failing to pop items without ever loading the new correct head. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_stack/rte_stack_lf_c11.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 2bc639419..adb9f590d 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -133,8 +133,10 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, /* If NULL was encountered, the list was modified while * traversing it. Retry. */ - if (i != num) + if (i != num) { + old_head = list->head; continue; + } new_head.top = tmp; new_head.cnt = old_head.cnt + 1; -- 2.17.1
Replace the store-release by relaxed for the CAS success at the end of pop. Release isn't needed, because there is not write to data that need to be synchronized. The only preceding write is when the length is decreased, but the length CAS loop already ensures the right synchronization. The situation to avoid is when a thread sees the old length but the new list, that doesn't have enough items for pop to success. But the CAS success on length before the pop loop ensures any core reads and updates the latest length, preventing this situation. The store-release is also used to make sure that the items are read before the head is updated, in order to prevent a core in pop to read an incorrect value because another core rewrites it with push. But this isn't needed, because items are read only when removed from the used list. Right after this, they are pushed to the free list, and the store-release in push makes sure the items are read before they are visible in the free list. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_stack/rte_stack_lf_c11.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index adb9f590d..8800db42e 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -145,7 +145,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, (rte_int128_t *)&list->head, (rte_int128_t *)&old_head, (rte_int128_t *)&new_head, - 0, __ATOMIC_RELEASE, + 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED); } while (success == 0); -- 2.17.1
> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 1/5] lib/stack: fix inconsistent weak / strong cas
>
> Fix cmpexchange usage of weak / strong.
> The generated code is the same on x86 and ARM (there is no weak
> cmpexchange), but the old usage was inconsistent.
> For push and pop update size, weak is used because cmpexchange is inside
> a loop.
> For pop update root, strong is used even though cmpexchange is inside a
> loop, because there may be a lot of operations to do in a loop iteration
> (locate the new head).
>
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
Thanks,
Gage
> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 2/5] lib/stack: remove push acquire fence
>
> An acquire fence is used to make sure loads after the fence can observe
> all store operations before a specific store-release.
> But push doesn't read any data, except for the head which is part of a
> CAS operation (the items on the list are not read).
> So there is no need for the acquire barrier.
>
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
Thanks,
Gage
> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 3/5] lib/stack: remove redundant orderings for list->len
>
> The load-acquire of list->len on pop function is redundant.
> Only the CAS success needs to be load-acquire.
> It synchronizes with the store release in push, to ensure that the
> updated head is visible when the new length is visible.
> Without this, one thread in pop could see the increased length but the
> old list, which doesn't have enough items yet for pop to succeed.
>
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
Thanks,
Gage
> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 4/5] lib/stack: reload head when pop fails
>
> List head must be loaded right before continue (when failed to
> find the new head).
> Without this, one thread might keep trying and failing to pop items
> without ever loading the new correct head.
>
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Good catch. Please add the following so the fix is considered for dpdk-stable:
Fixes: 7e6e609939a8 ("stack: add C11 atomic implementation")
Cc: stable@dpdk.org
Acked-by: Gage Eads <gage.eads@intel.com>
Thanks,
Gage
> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 11, 2020 10:30 AM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> <steven.lariau@arm.com>
> Subject: [PATCH 5/5] lib/stack: remove pop cas release ordering
>
> Replace the store-release by relaxed for the CAS success at the end of
> pop. Release isn't needed, because there is not write to data that need
> to be synchronized.
> The only preceding write is when the length is decreased, but the length
> CAS loop already ensures the right synchronization.
> The situation to avoid is when a thread sees the old length but the new
> list, that doesn't have enough items for pop to success.
> But the CAS success on length before the pop loop ensures any core reads
> and updates the latest length, preventing this situation.
>
> The store-release is also used to make sure that the items are read
> before the head is updated, in order to prevent a core in pop to read an
> incorrect value because another core rewrites it with push.
> But this isn't needed, because items are read only when removed from the
> used list. Right after this, they are pushed to the free list, and the
> store-release in push makes sure the items are read before they are
> visible in the free list.
Please add a comment to this effect above the compare-exchange call. Depending
on this caller behavior for correctness is a little risky, but since this header
is private to the library I think it's ok as long as it's well-documented.
Thanks,
Gage
Hello Steven,
On Mon, Sep 21, 2020 at 7:17 PM Eads, Gage <gage.eads@intel.com> wrote:
> > -----Original Message-----
> > From: Steven Lariau <steven.lariau@arm.com>
> > Sent: Friday, September 11, 2020 10:30 AM
> > To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> > Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; Steven Lariau
> > <steven.lariau@arm.com>
> > Subject: [PATCH 5/5] lib/stack: remove pop cas release ordering
> >
> > Replace the store-release by relaxed for the CAS success at the end of
> > pop. Release isn't needed, because there is not write to data that need
> > to be synchronized.
> > The only preceding write is when the length is decreased, but the length
> > CAS loop already ensures the right synchronization.
> > The situation to avoid is when a thread sees the old length but the new
> > list, that doesn't have enough items for pop to success.
> > But the CAS success on length before the pop loop ensures any core reads
> > and updates the latest length, preventing this situation.
> >
> > The store-release is also used to make sure that the items are read
> > before the head is updated, in order to prevent a core in pop to read an
> > incorrect value because another core rewrites it with push.
> > But this isn't needed, because items are read only when removed from the
> > used list. Right after this, they are pushed to the free list, and the
> > store-release in push makes sure the items are read before they are
> > visible in the free list.
>
> Please add a comment to this effect above the compare-exchange call. Depending
> on this caller behavior for correctness is a little risky, but since this header
> is private to the library I think it's ok as long as it's well-documented.
>
Can you prepare a v2 for this comment?
Thanks.
--
David Marchand
One implementation of the DPDK stack library is lockfree, based on C11 memory model for atomics. Some of these atomic operations use unnecessary memory orders, that can be relaxed. This patch relax some of these operations in order to improve the performance of the stack library. The patch was tested on several architectures, to ensure that the implementation is correct, and to measure performance. Below are the results for a few architectures on multithread stack lockfree test. The cycles count is the average number of cycles per item to perform a bulk push / pop. $sudo ./builddir/app/dpdk-test RTE>>stack_lf_perf_autotest difference compared to main Cycles count on ThunderX2 2 cores, bulk size = 8: -15.85% 2 cores, bulk size = 32: -04.56% 4 cores, bulk size = 8: -05.00% 4 cores, bulk size = 32: -04.35% 16 cores, bulk size = 8: -02.38% 16 cores, bulk size = 32: -01.88% difference compared to main Cycles count on N1SDP 2 cores, batch size = 8: +00.77% 2 cores, batch size = 32: -16.00% difference compared to main Cycles count on Skylake 2 cores, bulk size = 8: -00.18% 2 cores, bulk size = 32: -00.95% 4 cores, bulk size = 8: -01.19% 4 cores, bulk size = 32: +00.64% 16 cores, bulk size = 8: +01.20% 16 cores, bulk size = 32: +00.48% v2: add comment to explain why pop head CAS relaxed is valid added Fixes information Steven Lariau (5): lib/stack: fix inconsistent weak / strong cas lib/stack: remove push acquire fence lib/stack: remove redundant orderings for list->len lib/stack: reload head when pop fails lib/stack: remove pop cas release ordering lib/librte_stack/rte_stack_lf_c11.h | 32 +++++++++++++++++++---------- 1 file changed, 21 insertions(+), 11 deletions(-) -- 2.17.1
Fix cmpexchange usage of weak / strong. The generated code is the same on x86 and ARM (there is no weak cmpexchange), but the old usage was inconsistent. For push and pop update size, weak is used because cmpexchange is inside a loop. For pop update root, strong is used even though cmpexchange is inside a loop, because there may be a lot of operations to do in a loop iteration (locate the new head). Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> Acked-by: Gage Eads <gage.eads@intel.com> --- lib/librte_stack/rte_stack_lf_c11.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 999359f08..1e0ea0bef 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -96,7 +96,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, /* len is updated on failure */ if (__atomic_compare_exchange_n(&list->len, &len, len - num, - 0, __ATOMIC_ACQUIRE, + 1, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) break; } @@ -149,7 +149,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, (rte_int128_t *)&list->head, (rte_int128_t *)&old_head, (rte_int128_t *)&new_head, - 1, __ATOMIC_RELEASE, + 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED); } while (success == 0); -- 2.17.1
An acquire fence is used to make sure loads after the fence can observe all store operations before a specific store-release. But push doesn't read any data, except for the head which is part of a CAS operation (the items on the list are not read). So there is no need for the acquire barrier. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> Acked-by: Gage Eads <gage.eads@intel.com> --- lib/librte_stack/rte_stack_lf_c11.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 1e0ea0bef..82b7287f1 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -44,12 +44,6 @@ __rte_stack_lf_push_elems(struct rte_stack_lf_list *list, do { struct rte_stack_lf_head new_head; - /* Use an acquire fence to establish a synchronized-with - * relationship between the list->head load and store-release - * operations (as part of the rte_atomic128_cmp_exchange()). - */ - __atomic_thread_fence(__ATOMIC_ACQUIRE); - /* Swing the top pointer to the first element in the list and * make the last element point to the old top. */ -- 2.17.1
The load-acquire of list->len on pop function is redundant. Only the CAS success needs to be load-acquire. It synchronizes with the store release in push, to ensure that the updated head is visible when the new length is visible. Without this, one thread in pop could see the increased length but the old list, which doesn't have enough items yet for pop to succeed. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> Acked-by: Gage Eads <gage.eads@intel.com> --- lib/librte_stack/rte_stack_lf_c11.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 82b7287f1..2bc639419 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -80,7 +80,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, int success; /* Reserve num elements, if available */ - len = __atomic_load_n(&list->len, __ATOMIC_ACQUIRE); + len = __atomic_load_n(&list->len, __ATOMIC_RELAXED); while (1) { /* Does the list contain enough elements? */ @@ -91,7 +91,7 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, if (__atomic_compare_exchange_n(&list->len, &len, len - num, 1, __ATOMIC_ACQUIRE, - __ATOMIC_ACQUIRE)) + __ATOMIC_RELAXED)) break; } -- 2.17.1
List head must be loaded right before continue (when failed to find the new head). Without this, one thread might keep trying and failing to pop items without ever loading the new correct head. Fixes: 7e6e609939a8 ("stack: add C11 atomic implementation") Cc: gage.eads@intel.com Cc: stable@dpdk.org Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> Acked-by: Gage Eads <gage.eads@intel.com> --- lib/librte_stack/rte_stack_lf_c11.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index 2bc639419..adb9f590d 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -133,8 +133,10 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, /* If NULL was encountered, the list was modified while * traversing it. Retry. */ - if (i != num) + if (i != num) { + old_head = list->head; continue; + } new_head.top = tmp; new_head.cnt = old_head.cnt + 1; -- 2.17.1
Replace the store-release by relaxed for the CAS success at the end of pop. Release isn't needed, because there is not write to data that need to be synchronized. The only preceding write is when the length is decreased, but the length CAS loop already ensures the right synchronization. The situation to avoid is when a thread sees the old length but the new list, that doesn't have enough items for pop to success. But the CAS success on length before the pop loop ensures any core reads and updates the latest length, preventing this situation. The store-release is also used to make sure that the items are read before the head is updated, in order to prevent a core in pop to read an incorrect value because another core rewrites it with push. But this isn't needed, because items are read only when removed from the used list. Right after this, they are pushed to the free list, and the store-release in push makes sure the items are read before they are visible in the free list. Signed-off-by: Steven Lariau <steven.lariau@arm.com> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_stack/rte_stack_lf_c11.h | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/lib/librte_stack/rte_stack_lf_c11.h b/lib/librte_stack/rte_stack_lf_c11.h index adb9f590d..8403196d5 100644 --- a/lib/librte_stack/rte_stack_lf_c11.h +++ b/lib/librte_stack/rte_stack_lf_c11.h @@ -141,11 +141,25 @@ __rte_stack_lf_pop_elems(struct rte_stack_lf_list *list, new_head.top = tmp; new_head.cnt = old_head.cnt + 1; + /* + * The CAS should have release semantics to ensure that + * items are read before the head is updated. + * But this function is internal, and items are read + * only when __rte_stack_lf_pop calls this function to + * pop items from used list. + * Then, those items are pushed to the free list. + * Push uses a CAS store-release on head, which makes + * sure that items are read before they are pushed to + * the free list, without need for a CAS release here. + * This CAS could also be used to ensure that the new + * length is visible before the head update, but + * acquire semantics on the length update is enough. + */ success = rte_atomic128_cmp_exchange( (rte_int128_t *)&list->head, (rte_int128_t *)&old_head, (rte_int128_t *)&new_head, - 0, __ATOMIC_RELEASE, + 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED); } while (success == 0); -- 2.17.1
> -----Original Message-----
> From: Steven Lariau <steven.lariau@arm.com>
> Sent: Friday, September 25, 2020 12:44 PM
> To: Eads, Gage <gage.eads@intel.com>; Olivier Matz <olivier.matz@6wind.com>
> Cc: dev@dpdk.org; nd@arm.com; Steven Lariau <steven.lariau@arm.com>
> Subject: [PATCH v2 5/5] lib/stack: remove pop cas release ordering
>
> Replace the store-release by relaxed for the CAS success at the end of
> pop. Release isn't needed, because there is not write to data that need
> to be synchronized.
> The only preceding write is when the length is decreased, but the length
> CAS loop already ensures the right synchronization.
> The situation to avoid is when a thread sees the old length but the new
> list, that doesn't have enough items for pop to success.
> But the CAS success on length before the pop loop ensures any core reads
> and updates the latest length, preventing this situation.
>
> The store-release is also used to make sure that the items are read
> before the head is updated, in order to prevent a core in pop to read an
> incorrect value because another core rewrites it with push.
> But this isn't needed, because items are read only when removed from the
> used list. Right after this, they are pushed to the free list, and the
> store-release in push makes sure the items are read before they are
> visible in the free list.
>
> Signed-off-by: Steven Lariau <steven.lariau@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Gage Eads <gage.eads@intel.com>
Thanks,
Gage
On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
>
> Fix cmpexchange usage of weak / strong.
> The generated code is the same on x86 and ARM (there is no weak
> cmpexchange), but the old usage was inconsistent.
> For push and pop update size, weak is used because cmpexchange is inside
> a loop.
> For pop update root, strong is used even though cmpexchange is inside a
> loop, because there may be a lot of operations to do in a loop iteration
> (locate the new head).
Is this patch backport material?
Thanks.
--
David Marchand
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Monday, September 28, 2020 5:23 AM
> To: Steven Lariau <steven.lariau@arm.com>; Eads, Gage <gage.eads@intel.com>
> Cc: Olivier Matz <olivier.matz@6wind.com>; dev <dev@dpdk.org>; nd
> <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/5] lib/stack: fix inconsistent weak / strong
> cas
>
> On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
> >
> > Fix cmpexchange usage of weak / strong.
> > The generated code is the same on x86 and ARM (there is no weak
> > cmpexchange), but the old usage was inconsistent.
> > For push and pop update size, weak is used because cmpexchange is inside
> > a loop.
> > For pop update root, strong is used even though cmpexchange is inside a
> > loop, because there may be a lot of operations to do in a loop iteration
> > (locate the new head).
>
> Is this patch backport material?
It's not a bugfix. It could help performance on a system with weak
cmpexchange -- e.g. the pop-update change would ensure no spurious
failures (which the code can handle, but would require another relatively
expensive pass through the pop loop.)
Thanks,
Gage
On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
>
> One implementation of the DPDK stack library is lockfree,
> based on C11 memory model for atomics.
> Some of these atomic operations use unnecessary memory orders,
> that can be relaxed.
> This patch relax some of these operations in order to improve
> the performance of the stack library.
>
> The patch was tested on several architectures, to ensure that
> the implementation is correct, and to measure performance.
> Below are the results for a few architectures on multithread stack
> lockfree test.
> The cycles count is the average number of cycles per item to perform
> a bulk push / pop.
>
> $sudo ./builddir/app/dpdk-test
> RTE>>stack_lf_perf_autotest
> difference compared to main
> Cycles count on ThunderX2
> 2 cores, bulk size = 8: -15.85%
> 2 cores, bulk size = 32: -04.56%
> 4 cores, bulk size = 8: -05.00%
> 4 cores, bulk size = 32: -04.35%
> 16 cores, bulk size = 8: -02.38%
> 16 cores, bulk size = 32: -01.88%
>
> difference compared to main
> Cycles count on N1SDP
> 2 cores, batch size = 8: +00.77%
> 2 cores, batch size = 32: -16.00%
>
> difference compared to main
> Cycles count on Skylake
> 2 cores, bulk size = 8: -00.18%
> 2 cores, bulk size = 32: -00.95%
> 4 cores, bulk size = 8: -01.19%
> 4 cores, bulk size = 32: +00.64%
> 16 cores, bulk size = 8: +01.20%
> 16 cores, bulk size = 32: +00.48%
>
> v2: add comment to explain why pop head CAS relaxed is valid
> added Fixes information
>
> Steven Lariau (5):
> lib/stack: fix inconsistent weak / strong cas
> lib/stack: remove push acquire fence
> lib/stack: remove redundant orderings for list->len
> lib/stack: reload head when pop fails
> lib/stack: remove pop cas release ordering
>
> lib/librte_stack/rte_stack_lf_c11.h | 32 +++++++++++++++++++----------
> 1 file changed, 21 insertions(+), 11 deletions(-)
Series applied, thanks for those optimisations.
--
David Marchand