* [RFC] ring: Further performance improvements with C11 @ 2023-06-15 20:13 Wathsala Vithanage 2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage 0 siblings, 1 reply; 9+ messages in thread From: Wathsala Vithanage @ 2023-06-15 20:13 UTC (permalink / raw) To: honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang Cc: dev, nd, Wathsala Vithanage (1) Replace tail store with RELEASE semantics with a RELEASE fence and load of the tail in producer/consumer head update with ACQUIRE semantics with ACQUIRE fences. ------------------------------------------------------------------------ Use __ATOMIC_ACQUIRE fences between producer/consumer head updates and ring element accesses. CAS instructions always load the memory address it is supposed to conditionally update. An ACQUIRE fence between the CAS (LOAD) and the LOADs to the ring elements prevents reordering of these operations in a producer thread. Similarly, it also prevents reordering of CAS and the STOREs to the ring elements that follow in a consumer thread. An __ATOMIC_RELEASE fence in __rte_ring_update_tail prevents reordering of tail update (STORE) and LOADs to the ring elements in conumer threads. It also prevents reordering of the tail update and the STOREs to the ring in producer threads. This pairing can improve performance by eliminating the loading of the tail with ACQUIRE semantics in the consumer/producer head update loop. (2) The removal of ACQUIRE fence between the Load of the old_head and load of the cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head functions. ----------------------------------------------------------------------- ACQUIRE fence between the two loads established program order to prevent use of an outdated consumer or producer tail values in the computation of free_entries and entries in the two functions respectively. However, (A) the problem described above cannot be solved by establishing the program order by using a fence, and (B) the resulting values for free_entries and entries are safe for use even with an outdated head value. (A) First example below shows an example where load of the old_head and the cons_tail is reordered resulting in a free_entries computation that involves an outdated consumer tail. The second is a counter example where the same outdated consumer tail is used to compute the free_entries even with an ACQUIRE fence. Initial values for both (i) example, and (ii) counter example are capacity = 32 r->cons.tail = 5 r->cons.head = 5 r->prod.head = 10 r-prod.tail = 10 Assume CPU2 adds 2 elements to the ring and CPU3 removes 5 elements from the ring. (i) Example: The two loads are reversed. Program order violated. --------------------------------------------------------------------- cpu1(producer) cpu2(producer) cpu3(consumer) load r->cons.tail=5 ^ | | | load r->prod.head=10 | load r->cons.tail=5 | store r->prod.head(+n=2)=12 stalled <-- enqueue -----> | store r->prod.tail(+n)=12 | | | load r->cons.head = 5 | load r->prod.tail = 12 | store r->cons.head(+n=5)=10 | <...dequeue.....> | store r->cons.tail(+n)=10 | v load r->prod.head=12 Above shows that the value read from r->cons.tail to cons_tail in cpu1 is 5 which is outdated because r->cons.tail was later updated by cpu3. In this instance the state is... CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25 RING: r->cons.tail=10, r->prod.head=12 (ii) Example: ACQUIRE fence between the loads. Program order preserved. ----------------------------------------------------------------------- cpu1(producer) cpu2(producer) cpu3(consumer) | | | load r->prod.head=10 | load r->cons.tail=5 | store r->prod.head(+n=2)=12 | <-- enqueue -----> | store r->prod.tail(+n)=12 | | load r->prod.head=12 ^ | | load r->cons.head = 5 | load r->prod.tail = 12 stalled store r->cons.head(+n=5)=10 | <...dequeue.....> | ^ | | v | load r->cons.tail=5 stalled | | v store r->cons.tail(+n)= 10 Above shows that the value read from r->cons.tail to cons_tail in cpu1 is 5 which is outdated because r->cons.tail was later updated by cpu3. In this instance with the ACQUIRE fence the state is... CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25 RING: r->cons.tail=10, r->prod.head=12 Example ii even with an ACQUIRE fence to prevent reordering of the loads of old_head and cons_tail produces the same result for the free_entries variable, and the system ends up in the same state up to that point as example i. Same can be shown for the consumer threads as well. Therefore, the ACQUIRE fences do not solve the problem they intended to solve. If each thread is required to use the latest r->cons.head,r->cons.tail, r->prod.head, and r->prod.tail values in their computations of free_entries and entries, there should be a real-time ordering of all the threads. Establishing program order in a single thread only provides guarantees of sequential-consistency which does not include real-time ordering of all threads. (B) The values computed for free_entries, and entries are always safe without the ACQUIRE fence, because the resulting values are equal or less than the number of actual free solts or elements in the ring. This can be seen in both example i and ii, where free_entries computation ended up with 25, even when ring has 30 free_entries. Enqueueing 25 elements into a ring that has space for 30 is safe. Same can be shown for consumer threads as well. Therefore, the ACQUIRE fence is not required for safety reasons. Due to (A) and (B) above the ACQUIRE fence between the load of the head and load of the tail in both consumer and producer head updates can be removed to improve the performance of the ring library. Performance on Arm N1 Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 766730 | Total count: 651686 | Total count: 959621 | | | Gain: -15% | Gain: 25% | +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 816745 | Total count: 646385 | Total count: 766730 | | | Gain: -21% | Gain: -6% | +-------------------------------------------------------------------+ Performance on x86-64 Cascade Lake Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 165042 | Total count: 168971 | Total count: 168089 | | | Gain: 2.3% | Gain: 1.8% | +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 150884 | Total count: 153162 | Total count: 159661 | | | Gain: 1.5% | Gain: 5.8% | +-------------------------------------------------------------------+ Wathsala Vithanage (1): ring: improving performance with C11 .mailmap | 1 + lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- 2 files changed, 21 insertions(+), 15 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC] ring: further performance improvements with C11 2023-06-15 20:13 [RFC] ring: Further performance improvements with C11 Wathsala Vithanage @ 2023-06-15 20:13 ` Wathsala Vithanage 2023-07-31 12:31 ` Thomas Monjalon 2023-08-02 9:42 ` Konstantin Ananyev 0 siblings, 2 replies; 9+ messages in thread From: Wathsala Vithanage @ 2023-06-15 20:13 UTC (permalink / raw) To: honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang Cc: dev, nd, Wathsala Vithanage For improved performance over the current C11 based ring implementation following changes were made. (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail with a RELEASE fence. Replace load of the tail with ACQUIRE semantics in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE fences. (2) Remove ACQUIRE fences between load of the old_head and load of the cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. These two fences are not required for the safety of the ring library. Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> --- .mailmap | 1 + lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- 2 files changed, 21 insertions(+), 15 deletions(-) diff --git a/.mailmap b/.mailmap index 4018f0fc47..367115d134 100644 --- a/.mailmap +++ b/.mailmap @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com> Wang Sheng-Hui <shhuiw@gmail.com> Wangyu (Eric) <seven.wangyu@huawei.com> Waterman Cao <waterman.cao@intel.com> +Wathsala Vithanage <wathsala.vithanage@arm.com> Weichun Chen <weichunx.chen@intel.com> Wei Dai <wei.dai@intel.com> Weifeng Li <liweifeng96@126.com> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index f895950df4..63fe58ce9e 100644 --- a/lib/ring/rte_ring_c11_pvt.h +++ b/lib/ring/rte_ring_c11_pvt.h @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val, uint32_t single, uint32_t enqueue) { RTE_SET_USED(enqueue); + /* + * Updating of ht->tail cannot happen before elements are added to or + * removed from the ring, as it could result in data races between + * producer and consumer threads. Therefore we need a release + * barrier here. + */ + rte_atomic_thread_fence(__ATOMIC_RELEASE); /* * If there are other enqueues/dequeues in progress that preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, if (!single) rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); } /** @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp, /* Reset n to the initial burst count */ n = max; - /* Ensure the head is read before tail */ - __atomic_thread_fence(__ATOMIC_ACQUIRE); - - /* load-acquire synchronize with store-release of ht->tail - * in update_tail. - */ cons_tail = __atomic_load_n(&r->cons.tail, - __ATOMIC_ACQUIRE); + __ATOMIC_RELAXED); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have @@ -100,6 +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED); } while (unlikely(success == 0)); + /* + * Ensure that updates to the ring doesn't rise above + * load of the new_head in SP and MP cases. + */ + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); return n; } @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Restore n as it may change every loop */ n = max; - /* Ensure the head is read before tail */ - __atomic_thread_fence(__ATOMIC_ACQUIRE); - - /* this load-acquire synchronize with store-release of ht->tail - * in update_tail. - */ prod_tail = __atomic_load_n(&r->prod.tail, - __ATOMIC_ACQUIRE); + __ATOMIC_RELAXED); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have @@ -175,6 +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED); } while (unlikely(success == 0)); + /* + * Ensure that updates to the ring doesn't rise above + * load of the new_head in SP and MP cases. + */ + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); return n; } -- 2.25.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] ring: further performance improvements with C11 2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage @ 2023-07-31 12:31 ` Thomas Monjalon 2023-08-03 2:56 ` Honnappa Nagarahalli 2023-08-02 9:42 ` Konstantin Ananyev 1 sibling, 1 reply; 9+ messages in thread From: Thomas Monjalon @ 2023-07-31 12:31 UTC (permalink / raw) To: Wathsala Vithanage Cc: honnappa.nagarahalli, konstantin.v.ananyev, ruifeng.wang, dev, nd, stephen, jerinj, Morten Brørup, Tyler Retzlaff 15/06/2023 22:13, Wathsala Vithanage: > For improved performance over the current C11 based ring implementation > following changes were made. > (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail > with a RELEASE fence. Replace load of the tail with ACQUIRE semantics > in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE > fences. > (2) Remove ACQUIRE fences between load of the old_head and load of the > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. > These two fences are not required for the safety of the ring library. > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> Are we waiting for more reviews? ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] ring: further performance improvements with C11 2023-07-31 12:31 ` Thomas Monjalon @ 2023-08-03 2:56 ` Honnappa Nagarahalli 0 siblings, 0 replies; 9+ messages in thread From: Honnappa Nagarahalli @ 2023-08-03 2:56 UTC (permalink / raw) To: thomas, Wathsala Wathawana Vithanage Cc: konstantin.v.ananyev, Ruifeng Wang, dev, nd, stephen, jerinj, Morten Brørup, Tyler Retzlaff, nd > -----Original Message----- > From: Thomas Monjalon <thomas@monjalon.net> > Sent: Monday, July 31, 2023 7:31 AM > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > konstantin.v.ananyev@yandex.ru; Ruifeng Wang <Ruifeng.Wang@arm.com>; > dev@dpdk.org; nd <nd@arm.com>; stephen@networkplumber.org; > jerinj@marvell.com; Morten Brørup <mb@smartsharesystems.com>; Tyler > Retzlaff <roretzla@linux.microsoft.com> > Subject: Re: [RFC] ring: further performance improvements with C11 > > 15/06/2023 22:13, Wathsala Vithanage: > > For improved performance over the current C11 based ring > > implementation following changes were made. > > (1) Replace tail store with RELEASE semantics in > > __rte_ring_update_tail with a RELEASE fence. Replace load of the tail > > with ACQUIRE semantics in __rte_ring_move_prod_head and > > __rte_ring_move_cons_head with ACQUIRE fences. > > (2) Remove ACQUIRE fences between load of the old_head and load of the > > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. > > These two fences are not required for the safety of the ring library. > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> > > Are we waiting for more reviews? We do not have a good solution. We should discuss this in the Techboard meeting. > ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] ring: further performance improvements with C11 2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage 2023-07-31 12:31 ` Thomas Monjalon @ 2023-08-02 9:42 ` Konstantin Ananyev 2023-08-04 22:50 ` Wathsala Wathawana Vithanage 1 sibling, 1 reply; 9+ messages in thread From: Konstantin Ananyev @ 2023-08-02 9:42 UTC (permalink / raw) To: Wathsala Vithanage, honnappa.nagarahalli, konstantin.v.ananyev, thomas, ruifeng.wang Cc: dev, nd, justin.he > For improved performance over the current C11 based ring implementation > following changes were made. > (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail > with a RELEASE fence. Replace load of the tail with ACQUIRE semantics > in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE > fences. > (2) Remove ACQUIRE fences between load of the old_head and load of the > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. > These two fences are not required for the safety of the ring library. Hmm... with these changes, aren't we re-introducing the old bug fixed by this commit: commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 Author: Jia He <justin.he@arm.com> Date: Fri Nov 10 03:30:42 2017 +0000 ring: guarantee load/load order in enqueue and dequeue We watched a rte panic of mbuf_autotest in our qualcomm arm64 server (Amberwing). Root cause: In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head; //1st load const uint32_t prod_tail = r->prod.tail; //2nd load In weak memory order architectures (powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. .... ? > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> > --- > .mailmap | 1 + > lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- > 2 files changed, 21 insertions(+), 15 deletions(-) > > diff --git a/.mailmap b/.mailmap > index 4018f0fc47..367115d134 100644 > --- a/.mailmap > +++ b/.mailmap > @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com> > Wang Sheng-Hui <shhuiw@gmail.com> > Wangyu (Eric) <seven.wangyu@huawei.com> > Waterman Cao <waterman.cao@intel.com> > +Wathsala Vithanage <wathsala.vithanage@arm.com> > Weichun Chen <weichunx.chen@intel.com> > Wei Dai <wei.dai@intel.com> > Weifeng Li <liweifeng96@126.com> > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h > index f895950df4..63fe58ce9e 100644 > --- a/lib/ring/rte_ring_c11_pvt.h > +++ b/lib/ring/rte_ring_c11_pvt.h > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, > uint32_t new_val, uint32_t single, uint32_t enqueue) > { > RTE_SET_USED(enqueue); > + /* > + * Updating of ht->tail cannot happen before elements are added to or > + * removed from the ring, as it could result in data races between > + * producer and consumer threads. Therefore we need a release > + * barrier here. > + */ > + rte_atomic_thread_fence(__ATOMIC_RELEASE); > > /* > * If there are other enqueues/dequeues in progress that preceded us, > @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, > if (!single) > rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); > > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > } > > /** > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp, > /* Reset n to the initial burst count */ > n = max; > > - /* Ensure the head is read before tail */ > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > - > - /* load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > cons_tail = __atomic_load_n(&r->cons.tail, > - __ATOMIC_ACQUIRE); > + __ATOMIC_RELAXED); > > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > @@ -100,6 +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp, > 0, __ATOMIC_RELAXED, > __ATOMIC_RELAXED); > } while (unlikely(success == 0)); > + /* > + * Ensure that updates to the ring doesn't rise above > + * load of the new_head in SP and MP cases. > + */ > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > return n; > } > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > /* Restore n as it may change every loop */ > n = max; > > - /* Ensure the head is read before tail */ > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > - > - /* this load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > prod_tail = __atomic_load_n(&r->prod.tail, > - __ATOMIC_ACQUIRE); > + __ATOMIC_RELAXED); > > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > @@ -175,6 +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > 0, __ATOMIC_RELAXED, > __ATOMIC_RELAXED); > } while (unlikely(success == 0)); > + /* > + * Ensure that updates to the ring doesn't rise above > + * load of the new_head in SP and MP cases. > + */ > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > return n; > } > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] ring: further performance improvements with C11 2023-08-02 9:42 ` Konstantin Ananyev @ 2023-08-04 22:50 ` Wathsala Wathawana Vithanage 2023-08-09 18:18 ` Konstantin Ananyev 0 siblings, 1 reply; 9+ messages in thread From: Wathsala Wathawana Vithanage @ 2023-08-04 22:50 UTC (permalink / raw) To: Konstantin Ananyev, Honnappa Nagarahalli, konstantin.v.ananyev, thomas, Ruifeng Wang Cc: dev, nd, Justin He, nd > -----Original Message----- > From: Konstantin Ananyev <konstantin.ananyev@huawei.com> > Sent: Wednesday, August 2, 2023 4:43 AM > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>; > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > konstantin.v.ananyev@yandex.ru; thomas@monjalon.net; Ruifeng Wang > <Ruifeng.Wang@arm.com> > Cc: dev@dpdk.org; nd <nd@arm.com>; Justin He <Justin.He@arm.com> > Subject: RE: [RFC] ring: further performance improvements with C11 > > > > > For improved performance over the current C11 based ring > > implementation following changes were made. > > (1) Replace tail store with RELEASE semantics in > > __rte_ring_update_tail with a RELEASE fence. Replace load of the tail > > with ACQUIRE semantics in __rte_ring_move_prod_head and > > __rte_ring_move_cons_head with ACQUIRE fences. > > (2) Remove ACQUIRE fences between load of the old_head and load of the > > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. > > These two fences are not required for the safety of the ring library. > > Hmm... with these changes, aren't we re-introducing the old bug fixed by this > commit: Cover letter explains why this barrier does not solve what it intends to solve and Why it should not matter. https://mails.dpdk.org/archives/dev/2023-June/270874.html > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 > Author: Jia He <justin.he@arm.com> > Date: Fri Nov 10 03:30:42 2017 +0000 > > ring: guarantee load/load order in enqueue and dequeue > > We watched a rte panic of mbuf_autotest in our qualcomm arm64 server > (Amberwing). > > Root cause: > In __rte_ring_move_cons_head() > ... > do { > /* Restore n as it may change every loop */ > n = max; > > *old_head = r->cons.head; //1st load > const uint32_t prod_tail = r->prod.tail; //2nd load > > In weak memory order architectures (powerpc,arm), the 2nd load might be > reodered before the 1st load, that makes *entries is bigger than we wanted. > This nasty reording messed enque/deque up. > .... > ? > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> > > --- > > .mailmap | 1 + > > lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- > > 2 files changed, 21 insertions(+), 15 deletions(-) > > > > diff --git a/.mailmap b/.mailmap > > index 4018f0fc47..367115d134 100644 > > --- a/.mailmap > > +++ b/.mailmap > > @@ -1430,6 +1430,7 @@ Walter Heymans > <walter.heymans@corigine.com> > > Wang Sheng-Hui <shhuiw@gmail.com> Wangyu (Eric) > > <seven.wangyu@huawei.com> Waterman Cao <waterman.cao@intel.com> > > +Wathsala Vithanage <wathsala.vithanage@arm.com> > > Weichun Chen <weichunx.chen@intel.com> Wei Dai <wei.dai@intel.com> > > Weifeng Li <liweifeng96@126.com> diff --git > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index > > f895950df4..63fe58ce9e 100644 > > --- a/lib/ring/rte_ring_c11_pvt.h > > +++ b/lib/ring/rte_ring_c11_pvt.h > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, > uint32_t old_val, > > uint32_t new_val, uint32_t single, uint32_t enqueue) { > > RTE_SET_USED(enqueue); > > + /* > > + * Updating of ht->tail cannot happen before elements are added to or > > + * removed from the ring, as it could result in data races between > > + * producer and consumer threads. Therefore we need a release > > + * barrier here. > > + */ > > + rte_atomic_thread_fence(__ATOMIC_RELEASE); > > > > /* > > * If there are other enqueues/dequeues in progress that preceded > > us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail > *ht, uint32_t old_val, > > if (!single) > > rte_wait_until_equal_32(&ht->tail, old_val, > __ATOMIC_RELAXED); > > > > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > > } > > > > /** > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, > unsigned int is_sp, > > /* Reset n to the initial burst count */ > > n = max; > > > > - /* Ensure the head is read before tail */ > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > - > > - /* load-acquire synchronize with store-release of ht->tail > > - * in update_tail. > > - */ > > cons_tail = __atomic_load_n(&r->cons.tail, > > - __ATOMIC_ACQUIRE); > > + __ATOMIC_RELAXED); > > > > /* The subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we have @@ - > 100,6 > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int > is_sp, > > 0, __ATOMIC_RELAXED, > > __ATOMIC_RELAXED); > > } while (unlikely(success == 0)); > > + /* > > + * Ensure that updates to the ring doesn't rise above > > + * load of the new_head in SP and MP cases. > > + */ > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > return n; > > } > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > is_sc, > > /* Restore n as it may change every loop */ > > n = max; > > > > - /* Ensure the head is read before tail */ > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > - > > - /* this load-acquire synchronize with store-release of ht->tail > > - * in update_tail. > > - */ > > prod_tail = __atomic_load_n(&r->prod.tail, > > - __ATOMIC_ACQUIRE); > > + __ATOMIC_RELAXED); > > > > /* The subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we have @@ - > 175,6 > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > 0, > __ATOMIC_RELAXED, > > __ATOMIC_RELAXED); > > } while (unlikely(success == 0)); > > + /* > > + * Ensure that updates to the ring doesn't rise above > > + * load of the new_head in SP and MP cases. > > + */ > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > return n; > > } > > > > -- > > 2.25.1 > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] ring: further performance improvements with C11 2023-08-04 22:50 ` Wathsala Wathawana Vithanage @ 2023-08-09 18:18 ` Konstantin Ananyev 2023-08-15 5:14 ` Honnappa Nagarahalli 0 siblings, 1 reply; 9+ messages in thread From: Konstantin Ananyev @ 2023-08-09 18:18 UTC (permalink / raw) To: Wathsala Wathawana Vithanage, Honnappa Nagarahalli, konstantin.v.ananyev, thomas, Ruifeng Wang Cc: dev, nd, Justin He, nd > > > For improved performance over the current C11 based ring > > > implementation following changes were made. > > > (1) Replace tail store with RELEASE semantics in > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the tail > > > with ACQUIRE semantics in __rte_ring_move_prod_head and > > > __rte_ring_move_cons_head with ACQUIRE fences. > > > (2) Remove ACQUIRE fences between load of the old_head and load of the > > > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. > > > These two fences are not required for the safety of the ring library. > > > > Hmm... with these changes, aren't we re-introducing the old bug fixed by this > > commit: > > Cover letter explains why this barrier does not solve what it intends to solve and > Why it should not matter. > https://mails.dpdk.org/archives/dev/2023-June/270874.html Ok, let's consider the case similar to yours (i), but when r->prod.head was moved for distance greater then r->capacity. To be more specific, let' start with the same initial state: capacity = 32 r->cons.tail = 5 r->cons.head = 5 r->prod.head = 10 r-prod.tail = 10 time 0, thread1: /* re-ordered load */ const_tail = r->cons.tail; //= 5 Now, thread1 was stalled for a bit, meanwhile there were few enqueue/dequeus done by other threads, so current state of the ring: r->cons.tail = 105 r->cons.head = 105 r->prod.head = 110 r-prod.tail = 110 time 1, thread1: old_head = r->prod.head; // 110 *free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 110) == (uint32_t)-73 == 4294967223 So, free_entries value is way too big, and that comparison: if (unlikely(n > *free_entries)) might provide wrong result. So I still think we do need some sort of _read_fence_ between these two loads. As I said before, that looks exactly like the old bug, fixed a while ago: http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 but now re-introduced for C11 case. > > > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 > > Author: Jia He <justin.he@arm.com> > > Date: Fri Nov 10 03:30:42 2017 +0000 > > > > ring: guarantee load/load order in enqueue and dequeue > > > > We watched a rte panic of mbuf_autotest in our qualcomm arm64 server > > (Amberwing). > > > > Root cause: > > In __rte_ring_move_cons_head() > > ... > > do { > > /* Restore n as it may change every loop */ > > n = max; > > > > *old_head = r->cons.head; //1st load > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > In weak memory order architectures (powerpc,arm), the 2nd load might be > > reodered before the 1st load, that makes *entries is bigger than we wanted. > > This nasty reording messed enque/deque up. > > .... > > ? > > > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> > > > --- > > > .mailmap | 1 + > > > lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- > > > 2 files changed, 21 insertions(+), 15 deletions(-) > > > > > > diff --git a/.mailmap b/.mailmap > > > index 4018f0fc47..367115d134 100644 > > > --- a/.mailmap > > > +++ b/.mailmap > > > @@ -1430,6 +1430,7 @@ Walter Heymans > > <walter.heymans@corigine.com> > > > Wang Sheng-Hui <shhuiw@gmail.com> Wangyu (Eric) > > > <seven.wangyu@huawei.com> Waterman Cao <waterman.cao@intel.com> > > > +Wathsala Vithanage <wathsala.vithanage@arm.com> > > > Weichun Chen <weichunx.chen@intel.com> Wei Dai <wei.dai@intel.com> > > > Weifeng Li <liweifeng96@126.com> diff --git > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index > > > f895950df4..63fe58ce9e 100644 > > > --- a/lib/ring/rte_ring_c11_pvt.h > > > +++ b/lib/ring/rte_ring_c11_pvt.h > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, > > uint32_t old_val, > > > uint32_t new_val, uint32_t single, uint32_t enqueue) { > > > RTE_SET_USED(enqueue); > > > + /* > > > + * Updating of ht->tail cannot happen before elements are added to or > > > + * removed from the ring, as it could result in data races between > > > + * producer and consumer threads. Therefore we need a release > > > + * barrier here. > > > + */ > > > + rte_atomic_thread_fence(__ATOMIC_RELEASE); > > > > > > /* > > > * If there are other enqueues/dequeues in progress that preceded > > > us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail > > *ht, uint32_t old_val, > > > if (!single) > > > rte_wait_until_equal_32(&ht->tail, old_val, > > __ATOMIC_RELAXED); > > > > > > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > > > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > > > } > > > > > > /** > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, > > unsigned int is_sp, > > > /* Reset n to the initial burst count */ > > > n = max; > > > > > > - /* Ensure the head is read before tail */ > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > - > > > - /* load-acquire synchronize with store-release of ht->tail > > > - * in update_tail. > > > - */ > > > cons_tail = __atomic_load_n(&r->cons.tail, > > > - __ATOMIC_ACQUIRE); > > > + __ATOMIC_RELAXED); > > > > > > /* The subtraction is done between two unsigned 32bits value > > > * (the result is always modulo 32 bits even if we have @@ - > > 100,6 > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int > > is_sp, > > > 0, __ATOMIC_RELAXED, > > > __ATOMIC_RELAXED); > > > } while (unlikely(success == 0)); > > > + /* > > > + * Ensure that updates to the ring doesn't rise above > > > + * load of the new_head in SP and MP cases. > > > + */ > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > return n; > > > } > > > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > > is_sc, > > > /* Restore n as it may change every loop */ > > > n = max; > > > > > > - /* Ensure the head is read before tail */ > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > - > > > - /* this load-acquire synchronize with store-release of ht->tail > > > - * in update_tail. > > > - */ > > > prod_tail = __atomic_load_n(&r->prod.tail, > > > - __ATOMIC_ACQUIRE); > > > + __ATOMIC_RELAXED); > > > > > > /* The subtraction is done between two unsigned 32bits value > > > * (the result is always modulo 32 bits even if we have @@ - > > 175,6 > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > > 0, > > __ATOMIC_RELAXED, > > > __ATOMIC_RELAXED); > > > } while (unlikely(success == 0)); > > > + /* > > > + * Ensure that updates to the ring doesn't rise above > > > + * load of the new_head in SP and MP cases. > > > + */ > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > return n; > > > } > > > > > > -- > > > 2.25.1 > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] ring: further performance improvements with C11 2023-08-09 18:18 ` Konstantin Ananyev @ 2023-08-15 5:14 ` Honnappa Nagarahalli 2023-08-21 13:27 ` Konstantin Ananyev 0 siblings, 1 reply; 9+ messages in thread From: Honnappa Nagarahalli @ 2023-08-15 5:14 UTC (permalink / raw) To: Konstantin Ananyev, Wathsala Wathawana Vithanage, konstantin.v.ananyev, thomas, Ruifeng Wang Cc: dev, nd, Justin He, nd > -----Original Message----- > From: Konstantin Ananyev <konstantin.ananyev@huawei.com> > Sent: Wednesday, August 9, 2023 1:19 PM > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>; > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > konstantin.v.ananyev@yandex.ru; thomas@monjalon.net; Ruifeng Wang > <Ruifeng.Wang@arm.com> > Cc: dev@dpdk.org; nd <nd@arm.com>; Justin He <Justin.He@arm.com>; nd > <nd@arm.com> > Subject: RE: [RFC] ring: further performance improvements with C11 > > > > > > For improved performance over the current C11 based ring > > > > implementation following changes were made. > > > > (1) Replace tail store with RELEASE semantics in > > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the > > > > tail with ACQUIRE semantics in __rte_ring_move_prod_head and > > > > __rte_ring_move_cons_head with ACQUIRE fences. > > > > (2) Remove ACQUIRE fences between load of the old_head and load of > > > > the cons_tail in __rte_ring_move_prod_head and > __rte_ring_move_cons_head. > > > > These two fences are not required for the safety of the ring library. > > > > > > Hmm... with these changes, aren't we re-introducing the old bug > > > fixed by this > > > commit: > > > > Cover letter explains why this barrier does not solve what it intends > > to solve and Why it should not matter. > > https://mails.dpdk.org/archives/dev/2023-June/270874.html > > Ok, let's consider the case similar to yours (i), but when r->prod.head was > moved for distance greater then r->capacity. > To be more specific, let' start with the same initial state: > capacity = 32 > r->cons.tail = 5 > r->cons.head = 5 > r->prod.head = 10 > r-prod.tail = 10 > > time 0, thread1: > /* re-ordered load */ > const_tail = r->cons.tail; //= 5 > > Now, thread1 was stalled for a bit, meanwhile there were few What exactly do you mean by 'stalled'? If you are meaning, thread is preempted, then the ring algorithm is not designed for it. There are restrictions mentioned in [1]. However, we do need to handle the re-ordering case. [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issues > enqueue/dequeus done by other threads, so current state of the ring: > r->cons.tail = 105 > r->cons.head = 105 > r->prod.head = 110 > r-prod.tail = 110 > > time 1, thread1: > old_head = r->prod.head; // 110 > *free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 110) > == (uint32_t)-73 == 4294967223 > > So, free_entries value is way too big, and that comparison: > > if (unlikely(n > *free_entries)) > > might provide wrong result. > > So I still think we do need some sort of _read_fence_ between these two loads. > As I said before, that looks exactly like the old bug, fixed a while ago: > http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37 > ea0b654 > but now re-introduced for C11 case. Agree that the re-ordering case should be handled. I am thinking a check (*free_entries > capacity) and restarting the loop might suffice (without the barrier)? > > > > > > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 > > > Author: Jia He <justin.he@arm.com> > > > Date: Fri Nov 10 03:30:42 2017 +0000 > > > > > > ring: guarantee load/load order in enqueue and dequeue > > > > > > We watched a rte panic of mbuf_autotest in our qualcomm arm64 server > > > (Amberwing). > > > > > > Root cause: > > > In __rte_ring_move_cons_head() > > > ... > > > do { > > > /* Restore n as it may change every loop */ > > > n = max; > > > > > > *old_head = r->cons.head; //1st load > > > const uint32_t prod_tail = r->prod.tail; //2nd > > > load > > > > > > In weak memory order architectures (powerpc,arm), the 2nd load might > be > > > reodered before the 1st load, that makes *entries is bigger than we > wanted. > > > This nasty reording messed enque/deque up. > > > .... > > > ? > > > > > > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> > > > > --- > > > > .mailmap | 1 + > > > > lib/ring/rte_ring_c11_pvt.h | 35 > > > > ++++++++++++++++++++--------------- > > > > 2 files changed, 21 insertions(+), 15 deletions(-) > > > > > > > > diff --git a/.mailmap b/.mailmap > > > > index 4018f0fc47..367115d134 100644 > > > > --- a/.mailmap > > > > +++ b/.mailmap > > > > @@ -1430,6 +1430,7 @@ Walter Heymans > > > <walter.heymans@corigine.com> > > > > Wang Sheng-Hui <shhuiw@gmail.com> Wangyu (Eric) > > > > <seven.wangyu@huawei.com> Waterman Cao > <waterman.cao@intel.com> > > > > +Wathsala Vithanage <wathsala.vithanage@arm.com> > > > > Weichun Chen <weichunx.chen@intel.com> Wei Dai > > > > <wei.dai@intel.com> Weifeng Li <liweifeng96@126.com> diff --git > > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index > > > > f895950df4..63fe58ce9e 100644 > > > > --- a/lib/ring/rte_ring_c11_pvt.h > > > > +++ b/lib/ring/rte_ring_c11_pvt.h > > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail > > > > *ht, > > > uint32_t old_val, > > > > uint32_t new_val, uint32_t single, uint32_t enqueue) { > > > > RTE_SET_USED(enqueue); > > > > + /* > > > > + * Updating of ht->tail cannot happen before elements are added to or > > > > + * removed from the ring, as it could result in data races between > > > > + * producer and consumer threads. Therefore we need a release > > > > + * barrier here. > > > > + */ > > > > + rte_atomic_thread_fence(__ATOMIC_RELEASE); > > > > > > > > /* > > > > * If there are other enqueues/dequeues in progress that > > > > preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct > > > > rte_ring_headtail > > > *ht, uint32_t old_val, > > > > if (!single) > > > > rte_wait_until_equal_32(&ht->tail, old_val, > > > __ATOMIC_RELAXED); > > > > > > > > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > > > > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > > > > } > > > > > > > > /** > > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, > > > unsigned int is_sp, > > > > /* Reset n to the initial burst count */ > > > > n = max; > > > > > > > > - /* Ensure the head is read before tail */ > > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > - > > > > - /* load-acquire synchronize with store-release of ht->tail > > > > - * in update_tail. > > > > - */ > > > > cons_tail = __atomic_load_n(&r->cons.tail, > > > > - __ATOMIC_ACQUIRE); > > > > + __ATOMIC_RELAXED); > > > > > > > > /* The subtraction is done between two unsigned 32bits value > > > > * (the result is always modulo 32 bits even if we have @@ - > > > 100,6 > > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned > > > > +int > > > is_sp, > > > > 0, __ATOMIC_RELAXED, > > > > __ATOMIC_RELAXED); > > > > } while (unlikely(success == 0)); > > > > + /* > > > > + * Ensure that updates to the ring doesn't rise above > > > > + * load of the new_head in SP and MP cases. > > > > + */ > > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > return n; > > > > } > > > > > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, > > > > int > > > is_sc, > > > > /* Restore n as it may change every loop */ > > > > n = max; > > > > > > > > - /* Ensure the head is read before tail */ > > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > - > > > > - /* this load-acquire synchronize with store-release of ht->tail > > > > - * in update_tail. > > > > - */ > > > > prod_tail = __atomic_load_n(&r->prod.tail, > > > > - __ATOMIC_ACQUIRE); > > > > + __ATOMIC_RELAXED); > > > > > > > > /* The subtraction is done between two unsigned 32bits value > > > > * (the result is always modulo 32 bits even if we have @@ - > > > 175,6 > > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > > > > +is_sc, > > > > 0, > > > __ATOMIC_RELAXED, > > > > > __ATOMIC_RELAXED); > > > > } while (unlikely(success == 0)); > > > > + /* > > > > + * Ensure that updates to the ring doesn't rise above > > > > + * load of the new_head in SP and MP cases. > > > > + */ > > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > return n; > > > > } > > > > > > > > -- > > > > 2.25.1 > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] ring: further performance improvements with C11 2023-08-15 5:14 ` Honnappa Nagarahalli @ 2023-08-21 13:27 ` Konstantin Ananyev 0 siblings, 0 replies; 9+ messages in thread From: Konstantin Ananyev @ 2023-08-21 13:27 UTC (permalink / raw) To: Honnappa Nagarahalli, Wathsala Wathawana Vithanage, konstantin.v.ananyev, thomas, Ruifeng Wang Cc: dev, nd, Justin He, nd > > > > > For improved performance over the current C11 based ring > > > > > implementation following changes were made. > > > > > (1) Replace tail store with RELEASE semantics in > > > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the > > > > > tail with ACQUIRE semantics in __rte_ring_move_prod_head and > > > > > __rte_ring_move_cons_head with ACQUIRE fences. > > > > > (2) Remove ACQUIRE fences between load of the old_head and load of > > > > > the cons_tail in __rte_ring_move_prod_head and > > __rte_ring_move_cons_head. > > > > > These two fences are not required for the safety of the ring library. > > > > > > > > Hmm... with these changes, aren't we re-introducing the old bug > > > > fixed by this > > > > commit: > > > > > > Cover letter explains why this barrier does not solve what it intends > > > to solve and Why it should not matter. > > > https://mails.dpdk.org/archives/dev/2023-June/270874.html > > > > Ok, let's consider the case similar to yours (i), but when r->prod.head was > > moved for distance greater then r->capacity. > > To be more specific, let' start with the same initial state: > > capacity = 32 > > r->cons.tail = 5 > > r->cons.head = 5 > > r->prod.head = 10 > > r-prod.tail = 10 > > > > time 0, thread1: > > /* re-ordered load */ > > const_tail = r->cons.tail; //= 5 > > > > Now, thread1 was stalled for a bit, meanwhile there were few > What exactly do you mean by 'stalled'? I mean: CPU pipeline the thread was running on get stalled by whatever reasons - memory load latency, tlb miss, etc. > If you are meaning, thread is preempted, > then the ring algorithm is not designed for it. There > are restrictions mentioned in [1]. I am not talking about SW thread preemption here. With the example I provided, I think it is clear that the problem can happen even with 'ideal' conditions: each thread runs non-preempted on a separate core. > However, we do need to handle the re-ordering case. To be clear: for me right now this patch is bogus, it has to be either reworked or abandoned. > > [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issues > > > enqueue/dequeus done by other threads, so current state of the ring: > > r->cons.tail = 105 > > r->cons.head = 105 > > r->prod.head = 110 > > r-prod.tail = 110 > > > > time 1, thread1: > > old_head = r->prod.head; // 110 > > *free_entries = (capacity + cons_tail - old_head); // = (uint32_t)(32 + 5 - 110) > > == (uint32_t)-73 == 4294967223 > > > > So, free_entries value is way too big, and that comparison: > > > > if (unlikely(n > *free_entries)) > > > > might provide wrong result. > > > > So I still think we do need some sort of _read_fence_ between these two loads. > > As I said before, that looks exactly like the old bug, fixed a while ago: > > http://git.dpdk.org/dpdk/commit/?id=9bc2cbb007c0a3335c5582357ae9f6d37 > > ea0b654 > > but now re-introduced for C11 case. > Agree that the re-ordering case should be handled. Either handled, or simply not allowed. > I am thinking a check (*free_entries > capacity) and restarting the loop might > suffice (without the barrier)? I thought about the same thing, and at first glance it seems workable in principle. Though I am still hesitate to remove ordering completely here: As both compiler and CPU reordering will be possible, it will probably introduce a possibility of sort-of ABA problem: old cons.tail is read at a very early stage after that cur cons.tail value get wrapped around 2^32 and is now less then old cons.tail. Then we read prod.head So: (invalid_free_ent=capacity + cons.tail._old - prod.head) <= capacity and (valid_free_ent=capacity + cons.tail._cur - prod.head) <= capacity Are both true, but invalid > valid and we overestimate number of free entries. In majority of cases possibility of such situation is negligible. But for huge values for 'capacity' it will grow. Can I ask a bit different question: as I remember this series started as an attempt to improve C11 ring enqueue/dequeue implementation. Though looking at non-C11 one, there also exist a read-fence between these two loads: https://elixir.bootlin.com/dpdk/v23.07/source/lib/ring/rte_ring_generic_pvt.h#L73 Which makes me - might be it is not actual read-ordering that causes slowdown of C11 version. Did anyone try to compare generated code for both cases, wonder what is the difference? Konstantin > > > > > > > > > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 > > > > Author: Jia He <justin.he@arm.com> > > > > Date: Fri Nov 10 03:30:42 2017 +0000 > > > > > > > > ring: guarantee load/load order in enqueue and dequeue > > > > > > > > We watched a rte panic of mbuf_autotest in our qualcomm arm64 server > > > > (Amberwing). > > > > > > > > Root cause: > > > > In __rte_ring_move_cons_head() > > > > ... > > > > do { > > > > /* Restore n as it may change every loop */ > > > > n = max; > > > > > > > > *old_head = r->cons.head; //1st load > > > > const uint32_t prod_tail = r->prod.tail; //2nd > > > > load > > > > > > > > In weak memory order architectures (powerpc,arm), the 2nd load might > > be > > > > reodered before the 1st load, that makes *entries is bigger than we > > wanted. > > > > This nasty reording messed enque/deque up. > > > > .... > > > > ? > > > > > > > > > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com> > > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com> > > > > > --- > > > > > .mailmap | 1 + > > > > > lib/ring/rte_ring_c11_pvt.h | 35 > > > > > ++++++++++++++++++++--------------- > > > > > 2 files changed, 21 insertions(+), 15 deletions(-) > > > > > > > > > > diff --git a/.mailmap b/.mailmap > > > > > index 4018f0fc47..367115d134 100644 > > > > > --- a/.mailmap > > > > > +++ b/.mailmap > > > > > @@ -1430,6 +1430,7 @@ Walter Heymans > > > > <walter.heymans@corigine.com> > > > > > Wang Sheng-Hui <shhuiw@gmail.com> Wangyu (Eric) > > > > > <seven.wangyu@huawei.com> Waterman Cao > > <waterman.cao@intel.com> > > > > > +Wathsala Vithanage <wathsala.vithanage@arm.com> > > > > > Weichun Chen <weichunx.chen@intel.com> Wei Dai > > > > > <wei.dai@intel.com> Weifeng Li <liweifeng96@126.com> diff --git > > > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index > > > > > f895950df4..63fe58ce9e 100644 > > > > > --- a/lib/ring/rte_ring_c11_pvt.h > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h > > > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail > > > > > *ht, > > > > uint32_t old_val, > > > > > uint32_t new_val, uint32_t single, uint32_t enqueue) { > > > > > RTE_SET_USED(enqueue); > > > > > + /* > > > > > + * Updating of ht->tail cannot happen before elements are added to or > > > > > + * removed from the ring, as it could result in data races between > > > > > + * producer and consumer threads. Therefore we need a release > > > > > + * barrier here. > > > > > + */ > > > > > + rte_atomic_thread_fence(__ATOMIC_RELEASE); > > > > > > > > > > /* > > > > > * If there are other enqueues/dequeues in progress that > > > > > preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct > > > > > rte_ring_headtail > > > > *ht, uint32_t old_val, > > > > > if (!single) > > > > > rte_wait_until_equal_32(&ht->tail, old_val, > > > > __ATOMIC_RELAXED); > > > > > > > > > > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > > > > > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > > > > > } > > > > > > > > > > /** > > > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, > > > > unsigned int is_sp, > > > > > /* Reset n to the initial burst count */ > > > > > n = max; > > > > > > > > > > - /* Ensure the head is read before tail */ > > > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > - > > > > > - /* load-acquire synchronize with store-release of ht->tail > > > > > - * in update_tail. > > > > > - */ > > > > > cons_tail = __atomic_load_n(&r->cons.tail, > > > > > - __ATOMIC_ACQUIRE); > > > > > + __ATOMIC_RELAXED); > > > > > > > > > > /* The subtraction is done between two unsigned 32bits value > > > > > * (the result is always modulo 32 bits even if we have @@ - > > > > 100,6 > > > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned > > > > > +int > > > > is_sp, > > > > > 0, __ATOMIC_RELAXED, > > > > > __ATOMIC_RELAXED); > > > > > } while (unlikely(success == 0)); > > > > > + /* > > > > > + * Ensure that updates to the ring doesn't rise above > > > > > + * load of the new_head in SP and MP cases. > > > > > + */ > > > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > return n; > > > > > } > > > > > > > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, > > > > > int > > > > is_sc, > > > > > /* Restore n as it may change every loop */ > > > > > n = max; > > > > > > > > > > - /* Ensure the head is read before tail */ > > > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > - > > > > > - /* this load-acquire synchronize with store-release of ht->tail > > > > > - * in update_tail. > > > > > - */ > > > > > prod_tail = __atomic_load_n(&r->prod.tail, > > > > > - __ATOMIC_ACQUIRE); > > > > > + __ATOMIC_RELAXED); > > > > > > > > > > /* The subtraction is done between two unsigned 32bits value > > > > > * (the result is always modulo 32 bits even if we have @@ - > > > > 175,6 > > > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > > > > > +is_sc, > > > > > 0, > > > > __ATOMIC_RELAXED, > > > > > > > __ATOMIC_RELAXED); > > > > > } while (unlikely(success == 0)); > > > > > + /* > > > > > + * Ensure that updates to the ring doesn't rise above > > > > > + * load of the new_head in SP and MP cases. > > > > > + */ > > > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > return n; > > > > > } > > > > > > > > > > -- > > > > > 2.25.1 > > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-08-21 13:27 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-06-15 20:13 [RFC] ring: Further performance improvements with C11 Wathsala Vithanage 2023-06-15 20:13 ` [RFC] ring: further " Wathsala Vithanage 2023-07-31 12:31 ` Thomas Monjalon 2023-08-03 2:56 ` Honnappa Nagarahalli 2023-08-02 9:42 ` Konstantin Ananyev 2023-08-04 22:50 ` Wathsala Wathawana Vithanage 2023-08-09 18:18 ` Konstantin Ananyev 2023-08-15 5:14 ` Honnappa Nagarahalli 2023-08-21 13:27 ` Konstantin Ananyev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).