* [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
@ 2018-07-12 2:44 Takeshi Yoshimura
2018-07-12 17:08 ` Jerin Jacob
0 siblings, 1 reply; 8+ messages in thread
From: Takeshi Yoshimura @ 2018-07-12 2:44 UTC (permalink / raw)
To: dev; +Cc: Takeshi Yoshimura, stable, Takeshi Yoshimura
SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
It uses a single consumer and multiple producers for a rte_ring.
The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
The reordered loads happened on r->prod.tail in
__rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
dependency, but the code does not satisfy it. Note that they are
not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
cmpset invokes a read barrier.
The paired stores on these loads are in ENQUEUE_PTRS() and
update_tail(). Simplified code around the reorder is the following.
Consumer Producer
load idx[ring]
store idx[ring]
store r->prod.tail
load r->prod.tail
In this case, the consumer loads old idx[ring] and confirms the load
is valid with the new r->prod.tail.
I added a read barrier in the case where __IS_SC is passed to
__rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
to avoid similar problems with a single producer.
Cc: stable@dpdk.org
Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
---
lib/librte_ring/rte_ring_generic.h | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..477326180 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
return 0;
*new_head = *old_head + n;
- if (is_sp)
+ if (is_sp) {
+ rte_smp_rmb();
r->prod.head = *new_head, success = 1;
- else
+ } else
success = rte_atomic32_cmpset(&r->prod.head,
*old_head, *new_head);
} while (unlikely(success == 0));
@@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
return 0;
*new_head = *old_head + n;
- if (is_sc)
+ if (is_sc) {
+ rte_smp_rmb();
r->cons.head = *new_head, success = 1;
- else
+ } else
success = rte_atomic32_cmpset(&r->cons.head, *old_head,
*new_head);
} while (unlikely(success == 0));
--
2.17.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2018-07-12 2:44 [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64 Takeshi Yoshimura
@ 2018-07-12 17:08 ` Jerin Jacob
2018-07-17 2:54 ` Takeshi Yoshimura
0 siblings, 1 reply; 8+ messages in thread
From: Jerin Jacob @ 2018-07-12 17:08 UTC (permalink / raw)
To: Takeshi Yoshimura; +Cc: dev, stable, Takeshi Yoshimura
-----Original Message-----
> Date: Thu, 12 Jul 2018 11:44:14 +0900
> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> To: dev@dpdk.org
> Cc: Takeshi Yoshimura <t.yoshimura8869@gmail.com>, stable@dpdk.org, Takeshi
> Yoshimura <tyos@jp.ibm.com>
> Subject: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
> X-Mailer: git-send-email 2.15.1
>
> External Email
>
> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> It uses a single consumer and multiple producers for a rte_ring.
> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
Adding rte_smp_rmb() cause performance regression on non x86 platforms.
Having said that, load-load barrier can be expressed very well with C11 memory
model. I guess ppc64 supports C11 memory model. If so,
Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
original issue?
>
> The reordered loads happened on r->prod.tail in
> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> dependency, but the code does not satisfy it. Note that they are
> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
> cmpset invokes a read barrier.
>
> The paired stores on these loads are in ENQUEUE_PTRS() and
> update_tail(). Simplified code around the reorder is the following.
>
> Consumer Producer
> load idx[ring]
> store idx[ring]
> store r->prod.tail
> load r->prod.tail
>
> In this case, the consumer loads old idx[ring] and confirms the load
> is valid with the new r->prod.tail.
>
> I added a read barrier in the case where __IS_SC is passed to
> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
> to avoid similar problems with a single producer.
>
> Cc: stable@dpdk.org
>
> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> ---
> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
> index ea7dbe5b9..477326180 100644
> --- a/lib/librte_ring/rte_ring_generic.h
> +++ b/lib/librte_ring/rte_ring_generic.h
> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
> return 0;
>
> *new_head = *old_head + n;
> - if (is_sp)
> + if (is_sp) {
> + rte_smp_rmb();
> r->prod.head = *new_head, success = 1;
> - else
> + } else
> success = rte_atomic32_cmpset(&r->prod.head,
> *old_head, *new_head);
> } while (unlikely(success == 0));
> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> return 0;
>
> *new_head = *old_head + n;
> - if (is_sc)
> + if (is_sc) {
> + rte_smp_rmb();
> r->cons.head = *new_head, success = 1;
> - else
> + } else
> success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> *new_head);
> } while (unlikely(success == 0));
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2018-07-12 17:08 ` Jerin Jacob
@ 2018-07-17 2:54 ` Takeshi Yoshimura
2018-07-17 3:34 ` Jerin Jacob
0 siblings, 1 reply; 8+ messages in thread
From: Takeshi Yoshimura @ 2018-07-17 2:54 UTC (permalink / raw)
To: Jerin Jacob; +Cc: dev, stable
> Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> Having said that, load-load barrier can be expressed very well with C11 memory
> model. I guess ppc64 supports C11 memory model. If so,
> Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> original issue?
Yes, the performance regression happens on non-x86 with single
producer/consumer.
The average latency of an enqueue was increased from 21 nsec to 24 nsec in my
simple experiment. But, I think it is worth it.
I also tested C11 rte_ring, however, it caused the same race condition in ppc64.
I tried to fix the C11 problem as well, but I also found the C11
rte_ring had other potential
incorrect choices of memory orders, which caused another race
condition in ppc64.
For example,
__ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but
I am not sure why the load-acquire is used for the compare exchange.
Also in update_tail, the pause can be called before the data copy because
of ht->tail load without atomic_load_n.
The memory order is simply difficult, so it might take a bit longer
time to check
if the code is correct. I think I can fix the C11 rte_ring as another patch.
2018-07-13 2:08 GMT+09:00 Jerin Jacob <jerin.jacob@caviumnetworks.com>:
> -----Original Message-----
>> Date: Thu, 12 Jul 2018 11:44:14 +0900
>> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
>> To: dev@dpdk.org
>> Cc: Takeshi Yoshimura <t.yoshimura8869@gmail.com>, stable@dpdk.org, Takeshi
>> Yoshimura <tyos@jp.ibm.com>
>> Subject: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
>> X-Mailer: git-send-email 2.15.1
>>
>> External Email
>>
>> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
>> It uses a single consumer and multiple producers for a rte_ring.
>> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
>
> Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> Having said that, load-load barrier can be expressed very well with C11 memory
> model. I guess ppc64 supports C11 memory model. If so,
> Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> original issue?
>
>>
>> The reordered loads happened on r->prod.tail in
>> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
>> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
>> dependency, but the code does not satisfy it. Note that they are
>> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
>> cmpset invokes a read barrier.
>>
>> The paired stores on these loads are in ENQUEUE_PTRS() and
>> update_tail(). Simplified code around the reorder is the following.
>>
>> Consumer Producer
>> load idx[ring]
>> store idx[ring]
>> store r->prod.tail
>> load r->prod.tail
>>
>> In this case, the consumer loads old idx[ring] and confirms the load
>> is valid with the new r->prod.tail.
>>
>> I added a read barrier in the case where __IS_SC is passed to
>> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
>> to avoid similar problems with a single producer.
>>
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
>> ---
>> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
>> 1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
>> index ea7dbe5b9..477326180 100644
>> --- a/lib/librte_ring/rte_ring_generic.h
>> +++ b/lib/librte_ring/rte_ring_generic.h
>> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>> return 0;
>>
>> *new_head = *old_head + n;
>> - if (is_sp)
>> + if (is_sp) {
>> + rte_smp_rmb();
>> r->prod.head = *new_head, success = 1;
>> - else
>> + } else
>> success = rte_atomic32_cmpset(&r->prod.head,
>> *old_head, *new_head);
>> } while (unlikely(success == 0));
>> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
>> return 0;
>>
>> *new_head = *old_head + n;
>> - if (is_sc)
>> + if (is_sc) {
>> + rte_smp_rmb();
>> r->cons.head = *new_head, success = 1;
>> - else
>> + } else
>> success = rte_atomic32_cmpset(&r->cons.head, *old_head,
>> *new_head);
>> } while (unlikely(success == 0));
>> --
>> 2.17.1
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2018-07-17 2:54 ` Takeshi Yoshimura
@ 2018-07-17 3:34 ` Jerin Jacob
2021-03-24 21:45 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
0 siblings, 1 reply; 8+ messages in thread
From: Jerin Jacob @ 2018-07-17 3:34 UTC (permalink / raw)
To: Takeshi Yoshimura; +Cc: dev, stable, olivier.matz, chaozhu, konstantin.ananyev
-----Original Message-----
> Date: Tue, 17 Jul 2018 11:54:18 +0900
> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org, stable@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
>
Cc: olivier.matz@6wind.com
Cc: chaozhu@linux.vnet.ibm.com
Cc: konstantin.ananyev@intel.com
>
> > Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> > Having said that, load-load barrier can be expressed very well with C11 memory
> > model. I guess ppc64 supports C11 memory model. If so,
> > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> > original issue?
>
> Yes, the performance regression happens on non-x86 with single
> producer/consumer.
> The average latency of an enqueue was increased from 21 nsec to 24 nsec in my
> simple experiment. But, I think it is worth it.
That varies to machine to machine. What is the burst size etc.
>
>
> I also tested C11 rte_ring, however, it caused the same race condition in ppc64.
> I tried to fix the C11 problem as well, but I also found the C11
> rte_ring had other potential
> incorrect choices of memory orders, which caused another race
> condition in ppc64.
Does it happens on all ppc64 machines? Or on a specific machine?
Is following tests are passing on your system without the patch?
test/test/test_ring_perf.c
test/test/test_ring.c
>
> For example,
> __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but
> I am not sure why the load-acquire is used for the compare exchange.
It correct as per C11 acquire and release semantics.
> Also in update_tail, the pause can be called before the data copy because
> of ht->tail load without atomic_load_n.
>
> The memory order is simply difficult, so it might take a bit longer
> time to check
> if the code is correct. I think I can fix the C11 rte_ring as another patch.
>
> >>
> >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> >> It uses a single consumer and multiple producers for a rte_ring.
> >> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
> >
> > Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> > Having said that, load-load barrier can be expressed very well with C11 memory
> > model. I guess ppc64 supports C11 memory model. If so,
> > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> > original issue?
> >
> >>
> >> The reordered loads happened on r->prod.tail in
There is rte_smp_rmb() just before reading r->prod.tail in
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_rte_ring_move_cons_head(). Would that not suffice the requirement?
Can you check adding compiler barrier and see is compiler is reordering
the stuff?
DPDK's ring implementation is based freebsd's ring implementation, I
don't see need for such barrier
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
If it is something specific to ppc64 or a specific ppc64 machine, we
could add a compile option as it is arch specific to avoid performance
impact on other architectures.
> >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> >> dependency, but the code does not satisfy it. Note that they are
> >> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
> >> cmpset invokes a read barrier.
> >>
> >> The paired stores on these loads are in ENQUEUE_PTRS() and
> >> update_tail(). Simplified code around the reorder is the following.
> >>
> >> Consumer Producer
> >> load idx[ring]
> >> store idx[ring]
> >> store r->prod.tail
> >> load r->prod.tail
> >>
> >> In this case, the consumer loads old idx[ring] and confirms the load
> >> is valid with the new r->prod.tail.
> >>
> >> I added a read barrier in the case where __IS_SC is passed to
> >> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
> >> to avoid similar problems with a single producer.
> >>
> >> Cc: stable@dpdk.org
> >>
> >> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> >> ---
> >> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> >> 1 file changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
> >> index ea7dbe5b9..477326180 100644
> >> --- a/lib/librte_ring/rte_ring_generic.h
> >> +++ b/lib/librte_ring/rte_ring_generic.h
> >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
> >> return 0;
> >>
> >> *new_head = *old_head + n;
> >> - if (is_sp)
> >> + if (is_sp) {
> >> + rte_smp_rmb();
> >> r->prod.head = *new_head, success = 1;
> >> - else
> >> + } else
> >> success = rte_atomic32_cmpset(&r->prod.head,
> >> *old_head, *new_head);
> >> } while (unlikely(success == 0));
> >> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> >> return 0;
> >>
> >> *new_head = *old_head + n;
> >> - if (is_sc)
> >> + if (is_sc) {
> >> + rte_smp_rmb();
> >> r->cons.head = *new_head, success = 1;
> >> - else
> >> + } else
> >> success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> >> *new_head);
> >> } while (unlikely(success == 0));
> >> --
> >> 2.17.1
> >>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2018-07-17 3:34 ` Jerin Jacob
@ 2021-03-24 21:45 ` Thomas Monjalon
2021-03-28 1:00 ` Honnappa Nagarahalli
0 siblings, 1 reply; 8+ messages in thread
From: Thomas Monjalon @ 2021-03-24 21:45 UTC (permalink / raw)
To: Takeshi Yoshimura
Cc: stable, dev, olivier.matz, chaozhu, konstantin.ananyev,
Jerin Jacob, honnappa.nagarahalli
No reply after more than 2 years.
Unfortunately it is probably outdated now.
Classified as "Changes Requested".
17/07/2018 05:34, Jerin Jacob:
> From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
>
> Cc: olivier.matz@6wind.com
> Cc: chaozhu@linux.vnet.ibm.com
> Cc: konstantin.ananyev@intel.com
>
> >
> > > Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> > > Having said that, load-load barrier can be expressed very well with C11 memory
> > > model. I guess ppc64 supports C11 memory model. If so,
> > > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> > > original issue?
> >
> > Yes, the performance regression happens on non-x86 with single
> > producer/consumer.
> > The average latency of an enqueue was increased from 21 nsec to 24 nsec in my
> > simple experiment. But, I think it is worth it.
>
> That varies to machine to machine. What is the burst size etc.
>
> >
> >
> > I also tested C11 rte_ring, however, it caused the same race condition in ppc64.
> > I tried to fix the C11 problem as well, but I also found the C11
> > rte_ring had other potential
> > incorrect choices of memory orders, which caused another race
> > condition in ppc64.
>
> Does it happens on all ppc64 machines? Or on a specific machine?
> Is following tests are passing on your system without the patch?
> test/test/test_ring_perf.c
> test/test/test_ring.c
>
> >
> > For example,
> > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but
> > I am not sure why the load-acquire is used for the compare exchange.
>
> It correct as per C11 acquire and release semantics.
>
> > Also in update_tail, the pause can be called before the data copy because
> > of ht->tail load without atomic_load_n.
> >
> > The memory order is simply difficult, so it might take a bit longer
> > time to check
> > if the code is correct. I think I can fix the C11 rte_ring as another patch.
> >
> > >>
> > >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> > >> It uses a single consumer and multiple producers for a rte_ring.
> > >> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
> > >
> > > Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> > > Having said that, load-load barrier can be expressed very well with C11 memory
> > > model. I guess ppc64 supports C11 memory model. If so,
> > > Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> > > original issue?
> > >
> > >>
> > >> The reordered loads happened on r->prod.tail in
>
> There is rte_smp_rmb() just before reading r->prod.tail in
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> _rte_ring_move_cons_head(). Would that not suffice the requirement?
>
> Can you check adding compiler barrier and see is compiler is reordering
> the stuff?
>
> DPDK's ring implementation is based freebsd's ring implementation, I
> don't see need for such barrier
>
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
>
> If it is something specific to ppc64 or a specific ppc64 machine, we
> could add a compile option as it is arch specific to avoid performance
> impact on other architectures.
>
> > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> > >> dependency, but the code does not satisfy it. Note that they are
> > >> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
> > >> cmpset invokes a read barrier.
> > >>
> > >> The paired stores on these loads are in ENQUEUE_PTRS() and
> > >> update_tail(). Simplified code around the reorder is the following.
> > >>
> > >> Consumer Producer
> > >> load idx[ring]
> > >> store idx[ring]
> > >> store r->prod.tail
> > >> load r->prod.tail
> > >>
> > >> In this case, the consumer loads old idx[ring] and confirms the load
> > >> is valid with the new r->prod.tail.
> > >>
> > >> I added a read barrier in the case where __IS_SC is passed to
> > >> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
> > >> to avoid similar problems with a single producer.
> > >>
> > >> Cc: stable@dpdk.org
> > >>
> > >> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> > >> ---
> > >> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> > >> 1 file changed, 6 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
> > >> index ea7dbe5b9..477326180 100644
> > >> --- a/lib/librte_ring/rte_ring_generic.h
> > >> +++ b/lib/librte_ring/rte_ring_generic.h
> > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
> > >> return 0;
> > >>
> > >> *new_head = *old_head + n;
> > >> - if (is_sp)
> > >> + if (is_sp) {
> > >> + rte_smp_rmb();
> > >> r->prod.head = *new_head, success = 1;
> > >> - else
> > >> + } else
> > >> success = rte_atomic32_cmpset(&r->prod.head,
> > >> *old_head, *new_head);
> > >> } while (unlikely(success == 0));
> > >> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> > >> return 0;
> > >>
> > >> *new_head = *old_head + n;
> > >> - if (is_sc)
> > >> + if (is_sc) {
> > >> + rte_smp_rmb();
> > >> r->cons.head = *new_head, success = 1;
> > >> - else
> > >> + } else
> > >> success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> > >> *new_head);
> > >> } while (unlikely(success == 0));
> > >> --
> > >> 2.17.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2021-03-24 21:45 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
@ 2021-03-28 1:00 ` Honnappa Nagarahalli
2021-06-16 7:14 ` [dpdk-dev] 回复: " Feifei Wang
0 siblings, 1 reply; 8+ messages in thread
From: Honnappa Nagarahalli @ 2021-03-28 1:00 UTC (permalink / raw)
To: thomas, Takeshi Yoshimura
Cc: stable, dev, olivier.matz, chaozhu, konstantin.ananyev,
Jerin Jacob, nd, nd
<snip>
> Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] rte_ring: fix racy
> dequeue/enqueue in ppc64
>
> No reply after more than 2 years.
> Unfortunately it is probably outdated now.
> Classified as "Changes Requested".
Looking at the code, I think this patch in fact fixes a bug. Appreciate rebasing this patch.
The problem is already fixed in '__rte_ring_move_cons_head' but needs to be fixed in '__rte_ring_move_prod_head'.
This problem is fixed for C11 version due to acquire load of cons.tail and prod.tail.
>
>
> 17/07/2018 05:34, Jerin Jacob:
> > From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> >
> > Cc: olivier.matz@6wind.com
> > Cc: chaozhu@linux.vnet.ibm.com
> > Cc: konstantin.ananyev@intel.com
> >
> > >
> > > > Adding rte_smp_rmb() cause performance regression on non x86
> platforms.
> > > > Having said that, load-load barrier can be expressed very well
> > > > with C11 memory model. I guess ppc64 supports C11 memory model. If
> > > > so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64
> > > > and check original issue?
> > >
> > > Yes, the performance regression happens on non-x86 with single
> > > producer/consumer.
> > > The average latency of an enqueue was increased from 21 nsec to 24
> > > nsec in my simple experiment. But, I think it is worth it.
> >
> > That varies to machine to machine. What is the burst size etc.
> >
> > >
> > >
> > > I also tested C11 rte_ring, however, it caused the same race condition in
> ppc64.
> > > I tried to fix the C11 problem as well, but I also found the C11
> > > rte_ring had other potential incorrect choices of memory orders,
> > > which caused another race condition in ppc64.
> >
> > Does it happens on all ppc64 machines? Or on a specific machine?
> > Is following tests are passing on your system without the patch?
> > test/test/test_ring_perf.c
> > test/test/test_ring.c
> >
> > >
> > > For example,
> > > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but I
> > > am not sure why the load-acquire is used for the compare exchange.
> >
> > It correct as per C11 acquire and release semantics.
> >
> > > Also in update_tail, the pause can be called before the data copy
> > > because of ht->tail load without atomic_load_n.
> > >
> > > The memory order is simply difficult, so it might take a bit longer
> > > time to check if the code is correct. I think I can fix the C11
> > > rte_ring as another patch.
> > >
> > > >>
> > > >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> > > >> It uses a single consumer and multiple producers for a rte_ring.
> > > >> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
> > > >
> > > > Adding rte_smp_rmb() cause performance regression on non x86
> platforms.
> > > > Having said that, load-load barrier can be expressed very well
> > > > with C11 memory model. I guess ppc64 supports C11 memory model. If
> > > > so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64
> > > > and check original issue?
> > > >
> > > >>
> > > >> The reordered loads happened on r->prod.tail in
> >
> > There is rte_smp_rmb() just before reading r->prod.tail in
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > _rte_ring_move_cons_head(). Would that not suffice the requirement?
> >
> > Can you check adding compiler barrier and see is compiler is
> > reordering the stuff?
> >
> > DPDK's ring implementation is based freebsd's ring implementation, I
> > don't see need for such barrier
> >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
> >
> > If it is something specific to ppc64 or a specific ppc64 machine, we
> > could add a compile option as it is arch specific to avoid performance
> > impact on other architectures.
> >
> > > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> > > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> > > >> dependency, but the code does not satisfy it. Note that they are
> > > >> not reordered if __rte_ring_move_cons_head() with is_sc != 1
> > > >> because cmpset invokes a read barrier.
> > > >>
> > > >> The paired stores on these loads are in ENQUEUE_PTRS() and
> > > >> update_tail(). Simplified code around the reorder is the following.
> > > >>
> > > >> Consumer Producer
> > > >> load idx[ring]
> > > >> store idx[ring]
> > > >> store r->prod.tail load r->prod.tail
> > > >>
> > > >> In this case, the consumer loads old idx[ring] and confirms the
> > > >> load is valid with the new r->prod.tail.
> > > >>
> > > >> I added a read barrier in the case where __IS_SC is passed to
> > > >> __rte_ring_move_cons_head(). I also fixed
> > > >> __rte_ring_move_prod_head() to avoid similar problems with a single
> producer.
> > > >>
> > > >> Cc: stable@dpdk.org
> > > >>
> > > >> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> > > >> ---
> > > >> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> > > >> 1 file changed, 6 insertions(+), 4 deletions(-)
> > > >>
> > > >> diff --git a/lib/librte_ring/rte_ring_generic.h
> > > >> b/lib/librte_ring/rte_ring_generic.h
> > > >> index ea7dbe5b9..477326180 100644
> > > >> --- a/lib/librte_ring/rte_ring_generic.h
> > > >> +++ b/lib/librte_ring/rte_ring_generic.h
> > > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
> > > >> return 0;
> > > >>
> > > >> *new_head = *old_head + n;
> > > >> - if (is_sp)
> > > >> + if (is_sp) {
> > > >> + rte_smp_rmb();
> > > >> r->prod.head = *new_head, success = 1;
> > > >> - else
> > > >> + } else
> > > >> success = rte_atomic32_cmpset(&r->prod.head,
> > > >> *old_head, *new_head);
> > > >> } while (unlikely(success == 0)); @@ -158,9 +159,10 @@
> > > >> __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> > > >> return 0;
> > > >>
> > > >> *new_head = *old_head + n;
> > > >> - if (is_sc)
> > > >> + if (is_sc) {
> > > >> + rte_smp_rmb();
> > > >> r->cons.head = *new_head, success = 1;
> > > >> - else
> > > >> + } else
> > > >> success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> > > >> *new_head);
> > > >> } while (unlikely(success == 0));
> > > >> --
> > > >> 2.17.1
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [dpdk-dev] 回复: [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2021-03-28 1:00 ` Honnappa Nagarahalli
@ 2021-06-16 7:14 ` Feifei Wang
2021-06-16 16:37 ` [dpdk-dev] " Honnappa Nagarahalli
0 siblings, 1 reply; 8+ messages in thread
From: Feifei Wang @ 2021-06-16 7:14 UTC (permalink / raw)
To: Honnappa Nagarahalli, thomas, Takeshi Yoshimura
Cc: stable, dev, olivier.matz, chaozhu, konstantin.ananyev, jerinj, nd, nd
Hi, everyone
This patch can be closed with the following reasons.
> -----邮件原件-----
> 发件人: dev <dev-bounces@dpdk.org> 代表 Honnappa Nagarahalli
> 发送时间: 2021年3月28日 9:00
> 收件人: thomas@monjalon.net; Takeshi Yoshimura
> <t.yoshimura8869@gmail.com>
> 抄送: stable@dpdk.org; dev@dpdk.org; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; konstantin.ananyev@intel.com; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; nd <nd@arm.com>; nd <nd@arm.com>
> 主题: Re: [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy
> dequeue/enqueue in ppc64
>
> <snip>
>
> > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] rte_ring: fix racy
> > dequeue/enqueue in ppc64
> >
> > No reply after more than 2 years.
> > Unfortunately it is probably outdated now.
> > Classified as "Changes Requested".
> Looking at the code, I think this patch in fact fixes a bug. Appreciate rebasing
> this patch.
>
> The problem is already fixed in '__rte_ring_move_cons_head' but needs to
> be fixed in '__rte_ring_move_prod_head'.
> This problem is fixed for C11 version due to acquire load of cons.tail and
> prod.tail.
First, for consumer in dequeue:
the reason for that adding a rmb in move_cons_head of “generic” is based on this patch:
http://patches.dpdk.org/project/dpdk/patch/1552409933-45684-2-git-send-email-gavin.hu@arm.com/
Slot Consumer Producer
1 dequeue elements
2 update prod_tail
3 load new prod_tail
4 check room is enough(n < entries)
Dequeue elements maybe before load updated prod_tail, so consumer can load incorrect elements value.
For dequeue multiple consumers case, ‘rte_atomic32_cmpset’ with acquire and release order can prevent
dequeue before load prod_tail, no extra rmb is needed.
Second, for single producer in enqueue:
Slot Producer Consumer
1 enqueue elements(not commited)
2 update consumer_tail
3 load new consumer_tail
4 check room is enough(n < entries)
5 enqueued elements is committed
Though enqueue elements maybe reorder before load consumer_tail, these elements will not be committed until
‘check’ has finished. So from load to write control dependency is reliable and rmb is not needed here.
[1] https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf (page:15)
As a result, it is unnecessary to add a rmb for enqueue single producer due to control dependency. And this patch can
be closed.
Best Regards
Feifei
>
> >
> >
> > 17/07/2018 05:34, Jerin Jacob:
> > > From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> > >
> > > Cc: olivier.matz@6wind.com
> > > Cc: chaozhu@linux.vnet.ibm.com
> > > Cc: konstantin.ananyev@intel.com
> > >
> > > >
> > > > > Adding rte_smp_rmb() cause performance regression on non x86
> > platforms.
> > > > > Having said that, load-load barrier can be expressed very well
> > > > > with C11 memory model. I guess ppc64 supports C11 memory model.
> > > > > If so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
> for
> > > > > ppc64 and check original issue?
> > > >
> > > > Yes, the performance regression happens on non-x86 with single
> > > > producer/consumer.
> > > > The average latency of an enqueue was increased from 21 nsec to 24
> > > > nsec in my simple experiment. But, I think it is worth it.
> > >
> > > That varies to machine to machine. What is the burst size etc.
> > >
> > > >
> > > >
> > > > I also tested C11 rte_ring, however, it caused the same race
> > > > condition in
> > ppc64.
> > > > I tried to fix the C11 problem as well, but I also found the C11
> > > > rte_ring had other potential incorrect choices of memory orders,
> > > > which caused another race condition in ppc64.
> > >
> > > Does it happens on all ppc64 machines? Or on a specific machine?
> > > Is following tests are passing on your system without the patch?
> > > test/test/test_ring_perf.c
> > > test/test/test_ring.c
> > >
> > > >
> > > > For example,
> > > > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(),
> but I
> > > > am not sure why the load-acquire is used for the compare exchange.
> > >
> > > It correct as per C11 acquire and release semantics.
> > >
> > > > Also in update_tail, the pause can be called before the data copy
> > > > because of ht->tail load without atomic_load_n.
> > > >
> > > > The memory order is simply difficult, so it might take a bit
> > > > longer time to check if the code is correct. I think I can fix the
> > > > C11 rte_ring as another patch.
> > > >
> > > > >>
> > > > >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> > > > >> It uses a single consumer and multiple producers for a rte_ring.
> > > > >> The problem was a load-load reorder in
> rte_ring_sc_dequeue_bulk().
> > > > >
> > > > > Adding rte_smp_rmb() cause performance regression on non x86
> > platforms.
> > > > > Having said that, load-load barrier can be expressed very well
> > > > > with C11 memory model. I guess ppc64 supports C11 memory model.
> > > > > If so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
> for
> > > > > ppc64 and check original issue?
> > > > >
> > > > >>
> > > > >> The reordered loads happened on r->prod.tail in
> > >
> > > There is rte_smp_rmb() just before reading r->prod.tail in
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > _rte_ring_move_cons_head(). Would that not suffice the requirement?
> > >
> > > Can you check adding compiler barrier and see is compiler is
> > > reordering the stuff?
> > >
> > > DPDK's ring implementation is based freebsd's ring implementation, I
> > > don't see need for such barrier
> > >
> > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
> > >
> > > If it is something specific to ppc64 or a specific ppc64 machine, we
> > > could add a compile option as it is arch specific to avoid
> > > performance impact on other architectures.
> > >
> > > > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx]
> > > > >> in
> > > > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> > > > >> dependency, but the code does not satisfy it. Note that they
> > > > >> are not reordered if __rte_ring_move_cons_head() with is_sc !=
> > > > >> 1 because cmpset invokes a read barrier.
> > > > >>
> > > > >> The paired stores on these loads are in ENQUEUE_PTRS() and
> > > > >> update_tail(). Simplified code around the reorder is the following.
> > > > >>
> > > > >> Consumer Producer
> > > > >> load idx[ring]
> > > > >> store idx[ring]
> > > > >> store r->prod.tail load r->prod.tail
> > > > >>
> > > > >> In this case, the consumer loads old idx[ring] and confirms the
> > > > >> load is valid with the new r->prod.tail.
> > > > >>
> > > > >> I added a read barrier in the case where __IS_SC is passed to
> > > > >> __rte_ring_move_cons_head(). I also fixed
> > > > >> __rte_ring_move_prod_head() to avoid similar problems with a
> > > > >> single
> > producer.
> > > > >>
> > > > >> Cc: stable@dpdk.org
> > > > >>
> > > > >> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> > > > >> ---
> > > > >> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> > > > >> 1 file changed, 6 insertions(+), 4 deletions(-)
> > > > >>
> > > > >> diff --git a/lib/librte_ring/rte_ring_generic.h
> > > > >> b/lib/librte_ring/rte_ring_generic.h
> > > > >> index ea7dbe5b9..477326180 100644
> > > > >> --- a/lib/librte_ring/rte_ring_generic.h
> > > > >> +++ b/lib/librte_ring/rte_ring_generic.h
> > > > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring
> > > > >> *r,
> > unsigned int is_sp,
> > > > >> return 0;
> > > > >>
> > > > >> *new_head = *old_head + n;
> > > > >> - if (is_sp)
> > > > >> + if (is_sp) {
> > > > >> + rte_smp_rmb();
> > > > >> r->prod.head = *new_head, success = 1;
> > > > >> - else
> > > > >> + } else
> > > > >> success = rte_atomic32_cmpset(&r->prod.head,
> > > > >> *old_head, *new_head);
> > > > >> } while (unlikely(success == 0)); @@ -158,9 +159,10 @@
> > > > >> __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> > > > >> return 0;
> > > > >>
> > > > >> *new_head = *old_head + n;
> > > > >> - if (is_sc)
> > > > >> + if (is_sc) {
> > > > >> + rte_smp_rmb();
> > > > >> r->cons.head = *new_head, success = 1;
> > > > >> - else
> > > > >> + } else
> > > > >> success = rte_atomic32_cmpset(&r->cons.head,
> *old_head,
> > > > >> *new_head);
> > > > >> } while (unlikely(success == 0));
> > > > >> --
> > > > >> 2.17.1
> >
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
2021-06-16 7:14 ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-06-16 16:37 ` Honnappa Nagarahalli
0 siblings, 0 replies; 8+ messages in thread
From: Honnappa Nagarahalli @ 2021-06-16 16:37 UTC (permalink / raw)
To: Feifei Wang, thomas, Takeshi Yoshimura
Cc: stable, dev, olivier.matz, chaozhu, konstantin.ananyev, jerinj,
nd, Honnappa Nagarahalli, nd
<snip>
>
> Hi, everyone
>
> This patch can be closed with the following reasons.
>
> > -----邮件原件-----
> > 发件人: dev <dev-bounces@dpdk.org> 代表 Honnappa Nagarahalli
> > 发送时间: 2021年3月28日 9:00
> > 收件人: thomas@monjalon.net; Takeshi Yoshimura
> > <t.yoshimura8869@gmail.com>
> > 抄送: stable@dpdk.org; dev@dpdk.org; olivier.matz@6wind.com;
> > chaozhu@linux.vnet.ibm.com; konstantin.ananyev@intel.com; Jerin Jacob
> > <jerin.jacob@caviumnetworks.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> > 主题: Re: [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy
> > dequeue/enqueue in ppc64
> >
> > <snip>
> >
> > > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] rte_ring: fix racy
> > > dequeue/enqueue in ppc64
> > >
> > > No reply after more than 2 years.
> > > Unfortunately it is probably outdated now.
> > > Classified as "Changes Requested".
> > Looking at the code, I think this patch in fact fixes a bug.
> > Appreciate rebasing this patch.
> >
> > The problem is already fixed in '__rte_ring_move_cons_head' but needs
> > to be fixed in '__rte_ring_move_prod_head'.
> > This problem is fixed for C11 version due to acquire load of cons.tail
> > and prod.tail.
>
> First, for consumer in dequeue:
> the reason for that adding a rmb in move_cons_head of “generic” is based on
> this patch:
> http://patches.dpdk.org/project/dpdk/patch/1552409933-45684-2-git-send-
> email-gavin.hu@arm.com/
>
> Slot Consumer Producer
> 1 dequeue elements
> 2 update prod_tail
> 3 load new prod_tail
> 4 check room is enough(n < entries)
>
> Dequeue elements maybe before load updated prod_tail, so consumer can
> load incorrect elements value.
> For dequeue multiple consumers case, ‘rte_atomic32_cmpset’ with acquire
> and release order can prevent dequeue before load prod_tail, no extra rmb is
> needed.
>
> Second, for single producer in enqueue:
>
> Slot Producer Consumer
> 1 enqueue elements(not commited)
> 2 update
> consumer_tail
> 3 load new consumer_tail
> 4 check room is enough(n < entries)
> 5 enqueued elements is committed
>
> Though enqueue elements maybe reorder before load consumer_tail, these
> elements will not be committed until ‘check’ has finished. So from load to
> write control dependency is reliable and rmb is not needed here.
> [1] https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf (page:15)
>
> As a result, it is unnecessary to add a rmb for enqueue single producer due to
> control dependency. And this patch can be closed.
Thanks Feifei, I did not consider the control dependency from load to store which is reliable in my comments below.
Agree, we can reject this patch.
>
> Best Regards
> Feifei
> >
> > >
> > >
> > > 17/07/2018 05:34, Jerin Jacob:
> > > > From: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> > > >
> > > > Cc: olivier.matz@6wind.com
> > > > Cc: chaozhu@linux.vnet.ibm.com
> > > > Cc: konstantin.ananyev@intel.com
> > > >
> > > > >
> > > > > > Adding rte_smp_rmb() cause performance regression on non x86
> > > platforms.
> > > > > > Having said that, load-load barrier can be expressed very
> > > > > > well with C11 memory model. I guess ppc64 supports C11 memory
> model.
> > > > > > If so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
> > for
> > > > > > ppc64 and check original issue?
> > > > >
> > > > > Yes, the performance regression happens on non-x86 with single
> > > > > producer/consumer.
> > > > > The average latency of an enqueue was increased from 21 nsec to
> > > > > 24 nsec in my simple experiment. But, I think it is worth it.
> > > >
> > > > That varies to machine to machine. What is the burst size etc.
> > > >
> > > > >
> > > > >
> > > > > I also tested C11 rte_ring, however, it caused the same race
> > > > > condition in
> > > ppc64.
> > > > > I tried to fix the C11 problem as well, but I also found the C11
> > > > > rte_ring had other potential incorrect choices of memory orders,
> > > > > which caused another race condition in ppc64.
> > > >
> > > > Does it happens on all ppc64 machines? Or on a specific machine?
> > > > Is following tests are passing on your system without the patch?
> > > > test/test/test_ring_perf.c
> > > > test/test/test_ring.c
> > > >
> > > > >
> > > > > For example,
> > > > > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(),
> > but I
> > > > > am not sure why the load-acquire is used for the compare exchange.
> > > >
> > > > It correct as per C11 acquire and release semantics.
> > > >
> > > > > Also in update_tail, the pause can be called before the data
> > > > > copy because of ht->tail load without atomic_load_n.
> > > > >
> > > > > The memory order is simply difficult, so it might take a bit
> > > > > longer time to check if the code is correct. I think I can fix
> > > > > the
> > > > > C11 rte_ring as another patch.
> > > > >
> > > > > >>
> > > > > >> SPDK blobfs encountered a crash around rte_ring dequeues in
> ppc64.
> > > > > >> It uses a single consumer and multiple producers for a rte_ring.
> > > > > >> The problem was a load-load reorder in
> > rte_ring_sc_dequeue_bulk().
> > > > > >
> > > > > > Adding rte_smp_rmb() cause performance regression on non x86
> > > platforms.
> > > > > > Having said that, load-load barrier can be expressed very
> > > > > > well with C11 memory model. I guess ppc64 supports C11 memory
> model.
> > > > > > If so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
> > for
> > > > > > ppc64 and check original issue?
> > > > > >
> > > > > >>
> > > > > >> The reordered loads happened on r->prod.tail in
> > > >
> > > > There is rte_smp_rmb() just before reading r->prod.tail in
> > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > _rte_ring_move_cons_head(). Would that not suffice the requirement?
> > > >
> > > > Can you check adding compiler barrier and see is compiler is
> > > > reordering the stuff?
> > > >
> > > > DPDK's ring implementation is based freebsd's ring implementation,
> > > > I don't see need for such barrier
> > > >
> > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
> > > >
> > > > If it is something specific to ppc64 or a specific ppc64 machine,
> > > > we could add a compile option as it is arch specific to avoid
> > > > performance impact on other architectures.
> > > >
> > > > > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and
> > > > > >> ring[idx] in
> > > > > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> > > > > >> dependency, but the code does not satisfy it. Note that they
> > > > > >> are not reordered if __rte_ring_move_cons_head() with is_sc
> > > > > >> !=
> > > > > >> 1 because cmpset invokes a read barrier.
> > > > > >>
> > > > > >> The paired stores on these loads are in ENQUEUE_PTRS() and
> > > > > >> update_tail(). Simplified code around the reorder is the following.
> > > > > >>
> > > > > >> Consumer Producer
> > > > > >> load idx[ring]
> > > > > >> store idx[ring]
> > > > > >> store r->prod.tail load r->prod.tail
> > > > > >>
> > > > > >> In this case, the consumer loads old idx[ring] and confirms
> > > > > >> the load is valid with the new r->prod.tail.
> > > > > >>
> > > > > >> I added a read barrier in the case where __IS_SC is passed to
> > > > > >> __rte_ring_move_cons_head(). I also fixed
> > > > > >> __rte_ring_move_prod_head() to avoid similar problems with a
> > > > > >> single
> > > producer.
> > > > > >>
> > > > > >> Cc: stable@dpdk.org
> > > > > >>
> > > > > >> Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
> > > > > >> ---
> > > > > >> lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> > > > > >> 1 file changed, 6 insertions(+), 4 deletions(-)
> > > > > >>
> > > > > >> diff --git a/lib/librte_ring/rte_ring_generic.h
> > > > > >> b/lib/librte_ring/rte_ring_generic.h
> > > > > >> index ea7dbe5b9..477326180 100644
> > > > > >> --- a/lib/librte_ring/rte_ring_generic.h
> > > > > >> +++ b/lib/librte_ring/rte_ring_generic.h
> > > > > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring
> > > > > >> *r,
> > > unsigned int is_sp,
> > > > > >> return 0;
> > > > > >>
> > > > > >> *new_head = *old_head + n;
> > > > > >> - if (is_sp)
> > > > > >> + if (is_sp) {
> > > > > >> + rte_smp_rmb();
> > > > > >> r->prod.head = *new_head, success = 1;
> > > > > >> - else
> > > > > >> + } else
> > > > > >> success = rte_atomic32_cmpset(&r->prod.head,
> > > > > >> *old_head, *new_head);
> > > > > >> } while (unlikely(success == 0)); @@ -158,9 +159,10
> > > > > >> @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int
> is_sc,
> > > > > >> return 0;
> > > > > >>
> > > > > >> *new_head = *old_head + n;
> > > > > >> - if (is_sc)
> > > > > >> + if (is_sc) {
> > > > > >> + rte_smp_rmb();
> > > > > >> r->cons.head = *new_head, success = 1;
> > > > > >> - else
> > > > > >> + } else
> > > > > >> success =
> > > > > >> rte_atomic32_cmpset(&r->cons.head,
> > *old_head,
> > > > > >> *new_head);
> > > > > >> } while (unlikely(success == 0));
> > > > > >> --
> > > > > >> 2.17.1
> > >
> > >
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-06-16 16:38 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-12 2:44 [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64 Takeshi Yoshimura
2018-07-12 17:08 ` Jerin Jacob
2018-07-17 2:54 ` Takeshi Yoshimura
2018-07-17 3:34 ` Jerin Jacob
2021-03-24 21:45 ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
2021-03-28 1:00 ` Honnappa Nagarahalli
2021-06-16 7:14 ` [dpdk-dev] 回复: " Feifei Wang
2021-06-16 16:37 ` [dpdk-dev] " Honnappa Nagarahalli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).