DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/1] ring: correct ordering issue in head/tail update
@ 2025-09-15 18:54 Wathsala Vithanage
  2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
  0 siblings, 1 reply; 8+ messages in thread
From: Wathsala Vithanage @ 2025-09-15 18:54 UTC (permalink / raw)
  Cc: dev, Wathsala Vithanage

Hi all,

This patch resolves a subtle ordering issue in the ring code that could
lead to incorrect behavior under certain conditions. The change favors a
performance-conscious fix while preserving existing behavior.

For background, motivation, and validation (including Herd7 litmus
tests), please see the accompanying write-up:https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order

Wathsala Vithanage (1):
  ring: safe partial ordering for head/tail update

 lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-15 18:54 [PATCH 0/1] ring: correct ordering issue in head/tail update Wathsala Vithanage
@ 2025-09-15 18:54 ` Wathsala Vithanage
  2025-09-16 15:42   ` Bruce Richardson
  2025-09-16 22:57   ` Konstantin Ananyev
  0 siblings, 2 replies; 8+ messages in thread
From: Wathsala Vithanage @ 2025-09-15 18:54 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Konstantin Ananyev
  Cc: dev, Wathsala Vithanage, Ola Liljedahl, Dhruv Tripathi

The function __rte_ring_headtail_move_head() assumes that the barrier
(fence) between the load of the head and the load-acquire of the
opposing tail guarantees the following: if a first thread reads tail
and then writes head and a second thread reads the new value of head
and then reads tail, then it should observe the same (or a later)
value of tail.

This assumption is incorrect under the C11 memory model. If the barrier
(fence) is intended to establish a total ordering of ring operations,
it fails to do so. Instead, the current implementation only enforces a
partial ordering, which can lead to unsafe interleavings. In particular,
some partial orders can cause underflows in free slot or available
element computations, potentially resulting in data corruption.

The issue manifests when a CPU first acts as a producer and later as a
consumer. In this scenario, the barrier assumption may fail when another
core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
this violation. The problem has not been widely observed so far because:
  (a) on strong memory models (e.g., x86-64) the assumption holds, and
  (b) on relaxed models with RCsc semantics the ordering is still strong
      enough to prevent hazards.
The problem becomes visible only on weaker models, when load-acquire is
implemented with RCpc semantics (e.g. some AArch64 CPUs which support
the LDAPR and LDAPUR instructions).

Three possible solutions exist:
  1. Strengthen ordering by upgrading release/acquire semantics to
     sequential consistency. This requires using seq-cst for stores,
     loads, and CAS operations. However, this approach introduces a
     significant performance penalty on relaxed-memory architectures.

  2. Establish a safe partial order by enforcing a pair-wise
     happens-before relationship between thread of same role by changing
     the CAS and the preceding load of the head by converting them to
     release and acquire respectively. This approach makes the original
     barrier assumption unnecessary and allows its removal.

  3. Retain partial ordering but ensure only safe partial orders are
     committed. This can be done by detecting underflow conditions
     (producer < consumer) and quashing the update in such cases.
     This approach makes the original barrier assumption unnecessary
     and allows its removal.

This patch implements solution (3) for performance reasons.

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
---
 lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index b9388af0da..e5ac1f6b9e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
 		/* Reset n to the initial burst count */
 		n = max;
 
-		/* Ensure the head is read before tail */
-		rte_atomic_thread_fence(rte_memory_order_acquire);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
 		 */
 		*entries = (capacity + stail - *old_head);
 
+		/*
+		 * Ensure the entries calculation was not based on a stale
+		 * and unsafe stail observation that causes underflow.
+		 */
+		if ((int)*entries < 0)
+			*entries = 0;
+
 		/* check that we have enough room in ring */
 		if (unlikely(n > *entries))
 			n = (behavior == RTE_RING_QUEUE_FIXED) ?
-- 
2.43.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
@ 2025-09-16 15:42   ` Bruce Richardson
  2025-09-16 18:19     ` Ola Liljedahl
  2025-09-16 22:57   ` Konstantin Ananyev
  1 sibling, 1 reply; 8+ messages in thread
From: Bruce Richardson @ 2025-09-16 15:42 UTC (permalink / raw)
  To: Wathsala Vithanage
  Cc: Honnappa Nagarahalli, Konstantin Ananyev, dev, Ola Liljedahl,
	Dhruv Tripathi

On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:
> The function __rte_ring_headtail_move_head() assumes that the barrier
> (fence) between the load of the head and the load-acquire of the
> opposing tail guarantees the following: if a first thread reads tail
> and then writes head and a second thread reads the new value of head
> and then reads tail, then it should observe the same (or a later)
> value of tail.
> 
> This assumption is incorrect under the C11 memory model. If the barrier
> (fence) is intended to establish a total ordering of ring operations,
> it fails to do so. Instead, the current implementation only enforces a
> partial ordering, which can lead to unsafe interleavings. In particular,
> some partial orders can cause underflows in free slot or available
> element computations, potentially resulting in data corruption.
> 
> The issue manifests when a CPU first acts as a producer and later as a
> consumer. In this scenario, the barrier assumption may fail when another
> core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> this violation. The problem has not been widely observed so far because:
>   (a) on strong memory models (e.g., x86-64) the assumption holds, and
>   (b) on relaxed models with RCsc semantics the ordering is still strong
>       enough to prevent hazards.
> The problem becomes visible only on weaker models, when load-acquire is
> implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> the LDAPR and LDAPUR instructions).
> 
> Three possible solutions exist:
>   1. Strengthen ordering by upgrading release/acquire semantics to
>      sequential consistency. This requires using seq-cst for stores,
>      loads, and CAS operations. However, this approach introduces a
>      significant performance penalty on relaxed-memory architectures.
> 
>   2. Establish a safe partial order by enforcing a pair-wise
>      happens-before relationship between thread of same role by changing
>      the CAS and the preceding load of the head by converting them to
>      release and acquire respectively. This approach makes the original
>      barrier assumption unnecessary and allows its removal.
> 
>   3. Retain partial ordering but ensure only safe partial orders are
>      committed. This can be done by detecting underflow conditions
>      (producer < consumer) and quashing the update in such cases.
>      This approach makes the original barrier assumption unnecessary
>      and allows its removal.
> 
> This patch implements solution (3) for performance reasons.
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> ---
>  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
Thank you for the very comprehensive write-up in the article about this.
It was very educational.

On the patch, are we sure that option #3 is safe to take as an approach? It
seems wrong to me to deliberately leave ordering issues in the code and
just try and fix them up later. Would there be a noticable performance
difference for real-world apps if we took option #2, and actually used
correct ordering semantics? I realise the perf data in the blog post about
this shows it being slower, but for enqueues and dequeues of bursts, of
e.g. 8, rather than just 1, is there a very big delta?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-16 15:42   ` Bruce Richardson
@ 2025-09-16 18:19     ` Ola Liljedahl
  2025-09-17  7:47       ` Bruce Richardson
  0 siblings, 1 reply; 8+ messages in thread
From: Ola Liljedahl @ 2025-09-16 18:19 UTC (permalink / raw)
  To: Bruce Richardson, Wathsala Vithanage
  Cc: Honnappa Nagarahalli, Konstantin Ananyev, dev, Dhruv Tripathi

(I am sending this from Outlook, I hope I have been able to fake a proper
email client)

> On 2025-09-16, 17:43, "Bruce Richardson" <bruce.richardson@intel.com > <mailto:bruce.richardson@intel.com>> wrote:
>
>
> On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:
> > The function __rte_ring_headtail_move_head() assumes that the barrier
> > (fence) between the load of the head and the load-acquire of the
> > opposing tail guarantees the following: if a first thread reads tail
> > and then writes head and a second thread reads the new value of head
> > and then reads tail, then it should observe the same (or a later)
> > value of tail.
> >
> > This assumption is incorrect under the C11 memory model. If the barrier
> > (fence) is intended to establish a total ordering of ring operations,
> > it fails to do so. Instead, the current implementation only enforces a
> > partial ordering, which can lead to unsafe interleavings. In particular,
> > some partial orders can cause underflows in free slot or available
> > element computations, potentially resulting in data corruption.
> >
> > The issue manifests when a CPU first acts as a producer and later as a
> > consumer. In this scenario, the barrier assumption may fail when another
> > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > this violation. The problem has not been widely observed so far because:
> > (a) on strong memory models (e.g., x86-64) the assumption holds, and
> > (b) on relaxed models with RCsc semantics the ordering is still strong
> > enough to prevent hazards.
> > The problem becomes visible only on weaker models, when load-acquire is
> > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > the LDAPR and LDAPUR instructions).
> >
> > Three possible solutions exist:
> > 1. Strengthen ordering by upgrading release/acquire semantics to
> > sequential consistency. This requires using seq-cst for stores,
> > loads, and CAS operations. However, this approach introduces a
> > significant performance penalty on relaxed-memory architectures.
> >
> > 2. Establish a safe partial order by enforcing a pair-wise
> > happens-before relationship between thread of same role by changing
> > the CAS and the preceding load of the head by converting them to
> > release and acquire respectively. This approach makes the original
> > barrier assumption unnecessary and allows its removal.
> >
> > 3. Retain partial ordering but ensure only safe partial orders are
> > committed. This can be done by detecting underflow conditions
> > (producer < consumer) and quashing the update in such cases.
> > This approach makes the original barrier assumption unnecessary
> > and allows its removal.
> >
> > This patch implements solution (3) for performance reasons.
> >
> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>
> > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>
> > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > ---
> > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> Thank you for the very comprehensive write-up in the article about this.
> It was very educational.
>
>
> On the patch, are we sure that option #3 is safe to take as an approach? It
> seems wrong to me to deliberately leave ordering issues in the code and
> just try and fix them up later. Would there be a noticable performance
> difference for real-world apps if we took option #2, and actually used
> correct ordering semantics?
I am pretty sure that all necessary orderings for safely transferring elements
from producers to consumers (and empty slots from consumers to producers)
are present in the code (I still have some questions on the use of memory
order relaxed in __rte_ring_update_tail, we should create a litmus test for
this, to see what is required by the C memory model). What other metrics
for correctness do you suggest?

Some data structures are specified or expected (explicitly or implicitly) to be
e.g. sequentially consistent, someone claimed this is true for some
synchronization mechanisms in the Linux kernel and I have seen such claims
for the POSIX mutex. Are there any similar requirements for DPDK ring buffers?

- Ola

> I realise the perf data in the blog post about
> this shows it being slower, but for enqueues and dequeues of bursts, of
> e.g. 8, rather than just 1, is there a very big delta?
>
>
> Regards,
> /Bruce



IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
  2025-09-16 15:42   ` Bruce Richardson
@ 2025-09-16 22:57   ` Konstantin Ananyev
  2025-09-16 23:08     ` Konstantin Ananyev
       [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
  1 sibling, 2 replies; 8+ messages in thread
From: Konstantin Ananyev @ 2025-09-16 22:57 UTC (permalink / raw)
  To: Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Ola Liljedahl, Dhruv Tripathi



> The function __rte_ring_headtail_move_head() assumes that the barrier
> (fence) between the load of the head and the load-acquire of the
> opposing tail guarantees the following: if a first thread reads tail
> and then writes head and a second thread reads the new value of head
> and then reads tail, then it should observe the same (or a later)
> value of tail.
> 
> This assumption is incorrect under the C11 memory model. If the barrier
> (fence) is intended to establish a total ordering of ring operations,
> it fails to do so. Instead, the current implementation only enforces a
> partial ordering, which can lead to unsafe interleavings. In particular,
> some partial orders can cause underflows in free slot or available
> element computations, potentially resulting in data corruption.

Hmm... sounds exactly like the problem from the patch we discussed earlier that year:
https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com/
In two words:
"... thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or visa-versa.
In other words: 'cons.head' value depends on 'prod.tail', so before making latest 'cons.head'
value visible to other threads, we need to ensure that latest 'prod.tail' is also visible."
Is that the one?

> The issue manifests when a CPU first acts as a producer and later as a
> consumer. In this scenario, the barrier assumption may fail when another
> core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> this violation. The problem has not been widely observed so far because:
>   (a) on strong memory models (e.g., x86-64) the assumption holds, and
>   (b) on relaxed models with RCsc semantics the ordering is still strong
>       enough to prevent hazards.
> The problem becomes visible only on weaker models, when load-acquire is
> implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> the LDAPR and LDAPUR instructions).
> 
> Three possible solutions exist:
>   1. Strengthen ordering by upgrading release/acquire semantics to
>      sequential consistency. This requires using seq-cst for stores,
>      loads, and CAS operations. However, this approach introduces a
>      significant performance penalty on relaxed-memory architectures.
> 
>   2. Establish a safe partial order by enforcing a pair-wise
>      happens-before relationship between thread of same role by changing
>      the CAS and the preceding load of the head by converting them to
>      release and acquire respectively. This approach makes the original
>      barrier assumption unnecessary and allows its removal.

For the sake of clarity, can you outline what would be exact code changes for
approach #2? Same as in that patch:
https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com/
Or something different?
 
>   3. Retain partial ordering but ensure only safe partial orders are
>      committed. This can be done by detecting underflow conditions
>      (producer < consumer) and quashing the update in such cases.
>      This approach makes the original barrier assumption unnecessary
>      and allows its removal.

> This patch implements solution (3) for performance reasons.
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> ---
>  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index b9388af0da..e5ac1f6b9e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> *d,
>  		/* Reset n to the initial burst count */
>  		n = max;
> 
> -		/* Ensure the head is read before tail */
> -		rte_atomic_thread_fence(rte_memory_order_acquire);
> -
>  		/* load-acquire synchronize with store-release of ht->tail
>  		 * in update_tail.
>  		 */

But then cons.head can be read a before prod.tail (and visa-versa), right?

> @@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> *d,
>  		 */
>  		*entries = (capacity + stail - *old_head);
> 
> +		/*
> +		 * Ensure the entries calculation was not based on a stale
> +		 * and unsafe stail observation that causes underflow.
> +		 */
> +		if ((int)*entries < 0)
> +			*entries = 0;
> +
>  		/* check that we have enough room in ring */
>  		if (unlikely(n > *entries))
>  			n = (behavior == RTE_RING_QUEUE_FIXED) ?
> --
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-16 22:57   ` Konstantin Ananyev
@ 2025-09-16 23:08     ` Konstantin Ananyev
       [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
  1 sibling, 0 replies; 8+ messages in thread
From: Konstantin Ananyev @ 2025-09-16 23:08 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Ola Liljedahl, Dhruv Tripathi



> 
> 
> > The function __rte_ring_headtail_move_head() assumes that the barrier
> > (fence) between the load of the head and the load-acquire of the
> > opposing tail guarantees the following: if a first thread reads tail
> > and then writes head and a second thread reads the new value of head
> > and then reads tail, then it should observe the same (or a later)
> > value of tail.
> >
> > This assumption is incorrect under the C11 memory model. If the barrier
> > (fence) is intended to establish a total ordering of ring operations,
> > it fails to do so. Instead, the current implementation only enforces a
> > partial ordering, which can lead to unsafe interleavings. In particular,
> > some partial orders can cause underflows in free slot or available
> > element computations, potentially resulting in data corruption.
> 
> Hmm... sounds exactly like the problem from the patch we discussed earlier that
> year:
> https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com/
> In two words:
> "... thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or
> visa-versa.
> In other words: 'cons.head' value depends on 'prod.tail', so before making latest
> 'cons.head'
> value visible to other threads, we need to ensure that latest 'prod.tail' is also visible."
> Is that the one?
> 
> > The issue manifests when a CPU first acts as a producer and later as a
> > consumer. In this scenario, the barrier assumption may fail when another
> > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > this violation. The problem has not been widely observed so far because:
> >   (a) on strong memory models (e.g., x86-64) the assumption holds, and
> >   (b) on relaxed models with RCsc semantics the ordering is still strong
> >       enough to prevent hazards.
> > The problem becomes visible only on weaker models, when load-acquire is
> > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > the LDAPR and LDAPUR instructions).
> >
> > Three possible solutions exist:
> >   1. Strengthen ordering by upgrading release/acquire semantics to
> >      sequential consistency. This requires using seq-cst for stores,
> >      loads, and CAS operations. However, this approach introduces a
> >      significant performance penalty on relaxed-memory architectures.
> >
> >   2. Establish a safe partial order by enforcing a pair-wise
> >      happens-before relationship between thread of same role by changing
> >      the CAS and the preceding load of the head by converting them to
> >      release and acquire respectively. This approach makes the original
> >      barrier assumption unnecessary and allows its removal.
> 
> For the sake of clarity, can you outline what would be exact code changes for
> approach #2? Same as in that patch:
> https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com/
> Or something different?
> 
> >   3. Retain partial ordering but ensure only safe partial orders are
> >      committed. This can be done by detecting underflow conditions
> >      (producer < consumer) and quashing the update in such cases.
> >      This approach makes the original barrier assumption unnecessary
> >      and allows its removal.
> 
> > This patch implements solution (3) for performance reasons.
> >
> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> > ---
> >  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> > index b9388af0da..e5ac1f6b9e 100644
> > --- a/lib/ring/rte_ring_c11_pvt.h
> > +++ b/lib/ring/rte_ring_c11_pvt.h
> > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> > *d,
> >  		/* Reset n to the initial burst count */
> >  		n = max;
> >
> > -		/* Ensure the head is read before tail */
> > -		rte_atomic_thread_fence(rte_memory_order_acquire);
> > -
> >  		/* load-acquire synchronize with store-release of ht->tail
> >  		 * in update_tail.
> >  		 */
> 
> But then cons.head can be read a before prod.tail (and visa-versa), right?
s/before/after/
 
> 
> > @@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> > *d,
> >  		 */
> >  		*entries = (capacity + stail - *old_head);
> >
> > +		/*
> > +		 * Ensure the entries calculation was not based on a stale
> > +		 * and unsafe stail observation that causes underflow.
> > +		 */
> > +		if ((int)*entries < 0)
> > +			*entries = 0;
> > +
> >  		/* check that we have enough room in ring */
> >  		if (unlikely(n > *entries))
> >  			n = (behavior == RTE_RING_QUEUE_FIXED) ?
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-16 18:19     ` Ola Liljedahl
@ 2025-09-17  7:47       ` Bruce Richardson
  0 siblings, 0 replies; 8+ messages in thread
From: Bruce Richardson @ 2025-09-17  7:47 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: Wathsala Vithanage, Honnappa Nagarahalli, Konstantin Ananyev,
	dev, Dhruv Tripathi

On Tue, Sep 16, 2025 at 06:19:41PM +0000, Ola Liljedahl wrote:
> (I am sending this from Outlook, I hope I have been able to fake a proper
> email client)
> 
> > On 2025-09-16, 17:43, "Bruce Richardson" <bruce.richardson@intel.com > <mailto:bruce.richardson@intel.com>> wrote:
> >
> >
> > On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:
> > > The function __rte_ring_headtail_move_head() assumes that the barrier
> > > (fence) between the load of the head and the load-acquire of the
> > > opposing tail guarantees the following: if a first thread reads tail
> > > and then writes head and a second thread reads the new value of head
> > > and then reads tail, then it should observe the same (or a later)
> > > value of tail.
> > >
> > > This assumption is incorrect under the C11 memory model. If the barrier
> > > (fence) is intended to establish a total ordering of ring operations,
> > > it fails to do so. Instead, the current implementation only enforces a
> > > partial ordering, which can lead to unsafe interleavings. In particular,
> > > some partial orders can cause underflows in free slot or available
> > > element computations, potentially resulting in data corruption.
> > >
> > > The issue manifests when a CPU first acts as a producer and later as a
> > > consumer. In this scenario, the barrier assumption may fail when another
> > > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > > this violation. The problem has not been widely observed so far because:
> > > (a) on strong memory models (e.g., x86-64) the assumption holds, and
> > > (b) on relaxed models with RCsc semantics the ordering is still strong
> > > enough to prevent hazards.
> > > The problem becomes visible only on weaker models, when load-acquire is
> > > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > > the LDAPR and LDAPUR instructions).
> > >
> > > Three possible solutions exist:
> > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > sequential consistency. This requires using seq-cst for stores,
> > > loads, and CAS operations. However, this approach introduces a
> > > significant performance penalty on relaxed-memory architectures.
> > >
> > > 2. Establish a safe partial order by enforcing a pair-wise
> > > happens-before relationship between thread of same role by changing
> > > the CAS and the preceding load of the head by converting them to
> > > release and acquire respectively. This approach makes the original
> > > barrier assumption unnecessary and allows its removal.
> > >
> > > 3. Retain partial ordering but ensure only safe partial orders are
> > > committed. This can be done by detecting underflow conditions
> > > (producer < consumer) and quashing the update in such cases.
> > > This approach makes the original barrier assumption unnecessary
> > > and allows its removal.
> > >
> > > This patch implements solution (3) for performance reasons.
> > >
> > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>
> > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>
> > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > > ---
> > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > Thank you for the very comprehensive write-up in the article about this.
> > It was very educational.
> >
> >
> > On the patch, are we sure that option #3 is safe to take as an approach? It
> > seems wrong to me to deliberately leave ordering issues in the code and
> > just try and fix them up later. Would there be a noticable performance
> > difference for real-world apps if we took option #2, and actually used
> > correct ordering semantics?
> I am pretty sure that all necessary orderings for safely transferring elements
> from producers to consumers (and empty slots from consumers to producers)
> are present in the code (I still have some questions on the use of memory
> order relaxed in __rte_ring_update_tail, we should create a litmus test for
> this, to see what is required by the C memory model). What other metrics
> for correctness do you suggest?
>

Not suggesting any other metrics for correctness. I'm instead just wanting
to double-check the cost-benefit of taking the approach of putting in a
fix-up, or workaround, in the code here, rather than actually correcting
the memory ordering use.  Also, given that the workaround uses subtraction
to detect underflow, are we 100% sure that we have guaranteed correct
behaviour on counter wraparound at uint32_t max?

/Bruce 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
       [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
@ 2025-09-17  7:58       ` Konstantin Ananyev
  0 siblings, 0 replies; 8+ messages in thread
From: Konstantin Ananyev @ 2025-09-17  7:58 UTC (permalink / raw)
  To: Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Ola Liljedahl, Dhruv Tripathi, Bruce Richardson

To avoid information loss I combined reply to two Wathsala replies into one.

> > > The function __rte_ring_headtail_move_head() assumes that the
> > > barrier
> > (fence) between the load of the head and the load-acquire of the
> > > opposing tail guarantees the following: if a first thread reads
> > > tail
> > > and then writes head and a second thread reads the new value of
> > > head
> > > and then reads tail, then it should observe the same (or a later)
> > > value of tail.
> > >
> > > This assumption is incorrect under the C11 memory model. If the
> > > barrier
> > > (fence) is intended to establish a total ordering of ring
> > > operations,
> > > it fails to do so. Instead, the current implementation only
> > > enforces a
> > > partial ordering, which can lead to unsafe interleavings. In
> > > particular,
> > > some partial orders can cause underflows in free slot or available
> > > element computations, potentially resulting in data corruption.
> >
> > Hmm... sounds exactly like the problem from the patch we discussed
> > earlier that year:
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com/
> > In two words:
> > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > for 'prod.tail' or visa-versa.
> > In other words: 'cons.head' value depends on 'prod.tail', so before
> > making latest 'cons.head'
> > value visible to other threads, we need to ensure that latest
> > 'prod.tail' is also visible."
> > Is that the one?

> Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> which is why we didn’t catch it earlier. A fuller explanation, with
> Herd7 simulations, is in the blog post linked in the cover letter.
>
> https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order

I see, so now it is reproducible with core rte_ring on real HW.

> >
> > > The issue manifests when a CPU first acts as a producer and later
> > > as a
> > > consumer. In this scenario, the barrier assumption may fail when
> > > another
> > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > demonstrate
> > > this violation. The problem has not been widely observed so far
> > > because:
> > >   (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > and
> > >   (b) on relaxed models with RCsc semantics the ordering is still
> > > strong
> > >       enough to prevent hazards.
> > > The problem becomes visible only on weaker models, when load-
> > > acquire is
> > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > support
> > > the LDAPR and LDAPUR instructions).
> > >
> > > Three possible solutions exist:
> > >   1. Strengthen ordering by upgrading release/acquire semantics to
> > >      sequential consistency. This requires using seq-cst for
> > > stores,
> > >      loads, and CAS operations. However, this approach introduces a
> > >      significant performance penalty on relaxed-memory
> > > architectures.
> > >
> > >   2. Establish a safe partial order by enforcing a pair-wise
> > >      happens-before relationship between thread of same role by
> > > changing
> > >      the CAS and the preceding load of the head by converting them
> > > to
> > >      release and acquire respectively. This approach makes the
> > > original
> > >      barrier assumption unnecessary and allows its removal.
> >
> > For the sake of clarity, can you outline what would be exact code
> > changes for
> > approach #2? Same as in that patch:
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com/
> > Or something different?
> 
> Sorry, I missed the later half you your comment before.
> Yes, you have proposed the same solution there.

Ok,  thanks for confirmation.

> >
> >
> > >   3. Retain partial ordering but ensure only safe partial orders
> > > are
> > >      committed. This can be done by detecting underflow conditions
> > >      (producer < consumer) and quashing the update in such cases.
> > >      This approach makes the original barrier assumption
> > > unnecessary
> > >      and allows its removal.
> >
> > > This patch implements solution (3) for performance reasons.
> > >
> > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> > > ---
> > >  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > b/lib/ring/rte_ring_c11_pvt.h
> > > index b9388af0da..e5ac1f6b9e 100644
> > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail
> > > *d,
> > >             /* Reset n to the initial burst count */
> > >             n = max;
> > >
> > > -           /* Ensure the head is read before tail */
> > > -           rte_atomic_thread_fence(rte_memory_order_acquire);
> > > -
> > >             /* load-acquire synchronize with store-release of
> > > ht->tail
> > >              * in update_tail.
> > >              */
> >
> > But then cons.head can be read a before prod.tail (and visa-versa),
> > right?
> 
> Right, we let it happen but eliminate any resulting states that are
> semantically incorrect at the end.

Two comments here:
1) I think it is probably  safer to do the check like that: 
If (*entries > ring->capacity) ...
2) My concern that without forcing a proper read ordering
(cons.head first then prod.tail) we re-introduce a window for all sorts of
ABA-like problems.
ring: guarantee load/load order in enqueue and dequeue
commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
Author: Jia He <justin.he@arm.com>
Date:   Fri Nov 10 03:30:42 2017 +0000

> >
> > > @@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail
> > > *d,
> > >              */
> > >             *entries = (capacity + stail - *old_head);
> > >
> > > +           /*
> > > +            * Ensure the entries calculation was not based on
> > > a stale
> > > +            * and unsafe stail observation that causes
> > > underflow.
> > > +            */
> > > +           if ((int)*entries < 0)
> > > +                   *entries = 0;
> > > +
> > >             /* check that we have enough room in ring */
> > >             if (unlikely(n > *entries))
> > >                     n = (behavior == RTE_RING_QUEUE_FIXED) ?
> > > --
> > > 2.43.0
> > >
> >
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient, please
> notify the sender immediately and do not disclose the contents to any other person,
> use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-09-17  7:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-15 18:54 [PATCH 0/1] ring: correct ordering issue in head/tail update Wathsala Vithanage
2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
2025-09-16 15:42   ` Bruce Richardson
2025-09-16 18:19     ` Ola Liljedahl
2025-09-17  7:47       ` Bruce Richardson
2025-09-16 22:57   ` Konstantin Ananyev
2025-09-16 23:08     ` Konstantin Ananyev
     [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
2025-09-17  7:58       ` Konstantin Ananyev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).