[PATCH 0/1] ring: correct ordering issue in head/tail update

DPDK patches and discussions
 help / color / mirror / Atom feed

* [PATCH 0/1] ring: correct ordering issue in head/tail update
@ 2025-09-15 18:54 Wathsala Vithanage
  2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
  0 siblings, 1 reply; 25+ messages in thread
From: Wathsala Vithanage @ 2025-09-15 18:54 UTC (permalink / raw)
  Cc: dev, Wathsala Vithanage

Hi all,

This patch resolves a subtle ordering issue in the ring code that could
lead to incorrect behavior under certain conditions. The change favors a
performance-conscious fix while preserving existing behavior.

For background, motivation, and validation (including Herd7 litmus
tests), please see the accompanying write-up:https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order

Wathsala Vithanage (1):
  ring: safe partial ordering for head/tail update

 lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-15 18:54 [PATCH 0/1] ring: correct ordering issue in head/tail update Wathsala Vithanage
@ 2025-09-15 18:54 ` Wathsala Vithanage
  2025-09-16 15:42   ` Bruce Richardson
  2025-09-16 22:57   ` Konstantin Ananyev
  0 siblings, 2 replies; 25+ messages in thread
From: Wathsala Vithanage @ 2025-09-15 18:54 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Konstantin Ananyev
  Cc: dev, Wathsala Vithanage, Ola Liljedahl, Dhruv Tripathi

The function __rte_ring_headtail_move_head() assumes that the barrier
(fence) between the load of the head and the load-acquire of the
opposing tail guarantees the following: if a first thread reads tail
and then writes head and a second thread reads the new value of head
and then reads tail, then it should observe the same (or a later)
value of tail.

This assumption is incorrect under the C11 memory model. If the barrier
(fence) is intended to establish a total ordering of ring operations,
it fails to do so. Instead, the current implementation only enforces a
partial ordering, which can lead to unsafe interleavings. In particular,
some partial orders can cause underflows in free slot or available
element computations, potentially resulting in data corruption.

The issue manifests when a CPU first acts as a producer and later as a
consumer. In this scenario, the barrier assumption may fail when another
core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
this violation. The problem has not been widely observed so far because:
  (a) on strong memory models (e.g., x86-64) the assumption holds, and
  (b) on relaxed models with RCsc semantics the ordering is still strong
      enough to prevent hazards.
The problem becomes visible only on weaker models, when load-acquire is
implemented with RCpc semantics (e.g. some AArch64 CPUs which support
the LDAPR and LDAPUR instructions).

Three possible solutions exist:
  1. Strengthen ordering by upgrading release/acquire semantics to
     sequential consistency. This requires using seq-cst for stores,
     loads, and CAS operations. However, this approach introduces a
     significant performance penalty on relaxed-memory architectures.

  2. Establish a safe partial order by enforcing a pair-wise
     happens-before relationship between thread of same role by changing
     the CAS and the preceding load of the head by converting them to
     release and acquire respectively. This approach makes the original
     barrier assumption unnecessary and allows its removal.

  3. Retain partial ordering but ensure only safe partial orders are
     committed. This can be done by detecting underflow conditions
     (producer < consumer) and quashing the update in such cases.
     This approach makes the original barrier assumption unnecessary
     and allows its removal.

This patch implements solution (3) for performance reasons.

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
---
 lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index b9388af0da..e5ac1f6b9e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
 		/* Reset n to the initial burst count */
 		n = max;
 
-		/* Ensure the head is read before tail */
-		rte_atomic_thread_fence(rte_memory_order_acquire);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
 		 */
 		*entries = (capacity + stail - *old_head);
 
+		/*
+		 * Ensure the entries calculation was not based on a stale
+		 * and unsafe stail observation that causes underflow.
+		 */
+		if ((int)*entries < 0)
+			*entries = 0;
+
 		/* check that we have enough room in ring */
 		if (unlikely(n > *entries))
 			n = (behavior == RTE_RING_QUEUE_FIXED) ?
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
@ 2025-09-16 15:42   ` Bruce Richardson
  2025-09-16 18:19     ` Ola Liljedahl
  2025-09-16 22:57   ` Konstantin Ananyev
  1 sibling, 1 reply; 25+ messages in thread
From: Bruce Richardson @ 2025-09-16 15:42 UTC (permalink / raw)
  To: Wathsala Vithanage
  Cc: Honnappa Nagarahalli, Konstantin Ananyev, dev, Ola Liljedahl,
	Dhruv Tripathi

On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:
> The function __rte_ring_headtail_move_head() assumes that the barrier
> (fence) between the load of the head and the load-acquire of the
> opposing tail guarantees the following: if a first thread reads tail
> and then writes head and a second thread reads the new value of head
> and then reads tail, then it should observe the same (or a later)
> value of tail.
> 
> This assumption is incorrect under the C11 memory model. If the barrier
> (fence) is intended to establish a total ordering of ring operations,
> it fails to do so. Instead, the current implementation only enforces a
> partial ordering, which can lead to unsafe interleavings. In particular,
> some partial orders can cause underflows in free slot or available
> element computations, potentially resulting in data corruption.
> 
> The issue manifests when a CPU first acts as a producer and later as a
> consumer. In this scenario, the barrier assumption may fail when another
> core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> this violation. The problem has not been widely observed so far because:
>   (a) on strong memory models (e.g., x86-64) the assumption holds, and
>   (b) on relaxed models with RCsc semantics the ordering is still strong
>       enough to prevent hazards.
> The problem becomes visible only on weaker models, when load-acquire is
> implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> the LDAPR and LDAPUR instructions).
> 
> Three possible solutions exist:
>   1. Strengthen ordering by upgrading release/acquire semantics to
>      sequential consistency. This requires using seq-cst for stores,
>      loads, and CAS operations. However, this approach introduces a
>      significant performance penalty on relaxed-memory architectures.
> 
>   2. Establish a safe partial order by enforcing a pair-wise
>      happens-before relationship between thread of same role by changing
>      the CAS and the preceding load of the head by converting them to
>      release and acquire respectively. This approach makes the original
>      barrier assumption unnecessary and allows its removal.
> 
>   3. Retain partial ordering but ensure only safe partial orders are
>      committed. This can be done by detecting underflow conditions
>      (producer < consumer) and quashing the update in such cases.
>      This approach makes the original barrier assumption unnecessary
>      and allows its removal.
> 
> This patch implements solution (3) for performance reasons.
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> ---
>  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
Thank you for the very comprehensive write-up in the article about this.
It was very educational.

On the patch, are we sure that option #3 is safe to take as an approach? It
seems wrong to me to deliberately leave ordering issues in the code and
just try and fix them up later. Would there be a noticable performance
difference for real-world apps if we took option #2, and actually used
correct ordering semantics? I realise the perf data in the blog post about
this shows it being slower, but for enqueues and dequeues of bursts, of
e.g. 8, rather than just 1, is there a very big delta?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-16 15:42   ` Bruce Richardson
@ 2025-09-16 18:19     ` Ola Liljedahl
  2025-09-17  7:47       ` Bruce Richardson
  0 siblings, 1 reply; 25+ messages in thread
From: Ola Liljedahl @ 2025-09-16 18:19 UTC (permalink / raw)
  To: Bruce Richardson, Wathsala Vithanage
  Cc: Honnappa Nagarahalli, Konstantin Ananyev, dev, Dhruv Tripathi

(I am sending this from Outlook, I hope I have been able to fake a proper
email client)

> On 2025-09-16, 17:43, "Bruce Richardson" <bruce.richardson@intel.com > <mailto:bruce.richardson@intel.com>> wrote:
>
>
> On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:
> > The function __rte_ring_headtail_move_head() assumes that the barrier
> > (fence) between the load of the head and the load-acquire of the
> > opposing tail guarantees the following: if a first thread reads tail
> > and then writes head and a second thread reads the new value of head
> > and then reads tail, then it should observe the same (or a later)
> > value of tail.
> >
> > This assumption is incorrect under the C11 memory model. If the barrier
> > (fence) is intended to establish a total ordering of ring operations,
> > it fails to do so. Instead, the current implementation only enforces a
> > partial ordering, which can lead to unsafe interleavings. In particular,
> > some partial orders can cause underflows in free slot or available
> > element computations, potentially resulting in data corruption.
> >
> > The issue manifests when a CPU first acts as a producer and later as a
> > consumer. In this scenario, the barrier assumption may fail when another
> > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > this violation. The problem has not been widely observed so far because:
> > (a) on strong memory models (e.g., x86-64) the assumption holds, and
> > (b) on relaxed models with RCsc semantics the ordering is still strong
> > enough to prevent hazards.
> > The problem becomes visible only on weaker models, when load-acquire is
> > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > the LDAPR and LDAPUR instructions).
> >
> > Three possible solutions exist:
> > 1. Strengthen ordering by upgrading release/acquire semantics to
> > sequential consistency. This requires using seq-cst for stores,
> > loads, and CAS operations. However, this approach introduces a
> > significant performance penalty on relaxed-memory architectures.
> >
> > 2. Establish a safe partial order by enforcing a pair-wise
> > happens-before relationship between thread of same role by changing
> > the CAS and the preceding load of the head by converting them to
> > release and acquire respectively. This approach makes the original
> > barrier assumption unnecessary and allows its removal.
> >
> > 3. Retain partial ordering but ensure only safe partial orders are
> > committed. This can be done by detecting underflow conditions
> > (producer < consumer) and quashing the update in such cases.
> > This approach makes the original barrier assumption unnecessary
> > and allows its removal.
> >
> > This patch implements solution (3) for performance reasons.
> >
> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>
> > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>
> > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > ---
> > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> Thank you for the very comprehensive write-up in the article about this.
> It was very educational.
>
>
> On the patch, are we sure that option #3 is safe to take as an approach? It
> seems wrong to me to deliberately leave ordering issues in the code and
> just try and fix them up later. Would there be a noticable performance
> difference for real-world apps if we took option #2, and actually used
> correct ordering semantics?
I am pretty sure that all necessary orderings for safely transferring elements
from producers to consumers (and empty slots from consumers to producers)
are present in the code (I still have some questions on the use of memory
order relaxed in __rte_ring_update_tail, we should create a litmus test for
this, to see what is required by the C memory model). What other metrics
for correctness do you suggest?

Some data structures are specified or expected (explicitly or implicitly) to be
e.g. sequentially consistent, someone claimed this is true for some
synchronization mechanisms in the Linux kernel and I have seen such claims
for the POSIX mutex. Are there any similar requirements for DPDK ring buffers?

- Ola

> I realise the perf data in the blog post about
> this shows it being slower, but for enqueues and dequeues of bursts, of
> e.g. 8, rather than just 1, is there a very big delta?
>
>
> Regards,
> /Bruce



IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
  2025-09-16 15:42   ` Bruce Richardson
@ 2025-09-16 22:57   ` Konstantin Ananyev
  2025-09-16 23:08     ` Konstantin Ananyev
       [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
  1 sibling, 2 replies; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-16 22:57 UTC (permalink / raw)
  To: Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Ola Liljedahl, Dhruv Tripathi



> The function __rte_ring_headtail_move_head() assumes that the barrier
> (fence) between the load of the head and the load-acquire of the
> opposing tail guarantees the following: if a first thread reads tail
> and then writes head and a second thread reads the new value of head
> and then reads tail, then it should observe the same (or a later)
> value of tail.
> 
> This assumption is incorrect under the C11 memory model. If the barrier
> (fence) is intended to establish a total ordering of ring operations,
> it fails to do so. Instead, the current implementation only enforces a
> partial ordering, which can lead to unsafe interleavings. In particular,
> some partial orders can cause underflows in free slot or available
> element computations, potentially resulting in data corruption.

Hmm... sounds exactly like the problem from the patch we discussed earlier that year:
https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com/
In two words:
"... thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or visa-versa.
In other words: 'cons.head' value depends on 'prod.tail', so before making latest 'cons.head'
value visible to other threads, we need to ensure that latest 'prod.tail' is also visible."
Is that the one?

> The issue manifests when a CPU first acts as a producer and later as a
> consumer. In this scenario, the barrier assumption may fail when another
> core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> this violation. The problem has not been widely observed so far because:
>   (a) on strong memory models (e.g., x86-64) the assumption holds, and
>   (b) on relaxed models with RCsc semantics the ordering is still strong
>       enough to prevent hazards.
> The problem becomes visible only on weaker models, when load-acquire is
> implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> the LDAPR and LDAPUR instructions).
> 
> Three possible solutions exist:
>   1. Strengthen ordering by upgrading release/acquire semantics to
>      sequential consistency. This requires using seq-cst for stores,
>      loads, and CAS operations. However, this approach introduces a
>      significant performance penalty on relaxed-memory architectures.
> 
>   2. Establish a safe partial order by enforcing a pair-wise
>      happens-before relationship between thread of same role by changing
>      the CAS and the preceding load of the head by converting them to
>      release and acquire respectively. This approach makes the original
>      barrier assumption unnecessary and allows its removal.

For the sake of clarity, can you outline what would be exact code changes for
approach #2? Same as in that patch:
https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com/
Or something different?
 
>   3. Retain partial ordering but ensure only safe partial orders are
>      committed. This can be done by detecting underflow conditions
>      (producer < consumer) and quashing the update in such cases.
>      This approach makes the original barrier assumption unnecessary
>      and allows its removal.

> This patch implements solution (3) for performance reasons.
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> ---
>  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index b9388af0da..e5ac1f6b9e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> *d,
>  		/* Reset n to the initial burst count */
>  		n = max;
> 
> -		/* Ensure the head is read before tail */
> -		rte_atomic_thread_fence(rte_memory_order_acquire);
> -
>  		/* load-acquire synchronize with store-release of ht->tail
>  		 * in update_tail.
>  		 */

But then cons.head can be read a before prod.tail (and visa-versa), right?

> @@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> *d,
>  		 */
>  		*entries = (capacity + stail - *old_head);
> 
> +		/*
> +		 * Ensure the entries calculation was not based on a stale
> +		 * and unsafe stail observation that causes underflow.
> +		 */
> +		if ((int)*entries < 0)
> +			*entries = 0;
> +
>  		/* check that we have enough room in ring */
>  		if (unlikely(n > *entries))
>  			n = (behavior == RTE_RING_QUEUE_FIXED) ?
> --
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-16 22:57   ` Konstantin Ananyev
@ 2025-09-16 23:08     ` Konstantin Ananyev
       [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
  1 sibling, 0 replies; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-16 23:08 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Ola Liljedahl, Dhruv Tripathi



> 
> 
> > The function __rte_ring_headtail_move_head() assumes that the barrier
> > (fence) between the load of the head and the load-acquire of the
> > opposing tail guarantees the following: if a first thread reads tail
> > and then writes head and a second thread reads the new value of head
> > and then reads tail, then it should observe the same (or a later)
> > value of tail.
> >
> > This assumption is incorrect under the C11 memory model. If the barrier
> > (fence) is intended to establish a total ordering of ring operations,
> > it fails to do so. Instead, the current implementation only enforces a
> > partial ordering, which can lead to unsafe interleavings. In particular,
> > some partial orders can cause underflows in free slot or available
> > element computations, potentially resulting in data corruption.
> 
> Hmm... sounds exactly like the problem from the patch we discussed earlier that
> year:
> https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com/
> In two words:
> "... thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or
> visa-versa.
> In other words: 'cons.head' value depends on 'prod.tail', so before making latest
> 'cons.head'
> value visible to other threads, we need to ensure that latest 'prod.tail' is also visible."
> Is that the one?
> 
> > The issue manifests when a CPU first acts as a producer and later as a
> > consumer. In this scenario, the barrier assumption may fail when another
> > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > this violation. The problem has not been widely observed so far because:
> >   (a) on strong memory models (e.g., x86-64) the assumption holds, and
> >   (b) on relaxed models with RCsc semantics the ordering is still strong
> >       enough to prevent hazards.
> > The problem becomes visible only on weaker models, when load-acquire is
> > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > the LDAPR and LDAPUR instructions).
> >
> > Three possible solutions exist:
> >   1. Strengthen ordering by upgrading release/acquire semantics to
> >      sequential consistency. This requires using seq-cst for stores,
> >      loads, and CAS operations. However, this approach introduces a
> >      significant performance penalty on relaxed-memory architectures.
> >
> >   2. Establish a safe partial order by enforcing a pair-wise
> >      happens-before relationship between thread of same role by changing
> >      the CAS and the preceding load of the head by converting them to
> >      release and acquire respectively. This approach makes the original
> >      barrier assumption unnecessary and allows its removal.
> 
> For the sake of clarity, can you outline what would be exact code changes for
> approach #2? Same as in that patch:
> https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com/
> Or something different?
> 
> >   3. Retain partial ordering but ensure only safe partial orders are
> >      committed. This can be done by detecting underflow conditions
> >      (producer < consumer) and quashing the update in such cases.
> >      This approach makes the original barrier assumption unnecessary
> >      and allows its removal.
> 
> > This patch implements solution (3) for performance reasons.
> >
> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> > ---
> >  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> > index b9388af0da..e5ac1f6b9e 100644
> > --- a/lib/ring/rte_ring_c11_pvt.h
> > +++ b/lib/ring/rte_ring_c11_pvt.h
> > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> > *d,
> >  		/* Reset n to the initial burst count */
> >  		n = max;
> >
> > -		/* Ensure the head is read before tail */
> > -		rte_atomic_thread_fence(rte_memory_order_acquire);
> > -
> >  		/* load-acquire synchronize with store-release of ht->tail
> >  		 * in update_tail.
> >  		 */
> 
> But then cons.head can be read a before prod.tail (and visa-versa), right?
s/before/after/
 
> 
> > @@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> > *d,
> >  		 */
> >  		*entries = (capacity + stail - *old_head);
> >
> > +		/*
> > +		 * Ensure the entries calculation was not based on a stale
> > +		 * and unsafe stail observation that causes underflow.
> > +		 */
> > +		if ((int)*entries < 0)
> > +			*entries = 0;
> > +
> >  		/* check that we have enough room in ring */
> >  		if (unlikely(n > *entries))
> >  			n = (behavior == RTE_RING_QUEUE_FIXED) ?
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-16 18:19     ` Ola Liljedahl
@ 2025-09-17  7:47       ` Bruce Richardson
  2025-09-17 15:06         ` Stephen Hemminger
  2025-09-18 17:40         ` Wathsala Vithanage
  0 siblings, 2 replies; 25+ messages in thread
From: Bruce Richardson @ 2025-09-17  7:47 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: Wathsala Vithanage, Honnappa Nagarahalli, Konstantin Ananyev,
	dev, Dhruv Tripathi

On Tue, Sep 16, 2025 at 06:19:41PM +0000, Ola Liljedahl wrote:
> (I am sending this from Outlook, I hope I have been able to fake a proper
> email client)
> 
> > On 2025-09-16, 17:43, "Bruce Richardson" <bruce.richardson@intel.com > <mailto:bruce.richardson@intel.com>> wrote:
> >
> >
> > On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:
> > > The function __rte_ring_headtail_move_head() assumes that the barrier
> > > (fence) between the load of the head and the load-acquire of the
> > > opposing tail guarantees the following: if a first thread reads tail
> > > and then writes head and a second thread reads the new value of head
> > > and then reads tail, then it should observe the same (or a later)
> > > value of tail.
> > >
> > > This assumption is incorrect under the C11 memory model. If the barrier
> > > (fence) is intended to establish a total ordering of ring operations,
> > > it fails to do so. Instead, the current implementation only enforces a
> > > partial ordering, which can lead to unsafe interleavings. In particular,
> > > some partial orders can cause underflows in free slot or available
> > > element computations, potentially resulting in data corruption.
> > >
> > > The issue manifests when a CPU first acts as a producer and later as a
> > > consumer. In this scenario, the barrier assumption may fail when another
> > > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > > this violation. The problem has not been widely observed so far because:
> > > (a) on strong memory models (e.g., x86-64) the assumption holds, and
> > > (b) on relaxed models with RCsc semantics the ordering is still strong
> > > enough to prevent hazards.
> > > The problem becomes visible only on weaker models, when load-acquire is
> > > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > > the LDAPR and LDAPUR instructions).
> > >
> > > Three possible solutions exist:
> > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > sequential consistency. This requires using seq-cst for stores,
> > > loads, and CAS operations. However, this approach introduces a
> > > significant performance penalty on relaxed-memory architectures.
> > >
> > > 2. Establish a safe partial order by enforcing a pair-wise
> > > happens-before relationship between thread of same role by changing
> > > the CAS and the preceding load of the head by converting them to
> > > release and acquire respectively. This approach makes the original
> > > barrier assumption unnecessary and allows its removal.
> > >
> > > 3. Retain partial ordering but ensure only safe partial orders are
> > > committed. This can be done by detecting underflow conditions
> > > (producer < consumer) and quashing the update in such cases.
> > > This approach makes the original barrier assumption unnecessary
> > > and allows its removal.
> > >
> > > This patch implements solution (3) for performance reasons.
> > >
> > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>
> > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>
> > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > > ---
> > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > Thank you for the very comprehensive write-up in the article about this.
> > It was very educational.
> >
> >
> > On the patch, are we sure that option #3 is safe to take as an approach? It
> > seems wrong to me to deliberately leave ordering issues in the code and
> > just try and fix them up later. Would there be a noticable performance
> > difference for real-world apps if we took option #2, and actually used
> > correct ordering semantics?
> I am pretty sure that all necessary orderings for safely transferring elements
> from producers to consumers (and empty slots from consumers to producers)
> are present in the code (I still have some questions on the use of memory
> order relaxed in __rte_ring_update_tail, we should create a litmus test for
> this, to see what is required by the C memory model). What other metrics
> for correctness do you suggest?
>

Not suggesting any other metrics for correctness. I'm instead just wanting
to double-check the cost-benefit of taking the approach of putting in a
fix-up, or workaround, in the code here, rather than actually correcting
the memory ordering use.  Also, given that the workaround uses subtraction
to detect underflow, are we 100% sure that we have guaranteed correct
behaviour on counter wraparound at uint32_t max?

/Bruce 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
       [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
@ 2025-09-17  7:58       ` Konstantin Ananyev
  2025-09-17  9:05         ` Ola Liljedahl
  0 siblings, 1 reply; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-17  7:58 UTC (permalink / raw)
  To: Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Ola Liljedahl, Dhruv Tripathi, Bruce Richardson

To avoid information loss I combined reply to two Wathsala replies into one.

> > > The function __rte_ring_headtail_move_head() assumes that the
> > > barrier
> > (fence) between the load of the head and the load-acquire of the
> > > opposing tail guarantees the following: if a first thread reads
> > > tail
> > > and then writes head and a second thread reads the new value of
> > > head
> > > and then reads tail, then it should observe the same (or a later)
> > > value of tail.
> > >
> > > This assumption is incorrect under the C11 memory model. If the
> > > barrier
> > > (fence) is intended to establish a total ordering of ring
> > > operations,
> > > it fails to do so. Instead, the current implementation only
> > > enforces a
> > > partial ordering, which can lead to unsafe interleavings. In
> > > particular,
> > > some partial orders can cause underflows in free slot or available
> > > element computations, potentially resulting in data corruption.
> >
> > Hmm... sounds exactly like the problem from the patch we discussed
> > earlier that year:
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com/
> > In two words:
> > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > for 'prod.tail' or visa-versa.
> > In other words: 'cons.head' value depends on 'prod.tail', so before
> > making latest 'cons.head'
> > value visible to other threads, we need to ensure that latest
> > 'prod.tail' is also visible."
> > Is that the one?

> Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> which is why we didn’t catch it earlier. A fuller explanation, with
> Herd7 simulations, is in the blog post linked in the cover letter.
>
> https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order

I see, so now it is reproducible with core rte_ring on real HW.

> >
> > > The issue manifests when a CPU first acts as a producer and later
> > > as a
> > > consumer. In this scenario, the barrier assumption may fail when
> > > another
> > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > demonstrate
> > > this violation. The problem has not been widely observed so far
> > > because:
> > >   (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > and
> > >   (b) on relaxed models with RCsc semantics the ordering is still
> > > strong
> > >       enough to prevent hazards.
> > > The problem becomes visible only on weaker models, when load-
> > > acquire is
> > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > support
> > > the LDAPR and LDAPUR instructions).
> > >
> > > Three possible solutions exist:
> > >   1. Strengthen ordering by upgrading release/acquire semantics to
> > >      sequential consistency. This requires using seq-cst for
> > > stores,
> > >      loads, and CAS operations. However, this approach introduces a
> > >      significant performance penalty on relaxed-memory
> > > architectures.
> > >
> > >   2. Establish a safe partial order by enforcing a pair-wise
> > >      happens-before relationship between thread of same role by
> > > changing
> > >      the CAS and the preceding load of the head by converting them
> > > to
> > >      release and acquire respectively. This approach makes the
> > > original
> > >      barrier assumption unnecessary and allows its removal.
> >
> > For the sake of clarity, can you outline what would be exact code
> > changes for
> > approach #2? Same as in that patch:
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com/
> > Or something different?
> 
> Sorry, I missed the later half you your comment before.
> Yes, you have proposed the same solution there.

Ok,  thanks for confirmation.

> >
> >
> > >   3. Retain partial ordering but ensure only safe partial orders
> > > are
> > >      committed. This can be done by detecting underflow conditions
> > >      (producer < consumer) and quashing the update in such cases.
> > >      This approach makes the original barrier assumption
> > > unnecessary
> > >      and allows its removal.
> >
> > > This patch implements solution (3) for performance reasons.
> > >
> > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com>
> > > ---
> > >  lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > b/lib/ring/rte_ring_c11_pvt.h
> > > index b9388af0da..e5ac1f6b9e 100644
> > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail
> > > *d,
> > >             /* Reset n to the initial burst count */
> > >             n = max;
> > >
> > > -           /* Ensure the head is read before tail */
> > > -           rte_atomic_thread_fence(rte_memory_order_acquire);
> > > -
> > >             /* load-acquire synchronize with store-release of
> > > ht->tail
> > >              * in update_tail.
> > >              */
> >
> > But then cons.head can be read a before prod.tail (and visa-versa),
> > right?
> 
> Right, we let it happen but eliminate any resulting states that are
> semantically incorrect at the end.

Two comments here:
1) I think it is probably  safer to do the check like that: 
If (*entries > ring->capacity) ...
2) My concern that without forcing a proper read ordering
(cons.head first then prod.tail) we re-introduce a window for all sorts of
ABA-like problems.
ring: guarantee load/load order in enqueue and dequeue
commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
Author: Jia He <justin.he@arm.com>
Date:   Fri Nov 10 03:30:42 2017 +0000

> >
> > > @@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail
> > > *d,
> > >              */
> > >             *entries = (capacity + stail - *old_head);
> > >
> > > +           /*
> > > +            * Ensure the entries calculation was not based on
> > > a stale
> > > +            * and unsafe stail observation that causes
> > > underflow.
> > > +            */
> > > +           if ((int)*entries < 0)
> > > +                   *entries = 0;
> > > +
> > >             /* check that we have enough room in ring */
> > >             if (unlikely(n > *entries))
> > >                     n = (behavior == RTE_RING_QUEUE_FIXED) ?
> > > --
> > > 2.43.0
> > >
> >
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient, please
> notify the sender immediately and do not disclose the contents to any other person,
> use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-17  7:58       ` Konstantin Ananyev
@ 2025-09-17  9:05         ` Ola Liljedahl
  2025-09-20 12:01           ` Konstantin Ananyev
  0 siblings, 1 reply; 25+ messages in thread
From: Ola Liljedahl @ 2025-09-17  9:05 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson

> On 2025-09-17, 09:58, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
> To avoid information loss I combined reply to two Wathsala replies into one.
>
>
> > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > barrier
> > > (fence) between the load of the head and the load-acquire of the
> > > > opposing tail guarantees the following: if a first thread reads
> > > > tail
> > > > and then writes head and a second thread reads the new value of
> > > > head
> > > > and then reads tail, then it should observe the same (or a later)
> > > > value of tail.
> > > >
> > > > This assumption is incorrect under the C11 memory model. If the
> > > > barrier
> > > > (fence) is intended to establish a total ordering of ring
> > > > operations,
> > > > it fails to do so. Instead, the current implementation only
> > > > enforces a
> > > > partial ordering, which can lead to unsafe interleavings. In
> > > > particular,
> > > > some partial orders can cause underflows in free slot or available
> > > > element computations, potentially resulting in data corruption.
> > >
> > > Hmm... sounds exactly like the problem from the patch we discussed
> > > earlier that year:
> > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-konstantin.ananyev@huawei.com <mailto:20250521111432.207936-4-konstantin.ananyev@huawei.com>/
> > > In two words:
> > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > for 'prod.tail' or visa-versa.
> > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > making latest 'cons.head'
> > > value visible to other threads, we need to ensure that latest
> > > 'prod.tail' is also visible."
> > > Is that the one?
>
>
> > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > which is why we didn’t catch it earlier. A fuller explanation, with
> > Herd7 simulations, is in the blog post linked in the cover letter.
> >
> > https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order <https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order>
>
>
> I see, so now it is reproducible with core rte_ring on real HW.
>
>
> > >
> > > > The issue manifests when a CPU first acts as a producer and later
> > > > as a
> > > > consumer. In this scenario, the barrier assumption may fail when
> > > > another
> > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > demonstrate
> > > > this violation. The problem has not been widely observed so far
> > > > because:
> > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > and
> > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > strong
> > > > enough to prevent hazards.
> > > > The problem becomes visible only on weaker models, when load-
> > > > acquire is
> > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > support
> > > > the LDAPR and LDAPUR instructions).
> > > >
> > > > Three possible solutions exist:
> > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > sequential consistency. This requires using seq-cst for
> > > > stores,
> > > > loads, and CAS operations. However, this approach introduces a
> > > > significant performance penalty on relaxed-memory
> > > > architectures.
> > > >
> > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > happens-before relationship between thread of same role by
> > > > changing
> > > > the CAS and the preceding load of the head by converting them
> > > > to
> > > > release and acquire respectively. This approach makes the
> > > > original
> > > > barrier assumption unnecessary and allows its removal.
> > >
> > > For the sake of clarity, can you outline what would be exact code
> > > changes for
> > > approach #2? Same as in that patch:
> > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>/
> > > Or something different?
> >
> > Sorry, I missed the later half you your comment before.
> > Yes, you have proposed the same solution there.
>
>
> Ok, thanks for confirmation.
>
>
> > >
> > >
> > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > are
> > > > committed. This can be done by detecting underflow conditions
> > > > (producer < consumer) and quashing the update in such cases.
> > > > This approach makes the original barrier assumption
> > > > unnecessary
> > > > and allows its removal.
> > >
> > > > This patch implements solution (3) for performance reasons.
> > > >
> > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>
> > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>
> > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > > > ---
> > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > index b9388af0da..e5ac1f6b9e 100644
> > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > rte_ring_headtail
> > > > *d,
> > > > /* Reset n to the initial burst count */
> > > > n = max;
> > > >
> > > > - /* Ensure the head is read before tail */
> > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > -
> > > > /* load-acquire synchronize with store-release of
> > > > ht->tail
> > > > * in update_tail.
> > > > */
> > >
> > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > right?
> >
> > Right, we let it happen but eliminate any resulting states that are
> > semantically incorrect at the end.
>
>
> Two comments here:
> 1) I think it is probably safer to do the check like that:
> If (*entries > ring->capacity) ...
Yes, this might be another way of handling underflow situations. We could study this.

I have used the check for negative without problems in my ring buffer implementations
https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c
but can't say that has been battle-tested.

> 2) My concern that without forcing a proper read ordering
> (cons.head first then prod.tail) we re-introduce a window for all sorts of
> ABA-like problems.
Head and tail indexes are monotonically increasing so I don't see a risk for ABA-like problems.
Indeed, adding a monotonically increasing tag to pointers is the common way of avoiding ABA
problems in lock-free designs.

- Ola

Ignore the following auto-added disclaimer.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-17  7:47       ` Bruce Richardson
@ 2025-09-17 15:06         ` Stephen Hemminger
  2025-09-18 17:40         ` Wathsala Vithanage
  1 sibling, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2025-09-17 15:06 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli,
	Konstantin Ananyev, dev, Dhruv Tripathi

On Wed, 17 Sep 2025 08:47:53 +0100
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Tue, Sep 16, 2025 at 06:19:41PM +0000, Ola Liljedahl wrote:
> > (I am sending this from Outlook, I hope I have been able to fake a proper
> > email client)
> >   
> > > On 2025-09-16, 17:43, "Bruce Richardson" <bruce.richardson@intel.com > <mailto:bruce.richardson@intel.com>> wrote:
> > >
> > >
> > > On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage wrote:  
> > > > The function __rte_ring_headtail_move_head() assumes that the barrier
> > > > (fence) between the load of the head and the load-acquire of the
> > > > opposing tail guarantees the following: if a first thread reads tail
> > > > and then writes head and a second thread reads the new value of head
> > > > and then reads tail, then it should observe the same (or a later)
> > > > value of tail.
> > > >
> > > > This assumption is incorrect under the C11 memory model. If the barrier
> > > > (fence) is intended to establish a total ordering of ring operations,
> > > > it fails to do so. Instead, the current implementation only enforces a
> > > > partial ordering, which can lead to unsafe interleavings. In particular,
> > > > some partial orders can cause underflows in free slot or available
> > > > element computations, potentially resulting in data corruption.
> > > >
> > > > The issue manifests when a CPU first acts as a producer and later as a
> > > > consumer. In this scenario, the barrier assumption may fail when another
> > > > core takes the consumer role. A Herd7 litmus test in C11 can demonstrate
> > > > this violation. The problem has not been widely observed so far because:
> > > > (a) on strong memory models (e.g., x86-64) the assumption holds, and
> > > > (b) on relaxed models with RCsc semantics the ordering is still strong
> > > > enough to prevent hazards.
> > > > The problem becomes visible only on weaker models, when load-acquire is
> > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which support
> > > > the LDAPR and LDAPUR instructions).
> > > >
> > > > Three possible solutions exist:
> > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > sequential consistency. This requires using seq-cst for stores,
> > > > loads, and CAS operations. However, this approach introduces a
> > > > significant performance penalty on relaxed-memory architectures.
> > > >
> > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > happens-before relationship between thread of same role by changing
> > > > the CAS and the preceding load of the head by converting them to
> > > > release and acquire respectively. This approach makes the original
> > > > barrier assumption unnecessary and allows its removal.
> > > >
> > > > 3. Retain partial ordering but ensure only safe partial orders are
> > > > committed. This can be done by detecting underflow conditions
> > > > (producer < consumer) and quashing the update in such cases.
> > > > This approach makes the original barrier assumption unnecessary
> > > > and allows its removal.
> > > >
> > > > This patch implements solution (3) for performance reasons.
> > > >
> > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>
> > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>
> > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > > > ---
> > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > >  
> > > Thank you for the very comprehensive write-up in the article about this.
> > > It was very educational.
> > >
> > >
> > > On the patch, are we sure that option #3 is safe to take as an approach? It
> > > seems wrong to me to deliberately leave ordering issues in the code and
> > > just try and fix them up later. Would there be a noticable performance
> > > difference for real-world apps if we took option #2, and actually used
> > > correct ordering semantics?  
> > I am pretty sure that all necessary orderings for safely transferring elements
> > from producers to consumers (and empty slots from consumers to producers)
> > are present in the code (I still have some questions on the use of memory
> > order relaxed in __rte_ring_update_tail, we should create a litmus test for
> > this, to see what is required by the C memory model). What other metrics
> > for correctness do you suggest?
> >  
> 
> Not suggesting any other metrics for correctness. I'm instead just wanting
> to double-check the cost-benefit of taking the approach of putting in a
> fix-up, or workaround, in the code here, rather than actually correcting
> the memory ordering use.  Also, given that the workaround uses subtraction
> to detect underflow, are we 100% sure that we have guaranteed correct
> behaviour on counter wraparound at uint32_t max?

If you look at the code rabbit review demo, it flagged the same possible
underflow issue.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-17  7:47       ` Bruce Richardson
  2025-09-17 15:06         ` Stephen Hemminger
@ 2025-09-18 17:40         ` Wathsala Vithanage
  1 sibling, 0 replies; 25+ messages in thread
From: Wathsala Vithanage @ 2025-09-18 17:40 UTC (permalink / raw)
  To: Ola Liljedahl, bruce.richardson
  Cc: Honnappa Nagarahalli, konstantin.ananyev, dev, Dhruv Tripathi

On Wed, 2025-09-17 at 08:47 +0100, Bruce Richardson wrote:
> On Tue, Sep 16, 2025 at 06:19:41PM +0000, Ola Liljedahl wrote:
> > (I am sending this from Outlook, I hope I have been able to fake a
> > proper
> > email client)
> >
> > > On 2025-09-16, 17:43, "Bruce Richardson"
> > > <bruce.richardson@intel.com >
> > > <mailto:bruce.richardson@intel.com>> wrote:
> > >
> > >
> > > On Mon, Sep 15, 2025 at 06:54:50PM +0000, Wathsala Vithanage
> > > wrote:
> > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > barrier
> > > > (fence) between the load of the head and the load-acquire of
> > > > the
> > > > opposing tail guarantees the following: if a first thread reads
> > > > tail
> > > > and then writes head and a second thread reads the new value of
> > > > head
> > > > and then reads tail, then it should observe the same (or a
> > > > later)
> > > > value of tail.
> > > >
> > > > This assumption is incorrect under the C11 memory model. If the
> > > > barrier
> > > > (fence) is intended to establish a total ordering of ring
> > > > operations,
> > > > it fails to do so. Instead, the current implementation only
> > > > enforces a
> > > > partial ordering, which can lead to unsafe interleavings. In
> > > > particular,
> > > > some partial orders can cause underflows in free slot or
> > > > available
> > > > element computations, potentially resulting in data corruption.
> > > >
> > > > The issue manifests when a CPU first acts as a producer and
> > > > later as a
> > > > consumer. In this scenario, the barrier assumption may fail
> > > > when another
> > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > demonstrate
> > > > this violation. The problem has not been widely observed so far
> > > > because:
> > > > (a) on strong memory models (e.g., x86-64) the assumption
> > > > holds, and
> > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > strong
> > > > enough to prevent hazards.
> > > > The problem becomes visible only on weaker models, when load-
> > > > acquire is
> > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > support
> > > > the LDAPR and LDAPUR instructions).
> > > >
> > > > Three possible solutions exist:
> > > > 1. Strengthen ordering by upgrading release/acquire semantics
> > > > to
> > > > sequential consistency. This requires using seq-cst for stores,
> > > > loads, and CAS operations. However, this approach introduces a
> > > > significant performance penalty on relaxed-memory
> > > > architectures.
> > > >
> > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > happens-before relationship between thread of same role by
> > > > changing
> > > > the CAS and the preceding load of the head by converting them
> > > > to
> > > > release and acquire respectively. This approach makes the
> > > > original
> > > > barrier assumption unnecessary and allows its removal.
> > > >
> > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > are
> > > > committed. This can be done by detecting underflow conditions
> > > > (producer < consumer) and quashing the update in such cases.
> > > > This approach makes the original barrier assumption unnecessary
> > > > and allows its removal.
> > > >
> > > > This patch implements solution (3) for performance reasons.
> > > >
> > > > Signed-off-by: Wathsala Vithanage
> > > > <wathsala.vithanage@arm.com <mailto:
> > > > wathsala.vithanage@arm.com>>
> > > > Signed-off-by: Ola Liljedahl
> > > > <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>
> > > > Reviewed-by: Honnappa Nagarahalli
> > > > <honnappa.nagarahalli@arm.com <mailto:
> > > > honnappa.nagarahalli@arm.com>>
> > > > Reviewed-by: Dhruv Tripathi
> > > > <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>
> > > > ---
> > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > Thank you for the very comprehensive write-up in the article
> > > about this.
> > > It was very educational.
> > >
> > >
> > > On the patch, are we sure that option #3 is safe to take as an
> > > approach? It
> > > seems wrong to me to deliberately leave ordering issues in the
> > > code and
> > > just try and fix them up later. Would there be a noticable
> > > performance
> > > difference for real-world apps if we took option #2, and actually
> > > used
> > > correct ordering semantics?
> > I am pretty sure that all necessary orderings for safely
> > transferring elements
> > from producers to consumers (and empty slots from consumers to
> > producers)
> > are present in the code (I still have some questions on the use of
> > memory
> > order relaxed in __rte_ring_update_tail, we should create a litmus
> > test for
> > this, to see what is required by the C memory model). What other
> > metrics
> > for correctness do you suggest?
> >
>
> Not suggesting any other metrics for correctness. I'm instead just
> wanting
> to double-check the cost-benefit of taking the approach of putting in
> a
> fix-up, or workaround, in the code here, rather than actually
> correcting
> the memory ordering use.  Also, given that the workaround uses
> subtraction
> to detect underflow, are we 100% sure that we have guaranteed correct
> behaviour on counter wraparound at uint32_t max?
>

I agree—that's a valid concern. If the consumer reads a stale
producer_tail of 0xffffffff and the index has since wrapped to N, the
negative check detects the partial-order effect for N < 0x80000000 and,
not for N ≥ 0x80000000. In practice the likelihood of the latter is
unlikely.
However, comparing against the ring capacity would be even safer, since
it covers a larger range for N.

--wathsala


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-17  9:05         ` Ola Liljedahl
@ 2025-09-20 12:01           ` Konstantin Ananyev
       [not found]             ` <cf7e14d4ba5e9d78fddf083b6c92d75942447931.camel@arm.com>
  2025-09-23 21:57             ` Ola Liljedahl
  0 siblings, 2 replies; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-20 12:01 UTC (permalink / raw)
  To: Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson


> >
> > To avoid information loss I combined reply to two Wathsala replies into one.
> >
> >
> > > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > > barrier
> > > > (fence) between the load of the head and the load-acquire of the
> > > > > opposing tail guarantees the following: if a first thread reads
> > > > > tail
> > > > > and then writes head and a second thread reads the new value of
> > > > > head
> > > > > and then reads tail, then it should observe the same (or a later)
> > > > > value of tail.
> > > > >
> > > > > This assumption is incorrect under the C11 memory model. If the
> > > > > barrier
> > > > > (fence) is intended to establish a total ordering of ring
> > > > > operations,
> > > > > it fails to do so. Instead, the current implementation only
> > > > > enforces a
> > > > > partial ordering, which can lead to unsafe interleavings. In
> > > > > particular,
> > > > > some partial orders can cause underflows in free slot or available
> > > > > element computations, potentially resulting in data corruption.
> > > >
> > > > Hmm... sounds exactly like the problem from the patch we discussed
> > > > earlier that year:
> > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> konstantin.ananyev@huawei.com <mailto:20250521111432.207936-4-
> konstantin.ananyev@huawei.com>/
> > > > In two words:
> > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > for 'prod.tail' or visa-versa.
> > > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > > making latest 'cons.head'
> > > > value visible to other threads, we need to ensure that latest
> > > > 'prod.tail' is also visible."
> > > > Is that the one?
> >
> >
> > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > > which is why we didn’t catch it earlier. A fuller explanation, with
> > > Herd7 simulations, is in the blog post linked in the cover letter.
> > >
> > > https://community.arm.com/arm-community-blogs/b/architectures-and-
> processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order
> <https://community.arm.com/arm-community-blogs/b/architectures-and-
> processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order>
> >
> >
> > I see, so now it is reproducible with core rte_ring on real HW.
> >
> >
> > > >
> > > > > The issue manifests when a CPU first acts as a producer and later
> > > > > as a
> > > > > consumer. In this scenario, the barrier assumption may fail when
> > > > > another
> > > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > > demonstrate
> > > > > this violation. The problem has not been widely observed so far
> > > > > because:
> > > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > > and
> > > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > > strong
> > > > > enough to prevent hazards.
> > > > > The problem becomes visible only on weaker models, when load-
> > > > > acquire is
> > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > > support
> > > > > the LDAPR and LDAPUR instructions).
> > > > >
> > > > > Three possible solutions exist:
> > > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > > sequential consistency. This requires using seq-cst for
> > > > > stores,
> > > > > loads, and CAS operations. However, this approach introduces a
> > > > > significant performance penalty on relaxed-memory
> > > > > architectures.
> > > > >
> > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > happens-before relationship between thread of same role by
> > > > > changing
> > > > > the CAS and the preceding load of the head by converting them
> > > > > to
> > > > > release and acquire respectively. This approach makes the
> > > > > original
> > > > > barrier assumption unnecessary and allows its removal.
> > > >
> > > > For the sake of clarity, can you outline what would be exact code
> > > > changes for
> > > > approach #2? Same as in that patch:
> > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>/
> > > > Or something different?
> > >
> > > Sorry, I missed the later half you your comment before.
> > > Yes, you have proposed the same solution there.
> >
> >
> > Ok, thanks for confirmation.
> >
> >
> > > >
> > > >
> > > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > > are
> > > > > committed. This can be done by detecting underflow conditions
> > > > > (producer < consumer) and quashing the update in such cases.
> > > > > This approach makes the original barrier assumption
> > > > > unnecessary
> > > > > and allows its removal.
> > > >
> > > > > This patch implements solution (3) for performance reasons.
> > > > >
> > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com
> <mailto:wathsala.vithanage@arm.com>>
> > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com
> <mailto:ola.liljedahl@arm.com>>
> > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com
> <mailto:honnappa.nagarahalli@arm.com>>
> > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com
> <mailto:dhruv.tripathi@arm.com>>
> > > > > ---
> > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > rte_ring_headtail
> > > > > *d,
> > > > > /* Reset n to the initial burst count */
> > > > > n = max;
> > > > >
> > > > > - /* Ensure the head is read before tail */
> > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > -
> > > > > /* load-acquire synchronize with store-release of
> > > > > ht->tail
> > > > > * in update_tail.
> > > > > */
> > > >
> > > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > > right?
> > >
> > > Right, we let it happen but eliminate any resulting states that are
> > > semantically incorrect at the end.
> >
> >
> > Two comments here:
> > 1) I think it is probably safer to do the check like that:
> > If (*entries > ring->capacity) ...
> Yes, this might be another way of handling underflow situations. We could study
> this.
> 
> I have used the check for negative without problems in my ring buffer
> implementations
> https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c
> but can't say that has been battle-tested.

My thought was about the case (probably hypothetical) when the difference
between stale tail and head will be bigger then 2^31 + 1. 
 
> > 2) My concern that without forcing a proper read ordering
> > (cons.head first then prod.tail) we re-introduce a window for all sorts of
> > ABA-like problems.
> Head and tail indexes are monotonically increasing so I don't see a risk for ABA-like
> problems.

I understand that, but with current CPU speeds it can take rte_ring just few seconds to
wrap around head/tail values. If user doing something really fancy - like using rte_ring ZC API
(i.e. just moving head/tail without reading actual objects) that can probably happen even
faster (less than a second?).
Are we sure that the stale tail value will never persist that long?
Let say user calling move_head() in a loop till it succeeds?  

> Indeed, adding a monotonically increasing tag to pointers is the common way of
> avoiding ABA
> problems in lock-free designs.

Yep, using 64-bit values for head/tail counters will help to avoid these concerns.
But it will probably break HTS/RTS modes, plus it is an ABI change for sure.
 
Actually after another thought, I have one more concern here:

+               /*
+                * Ensure the entries calculation was not based on a stale
+                * and unsafe stail observation that causes underflow.
+                */
+               if ((int)*entries < 0)
+                       *entries = 0;
+
 
With that change, it might return not-valid information back to the user
about number of free/occupied entries in the ring.
Plus rte_ring_enqueue() now might fail even when there are enough free entries
in the ring (same for dequeue).
That looks like a change in our public API behavior that might break many things.
There are quite few places when caller expects enqueue/dequeue
operation to always succeed (let say there always should be enough free space in the ring).
For example: rte_mempool works like that.
I am pretty sure there are quite few other places like that inside DPDK,
not to mention third-party code.

Considering all of the above, I am actually more in favor
to combine approaches #2 and #3 for the final patch:  
establish a safe partial order (#2) and keep the check from #3 (should it become an assert()/verify()?) 

Another thing to note: whatever final approach we choose -
we need to make sure that the problem is addressed across all other
rte_ring flavors/modes too (generic implementation, rts/hts mode, soring).

Konstantin 
 




^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
       [not found]             ` <cf7e14d4ba5e9d78fddf083b6c92d75942447931.camel@arm.com>
@ 2025-09-22  7:12               ` Konstantin Ananyev
  0 siblings, 0 replies; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-22  7:12 UTC (permalink / raw)
  To: Wathsala Vithanage
  Cc: dev, Dhruv Tripathi, Bruce Richardson, Konstantin Ananyev,
	Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli



> > > >
> > > > To avoid information loss I combined reply to two Wathsala
> > > > replies into one.
> > > >
> > > >
> > > > > > > The function __rte_ring_headtail_move_head() assumes that
> > > > > > > the
> > > > > > > barrier
> > > > > > (fence) between the load of the head and the load-acquire of
> > > > > > the
> > > > > > > opposing tail guarantees the following: if a first thread
> > > > > > > reads
> > > > > > > tail
> > > > > > > and then writes head and a second thread reads the new
> > > > > > > value of
> > > > > > > head
> > > > > > > and then reads tail, then it should observe the same (or a
> > > > > > > later)
> > > > > > > value of tail.
> > > > > > >
> > > > > > > This assumption is incorrect under the C11 memory model. If
> > > > > > > the
> > > > > > > barrier
> > > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > > operations,
> > > > > > > it fails to do so. Instead, the current implementation only
> > > > > > > enforces a
> > > > > > > partial ordering, which can lead to unsafe interleavings.
> > > > > > > In
> > > > > > > particular,
> > > > > > > some partial orders can cause underflows in free slot or
> > > > > > > available
> > > > > > > element computations, potentially resulting in data
> > > > > > > corruption.
> > > > > >
> > > > > > Hmm... sounds exactly like the problem from the patch we
> > > > > > discussed
> > > > > > earlier that year:
> > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-
> 4
> > > > > > -
> > > konstantin.ananyev@huawei.com <mailto:20250521111432.207936-4-
> > > konstantin.ananyev@huawei.com>/
> > > > > > In two words:
> > > > > > "... thread can see 'latest' 'cons.head' value, with
> > > > > > 'previous' value
> > > > > > for 'prod.tail' or visa-versa.
> > > > > > In other words: 'cons.head' value depends on 'prod.tail', so
> > > > > > before
> > > > > > making latest 'cons.head'
> > > > > > value visible to other threads, we need to ensure that latest
> > > > > > 'prod.tail' is also visible."
> > > > > > Is that the one?
> > > >
> > > >
> > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc
> > > > > (LDAR),
> > > > > which is why we didn’t catch it earlier. A fuller explanation,
> > > > > with
> > > > > Herd7 simulations, is in the blog post linked in the cover
> > > > > letter.
> > > > >
> > > > > https://community.arm.com/arm-community-blogs/b/architectures-and
> > > > > -
> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-
> > > of-partial-order
> > > <https://community.arm.com/arm-community-blogs/b/architectures-and-
> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-
> > > of-partial-order>
> > > >
> > > >
> > > > I see, so now it is reproducible with core rte_ring on real HW.
> > > >
> > > >
> > > > > >
> > > > > > > The issue manifests when a CPU first acts as a producer and
> > > > > > > later
> > > > > > > as a
> > > > > > > consumer. In this scenario, the barrier assumption may fail
> > > > > > > when
> > > > > > > another
> > > > > > > core takes the consumer role. A Herd7 litmus test in C11
> > > > > > > can
> > > > > > > demonstrate
> > > > > > > this violation. The problem has not been widely observed so
> > > > > > > far
> > > > > > > because:
> > > > > > > (a) on strong memory models (e.g., x86-64) the assumption
> > > > > > > holds,
> > > > > > > and
> > > > > > > (b) on relaxed models with RCsc semantics the ordering is
> > > > > > > still
> > > > > > > strong
> > > > > > > enough to prevent hazards.
> > > > > > > The problem becomes visible only on weaker models, when
> > > > > > > load-
> > > > > > > acquire is
> > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs
> > > > > > > which
> > > > > > > support
> > > > > > > the LDAPR and LDAPUR instructions).
> > > > > > >
> > > > > > > Three possible solutions exist:
> > > > > > > 1. Strengthen ordering by upgrading release/acquire
> > > > > > > semantics to
> > > > > > > sequential consistency. This requires using seq-cst for
> > > > > > > stores,
> > > > > > > loads, and CAS operations. However, this approach
> > > > > > > introduces a
> > > > > > > significant performance penalty on relaxed-memory
> > > > > > > architectures.
> > > > > > >
> > > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > > happens-before relationship between thread of same role by
> > > > > > > changing
> > > > > > > the CAS and the preceding load of the head by converting
> > > > > > > them
> > > > > > > to
> > > > > > > release and acquire respectively. This approach makes the
> > > > > > > original
> > > > > > > barrier assumption unnecessary and allows its removal.
> > > > > >
> > > > > > For the sake of clarity, can you outline what would be exact
> > > > > > code
> > > > > > changes for
> > > > > > approach #2? Same as in that patch:
> > > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-
> 4
> > > > > > -
> > > <
> > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-
> > > 4->
> > > > > konstantin.ananyev@huawei.com <mailto:
> > > > > konstantin.ananyev@huawei.com>/
> > > > > > Or something different?
> > > > >
> > > > > Sorry, I missed the later half you your comment before.
> > > > > Yes, you have proposed the same solution there.
> > > >
> > > >
> > > > Ok, thanks for confirmation.
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > 3. Retain partial ordering but ensure only safe partial
> > > > > > > orders
> > > > > > > are
> > > > > > > committed. This can be done by detecting underflow
> > > > > > > conditions
> > > > > > > (producer < consumer) and quashing the update in such
> > > > > > > cases.
> > > > > > > This approach makes the original barrier assumption
> > > > > > > unnecessary
> > > > > > > and allows its removal.
> > > > > >
> > > > > > > This patch implements solution (3) for performance reasons.
> > > > > > >
> > > > > > > Signed-off-by: Wathsala Vithanage
> > > > > > > <wathsala.vithanage@arm.com
> > > <mailto:wathsala.vithanage@arm.com>>
> > > > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com
> > > <mailto:ola.liljedahl@arm.com>>
> > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > > <honnappa.nagarahalli@arm.com
> > > <mailto:honnappa.nagarahalli@arm.com>>
> > > > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com
> > > <mailto:dhruv.tripathi@arm.com>>
> > > > > > > ---
> > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > > rte_ring_headtail
> > > > > > > *d,
> > > > > > > /* Reset n to the initial burst count */
> > > > > > > n = max;
> > > > > > >
> > > > > > > - /* Ensure the head is read before tail */
> > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > -
> > > > > > > /* load-acquire synchronize with store-release of
> > > > > > > ht->tail
> > > > > > > * in update_tail.
> > > > > > > */
> > > > > >
> > > > > > But then cons.head can be read a before prod.tail (and visa-
> > > > > > versa),
> > > > > > right?
> > > > >
> > > > > Right, we let it happen but eliminate any resulting states that
> > > > > are
> > > > > semantically incorrect at the end.
> > > >
> > > >
> > > > Two comments here:
> > > > 1) I think it is probably safer to do the check like that:
> > > > If (*entries > ring->capacity) ...
> > > Yes, this might be another way of handling underflow situations. We
> > > could study
> > > this.
> > >
> > > I have used the check for negative without problems in my ring
> > > buffer
> > > implementations
> > > https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c
> > > but can't say that has been battle-tested.
> >
> > My thought was about the case (probably hypothetical) when the
> > difference
> > between stale tail and head will be bigger then 2^31 + 1.
> >
> > > > 2) My concern that without forcing a proper read ordering
> > > > (cons.head first then prod.tail) we re-introduce a window for all
> > > > sorts of
> > > > ABA-like problems.
> 
> The ABA like problem you are referring here is index wrapping around
> then reaching the point of not resulting a negative value I assume.
> If so, the distance between a stale (tail) and a wrapped around head
> has to be at least 0x80000000.

Yes, I think so. 

> > > Head and tail indexes are monotonically increasing so I don't see a
> > > risk for ABA-like
> > > problems.
> >
> > I understand that, but with current CPU speeds it can take rte_ring
> > just few seconds to
> > wrap around head/tail values. If user doing something really fancy -
> > like using rte_ring ZC API
> > (i.e. just moving head/tail without reading actual objects) that can
> > probably happen even
> > faster (less than a second?).
> > Are we sure that the stale tail value will never persist that long?
> > Let say user calling move_head() in a loop till it succeeds?
> >
> 
> Systems with fast CPUs may also have shorter window of inconsistency.
> 
> > > Indeed, adding a monotonically increasing tag to pointers is the
> > > common way of
> > > avoiding ABA
> > > problems in lock-free designs.
> >
> > Yep, using 64-bit values for head/tail counters will help to avoid
> > these concerns.
> > But it will probably break HTS/RTS modes, plus it is an ABI change
> > for sure.
> >
> > Actually after another thought, I have one more concern here:
> >
> > +               /*
> > +                * Ensure the entries calculation was not based on a
> > stale
> > +                * and unsafe stail observation that causes
> > underflow.
> > +                */
> > +               if ((int)*entries < 0)
> > +                       *entries = 0;
> > +
> >
> > With that change, it might return not-valid information back to the
> > user
> > about number of free/occupied entries in the ring.
> > Plus rte_ring_enqueue() now might fail even when there are enough
> > free entries
> > in the ring (same for dequeue).
> > That looks like a change in our public API behavior that might break
> > many things.
> > There are quite few places when caller expects enqueue/dequeue
> > operation to always succeed (let say there always should be enough
> > free space in the ring).
> > For example: rte_mempool works like that.
> > I am pretty sure there are quite few other places like that inside
> > DPDK,
> > not to mention third-party code.
> >
> > Considering all of the above, I am actually more in favor
> > to combine approaches #2 and #3 for the final patch:
> > establish a safe partial order (#2) and keep the check from #3
> > (should it become an assert()/verify()?)
> >
> 
> #2 is OK too, difference in performance is only meaningful when at
> single core Figure 12 in blog post.

Yes, same thought here.

> But why combine #3? Litmus tests prove that this state won't be reached
> in #2 (Figure 10).

My intention was to convert it to RTE_ASSERT().
To help people catch situations when things went completely wrong.
Though if you believe it is completely unnecessary - I am ok with pure #2 approach. 

> > Another thing to note: whatever final approach we choose -
> > we need to make sure that the problem is addressed across all other
> > rte_ring flavors/modes too (generic implementation, rts/hts mode,
> > soring).
> 
> Agree.

Ok, ping me if you need a hand with v2.
Thanks
Konstantin
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-20 12:01           ` Konstantin Ananyev
       [not found]             ` <cf7e14d4ba5e9d78fddf083b6c92d75942447931.camel@arm.com>
@ 2025-09-23 21:57             ` Ola Liljedahl
  2025-09-24  6:56               ` Konstantin Ananyev
  2025-09-24 15:24               ` Stephen Hemminger
  1 sibling, 2 replies; 25+ messages in thread
From: Ola Liljedahl @ 2025-09-23 21:57 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson

>
>
> On 2025-09-20, 14:01, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
>
>
>
> > >
> > > To avoid information loss I combined reply to two Wathsala replies into one.
> > >
> > >
> > > > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > > > barrier
> > > > > (fence) between the load of the head and the load-acquire of the
> > > > > > opposing tail guarantees the following: if a first thread reads
> > > > > > tail
> > > > > > and then writes head and a second thread reads the new value of
> > > > > > head
> > > > > > and then reads tail, then it should observe the same (or a later)
> > > > > > value of tail.
> > > > > >
> > > > > > This assumption is incorrect under the C11 memory model. If the
> > > > > > barrier
> > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > operations,
> > > > > > it fails to do so. Instead, the current implementation only
> > > > > > enforces a
> > > > > > partial ordering, which can lead to unsafe interleavings. In
> > > > > > particular,
> > > > > > some partial orders can cause underflows in free slot or available
> > > > > > element computations, potentially resulting in data corruption.
> > > > >
> > > > > Hmm... sounds exactly like the problem from the patch we discussed
> > > > > earlier that year:
> > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com> <mailto:20250521111432.207936-4-
> > konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>>/
> > > > > In two words:
> > > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > for 'prod.tail' or visa-versa.
> > > > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > > > making latest 'cons.head'
> > > > > value visible to other threads, we need to ensure that latest
> > > > > 'prod.tail' is also visible."
> > > > > Is that the one?
> > >
> > >
> > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > > > which is why we didn’t catch it earlier. A fuller explanation, with
> > > > Herd7 simulations, is in the blog post linked in the cover letter.
> > > >
> > > > https://community.arm.com/arm-community-blogs/b/architectures-and- <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order
> > <https://community.arm.com/arm-community-blogs/b/architectures-and- <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order>
> > >
> > >
> > > I see, so now it is reproducible with core rte_ring on real HW.
> > >
> > >
> > > > >
> > > > > > The issue manifests when a CPU first acts as a producer and later
> > > > > > as a
> > > > > > consumer. In this scenario, the barrier assumption may fail when
> > > > > > another
> > > > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > > > demonstrate
> > > > > > this violation. The problem has not been widely observed so far
> > > > > > because:
> > > > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > > > and
> > > > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > > > strong
> > > > > > enough to prevent hazards.
> > > > > > The problem becomes visible only on weaker models, when load-
> > > > > > acquire is
> > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > > > support
> > > > > > the LDAPR and LDAPUR instructions).
> > > > > >
> > > > > > Three possible solutions exist:
> > > > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > > > sequential consistency. This requires using seq-cst for
> > > > > > stores,
> > > > > > loads, and CAS operations. However, this approach introduces a
> > > > > > significant performance penalty on relaxed-memory
> > > > > > architectures.
> > > > > >
> > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > happens-before relationship between thread of same role by
> > > > > > changing
> > > > > > the CAS and the preceding load of the head by converting them
> > > > > > to
> > > > > > release and acquire respectively. This approach makes the
> > > > > > original
> > > > > > barrier assumption unnecessary and allows its removal.
> > > > >
> > > > > For the sake of clarity, can you outline what would be exact code
> > > > > changes for
> > > > > approach #2? Same as in that patch:
> > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-> <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-&gt;>
> > > > konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com> <mailto:konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>>/
> > > > > Or something different?
> > > >
> > > > Sorry, I missed the later half you your comment before.
> > > > Yes, you have proposed the same solution there.
> > >
> > >
> > > Ok, thanks for confirmation.
> > >
> > >
> > > > >
> > > > >
> > > > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > > > are
> > > > > > committed. This can be done by detecting underflow conditions
> > > > > > (producer < consumer) and quashing the update in such cases.
> > > > > > This approach makes the original barrier assumption
> > > > > > unnecessary
> > > > > > and allows its removal.
> > > > >
> > > > > > This patch implements solution (3) for performance reasons.
> > > > > >
> > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>
> > <mailto:wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>>>
> > > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>
> > <mailto:ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>>
> > > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>
> > <mailto:honnappa.nagarahalli@arm.com <mailto:honnappa.nagarahalli@arm.com>>>
> > > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>
> > <mailto:dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>>
> > > > > > ---
> > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > rte_ring_headtail
> > > > > > *d,
> > > > > > /* Reset n to the initial burst count */
> > > > > > n = max;
> > > > > >
> > > > > > - /* Ensure the head is read before tail */
> > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > -
> > > > > > /* load-acquire synchronize with store-release of
> > > > > > ht->tail
> > > > > > * in update_tail.
> > > > > > */
> > > > >
> > > > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > > > right?
> > > >
> > > > Right, we let it happen but eliminate any resulting states that are
> > > > semantically incorrect at the end.
> > >
> > >
> > > Two comments here:
> > > 1) I think it is probably safer to do the check like that:
> > > If (*entries > ring->capacity) ...
> > Yes, this might be another way of handling underflow situations. We could study
> > this.
> >
> > I have used the check for negative without problems in my ring buffer
> > implementations
> > https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c <https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c>
> > but can't say that has been battle-tested.
>
>
> My thought was about the case (probably hypothetical) when the difference
> between stale tail and head will be bigger then 2^31 + 1.
>
>
> > > 2) My concern that without forcing a proper read ordering
> > > (cons.head first then prod.tail) we re-introduce a window for all sorts of
> > > ABA-like problems.
> > Head and tail indexes are monotonically increasing so I don't see a risk for ABA-like
> > problems.
>
>
> I understand that, but with current CPU speeds it can take rte_ring just few seconds to
> wrap around head/tail values. If user doing something really fancy - like using rte_ring ZC API
> (i.e. just moving head/tail without reading actual objects) that can probably happen even
> faster (less than a second?).
> Are we sure that the stale tail value will never persist that long?
> Let say user calling move_head() in a loop till it succeeds?
>
>
> > Indeed, adding a monotonically increasing tag to pointers is the common way of
> > avoiding ABA
> > problems in lock-free designs.
>
>
> Yep, using 64-bit values for head/tail counters will help to avoid these concerns.
> But it will probably break HTS/RTS modes, plus it is an ABI change for sure.
>
>
> Actually after another thought, I have one more concern here:
>
>
> + /*
> + * Ensure the entries calculation was not based on a stale
> + * and unsafe stail observation that causes underflow.
> + */
> + if ((int)*entries < 0)
> + *entries = 0;
> +
>
>
> With that change, it might return not-valid information back to the user
> about number of free/occupied entries in the ring.
> Plus rte_ring_enqueue() now might fail even when there are enough free entries
> in the ring (same for dequeue).
How do you (or the thread) know there are enough free (or used) entries? Do you
assume sequentially consistent behaviour (a total order of memory accesses)?
Otherwise, you would need to explicitly create a happens-before relation
between threads, e.g. a consumer which made room in the ring buffer must
synchronize-with the producer that there is now room for more elements. That
synchronize-with edge will ensure the producer reads a fresh value of stail. But
without it, how can a thread know the state of the ring buffer that is being
manipulated by another thread?

> That looks like a change in our public API behavior that might break many things.
> There are quite few places when caller expects enqueue/dequeue
> operation to always succeed (let say there always should be enough free space in the ring).
Single-threaded scenarios are not a problem. Do you have a multithreaded scenario where
the caller expects enqueue/dequeue to always succeed? How are the threads involved in such
a scenario synchronizing with each other?

> For example: rte_mempool works like that.
> I am pretty sure there are quite few other places like that inside DPDK,
> not to mention third-party code.
>
>
> Considering all of the above, I am actually more in favor
> to combine approaches #2 and #3 for the final patch:
> establish a safe partial order (#2) and keep the check from #3 (should it become an assert()/verify()?)
I agree that using acquire/release for all prod/cons_head accesses will make it easier to
reason about the ring buffer state. Sequential consistency (total order) is the easiest to
reason about and often seems to be desired and expected by programmers (e.g. "I'll just
add a barrier here to ensure A happens before B in this thread, now there is a total order...").

- Ola

>
>
> Another thing to note: whatever final approach we choose -
> we need to make sure that the problem is addressed across all other
> rte_ring flavors/modes too (generic implementation, rts/hts mode, soring).
>
>
> Konstantin


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-23 21:57             ` Ola Liljedahl
@ 2025-09-24  6:56               ` Konstantin Ananyev
  2025-09-24  7:50                 ` Konstantin Ananyev
  2025-09-24 15:24               ` Stephen Hemminger
  1 sibling, 1 reply; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-24  6:56 UTC (permalink / raw)
  To: Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson


> > > > To avoid information loss I combined reply to two Wathsala replies into one.
> > > >
> > > >
> > > > > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > > > > barrier
> > > > > > (fence) between the load of the head and the load-acquire of the
> > > > > > > opposing tail guarantees the following: if a first thread reads
> > > > > > > tail
> > > > > > > and then writes head and a second thread reads the new value of
> > > > > > > head
> > > > > > > and then reads tail, then it should observe the same (or a later)
> > > > > > > value of tail.
> > > > > > >
> > > > > > > This assumption is incorrect under the C11 memory model. If the
> > > > > > > barrier
> > > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > > operations,
> > > > > > > it fails to do so. Instead, the current implementation only
> > > > > > > enforces a
> > > > > > > partial ordering, which can lead to unsafe interleavings. In
> > > > > > > particular,
> > > > > > > some partial orders can cause underflows in free slot or available
> > > > > > > element computations, potentially resulting in data corruption.
> > > > > >
> > > > > > Hmm... sounds exactly like the problem from the patch we discussed
> > > > > > earlier that year:
> > > > > >
> https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > konstantin.ananyev@huawei.com
> <mailto:konstantin.ananyev@huawei.com> <mailto:20250521111432.207936-4-
> > > konstantin.ananyev@huawei.com
> <mailto:konstantin.ananyev@huawei.com>>/
> > > > > > In two words:
> > > > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > > for 'prod.tail' or visa-versa.
> > > > > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > > > > making latest 'cons.head'
> > > > > > value visible to other threads, we need to ensure that latest
> > > > > > 'prod.tail' is also visible."
> > > > > > Is that the one?
> > > >
> > > >
> > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > > > > which is why we didn’t catch it earlier. A fuller explanation, with
> > > > > Herd7 simulations, is in the blog post linked in the cover letter.
> > > > >
> > > > > https://community.arm.com/arm-community-blogs/b/architectures-and-
> <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-
> order
> > > <https://community.arm.com/arm-community-blogs/b/architectures-and-
> <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-
> order>
> > > >
> > > >
> > > > I see, so now it is reproducible with core rte_ring on real HW.
> > > >
> > > >
> > > > > >
> > > > > > > The issue manifests when a CPU first acts as a producer and later
> > > > > > > as a
> > > > > > > consumer. In this scenario, the barrier assumption may fail when
> > > > > > > another
> > > > > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > > > > demonstrate
> > > > > > > this violation. The problem has not been widely observed so far
> > > > > > > because:
> > > > > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > > > > and
> > > > > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > > > > strong
> > > > > > > enough to prevent hazards.
> > > > > > > The problem becomes visible only on weaker models, when load-
> > > > > > > acquire is
> > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > > > > support
> > > > > > > the LDAPR and LDAPUR instructions).
> > > > > > >
> > > > > > > Three possible solutions exist:
> > > > > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > > > > sequential consistency. This requires using seq-cst for
> > > > > > > stores,
> > > > > > > loads, and CAS operations. However, this approach introduces a
> > > > > > > significant performance penalty on relaxed-memory
> > > > > > > architectures.
> > > > > > >
> > > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > > happens-before relationship between thread of same role by
> > > > > > > changing
> > > > > > > the CAS and the preceding load of the head by converting them
> > > > > > > to
> > > > > > > release and acquire respectively. This approach makes the
> > > > > > > original
> > > > > > > barrier assumption unnecessary and allows its removal.
> > > > > >
> > > > > > For the sake of clarity, can you outline what would be exact code
> > > > > > changes for
> > > > > > approach #2? Same as in that patch:
> > > > > >
> https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> &gt;>
> > > > > konstantin.ananyev@huawei.com
> <mailto:konstantin.ananyev@huawei.com>
> <mailto:konstantin.ananyev@huawei.com
> <mailto:konstantin.ananyev@huawei.com>>/
> > > > > > Or something different?
> > > > >
> > > > > Sorry, I missed the later half you your comment before.
> > > > > Yes, you have proposed the same solution there.
> > > >
> > > >
> > > > Ok, thanks for confirmation.
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > > > > are
> > > > > > > committed. This can be done by detecting underflow conditions
> > > > > > > (producer < consumer) and quashing the update in such cases.
> > > > > > > This approach makes the original barrier assumption
> > > > > > > unnecessary
> > > > > > > and allows its removal.
> > > > > >
> > > > > > > This patch implements solution (3) for performance reasons.
> > > > > > >
> > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com
> <mailto:wathsala.vithanage@arm.com>
> > > <mailto:wathsala.vithanage@arm.com
> <mailto:wathsala.vithanage@arm.com>>>
> > > > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com
> <mailto:ola.liljedahl@arm.com>
> > > <mailto:ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>>
> > > > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com
> <mailto:honnappa.nagarahalli@arm.com>
> > > <mailto:honnappa.nagarahalli@arm.com
> <mailto:honnappa.nagarahalli@arm.com>>>
> > > > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com
> <mailto:dhruv.tripathi@arm.com>
> > > <mailto:dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>>
> > > > > > > ---
> > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > > rte_ring_headtail
> > > > > > > *d,
> > > > > > > /* Reset n to the initial burst count */
> > > > > > > n = max;
> > > > > > >
> > > > > > > - /* Ensure the head is read before tail */
> > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > -
> > > > > > > /* load-acquire synchronize with store-release of
> > > > > > > ht->tail
> > > > > > > * in update_tail.
> > > > > > > */
> > > > > >
> > > > > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > > > > right?
> > > > >
> > > > > Right, we let it happen but eliminate any resulting states that are
> > > > > semantically incorrect at the end.
> > > >
> > > >
> > > > Two comments here:
> > > > 1) I think it is probably safer to do the check like that:
> > > > If (*entries > ring->capacity) ...
> > > Yes, this might be another way of handling underflow situations. We could
> study
> > > this.
> > >
> > > I have used the check for negative without problems in my ring buffer
> > > implementations
> > > https://github.com/ARM-
> software/progress64/blob/master/src/p64_ringbuf.c <https://github.com/ARM-
> software/progress64/blob/master/src/p64_ringbuf.c>
> > > but can't say that has been battle-tested.
> >
> >
> > My thought was about the case (probably hypothetical) when the difference
> > between stale tail and head will be bigger then 2^31 + 1.
> >
> >
> > > > 2) My concern that without forcing a proper read ordering
> > > > (cons.head first then prod.tail) we re-introduce a window for all sorts of
> > > > ABA-like problems.
> > > Head and tail indexes are monotonically increasing so I don't see a risk for
> ABA-like
> > > problems.
> >
> >
> > I understand that, but with current CPU speeds it can take rte_ring just few
> seconds to
> > wrap around head/tail values. If user doing something really fancy - like using
> rte_ring ZC API
> > (i.e. just moving head/tail without reading actual objects) that can probably
> happen even
> > faster (less than a second?).
> > Are we sure that the stale tail value will never persist that long?
> > Let say user calling move_head() in a loop till it succeeds?
> >
> >
> > > Indeed, adding a monotonically increasing tag to pointers is the common way
> of
> > > avoiding ABA
> > > problems in lock-free designs.
> >
> >
> > Yep, using 64-bit values for head/tail counters will help to avoid these concerns.
> > But it will probably break HTS/RTS modes, plus it is an ABI change for sure.
> >
> >
> > Actually after another thought, I have one more concern here:
> >
> >
> > + /*
> > + * Ensure the entries calculation was not based on a stale
> > + * and unsafe stail observation that causes underflow.
> > + */
> > + if ((int)*entries < 0)
> > + *entries = 0;
> > +
> >
> >
> > With that change, it might return not-valid information back to the user
> > about number of free/occupied entries in the ring.
> > Plus rte_ring_enqueue() now might fail even when there are enough free
> entries
> > in the ring (same for dequeue).
> How do you (or the thread) know there are enough free (or used) entries? Do
> you
> assume sequentially consistent behaviour (a total order of memory accesses)?
> Otherwise, you would need to explicitly create a happens-before relation
> between threads, e.g. a consumer which made room in the ring buffer must
> synchronize-with the producer that there is now room for more elements. That
> synchronize-with edge will ensure the producer reads a fresh value of stail. But
> without it, how can a thread know the state of the ring buffer that is being
> manipulated by another thread?
> 
> > That looks like a change in our public API behavior that might break many
> things.
> > There are quite few places when caller expects enqueue/dequeue
> > operation to always succeed (let say there always should be enough free space
> in the ring).
> Single-threaded scenarios are not a problem. Do you have a multithreaded
> scenario where
> the caller expects enqueue/dequeue to always succeed? How are the threads
> involved in such
> a scenario synchronizing with each other?

Sure, I am talking about MT scenario.
I think I already provided an example: DPDK mempool library (see below).
In brief, It works like that:
At init it allocates ring of N memory buffers and ring big enough to hold all of them.
Then it enqueues all allocated memory buffers into the ring.
mempool_get - retrieves (dequeues) buffers from the ring.
mempool_put - puts them back (enqueues) to the ring
get() might fail (ENOMEM), while put is expected to always succeed. 

> 
> > For example: rte_mempool works like that.
> > I am pretty sure there are quite few other places like that inside DPDK,
> > not to mention third-party code.
> >
> >
> > Considering all of the above, I am actually more in favor
> > to combine approaches #2 and #3 for the final patch:
> > establish a safe partial order (#2) and keep the check from #3 (should it become
> an assert()/verify()?)
> I agree that using acquire/release for all prod/cons_head accesses will make it
> easier to
> reason about the ring buffer state. Sequential consistency (total order) is the
> easiest to
> reason about and often seems to be desired and expected by programmers (e.g.
> "I'll just
> add a barrier here to ensure A happens before B in this thread, now there is a
> total order...").
> 
> - Ola
> 
> >
> >
> > Another thing to note: whatever final approach we choose -
> > we need to make sure that the problem is addressed across all other
> > rte_ring flavors/modes too (generic implementation, rts/hts mode, soring).
> >
> >
> > Konstantin
> 
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient,
> please notify the sender immediately and do not disclose the contents to any
> other person, use it for any purpose, or store or copy the information in any
> medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24  6:56               ` Konstantin Ananyev
@ 2025-09-24  7:50                 ` Konstantin Ananyev
  2025-09-24  8:51                   ` Ola Liljedahl
  0 siblings, 1 reply; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-24  7:50 UTC (permalink / raw)
  To: Konstantin Ananyev, Ola Liljedahl, Wathsala Vithanage,
	Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson



> > > > > To avoid information loss I combined reply to two Wathsala replies into one.
> > > > >
> > > > >
> > > > > > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > > > > > barrier
> > > > > > > (fence) between the load of the head and the load-acquire of the
> > > > > > > > opposing tail guarantees the following: if a first thread reads
> > > > > > > > tail
> > > > > > > > and then writes head and a second thread reads the new value of
> > > > > > > > head
> > > > > > > > and then reads tail, then it should observe the same (or a later)
> > > > > > > > value of tail.
> > > > > > > >
> > > > > > > > This assumption is incorrect under the C11 memory model. If the
> > > > > > > > barrier
> > > > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > > > operations,
> > > > > > > > it fails to do so. Instead, the current implementation only
> > > > > > > > enforces a
> > > > > > > > partial ordering, which can lead to unsafe interleavings. In
> > > > > > > > particular,
> > > > > > > > some partial orders can cause underflows in free slot or available
> > > > > > > > element computations, potentially resulting in data corruption.
> > > > > > >
> > > > > > > Hmm... sounds exactly like the problem from the patch we discussed
> > > > > > > earlier that year:
> > > > > > >
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > > konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com> <mailto:20250521111432.207936-4-
> > > > konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com>>/
> > > > > > > In two words:
> > > > > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > > > for 'prod.tail' or visa-versa.
> > > > > > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > > > > > making latest 'cons.head'
> > > > > > > value visible to other threads, we need to ensure that latest
> > > > > > > 'prod.tail' is also visible."
> > > > > > > Is that the one?
> > > > >
> > > > >
> > > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > > > > > which is why we didn’t catch it earlier. A fuller explanation, with
> > > > > > Herd7 simulations, is in the blog post linked in the cover letter.
> > > > > >
> > > > > > https://community.arm.com/arm-community-blogs/b/architectures-and-
> > <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-
> > order
> > > > <https://community.arm.com/arm-community-blogs/b/architectures-and-
> > <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-
> > order>
> > > > >
> > > > >
> > > > > I see, so now it is reproducible with core rte_ring on real HW.
> > > > >
> > > > >
> > > > > > >
> > > > > > > > The issue manifests when a CPU first acts as a producer and later
> > > > > > > > as a
> > > > > > > > consumer. In this scenario, the barrier assumption may fail when
> > > > > > > > another
> > > > > > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > > > > > demonstrate
> > > > > > > > this violation. The problem has not been widely observed so far
> > > > > > > > because:
> > > > > > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > > > > > and
> > > > > > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > > > > > strong
> > > > > > > > enough to prevent hazards.
> > > > > > > > The problem becomes visible only on weaker models, when load-
> > > > > > > > acquire is
> > > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > > > > > support
> > > > > > > > the LDAPR and LDAPUR instructions).
> > > > > > > >
> > > > > > > > Three possible solutions exist:
> > > > > > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > > > > > sequential consistency. This requires using seq-cst for
> > > > > > > > stores,
> > > > > > > > loads, and CAS operations. However, this approach introduces a
> > > > > > > > significant performance penalty on relaxed-memory
> > > > > > > > architectures.
> > > > > > > >
> > > > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > > > happens-before relationship between thread of same role by
> > > > > > > > changing
> > > > > > > > the CAS and the preceding load of the head by converting them
> > > > > > > > to
> > > > > > > > release and acquire respectively. This approach makes the
> > > > > > > > original
> > > > > > > > barrier assumption unnecessary and allows its removal.
> > > > > > >
> > > > > > > For the sake of clarity, can you outline what would be exact code
> > > > > > > changes for
> > > > > > > approach #2? Same as in that patch:
> > > > > > >
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > &gt;>
> > > > > > konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com>
> > <mailto:konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com>>/
> > > > > > > Or something different?
> > > > > >
> > > > > > Sorry, I missed the later half you your comment before.
> > > > > > Yes, you have proposed the same solution there.
> > > > >
> > > > >
> > > > > Ok, thanks for confirmation.
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > > > > > are
> > > > > > > > committed. This can be done by detecting underflow conditions
> > > > > > > > (producer < consumer) and quashing the update in such cases.
> > > > > > > > This approach makes the original barrier assumption
> > > > > > > > unnecessary
> > > > > > > > and allows its removal.
> > > > > > >
> > > > > > > > This patch implements solution (3) for performance reasons.
> > > > > > > >
> > > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com
> > <mailto:wathsala.vithanage@arm.com>
> > > > <mailto:wathsala.vithanage@arm.com
> > <mailto:wathsala.vithanage@arm.com>>>
> > > > > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com
> > <mailto:ola.liljedahl@arm.com>
> > > > <mailto:ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>>
> > > > > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com
> > <mailto:honnappa.nagarahalli@arm.com>
> > > > <mailto:honnappa.nagarahalli@arm.com
> > <mailto:honnappa.nagarahalli@arm.com>>>
> > > > > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com
> > <mailto:dhruv.tripathi@arm.com>
> > > > <mailto:dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>>
> > > > > > > > ---
> > > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > > > rte_ring_headtail
> > > > > > > > *d,
> > > > > > > > /* Reset n to the initial burst count */
> > > > > > > > n = max;
> > > > > > > >
> > > > > > > > - /* Ensure the head is read before tail */
> > > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > > -
> > > > > > > > /* load-acquire synchronize with store-release of
> > > > > > > > ht->tail
> > > > > > > > * in update_tail.
> > > > > > > > */
> > > > > > >
> > > > > > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > > > > > right?
> > > > > >
> > > > > > Right, we let it happen but eliminate any resulting states that are
> > > > > > semantically incorrect at the end.
> > > > >
> > > > >
> > > > > Two comments here:
> > > > > 1) I think it is probably safer to do the check like that:
> > > > > If (*entries > ring->capacity) ...
> > > > Yes, this might be another way of handling underflow situations. We could
> > study
> > > > this.
> > > >
> > > > I have used the check for negative without problems in my ring buffer
> > > > implementations
> > > > https://github.com/ARM-
> > software/progress64/blob/master/src/p64_ringbuf.c <https://github.com/ARM-
> > software/progress64/blob/master/src/p64_ringbuf.c>
> > > > but can't say that has been battle-tested.
> > >
> > >
> > > My thought was about the case (probably hypothetical) when the difference
> > > between stale tail and head will be bigger then 2^31 + 1.
> > >
> > >
> > > > > 2) My concern that without forcing a proper read ordering
> > > > > (cons.head first then prod.tail) we re-introduce a window for all sorts of
> > > > > ABA-like problems.
> > > > Head and tail indexes are monotonically increasing so I don't see a risk for
> > ABA-like
> > > > problems.
> > >
> > >
> > > I understand that, but with current CPU speeds it can take rte_ring just few
> > seconds to
> > > wrap around head/tail values. If user doing something really fancy - like using
> > rte_ring ZC API
> > > (i.e. just moving head/tail without reading actual objects) that can probably
> > happen even
> > > faster (less than a second?).
> > > Are we sure that the stale tail value will never persist that long?
> > > Let say user calling move_head() in a loop till it succeeds?
> > >
> > >
> > > > Indeed, adding a monotonically increasing tag to pointers is the common way
> > of
> > > > avoiding ABA
> > > > problems in lock-free designs.
> > >
> > >
> > > Yep, using 64-bit values for head/tail counters will help to avoid these concerns.
> > > But it will probably break HTS/RTS modes, plus it is an ABI change for sure.
> > >
> > >
> > > Actually after another thought, I have one more concern here:
> > >
> > >
> > > + /*
> > > + * Ensure the entries calculation was not based on a stale
> > > + * and unsafe stail observation that causes underflow.
> > > + */
> > > + if ((int)*entries < 0)
> > > + *entries = 0;
> > > +
> > >
> > >
> > > With that change, it might return not-valid information back to the user
> > > about number of free/occupied entries in the ring.
> > > Plus rte_ring_enqueue() now might fail even when there are enough free
> > entries
> > > in the ring (same for dequeue).
> > How do you (or the thread) know there are enough free (or used) entries? Do
> > you
> > assume sequentially consistent behaviour (a total order of memory accesses)?
> > Otherwise, you would need to explicitly create a happens-before relation
> > between threads, e.g. a consumer which made room in the ring buffer must
> > synchronize-with the producer that there is now room for more elements. That
> > synchronize-with edge will ensure the producer reads a fresh value of stail. But
> > without it, how can a thread know the state of the ring buffer that is being
> > manipulated by another thread?
> >
> > > That looks like a change in our public API behavior that might break many
> > things.
> > > There are quite few places when caller expects enqueue/dequeue
> > > operation to always succeed (let say there always should be enough free space
> > in the ring).
> > Single-threaded scenarios are not a problem. Do you have a multithreaded
> > scenario where
> > the caller expects enqueue/dequeue to always succeed? How are the threads
> > involved in such
> > a scenario synchronizing with each other?
> 
> Sure, I am talking about MT scenario.
> I think I already provided an example: DPDK mempool library (see below).
> In brief, It works like that:
> At init it allocates ring of N memory buffers and ring big enough to hold all of them.

Sorry, I meant to say: "it allocates N memory buffers and ring big enough to hold all of them".
 
> Then it enqueues all allocated memory buffers into the ring.
> mempool_get - retrieves (dequeues) buffers from the ring.
> mempool_put - puts them back (enqueues) to the ring
> get() might fail (ENOMEM), while put is expected to always succeed.
> 
> >
> > > For example: rte_mempool works like that.
> > > I am pretty sure there are quite few other places like that inside DPDK,
> > > not to mention third-party code.
> > >
> > >
> > > Considering all of the above, I am actually more in favor
> > > to combine approaches #2 and #3 for the final patch:
> > > establish a safe partial order (#2) and keep the check from #3 (should it become
> > an assert()/verify()?)
> > I agree that using acquire/release for all prod/cons_head accesses will make it
> > easier to
> > reason about the ring buffer state. Sequential consistency (total order) is the
> > easiest to
> > reason about and often seems to be desired and expected by programmers (e.g.
> > "I'll just
> > add a barrier here to ensure A happens before B in this thread, now there is a
> > total order...").
> >
> > - Ola
> >
> > >
> > >
> > > Another thing to note: whatever final approach we choose -
> > > we need to make sure that the problem is addressed across all other
> > > rte_ring flavors/modes too (generic implementation, rts/hts mode, soring).
> > >
> > >
> > > Konstantin
> >
> >
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended recipient,
> > please notify the sender immediately and do not disclose the contents to any
> > other person, use it for any purpose, or store or copy the information in any
> > medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24  7:50                 ` Konstantin Ananyev
@ 2025-09-24  8:51                   ` Ola Liljedahl
  2025-09-24 10:08                     ` Konstantin Ananyev
  0 siblings, 1 reply; 25+ messages in thread
From: Ola Liljedahl @ 2025-09-24  8:51 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson

> On 2025-09-24, 09:50, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
> > Sure, I am talking about MT scenario.
> > I think I already provided an example: DPDK mempool library (see below).
> > In brief, It works like that:
> > At init it allocates ring of N memory buffers and ring big enough to hold all of them.
>
> Sorry, I meant to say: "it allocates N memory buffers and ring big enough to hold all of them".
>
> > Then it enqueues all allocated memory buffers into the ring.
> > mempool_get - retrieves (dequeues) buffers from the ring.
> > mempool_put - puts them back (enqueues) to the ring
> > get() might fail (ENOMEM), while put is expected to always succeed.
But how does the thread which calls mempool_put() get hold of the memory buffers that
were obtained using mempool_get() by some other thread? Or this is not the scenario you
are worrying about?
Is it rather that multiple threads independently call mempool_get() and then mempool_put()
on their own buffers? And you are worried that a thread will fail to return (mempool_put) a
buffer that it earlier allocated (mempool_get)? We could create a litmus test for that.

- Ola


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24  8:51                   ` Ola Liljedahl
@ 2025-09-24 10:08                     ` Konstantin Ananyev
  2025-09-24 11:27                       ` Ola Liljedahl
  0 siblings, 1 reply; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-24 10:08 UTC (permalink / raw)
  To: Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson



> > > Sure, I am talking about MT scenario.
> > > I think I already provided an example: DPDK mempool library (see below).
> > > In brief, It works like that:
> > > At init it allocates ring of N memory buffers and ring big enough to hold all of
> them.
> >
> > Sorry, I meant to say: "it allocates N memory buffers and ring big enough to hold
> all of them".
> >
> > > Then it enqueues all allocated memory buffers into the ring.
> > > mempool_get - retrieves (dequeues) buffers from the ring.
> > > mempool_put - puts them back (enqueues) to the ring
> > > get() might fail (ENOMEM), while put is expected to always succeed.
> But how does the thread which calls mempool_put() get hold of the memory buffers
> that
> were obtained using mempool_get() by some other thread? Or this is not the
> scenario you
> are worrying about?
> Is it rather that multiple threads independently call mempool_get() and then
> mempool_put()
> on their own buffers? And you are worried that a thread will fail to return
> (mempool_put) a
> buffer that it earlier allocated (mempool_get)? We could create a litmus test for
> that.

Both scenarios are possible.
For Run-To-Completion model each thread usually does: allocate/use/free group of mbufs.
For pipleline model one thread can allocate bunch of mbufs, then pass them to other 
thread (via another ring for example) for further processing and then releasing.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24 10:08                     ` Konstantin Ananyev
@ 2025-09-24 11:27                       ` Ola Liljedahl
  2025-09-24 11:50                         ` Konstantin Ananyev
  0 siblings, 1 reply; 25+ messages in thread
From: Ola Liljedahl @ 2025-09-24 11:27 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson

>
> > On 2025-09-24, 12:08, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
>
> > > > Sure, I am talking about MT scenario.
> > > > I think I already provided an example: DPDK mempool library (see below).
> > > > In brief, It works like that:
> > > > At init it allocates ring of N memory buffers and ring big enough to hold all of
> > them.
> > >
> > > Sorry, I meant to say: "it allocates N memory buffers and ring big enough to hold
> > all of them".
> > >
> > > > Then it enqueues all allocated memory buffers into the ring.
> > > > mempool_get - retrieves (dequeues) buffers from the ring.
> > > > mempool_put - puts them back (enqueues) to the ring
> > > > get() might fail (ENOMEM), while put is expected to always succeed.
> > But how does the thread which calls mempool_put() get hold of the memory buffers
> > that
> > were obtained using mempool_get() by some other thread? Or this is not the
> > scenario you
> > are worrying about?
> > Is it rather that multiple threads independently call mempool_get() and then
> > mempool_put()
> > on their own buffers? And you are worried that a thread will fail to return
> > (mempool_put) a
> > buffer that it earlier allocated (mempool_get)? We could create a litmus test for
> > that.
>
>
> Both scenarios are possible.
> For Run-To-Completion model each thread usually does: allocate/use/free group of mbufs.
> For pipleline model one thread can allocate bunch of mbufs, then pass them to other
> thread (via another ring for example) for further processing and then releasing.
In the pipeline model, if the last stage (thread) frees (enqueues) buffers onto some ring buffer
and the first stage (thread) allocates (dequeues) buffers from the same ring buffer but there
isn't any other type of synchronization between the threads, we can never guarantee that
the first thread will be able to dequeue buffers because it doesn't know whether the last
thread has enqueued any buffers.

However, enqueue ought to always succeed. We should be able to create a litmus test for that.
Ring 1 is used as mempool, it initially contains capacity elements (full).
Ring 2 is used as pipe between stages 1 and 2, it initially contains 0 elements (empty).
Thread 1 allocates/dequeues a buffer from ring 1.
Thread 1 enqueues that buffer onto ring 2.
Thread 2 dequeues a buffer from ring 2.
Thread 2 frees/enqueues that buffer onto ring 1.  <<< this must succeed!
Does this reflect the situation you worry about?

- Ola


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24 11:27                       ` Ola Liljedahl
@ 2025-09-24 11:50                         ` Konstantin Ananyev
  2025-09-24 13:28                           ` Ola Liljedahl
  0 siblings, 1 reply; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-24 11:50 UTC (permalink / raw)
  To: Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson



> > > > > Sure, I am talking about MT scenario.
> > > > > I think I already provided an example: DPDK mempool library (see below).
> > > > > In brief, It works like that:
> > > > > At init it allocates ring of N memory buffers and ring big enough to hold all of
> > > them.
> > > >
> > > > Sorry, I meant to say: "it allocates N memory buffers and ring big enough to
> hold
> > > all of them".
> > > >
> > > > > Then it enqueues all allocated memory buffers into the ring.
> > > > > mempool_get - retrieves (dequeues) buffers from the ring.
> > > > > mempool_put - puts them back (enqueues) to the ring
> > > > > get() might fail (ENOMEM), while put is expected to always succeed.
> > > But how does the thread which calls mempool_put() get hold of the memory
> buffers
> > > that
> > > were obtained using mempool_get() by some other thread? Or this is not the
> > > scenario you
> > > are worrying about?
> > > Is it rather that multiple threads independently call mempool_get() and then
> > > mempool_put()
> > > on their own buffers? And you are worried that a thread will fail to return
> > > (mempool_put) a
> > > buffer that it earlier allocated (mempool_get)? We could create a litmus test for
> > > that.
> >
> >
> > Both scenarios are possible.
> > For Run-To-Completion model each thread usually does: allocate/use/free group
> of mbufs.
> > For pipleline model one thread can allocate bunch of mbufs, then pass them to
> other
> > thread (via another ring for example) for further processing and then releasing.
> In the pipeline model, if the last stage (thread) frees (enqueues) buffers onto some
> ring buffer
> and the first stage (thread) allocates (dequeues) buffers from the same ring buffer
> but there
> isn't any other type of synchronization between the threads, we can never guarantee
> that
> the first thread will be able to dequeue buffers because it doesn't know whether the
> last
> thread has enqueued any buffers.

Yes, as I said above - for mempool use-case: dequeue can fail, enqueue should always succeed.
The closest analogy: malloc() can fail, free() should never fail.

> 
> However, enqueue ought to always succeed. We should be able to create a litmus
> test for that.
> Ring 1 is used as mempool, it initially contains capacity elements (full).
> Ring 2 is used as pipe between stages 1 and 2, it initially contains 0 elements (empty).
> Thread 1 allocates/dequeues a buffer from ring 1.
> Thread 1 enqueues that buffer onto ring 2.
> Thread 2 dequeues a buffer from ring 2.
> Thread 2 frees/enqueues that buffer onto ring 1.  <<< this must succeed!
> Does this reflect the situation you worry about?

This is one of the possible scenarios.
As I said above - mempool_put() is expected to always be able to enqueue element to the ring.
TBH, I am not sure what you are trying to prove with the litmus test.
Looking at the changes you proposed:

+               /*
+                * Ensure the entries calculation was not based on a stale
+                * and unsafe stail observation that causes underflow.
+                */
+               if ((int)*entries < 0)
+                       *entries = 0;
+ 

                /* check that we have enough room in ring */
                if (unlikely(n > *entries))
                        n = (behavior == RTE_RING_QUEUE_FIXED) ?
                                        0 : *entries;

                *new_head = *old_head + n;
                if (n == 0)
                        return 0;

It is clear that with these changes enqueue/dequeue might fail even
when there are available entries in the ring.
One simple way that probably will introduce a loop instead of 'if':
(keep reading head and tail values till we get a valid results)
but again I am not sure it is a good way. 
Konstantin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24 11:50                         ` Konstantin Ananyev
@ 2025-09-24 13:28                           ` Ola Liljedahl
  2025-09-24 15:03                             ` Konstantin Ananyev
  0 siblings, 1 reply; 25+ messages in thread
From: Ola Liljedahl @ 2025-09-24 13:28 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson

> On 2025-09-24, 13:51, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
> > > > > > Sure, I am talking about MT scenario.
> > > > > > I think I already provided an example: DPDK mempool library (see below).
> > > > > > In brief, It works like that:
> > > > > > At init it allocates ring of N memory buffers and ring big enough to hold all of
> > > > them.
> > > > >
> > > > > Sorry, I meant to say: "it allocates N memory buffers and ring big enough to
> > hold
> > > > all of them".
> > > > >
> > > > > > Then it enqueues all allocated memory buffers into the ring.
> > > > > > mempool_get - retrieves (dequeues) buffers from the ring.
> > > > > > mempool_put - puts them back (enqueues) to the ring
> > > > > > get() might fail (ENOMEM), while put is expected to always succeed.
> > > > But how does the thread which calls mempool_put() get hold of the memory
> > buffers
> > > > that
> > > > were obtained using mempool_get() by some other thread? Or this is not the
> > > > scenario you
> > > > are worrying about?
> > > > Is it rather that multiple threads independently call mempool_get() and then
> > > > mempool_put()
> > > > on their own buffers? And you are worried that a thread will fail to return
> > > > (mempool_put) a
> > > > buffer that it earlier allocated (mempool_get)? We could create a litmus test for
> > > > that.
> > >
> > >
> > > Both scenarios are possible.
> > > For Run-To-Completion model each thread usually does: allocate/use/free group
> > of mbufs.
> > > For pipleline model one thread can allocate bunch of mbufs, then pass them to
> > other
> > thread (via another ring for example) for further processing and then releasing.
> > In the pipeline model, if the last stage (thread) frees (enqueues) buffers onto some
> > ring buffer
> > and the first stage (thread) allocates (dequeues) buffers from the same ring buffer
> > but there
> > isn't any other type of synchronization between the threads, we can never guarantee
> > that
> > the first thread will be able to dequeue buffers because it doesn't know whether the
> > last
> > thread has enqueued any buffers.
>
>
> Yes, as I said above - for mempool use-case: dequeue can fail, enqueue should always succeed.
> The closest analogy: malloc() can fail, free() should never fail.
>
>
> >
> > However, enqueue ought to always succeed. We should be able to create a litmus
> > test for that.
> > Ring 1 is used as mempool, it initially contains capacity elements (full).
> > Ring 2 is used as pipe between stages 1 and 2, it initially contains 0 elements (empty).
> > Thread 1 allocates/dequeues a buffer from ring 1.
> > Thread 1 enqueues that buffer onto ring 2.
> > Thread 2 dequeues a buffer from ring 2.
> > Thread 2 frees/enqueues that buffer onto ring 1. <<< this must succeed!
> > Does this reflect the situation you worry about?
>
>
> This is one of the possible scenarios.
> As I said above - mempool_put() is expected to always be able to enqueue element to the ring.
> TBH, I am not sure what you are trying to prove with the litmus test.
With a litmus test, we can prove that a specific situation can or cannot occur when
executing the specified memory accesses in multiple threads. By experimenting with
different memory access sequences and orderings, we can find the most relaxed
sequence that still prohibits undesired results.

So with a litmus test, I could prove that enqueue in thread 2 above will succeed or
alternatively, may not succeed (which is undesired). And I could find out what orderings
are required to guarantee success.

> Looking at the changes you proposed:
>
>
> + /*
> + * Ensure the entries calculation was not based on a stale
> + * and unsafe stail observation that causes underflow.
> + */
> + if ((int)*entries < 0)
> + *entries = 0;
> +
>
>
> /* check that we have enough room in ring */
> if (unlikely(n > *entries))
> n = (behavior == RTE_RING_QUEUE_FIXED) ?
> 0 : *entries;
>
>
> *new_head = *old_head + n;
> if (n == 0)
> return 0;
>
>
> It is clear that with these changes enqueue/dequeue might fail even
> when there are available entries in the ring.
Without explicit synchronization between the threads, they will still race and
a consumer cannot be guaranteed to succeed with its dequeue operation, e.g.
the dequeue operation could occur before the (supposedly) matching enqueue
operation in another thread.

Using acquire/release for all ring buffer metadata accesses doesn't inform the
thread that its operation is guaranteed to succeed, it just ensures any data is
transferred (or passed ownership) in a safe way (establishes a happens-before
relation). But the ring buffer implementation itself cannot ensure that the
enqueue in thread P happens before the dequeue in thread C. The dequeue
can still fail and return 0 elements.

How do you define "there are available entries in the ring"? Just reading one
metadata variable doesn't tell you this. And I assume users usually don't access
internal ring buffer metadata. So how does a thread know there are available
entries in the ring that can be dequeued?

It seems to me that your perspective is still very much that of sequential
consistency and total order. But even so, you need synchronization (usually
based on memory accesses but there are other ways, e.g. using system calls)
to ensure that operation D in thread 2 is only initiated after operation E in
thread 1 has completed. Otherwise, the operations will race and the outcome
is not guaranteed.

- Ola

> One simple way that probably will introduce a loop instead of 'if':
> (keep reading head and tail values till we get a valid results)
> but again I am not sure it is a good way.
> Konstantin



IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24 13:28                           ` Ola Liljedahl
@ 2025-09-24 15:03                             ` Konstantin Ananyev
  2025-09-25  4:29                               ` Morten Brørup
  0 siblings, 1 reply; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-24 15:03 UTC (permalink / raw)
  To: Ola Liljedahl, Wathsala Vithanage, Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson



> > > > > > > Sure, I am talking about MT scenario.
> > > > > > > I think I already provided an example: DPDK mempool library (see below).
> > > > > > > In brief, It works like that:
> > > > > > > At init it allocates ring of N memory buffers and ring big enough to hold all
> of
> > > > > them.
> > > > > >
> > > > > > Sorry, I meant to say: "it allocates N memory buffers and ring big enough to
> > > hold
> > > > > all of them".
> > > > > >
> > > > > > > Then it enqueues all allocated memory buffers into the ring.
> > > > > > > mempool_get - retrieves (dequeues) buffers from the ring.
> > > > > > > mempool_put - puts them back (enqueues) to the ring
> > > > > > > get() might fail (ENOMEM), while put is expected to always succeed.
> > > > > But how does the thread which calls mempool_put() get hold of the memory
> > > buffers
> > > > > that
> > > > > were obtained using mempool_get() by some other thread? Or this is not the
> > > > > scenario you
> > > > > are worrying about?
> > > > > Is it rather that multiple threads independently call mempool_get() and then
> > > > > mempool_put()
> > > > > on their own buffers? And you are worried that a thread will fail to return
> > > > > (mempool_put) a
> > > > > buffer that it earlier allocated (mempool_get)? We could create a litmus test
> for
> > > > > that.
> > > >
> > > >
> > > > Both scenarios are possible.
> > > > For Run-To-Completion model each thread usually does: allocate/use/free
> group
> > > of mbufs.
> > > > For pipleline model one thread can allocate bunch of mbufs, then pass them to
> > > other
> > > thread (via another ring for example) for further processing and then releasing.
> > > In the pipeline model, if the last stage (thread) frees (enqueues) buffers onto
> some
> > > ring buffer
> > > and the first stage (thread) allocates (dequeues) buffers from the same ring
> buffer
> > > but there
> > > isn't any other type of synchronization between the threads, we can never
> guarantee
> > > that
> > > the first thread will be able to dequeue buffers because it doesn't know whether
> the
> > > last
> > > thread has enqueued any buffers.
> >
> >
> > Yes, as I said above - for mempool use-case: dequeue can fail, enqueue should
> always succeed.
> > The closest analogy: malloc() can fail, free() should never fail.
> >
> >
> > >
> > > However, enqueue ought to always succeed. We should be able to create a
> litmus
> > > test for that.
> > > Ring 1 is used as mempool, it initially contains capacity elements (full).
> > > Ring 2 is used as pipe between stages 1 and 2, it initially contains 0 elements
> (empty).
> > > Thread 1 allocates/dequeues a buffer from ring 1.
> > > Thread 1 enqueues that buffer onto ring 2.
> > > Thread 2 dequeues a buffer from ring 2.
> > > Thread 2 frees/enqueues that buffer onto ring 1. <<< this must succeed!
> > > Does this reflect the situation you worry about?
> >
> >
> > This is one of the possible scenarios.
> > As I said above - mempool_put() is expected to always be able to enqueue element
> to the ring.
> > TBH, I am not sure what you are trying to prove with the litmus test.
> With a litmus test, we can prove that a specific situation can or cannot occur when
> executing the specified memory accesses in multiple threads. By experimenting with
> different memory access sequences and orderings, we can find the most relaxed
> sequence that still prohibits undesired results.
> 
> So with a litmus test, I could prove that enqueue in thread 2 above will succeed or
> alternatively, may not succeed (which is undesired). And I could find out what
> orderings
> are required to guarantee success.

I understand that part, what I am trying to say: this is just one of possible scenarios.
We could have multiple threads, multiple allocators, and multiple releasers,
and/or threads that will perform both.
Not to mention that this is just one library (mempool), there probably some others
Within DPDK that have similar expectations, not to mention third-party code. 
I am not trying to stop you from run whatever tests you like,
I am just a bit skeptical that it will cover all possible scenarios.

> > Looking at the changes you proposed:
> >
> >
> > + /*
> > + * Ensure the entries calculation was not based on a stale
> > + * and unsafe stail observation that causes underflow.
> > + */
> > + if ((int)*entries < 0)
> > + *entries = 0;
> > +
> >
> >
> > /* check that we have enough room in ring */
> > if (unlikely(n > *entries))
> > n = (behavior == RTE_RING_QUEUE_FIXED) ?
> > 0 : *entries;
> >
> >
> > *new_head = *old_head + n;
> > if (n == 0)
> > return 0;
> >
> >
> > It is clear that with these changes enqueue/dequeue might fail even
> > when there are available entries in the ring.
> Without explicit synchronization between the threads, they will still race and
> a consumer cannot be guaranteed to succeed with its dequeue operation, e.g.
> the dequeue operation could occur before the (supposedly) matching enqueue
> operation in another thread.
> 
> Using acquire/release for all ring buffer metadata accesses doesn't inform the
> thread that its operation is guaranteed to succeed, it just ensures any data is
> transferred (or passed ownership) in a safe way (establishes a happens-before
> relation). But the ring buffer implementation itself cannot ensure that the
> enqueue in thread P happens before the dequeue in thread C. The dequeue
> can still fail and return 0 elements.
> 
> How do you define "there are available entries in the ring"?
> Just reading one
> metadata variable doesn't tell you this. And I assume users usually don't access
> internal ring buffer metadata. So how does a thread know there are available
> entries in the ring that can be dequeued?

As I said - dequeue can fail, that's ok.
enqueue never should. 
As I explained, that is dictated by mempool lib design.
We create a ring with capacity=N, then put N objects into it.
Then we just dequeue/enqueue these elements from/to the ring.
We never try to put any other elements into the ring, so there always
has to be enough space in the ring to return object back to it.  
 
> It seems to me that your perspective is still very much that of sequential
> consistency and total order. But even so, you need synchronization (usually
> based on memory accesses but there are other ways, e.g. using system calls)
> to ensure that operation D in thread 2 is only initiated after operation E in
> thread 1 has completed. Otherwise, the operations will race and the outcome
> is not guaranteed.
> 
> - Ola
> 
> > One simple way that probably will introduce a loop instead of 'if':
> > (keep reading head and tail values till we get a valid results)
> > but again I am not sure it is a good way.
> > Konstantin
> 
> 
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient, please
> notify the sender immediately and do not disclose the contents to any other person,
> use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-23 21:57             ` Ola Liljedahl
  2025-09-24  6:56               ` Konstantin Ananyev
@ 2025-09-24 15:24               ` Stephen Hemminger
  1 sibling, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2025-09-24 15:24 UTC (permalink / raw)
  To: Ola Liljedahl
  Cc: Konstantin Ananyev, Wathsala Vithanage, Honnappa Nagarahalli,
	dev, Dhruv Tripathi, Bruce Richardson

On Tue, 23 Sep 2025 21:57:32 +0000
Ola Liljedahl <Ola.Liljedahl@arm.com> wrote:

> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Please fix your mail system (or switch to another email provider).
Because this kind of boilerplate interferes with sharing in free software.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-24 15:03                             ` Konstantin Ananyev
@ 2025-09-25  4:29                               ` Morten Brørup
  2025-09-25  7:11                                 ` Konstantin Ananyev
  0 siblings, 1 reply; 25+ messages in thread
From: Morten Brørup @ 2025-09-25  4:29 UTC (permalink / raw)
  To: Konstantin Ananyev, Ola Liljedahl, Wathsala Vithanage,
	Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson

I have been following this interesting discussion, and want to clarify:

For a generic ring, enqueue can fail if the ring doesn't have sufficient free space, and dequeue can fail if it doesn't have sufficient objects in queue.

However, when a ring is used as the backing store for a mempool, enqueue can never fail. (Dequeue can still fail if the mempool has been depleted.)

The reason enqueue into a mempool ring can never fail is:
On creation of the mempool ring, the objects held by the mempool (e.g. mbufs) are allocated in memory and enqueued into the ring. If the mempool ring has size SIZE, then SIZE objects are allocated in memory and enqueued into the mempool ring.
So, since only SIZE objects exist in the whole world, and the mempool ring has size SIZE, enqueue of those objects into the mempool ring cannot fail, and the mempool "put" API reflects this.

Note that this is a requirement for the mempool API, not the ring API.
So, if the ring API doesn't provide this guarantee (that if only SIZE objects exist, enqueue cannot fail), then this guarantee could be implemented in the mempool library where it interfaces to the ring "enqueue" API (instead of having the ring API provide this guarantee).

However, other libraries or applications might assume the same guarantee for a ring when no more than SIZE objects exist. (I don't know!) If this is the case, then it is a ring API requirement, not just a mempool API requirement.
One possible solution to this could be offering two ring enqueue APIs, a "fast" API without the guarantee and a "safe" API with the guarantee.
Somewhat like the iteration macros for linked lists: foreach() and foreach_safe().

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
  2025-09-25  4:29                               ` Morten Brørup
@ 2025-09-25  7:11                                 ` Konstantin Ananyev
  0 siblings, 0 replies; 25+ messages in thread
From: Konstantin Ananyev @ 2025-09-25  7:11 UTC (permalink / raw)
  To: Morten Brørup, Ola Liljedahl, Wathsala Vithanage,
	Honnappa Nagarahalli
  Cc: dev, Dhruv Tripathi, Bruce Richardson



> I have been following this interesting discussion, and want to clarify:
> 
> For a generic ring, enqueue can fail if the ring doesn't have sufficient free space, and
> dequeue can fail if it doesn't have sufficient objects in queue.
> 
> However, when a ring is used as the backing store for a mempool, enqueue can
> never fail. (Dequeue can still fail if the mempool has been depleted.)
> 
> The reason enqueue into a mempool ring can never fail is:
> On creation of the mempool ring, the objects held by the mempool (e.g. mbufs) are
> allocated in memory and enqueued into the ring. If the mempool ring has size SIZE,
> then SIZE objects are allocated in memory and enqueued into the mempool ring.
> So, since only SIZE objects exist in the whole world, and the mempool ring has size
> SIZE, enqueue of those objects into the mempool ring cannot fail, and the mempool
> "put" API reflects this.
> 
> Note that this is a requirement for the mempool API, not the ring API.
> So, if the ring API doesn't provide this guarantee (that if only SIZE objects exist,
> enqueue cannot fail), then this guarantee could be implemented in the mempool
> library where it interfaces to the ring "enqueue" API (instead of having the ring API
> provide this guarantee).
> 
> However, other libraries or applications might assume the same guarantee for a ring
> when no more than SIZE objects exist. (I don't know!) If this is the case, then it is a
> ring API requirement, not just a mempool API requirement.

Yep, there could be other libs (both DPDK and third-party) that rely on that.
That's why I think we need to preserve existing behavior of the public ring API.  
Providing some extra (fast) version of API is possible, though I am not big fan
of that idea: it will complicate and increase existing code quite a bit, while
I don't think the gain would be that huge.
But again - I think that new API shall be subject of a separate patch-set and discussion.
As first thing we do need a fix or existing one.
Thank you for great summary.  
Konstantin

> One possible solution to this could be offering two ring enqueue APIs, a "fast" API
> without the guarantee and a "safe" API with the guarantee.
> Somewhat like the iteration macros for linked lists: foreach() and foreach_safe().


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-09-25  7:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-15 18:54 [PATCH 0/1] ring: correct ordering issue in head/tail update Wathsala Vithanage
2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
2025-09-16 15:42   ` Bruce Richardson
2025-09-16 18:19     ` Ola Liljedahl
2025-09-17  7:47       ` Bruce Richardson
2025-09-17 15:06         ` Stephen Hemminger
2025-09-18 17:40         ` Wathsala Vithanage
2025-09-16 22:57   ` Konstantin Ananyev
2025-09-16 23:08     ` Konstantin Ananyev
     [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
2025-09-17  7:58       ` Konstantin Ananyev
2025-09-17  9:05         ` Ola Liljedahl
2025-09-20 12:01           ` Konstantin Ananyev
     [not found]             ` <cf7e14d4ba5e9d78fddf083b6c92d75942447931.camel@arm.com>
2025-09-22  7:12               ` Konstantin Ananyev
2025-09-23 21:57             ` Ola Liljedahl
2025-09-24  6:56               ` Konstantin Ananyev
2025-09-24  7:50                 ` Konstantin Ananyev
2025-09-24  8:51                   ` Ola Liljedahl
2025-09-24 10:08                     ` Konstantin Ananyev
2025-09-24 11:27                       ` Ola Liljedahl
2025-09-24 11:50                         ` Konstantin Ananyev
2025-09-24 13:28                           ` Ola Liljedahl
2025-09-24 15:03                             ` Konstantin Ananyev
2025-09-25  4:29                               ` Morten Brørup
2025-09-25  7:11                                 ` Konstantin Ananyev
2025-09-24 15:24               ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).