RE: [PATCH 1/1] ring: safe partial ordering for head/tail update

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
	Ola Liljedahl <Ola.Liljedahl@arm.com>,
	Wathsala Vithanage <wathsala.vithanage@arm.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
	Dhruv Tripathi <Dhruv.Tripathi@arm.com>,
	Bruce Richardson <bruce.richardson@intel.com>
Subject: RE: [PATCH 1/1] ring: safe partial ordering for head/tail update
Date: Wed, 24 Sep 2025 07:50:41 +0000	[thread overview]
Message-ID: <d485010b97654d09a1921b97879faf90@huawei.com> (raw)
In-Reply-To: <8eab3da6eb2944a4a4a197aeb38ee2f1@huawei.com>



> > > > > To avoid information loss I combined reply to two Wathsala replies into one.
> > > > >
> > > > >
> > > > > > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > > > > > barrier
> > > > > > > (fence) between the load of the head and the load-acquire of the
> > > > > > > > opposing tail guarantees the following: if a first thread reads
> > > > > > > > tail
> > > > > > > > and then writes head and a second thread reads the new value of
> > > > > > > > head
> > > > > > > > and then reads tail, then it should observe the same (or a later)
> > > > > > > > value of tail.
> > > > > > > >
> > > > > > > > This assumption is incorrect under the C11 memory model. If the
> > > > > > > > barrier
> > > > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > > > operations,
> > > > > > > > it fails to do so. Instead, the current implementation only
> > > > > > > > enforces a
> > > > > > > > partial ordering, which can lead to unsafe interleavings. In
> > > > > > > > particular,
> > > > > > > > some partial orders can cause underflows in free slot or available
> > > > > > > > element computations, potentially resulting in data corruption.
> > > > > > >
> > > > > > > Hmm... sounds exactly like the problem from the patch we discussed
> > > > > > > earlier that year:
> > > > > > >
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > > konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com> <mailto:20250521111432.207936-4-
> > > > konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com>>/
> > > > > > > In two words:
> > > > > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > > > for 'prod.tail' or visa-versa.
> > > > > > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > > > > > making latest 'cons.head'
> > > > > > > value visible to other threads, we need to ensure that latest
> > > > > > > 'prod.tail' is also visible."
> > > > > > > Is that the one?
> > > > >
> > > > >
> > > > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > > > > > which is why we didn’t catch it earlier. A fuller explanation, with
> > > > > > Herd7 simulations, is in the blog post linked in the cover letter.
> > > > > >
> > > > > > https://community.arm.com/arm-community-blogs/b/architectures-and-
> > <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-
> > order
> > > > <https://community.arm.com/arm-community-blogs/b/architectures-and-
> > <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > > > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-
> > order>
> > > > >
> > > > >
> > > > > I see, so now it is reproducible with core rte_ring on real HW.
> > > > >
> > > > >
> > > > > > >
> > > > > > > > The issue manifests when a CPU first acts as a producer and later
> > > > > > > > as a
> > > > > > > > consumer. In this scenario, the barrier assumption may fail when
> > > > > > > > another
> > > > > > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > > > > > demonstrate
> > > > > > > > this violation. The problem has not been widely observed so far
> > > > > > > > because:
> > > > > > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > > > > > and
> > > > > > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > > > > > strong
> > > > > > > > enough to prevent hazards.
> > > > > > > > The problem becomes visible only on weaker models, when load-
> > > > > > > > acquire is
> > > > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > > > > > support
> > > > > > > > the LDAPR and LDAPUR instructions).
> > > > > > > >
> > > > > > > > Three possible solutions exist:
> > > > > > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > > > > > sequential consistency. This requires using seq-cst for
> > > > > > > > stores,
> > > > > > > > loads, and CAS operations. However, this approach introduces a
> > > > > > > > significant performance penalty on relaxed-memory
> > > > > > > > architectures.
> > > > > > > >
> > > > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > > > happens-before relationship between thread of same role by
> > > > > > > > changing
> > > > > > > > the CAS and the preceding load of the head by converting them
> > > > > > > > to
> > > > > > > > release and acquire respectively. This approach makes the
> > > > > > > > original
> > > > > > > > barrier assumption unnecessary and allows its removal.
> > > > > > >
> > > > > > > For the sake of clarity, can you outline what would be exact code
> > > > > > > changes for
> > > > > > > approach #2? Same as in that patch:
> > > > > > >
> > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-
> > &gt;>
> > > > > > konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com>
> > <mailto:konstantin.ananyev@huawei.com
> > <mailto:konstantin.ananyev@huawei.com>>/
> > > > > > > Or something different?
> > > > > >
> > > > > > Sorry, I missed the later half you your comment before.
> > > > > > Yes, you have proposed the same solution there.
> > > > >
> > > > >
> > > > > Ok, thanks for confirmation.
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > > > > > are
> > > > > > > > committed. This can be done by detecting underflow conditions
> > > > > > > > (producer < consumer) and quashing the update in such cases.
> > > > > > > > This approach makes the original barrier assumption
> > > > > > > > unnecessary
> > > > > > > > and allows its removal.
> > > > > > >
> > > > > > > > This patch implements solution (3) for performance reasons.
> > > > > > > >
> > > > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com
> > <mailto:wathsala.vithanage@arm.com>
> > > > <mailto:wathsala.vithanage@arm.com
> > <mailto:wathsala.vithanage@arm.com>>>
> > > > > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com
> > <mailto:ola.liljedahl@arm.com>
> > > > <mailto:ola.liljedahl@arm.com <mailto:ola.liljedahl@arm.com>>>
> > > > > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com
> > <mailto:honnappa.nagarahalli@arm.com>
> > > > <mailto:honnappa.nagarahalli@arm.com
> > <mailto:honnappa.nagarahalli@arm.com>>>
> > > > > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi@arm.com
> > <mailto:dhruv.tripathi@arm.com>
> > > > <mailto:dhruv.tripathi@arm.com <mailto:dhruv.tripathi@arm.com>>>
> > > > > > > > ---
> > > > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > > > rte_ring_headtail
> > > > > > > > *d,
> > > > > > > > /* Reset n to the initial burst count */
> > > > > > > > n = max;
> > > > > > > >
> > > > > > > > - /* Ensure the head is read before tail */
> > > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > > -
> > > > > > > > /* load-acquire synchronize with store-release of
> > > > > > > > ht->tail
> > > > > > > > * in update_tail.
> > > > > > > > */
> > > > > > >
> > > > > > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > > > > > right?
> > > > > >
> > > > > > Right, we let it happen but eliminate any resulting states that are
> > > > > > semantically incorrect at the end.
> > > > >
> > > > >
> > > > > Two comments here:
> > > > > 1) I think it is probably safer to do the check like that:
> > > > > If (*entries > ring->capacity) ...
> > > > Yes, this might be another way of handling underflow situations. We could
> > study
> > > > this.
> > > >
> > > > I have used the check for negative without problems in my ring buffer
> > > > implementations
> > > > https://github.com/ARM-
> > software/progress64/blob/master/src/p64_ringbuf.c <https://github.com/ARM-
> > software/progress64/blob/master/src/p64_ringbuf.c>
> > > > but can't say that has been battle-tested.
> > >
> > >
> > > My thought was about the case (probably hypothetical) when the difference
> > > between stale tail and head will be bigger then 2^31 + 1.
> > >
> > >
> > > > > 2) My concern that without forcing a proper read ordering
> > > > > (cons.head first then prod.tail) we re-introduce a window for all sorts of
> > > > > ABA-like problems.
> > > > Head and tail indexes are monotonically increasing so I don't see a risk for
> > ABA-like
> > > > problems.
> > >
> > >
> > > I understand that, but with current CPU speeds it can take rte_ring just few
> > seconds to
> > > wrap around head/tail values. If user doing something really fancy - like using
> > rte_ring ZC API
> > > (i.e. just moving head/tail without reading actual objects) that can probably
> > happen even
> > > faster (less than a second?).
> > > Are we sure that the stale tail value will never persist that long?
> > > Let say user calling move_head() in a loop till it succeeds?
> > >
> > >
> > > > Indeed, adding a monotonically increasing tag to pointers is the common way
> > of
> > > > avoiding ABA
> > > > problems in lock-free designs.
> > >
> > >
> > > Yep, using 64-bit values for head/tail counters will help to avoid these concerns.
> > > But it will probably break HTS/RTS modes, plus it is an ABI change for sure.
> > >
> > >
> > > Actually after another thought, I have one more concern here:
> > >
> > >
> > > + /*
> > > + * Ensure the entries calculation was not based on a stale
> > > + * and unsafe stail observation that causes underflow.
> > > + */
> > > + if ((int)*entries < 0)
> > > + *entries = 0;
> > > +
> > >
> > >
> > > With that change, it might return not-valid information back to the user
> > > about number of free/occupied entries in the ring.
> > > Plus rte_ring_enqueue() now might fail even when there are enough free
> > entries
> > > in the ring (same for dequeue).
> > How do you (or the thread) know there are enough free (or used) entries? Do
> > you
> > assume sequentially consistent behaviour (a total order of memory accesses)?
> > Otherwise, you would need to explicitly create a happens-before relation
> > between threads, e.g. a consumer which made room in the ring buffer must
> > synchronize-with the producer that there is now room for more elements. That
> > synchronize-with edge will ensure the producer reads a fresh value of stail. But
> > without it, how can a thread know the state of the ring buffer that is being
> > manipulated by another thread?
> >
> > > That looks like a change in our public API behavior that might break many
> > things.
> > > There are quite few places when caller expects enqueue/dequeue
> > > operation to always succeed (let say there always should be enough free space
> > in the ring).
> > Single-threaded scenarios are not a problem. Do you have a multithreaded
> > scenario where
> > the caller expects enqueue/dequeue to always succeed? How are the threads
> > involved in such
> > a scenario synchronizing with each other?
> 
> Sure, I am talking about MT scenario.
> I think I already provided an example: DPDK mempool library (see below).
> In brief, It works like that:
> At init it allocates ring of N memory buffers and ring big enough to hold all of them.

Sorry, I meant to say: "it allocates N memory buffers and ring big enough to hold all of them".
 
> Then it enqueues all allocated memory buffers into the ring.
> mempool_get - retrieves (dequeues) buffers from the ring.
> mempool_put - puts them back (enqueues) to the ring
> get() might fail (ENOMEM), while put is expected to always succeed.
> 
> >
> > > For example: rte_mempool works like that.
> > > I am pretty sure there are quite few other places like that inside DPDK,
> > > not to mention third-party code.
> > >
> > >
> > > Considering all of the above, I am actually more in favor
> > > to combine approaches #2 and #3 for the final patch:
> > > establish a safe partial order (#2) and keep the check from #3 (should it become
> > an assert()/verify()?)
> > I agree that using acquire/release for all prod/cons_head accesses will make it
> > easier to
> > reason about the ring buffer state. Sequential consistency (total order) is the
> > easiest to
> > reason about and often seems to be desired and expected by programmers (e.g.
> > "I'll just
> > add a barrier here to ensure A happens before B in this thread, now there is a
> > total order...").
> >
> > - Ola
> >
> > >
> > >
> > > Another thing to note: whatever final approach we choose -
> > > we need to make sure that the problem is addressed across all other
> > > rte_ring flavors/modes too (generic implementation, rts/hts mode, soring).
> > >
> > >
> > > Konstantin
> >
> >
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended recipient,
> > please notify the sender immediately and do not disclose the contents to any
> > other person, use it for any purpose, or store or copy the information in any
> > medium. Thank you.

next prev parent reply	other threads:[~2025-09-24  7:50 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 18:54 [PATCH 0/1] ring: correct ordering issue in " Wathsala Vithanage
2025-09-15 18:54 ` [PATCH 1/1] ring: safe partial ordering for " Wathsala Vithanage
2025-09-16 15:42   ` Bruce Richardson
2025-09-16 18:19     ` Ola Liljedahl
2025-09-17  7:47       ` Bruce Richardson
2025-09-17 15:06         ` Stephen Hemminger
2025-09-18 17:40         ` Wathsala Vithanage
2025-09-16 22:57   ` Konstantin Ananyev
2025-09-16 23:08     ` Konstantin Ananyev
     [not found]     ` <2a611c3cf926d752a54b7655c27d6df874a2d0de.camel@arm.com>
2025-09-17  7:58       ` Konstantin Ananyev
2025-09-17  9:05         ` Ola Liljedahl
2025-09-20 12:01           ` Konstantin Ananyev
     [not found]             ` <cf7e14d4ba5e9d78fddf083b6c92d75942447931.camel@arm.com>
2025-09-22  7:12               ` Konstantin Ananyev
2025-09-23 21:57             ` Ola Liljedahl
2025-09-24  6:56               ` Konstantin Ananyev
2025-09-24  7:50                 ` Konstantin Ananyev [this message]
2025-09-24  8:51                   ` Ola Liljedahl
2025-09-24 10:08                     ` Konstantin Ananyev
2025-09-24 11:27                       ` Ola Liljedahl
2025-09-24 11:50                         ` Konstantin Ananyev
2025-09-24 13:28                           ` Ola Liljedahl
2025-09-24 15:03                             ` Konstantin Ananyev
2025-09-25  4:29                               ` Morten Brørup
2025-09-25  7:11                                 ` Konstantin Ananyev
2025-09-24 15:24               ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d485010b97654d09a1921b97879faf90@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=Dhruv.Tripathi@arm.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=Ola.Liljedahl@arm.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=wathsala.vithanage@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).