DPDK patches and discussions
 help / color / mirror / Atom feed
From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	"jerinj@marvell.com" <jerinj@marvell.com>,
	"drc@linux.ibm.com" <drc@linux.ibm.com>, nd <nd@arm.com>
Subject: RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
Date: Tue, 15 Oct 2024 15:11:36 +0000	[thread overview]
Message-ID: <8ab3aad0be784a4a9713d9d09ad80635@huawei.com> (raw)
In-Reply-To: <PAWPR08MB890977BA6D16EA389AA289589F792@PAWPR08MB8909.eurprd08.prod.outlook.com>


> > > > > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > > > > =====================
> > > > > > > > >
> > > > > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > > > > --------------------                                                 ------------------------------
> > ----
> > > > ----
> > > > > > > > >  head.load()                                          //        ldr [head]
> > > > > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > > > > >  ...
> > > > > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];...
> > stlex[head]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. rte_ring_c11_pvt.h
> > > > > > > > > =====================
> > > > > > > > >
> > > > > > > > > pseudo-c-code                                       //        related armv8
> > instructions
> > > > > > > > > --------------------                                                 ------------------------------
> > ----
> > > > ----
> > > > > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > > > > ...
> > > > > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ...
> > strex[head]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > > > > ==========================
> > > > > > > > >
> > > > > > > > > pseudo-c-code                                       //        related armv8
> > instructions
> > > > > > > > > --------------------                                                 ------------------------------
> > ----
> > > > ----
> > > > > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > > > > ...
> > > > > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ...
> > strex[head]
> > > > > > > > >
> > > > > > > > > The questions that arose from these observations:
> > > > > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > > > > Different, lda (Load with acquire semantics) and ldr (load) are
> > different.
> > > > > > >
> > > > > > > I understand that, my question was:
> > > > > > > lda {head]; ldr[tail]
> > > > > > > vs
> > > > > > > ldr [head]; dmb ishld; ldr [tail];
> > > > > > >
> > > > > > > Is there any difference in terms of functionality (memory ops
> > > > > > ordering/observability)?
> > > > > >
> > > > > > To be more precise:
> > > > > >
> > > > > > lda {head]; ldr[tail]
> > > > > > vs
> > > > > > ldr [head]; dmb ishld; ldr [tail]; vs ldr [head]; dmb ishld; lda
> > > > > > [tail];
> > > > > >
> > > > > > what would be the difference between these 3 cases?
> > > > >
> > > > > Case A: lda {head]; ldr[tail]
> > > > > load of the head will be observed by the memory subsystem before
> > > > > the load of the tail.
> > > > >
> > > > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will
> > > > > be observed by the memory subsystem Before the load of the tail.
> > > > >
> > > > >
> > > > > Essentially both cases A and B are the same.
> > > > > They preserve following program orders.
> > > > > LOAD-LOAD
> > > > > LOAD-STORE
> > > >
> > > > Ok, that is crystal clear, thanks for explanation.
> > > >
> > > >
> > > > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will
> > > > > be observed by the memory subsystem before the load of the tail.
> > > >
> > > > Ok.
> > > >
> > > > > In addition, any load or store program order after lda[tail] will
> > > > > not be observed by the memory subsystem before the load of the tail.
> > > >
> > > > Ok... the question is why we need that extra hoisting barrier here?
> > > > From what unwanted  re-orderings we are protecting here?
> > > > Does it mean that without it, ldrex/strex (CAS) can be reordered
> > > > with load[cons.tail]?
> > > >
> > > > Actually, we probably need to look at whole picture:
> > > >
> > > > in rte_ring_generic_pvt.h
> > > > =====================
> > > >
> > > > ldr [prod.head]
> > > > dmb ishld
> > > > ldr [cons.tail]
> > > > ...
> > > > /* cas */
> > > > ldrex [prod.head]
> > > > stlex [prod.head]   /* sink barrier */
> > > >
> > > > in rte_ring_c11_pvt.h
> > > > =====================
> > > >
> > > > ldr [prod.head]
> > > > dmb ishld
> > > > lda [cons.tail]          /* exrea hoist */
> > > > ...
> > > > /* cas */
> > > > ldrex [prod.head]
> > > > strex [prod.head]
> > >
> > > Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
> > > Aaarch64 has a cas instruction.
> > > Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n
> >
> > Cool, thanks.
> >
> > >
> > > >
> > > > So, in _genereic_ we don't have that extra hoist barrier after
> > > > load[con.tail], but we have extra sink barrier at cas(prod.tail).
> > > >
> > >
> > > You are right, technically, that lda[cons.tail] is not required
> > > because due to the dependency chain up until CAS a memory reordering is
> > not possible.
> > > For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-
> > update).
> > > C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in
> > GCC).
> > >
> > > So, ideally one could have written something like...
> > > __atomic_load_n(prod.head, __ATOMIC_CONSUME); instead of
> > > __atomic_load_n(prod.head, __ATOMIC_ACQUIRE);
> > >
> > > The compiler is then supposed to figure out if there are any
> > > dependencies in the code path to ensure that load of the prod.head is
> > > observed before any load/store that's program order after the load of the
> > prod.head.
> > > If not, the compiler is supposed to add required barrier to ensure
> > > that order is preserved.
> > >
> > > However, it's easier said than done. No, compiler implements it and
> > > C11 standard discourages use of memory-order-consume for that reason.
> > >
> > > This brings us to the next caveat. As per C11 standard, there cannot
> > > be a free standing Store with release semantics (stlr)  that isn't
> > > paired with a load with acquire or consume semantics. Since we can't
> > > use __ATOMIC_CONSUME (which would have resulted in ldr[prod.head]),
> > we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).
> > >
> > >
> > > > If that's correct observation, can we change _c11_ implementation to
> > > > match _generic_ one by:
> > > >
> > > >  atomic_load(prod.head, releaxed);
> > > >  atomic_thread_fence(acquire);
> > > >  atomic_load(cons.tail, releaxed);
> > > > ....
> > > > atomic_cas(prod.head, release, relaxed); ?
> > > >
> > > > From my understanding that should help to make these 2 implantations
> > > > Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> > > >
> > >
> > > They both are functionally correct.
> > > _generic is correct but does not comply with the C11 specification.
> > > _c11 is correct and fully compliant with the C11 specification.
> >
> > So far, I didn't say they are 'functionally incorrect'.
> > The problem is - we have two implementations for exactly the same thing that
> > generate different code-sequence for the same platform.
> > By default It is switched on/off depending on the platform ('off' - at
> > arm/thunderx, ppc, x86, 'on' on other arms and riscv ).
> > So for some platforms _c11_ is probably never used (and less tested), while on
> 
> C11 is guaranteed to work on all platforms if built with a C11 compliant compiler.
> 
> > others _generic_ is not used and under tested.
> 
> _generic_ by chance sets a dependency chain up to the point array operations are
> done, so there is an implicit head.load(consume). It's possible that there can be a
> platform/compiler combination that does not do that.
> 
> > Obviously it is not a good situation and better be addressed.
> > As I remember the reason while we ended up with 2 code-paths here - on
> > some platforms _c11_ one was slower.
> > Honnappa, please correct me if I am wrong here.
> > So ideal solution seems be - make _c11_ generate exactly the same
> > code as current _generic_, so we can deprecate and remove _generic_.
> >
> 
> Ideal solution would be to make _generic_ generate the same code _c11_
> generates for the above reasons. But, at least for Arm, when compiled with
> GCC we know that there is a dependency chain that guarantees the program
> order (implicit consume), so it's safe to use and slightly more performant.
> 
> > > Replacing atomic_load(cons.tail, acquire) with load(cons.tail,
> > > relaxed) in
> > > _c11 would make it non-compliant with C11 the spec.
> >

Could you point me to exact place where sequence:
atomic_store(release); atomic_load(relaxed);
in C11 is marked and ' non-compliant'?
I suppose you are referring to:
"...
Release-Acquire ordering
If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_acquire, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B.
All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.
The synchronization is established only between the threads releasing and acquiring the same atomic variable. 
..."
From: https://en.cppreference.com/w/c/atomic/memory_order
Or something else? 

> > Hmm... where this constraint comes from exactly?
> > AFAIK, inside DPDK, we have several places where we do use
> > 'atomic_store(var, release);' in conjunction with 'var.atomic_load(relaxed);'
> 
> That's ok, such code can exist just like in _generic ring.
> As long as they are correct on all supported platforms and no one states it's C11 compliant, there is no issue, no?

From my perspective - not really.
We have 2 different implementations for the same functionality.
Both are platform independent, but reasons why one or the other should be
selected on particular platform are not clear. 
That's not ideal situation in terms of support and maintenance.
As I remember, when few years ago _c11_-style atomics were introduced into DPDK
the main agreement was: in all appropriate places  replace our home-grown sync primitives
with _c11_ atomics.
AFAIK, it was done for nearly whole DPDK code - probably ring _generic_ is one of few
remaining ones.
Again, as I remember - main reason for that was that _generic_ is slightly more performant,
then _c11_.
That's why my thought was - can we try to optimize _c11_ code-path?
If you insist that it is not possible - ok, then I still think we should consider to stick with _c11_
only for all platforms from some point.

Another thing - in rte_ring library now we have 4 different sync modes for cons and prod:
ST, MT, HTS, RTS  and user is free to mix and match them in a way he wants.
So we can have move_head() from one implementation and update_tail() 
from different one.
HTS and RTS also use _c11_ style atomics but their code for 'move_head()' differs a bit
from _c11_ ring.  
I particular they use for prod/cons.tail:
atomic_store(release);  /* in update_tail() */ 
atomic_load(relaxed);   /* in move_head() */
Which as I understand, you consider as 'non-compilant with C11',
and not guaranteed to work on all possible platforms, right? 

Again, there is  patch that will probably complicate things even more:
https://patchwork.dpdk.org/project/dpdk/patch/20241015130111.826-5-konstantin.v.ananyev@yandex.ru/

So my thought was: we better have one common principle for all
these different move_head()s. 

> >
> > Again, if that really that strict - why in _c11_ move_head() it is ok  to use
> > 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'?
> > Following that logic, it should always be 'head.cas(release);', no?
> >
> _c11_move_head() does not head.load(acquire), it's head.load(relaxed).
> there is a tail.load(acquire) which must synchronize with tail.store(release).
> 
> Prod                                                            Cons
> ------                                                             -------
> ProdHead.load(relaxed)                          ConsHead.load(relaxed)

Correction: we have 
rte_atomic_thread_fence(rte_memory_order_acquire);
right here, between these 2 loads.
And as I understand:
prod.head.load(relaxed); fence(acquire);
is identical (in terms of synchronization) to prod.head.load(acquire). 

> ConsTail.load(acquire)                             ProdTail.load(acquire)
> ProdHead.cas(relaxed)                            ConsHead.cas(relaxed)

Another question - if head.cas() is relaxed here, what prevents it
to be reordered with relaxed loads/stores in AddToArray?
I presume just dependency order for head value?
Anything else?  

> AddToArray                                               RemoveFromArray
> ProdTail.store(release)                            ConsTail.store(release)
> 
> Above is how _c11 is implemented.
> ConsTail.load(acquire)  in Prod synchronizes with the ConsTail.store(release)
> In Cons.
> ProdTail.load(acquire) in Cons synchronizes with the ProdTail.store(release)
> In Prod.
> 
> If you want, you can switch to ConsTail.load(consume) and ProdTail.load(consume).
> But, due to lack of support as explained before that results identical to
> ConsTail.load(acquire) and ProdTail.load(acquire).
> Also, C11 standard highly discourage it's use for the time being.


  reply	other threads:[~2024-10-15 15:11 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-08 12:58 Konstantin Ananyev
2024-10-08 15:09 ` Wathsala Wathawana Vithanage
2024-10-08 15:12   ` Wathsala Wathawana Vithanage
2024-10-08 15:45   ` Konstantin Ananyev
2024-10-08 15:56     ` Konstantin Ananyev
2024-10-09 17:27       ` Wathsala Wathawana Vithanage
2024-10-10 16:54         ` Konstantin Ananyev
2024-10-11  0:11           ` Wathsala Wathawana Vithanage
2024-10-11 14:08             ` Konstantin Ananyev
2024-10-11 15:48               ` Wathsala Wathawana Vithanage
2024-10-15 15:11                 ` Konstantin Ananyev [this message]
2024-10-09  1:41     ` Wathsala Wathawana Vithanage
2024-10-09  2:22     ` Wathsala Wathawana Vithanage

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8ab3aad0be784a4a9713d9d09ad80635@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.ibm.com \
    --cc=jerinj@marvell.com \
    --cc=nd@arm.com \
    --cc=wathsala.vithanage@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).