RE: rte_ring move head question for machines with relaxed MO (arm/ppc)

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	"jerinj@marvell.com" <jerinj@marvell.com>,
	"drc@linux.ibm.com" <drc@linux.ibm.com>, nd <nd@arm.com>
Subject: RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
Date: Fri, 11 Oct 2024 15:48:52 +0000	[thread overview]
Message-ID: <PAWPR08MB890977BA6D16EA389AA289589F792@PAWPR08MB8909.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <ddb22568929a44fdb5c0457f668dbb87@huawei.com>



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Friday, October 11, 2024 9:09 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerinj@marvell.com; drc@linux.ibm.com
> Subject: RE: rte_ring move head question for machines with relaxed MO
> (arm/ppc)
> 
> 
> 
> > >
> > > > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > > > =====================
> > > > > > > >
> > > > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > > > --------------------                                                 ------------------------------
> ----
> > > ----
> > > > > > > >  head.load()                                          //        ldr [head]
> > > > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > > > >  ...
> > > > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];...
> stlex[head]
> > > > > > > >
> > > > > > > >
> > > > > > > > 2. rte_ring_c11_pvt.h
> > > > > > > > =====================
> > > > > > > >
> > > > > > > > pseudo-c-code                                       //        related armv8
> instructions
> > > > > > > > --------------------                                                 ------------------------------
> ----
> > > ----
> > > > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > > > ...
> > > > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ...
> strex[head]
> > > > > > > >
> > > > > > > >
> > > > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > > > ==========================
> > > > > > > >
> > > > > > > > pseudo-c-code                                       //        related armv8
> instructions
> > > > > > > > --------------------                                                 ------------------------------
> ----
> > > ----
> > > > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > > > ...
> > > > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ...
> strex[head]
> > > > > > > >
> > > > > > > > The questions that arose from these observations:
> > > > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > > > Different, lda (Load with acquire semantics) and ldr (load) are
> different.
> > > > > >
> > > > > > I understand that, my question was:
> > > > > > lda {head]; ldr[tail]
> > > > > > vs
> > > > > > ldr [head]; dmb ishld; ldr [tail];
> > > > > >
> > > > > > Is there any difference in terms of functionality (memory ops
> > > > > ordering/observability)?
> > > > >
> > > > > To be more precise:
> > > > >
> > > > > lda {head]; ldr[tail]
> > > > > vs
> > > > > ldr [head]; dmb ishld; ldr [tail]; vs ldr [head]; dmb ishld; lda
> > > > > [tail];
> > > > >
> > > > > what would be the difference between these 3 cases?
> > > >
> > > > Case A: lda {head]; ldr[tail]
> > > > load of the head will be observed by the memory subsystem before
> > > > the load of the tail.
> > > >
> > > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will
> > > > be observed by the memory subsystem Before the load of the tail.
> > > >
> > > >
> > > > Essentially both cases A and B are the same.
> > > > They preserve following program orders.
> > > > LOAD-LOAD
> > > > LOAD-STORE
> > >
> > > Ok, that is crystal clear, thanks for explanation.
> > >
> > >
> > > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will
> > > > be observed by the memory subsystem before the load of the tail.
> > >
> > > Ok.
> > >
> > > > In addition, any load or store program order after lda[tail] will
> > > > not be observed by the memory subsystem before the load of the tail.
> > >
> > > Ok... the question is why we need that extra hoisting barrier here?
> > > From what unwanted  re-orderings we are protecting here?
> > > Does it mean that without it, ldrex/strex (CAS) can be reordered
> > > with load[cons.tail]?
> > >
> > > Actually, we probably need to look at whole picture:
> > >
> > > in rte_ring_generic_pvt.h
> > > =====================
> > >
> > > ldr [prod.head]
> > > dmb ishld
> > > ldr [cons.tail]
> > > ...
> > > /* cas */
> > > ldrex [prod.head]
> > > stlex [prod.head]   /* sink barrier */
> > >
> > > in rte_ring_c11_pvt.h
> > > =====================
> > >
> > > ldr [prod.head]
> > > dmb ishld
> > > lda [cons.tail]          /* exrea hoist */
> > > ...
> > > /* cas */
> > > ldrex [prod.head]
> > > strex [prod.head]
> >
> > Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
> > Aaarch64 has a cas instruction.
> > Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n
> 
> Cool, thanks.
> 
> >
> > >
> > > So, in _genereic_ we don't have that extra hoist barrier after
> > > load[con.tail], but we have extra sink barrier at cas(prod.tail).
> > >
> >
> > You are right, technically, that lda[cons.tail] is not required
> > because due to the dependency chain up until CAS a memory reordering is
> not possible.
> > For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-
> update).
> > C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in
> GCC).
> >
> > So, ideally one could have written something like...
> > __atomic_load_n(prod.head, __ATOMIC_CONSUME); instead of
> > __atomic_load_n(prod.head, __ATOMIC_ACQUIRE);
> >
> > The compiler is then supposed to figure out if there are any
> > dependencies in the code path to ensure that load of the prod.head is
> > observed before any load/store that's program order after the load of the
> prod.head.
> > If not, the compiler is supposed to add required barrier to ensure
> > that order is preserved.
> >
> > However, it's easier said than done. No, compiler implements it and
> > C11 standard discourages use of memory-order-consume for that reason.
> >
> > This brings us to the next caveat. As per C11 standard, there cannot
> > be a free standing Store with release semantics (stlr)  that isn't
> > paired with a load with acquire or consume semantics. Since we can't
> > use __ATOMIC_CONSUME (which would have resulted in ldr[prod.head]),
> we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).
> >
> >
> > > If that's correct observation, can we change _c11_ implementation to
> > > match _generic_ one by:
> > >
> > >  atomic_load(prod.head, releaxed);
> > >  atomic_thread_fence(acquire);
> > >  atomic_load(cons.tail, releaxed);
> > > ....
> > > atomic_cas(prod.head, release, relaxed); ?
> > >
> > > From my understanding that should help to make these 2 implantations
> > > Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> > >
> >
> > They both are functionally correct.
> > _generic is correct but does not comply with the C11 specification.
> > _c11 is correct and fully compliant with the C11 specification.
> 
> So far, I didn't say they are 'functionally incorrect'.
> The problem is - we have two implementations for exactly the same thing that
> generate different code-sequence for the same platform.
> By default It is switched on/off depending on the platform ('off' - at
> arm/thunderx, ppc, x86, 'on' on other arms and riscv ).
> So for some platforms _c11_ is probably never used (and less tested), while on

C11 is guaranteed to work on all platforms if built with a C11 compliant compiler.

> others _generic_ is not used and under tested.

_generic_ by chance sets a dependency chain up to the point array operations are
done, so there is an implicit head.load(consume). It's possible that there can be a
platform/compiler combination that does not do that.

> Obviously it is not a good situation and better be addressed.
> As I remember the reason while we ended up with 2 code-paths here - on
> some platforms _c11_ one was slower.
> Honnappa, please correct me if I am wrong here.
> So ideal solution seems be - make _c11_ generate exactly the same
> code as current _generic_, so we can deprecate and remove _generic_.
> 

Ideal solution would be to make _generic_ generate the same code _c11_
generates for the above reasons. But, at least for Arm, when compiled with
GCC we know that there is a dependency chain that guarantees the program 
order (implicit consume), so it's safe to use and slightly more performant. 

> > Replacing atomic_load(cons.tail, acquire) with load(cons.tail,
> > relaxed) in
> > _c11 would make it non-compliant with C11 the spec.
> 
> Hmm... where this constraint comes from exactly?
> AFAIK, inside DPDK, we have several places where we do use
> 'atomic_store(var, release);' in conjunction with 'var.atomic_load(relaxed);'

That's ok, such code can exist just like in _generic ring.
As long as they are correct on all supported platforms and no one states it's C11 compliant, there is no issue, no?

> 
> Again, if that really that strict - why in _c11_ move_head() it is ok  to use
> 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'?
> Following that logic, it should always be 'head.cas(release);', no?
> 
_c11_move_head() does not head.load(acquire), it's head.load(relaxed).
there is a tail.load(acquire) which must synchronize with tail.store(release).

Prod                                                            Cons
------                                                             -------
ProdHead.load(relaxed)                          ConsHead.load(relaxed)
ConsTail.load(acquire)                             ProdTail.load(acquire)
ProdHead.cas(relaxed)                            ConsHead.cas(relaxed)
AddToArray                                               RemoveFromArray
ProdTail.store(release)                            ConsTail.store(release)

Above is how _c11 is implemented. 
ConsTail.load(acquire)  in Prod synchronizes with the ConsTail.store(release)
In Cons.
ProdTail.load(acquire) in Cons synchronizes with the ProdTail.store(release)
In Prod.

If you want, you can switch to ConsTail.load(consume) and ProdTail.load(consume).
But, due to lack of support as explained before that results identical to 
ConsTail.load(acquire) and ProdTail.load(acquire). 
Also, C11 standard highly discourage it's use for the time being.

next prev parent reply	other threads:[~2024-10-11 15:49 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-08 12:58 Konstantin Ananyev
2024-10-08 15:09 ` Wathsala Wathawana Vithanage
2024-10-08 15:12   ` Wathsala Wathawana Vithanage
2024-10-08 15:45   ` Konstantin Ananyev
2024-10-08 15:56     ` Konstantin Ananyev
2024-10-09 17:27       ` Wathsala Wathawana Vithanage
2024-10-10 16:54         ` Konstantin Ananyev
2024-10-11  0:11           ` Wathsala Wathawana Vithanage
2024-10-11 14:08             ` Konstantin Ananyev
2024-10-11 15:48               ` Wathsala Wathawana Vithanage [this message]
2024-10-15 15:11                 ` Konstantin Ananyev
2024-10-09  1:41     ` Wathsala Wathawana Vithanage
2024-10-09  2:22     ` Wathsala Wathawana Vithanage

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=PAWPR08MB890977BA6D16EA389AA289589F792@PAWPR08MB8909.eurprd08.prod.outlook.com \
    --to=wathsala.vithanage@arm.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.ibm.com \
    --cc=jerinj@marvell.com \
    --cc=konstantin.ananyev@huawei.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).