DPDK patches and discussions
 help / color / mirror / Atom feed
* rte_ring move head  question for machines with relaxed MO (arm/ppc)
@ 2024-10-08 12:58 Konstantin Ananyev
  2024-10-08 15:09 ` Wathsala Wathawana Vithanage
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Ananyev @ 2024-10-08 12:58 UTC (permalink / raw)
  To: dev; +Cc: honnappa.nagarahalli, Jerin Jacob, Wathsala Vithanage, drc

Hi lads,

Looking at rte_ring move_head functions I noticed that all of them
use slightly different approach to guarantee desired order of memory accesses:


1. rte_ring_generic_pvt.h:
=====================

pseudo-c-code                                      //        related armv8 instructions
--------------------                                                 -------------------------------------- 
 head.load()                                          //        ldr [head]
 rte_smp_rmb()                                    //        dmb ishld
 opposite_tail.load()                            //        ldr [opposite_tail]
 ...
 rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]


2. rte_ring_c11_pvt.h
=====================

pseudo-c-code                                       //        related armv8 instructions
--------------------                                                 --------------------------------------
head.atomic_load(relaxed)                 //        ldr[head]
atomic_thread_fence(acquire)           //        dmb ish
opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
...
head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]


3.   rte_ring_hts_elem_pvt.h
==========================

pseudo-c-code                                       //        related armv8 instructions
--------------------                                                 --------------------------------------
head.atomic_load(acquire)                //        lda [head]
opposite_tail.load()                             //        ldr [opposite_tail] 
...
head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]         

The questions that arose from these observations:
a) are all 3 approaches equivalent in terms of functionality?
b) if yes, is there any difference in terms of performance between:
     "ldr; dmb; ldr;"   vs "lda; ldr;"
      ?
c) Comapring at 1) and 2) above, combination of 
   ldr [head]; dmb; lda [opposite_tail]:
   looks     like an overkill to me.  Wouldn't just:
   ldr [head]; dmb; ldr[opposite_tail];
   be sufficient here?
I.E.- for reading tail value - we don't need to use load(acquire).
Or probably I do miss something obvious here?

Thanks
Konstantin

For convenience, I created a godbot page with all these variants:
https://godbolt.org/z/Yjj73b8xa
   
#1 - __rte_ring_headtail_move_head()
#2 - __rte_ring_headtail_move_head_c11_v1  
#3 - __rte_ring_headtail_move_head_c11_v2
#2 with c) -  __rte_ring_headtail_move_head_c11_v3
 
          
  


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 12:58 rte_ring move head question for machines with relaxed MO (arm/ppc) Konstantin Ananyev
@ 2024-10-08 15:09 ` Wathsala Wathawana Vithanage
  2024-10-08 15:12   ` Wathsala Wathawana Vithanage
  2024-10-08 15:45   ` Konstantin Ananyev
  0 siblings, 2 replies; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-08 15:09 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd

> 1. rte_ring_generic_pvt.h:
> =====================
> 
> pseudo-c-code                                      //        related armv8 instructions
> --------------------                                                 --------------------------------------
>  head.load()                                          //        ldr [head]
>  rte_smp_rmb()                                    //        dmb ishld
>  opposite_tail.load()                            //        ldr [opposite_tail]
>  ...
>  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> 
> 
> 2. rte_ring_c11_pvt.h
> =====================
> 
> pseudo-c-code                                       //        related armv8 instructions
> --------------------                                                 --------------------------------------
> head.atomic_load(relaxed)                 //        ldr[head]
> atomic_thread_fence(acquire)           //        dmb ish
> opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> ...
> head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> 
> 
> 3.   rte_ring_hts_elem_pvt.h
> ==========================
> 
> pseudo-c-code                                       //        related armv8 instructions
> --------------------                                                 --------------------------------------
> head.atomic_load(acquire)                //        lda [head]
> opposite_tail.load()                             //        ldr [opposite_tail]
> ...
> head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> 
> The questions that arose from these observations:
> a) are all 3 approaches equivalent in terms of functionality?
Different, lda (Load with acquire semantics) and ldr (load) are different. 

> b) if yes, is there any difference in terms of performance between:
>      "ldr; dmb; ldr;"   vs "lda; ldr;"
>       ?
dmb is a full barrier, performance is poor.
I would assume (haven't measured) ldr; dmb; ldr to be less performant than lda;ldr;

> c) Comapring at 1) and 2) above, combination of
>    ldr [head]; dmb; lda [opposite_tail]:
>    looks     like an overkill to me.  Wouldn't just:
>    ldr [head]; dmb; ldr[opposite_tail];
>    be sufficient here?
lda [opposite_tail]: synchronizes with stlr in tail update that happens after array update.
So, it cannot be changed to ldr. 

lda can be replaced with ldapr (LDA with release consistency - processor consistency) 
which is more performant as lda is allowed to rise above stlr. Can be done with -mcpu=+rcpc

--wathsala



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 15:09 ` Wathsala Wathawana Vithanage
@ 2024-10-08 15:12   ` Wathsala Wathawana Vithanage
  2024-10-08 15:45   ` Konstantin Ananyev
  1 sibling, 0 replies; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-08 15:12 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd, nd

> 
> lda can be replaced with ldapr (LDA with release consistency - processor
> consistency) which is more performant as lda is allowed to rise above stlr. Can
> be done with -mcpu=+rcpc
> 

Correction: lapr is allowed to rise above stlr.

-- wathsala

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 15:09 ` Wathsala Wathawana Vithanage
  2024-10-08 15:12   ` Wathsala Wathawana Vithanage
@ 2024-10-08 15:45   ` Konstantin Ananyev
  2024-10-08 15:56     ` Konstantin Ananyev
                       ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Konstantin Ananyev @ 2024-10-08 15:45 UTC (permalink / raw)
  To: Wathsala Wathawana Vithanage, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd


> > 1. rte_ring_generic_pvt.h:
> > =====================
> >
> > pseudo-c-code                                      //        related armv8 instructions
> > --------------------                                                 --------------------------------------
> >  head.load()                                          //        ldr [head]
> >  rte_smp_rmb()                                    //        dmb ishld
> >  opposite_tail.load()                            //        ldr [opposite_tail]
> >  ...
> >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> >
> >
> > 2. rte_ring_c11_pvt.h
> > =====================
> >
> > pseudo-c-code                                       //        related armv8 instructions
> > --------------------                                                 --------------------------------------
> > head.atomic_load(relaxed)                 //        ldr[head]
> > atomic_thread_fence(acquire)           //        dmb ish
> > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > ...
> > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> >
> >
> > 3.   rte_ring_hts_elem_pvt.h
> > ==========================
> >
> > pseudo-c-code                                       //        related armv8 instructions
> > --------------------                                                 --------------------------------------
> > head.atomic_load(acquire)                //        lda [head]
> > opposite_tail.load()                             //        ldr [opposite_tail]
> > ...
> > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> >
> > The questions that arose from these observations:
> > a) are all 3 approaches equivalent in terms of functionality?
> Different, lda (Load with acquire semantics) and ldr (load) are different.

I understand that, my question was:
lda {head]; ldr[tail]
vs
ldr [head]; dmb ishld; ldr [tail];

Is there any difference in terms of functionality (memory ops ordering/observability)?  

> 
> > b) if yes, is there any difference in terms of performance between:
> >      "ldr; dmb; ldr;"   vs "lda; ldr;"
> >       ?
> dmb is a full barrier, performance is poor.
> I would assume (haven't measured) ldr; dmb; ldr to be less performant than lda;ldr;

Through all this mail am talking about 'dmb ishld', sorry for not being clear upfront. 

> 
> > c) Comapring at 1) and 2) above, combination of
> >    ldr [head]; dmb; lda [opposite_tail]:
> >    looks     like an overkill to me.  Wouldn't just:
> >    ldr [head]; dmb; ldr[opposite_tail];
> >    be sufficient here?
> lda [opposite_tail]: synchronizes with stlr in tail update that happens after array update.
> So, it cannot be changed to ldr.

Can you explain me a bit more here why it is not possible?
From here:
https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/LDA-and-STL
"There is no requirement that a load-acquire and store-release be paired."
Do I misinterpret this statement somehow?
 
> lda can be replaced with ldapr (LDA with release consistency - processor consistency)
> which is more performant as lda is allowed to rise above stlr. Can be done with -mcpu=+rcpc
> 
> --wathsala
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 15:45   ` Konstantin Ananyev
@ 2024-10-08 15:56     ` Konstantin Ananyev
  2024-10-09 17:27       ` Wathsala Wathawana Vithanage
  2024-10-09  1:41     ` Wathsala Wathawana Vithanage
  2024-10-09  2:22     ` Wathsala Wathawana Vithanage
  2 siblings, 1 reply; 13+ messages in thread
From: Konstantin Ananyev @ 2024-10-08 15:56 UTC (permalink / raw)
  To: Konstantin Ananyev, Wathsala Wathawana Vithanage, dev
  Cc: Honnappa Nagarahalli, jerinj, drc, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Tuesday, October 8, 2024 4:46 PM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com; drc@linux.ibm.com; nd <nd@arm.com>
> Subject: RE: rte_ring move head question for machines with relaxed MO (arm/ppc)
> 
> 
> > > 1. rte_ring_generic_pvt.h:
> > > =====================
> > >
> > > pseudo-c-code                                      //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > >  head.load()                                          //        ldr [head]
> > >  rte_smp_rmb()                                    //        dmb ishld
> > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > >  ...
> > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > >
> > >
> > > 2. rte_ring_c11_pvt.h
> > > =====================
> > >
> > > pseudo-c-code                                       //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > > head.atomic_load(relaxed)                 //        ldr[head]
> > > atomic_thread_fence(acquire)           //        dmb ish
> > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > ...
> > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > >
> > >
> > > 3.   rte_ring_hts_elem_pvt.h
> > > ==========================
> > >
> > > pseudo-c-code                                       //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > > head.atomic_load(acquire)                //        lda [head]
> > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > ...
> > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > >
> > > The questions that arose from these observations:
> > > a) are all 3 approaches equivalent in terms of functionality?
> > Different, lda (Load with acquire semantics) and ldr (load) are different.
> 
> I understand that, my question was:
> lda {head]; ldr[tail]
> vs
> ldr [head]; dmb ishld; ldr [tail];
> 
> Is there any difference in terms of functionality (memory ops ordering/observability)?

To be more precise:

lda {head]; ldr[tail]
vs
ldr [head]; dmb ishld; ldr [tail]; 
vs
ldr [head]; dmb ishld; lda [tail];

what would be the difference between these 3 cases?

> 
> >
> > > b) if yes, is there any difference in terms of performance between:
> > >      "ldr; dmb; ldr;"   vs "lda; ldr;"
> > >       ?
> > dmb is a full barrier, performance is poor.
> > I would assume (haven't measured) ldr; dmb; ldr to be less performant than lda;ldr;
> 
> Through all this mail am talking about 'dmb ishld', sorry for not being clear upfront.
> 
> >
> > > c) Comapring at 1) and 2) above, combination of
> > >    ldr [head]; dmb; lda [opposite_tail]:
> > >    looks     like an overkill to me.  Wouldn't just:
> > >    ldr [head]; dmb; ldr[opposite_tail];
> > >    be sufficient here?
> > lda [opposite_tail]: synchronizes with stlr in tail update that happens after array update.
> > So, it cannot be changed to ldr.
> 
> Can you explain me a bit more here why it is not possible?
> From here:
> https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/LDA-and-STL
> "There is no requirement that a load-acquire and store-release be paired."
> Do I misinterpret this statement somehow?
> 
> > lda can be replaced with ldapr (LDA with release consistency - processor consistency)
> > which is more performant as lda is allowed to rise above stlr. Can be done with -mcpu=+rcpc
> >
> > --wathsala
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 15:45   ` Konstantin Ananyev
  2024-10-08 15:56     ` Konstantin Ananyev
@ 2024-10-09  1:41     ` Wathsala Wathawana Vithanage
  2024-10-09  2:22     ` Wathsala Wathawana Vithanage
  2 siblings, 0 replies; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-09  1:41 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd, nd

> 
> > > 1. rte_ring_generic_pvt.h:
> > > =====================
> > >
> > > pseudo-c-code                                      //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > >  head.load()                                          //        ldr [head]
> > >  rte_smp_rmb()                                    //        dmb ishld
> > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > >  ...
> > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > >
> > >
> > > 2. rte_ring_c11_pvt.h
> > > =====================
> > >
> > > pseudo-c-code                                       //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > > head.atomic_load(relaxed)                 //        ldr[head]
> > > atomic_thread_fence(acquire)           //        dmb ish
> > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > ...
> > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > >
> > >
> > > 3.   rte_ring_hts_elem_pvt.h
> > > ==========================
> > >
> > > pseudo-c-code                                       //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > > head.atomic_load(acquire)                //        lda [head]
> > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > ...
> > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > >
> > > The questions that arose from these observations:
> > > a) are all 3 approaches equivalent in terms of functionality?
> > Different, lda (Load with acquire semantics) and ldr (load) are different.
> 
> I understand that, my question was:
> lda {head]; ldr[tail]
> vs
> ldr [head]; dmb ishld; ldr [tail];
> 
> Is there any difference in terms of functionality (memory ops
> ordering/observability)?
> 
> >
> > > b) if yes, is there any difference in terms of performance between:
> > >      "ldr; dmb; ldr;"   vs "lda; ldr;"
> > >       ?
> > dmb is a full barrier, performance is poor.
> > I would assume (haven't measured) ldr; dmb; ldr to be less performant
> > than lda;ldr;
> 
> Through all this mail am talking about 'dmb ishld', sorry for not being clear
> upfront.
> 
> >
> > > c) Comapring at 1) and 2) above, combination of
> > >    ldr [head]; dmb; lda [opposite_tail]:
> > >    looks     like an overkill to me.  Wouldn't just:
> > >    ldr [head]; dmb; ldr[opposite_tail];
> > >    be sufficient here?
> > lda [opposite_tail]: synchronizes with stlr in tail update that happens after
> array update.
> > So, it cannot be changed to ldr.
> 
> Can you explain me a bit more here why it is not possible?
> From here:
> https://developer.arm.com/documentation/dui0802/b/A32-and-T32-
> Instructions/LDA-and-STL
> "There is no requirement that a load-acquire and store-release be paired."
> Do I misinterpret this statement somehow?

There is no architectural requirement for them to be paired.
But C11 seem to have such requirement, such that prod: lda[cons-tail] synchronizes with cons: stl[cons-tail].




^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 15:45   ` Konstantin Ananyev
  2024-10-08 15:56     ` Konstantin Ananyev
  2024-10-09  1:41     ` Wathsala Wathawana Vithanage
@ 2024-10-09  2:22     ` Wathsala Wathawana Vithanage
  2 siblings, 0 replies; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-09  2:22 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd, nd

> > > 1. rte_ring_generic_pvt.h:
> > > =====================
> > >
> > > pseudo-c-code                                      //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > >  head.load()                                          //        ldr [head]
> > >  rte_smp_rmb()                                    //        dmb ishld
> > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > >  ...
> > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > >
> > >
> > > 2. rte_ring_c11_pvt.h
> > > =====================
> > >
> > > pseudo-c-code                                       //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > > head.atomic_load(relaxed)                 //        ldr[head]
> > > atomic_thread_fence(acquire)           //        dmb ish
> > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > ...
> > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > >
> > >
> > > 3.   rte_ring_hts_elem_pvt.h
> > > ==========================
> > >
> > > pseudo-c-code                                       //        related armv8 instructions
> > > --------------------                                                 --------------------------------------
> > > head.atomic_load(acquire)                //        lda [head]
> > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > ...
> > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > >
> > > The questions that arose from these observations:
> > > a) are all 3 approaches equivalent in terms of functionality?
> > Different, lda (Load with acquire semantics) and ldr (load) are different.
> 
> I understand that, my question was:
> lda {head]; ldr[tail]
> vs
> ldr [head]; dmb ishld; ldr [tail];
> 
> Is there any difference in terms of functionality (memory ops
> ordering/observability)?
> 
> >
> > > b) if yes, is there any difference in terms of performance between:
> > >      "ldr; dmb; ldr;"   vs "lda; ldr;"
> > >       ?
> > dmb is a full barrier, performance is poor.
> > I would assume (haven't measured) ldr; dmb; ldr to be less performant
> > than lda;ldr;
> 
> Through all this mail am talking about 'dmb ishld', sorry for not being clear
> upfront.
>
A: ldr; dmb ishld; ldr; -> load before the dmb ishld should be observed by the inner shareable 
domain before execution of the second ldr.
(Also applies to stores program order after dmb ishld.)

B: lda; ldr; -> second load cannot execute before the load acquire.
(Also applies to stores program order after lda.)

In theory, I would assume them to be at least roughly equal in performance if not B is more
performant than A.


> >
> > > c) Comapring at 1) and 2) above, combination of
> > >    ldr [head]; dmb; lda [opposite_tail]:
> > >    looks     like an overkill to me.  Wouldn't just:
> > >    ldr [head]; dmb; ldr[opposite_tail];
> > >    be sufficient here?
> > lda [opposite_tail]: synchronizes with stlr in tail update that happens after
> array update.
> > So, it cannot be changed to ldr.
> 
> Can you explain me a bit more here why it is not possible?
> From here:
> https://developer.arm.com/documentation/dui0802/b/A32-and-T32-
> Instructions/LDA-and-STL
> "There is no requirement that a load-acquire and store-release be paired."
> Do I misinterpret this statement somehow?
> 
> > lda can be replaced with ldapr (LDA with release consistency -
> > processor consistency) which is more performant as lda is allowed to
> > rise above stlr. Can be done with -mcpu=+rcpc
> >
> > --wathsala
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-08 15:56     ` Konstantin Ananyev
@ 2024-10-09 17:27       ` Wathsala Wathawana Vithanage
  2024-10-10 16:54         ` Konstantin Ananyev
  0 siblings, 1 reply; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-09 17:27 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd, nd

> >
> > > > 1. rte_ring_generic_pvt.h:
> > > > =====================
> > > >
> > > > pseudo-c-code                                      //        related armv8 instructions
> > > > --------------------                                                 --------------------------------------
> > > >  head.load()                                          //        ldr [head]
> > > >  rte_smp_rmb()                                    //        dmb ishld
> > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > >  ...
> > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > > >
> > > >
> > > > 2. rte_ring_c11_pvt.h
> > > > =====================
> > > >
> > > > pseudo-c-code                                       //        related armv8 instructions
> > > > --------------------                                                 --------------------------------------
> > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > ...
> > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > > >
> > > >
> > > > 3.   rte_ring_hts_elem_pvt.h
> > > > ==========================
> > > >
> > > > pseudo-c-code                                       //        related armv8 instructions
> > > > --------------------                                                 --------------------------------------
> > > > head.atomic_load(acquire)                //        lda [head]
> > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > ...
> > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > > >
> > > > The questions that arose from these observations:
> > > > a) are all 3 approaches equivalent in terms of functionality?
> > > Different, lda (Load with acquire semantics) and ldr (load) are different.
> >
> > I understand that, my question was:
> > lda {head]; ldr[tail]
> > vs
> > ldr [head]; dmb ishld; ldr [tail];
> >
> > Is there any difference in terms of functionality (memory ops
> ordering/observability)?
> 
> To be more precise:
> 
> lda {head]; ldr[tail]
> vs
> ldr [head]; dmb ishld; ldr [tail];
> vs
> ldr [head]; dmb ishld; lda [tail];
> 
> what would be the difference between these 3 cases?

Case A: lda {head]; ldr[tail]
load of the head will be observed by the memory subsystem
before the load of the tail.

Case B: ldr [head]; dmb ishld; ldr [tail];
load of the head will be observed by the memory subsystem
Before the load of the tail.

Case C: ldr [head]; dmb ishld; lda [tail];
load of the head will be observed by the memory subsystem
before the load of the tail. In addition, any load or store program
order after lda[tail] will not be observed by the memory subsystem
before the load of the tail.

Essentially both cases A and B are the same.
They preserve following program orders.
LOAD-LOAD
LOAD-STORE


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-09 17:27       ` Wathsala Wathawana Vithanage
@ 2024-10-10 16:54         ` Konstantin Ananyev
  2024-10-11  0:11           ` Wathsala Wathawana Vithanage
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Ananyev @ 2024-10-10 16:54 UTC (permalink / raw)
  To: Wathsala Wathawana Vithanage, dev
  Cc: Honnappa Nagarahalli, jerinj, drc, nd, nd



> > > > > 1. rte_ring_generic_pvt.h:
> > > > > =====================
> > > > >
> > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > --------------------                                                 --------------------------------------
> > > > >  head.load()                                          //        ldr [head]
> > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > >  ...
> > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > > > >
> > > > >
> > > > > 2. rte_ring_c11_pvt.h
> > > > > =====================
> > > > >
> > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > --------------------                                                 --------------------------------------
> > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > ...
> > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > > > >
> > > > >
> > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > ==========================
> > > > >
> > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > --------------------                                                 --------------------------------------
> > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > ...
> > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > > > >
> > > > > The questions that arose from these observations:
> > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > Different, lda (Load with acquire semantics) and ldr (load) are different.
> > >
> > > I understand that, my question was:
> > > lda {head]; ldr[tail]
> > > vs
> > > ldr [head]; dmb ishld; ldr [tail];
> > >
> > > Is there any difference in terms of functionality (memory ops
> > ordering/observability)?
> >
> > To be more precise:
> >
> > lda {head]; ldr[tail]
> > vs
> > ldr [head]; dmb ishld; ldr [tail];
> > vs
> > ldr [head]; dmb ishld; lda [tail];
> >
> > what would be the difference between these 3 cases?
> 
> Case A: lda {head]; ldr[tail]
> load of the head will be observed by the memory subsystem
> before the load of the tail.
> 
> Case B: ldr [head]; dmb ishld; ldr [tail];
> load of the head will be observed by the memory subsystem
> Before the load of the tail.
> 
> 
> Essentially both cases A and B are the same.
> They preserve following program orders.
> LOAD-LOAD
> LOAD-STORE

Ok, that is crystal clear, thanks for explanation.
 

> Case C: ldr [head]; dmb ishld; lda [tail];
> load of the head will be observed by the memory subsystem
> before the load of the tail. 

Ok.

> In addition, any load or store program
> order after lda[tail] will not be observed by the memory subsystem
> before the load of the tail. 

Ok... the question is why we need that extra hoisting barrier here?
From what unwanted  re-orderings we are protecting here?
Does it mean that without it, ldrex/strex (CAS) can be reordered with load[cons.tail]?

Actually, we probably need to look at whole picture:

in rte_ring_generic_pvt.h
=====================

ldr [prod.head]
dmb ishld
ldr [cons.tail]
...
/* cas */
ldrex [prod.head]
stlex [prod.head]   /* sink barrier */

in rte_ring_c11_pvt.h
=====================

ldr [prod.head]
dmb ishld
lda [cons.tail]          /* exrea hoist */
...
/* cas */
ldrex [prod.head]
strex [prod.head]  

So, in _genereic_ we don't have that extra hoist barrier after load[con.tail],
but we have extra sink barrier at cas(prod.tail).

If that's correct observation, can we change _c11_ implementation to match
_generic_ one by:

 atomic_load(prod.head, releaxed);
 atomic_thread_fence(acquire);
 atomic_load(cons.tail, releaxed);
....
atomic_cas(prod.head, release, relaxed);
?
  
From my understanding that should help to make these 2 implantations
Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-10 16:54         ` Konstantin Ananyev
@ 2024-10-11  0:11           ` Wathsala Wathawana Vithanage
  2024-10-11 14:08             ` Konstantin Ananyev
  0 siblings, 1 reply; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-11  0:11 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Thursday, October 10, 2024 11:54 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerinj@marvell.com; drc@linux.ibm.com; nd <nd@arm.com>; nd
> <nd@arm.com>
> Subject: RE: rte_ring move head question for machines with relaxed MO
> (arm/ppc)
> 
> 
> 
> > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > =====================
> > > > > >
> > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > --------------------                                                 ----------------------------------
> ----
> > > > > >  head.load()                                          //        ldr [head]
> > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > >  ...
> > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > > > > >
> > > > > >
> > > > > > 2. rte_ring_c11_pvt.h
> > > > > > =====================
> > > > > >
> > > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > > --------------------                                                 ----------------------------------
> ----
> > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > ...
> > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > > > > >
> > > > > >
> > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > ==========================
> > > > > >
> > > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > > --------------------                                                 ----------------------------------
> ----
> > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > ...
> > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > > > > >
> > > > > > The questions that arose from these observations:
> > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > Different, lda (Load with acquire semantics) and ldr (load) are different.
> > > >
> > > > I understand that, my question was:
> > > > lda {head]; ldr[tail]
> > > > vs
> > > > ldr [head]; dmb ishld; ldr [tail];
> > > >
> > > > Is there any difference in terms of functionality (memory ops
> > > ordering/observability)?
> > >
> > > To be more precise:
> > >
> > > lda {head]; ldr[tail]
> > > vs
> > > ldr [head]; dmb ishld; ldr [tail];
> > > vs
> > > ldr [head]; dmb ishld; lda [tail];
> > >
> > > what would be the difference between these 3 cases?
> >
> > Case A: lda {head]; ldr[tail]
> > load of the head will be observed by the memory subsystem before the
> > load of the tail.
> >
> > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will be
> > observed by the memory subsystem Before the load of the tail.
> >
> >
> > Essentially both cases A and B are the same.
> > They preserve following program orders.
> > LOAD-LOAD
> > LOAD-STORE
> 
> Ok, that is crystal clear, thanks for explanation.
> 
> 
> > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will be
> > observed by the memory subsystem before the load of the tail.
> 
> Ok.
> 
> > In addition, any load or store program order after lda[tail] will not
> > be observed by the memory subsystem before the load of the tail.
> 
> Ok... the question is why we need that extra hoisting barrier here?
> From what unwanted  re-orderings we are protecting here?
> Does it mean that without it, ldrex/strex (CAS) can be reordered with
> load[cons.tail]?
> 
> Actually, we probably need to look at whole picture:
> 
> in rte_ring_generic_pvt.h
> =====================
> 
> ldr [prod.head]
> dmb ishld
> ldr [cons.tail]
> ...
> /* cas */
> ldrex [prod.head]
> stlex [prod.head]   /* sink barrier */
> 
> in rte_ring_c11_pvt.h
> =====================
> 
> ldr [prod.head]
> dmb ishld
> lda [cons.tail]          /* exrea hoist */
> ...
> /* cas */
> ldrex [prod.head]
> strex [prod.head]

Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
Aaarch64 has a cas instruction.
Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n


> 
> So, in _genereic_ we don't have that extra hoist barrier after load[con.tail], but
> we have extra sink barrier at cas(prod.tail).
> 

You are right, technically, that lda[cons.tail] is not required because due to the 
dependency chain up until CAS a memory reordering is not possible.
For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-update).
C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in GCC).

So, ideally one could have written something like...
__atomic_load_n(prod.head, __ATOMIC_CONSUME);
instead of 
__atomic_load_n(prod.head, __ATOMIC_ACQUIRE);

The compiler is then supposed to figure out if there are any dependencies in
the code path to ensure that load of the prod.head is observed before any
load/store that's program order after the load of the prod.head.
If not, the compiler is supposed to add required barrier to ensure that order is 
preserved.

However, it's easier said than done. No, compiler implements it and C11 standard
discourages use of memory-order-consume for that reason.

This brings us to the next caveat. As per C11 standard, there cannot be a free standing
Store with release semantics (stlr)  that isn't paired with a load with acquire or consume
semantics. Since we can't use __ATOMIC_CONSUME (which would have resulted in 
ldr[prod.head]), we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).


> If that's correct observation, can we change _c11_ implementation to match
> _generic_ one by:
> 
>  atomic_load(prod.head, releaxed);
>  atomic_thread_fence(acquire);
>  atomic_load(cons.tail, releaxed);
> ....
> atomic_cas(prod.head, release, relaxed); ?
> 
> From my understanding that should help to make these 2 implantations
> Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> 

They both are functionally correct.
_generic is correct but does not comply with the C11 specification. 
_c11 is correct and fully compliant with the C11 specification.

Replacing atomic_load(cons.tail, acquire) with load(cons.tail, relaxed) in
_c11 would make it non-compliant with C11 the spec.

 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-11  0:11           ` Wathsala Wathawana Vithanage
@ 2024-10-11 14:08             ` Konstantin Ananyev
  2024-10-11 15:48               ` Wathsala Wathawana Vithanage
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Ananyev @ 2024-10-11 14:08 UTC (permalink / raw)
  To: Wathsala Wathawana Vithanage, dev; +Cc: Honnappa Nagarahalli, jerinj, drc



> >
> > > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > > =====================
> > > > > > >
> > > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > > --------------------                                                 ----------------------------------
> > ----
> > > > > > >  head.load()                                          //        ldr [head]
> > > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > > >  ...
> > > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > > > > > >
> > > > > > >
> > > > > > > 2. rte_ring_c11_pvt.h
> > > > > > > =====================
> > > > > > >
> > > > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > > > --------------------                                                 ----------------------------------
> > ----
> > > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > > ...
> > > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > > > > > >
> > > > > > >
> > > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > > ==========================
> > > > > > >
> > > > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > > > --------------------                                                 ----------------------------------
> > ----
> > > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > > ...
> > > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > > > > > >
> > > > > > > The questions that arose from these observations:
> > > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > > Different, lda (Load with acquire semantics) and ldr (load) are different.
> > > > >
> > > > > I understand that, my question was:
> > > > > lda {head]; ldr[tail]
> > > > > vs
> > > > > ldr [head]; dmb ishld; ldr [tail];
> > > > >
> > > > > Is there any difference in terms of functionality (memory ops
> > > > ordering/observability)?
> > > >
> > > > To be more precise:
> > > >
> > > > lda {head]; ldr[tail]
> > > > vs
> > > > ldr [head]; dmb ishld; ldr [tail];
> > > > vs
> > > > ldr [head]; dmb ishld; lda [tail];
> > > >
> > > > what would be the difference between these 3 cases?
> > >
> > > Case A: lda {head]; ldr[tail]
> > > load of the head will be observed by the memory subsystem before the
> > > load of the tail.
> > >
> > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will be
> > > observed by the memory subsystem Before the load of the tail.
> > >
> > >
> > > Essentially both cases A and B are the same.
> > > They preserve following program orders.
> > > LOAD-LOAD
> > > LOAD-STORE
> >
> > Ok, that is crystal clear, thanks for explanation.
> >
> >
> > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will be
> > > observed by the memory subsystem before the load of the tail.
> >
> > Ok.
> >
> > > In addition, any load or store program order after lda[tail] will not
> > > be observed by the memory subsystem before the load of the tail.
> >
> > Ok... the question is why we need that extra hoisting barrier here?
> > From what unwanted  re-orderings we are protecting here?
> > Does it mean that without it, ldrex/strex (CAS) can be reordered with
> > load[cons.tail]?
> >
> > Actually, we probably need to look at whole picture:
> >
> > in rte_ring_generic_pvt.h
> > =====================
> >
> > ldr [prod.head]
> > dmb ishld
> > ldr [cons.tail]
> > ...
> > /* cas */
> > ldrex [prod.head]
> > stlex [prod.head]   /* sink barrier */
> >
> > in rte_ring_c11_pvt.h
> > =====================
> >
> > ldr [prod.head]
> > dmb ishld
> > lda [cons.tail]          /* exrea hoist */
> > ...
> > /* cas */
> > ldrex [prod.head]
> > strex [prod.head]
> 
> Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
> Aaarch64 has a cas instruction.
> Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n

Cool, thanks.

> 
> >
> > So, in _genereic_ we don't have that extra hoist barrier after load[con.tail], but
> > we have extra sink barrier at cas(prod.tail).
> >
> 
> You are right, technically, that lda[cons.tail] is not required because due to the
> dependency chain up until CAS a memory reordering is not possible.
> For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-update).
> C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in GCC).
> 
> So, ideally one could have written something like...
> __atomic_load_n(prod.head, __ATOMIC_CONSUME);
> instead of
> __atomic_load_n(prod.head, __ATOMIC_ACQUIRE);
> 
> The compiler is then supposed to figure out if there are any dependencies in
> the code path to ensure that load of the prod.head is observed before any
> load/store that's program order after the load of the prod.head.
> If not, the compiler is supposed to add required barrier to ensure that order is
> preserved.
> 
> However, it's easier said than done. No, compiler implements it and C11 standard
> discourages use of memory-order-consume for that reason.
> 
> This brings us to the next caveat. As per C11 standard, there cannot be a free standing
> Store with release semantics (stlr)  that isn't paired with a load with acquire or consume
> semantics. Since we can't use __ATOMIC_CONSUME (which would have resulted in
> ldr[prod.head]), we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).
> 
> 
> > If that's correct observation, can we change _c11_ implementation to match
> > _generic_ one by:
> >
> >  atomic_load(prod.head, releaxed);
> >  atomic_thread_fence(acquire);
> >  atomic_load(cons.tail, releaxed);
> > ....
> > atomic_cas(prod.head, release, relaxed); ?
> >
> > From my understanding that should help to make these 2 implantations
> > Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> >
> 
> They both are functionally correct.
> _generic is correct but does not comply with the C11 specification.
> _c11 is correct and fully compliant with the C11 specification.

So far, I didn't say they are 'functionally incorrect'.
The problem is - we have two implementations for exactly the same thing
that generate different code-sequence for the same platform.
By default It is switched on/off depending on the platform
('off' - at  arm/thunderx, ppc, x86, 'on' on other arms and riscv ).
So for some platforms _c11_ is probably never used (and less tested),
while on others _generic_ is not used and under tested.
Obviously it is not a good situation and better be addressed.
As I remember the reason while we ended up with 2 code-paths here -
on some platforms _c11_ one was slower.
Honnappa, please correct me if I am wrong here.
So ideal solution seems be - make _c11_ generate exactly the same
code as current _generic_, so we can deprecate and remove _generic_.        

> Replacing atomic_load(cons.tail, acquire) with load(cons.tail, relaxed) in
> _c11 would make it non-compliant with C11 the spec.

Hmm... where this constraint comes from exactly?
AFAIK, inside DPDK, we have several places where we do use
'atomic_store(var, release);' in conjunction with 'var.atomic_load(relaxed);' 

Again, if that really that strict - why in _c11_ move_head() it is ok  to
use 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'?
Following that logic, it should always be 'head.cas(release);', no?




^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-11 14:08             ` Konstantin Ananyev
@ 2024-10-11 15:48               ` Wathsala Wathawana Vithanage
  2024-10-15 15:11                 ` Konstantin Ananyev
  0 siblings, 1 reply; 13+ messages in thread
From: Wathsala Wathawana Vithanage @ 2024-10-11 15:48 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Friday, October 11, 2024 9:09 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerinj@marvell.com; drc@linux.ibm.com
> Subject: RE: rte_ring move head question for machines with relaxed MO
> (arm/ppc)
> 
> 
> 
> > >
> > > > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > > > =====================
> > > > > > > >
> > > > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > > > --------------------                                                 ------------------------------
> ----
> > > ----
> > > > > > > >  head.load()                                          //        ldr [head]
> > > > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > > > >  ...
> > > > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];...
> stlex[head]
> > > > > > > >
> > > > > > > >
> > > > > > > > 2. rte_ring_c11_pvt.h
> > > > > > > > =====================
> > > > > > > >
> > > > > > > > pseudo-c-code                                       //        related armv8
> instructions
> > > > > > > > --------------------                                                 ------------------------------
> ----
> > > ----
> > > > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > > > ...
> > > > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ...
> strex[head]
> > > > > > > >
> > > > > > > >
> > > > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > > > ==========================
> > > > > > > >
> > > > > > > > pseudo-c-code                                       //        related armv8
> instructions
> > > > > > > > --------------------                                                 ------------------------------
> ----
> > > ----
> > > > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > > > ...
> > > > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ...
> strex[head]
> > > > > > > >
> > > > > > > > The questions that arose from these observations:
> > > > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > > > Different, lda (Load with acquire semantics) and ldr (load) are
> different.
> > > > > >
> > > > > > I understand that, my question was:
> > > > > > lda {head]; ldr[tail]
> > > > > > vs
> > > > > > ldr [head]; dmb ishld; ldr [tail];
> > > > > >
> > > > > > Is there any difference in terms of functionality (memory ops
> > > > > ordering/observability)?
> > > > >
> > > > > To be more precise:
> > > > >
> > > > > lda {head]; ldr[tail]
> > > > > vs
> > > > > ldr [head]; dmb ishld; ldr [tail]; vs ldr [head]; dmb ishld; lda
> > > > > [tail];
> > > > >
> > > > > what would be the difference between these 3 cases?
> > > >
> > > > Case A: lda {head]; ldr[tail]
> > > > load of the head will be observed by the memory subsystem before
> > > > the load of the tail.
> > > >
> > > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will
> > > > be observed by the memory subsystem Before the load of the tail.
> > > >
> > > >
> > > > Essentially both cases A and B are the same.
> > > > They preserve following program orders.
> > > > LOAD-LOAD
> > > > LOAD-STORE
> > >
> > > Ok, that is crystal clear, thanks for explanation.
> > >
> > >
> > > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will
> > > > be observed by the memory subsystem before the load of the tail.
> > >
> > > Ok.
> > >
> > > > In addition, any load or store program order after lda[tail] will
> > > > not be observed by the memory subsystem before the load of the tail.
> > >
> > > Ok... the question is why we need that extra hoisting barrier here?
> > > From what unwanted  re-orderings we are protecting here?
> > > Does it mean that without it, ldrex/strex (CAS) can be reordered
> > > with load[cons.tail]?
> > >
> > > Actually, we probably need to look at whole picture:
> > >
> > > in rte_ring_generic_pvt.h
> > > =====================
> > >
> > > ldr [prod.head]
> > > dmb ishld
> > > ldr [cons.tail]
> > > ...
> > > /* cas */
> > > ldrex [prod.head]
> > > stlex [prod.head]   /* sink barrier */
> > >
> > > in rte_ring_c11_pvt.h
> > > =====================
> > >
> > > ldr [prod.head]
> > > dmb ishld
> > > lda [cons.tail]          /* exrea hoist */
> > > ...
> > > /* cas */
> > > ldrex [prod.head]
> > > strex [prod.head]
> >
> > Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
> > Aaarch64 has a cas instruction.
> > Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n
> 
> Cool, thanks.
> 
> >
> > >
> > > So, in _genereic_ we don't have that extra hoist barrier after
> > > load[con.tail], but we have extra sink barrier at cas(prod.tail).
> > >
> >
> > You are right, technically, that lda[cons.tail] is not required
> > because due to the dependency chain up until CAS a memory reordering is
> not possible.
> > For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-
> update).
> > C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in
> GCC).
> >
> > So, ideally one could have written something like...
> > __atomic_load_n(prod.head, __ATOMIC_CONSUME); instead of
> > __atomic_load_n(prod.head, __ATOMIC_ACQUIRE);
> >
> > The compiler is then supposed to figure out if there are any
> > dependencies in the code path to ensure that load of the prod.head is
> > observed before any load/store that's program order after the load of the
> prod.head.
> > If not, the compiler is supposed to add required barrier to ensure
> > that order is preserved.
> >
> > However, it's easier said than done. No, compiler implements it and
> > C11 standard discourages use of memory-order-consume for that reason.
> >
> > This brings us to the next caveat. As per C11 standard, there cannot
> > be a free standing Store with release semantics (stlr)  that isn't
> > paired with a load with acquire or consume semantics. Since we can't
> > use __ATOMIC_CONSUME (which would have resulted in ldr[prod.head]),
> we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).
> >
> >
> > > If that's correct observation, can we change _c11_ implementation to
> > > match _generic_ one by:
> > >
> > >  atomic_load(prod.head, releaxed);
> > >  atomic_thread_fence(acquire);
> > >  atomic_load(cons.tail, releaxed);
> > > ....
> > > atomic_cas(prod.head, release, relaxed); ?
> > >
> > > From my understanding that should help to make these 2 implantations
> > > Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> > >
> >
> > They both are functionally correct.
> > _generic is correct but does not comply with the C11 specification.
> > _c11 is correct and fully compliant with the C11 specification.
> 
> So far, I didn't say they are 'functionally incorrect'.
> The problem is - we have two implementations for exactly the same thing that
> generate different code-sequence for the same platform.
> By default It is switched on/off depending on the platform ('off' - at
> arm/thunderx, ppc, x86, 'on' on other arms and riscv ).
> So for some platforms _c11_ is probably never used (and less tested), while on

C11 is guaranteed to work on all platforms if built with a C11 compliant compiler.

> others _generic_ is not used and under tested.

_generic_ by chance sets a dependency chain up to the point array operations are
done, so there is an implicit head.load(consume). It's possible that there can be a
platform/compiler combination that does not do that.

> Obviously it is not a good situation and better be addressed.
> As I remember the reason while we ended up with 2 code-paths here - on
> some platforms _c11_ one was slower.
> Honnappa, please correct me if I am wrong here.
> So ideal solution seems be - make _c11_ generate exactly the same
> code as current _generic_, so we can deprecate and remove _generic_.
> 

Ideal solution would be to make _generic_ generate the same code _c11_
generates for the above reasons. But, at least for Arm, when compiled with
GCC we know that there is a dependency chain that guarantees the program 
order (implicit consume), so it's safe to use and slightly more performant. 

> > Replacing atomic_load(cons.tail, acquire) with load(cons.tail,
> > relaxed) in
> > _c11 would make it non-compliant with C11 the spec.
> 
> Hmm... where this constraint comes from exactly?
> AFAIK, inside DPDK, we have several places where we do use
> 'atomic_store(var, release);' in conjunction with 'var.atomic_load(relaxed);'

That's ok, such code can exist just like in _generic ring.
As long as they are correct on all supported platforms and no one states it's C11 compliant, there is no issue, no?

> 
> Again, if that really that strict - why in _c11_ move_head() it is ok  to use
> 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'?
> Following that logic, it should always be 'head.cas(release);', no?
> 
_c11_move_head() does not head.load(acquire), it's head.load(relaxed).
there is a tail.load(acquire) which must synchronize with tail.store(release).

Prod                                                            Cons
------                                                             -------
ProdHead.load(relaxed)                          ConsHead.load(relaxed)
ConsTail.load(acquire)                             ProdTail.load(acquire)
ProdHead.cas(relaxed)                            ConsHead.cas(relaxed)
AddToArray                                               RemoveFromArray
ProdTail.store(release)                            ConsTail.store(release)

Above is how _c11 is implemented. 
ConsTail.load(acquire)  in Prod synchronizes with the ConsTail.store(release)
In Cons.
ProdTail.load(acquire) in Cons synchronizes with the ProdTail.store(release)
In Prod.

If you want, you can switch to ConsTail.load(consume) and ProdTail.load(consume).
But, due to lack of support as explained before that results identical to 
ConsTail.load(acquire) and ProdTail.load(acquire). 
Also, C11 standard highly discourage it's use for the time being.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: rte_ring move head  question for machines with relaxed MO (arm/ppc)
  2024-10-11 15:48               ` Wathsala Wathawana Vithanage
@ 2024-10-15 15:11                 ` Konstantin Ananyev
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Ananyev @ 2024-10-15 15:11 UTC (permalink / raw)
  To: Wathsala Wathawana Vithanage, dev; +Cc: Honnappa Nagarahalli, jerinj, drc, nd


> > > > > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > > > > =====================
> > > > > > > > >
> > > > > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > > > > --------------------                                                 ------------------------------
> > ----
> > > > ----
> > > > > > > > >  head.load()                                          //        ldr [head]
> > > > > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > > > > >  ...
> > > > > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];...
> > stlex[head]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. rte_ring_c11_pvt.h
> > > > > > > > > =====================
> > > > > > > > >
> > > > > > > > > pseudo-c-code                                       //        related armv8
> > instructions
> > > > > > > > > --------------------                                                 ------------------------------
> > ----
> > > > ----
> > > > > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > > > > ...
> > > > > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ...
> > strex[head]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > > > > ==========================
> > > > > > > > >
> > > > > > > > > pseudo-c-code                                       //        related armv8
> > instructions
> > > > > > > > > --------------------                                                 ------------------------------
> > ----
> > > > ----
> > > > > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > > > > ...
> > > > > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ...
> > strex[head]
> > > > > > > > >
> > > > > > > > > The questions that arose from these observations:
> > > > > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > > > > Different, lda (Load with acquire semantics) and ldr (load) are
> > different.
> > > > > > >
> > > > > > > I understand that, my question was:
> > > > > > > lda {head]; ldr[tail]
> > > > > > > vs
> > > > > > > ldr [head]; dmb ishld; ldr [tail];
> > > > > > >
> > > > > > > Is there any difference in terms of functionality (memory ops
> > > > > > ordering/observability)?
> > > > > >
> > > > > > To be more precise:
> > > > > >
> > > > > > lda {head]; ldr[tail]
> > > > > > vs
> > > > > > ldr [head]; dmb ishld; ldr [tail]; vs ldr [head]; dmb ishld; lda
> > > > > > [tail];
> > > > > >
> > > > > > what would be the difference between these 3 cases?
> > > > >
> > > > > Case A: lda {head]; ldr[tail]
> > > > > load of the head will be observed by the memory subsystem before
> > > > > the load of the tail.
> > > > >
> > > > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will
> > > > > be observed by the memory subsystem Before the load of the tail.
> > > > >
> > > > >
> > > > > Essentially both cases A and B are the same.
> > > > > They preserve following program orders.
> > > > > LOAD-LOAD
> > > > > LOAD-STORE
> > > >
> > > > Ok, that is crystal clear, thanks for explanation.
> > > >
> > > >
> > > > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will
> > > > > be observed by the memory subsystem before the load of the tail.
> > > >
> > > > Ok.
> > > >
> > > > > In addition, any load or store program order after lda[tail] will
> > > > > not be observed by the memory subsystem before the load of the tail.
> > > >
> > > > Ok... the question is why we need that extra hoisting barrier here?
> > > > From what unwanted  re-orderings we are protecting here?
> > > > Does it mean that without it, ldrex/strex (CAS) can be reordered
> > > > with load[cons.tail]?
> > > >
> > > > Actually, we probably need to look at whole picture:
> > > >
> > > > in rte_ring_generic_pvt.h
> > > > =====================
> > > >
> > > > ldr [prod.head]
> > > > dmb ishld
> > > > ldr [cons.tail]
> > > > ...
> > > > /* cas */
> > > > ldrex [prod.head]
> > > > stlex [prod.head]   /* sink barrier */
> > > >
> > > > in rte_ring_c11_pvt.h
> > > > =====================
> > > >
> > > > ldr [prod.head]
> > > > dmb ishld
> > > > lda [cons.tail]          /* exrea hoist */
> > > > ...
> > > > /* cas */
> > > > ldrex [prod.head]
> > > > strex [prod.head]
> > >
> > > Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
> > > Aaarch64 has a cas instruction.
> > > Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n
> >
> > Cool, thanks.
> >
> > >
> > > >
> > > > So, in _genereic_ we don't have that extra hoist barrier after
> > > > load[con.tail], but we have extra sink barrier at cas(prod.tail).
> > > >
> > >
> > > You are right, technically, that lda[cons.tail] is not required
> > > because due to the dependency chain up until CAS a memory reordering is
> > not possible.
> > > For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-
> > update).
> > > C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in
> > GCC).
> > >
> > > So, ideally one could have written something like...
> > > __atomic_load_n(prod.head, __ATOMIC_CONSUME); instead of
> > > __atomic_load_n(prod.head, __ATOMIC_ACQUIRE);
> > >
> > > The compiler is then supposed to figure out if there are any
> > > dependencies in the code path to ensure that load of the prod.head is
> > > observed before any load/store that's program order after the load of the
> > prod.head.
> > > If not, the compiler is supposed to add required barrier to ensure
> > > that order is preserved.
> > >
> > > However, it's easier said than done. No, compiler implements it and
> > > C11 standard discourages use of memory-order-consume for that reason.
> > >
> > > This brings us to the next caveat. As per C11 standard, there cannot
> > > be a free standing Store with release semantics (stlr)  that isn't
> > > paired with a load with acquire or consume semantics. Since we can't
> > > use __ATOMIC_CONSUME (which would have resulted in ldr[prod.head]),
> > we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).
> > >
> > >
> > > > If that's correct observation, can we change _c11_ implementation to
> > > > match _generic_ one by:
> > > >
> > > >  atomic_load(prod.head, releaxed);
> > > >  atomic_thread_fence(acquire);
> > > >  atomic_load(cons.tail, releaxed);
> > > > ....
> > > > atomic_cas(prod.head, release, relaxed); ?
> > > >
> > > > From my understanding that should help to make these 2 implantations
> > > > Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> > > >
> > >
> > > They both are functionally correct.
> > > _generic is correct but does not comply with the C11 specification.
> > > _c11 is correct and fully compliant with the C11 specification.
> >
> > So far, I didn't say they are 'functionally incorrect'.
> > The problem is - we have two implementations for exactly the same thing that
> > generate different code-sequence for the same platform.
> > By default It is switched on/off depending on the platform ('off' - at
> > arm/thunderx, ppc, x86, 'on' on other arms and riscv ).
> > So for some platforms _c11_ is probably never used (and less tested), while on
> 
> C11 is guaranteed to work on all platforms if built with a C11 compliant compiler.
> 
> > others _generic_ is not used and under tested.
> 
> _generic_ by chance sets a dependency chain up to the point array operations are
> done, so there is an implicit head.load(consume). It's possible that there can be a
> platform/compiler combination that does not do that.
> 
> > Obviously it is not a good situation and better be addressed.
> > As I remember the reason while we ended up with 2 code-paths here - on
> > some platforms _c11_ one was slower.
> > Honnappa, please correct me if I am wrong here.
> > So ideal solution seems be - make _c11_ generate exactly the same
> > code as current _generic_, so we can deprecate and remove _generic_.
> >
> 
> Ideal solution would be to make _generic_ generate the same code _c11_
> generates for the above reasons. But, at least for Arm, when compiled with
> GCC we know that there is a dependency chain that guarantees the program
> order (implicit consume), so it's safe to use and slightly more performant.
> 
> > > Replacing atomic_load(cons.tail, acquire) with load(cons.tail,
> > > relaxed) in
> > > _c11 would make it non-compliant with C11 the spec.
> >

Could you point me to exact place where sequence:
atomic_store(release); atomic_load(relaxed);
in C11 is marked and ' non-compliant'?
I suppose you are referring to:
"...
Release-Acquire ordering
If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_acquire, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B.
All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.
The synchronization is established only between the threads releasing and acquiring the same atomic variable. 
..."
From: https://en.cppreference.com/w/c/atomic/memory_order
Or something else? 

> > Hmm... where this constraint comes from exactly?
> > AFAIK, inside DPDK, we have several places where we do use
> > 'atomic_store(var, release);' in conjunction with 'var.atomic_load(relaxed);'
> 
> That's ok, such code can exist just like in _generic ring.
> As long as they are correct on all supported platforms and no one states it's C11 compliant, there is no issue, no?

From my perspective - not really.
We have 2 different implementations for the same functionality.
Both are platform independent, but reasons why one or the other should be
selected on particular platform are not clear. 
That's not ideal situation in terms of support and maintenance.
As I remember, when few years ago _c11_-style atomics were introduced into DPDK
the main agreement was: in all appropriate places  replace our home-grown sync primitives
with _c11_ atomics.
AFAIK, it was done for nearly whole DPDK code - probably ring _generic_ is one of few
remaining ones.
Again, as I remember - main reason for that was that _generic_ is slightly more performant,
then _c11_.
That's why my thought was - can we try to optimize _c11_ code-path?
If you insist that it is not possible - ok, then I still think we should consider to stick with _c11_
only for all platforms from some point.

Another thing - in rte_ring library now we have 4 different sync modes for cons and prod:
ST, MT, HTS, RTS  and user is free to mix and match them in a way he wants.
So we can have move_head() from one implementation and update_tail() 
from different one.
HTS and RTS also use _c11_ style atomics but their code for 'move_head()' differs a bit
from _c11_ ring.  
I particular they use for prod/cons.tail:
atomic_store(release);  /* in update_tail() */ 
atomic_load(relaxed);   /* in move_head() */
Which as I understand, you consider as 'non-compilant with C11',
and not guaranteed to work on all possible platforms, right? 

Again, there is  patch that will probably complicate things even more:
https://patchwork.dpdk.org/project/dpdk/patch/20241015130111.826-5-konstantin.v.ananyev@yandex.ru/

So my thought was: we better have one common principle for all
these different move_head()s. 

> >
> > Again, if that really that strict - why in _c11_ move_head() it is ok  to use
> > 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'?
> > Following that logic, it should always be 'head.cas(release);', no?
> >
> _c11_move_head() does not head.load(acquire), it's head.load(relaxed).
> there is a tail.load(acquire) which must synchronize with tail.store(release).
> 
> Prod                                                            Cons
> ------                                                             -------
> ProdHead.load(relaxed)                          ConsHead.load(relaxed)

Correction: we have 
rte_atomic_thread_fence(rte_memory_order_acquire);
right here, between these 2 loads.
And as I understand:
prod.head.load(relaxed); fence(acquire);
is identical (in terms of synchronization) to prod.head.load(acquire). 

> ConsTail.load(acquire)                             ProdTail.load(acquire)
> ProdHead.cas(relaxed)                            ConsHead.cas(relaxed)

Another question - if head.cas() is relaxed here, what prevents it
to be reordered with relaxed loads/stores in AddToArray?
I presume just dependency order for head value?
Anything else?  

> AddToArray                                               RemoveFromArray
> ProdTail.store(release)                            ConsTail.store(release)
> 
> Above is how _c11 is implemented.
> ConsTail.load(acquire)  in Prod synchronizes with the ConsTail.store(release)
> In Cons.
> ProdTail.load(acquire) in Cons synchronizes with the ProdTail.store(release)
> In Prod.
> 
> If you want, you can switch to ConsTail.load(consume) and ProdTail.load(consume).
> But, due to lack of support as explained before that results identical to
> ConsTail.load(acquire) and ProdTail.load(acquire).
> Also, C11 standard highly discourage it's use for the time being.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-15 15:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-08 12:58 rte_ring move head question for machines with relaxed MO (arm/ppc) Konstantin Ananyev
2024-10-08 15:09 ` Wathsala Wathawana Vithanage
2024-10-08 15:12   ` Wathsala Wathawana Vithanage
2024-10-08 15:45   ` Konstantin Ananyev
2024-10-08 15:56     ` Konstantin Ananyev
2024-10-09 17:27       ` Wathsala Wathawana Vithanage
2024-10-10 16:54         ` Konstantin Ananyev
2024-10-11  0:11           ` Wathsala Wathawana Vithanage
2024-10-11 14:08             ` Konstantin Ananyev
2024-10-11 15:48               ` Wathsala Wathawana Vithanage
2024-10-15 15:11                 ` Konstantin Ananyev
2024-10-09  1:41     ` Wathsala Wathawana Vithanage
2024-10-09  2:22     ` Wathsala Wathawana Vithanage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).