From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id ECF2445B42; Tue, 15 Oct 2024 17:11:40 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 8055F400D7; Tue, 15 Oct 2024 17:11:40 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 455C5400D6 for ; Tue, 15 Oct 2024 17:11:38 +0200 (CEST) Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XScrr2NDLz6J6KY; Tue, 15 Oct 2024 23:07:08 +0800 (CST) Received: from frapeml500005.china.huawei.com (unknown [7.182.85.13]) by mail.maildlp.com (Postfix) with ESMTPS id F2E051404F5; Tue, 15 Oct 2024 23:11:36 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500005.china.huawei.com (7.182.85.13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Tue, 15 Oct 2024 17:11:36 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Tue, 15 Oct 2024 17:11:36 +0200 From: Konstantin Ananyev To: Wathsala Wathawana Vithanage , "dev@dpdk.org" CC: Honnappa Nagarahalli , "jerinj@marvell.com" , "drc@linux.ibm.com" , nd Subject: RE: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Topic: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Index: AdsZem51pV3bFdnOQj2Lv/oX5wHuJQAE4tywAAJgEaAAAKmmgAA1OCnAAC+O+rAADhtAsAAe4csQAAI+2vAAx6JWoA== Date: Tue, 15 Oct 2024 15:11:36 +0000 Message-ID: <8ab3aad0be784a4a9713d9d09ad80635@huawei.com> References: <8139916ad4814629b8804525bd785d58@huawei.com> <0badc1b8ea524bf3b69d0b7b316bdc8f@huawei.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.48.153.33] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > > > > > > > > > 1. rte_ring_generic_pvt.h: > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D > > > > > > > > > > > > > > > > > > pseudo-c-code // = related armv8 instructions > > > > > > > > > -------------------- = ------------------------------ > > ---- > > > > ---- > > > > > > > > > head.load() // = ldr [head] > > > > > > > > > rte_smp_rmb() // = dmb ishld > > > > > > > > > opposite_tail.load() // = ldr [opposite_tail] > > > > > > > > > ... > > > > > > > > > rte_atomic32_cmpset(head, ...) // ldrex[head= ];... > > stlex[head] > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. rte_ring_c11_pvt.h > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D > > > > > > > > > > > > > > > > > > pseudo-c-code // = related armv8 > > instructions > > > > > > > > > -------------------- = ------------------------------ > > ---- > > > > ---- > > > > > > > > > head.atomic_load(relaxed) // ldr[h= ead] > > > > > > > > > atomic_thread_fence(acquire) // dmb ish > > > > > > > > > opposite_tail.atomic_load(acquire) // lda[opposi= te_tail] > > > > > > > > > ... > > > > > > > > > head.atomic_cas(..., relaxed) // ldre= x[haed]; ... > > strex[head] > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. rte_ring_hts_elem_pvt.h > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > > > > > > > > > pseudo-c-code // = related armv8 > > instructions > > > > > > > > > -------------------- = ------------------------------ > > ---- > > > > ---- > > > > > > > > > head.atomic_load(acquire) // lda [h= ead] > > > > > > > > > opposite_tail.load() // = ldr [opposite_tail] > > > > > > > > > ... > > > > > > > > > head.atomic_cas(..., acquire) // ldaex= [head]; ... > > strex[head] > > > > > > > > > > > > > > > > > > The questions that arose from these observations: > > > > > > > > > a) are all 3 approaches equivalent in terms of functional= ity? > > > > > > > > Different, lda (Load with acquire semantics) and ldr (load)= are > > different. > > > > > > > > > > > > > > I understand that, my question was: > > > > > > > lda {head]; ldr[tail] > > > > > > > vs > > > > > > > ldr [head]; dmb ishld; ldr [tail]; > > > > > > > > > > > > > > Is there any difference in terms of functionality (memory ops > > > > > > ordering/observability)? > > > > > > > > > > > > To be more precise: > > > > > > > > > > > > lda {head]; ldr[tail] > > > > > > vs > > > > > > ldr [head]; dmb ishld; ldr [tail]; vs ldr [head]; dmb ishld; ld= a > > > > > > [tail]; > > > > > > > > > > > > what would be the difference between these 3 cases? > > > > > > > > > > Case A: lda {head]; ldr[tail] > > > > > load of the head will be observed by the memory subsystem before > > > > > the load of the tail. > > > > > > > > > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will > > > > > be observed by the memory subsystem Before the load of the tail. > > > > > > > > > > > > > > > Essentially both cases A and B are the same. > > > > > They preserve following program orders. > > > > > LOAD-LOAD > > > > > LOAD-STORE > > > > > > > > Ok, that is crystal clear, thanks for explanation. > > > > > > > > > > > > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will > > > > > be observed by the memory subsystem before the load of the tail. > > > > > > > > Ok. > > > > > > > > > In addition, any load or store program order after lda[tail] will > > > > > not be observed by the memory subsystem before the load of the ta= il. > > > > > > > > Ok... the question is why we need that extra hoisting barrier here? > > > > From what unwanted re-orderings we are protecting here? > > > > Does it mean that without it, ldrex/strex (CAS) can be reordered > > > > with load[cons.tail]? > > > > > > > > Actually, we probably need to look at whole picture: > > > > > > > > in rte_ring_generic_pvt.h > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > ldr [prod.head] > > > > dmb ishld > > > > ldr [cons.tail] > > > > ... > > > > /* cas */ > > > > ldrex [prod.head] > > > > stlex [prod.head] /* sink barrier */ > > > > > > > > in rte_ring_c11_pvt.h > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > ldr [prod.head] > > > > dmb ishld > > > > lda [cons.tail] /* exrea hoist */ > > > > ... > > > > /* cas */ > > > > ldrex [prod.head] > > > > strex [prod.head] > > > > > > Minor thing, ldrex and strex is how Arm 32 way of doing CAS. > > > Aaarch64 has a cas instruction. > > > Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n > > > > Cool, thanks. > > > > > > > > > > > > > So, in _genereic_ we don't have that extra hoist barrier after > > > > load[con.tail], but we have extra sink barrier at cas(prod.tail). > > > > > > > > > > You are right, technically, that lda[cons.tail] is not required > > > because due to the dependency chain up until CAS a memory reordering = is > > not possible. > > > For that reason, it has no issue synchronizing with the strl[prod.tai= l] (in tail- > > update). > > > C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in > > GCC). > > > > > > So, ideally one could have written something like... > > > __atomic_load_n(prod.head, __ATOMIC_CONSUME); instead of > > > __atomic_load_n(prod.head, __ATOMIC_ACQUIRE); > > > > > > The compiler is then supposed to figure out if there are any > > > dependencies in the code path to ensure that load of the prod.head is > > > observed before any load/store that's program order after the load of= the > > prod.head. > > > If not, the compiler is supposed to add required barrier to ensure > > > that order is preserved. > > > > > > However, it's easier said than done. No, compiler implements it and > > > C11 standard discourages use of memory-order-consume for that reason. > > > > > > This brings us to the next caveat. As per C11 standard, there cannot > > > be a free standing Store with release semantics (stlr) that isn't > > > paired with a load with acquire or consume semantics. Since we can't > > > use __ATOMIC_CONSUME (which would have resulted in ldr[prod.head]), > > we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head])= . > > > > > > > > > > If that's correct observation, can we change _c11_ implementation t= o > > > > match _generic_ one by: > > > > > > > > atomic_load(prod.head, releaxed); > > > > atomic_thread_fence(acquire); > > > > atomic_load(cons.tail, releaxed); > > > > .... > > > > atomic_cas(prod.head, release, relaxed); ? > > > > > > > > From my understanding that should help to make these 2 implantation= s > > > > Identical, and then hopefully we can get rid of rte_ring_generic_pv= t.h. > > > > > > > > > > They both are functionally correct. > > > _generic is correct but does not comply with the C11 specification. > > > _c11 is correct and fully compliant with the C11 specification. > > > > So far, I didn't say they are 'functionally incorrect'. > > The problem is - we have two implementations for exactly the same thing= that > > generate different code-sequence for the same platform. > > By default It is switched on/off depending on the platform ('off' - at > > arm/thunderx, ppc, x86, 'on' on other arms and riscv ). > > So for some platforms _c11_ is probably never used (and less tested), w= hile on >=20 > C11 is guaranteed to work on all platforms if built with a C11 compliant = compiler. >=20 > > others _generic_ is not used and under tested. >=20 > _generic_ by chance sets a dependency chain up to the point array operati= ons are > done, so there is an implicit head.load(consume). It's possible that ther= e can be a > platform/compiler combination that does not do that. >=20 > > Obviously it is not a good situation and better be addressed. > > As I remember the reason while we ended up with 2 code-paths here - on > > some platforms _c11_ one was slower. > > Honnappa, please correct me if I am wrong here. > > So ideal solution seems be - make _c11_ generate exactly the same > > code as current _generic_, so we can deprecate and remove _generic_. > > >=20 > Ideal solution would be to make _generic_ generate the same code _c11_ > generates for the above reasons. But, at least for Arm, when compiled wit= h > GCC we know that there is a dependency chain that guarantees the program > order (implicit consume), so it's safe to use and slightly more performan= t. >=20 > > > Replacing atomic_load(cons.tail, acquire) with load(cons.tail, > > > relaxed) in > > > _c11 would make it non-compliant with C11 the spec. > > Could you point me to exact place where sequence: atomic_store(release); atomic_load(relaxed); in C11 is marked and ' non-compliant'? I suppose you are referring to: "... Release-Acquire ordering If an atomic store in thread A is tagged memory_order_release, an atomic lo= ad in thread B from the same variable is tagged memory_order_acquire, and t= he load in thread B reads a value written by the store in thread A, then th= e store in thread A synchronizes-with the load in thread B. All memory writes (including non-atomic and relaxed atomic) that happened-b= efore the atomic store from the point of view of thread A, become visible s= ide-effects in thread B. That is, once the atomic load is completed, thread= B is guaranteed to see everything thread A wrote to memory. This promise o= nly holds if B actually returns the value that A stored, or a value from la= ter in the release sequence. The synchronization is established only between the threads releasing and a= cquiring the same atomic variable.=20 ..." From: https://en.cppreference.com/w/c/atomic/memory_order Or something else?=20 > > Hmm... where this constraint comes from exactly? > > AFAIK, inside DPDK, we have several places where we do use > > 'atomic_store(var, release);' in conjunction with 'var.atomic_load(rela= xed);' >=20 > That's ok, such code can exist just like in _generic ring. > As long as they are correct on all supported platforms and no one states = it's C11 compliant, there is no issue, no? >From my perspective - not really. We have 2 different implementations for the same functionality. Both are platform independent, but reasons why one or the other should be selected on particular platform are not clear.=20 That's not ideal situation in terms of support and maintenance. As I remember, when few years ago _c11_-style atomics were introduced into = DPDK the main agreement was: in all appropriate places replace our home-grown s= ync primitives with _c11_ atomics. AFAIK, it was done for nearly whole DPDK code - probably ring _generic_ is = one of few remaining ones. Again, as I remember - main reason for that was that _generic_ is slightly = more performant, then _c11_. That's why my thought was - can we try to optimize _c11_ code-path? If you insist that it is not possible - ok, then I still think we should co= nsider to stick with _c11_ only for all platforms from some point. Another thing - in rte_ring library now we have 4 different sync modes for = cons and prod: ST, MT, HTS, RTS and user is free to mix and match them in a way he wants. So we can have move_head() from one implementation and update_tail()=20 from different one. HTS and RTS also use _c11_ style atomics but their code for 'move_head()' d= iffers a bit from _c11_ ring. =20 I particular they use for prod/cons.tail: atomic_store(release); /* in update_tail() */=20 atomic_load(relaxed); /* in move_head() */ Which as I understand, you consider as 'non-compilant with C11', and not guaranteed to work on all possible platforms, right?=20 Again, there is patch that will probably complicate things even more: https://patchwork.dpdk.org/project/dpdk/patch/20241015130111.826-5-konstant= in.v.ananyev@yandex.ru/ So my thought was: we better have one common principle for all these different move_head()s.=20 > > > > Again, if that really that strict - why in _c11_ move_head() it is ok = to use > > 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'? > > Following that logic, it should always be 'head.cas(release);', no? > > > _c11_move_head() does not head.load(acquire), it's head.load(relaxed). > there is a tail.load(acquire) which must synchronize with tail.store(rele= ase). >=20 > Prod Cons > ------ ------= - > ProdHead.load(relaxed) ConsHead.load(relaxed) Correction: we have=20 rte_atomic_thread_fence(rte_memory_order_acquire); right here, between these 2 loads. And as I understand: prod.head.load(relaxed); fence(acquire); is identical (in terms of synchronization) to prod.head.load(acquire).=20 > ConsTail.load(acquire) ProdTail.load(acquire) > ProdHead.cas(relaxed) ConsHead.cas(relaxed) Another question - if head.cas() is relaxed here, what prevents it to be reordered with relaxed loads/stores in AddToArray? I presume just dependency order for head value? Anything else? =20 > AddToArray RemoveFromArray > ProdTail.store(release) ConsTail.store(release= ) >=20 > Above is how _c11 is implemented. > ConsTail.load(acquire) in Prod synchronizes with the ConsTail.store(rele= ase) > In Cons. > ProdTail.load(acquire) in Cons synchronizes with the ProdTail.store(relea= se) > In Prod. >=20 > If you want, you can switch to ConsTail.load(consume) and ProdTail.load(c= onsume). > But, due to lack of support as explained before that results identical to > ConsTail.load(acquire) and ProdTail.load(acquire). > Also, C11 standard highly discourage it's use for the time being.