From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id CA17945B13; Fri, 11 Oct 2024 16:08:39 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9922B402A1; Fri, 11 Oct 2024 16:08:39 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id D1DAE4028B for ; Fri, 11 Oct 2024 16:08:36 +0200 (CEST) Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XQ7jX0gkKz6K6Sd; Fri, 11 Oct 2024 22:07:12 +0800 (CST) Received: from frapeml500005.china.huawei.com (unknown [7.182.85.13]) by mail.maildlp.com (Postfix) with ESMTPS id 5EB8B1400CB; Fri, 11 Oct 2024 22:08:36 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500005.china.huawei.com (7.182.85.13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 11 Oct 2024 16:08:36 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Fri, 11 Oct 2024 16:08:35 +0200 From: Konstantin Ananyev To: Wathsala Wathawana Vithanage , "dev@dpdk.org" CC: Honnappa Nagarahalli , "jerinj@marvell.com" , "drc@linux.ibm.com" Subject: RE: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Topic: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Index: AdsZem51pV3bFdnOQj2Lv/oX5wHuJQAE4tywAAJgEaAAAKmmgAA1OCnAAC+O+rAADhtAsAAe4csQ Date: Fri, 11 Oct 2024 14:08:35 +0000 Message-ID: References: <8139916ad4814629b8804525bd785d58@huawei.com> <0badc1b8ea524bf3b69d0b7b316bdc8f@huawei.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.206.138.42] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > > > > > > > > > 1. rte_ring_generic_pvt.h: > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > > > > > > > > > > > > > > pseudo-c-code // = related armv8 instructions > > > > > > > -------------------- = ---------------------------------- > > ---- > > > > > > > head.load() // = ldr [head] > > > > > > > rte_smp_rmb() // d= mb ishld > > > > > > > opposite_tail.load() // ld= r [opposite_tail] > > > > > > > ... > > > > > > > rte_atomic32_cmpset(head, ...) // ldrex[head];..= . stlex[head] > > > > > > > > > > > > > > > > > > > > > 2. rte_ring_c11_pvt.h > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > > > > > > > > > > > > > > pseudo-c-code // = related armv8 instructions > > > > > > > -------------------- = ---------------------------------- > > ---- > > > > > > > head.atomic_load(relaxed) // ldr[head] > > > > > > > atomic_thread_fence(acquire) // dmb ish > > > > > > > opposite_tail.atomic_load(acquire) // lda[opposite_t= ail] > > > > > > > ... > > > > > > > head.atomic_cas(..., relaxed) // ldrex[ha= ed]; ... strex[head] > > > > > > > > > > > > > > > > > > > > > 3. rte_ring_hts_elem_pvt.h > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > > > > > > > > > > > > > > pseudo-c-code // = related armv8 instructions > > > > > > > -------------------- = ---------------------------------- > > ---- > > > > > > > head.atomic_load(acquire) // lda [head] > > > > > > > opposite_tail.load() // ld= r [opposite_tail] > > > > > > > ... > > > > > > > head.atomic_cas(..., acquire) // ldaex[hea= d]; ... strex[head] > > > > > > > > > > > > > > The questions that arose from these observations: > > > > > > > a) are all 3 approaches equivalent in terms of functionality? > > > > > > Different, lda (Load with acquire semantics) and ldr (load) are= different. > > > > > > > > > > I understand that, my question was: > > > > > lda {head]; ldr[tail] > > > > > vs > > > > > ldr [head]; dmb ishld; ldr [tail]; > > > > > > > > > > Is there any difference in terms of functionality (memory ops > > > > ordering/observability)? > > > > > > > > To be more precise: > > > > > > > > lda {head]; ldr[tail] > > > > vs > > > > ldr [head]; dmb ishld; ldr [tail]; > > > > vs > > > > ldr [head]; dmb ishld; lda [tail]; > > > > > > > > what would be the difference between these 3 cases? > > > > > > Case A: lda {head]; ldr[tail] > > > load of the head will be observed by the memory subsystem before the > > > load of the tail. > > > > > > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will be > > > observed by the memory subsystem Before the load of the tail. > > > > > > > > > Essentially both cases A and B are the same. > > > They preserve following program orders. > > > LOAD-LOAD > > > LOAD-STORE > > > > Ok, that is crystal clear, thanks for explanation. > > > > > > > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will be > > > observed by the memory subsystem before the load of the tail. > > > > Ok. > > > > > In addition, any load or store program order after lda[tail] will not > > > be observed by the memory subsystem before the load of the tail. > > > > Ok... the question is why we need that extra hoisting barrier here? > > From what unwanted re-orderings we are protecting here? > > Does it mean that without it, ldrex/strex (CAS) can be reordered with > > load[cons.tail]? > > > > Actually, we probably need to look at whole picture: > > > > in rte_ring_generic_pvt.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > ldr [prod.head] > > dmb ishld > > ldr [cons.tail] > > ... > > /* cas */ > > ldrex [prod.head] > > stlex [prod.head] /* sink barrier */ > > > > in rte_ring_c11_pvt.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > ldr [prod.head] > > dmb ishld > > lda [cons.tail] /* exrea hoist */ > > ... > > /* cas */ > > ldrex [prod.head] > > strex [prod.head] >=20 > Minor thing, ldrex and strex is how Arm 32 way of doing CAS. > Aaarch64 has a cas instruction. > Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n Cool, thanks. >=20 > > > > So, in _genereic_ we don't have that extra hoist barrier after load[con= .tail], but > > we have extra sink barrier at cas(prod.tail). > > >=20 > You are right, technically, that lda[cons.tail] is not required because d= ue to the > dependency chain up until CAS a memory reordering is not possible. > For that reason, it has no issue synchronizing with the strl[prod.tail] (= in tail-update). > C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in GCC). >=20 > So, ideally one could have written something like... > __atomic_load_n(prod.head, __ATOMIC_CONSUME); > instead of > __atomic_load_n(prod.head, __ATOMIC_ACQUIRE); >=20 > The compiler is then supposed to figure out if there are any dependencies= in > the code path to ensure that load of the prod.head is observed before any > load/store that's program order after the load of the prod.head. > If not, the compiler is supposed to add required barrier to ensure that o= rder is > preserved. >=20 > However, it's easier said than done. No, compiler implements it and C11 s= tandard > discourages use of memory-order-consume for that reason. >=20 > This brings us to the next caveat. As per C11 standard, there cannot be a= free standing > Store with release semantics (stlr) that isn't paired with a load with a= cquire or consume > semantics. Since we can't use __ATOMIC_CONSUME (which would have resulted= in > ldr[prod.head]), we are forced to use __ATOMIC_ACQUIRE (which results in = lda[prod.head]). >=20 >=20 > > If that's correct observation, can we change _c11_ implementation to ma= tch > > _generic_ one by: > > > > atomic_load(prod.head, releaxed); > > atomic_thread_fence(acquire); > > atomic_load(cons.tail, releaxed); > > .... > > atomic_cas(prod.head, release, relaxed); ? > > > > From my understanding that should help to make these 2 implantations > > Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h. > > >=20 > They both are functionally correct. > _generic is correct but does not comply with the C11 specification. > _c11 is correct and fully compliant with the C11 specification. So far, I didn't say they are 'functionally incorrect'. The problem is - we have two implementations for exactly the same thing that generate different code-sequence for the same platform. By default It is switched on/off depending on the platform ('off' - at arm/thunderx, ppc, x86, 'on' on other arms and riscv ). So for some platforms _c11_ is probably never used (and less tested), while on others _generic_ is not used and under tested. Obviously it is not a good situation and better be addressed. As I remember the reason while we ended up with 2 code-paths here - on some platforms _c11_ one was slower. Honnappa, please correct me if I am wrong here. So ideal solution seems be - make _c11_ generate exactly the same code as current _generic_, so we can deprecate and remove _generic_. = =20 > Replacing atomic_load(cons.tail, acquire) with load(cons.tail, relaxed) i= n > _c11 would make it non-compliant with C11 the spec. Hmm... where this constraint comes from exactly? AFAIK, inside DPDK, we have several places where we do use 'atomic_store(var, release);' in conjunction with 'var.atomic_load(relaxed)= ;'=20 Again, if that really that strict - why in _c11_ move_head() it is ok to use 'head.load(acquire)' in conjunction with 'head.cas(relaxed);'? Following that logic, it should always be 'head.cas(release);', no?