From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 77B7845B08; Thu, 10 Oct 2024 18:54:14 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 3F38E402A3; Thu, 10 Oct 2024 18:54:14 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 363E540144 for ; Thu, 10 Oct 2024 18:54:12 +0200 (CEST) Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XPbR61tTMz6K74d; Fri, 11 Oct 2024 00:52:50 +0800 (CST) Received: from frapeml500005.china.huawei.com (unknown [7.182.85.13]) by mail.maildlp.com (Postfix) with ESMTPS id B11D01400CB; Fri, 11 Oct 2024 00:54:11 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500005.china.huawei.com (7.182.85.13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Thu, 10 Oct 2024 18:54:11 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Thu, 10 Oct 2024 18:54:11 +0200 From: Konstantin Ananyev To: Wathsala Wathawana Vithanage , "dev@dpdk.org" CC: Honnappa Nagarahalli , "jerinj@marvell.com" , "drc@linux.ibm.com" , nd , nd Subject: RE: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Topic: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Index: AdsZem51pV3bFdnOQj2Lv/oX5wHuJQAE4tywAAJgEaAAAKmmgAA1OCnAAC+O+rA= Date: Thu, 10 Oct 2024 16:54:11 +0000 Message-ID: References: <8139916ad4814629b8804525bd785d58@huawei.com> <0badc1b8ea524bf3b69d0b7b316bdc8f@huawei.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.126.170.131] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > > > > > 1. rte_ring_generic_pvt.h: > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > pseudo-c-code // rela= ted armv8 instructions > > > > > -------------------- = -------------------------------------- > > > > > head.load() // l= dr [head] > > > > > rte_smp_rmb() // dmb i= shld > > > > > opposite_tail.load() // ldr [o= pposite_tail] > > > > > ... > > > > > rte_atomic32_cmpset(head, ...) // ldrex[head];... st= lex[head] > > > > > > > > > > > > > > > 2. rte_ring_c11_pvt.h > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > pseudo-c-code // rel= ated armv8 instructions > > > > > -------------------- = -------------------------------------- > > > > > head.atomic_load(relaxed) // ldr[head] > > > > > atomic_thread_fence(acquire) // dmb ish > > > > > opposite_tail.atomic_load(acquire) // lda[opposite_tail] > > > > > ... > > > > > head.atomic_cas(..., relaxed) // ldrex[haed];= ... strex[head] > > > > > > > > > > > > > > > 3. rte_ring_hts_elem_pvt.h > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > > > > > > > > pseudo-c-code // rel= ated armv8 instructions > > > > > -------------------- = -------------------------------------- > > > > > head.atomic_load(acquire) // lda [head] > > > > > opposite_tail.load() // ldr [o= pposite_tail] > > > > > ... > > > > > head.atomic_cas(..., acquire) // ldaex[head]; = ... strex[head] > > > > > > > > > > The questions that arose from these observations: > > > > > a) are all 3 approaches equivalent in terms of functionality? > > > > Different, lda (Load with acquire semantics) and ldr (load) are dif= ferent. > > > > > > I understand that, my question was: > > > lda {head]; ldr[tail] > > > vs > > > ldr [head]; dmb ishld; ldr [tail]; > > > > > > Is there any difference in terms of functionality (memory ops > > ordering/observability)? > > > > To be more precise: > > > > lda {head]; ldr[tail] > > vs > > ldr [head]; dmb ishld; ldr [tail]; > > vs > > ldr [head]; dmb ishld; lda [tail]; > > > > what would be the difference between these 3 cases? >=20 > Case A: lda {head]; ldr[tail] > load of the head will be observed by the memory subsystem > before the load of the tail. >=20 > Case B: ldr [head]; dmb ishld; ldr [tail]; > load of the head will be observed by the memory subsystem > Before the load of the tail. >=20 >=20 > Essentially both cases A and B are the same. > They preserve following program orders. > LOAD-LOAD > LOAD-STORE Ok, that is crystal clear, thanks for explanation. =20 > Case C: ldr [head]; dmb ishld; lda [tail]; > load of the head will be observed by the memory subsystem > before the load of the tail.=20 Ok. > In addition, any load or store program > order after lda[tail] will not be observed by the memory subsystem > before the load of the tail.=20 Ok... the question is why we need that extra hoisting barrier here? >From what unwanted re-orderings we are protecting here? Does it mean that without it, ldrex/strex (CAS) can be reordered with load[= cons.tail]? Actually, we probably need to look at whole picture: in rte_ring_generic_pvt.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ldr [prod.head] dmb ishld ldr [cons.tail] ... /* cas */ ldrex [prod.head] stlex [prod.head] /* sink barrier */ in rte_ring_c11_pvt.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ldr [prod.head] dmb ishld lda [cons.tail] /* exrea hoist */ ... /* cas */ ldrex [prod.head] strex [prod.head] =20 So, in _genereic_ we don't have that extra hoist barrier after load[con.tai= l], but we have extra sink barrier at cas(prod.tail). If that's correct observation, can we change _c11_ implementation to match _generic_ one by: atomic_load(prod.head, releaxed); atomic_thread_fence(acquire); atomic_load(cons.tail, releaxed); .... atomic_cas(prod.head, release, relaxed); ? =20 >From my understanding that should help to make these 2 implantations Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h. =20