From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A234345AE2; Tue, 8 Oct 2024 17:46:04 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7489E40DCA; Tue, 8 Oct 2024 17:45:59 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 50CBD40A89 for ; Tue, 8 Oct 2024 17:45:58 +0200 (CEST) Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XNKxv5WNDz6GFX2; Tue, 8 Oct 2024 23:41:39 +0800 (CST) Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172]) by mail.maildlp.com (Postfix) with ESMTPS id BC616140159; Tue, 8 Oct 2024 23:45:57 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Tue, 8 Oct 2024 17:45:57 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Tue, 8 Oct 2024 17:45:57 +0200 From: Konstantin Ananyev To: Wathsala Wathawana Vithanage , "dev@dpdk.org" CC: Honnappa Nagarahalli , "jerinj@marvell.com" , "drc@linux.ibm.com" , nd Subject: RE: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Topic: rte_ring move head question for machines with relaxed MO (arm/ppc) Thread-Index: AdsZem51pV3bFdnOQj2Lv/oX5wHuJQAE4tywAAJgEaA= Date: Tue, 8 Oct 2024 15:45:57 +0000 Message-ID: <0badc1b8ea524bf3b69d0b7b316bdc8f@huawei.com> References: <8139916ad4814629b8804525bd785d58@huawei.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.48.152.51] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > > 1. rte_ring_generic_pvt.h: > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > pseudo-c-code // related ar= mv8 instructions > > -------------------- --= ------------------------------------ > > head.load() // ldr [he= ad] > > rte_smp_rmb() // dmb ishld > > opposite_tail.load() // ldr [opposit= e_tail] > > ... > > rte_atomic32_cmpset(head, ...) // ldrex[head];... stlex[he= ad] > > > > > > 2. rte_ring_c11_pvt.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > pseudo-c-code // related a= rmv8 instructions > > -------------------- --= ------------------------------------ > > head.atomic_load(relaxed) // ldr[head] > > atomic_thread_fence(acquire) // dmb ish > > opposite_tail.atomic_load(acquire) // lda[opposite_tail] > > ... > > head.atomic_cas(..., relaxed) // ldrex[haed]; ... s= trex[head] > > > > > > 3. rte_ring_hts_elem_pvt.h > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > > > > pseudo-c-code // related a= rmv8 instructions > > -------------------- --= ------------------------------------ > > head.atomic_load(acquire) // lda [head] > > opposite_tail.load() // ldr [opposit= e_tail] > > ... > > head.atomic_cas(..., acquire) // ldaex[head]; ... st= rex[head] > > > > The questions that arose from these observations: > > a) are all 3 approaches equivalent in terms of functionality? > Different, lda (Load with acquire semantics) and ldr (load) are different= . I understand that, my question was: lda {head]; ldr[tail] vs ldr [head]; dmb ishld; ldr [tail]; Is there any difference in terms of functionality (memory ops ordering/obse= rvability)? =20 >=20 > > b) if yes, is there any difference in terms of performance between: > > "ldr; dmb; ldr;" vs "lda; ldr;" > > ? > dmb is a full barrier, performance is poor. > I would assume (haven't measured) ldr; dmb; ldr to be less performant tha= n lda;ldr; Through all this mail am talking about 'dmb ishld', sorry for not being cle= ar upfront.=20 >=20 > > c) Comapring at 1) and 2) above, combination of > > ldr [head]; dmb; lda [opposite_tail]: > > looks like an overkill to me. Wouldn't just: > > ldr [head]; dmb; ldr[opposite_tail]; > > be sufficient here? > lda [opposite_tail]: synchronizes with stlr in tail update that happens a= fter array update. > So, it cannot be changed to ldr. Can you explain me a bit more here why it is not possible? >From here: https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/= LDA-and-STL "There is no requirement that a load-acquire and store-release be paired." Do I misinterpret this statement somehow? =20 > lda can be replaced with ldapr (LDA with release consistency - processor = consistency) > which is more performant as lda is allowed to rise above stlr. Can be don= e with -mcpu=3D+rcpc >=20 > --wathsala >=20