From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6B0F342F83; Wed, 2 Aug 2023 11:42:35 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 412DF40DDB; Wed, 2 Aug 2023 11:42:35 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 2F6064021D for ; Wed, 2 Aug 2023 11:42:34 +0200 (CEST) Received: from frapeml500007.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4RG6PB0mFnz6J7Q7; Wed, 2 Aug 2023 17:38:54 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Wed, 2 Aug 2023 11:42:32 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.027; Wed, 2 Aug 2023 11:42:32 +0200 From: Konstantin Ananyev To: Wathsala Vithanage , "honnappa.nagarahalli@arm.com" , "konstantin.v.ananyev@yandex.ru" , "thomas@monjalon.net" , "ruifeng.wang@arm.com" CC: "dev@dpdk.org" , "nd@arm.com" , "justin.he@arm.com" Subject: RE: [RFC] ring: further performance improvements with C11 Thread-Topic: [RFC] ring: further performance improvements with C11 Thread-Index: AQHZn8X6kDwC6EfAUUeIV8znXH/OIK+MKqeAgEre4/A= Date: Wed, 2 Aug 2023 09:42:32 +0000 Message-ID: <67a8987cb0d5456b9c99887402ea30af@huawei.com> References: <20230615201335.919563-1-wathsala.vithanage@arm.com> <20230615201335.919563-2-wathsala.vithanage@arm.com> In-Reply-To: <20230615201335.919563-2-wathsala.vithanage@arm.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.206.138.42] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > For improved performance over the current C11 based ring implementation > following changes were made. > (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail > with a RELEASE fence. Replace load of the tail with ACQUIRE semantics > in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE > fences. > (2) Remove ACQUIRE fences between load of the old_head and load of the > cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head. > These two fences are not required for the safety of the ring library. Hmm... with these changes, aren't we re-introducing the old bug fixed by this commit: commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 Author: Jia He Date: Fri Nov 10 03:30:42 2017 +0000 ring: guarantee load/load order in enqueue and dequeue We watched a rte panic of mbuf_autotest in our qualcomm arm64 server (Amberwing). Root cause: In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n =3D max; *old_head =3D r->cons.head; //1st load const uint32_t prod_tail =3D r->prod.tail; //2nd load In weak memory order architectures (powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wan= ted. This nasty reording messed enque/deque up.=20 .... ? >=20 > Signed-off-by: Wathsala Vithanage > Reviewed-by: Honnappa Nagarahalli > Reviewed-by: Ruifeng Wang > --- > .mailmap | 1 + > lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- > 2 files changed, 21 insertions(+), 15 deletions(-) >=20 > diff --git a/.mailmap b/.mailmap > index 4018f0fc47..367115d134 100644 > --- a/.mailmap > +++ b/.mailmap > @@ -1430,6 +1430,7 @@ Walter Heymans > Wang Sheng-Hui > Wangyu (Eric) > Waterman Cao > +Wathsala Vithanage > Weichun Chen > Wei Dai > Weifeng Li > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h > index f895950df4..63fe58ce9e 100644 > --- a/lib/ring/rte_ring_c11_pvt.h > +++ b/lib/ring/rte_ring_c11_pvt.h > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, u= int32_t old_val, > uint32_t new_val, uint32_t single, uint32_t enqueue) > { > RTE_SET_USED(enqueue); > + /* > + * Updating of ht->tail cannot happen before elements are added to or > + * removed from the ring, as it could result in data races between > + * producer and consumer threads. Therefore we need a release > + * barrier here. > + */ > + rte_atomic_thread_fence(__ATOMIC_RELEASE); >=20 > /* > * If there are other enqueues/dequeues in progress that preceded us, > @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, ui= nt32_t old_val, > if (!single) > rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); >=20 > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > } >=20 > /** > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned= int is_sp, > /* Reset n to the initial burst count */ > n =3D max; >=20 > - /* Ensure the head is read before tail */ > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > - > - /* load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > cons_tail =3D __atomic_load_n(&r->cons.tail, > - __ATOMIC_ACQUIRE); > + __ATOMIC_RELAXED); >=20 > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > @@ -100,6 +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsign= ed int is_sp, > 0, __ATOMIC_RELAXED, > __ATOMIC_RELAXED); > } while (unlikely(success =3D=3D 0)); > + /* > + * Ensure that updates to the ring doesn't rise above > + * load of the new_head in SP and MP cases. > + */ > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > return n; > } >=20 > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is= _sc, > /* Restore n as it may change every loop */ > n =3D max; >=20 > - /* Ensure the head is read before tail */ > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > - > - /* this load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > prod_tail =3D __atomic_load_n(&r->prod.tail, > - __ATOMIC_ACQUIRE); > + __ATOMIC_RELAXED); >=20 > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > @@ -175,6 +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is= _sc, > 0, __ATOMIC_RELAXED, > __ATOMIC_RELAXED); > } while (unlikely(success =3D=3D 0)); > + /* > + * Ensure that updates to the ring doesn't rise above > + * load of the new_head in SP and MP cases. > + */ > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > return n; > } >=20 > -- > 2.25.1 >=20