From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 6B0F342F83;
	Wed,  2 Aug 2023 11:42:35 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 412DF40DDB;
	Wed,  2 Aug 2023 11:42:35 +0200 (CEST)
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com
 [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 2F6064021D
 for <dev@dpdk.org>; Wed,  2 Aug 2023 11:42:34 +0200 (CEST)
Received: from frapeml500007.china.huawei.com (unknown [172.18.147.226])
 by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4RG6PB0mFnz6J7Q7;
 Wed,  2 Aug 2023 17:38:54 +0800 (CST)
Received: from frapeml500007.china.huawei.com (7.182.85.172) by
 frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.27; Wed, 2 Aug 2023 11:42:32 +0200
Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by
 frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.027;
 Wed, 2 Aug 2023 11:42:32 +0200
From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: Wathsala Vithanage <wathsala.vithanage@arm.com>,
 "honnappa.nagarahalli@arm.com" <honnappa.nagarahalli@arm.com>,
 "konstantin.v.ananyev@yandex.ru" <konstantin.v.ananyev@yandex.ru>,
 "thomas@monjalon.net" <thomas@monjalon.net>, "ruifeng.wang@arm.com"
 <ruifeng.wang@arm.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, "nd@arm.com" <nd@arm.com>,
 "justin.he@arm.com" <justin.he@arm.com>
Subject: RE: [RFC] ring: further performance improvements with C11
Thread-Topic: [RFC] ring: further performance improvements with C11
Thread-Index: AQHZn8X6kDwC6EfAUUeIV8znXH/OIK+MKqeAgEre4/A=
Date: Wed, 2 Aug 2023 09:42:32 +0000
Message-ID: <67a8987cb0d5456b9c99887402ea30af@huawei.com>
References: <20230615201335.919563-1-wathsala.vithanage@arm.com>
 <20230615201335.919563-2-wathsala.vithanage@arm.com>
In-Reply-To: <20230615201335.919563-2-wathsala.vithanage@arm.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.206.138.42]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org


> For improved performance over the current C11 based ring implementation
> following changes were made.
> (1) Replace tail store with RELEASE semantics in __rte_ring_update_tail
> with a RELEASE fence. Replace load of the tail with ACQUIRE semantics
> in __rte_ring_move_prod_head and __rte_ring_move_cons_head with ACQUIRE
> fences.
> (2) Remove ACQUIRE fences between load of the old_head and load of the
> cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head.
> These two fences are not required for the safety of the ring library.

Hmm... with these changes, aren't we re-introducing the old bug fixed by
this commit:

commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654
Author: Jia He <justin.he@arm.com>
Date:   Fri Nov 10 03:30:42 2017 +0000

    ring: guarantee load/load order in enqueue and dequeue

    We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
    (Amberwing).

    Root cause:
    In __rte_ring_move_cons_head()
    ...
            do {
                    /* Restore n as it may change every loop */
                    n =3D max;

                    *old_head =3D r->cons.head;                //1st load
                    const uint32_t prod_tail =3D r->prod.tail; //2nd load

    In weak memory order architectures (powerpc,arm), the 2nd load might be
    reodered before the 1st load, that makes *entries is bigger than we wan=
ted.
    This nasty reording messed enque/deque up.=20
    ....
?

>=20
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++---------------
>  2 files changed, 21 insertions(+), 15 deletions(-)
>=20
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
>  Wang Sheng-Hui <shhuiw@gmail.com>
>  Wangyu (Eric) <seven.wangyu@huawei.com>
>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index f895950df4..63fe58ce9e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, u=
int32_t old_val,
>  		uint32_t new_val, uint32_t single, uint32_t enqueue)
>  {
>  	RTE_SET_USED(enqueue);
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore we need a release
> +	 * barrier here.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_RELEASE);
>=20
>  	/*
>  	 * If there are other enqueues/dequeues in progress that preceded us,
> @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, ui=
nt32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
>=20
> -	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
> +	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED);
>  }
>=20
>  /**
> @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned=
 int is_sp,
>  		/* Reset n to the initial burst count */
>  		n =3D max;
>=20
> -		/* Ensure the head is read before tail */
> -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> -
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail =3D __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
>=20
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -100,6 +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsign=
ed int is_sp,
>  					0, __ATOMIC_RELAXED,
>  					__ATOMIC_RELAXED);
>  	} while (unlikely(success =3D=3D 0));
> +	/*
> +	 * Ensure that updates to the ring doesn't rise above
> +	 * load of the new_head in SP and MP cases.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
>  	return n;
>  }
>=20
> @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is=
_sc,
>  		/* Restore n as it may change every loop */
>  		n =3D max;
>=20
> -		/* Ensure the head is read before tail */
> -		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> -
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail =3D __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
>=20
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -175,6 +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is=
_sc,
>  							0, __ATOMIC_RELAXED,
>  							__ATOMIC_RELAXED);
>  	} while (unlikely(success =3D=3D 0));
> +	/*
> +	 * Ensure that updates to the ring doesn't rise above
> +	 * load of the new_head in SP and MP cases.
> +	 */
> +	rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
>  	return n;
>  }
>=20
> --
> 2.25.1
>=20