From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0BB1B429B5;
	Sat, 22 Apr 2023 15:19:40 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 901F84021D;
	Sat, 22 Apr 2023 15:19:39 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 5658E40144
 for <dev@dpdk.org>; Sat, 22 Apr 2023 15:19:38 +0200 (CEST)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [RFC] ring: improve ring performance with C11 atomics
Date: Sat, 22 Apr 2023 15:19:35 +0200
X-MimeOLE: Produced By Microsoft Exchange V6.5
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D878AE@smartserver.smartshare.dk>
In-Reply-To: <20230421191642.217011-1-wathsala.vithanage@arm.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [RFC] ring: improve ring performance with C11 atomics
Thread-Index: Adl0hdgSVLW4YTTQSxCsA96UdoRm1AAkRrcw
References: <20230421191642.217011-1-wathsala.vithanage@arm.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Wathsala Vithanage" <wathsala.vithanage@arm.com>,
 <honnappa.nagarahalli@arm.com>, <konstantin.v.ananyev@yandex.ru>,
 <feifei.wang2@arm.com>
Cc: <dev@dpdk.org>,
	<nd@arm.com>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Wathsala Vithanage [mailto:wathsala.vithanage@arm.com]
> Sent: Friday, 21 April 2023 21.17
>=20
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
> can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue
> elements __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in
> __rte_ring_move_prod_head. This results in an address dependency
> between the two operations. Therefore __rte_ring_dequeue_elems cannot
> happen before  __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
> won't be reordered either.

These preconditions should be added as comments in the source code.

>=20
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+

Big performance gain compared to pre-improved C11 atomics! Excellent.

>=20
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+

I noticed that the larger size (32 objects) had a larger relative drop =
in performance than the smaller size (8 objects), so I am wondering what =
the performance numbers are for size 512, the default =
RTE_MEMPOOL_CACHE_MAX_SIZE? It's probably not going to change anything =
regarding the patch acceptance, but I'm curious about the numbers.

>=20
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
>=20
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
>  Wang Sheng-Hui <shhuiw@gmail.com>
>  Wangyu (Eric) <seven.wangyu@huawei.com>
>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail =
*ht,
> uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
>=20
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to =
or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore ht->tail should be =
updated
> +	 * with release semantics to prevent ring data copy phase from =
sinking
> +	 * below it.
> +	 */

I think this comment should clarified as:

Updating of ht->tail SHOULD NOT happen before elements are added to or
removed from the ring, as it WOULD result in data races between
producer and consumer threads. Therefore ht->tail MUST be updated
with release semantics to prevent ring data copy phase from sinking
below it.

Don't use capitals; I only used them to highlight my modifications.

NB: English is not my native language, so please ignore me if I am =
mistaken!

>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
>  }
>=20
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, =
unsigned int
> is_sp,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
>=20
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail =3D __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);

The __ATOMIC_ACQUIRE had a comment describing its purpose.

Please don't just remove that comment. Please replace it with a new =
comment describing why __ATOMIC_RELAXED is valid here.

>=20
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int =
is_sc,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
>=20
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail =3D __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> -
> +					__ATOMIC_RELAXED);

Same comment as above: Don't just remove the old comment, please replace =
it with a new comment.

>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1

With comments regarding __ATOMIC_RELAXED added to the source code,

Acked-by: Morten Br=F8rup <mb@smartsharesystems.com>