From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0BB1B429B5; Sat, 22 Apr 2023 15:19:40 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 901F84021D; Sat, 22 Apr 2023 15:19:39 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 5658E40144 for ; Sat, 22 Apr 2023 15:19:38 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [RFC] ring: improve ring performance with C11 atomics Date: Sat, 22 Apr 2023 15:19:35 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D878AE@smartserver.smartshare.dk> In-Reply-To: <20230421191642.217011-1-wathsala.vithanage@arm.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC] ring: improve ring performance with C11 atomics Thread-Index: Adl0hdgSVLW4YTTQSxCsA96UdoRm1AAkRrcw References: <20230421191642.217011-1-wathsala.vithanage@arm.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Wathsala Vithanage" , , , Cc: , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Wathsala Vithanage [mailto:wathsala.vithanage@arm.com] > Sent: Friday, 21 April 2023 21.17 >=20 > Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head > can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE. > Because to calculate the addresses of the dequeue > elements __rte_ring_dequeue_elems uses the old_head updated by the > __atomic_compare_exchange_n intrinsic used in > __rte_ring_move_prod_head. This results in an address dependency > between the two operations. Therefore __rte_ring_dequeue_elems cannot > happen before __rte_ring_move_prod_head. > Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head > won't be reordered either. These preconditions should be added as comments in the source code. >=20 > Performance on Arm N1 > Gain relative to generic implementation > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 8 (Arm N1) | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 766730 | Total count: 651686 | Total count: 812125 | > | | Gain: -15% | Gain: 6% | > +-------------------------------------------------------------------+ > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 32 (Arm N1) | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 816745 | Total count: 646385 | Total count: 830935 | > | | Gain: -21% | Gain: 2% | > +-------------------------------------------------------------------+ Big performance gain compared to pre-improved C11 atomics! Excellent. >=20 > Performance on x86-64 Cascade Lake > Gain relative to generic implementation > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 8 | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 181640 | Total count: 181995 | Total count: 182791 | > | | Gain: 0.2% | Gain: 0.6% > +-------------------------------------------------------------------+ > +-------------------------------------------------------------------+ > | Bulk enq/dequeue count on size 32 | > +-------------------------------------------------------------------+ > | Generic | C11 atomics | C11 atomics improved | > +-------------------------------------------------------------------+ > | Total count: 167495 | Total count: 161536 | Total count: 163190 | > | | Gain: -3.5% | Gain: -2.6% | > +-------------------------------------------------------------------+ I noticed that the larger size (32 objects) had a larger relative drop = in performance than the smaller size (8 objects), so I am wondering what = the performance numbers are for size 512, the default = RTE_MEMPOOL_CACHE_MAX_SIZE? It's probably not going to change anything = regarding the patch acceptance, but I'm curious about the numbers. >=20 > Signed-off-by: Wathsala Vithanage > Reviewed-by: Honnappa Nagarahalli > Reviewed-by: Feifei Wang > --- > .mailmap | 1 + > lib/ring/rte_ring_c11_pvt.h | 18 +++++++++--------- > 2 files changed, 10 insertions(+), 9 deletions(-) >=20 > diff --git a/.mailmap b/.mailmap > index 4018f0fc47..367115d134 100644 > --- a/.mailmap > +++ b/.mailmap > @@ -1430,6 +1430,7 @@ Walter Heymans > Wang Sheng-Hui > Wangyu (Eric) > Waterman Cao > +Wathsala Vithanage > Weichun Chen > Wei Dai > Weifeng Li > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h > index f895950df4..1895f2bb0e 100644 > --- a/lib/ring/rte_ring_c11_pvt.h > +++ b/lib/ring/rte_ring_c11_pvt.h > @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail = *ht, > uint32_t old_val, > if (!single) > rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); >=20 > + /* > + * Updating of ht->tail cannot happen before elements are added to = or > + * removed from the ring, as it could result in data races between > + * producer and consumer threads. Therefore ht->tail should be = updated > + * with release semantics to prevent ring data copy phase from = sinking > + * below it. > + */ I think this comment should clarified as: Updating of ht->tail SHOULD NOT happen before elements are added to or removed from the ring, as it WOULD result in data races between producer and consumer threads. Therefore ht->tail MUST be updated with release semantics to prevent ring data copy phase from sinking below it. Don't use capitals; I only used them to highlight my modifications. NB: English is not my native language, so please ignore me if I am = mistaken! > __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > } >=20 > @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, = unsigned int > is_sp, > /* Ensure the head is read before tail */ > __atomic_thread_fence(__ATOMIC_ACQUIRE); >=20 > - /* load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > cons_tail =3D __atomic_load_n(&r->cons.tail, > - __ATOMIC_ACQUIRE); > + __ATOMIC_RELAXED); The __ATOMIC_ACQUIRE had a comment describing its purpose. Please don't just remove that comment. Please replace it with a new = comment describing why __ATOMIC_RELAXED is valid here. >=20 > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int = is_sc, > /* Ensure the head is read before tail */ > __atomic_thread_fence(__ATOMIC_ACQUIRE); >=20 > - /* this load-acquire synchronize with store-release of ht->tail > - * in update_tail. > - */ > prod_tail =3D __atomic_load_n(&r->prod.tail, > - __ATOMIC_ACQUIRE); > - > + __ATOMIC_RELAXED); Same comment as above: Don't just remove the old comment, please replace = it with a new comment. > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > -- > 2.25.1 With comments regarding __ATOMIC_RELAXED added to the source code, Acked-by: Morten Br=F8rup