From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5F482430C1; Mon, 21 Aug 2023 15:27:24 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 328E9427E9; Mon, 21 Aug 2023 15:27:24 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 494F540DF5 for ; Mon, 21 Aug 2023 15:27:23 +0200 (CEST) Received: from frapeml100007.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4RTtYJ4Y0tz6HJpB; Mon, 21 Aug 2023 21:26:44 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml100007.china.huawei.com (7.182.85.133) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Mon, 21 Aug 2023 15:27:21 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.031; Mon, 21 Aug 2023 15:27:21 +0200 From: Konstantin Ananyev To: Honnappa Nagarahalli , "Wathsala Wathawana Vithanage" , "konstantin.v.ananyev@yandex.ru" , "thomas@monjalon.net" , Ruifeng Wang CC: "dev@dpdk.org" , nd , Justin He , nd Subject: RE: [RFC] ring: further performance improvements with C11 Thread-Topic: [RFC] ring: further performance improvements with C11 Thread-Index: AQHZn8X6kDwC6EfAUUeIV8znXH/OIK+MKqeAgEre4/CAA+GmgIAHj2OwgAiTJwCAChhhkA== Date: Mon, 21 Aug 2023 13:27:21 +0000 Message-ID: References: <20230615201335.919563-1-wathsala.vithanage@arm.com> <20230615201335.919563-2-wathsala.vithanage@arm.com> <67a8987cb0d5456b9c99887402ea30af@huawei.com> <184a99eee3bf41f1ace54273678565fb@huawei.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.195.247.142] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > > > > > For improved performance over the current C11 based ring > > > > > implementation following changes were made. > > > > > (1) Replace tail store with RELEASE semantics in > > > > > __rte_ring_update_tail with a RELEASE fence. Replace load of the > > > > > tail with ACQUIRE semantics in __rte_ring_move_prod_head and > > > > > __rte_ring_move_cons_head with ACQUIRE fences. > > > > > (2) Remove ACQUIRE fences between load of the old_head and load o= f > > > > > the cons_tail in __rte_ring_move_prod_head and > > __rte_ring_move_cons_head. > > > > > These two fences are not required for the safety of the ring libr= ary. > > > > > > > > Hmm... with these changes, aren't we re-introducing the old bug > > > > fixed by this > > > > commit: > > > > > > Cover letter explains why this barrier does not solve what it intends > > > to solve and Why it should not matter. > > > https://mails.dpdk.org/archives/dev/2023-June/270874.html > > > > Ok, let's consider the case similar to yours (i), but when r->prod.head= was > > moved for distance greater then r->capacity. > > To be more specific, let' start with the same initial state: > > capacity =3D 32 > > r->cons.tail =3D 5 > > r->cons.head =3D 5 > > r->prod.head =3D 10 > > r-prod.tail =3D 10 > > > > time 0, thread1: > > /* re-ordered load */ > > const_tail =3D r->cons.tail; //=3D 5 > > > > Now, thread1 was stalled for a bit, meanwhile there were few > What exactly do you mean by 'stalled'?=20 I mean: CPU pipeline the thread was running on get stalled by whatever reas= ons - memory load latency, tlb miss, etc. > If you are meaning, thread is preempted, > then the ring algorithm is not designed for it. There > are restrictions mentioned in [1]. I am not talking about SW thread preemption here. With the example I provided, I think it is clear that the problem can happe= n even with 'ideal' conditions: each thread runs non-preempted on a separate core. > However, we do need to handle the re-ordering case. To be clear: for me right now this patch is bogus, it has to be either rew= orked or abandoned. >=20 > [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#kno= wn-issues >=20 > > enqueue/dequeus done by other threads, so current state of the ring: > > r->cons.tail =3D 105 > > r->cons.head =3D 105 > > r->prod.head =3D 110 > > r-prod.tail =3D 110 > > > > time 1, thread1: > > old_head =3D r->prod.head; // 110 > > *free_entries =3D (capacity + cons_tail - old_head); // =3D (uint32_t)(= 32 + 5 - 110) > > =3D=3D (uint32_t)-73 =3D=3D 4294967223 > > > > So, free_entries value is way too big, and that comparison: > > > > if (unlikely(n > *free_entries)) > > > > might provide wrong result. > > > > So I still think we do need some sort of _read_fence_ between these two= loads. > > As I said before, that looks exactly like the old bug, fixed a while ag= o: > > http://git.dpdk.org/dpdk/commit/?id=3D9bc2cbb007c0a3335c5582357ae9f6d37 > > ea0b654 > > but now re-introduced for C11 case. > Agree that the re-ordering case should be handled. Either handled, or simply not allowed.=20 > I am thinking a check (*free_entries > capacity) and restarting the loop = might > suffice (without the barrier)? I thought about the same thing, and at first glance it seems workable in pr= inciple. Though I am still hesitate to remove ordering completely here: As both compiler and CPU reordering will be possible, it will probably intr= oduce a possibility of sort-of ABA problem: old cons.tail is read at a very early stage after that cur cons.tail value get wrapped around 2^32 and is now less then= old cons.tail. Then we read prod.head So: (invalid_free_ent=3Dcapacity + cons.tail._old - prod.head) <=3D capacity and (valid_free_ent=3Dcapacity + cons.tail._cur - prod.head) <=3D capacity Are both true, but invalid > valid and we overestimate number of free entri= es. In majority of cases possibility of such situation is negligible. But for huge values for 'capacity' it will grow. Can I ask a bit different question: as I remember this series started as a= n attempt to improve C11 ring enqueue/dequeue implementation. Though looking at non-C11 one, there also exist a read-fence between these = two loads: https://elixir.bootlin.com/dpdk/v23.07/source/lib/ring/rte_ring_generic_pvt= .h#L73 Which makes me - might be it is not actual read-ordering that causes slowdo= wn of C11 version. Did anyone try to compare generated code for both cases, wonder what is th= e difference? Konstantin > > > > > > > > > > commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 > > > > Author: Jia He > > > > Date: Fri Nov 10 03:30:42 2017 +0000 > > > > > > > > ring: guarantee load/load order in enqueue and dequeue > > > > > > > > We watched a rte panic of mbuf_autotest in our qualcomm arm64 s= erver > > > > (Amberwing). > > > > > > > > Root cause: > > > > In __rte_ring_move_cons_head() > > > > ... > > > > do { > > > > /* Restore n as it may change every loop */ > > > > n =3D max; > > > > > > > > *old_head =3D r->cons.head; //1s= t load > > > > const uint32_t prod_tail =3D r->prod.tail; //2n= d > > > > load > > > > > > > > In weak memory order architectures (powerpc,arm), the 2nd load = might > > be > > > > reodered before the 1st load, that makes *entries is bigger tha= n we > > wanted. > > > > This nasty reording messed enque/deque up. > > > > .... > > > > ? > > > > > > > > > > > > > > Signed-off-by: Wathsala Vithanage > > > > > Reviewed-by: Honnappa Nagarahalli > > > > > Reviewed-by: Ruifeng Wang > > > > > --- > > > > > .mailmap | 1 + > > > > > lib/ring/rte_ring_c11_pvt.h | 35 > > > > > ++++++++++++++++++++--------------- > > > > > 2 files changed, 21 insertions(+), 15 deletions(-) > > > > > > > > > > diff --git a/.mailmap b/.mailmap > > > > > index 4018f0fc47..367115d134 100644 > > > > > --- a/.mailmap > > > > > +++ b/.mailmap > > > > > @@ -1430,6 +1430,7 @@ Walter Heymans > > > > > > > > > Wang Sheng-Hui Wangyu (Eric) > > > > > Waterman Cao > > > > > > > +Wathsala Vithanage > > > > > Weichun Chen Wei Dai > > > > > Weifeng Li diff --git > > > > > a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index > > > > > f895950df4..63fe58ce9e 100644 > > > > > --- a/lib/ring/rte_ring_c11_pvt.h > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h > > > > > @@ -16,6 +16,13 @@ __rte_ring_update_tail(struct rte_ring_headtai= l > > > > > *ht, > > > > uint32_t old_val, > > > > > uint32_t new_val, uint32_t single, uint32_t enqueue) { > > > > > RTE_SET_USED(enqueue); > > > > > + /* > > > > > + * Updating of ht->tail cannot happen before elements are added= to or > > > > > + * removed from the ring, as it could result in data races betw= een > > > > > + * producer and consumer threads. Therefore we need a release > > > > > + * barrier here. > > > > > + */ > > > > > + rte_atomic_thread_fence(__ATOMIC_RELEASE); > > > > > > > > > > /* > > > > > * If there are other enqueues/dequeues in progress that > > > > > preceded us, @@ -24,7 +31,7 @@ __rte_ring_update_tail(struct > > > > > rte_ring_headtail > > > > *ht, uint32_t old_val, > > > > > if (!single) > > > > > rte_wait_until_equal_32(&ht->tail, old_val, > > > > __ATOMIC_RELAXED); > > > > > > > > > > - __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); > > > > > + __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELAXED); > > > > > } > > > > > > > > > > /** > > > > > @@ -66,14 +73,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, > > > > unsigned int is_sp, > > > > > /* Reset n to the initial burst count */ > > > > > n =3D max; > > > > > > > > > > - /* Ensure the head is read before tail */ > > > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > - > > > > > - /* load-acquire synchronize with store-release of ht->tail > > > > > - * in update_tail. > > > > > - */ > > > > > cons_tail =3D __atomic_load_n(&r->cons.tail, > > > > > - __ATOMIC_ACQUIRE); > > > > > + __ATOMIC_RELAXED); > > > > > > > > > > /* The subtraction is done between two unsigned 32bits value > > > > > * (the result is always modulo 32 bits even if we have @@ - > > > > 100,6 > > > > > +101,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned > > > > > +int > > > > is_sp, > > > > > 0, __ATOMIC_RELAXED, > > > > > __ATOMIC_RELAXED); > > > > > } while (unlikely(success =3D=3D 0)); > > > > > + /* > > > > > + * Ensure that updates to the ring doesn't rise above > > > > > + * load of the new_head in SP and MP cases. > > > > > + */ > > > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > return n; > > > > > } > > > > > > > > > > @@ -142,14 +148,8 @@ __rte_ring_move_cons_head(struct rte_ring *r= , > > > > > int > > > > is_sc, > > > > > /* Restore n as it may change every loop */ > > > > > n =3D max; > > > > > > > > > > - /* Ensure the head is read before tail */ > > > > > - __atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > - > > > > > - /* this load-acquire synchronize with store-release of ht->tai= l > > > > > - * in update_tail. > > > > > - */ > > > > > prod_tail =3D __atomic_load_n(&r->prod.tail, > > > > > - __ATOMIC_ACQUIRE); > > > > > + __ATOMIC_RELAXED); > > > > > > > > > > /* The subtraction is done between two unsigned 32bits value > > > > > * (the result is always modulo 32 bits even if we have @@ - > > > > 175,6 > > > > > +175,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > > > > > +is_sc, > > > > > 0, > > > > __ATOMIC_RELAXED, > > > > > > > __ATOMIC_RELAXED); > > > > > } while (unlikely(success =3D=3D 0)); > > > > > + /* > > > > > + * Ensure that updates to the ring doesn't rise above > > > > > + * load of the new_head in SP and MP cases. > > > > > + */ > > > > > + rte_atomic_thread_fence(__ATOMIC_ACQUIRE); > > > > > return n; > > > > > } > > > > > > > > > > -- > > > > > 2.25.1 > > > > >