From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BC4094601F; Wed, 8 Jan 2025 17:49:34 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4736940150; Wed, 8 Jan 2025 17:49:34 +0100 (CET) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 804EC400D6 for ; Wed, 8 Jan 2025 17:49:32 +0100 (CET) Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4YSv3w1rBqz6M4Lv; Thu, 9 Jan 2025 00:47:56 +0800 (CST) Received: from frapeml100008.china.huawei.com (unknown [7.182.85.131]) by mail.maildlp.com (Postfix) with ESMTPS id 4714B140B73; Thu, 9 Jan 2025 00:49:31 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml100008.china.huawei.com (7.182.85.131) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Wed, 8 Jan 2025 17:49:30 +0100 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Wed, 8 Jan 2025 17:49:30 +0100 From: Konstantin Ananyev To: Huichao Cai , "honnappa.nagarahalli@arm.com" , "konstantin.v.ananyev@yandex.ru" , "thomas@monjalon.net" CC: "dev@dpdk.org" Subject: RE: [PATCH v2] ring: add the second version of the RTS interface Thread-Topic: [PATCH v2] ring: add the second version of the RTS interface Thread-Index: AQHbX4SFRjvq8K49qkCRp3yi++/8obMNCMPQ Date: Wed, 8 Jan 2025 16:49:30 +0000 Message-ID: <42af7788ee224ab2a1824f827c6a9328@huawei.com> References: <20250105095746.2727-1-chcchc88@163.com> <20250105151345.3314-1-chcchc88@163.com> In-Reply-To: <20250105151345.3314-1-chcchc88@163.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.195.246.78] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi, > The timing of the update of the RTS enqueues/dequeues tail is > limited to the last enqueues/dequeues, which reduces concurrency, > so the RTS interface of the V2 version is added, which makes the tail > of the enqueues/dequeues not limited to the last enqueues/dequeues > and thus enables timely updates to increase concurrency. That's description is way too cryptic to me and really just creates more co= nfusion instead of explain things: I have to go and read through the code to unders= tand what you are up to. In fact, I don't think the approach you used will work properly dues to rac= e Conditions (see below for more details). But for future reference, when you are introducing a new sync mechanism for the ring please do: 1. explain clearly what particular problem(s) with existing one(s) you are = trying to address. 2. clearly explain new sync mechanism you are going to introduce and why/w= hen you believe It would behave better than existing ones 3. In case of performance improvement claims - provide some reproducible n= umbers either with ring_stress_test or some dpdk packet processing sample app = or both. =20 =20 > Add some corresponding test cases. >=20 > Signed-off-by: Huichao Cai > --- ... > diff --git a/lib/ring/rte_ring.c b/lib/ring/rte_ring.c > index aebb6d6728..ada1ae88fa 100644 > --- a/lib/ring/rte_ring.c > +++ b/lib/ring/rte_ring.c > @@ -43,7 +43,8 @@ EAL_REGISTER_TAILQ(rte_ring_tailq) > /* mask of all valid flag values to ring_create() */ > #define RING_F_MASK (RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | \ > RING_F_MP_RTS_ENQ | RING_F_MC_RTS_DEQ | \ > - RING_F_MP_HTS_ENQ | RING_F_MC_HTS_DEQ) > + RING_F_MP_HTS_ENQ | RING_F_MC_HTS_DEQ | \ > + RING_F_MP_RTS_V2_ENQ | RING_F_MC_RTS_V2_DEQ) >=20 > /* true if x is a power of 2 */ > #define POWEROF2(x) ((((x)-1) & (x)) =3D=3D 0) > @@ -106,6 +107,7 @@ reset_headtail(void *p) > ht->tail =3D 0; > break; > case RTE_RING_SYNC_MT_RTS: > + case RTE_RING_SYNC_MT_RTS_V2: > ht_rts->head.raw =3D 0; > ht_rts->tail.raw =3D 0; > break; > @@ -135,9 +137,11 @@ get_sync_type(uint32_t flags, enum rte_ring_sync_typ= e *prod_st, > enum rte_ring_sync_type *cons_st) > { > static const uint32_t prod_st_flags =3D > - (RING_F_SP_ENQ | RING_F_MP_RTS_ENQ | RING_F_MP_HTS_ENQ); > + (RING_F_SP_ENQ | RING_F_MP_RTS_ENQ | RING_F_MP_HTS_ENQ | > + RING_F_MP_RTS_V2_ENQ); > static const uint32_t cons_st_flags =3D > - (RING_F_SC_DEQ | RING_F_MC_RTS_DEQ | RING_F_MC_HTS_DEQ); > + (RING_F_SC_DEQ | RING_F_MC_RTS_DEQ | RING_F_MC_HTS_DEQ | > + RING_F_MC_RTS_V2_DEQ); >=20 > switch (flags & prod_st_flags) { > case 0: > @@ -152,6 +156,9 @@ get_sync_type(uint32_t flags, enum rte_ring_sync_type= *prod_st, > case RING_F_MP_HTS_ENQ: > *prod_st =3D RTE_RING_SYNC_MT_HTS; > break; > + case RING_F_MP_RTS_V2_ENQ: > + *prod_st =3D RTE_RING_SYNC_MT_RTS_V2; > + break; > default: > return -EINVAL; > } > @@ -169,6 +176,9 @@ get_sync_type(uint32_t flags, enum rte_ring_sync_type= *prod_st, > case RING_F_MC_HTS_DEQ: > *cons_st =3D RTE_RING_SYNC_MT_HTS; > break; > + case RING_F_MC_RTS_V2_DEQ: > + *cons_st =3D RTE_RING_SYNC_MT_RTS_V2; > + break; > default: > return -EINVAL; > } > @@ -239,6 +249,28 @@ rte_ring_init(struct rte_ring *r, const char *name, = unsigned int count, > if (flags & RING_F_MC_RTS_DEQ) > rte_ring_set_cons_htd_max(r, r->capacity / HTD_MAX_DEF); >=20 > + /* set default values for head-tail distance and allocate memory to cac= he */ > + if (flags & RING_F_MP_RTS_V2_ENQ) { > + rte_ring_set_prod_htd_max(r, r->capacity / HTD_MAX_DEF); > + r->rts_prod.rts_cache =3D (struct rte_ring_rts_cache *)rte_zmalloc( > + "RTS_PROD_CACHE", sizeof(struct rte_ring_rts_cache) * r->size, 0); That doesn't look right at all - rte_ring_init() not supposed to allocate e= xtra memory. It is a caller responsibility to provide buffer large enough to hold whole = ring (including actual data and meta-data). Ideally, if your sync mechanism needs extra space it should be reported by = rte_ring_get_memsize_elem(). Though right now it takes only elem size and count... One approach might be to introduce new function for that, another introduce= new high-level struct, instead of rte_ring. Though I suppose first thing that needs to be done fix race-conditions and = prove that such addition is really worth from performance perspective. =20 > + if (r->rts_prod.rts_cache =3D=3D NULL) { > + RING_LOG(ERR, "Cannot reserve memory for rts prod cache"); > + return -ENOMEM; > + } > + } > + if (flags & RING_F_MC_RTS_V2_DEQ) { > + rte_ring_set_cons_htd_max(r, r->capacity / HTD_MAX_DEF); > + r->rts_cons.rts_cache =3D (struct rte_ring_rts_cache *)rte_zmalloc( > + "RTS_CONS_CACHE", sizeof(struct rte_ring_rts_cache) * r->size, 0); > + if (r->rts_cons.rts_cache =3D=3D NULL) { > + if (flags & RING_F_MP_RTS_V2_ENQ) > + rte_free(r->rts_prod.rts_cache); > + RING_LOG(ERR, "Cannot reserve memory for rts cons cache"); > + return -ENOMEM; > + } > + } > + > return 0; > } >=20 .... > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h b/lib/ring/rte_ring_rts_ele= m_pvt.h > index 122650346b..4ce22a93ed 100644 > --- a/lib/ring/rte_ring_rts_elem_pvt.h > +++ b/lib/ring/rte_ring_rts_elem_pvt.h > @@ -46,6 +46,92 @@ __rte_ring_rts_update_tail(struct rte_ring_rts_headtai= l *ht) > rte_memory_order_release, rte_memory_order_acquire) =3D=3D 0); > } >=20 > +/** > + * @file rte_ring_rts_elem_pvt.h > + * It is not recommended to include this file directly, > + * include instead. > + * Contains internal helper functions for Relaxed Tail Sync (RTS) ring m= ode. > + * For more information please refer to . > + */ > + > +/** > + * @internal This function updates tail values. > + */ > +static __rte_always_inline void > +__rte_ring_rts_v2_update_tail(struct rte_ring_rts_headtail *ht, > + uint32_t old_tail, uint32_t num, uint32_t mask) > +{ > + union __rte_ring_rts_poscnt ot, nt; > + > + ot.val.cnt =3D nt.val.cnt =3D 0; > + ot.val.pos =3D old_tail; > + nt.val.pos =3D old_tail + num; > + > + /* > + * If the tail is equal to the current enqueues/dequeues, update > + * the tail with new value and then continue to try to update the > + * tail until the num of the cache is 0, otherwise write the num of > + * the current enqueues/dequeues to the cache. > + */ > + > + if (rte_atomic_compare_exchange_strong_explicit(&ht->tail.raw, > + (uint64_t *)(uintptr_t)&ot.raw, nt.raw, > + rte_memory_order_release, rte_memory_order_acquire) =3D=3D 0) { > + ot.val.pos =3D old_tail; > + > + /* > + * Write the num of the current enqueues/dequeues to the > + * corresponding cache. > + */ > + rte_atomic_store_explicit(&ht->rts_cache[ot.val.pos & mask].num, > + num, rte_memory_order_release); > + > + /* > + * There may be competition with another enqueues/dequeues > + * for the update tail. The winner continues to try to update > + * the tail, and the loser exits. > + */ > + if (rte_atomic_compare_exchange_strong_explicit(&ht->tail.raw, > + (uint64_t *)(uintptr_t)&ot.raw, nt.raw, > + rte_memory_order_release, rte_memory_order_acquire) =3D=3D 0) > + return; > + > + /* > + * Set the corresponding cache to 0 for next use. > + */ I think there is race condition between CAS above and the store below: After you updated the tail, other threads are free to re-use these ring ele= ments. So if some thread get stalled/preempted here while other threads will conti= nue to do enqueue/dequeue to the ring, this rts_cache[] element can be already re-use= d by other thread when given thread will wake-up and overwrite it with zero. In fact such race-condition can be easily reproduced on my box with: echo ring_stress_autotest | dpdk-test --lcores=3D(2-16)@(3-5) -n 8 --no-p= ci --no-huge =20 all workers ends up in the infinite loop: (gdb) bt #0 _mm_pause () at /usr/lib64/gcc/x86_64-suse-linux/12/include/xmmintrin.h:1335 #1 rte_pause () at ../lib/eal/x86/include/rte_pause.h:18 #2 0x00000000010d008b in __rte_ring_rts_head_wait (h=3D0x7ff90c3ffca0, ht=3D0x103ed2d40) at ../lib/ring/rte_ring_rts_elem_pvt.h:148 #3 __rte_ring_rts_move_prod_head (free_entries=3D0x7ff90c3ffcc8, old_head=3D0x7ff90c3ffcc4, behavior=3DRTE_RING_QUEUE_FIXED, num=3D33, r=3D0x103ed2cc0) at ../lib/ring/rte_ring_rts_elem_pvt.h:177 #4 __rte_ring_do_rts_v2_enqueue_elem (free_space=3D0x0, behavior=3DRTE_RING_QUEUE_FIXED, n=3D33, esize=3D8, obj_table=3D0x7ff90= c3ffec0, r=3D0x103ed2cc0) at ../lib/ring/rte_ring_rts_elem_pvt.h:336 #5 rte_ring_mp_rts_v2_enqueue_bulk_elem (free_space=3D0x0, n=3D33, esize= =3D8, obj_table=3D0x7ff90c3ffec0, r=3D0x103ed2cc0) at ../lib/ring/rte_ring_rt= s.h:110 #6 rte_ring_mp_rts_v2_enqueue_bulk (free_space=3D0x0, n=3D33, obj_table=3D0x7ff90c3ffec0, r=3D0x103ed2cc0) at ../lib/ring/rte_ring_rt= s.h:345 #7 _st_ring_enqueue_bulk (r=3D0x103ed2cc0, obj=3D0x7ff90c3ffec0, n=3D33, f= ree=3D0x0) at ../app/test/test_ring_rts_v2_stress.c:18 .... (gdb) print/x r->rts_prod.rts_cache[r->rts_prod.tail.val.pos & r->mask] $11 =3D {num =3D 0x0} (gdb) print r->rts_prod $13 =3D {tail =3D {raw =3D 127384228873633792, val =3D {cnt =3D 0, pos =3D = 29658952}}, sync_type =3D RTE_RING_SYNC_MT_RTS_V2, htd_max =3D 2047, head =3D { raw =3D 127393072212139551, val =3D {cnt =3D 843295, pos =3D 29661011}}= , rts_cache =3D 0x103ec2c40} (gdb) print 29661011-29658952 $14 =3D 2059 All in all - I don't think this approach is going to work. You need some extra stuff to synchronize between cache[] and tail. If interested, in SORING we solved similar thing by updating state[] before= updating tail, plus making sure that only one thread at a time will update tail value:=20 https://patchwork.dpdk.org/project/dpdk/patch/20241206183600.34758-6-konsta= ntin.ananyev@huawei.com/ Another minor thing - why do you re-use RTS head/tail struct? >From what I can read you don't .cnt part at all. > + rte_atomic_store_explicit(&ht->rts_cache[ot.val.pos & mask].num, > + 0, rte_memory_order_release); > + } > + > + /* > + * Try to update the tail until the num of the corresponding cache is 0= . > + * Getting here means that the current enqueues/dequeues is trying to u= pdate > + * the tail of another enqueues/dequeues. > + */ > + while (1) { > + num =3D rte_atomic_load_explicit(&ht->rts_cache[nt.val.pos & mask].num= , > + rte_memory_order_acquire); > + if (num =3D=3D 0) > + break; > + > + ot.val.pos =3D nt.val.pos; > + nt.val.pos +=3D num; > + > + /* > + * There may be competition with another enqueues/dequeues > + * for the update tail. The winner continues to try to update > + * the tail, and the loser exits. > + */ > + if (rte_atomic_compare_exchange_strong_explicit(&ht->tail.raw, > + (uint64_t *)(uintptr_t)&ot.raw, nt.raw, > + rte_memory_order_release, rte_memory_order_acquire) =3D=3D 0) > + return; > + > + rte_atomic_store_explicit(&ht->rts_cache[ot.val.pos & mask].num, > + 0, rte_memory_order_release); > + }; > +} > +