From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5DE64A2EFC for ; Tue, 15 Oct 2019 13:15:22 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DDA7F1EB2F; Tue, 15 Oct 2019 13:15:15 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id C89521E9A1 for ; Tue, 15 Oct 2019 13:15:10 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 15 Oct 2019 04:15:09 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,299,1566889200"; d="scan'208";a="370425623" Received: from irsmsx110.ger.corp.intel.com ([163.33.3.25]) by orsmga005.jf.intel.com with ESMTP; 15 Oct 2019 04:15:06 -0700 Received: from irsmsx104.ger.corp.intel.com ([169.254.5.252]) by irsmsx110.ger.corp.intel.com ([169.254.15.37]) with mapi id 14.03.0439.000; Tue, 15 Oct 2019 12:15:06 +0100 From: "Ananyev, Konstantin" To: Honnappa Nagarahalli , "Richardson, Bruce" , "Medvedkin, Vladimir" , "olivier.matz@6wind.com" CC: "dev@dpdk.org" , "stephen@networkplumber.org" , "paulmck@linux.ibm.com" , "Gavin Hu (Arm Technology China)" , Dharmik Thakkar , "Ruifeng Wang (Arm Technology China)" , nd , "Ruifeng Wang (Arm Technology China)" , nd Thread-Topic: [PATCH v3 1/3] lib/lpm: integrate RCU QSBR Thread-Index: AQHVeIYqYclVo19ZE0eAcqB/Aek1QadO6yMggAkSpQCAA59TkA== Date: Tue, 15 Oct 2019 11:15:05 +0000 Message-ID: <2601191342CEEE43887BDE71AB97725801A8C68CBC@IRSMSX104.ger.corp.intel.com> References: <20190906094534.36060-1-ruifeng.wang@arm.com> <20191001182857.43867-1-honnappa.nagarahalli@arm.com> <20191001182857.43867-2-honnappa.nagarahalli@arm.com> <2601191342CEEE43887BDE71AB9772580191971F35@irsmsx105.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYWI4OTQ0NTYtYWFjMy00MWU5LTk1N2UtZmY4YTk2ZWNmYWE3IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiTm9XZDBiaDVNeTdVK1FKZnhrMkNjaCtiRmV4Q0dXUmJaYUZBNjdTdlpDcXZaand5WmhiTzl4NXZCeTg2OTU4UiJ9 x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: integrate RCU QSBR X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > >=20 > > Hi guys, > I have tried to consolidate design related questions here. If I have miss= ed anything, please add. >=20 > > > > > > > > From: Ruifeng Wang > > > > > > Currently, the tbl8 group is freed even though the readers might be > > > using the tbl8 group entries. The freed tbl8 group can be reallocated > > > quickly. This results in incorrect lookup results. > > > > > > RCU QSBR process is integrated for safe tbl8 group reclaim. > > > Refer to RCU documentation to understand various aspects of > > > integrating RCU library into other libraries. > > > > > > Signed-off-by: Ruifeng Wang > > > Reviewed-by: Honnappa Nagarahalli > > > --- > > > lib/librte_lpm/Makefile | 3 +- > > > lib/librte_lpm/meson.build | 2 + > > > lib/librte_lpm/rte_lpm.c | 102 +++++++++++++++++++++++++--= -- > > > lib/librte_lpm/rte_lpm.h | 21 ++++++ > > > lib/librte_lpm/rte_lpm_version.map | 6 ++ > > > 5 files changed, 122 insertions(+), 12 deletions(-) > > > > > > diff --git a/lib/librte_lpm/Makefile b/lib/librte_lpm/Makefile index > > > a7946a1c5..ca9e16312 100644 > > > --- a/lib/librte_lpm/Makefile > > > +++ b/lib/librte_lpm/Makefile > > > @@ -6,9 +6,10 @@ include $(RTE_SDK)/mk/rte.vars.mk # library name > > > LIB =3D librte_lpm.a > > > > > > +CFLAGS +=3D -DALLOW_EXPERIMENTAL_API > > > CFLAGS +=3D -O3 > > > CFLAGS +=3D $(WERROR_FLAGS) -I$(SRCDIR) -LDLIBS +=3D -lrte_eal -lrte= _hash > > > +LDLIBS +=3D -lrte_eal -lrte_hash -lrte_rcu > > > > > > EXPORT_MAP :=3D rte_lpm_version.map > > > > > > diff --git a/lib/librte_lpm/meson.build b/lib/librte_lpm/meson.build > > > index a5176d8ae..19a35107f 100644 > > > --- a/lib/librte_lpm/meson.build > > > +++ b/lib/librte_lpm/meson.build > > > @@ -2,9 +2,11 @@ > > > # Copyright(c) 2017 Intel Corporation > > > > > > version =3D 2 > > > +allow_experimental_apis =3D true > > > sources =3D files('rte_lpm.c', 'rte_lpm6.c') headers =3D > > > files('rte_lpm.h', 'rte_lpm6.h') # since header files have different > > > names, we can install all vector headers # without worrying about > > > which architecture we actually need headers +=3D > > > files('rte_lpm_altivec.h', 'rte_lpm_neon.h', 'rte_lpm_sse.h') deps += =3D > > > ['hash'] > > > +deps +=3D ['rcu'] > > > diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c inde= x > > > 3a929a1b1..ca58d4b35 100644 > > > --- a/lib/librte_lpm/rte_lpm.c > > > +++ b/lib/librte_lpm/rte_lpm.c > > > @@ -1,5 +1,6 @@ > > > /* SPDX-License-Identifier: BSD-3-Clause > > > * Copyright(c) 2010-2014 Intel Corporation > > > + * Copyright(c) 2019 Arm Limited > > > */ > > > > > > #include > > > @@ -381,6 +382,8 @@ rte_lpm_free_v1604(struct rte_lpm *lpm) > > > > > > rte_mcfg_tailq_write_unlock(); > > > > > > + if (lpm->dq) > > > + rte_rcu_qsbr_dq_delete(lpm->dq); > > > rte_free(lpm->tbl8); > > > rte_free(lpm->rules_tbl); > > > rte_free(lpm); > > > @@ -390,6 +393,59 @@ BIND_DEFAULT_SYMBOL(rte_lpm_free, _v1604, > > 16.04); > > > MAP_STATIC_SYMBOL(void rte_lpm_free(struct rte_lpm *lpm), > > > rte_lpm_free_v1604); > > > > > > +struct __rte_lpm_rcu_dq_entry { > > > + uint32_t tbl8_group_index; > > > + uint32_t pad; > > > +}; > > > + > > > +static void > > > +__lpm_rcu_qsbr_free_resource(void *p, void *data) { > > > + struct rte_lpm_tbl_entry zero_tbl8_entry =3D {0}; > > > + struct __rte_lpm_rcu_dq_entry *e =3D > > > + (struct __rte_lpm_rcu_dq_entry *)data; > > > + struct rte_lpm_tbl_entry *tbl8 =3D (struct rte_lpm_tbl_entry *)p; > > > + > > > + /* Set tbl8 group invalid */ > > > + __atomic_store(&tbl8[e->tbl8_group_index], &zero_tbl8_entry, > > > + __ATOMIC_RELAXED); > > > +} > > > + > > > +/* Associate QSBR variable with an LPM object. > > > + */ > > > +int > > > +rte_lpm_rcu_qsbr_add(struct rte_lpm *lpm, struct rte_rcu_qsbr *v) { > > > + char rcu_dq_name[RTE_RCU_QSBR_DQ_NAMESIZE]; > > > + struct rte_rcu_qsbr_dq_parameters params; > > > + > > > + if ((lpm =3D=3D NULL) || (v =3D=3D NULL)) { > > > + rte_errno =3D EINVAL; > > > + return 1; > > > + } > > > + > > > + if (lpm->dq) { > > > + rte_errno =3D EEXIST; > > > + return 1; > > > + } > > > + > > > + /* Init QSBR defer queue. */ > > > + snprintf(rcu_dq_name, sizeof(rcu_dq_name), "LPM_RCU_%s", lpm- > > >name); > > > + params.name =3D rcu_dq_name; > > > + params.size =3D lpm->number_tbl8s; > > > + params.esize =3D sizeof(struct __rte_lpm_rcu_dq_entry); > > > + params.f =3D __lpm_rcu_qsbr_free_resource; > > > + params.p =3D lpm->tbl8; > > > + params.v =3D v; > > > + lpm->dq =3D rte_rcu_qsbr_dq_create(¶ms); > > > + if (lpm->dq =3D=3D NULL) { > > > + RTE_LOG(ERR, LPM, "LPM QS defer queue creation failed\n"); > > > + return 1; > > > + } > > > > Few thoughts about that function: > Few things to keep in mind, the goal of the design is to make it easy for= the applications to adopt lock-free algorithms. The reclamation > process in the writer is a major portion of code one has to write for usi= ng lock-free algorithms. The current design is such that the writer > does not have to change any code or write additional code other than call= ing 'rte_lpm_rcu_qsbr_add'. >=20 > > It names rcu_qsbr_add() but in fact it allocates defer queue for give r= cu var. > > So first thought - is it always necessary? > This is part of the design. If the application does not want to use this = integrated logic then, it does not have to call this API. It can use the > RCU defer APIs to implement its own logic. But, if I ask the question, do= es this integrated logic address most of the use cases of the LPM > library, I think the answer is yes. >=20 > > For some use-cases I suppose user might be ok to wait for quiescent sta= te > > change > > inside tbl8_free()? > Yes, that is a possibility (for ex: no frequent route changes). But, I th= ink that is very trivial for the application to implement. Though, the LPM > library has to separate the 'delete' and 'free' operations.=20 Exactly. That's why it is not trivial with current LPM library. In fact to do that himself right now, user would have to implement and supp= ort his own version of LPM code. Honestly, I don't understand why you consider it as a drawback. >From my perspective only few things need to be changed: 1. Add 2 parameters to 'rte_lpm_rcu_qsbr_add(): number of elems in defer_queue reclaim() threshold value. If the user doesn't want to provide any values, that's fine we can use defa= ult ones here (as you do it right now). 2. Make rte_lpm_rcu_qsbr_add() to return pointer to the defer_queue. Again if user doesn't want to call reclaim() himself, he can just ignore re= turn value. These 2 changes will provide us with necessary flexibility that would help = to cover more use-cases: - user can decide how big should be the defer queue - user can decide when/how he wants to do reclaim() Konstantin >Similar operations are provided in rte_hash library. IMO, we should follow > consistent approach. >=20 > > Another thing you do allocate defer queue, but it is internal, so user = can't call > > reclaim() manually, which looks strange. > > Why not to return defer_queue pointer to the user, so he can call recla= im() > > himself at appropriate time? > The intention of the design is to take the complexity away from the user = of LPM library. IMO, the current design will address most uses > cases of LPM library. If we expose the 2 parameters (when to trigger recl= amation and how much to reclaim) in the 'rte_lpm_rcu_qsbr_add' > API, it should provide enough flexibility to the application. >=20 > > Third thing - you always allocate defer queue with size equal to number= of > > tbl8. > > Though I understand it could be up to 16M tbl8 groups inside the LPM. > > Do we really need defer queue that long? > No, we do not need it to be this long. It is this long today to avoid ret= urning no-space on the defer queue error. >=20 > > Especially considering that current rcu_defer_queue will start reclama= tion > > when 1/8 of defer_quueue becomes full and wouldn't reclaim more then > > 1/16 of it. > > Probably better to let user to decide himself how long defer_queue he n= eeds > > for that LPM? > It makes sense to expose it to the user if the writer-writer concurrency = is lock-free (no memory allocation allowed to expand the defer > queue size when the queue is full). However, LPM is not lock-free on the = writer side. If we think the writer could be lock-free in the future, it > has to be exposed to the user. >=20 > > > > Konstantin > Pulling questions/comments from other threads: > Can we leave reclamation to some other house-keeping thread to do (sort o= f garbage collector). Or such mode is not supported/planned? >=20 > [Honnappa] If the reclamation cost is small, the current method provides = advantages over having a separate thread to do reclamation. I did > not plan to provide such an option. But may be it makes sense to keep the= options open (especially from ABI perspective). May be we > should add a flags field which will allow us to implement different metho= ds in the future? >=20 > > > > > > > + > > > + return 0; > > > +} > > > + > > > /* > > > * Adds a rule to the rule table. > > > * > > > @@ -679,14 +735,15 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 > > > *tbl8) } > > > > > > static int32_t > > > -tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t > > > number_tbl8s) > > > +__tbl8_alloc_v1604(struct rte_lpm *lpm) > > > { > > > uint32_t group_idx; /* tbl8 group index. */ > > > struct rte_lpm_tbl_entry *tbl8_entry; > > > > > > /* Scan through tbl8 to find a free (i.e. INVALID) tbl8 group. */ > > > - for (group_idx =3D 0; group_idx < number_tbl8s; group_idx++) { > > > - tbl8_entry =3D &tbl8[group_idx * > > RTE_LPM_TBL8_GROUP_NUM_ENTRIES]; > > > + for (group_idx =3D 0; group_idx < lpm->number_tbl8s; group_idx++) { > > > + tbl8_entry =3D &lpm->tbl8[group_idx * > > > + > > RTE_LPM_TBL8_GROUP_NUM_ENTRIES]; > > > /* If a free tbl8 group is found clean it and set as VALID. */ > > > if (!tbl8_entry->valid_group) { > > > struct rte_lpm_tbl_entry new_tbl8_entry =3D { @@ - > > 712,6 +769,21 @@ > > > tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8= s) > > > return -ENOSPC; > > > } > > > > > > +static int32_t > > > +tbl8_alloc_v1604(struct rte_lpm *lpm) { > > > + int32_t group_idx; /* tbl8 group index. */ > > > + > > > + group_idx =3D __tbl8_alloc_v1604(lpm); > > > + if ((group_idx < 0) && (lpm->dq !=3D NULL)) { > > > + /* If there are no tbl8 groups try to reclaim some. */ > > > + if (rte_rcu_qsbr_dq_reclaim(lpm->dq) =3D=3D 0) > > > + group_idx =3D __tbl8_alloc_v1604(lpm); > > > + } > > > + > > > + return group_idx; > > > +} > > > + > > > static void > > > tbl8_free_v20(struct rte_lpm_tbl_entry_v20 *tbl8, uint32_t > > > tbl8_group_start) { @@ -728,13 +800,21 @@ tbl8_free_v20(struct > > > rte_lpm_tbl_entry_v20 *tbl8, uint32_t tbl8_group_start) } > > > > > > static void > > > -tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t > > > tbl8_group_start) > > > +tbl8_free_v1604(struct rte_lpm *lpm, uint32_t tbl8_group_start) > > > { > > > - /* Set tbl8 group invalid*/ > > > struct rte_lpm_tbl_entry zero_tbl8_entry =3D {0}; > > > + struct __rte_lpm_rcu_dq_entry e; > > > > > > - __atomic_store(&tbl8[tbl8_group_start], &zero_tbl8_entry, > > > - __ATOMIC_RELAXED); > > > + if (lpm->dq !=3D NULL) { > > > + e.tbl8_group_index =3D tbl8_group_start; > > > + e.pad =3D 0; > > > + /* Push into QSBR defer queue. */ > > > + rte_rcu_qsbr_dq_enqueue(lpm->dq, (void *)&e); > > > + } else { > > > + /* Set tbl8 group invalid*/ > > > + __atomic_store(&lpm->tbl8[tbl8_group_start], > > &zero_tbl8_entry, > > > + __ATOMIC_RELAXED); > > > + } > > > } > > > > > > static __rte_noinline int32_t > > > @@ -1037,7 +1117,7 @@ add_depth_big_v1604(struct rte_lpm *lpm, > > > uint32_t ip_masked, uint8_t depth, > > > > > > if (!lpm->tbl24[tbl24_index].valid) { > > > /* Search for a free tbl8 group. */ > > > - tbl8_group_index =3D tbl8_alloc_v1604(lpm->tbl8, lpm- > > >number_tbl8s); > > > + tbl8_group_index =3D tbl8_alloc_v1604(lpm); > > > > > > /* Check tbl8 allocation was successful. */ > > > if (tbl8_group_index < 0) { > > > @@ -1083,7 +1163,7 @@ add_depth_big_v1604(struct rte_lpm *lpm, > > uint32_t ip_masked, uint8_t depth, > > > } /* If valid entry but not extended calculate the index into Table= 8. */ > > > else if (lpm->tbl24[tbl24_index].valid_group =3D=3D 0) { > > > /* Search for free tbl8 group. */ > > > - tbl8_group_index =3D tbl8_alloc_v1604(lpm->tbl8, lpm- > > >number_tbl8s); > > > + tbl8_group_index =3D tbl8_alloc_v1604(lpm); > > > > > > if (tbl8_group_index < 0) { > > > return tbl8_group_index; > > > @@ -1818,7 +1898,7 @@ delete_depth_big_v1604(struct rte_lpm *lpm, > > uint32_t ip_masked, > > > */ > > > lpm->tbl24[tbl24_index].valid =3D 0; > > > __atomic_thread_fence(__ATOMIC_RELEASE); > > > - tbl8_free_v1604(lpm->tbl8, tbl8_group_start); > > > + tbl8_free_v1604(lpm, tbl8_group_start); > > > } else if (tbl8_recycle_index > -1) { > > > /* Update tbl24 entry. */ > > > struct rte_lpm_tbl_entry new_tbl24_entry =3D { @@ -1834,7 > > +1914,7 @@ > > > delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, > > > __atomic_store(&lpm->tbl24[tbl24_index], > > &new_tbl24_entry, > > > __ATOMIC_RELAXED); > > > __atomic_thread_fence(__ATOMIC_RELEASE); > > > - tbl8_free_v1604(lpm->tbl8, tbl8_group_start); > > > + tbl8_free_v1604(lpm, tbl8_group_start); > > > } > > > #undef group_idx > > > return 0; > > > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h inde= x > > > 906ec4483..49c12a68d 100644 > > > --- a/lib/librte_lpm/rte_lpm.h > > > +++ b/lib/librte_lpm/rte_lpm.h > > > @@ -1,5 +1,6 @@ > > > /* SPDX-License-Identifier: BSD-3-Clause > > > * Copyright(c) 2010-2014 Intel Corporation > > > + * Copyright(c) 2019 Arm Limited > > > */ > > > > > > #ifndef _RTE_LPM_H_ > > > @@ -21,6 +22,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #ifdef __cplusplus > > > extern "C" { > > > @@ -186,6 +188,7 @@ struct rte_lpm { > > > __rte_cache_aligned; /**< LPM tbl24 table. */ > > > struct rte_lpm_tbl_entry *tbl8; /**< LPM tbl8 table. */ > > > struct rte_lpm_rule *rules_tbl; /**< LPM rules. */ > > > + struct rte_rcu_qsbr_dq *dq; /**< RCU QSBR defer queue.*/ > > > }; > > > > > > /** > > > @@ -248,6 +251,24 @@ rte_lpm_free_v20(struct rte_lpm_v20 *lpm); > > void > > > rte_lpm_free_v1604(struct rte_lpm *lpm); > > > > > > +/** > > > + * Associate RCU QSBR variable with an LPM object. > > > + * > > > + * @param lpm > > > + * the lpm object to add RCU QSBR > > > + * @param v > > > + * RCU QSBR variable > > > + * @return > > > + * On success - 0 > > > + * On error - 1 with error code set in rte_errno. > > > + * Possible rte_errno codes are: > > > + * - EINVAL - invalid pointer > > > + * - EEXIST - already added QSBR > > > + * - ENOMEM - memory allocation failure > > > + */ > > > +__rte_experimental > > > +int rte_lpm_rcu_qsbr_add(struct rte_lpm *lpm, struct rte_rcu_qsbr > > > +*v); > > > + > > > /** > > > * Add a rule to the LPM table. > > > * > > > diff --git a/lib/librte_lpm/rte_lpm_version.map > > > b/lib/librte_lpm/rte_lpm_version.map > > > index 90beac853..b353aabd2 100644 > > > --- a/lib/librte_lpm/rte_lpm_version.map > > > +++ b/lib/librte_lpm/rte_lpm_version.map > > > @@ -44,3 +44,9 @@ DPDK_17.05 { > > > rte_lpm6_lookup_bulk_func; > > > > > > } DPDK_16.04; > > > + > > > +EXPERIMENTAL { > > > + global: > > > + > > > + rte_lpm_rcu_qsbr_add; > > > +}; > > > -- > > > 2.17.1