From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 6D23C1B140 for ; Wed, 28 Nov 2018 16:25:26 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Nov 2018 07:25:25 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,291,1539673200"; d="scan'208";a="95381694" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by orsmga006.jf.intel.com with ESMTP; 28 Nov 2018 07:25:22 -0800 Received: from irsmsx106.ger.corp.intel.com ([169.254.8.8]) by IRSMSX154.ger.corp.intel.com ([169.254.12.166]) with mapi id 14.03.0415.000; Wed, 28 Nov 2018 15:25:21 +0000 From: "Ananyev, Konstantin" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: nd , Dharmik Thakkar , Malvika Gupta , "Gavin Hu (Arm Technology China)" , nd Thread-Topic: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library Thread-Index: AQHUghQLRVNcuR2lRU2RH6S2NG2yyqVeB0ZggAUcBqCAAiq/IA== Date: Wed, 28 Nov 2018 15:25:21 +0000 Message-ID: <2601191342CEEE43887BDE71AB977258010CEBD5D4@IRSMSX106.ger.corp.intel.com> References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> <20181122033055.3431-3-honnappa.nagarahalli@arm.com> <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNmExMWNhOGMtNmJhNy00Y2RmLTg0M2QtZjJmMmMyMDRiZWY5IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiUXYyRERXeXNnZ1ppTVBqcVRDbmZicEdSVDNVYm9XRlJsUjVHVVA5NnpyTjFPSmJlKzRrMzAzTEY0RFUxQ2NaYyJ9 x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Nov 2018 15:25:27 -0000 > > > > Hi Honnappa, > Thank you for reviewing the patch, appreciate your comments. >=20 > > > > > + > > > +/* Allocate a new TQS variable with the name *name* in memory. */ > > > +struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name, > > > +int socket_id, uint64_t lcore_mask) { > > > + char tqs_name[RTE_TQS_NAMESIZE]; > > > + struct rte_tailq_entry *te, *tmp_te; > > > + struct rte_tqs_list *tqs_list; > > > + struct rte_tqs *v, *tmp_v; > > > + int ret; > > > + > > > + if (name =3D=3D NULL) { > > > + RTE_LOG(ERR, TQS, "Invalid input parameters\n"); > > > + rte_errno =3D -EINVAL; > > > + return NULL; > > > + } > > > + > > > + te =3D rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0); > > > + if (te =3D=3D NULL) { > > > + RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n"); > > > + rte_errno =3D -ENOMEM; > > > + return NULL; > > > + } > > > + > > > + snprintf(tqs_name, sizeof(tqs_name), "%s", name); > > > + v =3D rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs), > > > + RTE_CACHE_LINE_SIZE, socket_id); > > > + if (v =3D=3D NULL) { > > > + RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS > > variable\n"); > > > + rte_errno =3D -ENOMEM; > > > + goto alloc_error; > > > + } > > > + > > > + ret =3D snprintf(v->name, sizeof(v->name), "%s", name); > > > + if (ret < 0 || ret >=3D (int)sizeof(v->name)) { > > > + rte_errno =3D -ENAMETOOLONG; > > > + goto alloc_error; > > > + } > > > + > > > + te->data =3D (void *) v; > > > + v->lcore_mask =3D lcore_mask; > > > + > > > + rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK); > > > + > > > + tqs_list =3D RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list); > > > + > > > + /* Search if a TQS variable with the same name exists already */ > > > + TAILQ_FOREACH(tmp_te, tqs_list, next) { > > > + tmp_v =3D (struct rte_tqs *) tmp_te->data; > > > + if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) =3D=3D 0) > > > + break; > > > + } > > > + > > > + if (tmp_te !=3D NULL) { > > > + rte_errno =3D -EEXIST; > > > + goto tqs_exist; > > > + } > > > + > > > + TAILQ_INSERT_TAIL(tqs_list, te, next); > > > + > > > + rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK); > > > + > > > + return v; > > > + > > > +tqs_exist: > > > + rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK); > > > + > > > +alloc_error: > > > + rte_free(te); > > > + rte_free(v); > > > + return NULL; > > > +} > > > > That seems quite heavy-weight function just to allocate sync variable. > > As size of struct rte_tqs is constant and known to the user, might be b= etter just > > provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/fr= ee memory > > for it by himself. > > > I believe, when you say heavy-weight, you are referring to adding tqs var= iable to the TAILQ and allocating the memory for it.=20 Yes. > Agree. I also > do not expect that there are a whole lot of tqs variables used in an appl= ication. Even in rte_tqs_free, there is similar overhead. >=20 > The extra part is due to the way the TQS variable will get identified by = data plane threads. I am thinking that a data plane thread will use the > rte_tqs_lookup API to identify a TQS variable. However, it is possible to= share this with data plane threads via a simple shared structure as > well. >=20 > Along with not allocating the memory, are you suggesting that we could sk= ip maintaining a list of TQS variables in the TAILQ? This will > remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine wi= th this approach. Yes, that's what I suggest. My thought was - it is just another data structure used for synchronization= (as spinlock, rwlock, etc.). So should be possible to allocate it statically and we probably don't need = to have an ability to lookup such variable by name via tailq. >=20 > > > + > > > +/* Add a reader thread, running on an lcore, to the list of threads > > > + * reporting their quiescent state on a TQS variable. > > > + */ > > > +int __rte_experimental > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) { > > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL || lcore_id >=3D > > RTE_TQS_MAX_LCORE), > > > + -EINVAL); > > > > It is not very good practice to make function return different values a= nd behave > > in a different way in debug/non-debug mode. > > I'd say that for slow-path (functions in .c) it is always good to check= input > > parameters. > > For fast-path (functions in .h) we sometimes skip such checking, but de= bug > > mode can probably use RTE_ASSERT() or so. > Makes sense, I will change this in the next version. >=20 > > > > > > lcore_id >=3D RTE_TQS_MAX_LCORE > > > > Is this limitation really necessary? > I added this limitation because currently DPDK application cannot take a = mask more than 64bit wide. Otherwise, this should be as big as > RTE_MAX_LCORE. > I see that in the case of '-lcores' option, the number of lcores can be m= ore than the number of PEs. In this case, we still need a MAX limit > (but can be bigger than 64). >=20 > > First it means that only lcores can use that API (at least data-path pa= rt), second > > even today many machines have more than 64 cores. > > I think you can easily avoid such limitation, if instead of requiring l= core_id as > > input parameter, you'll just make it return index of next available ent= ry in w[]. > > Then tqs_update() can take that index as input parameter. > I had thought about a similar approach based on IDs. I was concerned that= ID will be one more thing to manage for the application. But, I > see the limitations of the current approach now. I will change it to allo= cation based. This will support even non-EAL pthreads as well. Yes, with such approach non-lcore threads will be able to use it also. =20 > > > > > + > > > > > + /* Worker thread has to count the quiescent states > > > + * only from the current value of token. > > > + */ > > > + v->w[lcore_id].cnt =3D v->token; > > > > Wonder what would happen, if new reader will call register(), after wri= ter calls > > start()? > > Looks like a race-condition. > > Or such pattern is not supported? > The start should be called only after the reference to the entry in the d= ata structure is 'deleted'. Hence the new reader will not get the > reference to the deleted entry and does not have to increment its counter= . When rte_tqs_check is called, it will see that the counter is > already up to date. (I am missing a load-acquire on the token, I will cor= rect that in the next version). Yes, with _acquire_ in place it seems to be good here. =20 >=20 > > > > > + > > > + /* Release the store to initial TQS count so that workers > > > + * can use it immediately after this function returns. > > > + */ > > > + __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), > > > +__ATOMIC_RELEASE); > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * @warning > > > + * @b EXPERIMENTAL: this API may change without prior notice > > > + * > > > + * Trigger the worker threads to report the quiescent state > > > + * status. > > > + * > > > + * This is implemented as a lock-free function. It is multi-thread > > > + * safe and can be called from the worker threads as well. > > > + * > > > + * @param v > > > + * TQS variable > > > + * @param n > > > + * Expected number of times the quiescent state is entered > > > + * @param t > > > + * - If successful, this is the token for this call of the API. > > > + * This should be passed to rte_tqs_check API. > > > + * @return > > > + * - -EINVAL if the parameters are invalid (debug mode compilation= only). > > > + * - 0 Otherwise and always (non-debug mode compilation). > > > + */ > > > +static __rte_always_inline int __rte_experimental > > > +rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) { > > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL || t =3D=3D NULL), -EINVAL); > > > + > > > + /* This store release will ensure that changes to any data > > > + * structure are visible to the workers before the token > > > + * update is visible. > > > + */ > > > + *t =3D __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE); > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * @warning > > > + * @b EXPERIMENTAL: this API may change without prior notice > > > + * > > > + * Update quiescent state for the worker thread on a lcore. > > > + * > > > + * This is implemented as a lock-free function. It is multi-thread s= afe. > > > + * All the worker threads registered to report their quiescent state > > > + * on the TQS variable must call this API. > > > + * > > > + * @param v > > > + * TQS variable > > > + */ > > > +static __rte_always_inline void __rte_experimental > > > +rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) { > > > + uint32_t t; > > > + > > > + TQS_ERR_LOG_IF_TRUE(v =3D=3D NULL || lcore_id >=3D > > RTE_TQS_MAX_LCORE); > > > + > > > + /* Load the token before the worker thread loads any other > > > + * (lock-free) data structure. This ensures that updates > > > + * to the data structures are visible if the update > > > + * to token is visible. > > > + */ > > > + t =3D __atomic_load_n(&v->token, __ATOMIC_ACQUIRE); > > > > Hmm, I am not very familiar with C11 model, but it looks like a race co= ndition > > to me: > > as I understand, update() supposed be called at the end of reader's cri= tical > > section, correct? > Yes, the understanding is correct. >=20 > > But ACQUIRE is only a hoist barrier, which means compiler and cpu are f= ree to > > move earlier reads (and writes) after it. > Yes, your understanding is correct. >=20 > > It probably needs to be a full ACQ_REL here. > > > The sequence of operations is as follows: > 1) Writer 'deletes' an entry from a lock-free data structure > 2) Writer calls rte_tqs_start - This API increments the 'token' and does = a store-release. So, any earlier stores would be visible if the store to > 'token' is visible (to the data plane threads). > 3) Reader calls rte_tqs_update - This API load-acquires the 'token'. > a) If this 'token' is the updated value from 2) then the entry deleted f= rom 1) will not be available for the reader to reference (even if > that reference is due to earlier reads being moved after load-acquire of = 'token'). > b) If this 'token' is not the updated value from 2) then the entry delet= ed from 1) may or may not be available for the reader to > reference. In this case the w[lcore_id].cnt is not updated, hence the wri= ter will wait to 'free' the deleted entry from 1) Yes, you right, it's me being confused. >=20 >=20 > > > + if (v->w[lcore_id].cnt !=3D t) > > > + v->w[lcore_id].cnt++; > > > +} > > > + > > > +/** > > > + * @warning > > > + * @b EXPERIMENTAL: this API may change without prior notice > > > + * > > > + * Checks if all the worker threads have entered the quiescent state > > > + * 'n' number of times. 'n' is provided in rte_tqs_start API. > > > + * > > > + * This is implemented as a lock-free function. It is multi-thread > > > + * safe and can be called from the worker threads as well. > > > + * > > > + * @param v > > > + * TQS variable > > > + * @param t > > > + * Token returned by rte_tqs_start API > > > + * @param wait > > > + * If true, block till all the worker threads have completed enter= ing > > > + * the quiescent state 'n' number of times > > > + * @return > > > + * - 0 if all worker threads have NOT passed through specified num= ber > > > + * of quiescent states. > > > + * - 1 if all worker threads have passed through specified number > > > + * of quiescent states. > > > + * - -EINVAL if the parameters are invalid (debug mode compilation= only). > > > + */ > > > +static __rte_always_inline int __rte_experimental > > > +rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) { > > > + uint64_t l; > > > + uint64_t lcore_mask; > > > + > > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL), -EINVAL); > > > + > > > + do { > > > + /* Load the current lcore_mask before loading the > > > + * worker thread quiescent state counters. > > > + */ > > > + lcore_mask =3D __atomic_load_n(&v->lcore_mask, > > __ATOMIC_ACQUIRE); > > > > What would happen if reader will call unregister() simultaneously with = check() > > and will update lcore_mask straight after that load? > > As I understand check() might hang in such case. > If the 'lcore_mask' is updated after this load, it will affect only the c= urrent iteration of the while loop below. In the next iteration the > 'lcore_mask' is loaded again. True, my confusion again. >=20 > > > > > + > > > + while (lcore_mask) { > > > + l =3D __builtin_ctz(lcore_mask); > > > + if (v->w[l].cnt !=3D t) > > > + break; > > > > As I understand, that makes control-path function progress dependent on > > simultaneous invocation of data-path functions. > I agree that the control-path function progress (for ex: how long to wait= for freeing the memory) depends on invocation of the data-path > functions. The separation of 'start', 'check' and the option not to block= in 'check' provide the flexibility for control-path to do some other > work if it chooses to. >=20 > > In some cases that might cause control-path to hang. > > Let say if data-path function wouldn't be called, or user invokes contr= ol-path > > and data-path functions from the same thread. > I agree with the case of data-path function not getting called. I would c= onsider that as programming error. I can document that warning in > the rte_tqs_check API. Sure, it can be documented. Though that means, that each data-path thread would have to do explicit upd= ate() call for every tqs it might use. I just think that it would complicate things and might limit usage of the l= ibrary quite significantly. =20 >=20 > In the case of same thread calling both control-path and data-path functi= ons, it would depend on the sequence of the calls. The following > sequence should not cause any hangs: > Worker thread > 1) 'deletes' an entry from a lock-free data structure > 2) rte_tqs_start > 3) rte_tqs_update > 4) rte_tqs_check (wait =3D=3D 1 or wait =3D=3D 0) > 5) 'free' the entry deleted in 1) That an interesting idea, and that should help, I think. Probably worth to have {2,3,4} sequence as a new high level function. >=20 > If 3) and 4) are interchanged, then there will be a hang if wait is set t= o 1. If wait is set to 0, there should not be a hang. > I can document this as part of the documentation (I do not think API docu= mentation is required for this). >=20 > > > > > + > > > + lcore_mask &=3D ~(1UL << l); > > > + } > > > + > > > + if (lcore_mask =3D=3D 0) > > > + return 1; > > > + > > > + rte_pause(); > > > + } while (wait); > > > + > > > + return 0; > > > +} > > > +