From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-eopbgr150052.outbound.protection.outlook.com [40.107.15.52]) by dpdk.org (Postfix) with ESMTP id B2CFF1B429 for ; Tue, 27 Nov 2018 22:32:06 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=l3IcHm1rulBTN/cXUMg0XRni3Qpz6x2F0ypMbU3rNB4=; b=RQUgKa7oeSw7KQrf/E1yOfw+0PM6Z9hHzh7ExAC+NAUM9hQ3sNsZgI4/iMiJlwFkmDMpMRYhdOg9o8z5DfGVL2n5XqOqH2c6CY+IpnBVCwmk6tbDr2sk/aAGbkTq/B4AzLm+F/EYE6WPe1bFzckaOH7Zym12xC2pIhYAChy3G4I= Received: from AM6PR08MB3672.eurprd08.prod.outlook.com (20.177.115.29) by AM6PR08MB3288.eurprd08.prod.outlook.com (52.135.164.157) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1361.19; Tue, 27 Nov 2018 21:32:05 +0000 Received: from AM6PR08MB3672.eurprd08.prod.outlook.com ([fe80::6194:f5dd:48dd:9b1c]) by AM6PR08MB3672.eurprd08.prod.outlook.com ([fe80::6194:f5dd:48dd:9b1c%2]) with mapi id 15.20.1361.019; Tue, 27 Nov 2018 21:32:05 +0000 From: Honnappa Nagarahalli To: "Ananyev, Konstantin" , "dev@dpdk.org" CC: nd , Dharmik Thakkar , Malvika Gupta , "Gavin Hu (Arm Technology China)" , Honnappa Nagarahalli , nd Thread-Topic: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library Thread-Index: AQHUghQLRVNcuR2lRU2RH6S2NG2yyqVeB0ZggAUcBqA= Date: Tue, 27 Nov 2018 21:32:04 +0000 Message-ID: References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> <20181122033055.3431-3-honnappa.nagarahalli@arm.com> <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Honnappa.Nagarahalli@arm.com; x-originating-ip: [217.140.111.135] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM6PR08MB3288; 6:kK23KqIk3/276e4gLSqEN2z+lwqwEVcKV9unXrgpYYjdW2VcwzxKhgc9bMjLMWzlccd3LcvX01aDdByNKkN8tvsXdss9KsH+Apd0BCrl0GWQy71gjBcaQwLvIdRrWkichAxTQ8gaZB/wFJ/HQeJytlb4pVptfaNsj3KZMEGJ95FbQHriIytF0RzCLXrOsyuWRut55GhJZtXsDviJ+iM3+qpWW+0OkUxHj9sgXVzrJsUzggrvQ+2W8Kok8uMDa/rxIeQscJoxDa9qePjfJQTmKLHXcH+uXWARDoopYUdSDeG1L+pogEn5rb80DeufiFCOA5Bs8qUGcwty8jrQDv067K0avnrkK9OeOcpOUybclCuCGEppaK0dyAJ0i50Krzzvv2lG564JZYKs11eLoHqf2Lbu9//gkOSariJYCGLSe91hyp0IGadcp8VU8u/1oJvg3cDnQewNAVZ3JNC38hQlSg==; 5:NMNV8aLvKz9sXguUbeBInd0rRVbW0tzKZCsRhpastT+c5HWG2GaDCqh/IshjdyCX38hb4GsGOoOtioDKUmJ/T7b8oBa+maRfa/6lXbB8CDJzeBvxekqWUCqrGErTl7xQXzeDB4eCiL2hu8WJS9jCYd02X0y77IBbXFm/gLsdF/I=; 7:/1o+/AKGCN1YSlbhzOUq84LnfDUtKkn2k4Wt08QX8cAp9BdIh5nu0eo2iIrOX0HpubxobeRFgK+rtpxTH6BDjVa0rTMfQ5TLx5XlT+iXZvWlxh2dEijs2sHfFP0l2OSguN+rh4n0bumr89NAQWh9jg== x-ms-exchange-antispam-srfa-diagnostics: SOS;SOR; x-ms-office365-filtering-correlation-id: 0cf13cea-9444-4a99-6530-08d654afc6d0 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390098)(7020095)(4652040)(8989299)(5600074)(711020)(4618075)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7153060)(7193020); SRVR:AM6PR08MB3288; x-ms-traffictypediagnostic: AM6PR08MB3288: nodisclaimer: True x-microsoft-antispam-prvs: x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(6040522)(2401047)(8121501046)(5005006)(10201501046)(3002001)(3231443)(944501410)(52105112)(93006095)(93001095)(6055026)(148016)(149066)(150057)(6041310)(20161123562045)(20161123560045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123564045)(201708071742011)(7699051)(76991095); SRVR:AM6PR08MB3288; BCL:0; PCL:0; RULEID:; SRVR:AM6PR08MB3288; x-forefront-prvs: 086943A159 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(376002)(396003)(39860400002)(366004)(136003)(346002)(199004)(189003)(229853002)(25786009)(186003)(97736004)(5660300001)(345774005)(26005)(68736007)(86362001)(2906002)(72206003)(256004)(4326008)(14444005)(2501003)(71200400001)(71190400001)(102836004)(316002)(99286004)(6246003)(486006)(6506007)(55016002)(9686003)(305945005)(81156014)(476003)(3846002)(7736002)(66066001)(8936002)(4744004)(6436002)(106356001)(105586002)(6116002)(11346002)(446003)(74316002)(53936002)(14454004)(8676002)(7696005)(478600001)(33656002)(76176011)(81166006)(54906003)(110136005); DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB3288; H:AM6PR08MB3672.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: QS4et5SAn6j+jpsfahUHeIqLsKKGOQlnoFJnenQL2btfRL89b893bIu+25/Us+BRQPcTTO+505oyxK3AxCA62nnEAY9VLbC/BCgKcFBmdTRFMzoHiYszZKKDxQScXGtlQeNTDGcvhpiXqvIXu0gd/j+tWxzuGi7o2Fgd5CqWsLAQDtaE6DqUoIjIhFEZM/iaLNO609uC3ZX1Fsi3wMSA9NuGVE8m6SKJB93SAeup3oJeFuhShLWhET/2CJcGpuowA5OBGC2M5p5C7DOwLucK22zO7MfYv6Uz54D9szm+wcZUTIMNXMYaehajmPbTw35df/SaJ+ZH8M+Vn2Ei2uBucfktnGcroV0tBQKEuQfRh/g= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: 0cf13cea-9444-4a99-6530-08d654afc6d0 X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Nov 2018 21:32:05.0278 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB3288 Subject: Re: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Nov 2018 21:32:07 -0000 >=20 > Hi Honnappa, Thank you for reviewing the patch, appreciate your comments. >=20 > > + > > +/* Allocate a new TQS variable with the name *name* in memory. */ > > +struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name, > > +int socket_id, uint64_t lcore_mask) { > > + char tqs_name[RTE_TQS_NAMESIZE]; > > + struct rte_tailq_entry *te, *tmp_te; > > + struct rte_tqs_list *tqs_list; > > + struct rte_tqs *v, *tmp_v; > > + int ret; > > + > > + if (name =3D=3D NULL) { > > + RTE_LOG(ERR, TQS, "Invalid input parameters\n"); > > + rte_errno =3D -EINVAL; > > + return NULL; > > + } > > + > > + te =3D rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0); > > + if (te =3D=3D NULL) { > > + RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n"); > > + rte_errno =3D -ENOMEM; > > + return NULL; > > + } > > + > > + snprintf(tqs_name, sizeof(tqs_name), "%s", name); > > + v =3D rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs), > > + RTE_CACHE_LINE_SIZE, socket_id); > > + if (v =3D=3D NULL) { > > + RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS > variable\n"); > > + rte_errno =3D -ENOMEM; > > + goto alloc_error; > > + } > > + > > + ret =3D snprintf(v->name, sizeof(v->name), "%s", name); > > + if (ret < 0 || ret >=3D (int)sizeof(v->name)) { > > + rte_errno =3D -ENAMETOOLONG; > > + goto alloc_error; > > + } > > + > > + te->data =3D (void *) v; > > + v->lcore_mask =3D lcore_mask; > > + > > + rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK); > > + > > + tqs_list =3D RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list); > > + > > + /* Search if a TQS variable with the same name exists already */ > > + TAILQ_FOREACH(tmp_te, tqs_list, next) { > > + tmp_v =3D (struct rte_tqs *) tmp_te->data; > > + if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) =3D=3D 0) > > + break; > > + } > > + > > + if (tmp_te !=3D NULL) { > > + rte_errno =3D -EEXIST; > > + goto tqs_exist; > > + } > > + > > + TAILQ_INSERT_TAIL(tqs_list, te, next); > > + > > + rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK); > > + > > + return v; > > + > > +tqs_exist: > > + rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK); > > + > > +alloc_error: > > + rte_free(te); > > + rte_free(v); > > + return NULL; > > +} >=20 > That seems quite heavy-weight function just to allocate sync variable. > As size of struct rte_tqs is constant and known to the user, might be bet= ter just > provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free= memory > for it by himself. >=20 I believe, when you say heavy-weight, you are referring to adding tqs varia= ble to the TAILQ and allocating the memory for it. Agree. I also do not exp= ect that there are a whole lot of tqs variables used in an application. Eve= n in rte_tqs_free, there is similar overhead. The extra part is due to the way the TQS variable will get identified by da= ta plane threads. I am thinking that a data plane thread will use the rte_t= qs_lookup API to identify a TQS variable. However, it is possible to share = this with data plane threads via a simple shared structure as well. Along with not allocating the memory, are you suggesting that we could skip= maintaining a list of TQS variables in the TAILQ? This will remove rte_tqs= _lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach= . > > + > > +/* Add a reader thread, running on an lcore, to the list of threads > > + * reporting their quiescent state on a TQS variable. > > + */ > > +int __rte_experimental > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) { > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL || lcore_id >=3D > RTE_TQS_MAX_LCORE), > > + -EINVAL); >=20 > It is not very good practice to make function return different values and= behave > in a different way in debug/non-debug mode. > I'd say that for slow-path (functions in .c) it is always good to check i= nput > parameters. > For fast-path (functions in .h) we sometimes skip such checking, but debu= g > mode can probably use RTE_ASSERT() or so. Makes sense, I will change this in the next version. >=20 >=20 > lcore_id >=3D RTE_TQS_MAX_LCORE >=20 > Is this limitation really necessary? I added this limitation because currently DPDK application cannot take a ma= sk more than 64bit wide. Otherwise, this should be as big as RTE_MAX_LCORE. I see that in the case of '-lcores' option, the number of lcores can be mor= e than the number of PEs. In this case, we still need a MAX limit (but can = be bigger than 64). =20 > First it means that only lcores can use that API (at least data-path part= ), second > even today many machines have more than 64 cores. > I think you can easily avoid such limitation, if instead of requiring lco= re_id as > input parameter, you'll just make it return index of next available entry= in w[]. > Then tqs_update() can take that index as input parameter. I had thought about a similar approach based on IDs. I was concerned that I= D will be one more thing to manage for the application. But, I see the limi= tations of the current approach now. I will change it to allocation based. = This will support even non-EAL pthreads as well. >=20 > > + >=20 > > + /* Worker thread has to count the quiescent states > > + * only from the current value of token. > > + */ > > + v->w[lcore_id].cnt =3D v->token; >=20 > Wonder what would happen, if new reader will call register(), after write= r calls > start()? > Looks like a race-condition. > Or such pattern is not supported? The start should be called only after the reference to the entry in the dat= a structure is 'deleted'. Hence the new reader will not get the reference t= o the deleted entry and does not have to increment its counter. When rte_tq= s_check is called, it will see that the counter is already up to date. (I a= m missing a load-acquire on the token, I will correct that in the next vers= ion). >=20 > > + > > + /* Release the store to initial TQS count so that workers > > + * can use it immediately after this function returns. > > + */ > > + __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), > > +__ATOMIC_RELEASE); > > + > > + return 0; > > +} > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice > > + * > > + * Trigger the worker threads to report the quiescent state > > + * status. > > + * > > + * This is implemented as a lock-free function. It is multi-thread > > + * safe and can be called from the worker threads as well. > > + * > > + * @param v > > + * TQS variable > > + * @param n > > + * Expected number of times the quiescent state is entered > > + * @param t > > + * - If successful, this is the token for this call of the API. > > + * This should be passed to rte_tqs_check API. > > + * @return > > + * - -EINVAL if the parameters are invalid (debug mode compilation o= nly). > > + * - 0 Otherwise and always (non-debug mode compilation). > > + */ > > +static __rte_always_inline int __rte_experimental > > +rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) { > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL || t =3D=3D NULL), -EINVAL); > > + > > + /* This store release will ensure that changes to any data > > + * structure are visible to the workers before the token > > + * update is visible. > > + */ > > + *t =3D __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE); > > + > > + return 0; > > +} > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice > > + * > > + * Update quiescent state for the worker thread on a lcore. > > + * > > + * This is implemented as a lock-free function. It is multi-thread saf= e. > > + * All the worker threads registered to report their quiescent state > > + * on the TQS variable must call this API. > > + * > > + * @param v > > + * TQS variable > > + */ > > +static __rte_always_inline void __rte_experimental > > +rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) { > > + uint32_t t; > > + > > + TQS_ERR_LOG_IF_TRUE(v =3D=3D NULL || lcore_id >=3D > RTE_TQS_MAX_LCORE); > > + > > + /* Load the token before the worker thread loads any other > > + * (lock-free) data structure. This ensures that updates > > + * to the data structures are visible if the update > > + * to token is visible. > > + */ > > + t =3D __atomic_load_n(&v->token, __ATOMIC_ACQUIRE); >=20 > Hmm, I am not very familiar with C11 model, but it looks like a race cond= ition > to me: > as I understand, update() supposed be called at the end of reader's criti= cal > section, correct? Yes, the understanding is correct. > But ACQUIRE is only a hoist barrier, which means compiler and cpu are fre= e to > move earlier reads (and writes) after it. Yes, your understanding is correct. > It probably needs to be a full ACQ_REL here. >=20 The sequence of operations is as follows: 1) Writer 'deletes' an entry from a lock-free data structure 2) Writer calls rte_tqs_start - This API increments the 'token' and does a = store-release. So, any earlier stores would be visible if the store to 'tok= en' is visible (to the data plane threads). 3) Reader calls rte_tqs_update - This API load-acquires the 'token'. a) If this 'token' is the updated value from 2) then the entry deleted fro= m 1) will not be available for the reader to reference (even if that refere= nce is due to earlier reads being moved after load-acquire of 'token'). b) If this 'token' is not the updated value from 2) then the entry deleted= from 1) may or may not be available for the reader to reference. In this c= ase the w[lcore_id].cnt is not updated, hence the writer will wait to 'free= ' the deleted entry from 1) > > + if (v->w[lcore_id].cnt !=3D t) > > + v->w[lcore_id].cnt++; > > +} > > + > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice > > + * > > + * Checks if all the worker threads have entered the quiescent state > > + * 'n' number of times. 'n' is provided in rte_tqs_start API. > > + * > > + * This is implemented as a lock-free function. It is multi-thread > > + * safe and can be called from the worker threads as well. > > + * > > + * @param v > > + * TQS variable > > + * @param t > > + * Token returned by rte_tqs_start API > > + * @param wait > > + * If true, block till all the worker threads have completed enterin= g > > + * the quiescent state 'n' number of times > > + * @return > > + * - 0 if all worker threads have NOT passed through specified numbe= r > > + * of quiescent states. > > + * - 1 if all worker threads have passed through specified number > > + * of quiescent states. > > + * - -EINVAL if the parameters are invalid (debug mode compilation o= nly). > > + */ > > +static __rte_always_inline int __rte_experimental > > +rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) { > > + uint64_t l; > > + uint64_t lcore_mask; > > + > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL), -EINVAL); > > + > > + do { > > + /* Load the current lcore_mask before loading the > > + * worker thread quiescent state counters. > > + */ > > + lcore_mask =3D __atomic_load_n(&v->lcore_mask, > __ATOMIC_ACQUIRE); >=20 > What would happen if reader will call unregister() simultaneously with ch= eck() > and will update lcore_mask straight after that load? > As I understand check() might hang in such case. If the 'lcore_mask' is updated after this load, it will affect only the cur= rent iteration of the while loop below. In the next iteration the 'lcore_ma= sk' is loaded again. >=20 > > + > > + while (lcore_mask) { > > + l =3D __builtin_ctz(lcore_mask); > > + if (v->w[l].cnt !=3D t) > > + break; >=20 > As I understand, that makes control-path function progress dependent on > simultaneous invocation of data-path functions. I agree that the control-path function progress (for ex: how long to wait f= or freeing the memory) depends on invocation of the data-path functions. Th= e separation of 'start', 'check' and the option not to block in 'check' pro= vide the flexibility for control-path to do some other work if it chooses t= o. =20 > In some cases that might cause control-path to hang. > Let say if data-path function wouldn't be called, or user invokes control= -path > and data-path functions from the same thread. I agree with the case of data-path function not getting called. I would con= sider that as programming error. I can document that warning in the rte_tqs= _check API. In the case of same thread calling both control-path and data-path function= s, it would depend on the sequence of the calls. The following sequence sho= uld not cause any hangs: Worker thread 1) 'deletes' an entry from a lock-free data structure 2) rte_tqs_start 3) rte_tqs_update=20 4) rte_tqs_check (wait =3D=3D 1 or wait =3D=3D 0) 5) 'free' the entry deleted in 1) If 3) and 4) are interchanged, then there will be a hang if wait is set to = 1. If wait is set to 0, there should not be a hang. I can document this as part of the documentation (I do not think API docume= ntation is required for this). >=20 > > + > > + lcore_mask &=3D ~(1UL << l); > > + } > > + > > + if (lcore_mask =3D=3D 0) > > + return 1; > > + > > + rte_pause(); > > + } while (wait); > > + > > + return 0; > > +} > > +