From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR02-VE1-obe.outbound.protection.outlook.com (mail-eopbgr20049.outbound.protection.outlook.com [40.107.2.49]) by dpdk.org (Postfix) with ESMTP id CA8811B4A2 for ; Thu, 13 Dec 2018 08:39:32 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=PNkih46N9UtwOYHLX0zXDzewtz4KN09GUHX+ls69eIs=; b=UERmeV7eXqXdg0XhPro3ZaliDMj/SVbIoH3E5dTnADTM5PSN9rXYdaeBOoJQQAlP61X7VR37oVO1XeYe9qx6uMgG6XVwLMhMQnnRZQVl5dtb1dBIb3joBGGcTN019S9S3KeT524Gve8UCaJE4yiZnE/7lpRiJkospT7blZJgEjg= Received: from AM6PR08MB3672.eurprd08.prod.outlook.com (20.177.115.29) by AM6PR08MB3400.eurprd08.prod.outlook.com (20.177.112.225) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1425.18; Thu, 13 Dec 2018 07:39:31 +0000 Received: from AM6PR08MB3672.eurprd08.prod.outlook.com ([fe80::78ab:2bf4:5476:6c3e]) by AM6PR08MB3672.eurprd08.prod.outlook.com ([fe80::78ab:2bf4:5476:6c3e%2]) with mapi id 15.20.1425.016; Thu, 13 Dec 2018 07:39:31 +0000 From: Honnappa Nagarahalli To: "Ananyev, Konstantin" , "dev@dpdk.org" CC: nd , Dharmik Thakkar , Malvika Gupta , "Gavin Hu (Arm Technology China)" , nd , nd Thread-Topic: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library Thread-Index: AQHUghQLRVNcuR2lRU2RH6S2NG2yyqVeB0ZggAUcBqCAAiq/IIANopHQgAa+yECAAZtIEA== Date: Thu, 13 Dec 2018 07:39:30 +0000 Message-ID: References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> <20181122033055.3431-3-honnappa.nagarahalli@arm.com> <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com> <2601191342CEEE43887BDE71AB977258010CEBD5D4@IRSMSX106.ger.corp.intel.com> <2601191342CEEE43887BDE71AB977258010D8B8CEE@IRSMSX106.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB977258010D8B8CEE@IRSMSX106.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Honnappa.Nagarahalli@arm.com; x-originating-ip: [217.140.111.135] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM6PR08MB3400; 6:rzS5DKRzo+IaS7zRnwQdUhMz4YDjXxZxMhW53WasS5sxlWvoghN7I0wBk/l9O3giDAVOUJFn1sFx1NZccooir7LWPFFJmo/wcmUKE1ndaIr5N+3pEErSJs1dES9wFbWsqlKNVJwGVpIldwepmBv/wRHqTUfF+fWS89OTIZijUxKCCtA4lbnnVkhi//hnWoF/bhGrXYgrzTBABAOgwZjamuGyZpeGkGTU4a7WwNDsFufN6f/0jQD1pdAK3hlV4QIi9DSs0PSUi4G0pTHlv5q8jfxBVqJ/IAuDncFqG6f+2+BNST78A6PEz17/xnVbISBYQiZry0UsPCgAXvMWyK+BaBRpamr9Xs6j8WEFPIli/TVVe52D/Cm5ZoQmptvpabn9xtNuKHLSS6sH6tZqw61D/rbuf2pWeIO2u/wWfsZAKVBzbvKhTXYdib5BAWNOZdbGnhMWFGIbfNRnp7tkKgsSrA==; 5:4kXTLNd2EXcM5Foqwl4UVP47uL6dSp27FZ8D+yzuz0jfAGv19ytX/1VbHaeX3+9xOX4nz1QbgwToexhV+SfTByq8J+YdJY+7a0B5GjvvB+1fdsWEj2FG3gThV8RdYihim/NmfDFxkl7zY7K1phGTiHQkVMXMaHnSV9/tLtpPu0c=; 7:0+TR7FoOX1d3NIwB0i5KOXhwI2vL/aPaRLxDQTNDa1rgOSMGLSSMeIbWvoVTemakNEPWJgK7y9ReCHY+Gv65WCyiiw71mYY/p/vHDPUEyvJ2Iffke8d+hZzeKUbdTcU7YEit7UlfezPhNu8Z2rxN7A== x-ms-exchange-antispam-srfa-diagnostics: SOS;SOR; x-ms-office365-filtering-correlation-id: f70cea9d-fc49-4426-0d41-08d660ce1e8c x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390098)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(4618075)(2017052603328)(7153060)(7193020); SRVR:AM6PR08MB3400; x-ms-traffictypediagnostic: AM6PR08MB3400: nodisclaimer: True x-microsoft-antispam-prvs: x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(3230021)(999002)(6040522)(2401047)(5005006)(8121501046)(3002001)(10201501046)(3231475)(944501520)(52105112)(93006095)(93001095)(6055026)(148016)(149066)(150057)(6041310)(20161123560045)(20161123562045)(20161123564045)(20161123558120)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(201708071742011)(7699051)(76991095); SRVR:AM6PR08MB3400; BCL:0; PCL:0; RULEID:; SRVR:AM6PR08MB3400; x-forefront-prvs: 088552DE73 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(376002)(136003)(39860400002)(396003)(346002)(366004)(43544003)(199004)(189003)(7696005)(4326008)(2906002)(53936002)(5660300001)(76176011)(54906003)(6246003)(99286004)(110136005)(6116002)(3846002)(256004)(316002)(14444005)(229853002)(106356001)(105586002)(14454004)(68736007)(478600001)(86362001)(966005)(72206003)(7736002)(6506007)(25786009)(55016002)(74316002)(6306002)(305945005)(9686003)(446003)(11346002)(2501003)(486006)(476003)(186003)(26005)(6436002)(66066001)(97736004)(8936002)(81166006)(4744004)(93886005)(81156014)(8676002)(71190400001)(71200400001)(102836004)(33656002)(21314003); DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB3400; H:AM6PR08MB3672.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: nAET4+0oFg5E4JrF9MuK3A/bxh3UoSi83qmcCeGJZdB709NWNZ0tzA1TYzmdr/Zh8nQUOtoNviR2Hflp7j2EE17VUD4TsEopNApQ4abm5yqdjwFnCT1pcsgT1TgYiTtzcBNSZk7R1zD1aWFTt0CFYKvlLMF/yWCZhxrEVw5tKjELC4axP83LtxsqEMOp2OWSByoTUrUhPyRFSzuh9k8vDWyScgZR1OEVa5S8rKURC7Kio8+Bru4TK+dGLgkg4xOrLp5iJZ36msVwJsSIS8Sw8MXk1GOPLt6nc42lA5bY1WCEVs3rpTz3K5yTN8B89lYo spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: f70cea9d-fc49-4426-0d41-08d660ce1e8c X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Dec 2018 07:39:31.0557 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB3400 Subject: Re: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Dec 2018 07:39:33 -0000 >=20 >=20 > > > > > > > > > > + > > > > > > +/* Add a reader thread, running on an lcore, to the list of > > > > > > +threads > > > > > > + * reporting their quiescent state on a TQS variable. > > > > > > + */ > > > > > > +int __rte_experimental > > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_i= d) { > > > > > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL || lcore_id >=3D > > > > > RTE_TQS_MAX_LCORE), > > > > > > + -EINVAL); > > > > > > > > > > It is not very good practice to make function return different > > > > > values and behave in a different way in debug/non-debug mode. > > > > > I'd say that for slow-path (functions in .c) it is always good > > > > > to check input parameters. > > > > > For fast-path (functions in .h) we sometimes skip such checking, > > > > > but debug mode can probably use RTE_ASSERT() or so. > > > > Makes sense, I will change this in the next version. > > > > > > > > > > > > > > > > > > > lcore_id >=3D RTE_TQS_MAX_LCORE > > > > > > > > > > Is this limitation really necessary? > > > > I added this limitation because currently DPDK application cannot > > > > take a mask more than 64bit wide. Otherwise, this should be as big > > > > as > > > RTE_MAX_LCORE. > > > > I see that in the case of '-lcores' option, the number of lcores > > > > can be more than the number of PEs. In this case, we still need a > > > > MAX limit (but > > > can be bigger than 64). > > > > > > > > > First it means that only lcores can use that API (at least > > > > > data-path part), second even today many machines have more than 6= 4 > cores. > > > > > I think you can easily avoid such limitation, if instead of > > > > > requiring lcore_id as input parameter, you'll just make it > > > > > return index of > > > next available entry in w[]. > > > > > Then tqs_update() can take that index as input parameter. > > > > I had thought about a similar approach based on IDs. I was > > > > concerned that ID will be one more thing to manage for the > > > > application. But, I see the > > > limitations of the current approach now. I will change it to allocati= on > based. > > > This will support even non-EAL pthreads as well. > > > > > > Yes, with such approach non-lcore threads will be able to use it also= . > > > > > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need > > to be efficient as they can be called from the worker's packet > > processing loop (rte_event_dequeue_burst allows blocking. So, the > > worker thread needs to call rte_tqs_unregister_lcore before calling > rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet > processing). Allocating the thread ID in these functions will make them m= ore > complex. > > > > I suggest that we change the argument 'lcore_id' to 'thread_id'. The > > application could use 'lcore_id' as 'thread_id' if threads are mapped t= o > physical cores 1:1. > > > > If the threads are not mapped 1:1 to physical cores, the threads need > > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not > > see that DPDK has a thread_id concept. For TQS, the thread IDs are glob= al > (i.e. not per TQS variable). I could provide APIs to do the thread ID all= ocation, > but I think the thread ID allocation should not be part of this library. = Such > thread ID might be useful for other libraries. >=20 > I don't think there is any point to introduce new thread_id concept just = for > that library. Currently, we have rte_gettid API. It is being used by rte_spinlock. Howeve= r, the thread ID returned here is the thread ID as defined by OS. rte_spinl= ock APIs do not care who defines the thread ID as long as those IDs are uni= que per thread. I think, if we have a thread_id concept that covers non-eal= threads as well, it might help other libraries too. For ex: [1] talks abou= t the limitation of per-lcore cache. I think this limitation can be removed= easily if we could have a thread_id that is in a small, well defined space= (rather than OS defined thread ID which may be an arbitrary number). I see= similar issues mentioned for rte_timer. It might be useful in the dynamic threads Bruce talked about at the Dublin = summit (I am not sure on this one, just speculating). [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known= -issue-label > After all we already have a concept of lcore_id which pretty much serves = the > same purpose. > I still think that we need to either: > a) make register/unregister to work with any valid lcore_id (<=3D > RTE_MAX_LCORE) I have made this change already, it will be there in the next version. > b) make register/unregister to return index in w[] >=20 > For a) will need mask bigger than 64bits. > b) would allow to use data-path API by non-lcores threads too, plus w[] > would occupy less space, and check() might be faster. > Though yes, as a drawback, for b) register/unregister probably would need > extra 'while(CAS(...));' loop. Along with the CAS, we also need to search for available index in the array= . > I suppose the question here do you foresee a lot of concurrent > register/unregister at data-path? IMO, yes, because of the event dev API being blocking. We can solve this by providing separate APIs for allocation/freeing of the = IDs. I am just questioning where these APIs should be. >=20 > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > + while (lcore_mask) { > > > > > > + l =3D __builtin_ctz(lcore_mask); > > > > > > + if (v->w[l].cnt !=3D t) > > > > > > + break; > > > > > > > > > > As I understand, that makes control-path function progress > > > > > dependent on simultaneous invocation of data-path functions. > > > > I agree that the control-path function progress (for ex: how long > > > > to wait for freeing the memory) depends on invocation of the > > > > data-path functions. The separation of 'start', 'check' and the > > > > option not to block in > > > 'check' provide the flexibility for control-path to do some other > > > work if it chooses to. > > > > > > > > > In some cases that might cause control-path to hang. > > > > > Let say if data-path function wouldn't be called, or user > > > > > invokes control-path and data-path functions from the same thread= . > > > > I agree with the case of data-path function not getting called. I > > > > would consider that as programming error. I can document that > > > > warning in > > > the rte_tqs_check API. > > > > > > Sure, it can be documented. > > > Though that means, that each data-path thread would have to do > > > explicit > > > update() call for every tqs it might use. > > > I just think that it would complicate things and might limit usage > > > of the library quite significantly. > > Each data path thread has to report its quiescent state. Hence, each > > data-path thread has to call update() (similar to how > > rte_timer_manage() has to be called periodically on the worker thread). >=20 > I understand that. > Though that means that each data-path thread has to know explicitly what = rcu > vars it accesses. Yes. That is correct. It is both good and bad. It is providing flexibility = to reduce the overhead. For ex: in pipeline mode, it may be that a particul= ar data structure is accessed only by some of the threads in the applicatio= n. In this case, this library allows for per data structure vars, which red= uces the over head. This applies for service cores as well. > Would be hard to adopt such API with rcu vars used inside some library. > But ok, as I understand people do use QSBR approach in their apps and fin= d it > useful. It can be adopted in the library with different levels of assumptions/const= raints. 1) With the assumption that the data plane threads will update the quiescen= t state. For ex: for rte_hash library we could ask the user to pass the TQS= variable as input and rte_hash writer APIs can call rte_tqs_start and rte_= tqs_check APIs. 2) If the assumption in 1) is not good, updating of the quiescent state can= be hidden in the library, but again with the assumption that the data plan= e library API is called on a regular basis. For ex: the rte_tqs_update can = be called within rte_hash_lookup API. 3) If we do not want to assume that the data plane API will be called on a = regular basis, then the rte_tqs_register/unregister APIs need to be used be= fore and after entering the critical section along with calling rte_tqs_upd= ate API. For ex: rte_hash_lookup should have the sequence rte_tqs_register,= , rte_tqs_unregister, rte_tqs_update. (very similar to G= P) >=20 > > Do you have any particular use case in mind where this fails? >=20 > Let say it means that library can't be used to add/del RX/TX ethdev callb= acks > in a safe manner. I need to understand this better. I will look at rte_ethdev library. >=20 > BTW, two side questions: > 1) As I understand what you propose is very similar to QSBR main concept. > Wouldn't it be better to name it accordingly to avoid confusion (or at le= ast > document it somewhere). > I think someone else already raised that question. QSBR stands for Quiescent State Based Reclamation. This library already has= 'Thread Quiescent State' in the name. Others have questioned/suggested why= not use RCU instead. I called it thread quiescent state as this library ju= st helps determine if all the readers have entered the quiescent state. It = does not do anything else. However, you are also bringing up an important point, 'will we add other me= thods of memory reclamation'? With that in mind, may be we should not call = it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will also future proof = the API incase we want to add additional RCU types. > 2) Would QSBR be the only technique in that lib? > Any plans to add something similar to GP one too (with MBs at reader-side= )? I believe, by GP, you mean general-purpose RCU. In my understanding QSBR is= the one with least overhead. For DPDK applications, I think reducing that = overhead is important. The GP adds additional over head on the reader side.= I did not see a need to add any additional ones as of now. But if there ar= e use cases that cannot be achieved with the proposed APIs, we can definite= ly expand it. >=20 > > > > > > > > > > > > > In the case of same thread calling both control-path and data-path > > > > functions, it would depend on the sequence of the calls. The > > > > following > > > sequence should not cause any hangs: > > > > Worker thread > > > > 1) 'deletes' an entry from a lock-free data structure > > > > 2) rte_tqs_start > > > > 3) rte_tqs_update > > > > 4) rte_tqs_check (wait =3D=3D 1 or wait =3D=3D 0) > > > > 5) 'free' the entry deleted in 1) > > > > > > That an interesting idea, and that should help, I think. > > > Probably worth to have {2,3,4} sequence as a new high level function. > > > > > Yes, this is a good idea. Such a function would be applicable only in > > the worker thread. I would prefer to leave it to the application to tak= e care. >=20 > Yes, it would be applicable only to worker thread, but why we can't have = a > function for it? > Let say it could be 2 different functions: one doing {2,3,4} - for worker= threads, > and second doing just {2,4} - for control threads. > Or it could be just one function that takes extra parameter: lcore_id/w[]= index. > If it is some predefined invalid value (-1 or so), step #3 will be skippe= d. The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the w= riters do not have to spend CPU/memory cycles polling for the readers' quie= scent state. In the context of DPDK, this overhead will be significant (at = least equal to the length of 1 while loop on the worker core). This is one = of the key features of this library. Combining 2,[3], 4 will defeat this pu= rpose. For ex: in the rte_hash library, whenever a writer on the data path = calls rte_hash_add, (with 2,3,4 combined) it will wait for the rest of the = readers to enter quiescent state. i.e. the performance will come down whene= ver a rte_hash_add is called. I am trying to understand your concern. If it is that, 'programmers may not= use the right sequence', I would prefer to treat that as programming error= . May be it is better addressed by providing debugging capabilities. >=20 > Konstantin