From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id A128D1B534 for ; Mon, 17 Dec 2018 14:14:42 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Dec 2018 05:14:41 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,365,1539673200"; d="scan'208";a="119118613" Received: from irsmsx103.ger.corp.intel.com ([163.33.3.157]) by FMSMGA003.fm.intel.com with ESMTP; 17 Dec 2018 05:14:39 -0800 Received: from irsmsx106.ger.corp.intel.com ([169.254.8.227]) by IRSMSX103.ger.corp.intel.com ([169.254.3.157]) with mapi id 14.03.0415.000; Mon, 17 Dec 2018 13:14:39 +0000 From: "Ananyev, Konstantin" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: nd , Dharmik Thakkar , Malvika Gupta , "Gavin Hu (Arm Technology China)" , nd , nd Thread-Topic: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library Thread-Index: AQHUghQLRVNcuR2lRU2RH6S2NG2yyqVeB0ZggAUcBqCAAiq/IIANopHQgAa+yECAAZtIEIAHrIEQ Date: Mon, 17 Dec 2018 13:14:38 +0000 Message-ID: <2601191342CEEE43887BDE71AB977258010D8BB372@IRSMSX106.ger.corp.intel.com> References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> <20181122033055.3431-3-honnappa.nagarahalli@arm.com> <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com> <2601191342CEEE43887BDE71AB977258010CEBD5D4@IRSMSX106.ger.corp.intel.com> <2601191342CEEE43887BDE71AB977258010D8B8CEE@IRSMSX106.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMjk2YTI4OWQtYmZhMS00MjIwLWFmNGYtOTk3NWNlYTQ5ZTllIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiVFR0ZEZvOFRrTGFkNnRRSVhCNVVhRXFlcFJQNkl2c1BlUWYwMkFjdWpDUmFcL0NsaGQwbnJpczV4aCt5ZjFzVFIifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2018 13:14:43 -0000 > > > > > > > > > > > > > > + > > > > > > > +/* Add a reader thread, running on an lcore, to the list of > > > > > > > +threads > > > > > > > + * reporting their quiescent state on a TQS variable. > > > > > > > + */ > > > > > > > +int __rte_experimental > > > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore= _id) { > > > > > > > + TQS_RETURN_IF_TRUE((v =3D=3D NULL || lcore_id >=3D > > > > > > RTE_TQS_MAX_LCORE), > > > > > > > + -EINVAL); > > > > > > > > > > > > It is not very good practice to make function return different > > > > > > values and behave in a different way in debug/non-debug mode. > > > > > > I'd say that for slow-path (functions in .c) it is always good > > > > > > to check input parameters. > > > > > > For fast-path (functions in .h) we sometimes skip such checking= , > > > > > > but debug mode can probably use RTE_ASSERT() or so. > > > > > Makes sense, I will change this in the next version. > > > > > > > > > > > > > > > > > > > > > > > lcore_id >=3D RTE_TQS_MAX_LCORE > > > > > > > > > > > > Is this limitation really necessary? > > > > > I added this limitation because currently DPDK application cannot > > > > > take a mask more than 64bit wide. Otherwise, this should be as bi= g > > > > > as > > > > RTE_MAX_LCORE. > > > > > I see that in the case of '-lcores' option, the number of lcores > > > > > can be more than the number of PEs. In this case, we still need a > > > > > MAX limit (but > > > > can be bigger than 64). > > > > > > > > > > > First it means that only lcores can use that API (at least > > > > > > data-path part), second even today many machines have more than= 64 > > cores. > > > > > > I think you can easily avoid such limitation, if instead of > > > > > > requiring lcore_id as input parameter, you'll just make it > > > > > > return index of > > > > next available entry in w[]. > > > > > > Then tqs_update() can take that index as input parameter. > > > > > I had thought about a similar approach based on IDs. I was > > > > > concerned that ID will be one more thing to manage for the > > > > > application. But, I see the > > > > limitations of the current approach now. I will change it to alloca= tion > > based. > > > > This will support even non-EAL pthreads as well. > > > > > > > > Yes, with such approach non-lcore threads will be able to use it al= so. > > > > > > > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need > > > to be efficient as they can be called from the worker's packet > > > processing loop (rte_event_dequeue_burst allows blocking. So, the > > > worker thread needs to call rte_tqs_unregister_lcore before calling > > rte_event_dequeue_burst and rte_tqs_register_lcore before starting pack= et > > processing). Allocating the thread ID in these functions will make them= more > > complex. > > > > > > I suggest that we change the argument 'lcore_id' to 'thread_id'. The > > > application could use 'lcore_id' as 'thread_id' if threads are mapped= to > > physical cores 1:1. > > > > > > If the threads are not mapped 1:1 to physical cores, the threads need > > > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not > > > see that DPDK has a thread_id concept. For TQS, the thread IDs are gl= obal > > (i.e. not per TQS variable). I could provide APIs to do the thread ID a= llocation, > > but I think the thread ID allocation should not be part of this library= . Such > > thread ID might be useful for other libraries. > > > > I don't think there is any point to introduce new thread_id concept jus= t for > > that library. > Currently, we have rte_gettid API. It is being used by rte_spinlock. Howe= ver, the thread ID returned here is the thread ID as defined by OS. > rte_spinlock APIs do not care who defines the thread ID as long as those = IDs are unique per thread. I think, if we have a thread_id concept > that covers non-eal threads as well, it might help other libraries too. F= or ex: [1] talks about the limitation of per-lcore cache. > I think this > limitation can be removed easily if we could have a thread_id that is in = a small, well defined space (rather than OS defined thread ID which > may be an arbitrary number). I see similar issues mentioned for rte_timer= . If we'll just introduce new ID (let's name it thread_id) then we'll just re= place one limitation with the other. If it still would be local_cache[], now based on some thread_id instead of = current lcore_id. I don't see how it will be better than current one. To make any arbitrary thread to use mempool's cache we need something smart= er then just local_cache[] for each id, but without loss of performance. > It might be useful in the dynamic threads Bruce talked about at the Dubli= n summit (I am not sure on this one, just speculating). That's probably about make lcore_id allocation/freeing to be dynamic. >=20 > [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#kno= wn-issue-label >=20 > > After all we already have a concept of lcore_id which pretty much serve= s the > > same purpose. > > I still think that we need to either: > > a) make register/unregister to work with any valid lcore_id (<=3D > > RTE_MAX_LCORE) > I have made this change already, it will be there in the next version. Ok. >=20 > > b) make register/unregister to return index in w[] > > > > For a) will need mask bigger than 64bits. > > b) would allow to use data-path API by non-lcores threads too, plus w[= ] > > would occupy less space, and check() might be faster. > > Though yes, as a drawback, for b) register/unregister probably would ne= ed > > extra 'while(CAS(...));' loop. > Along with the CAS, we also need to search for available index in the arr= ay. Sure, but I thought that one is relatively cheap comparing to CAS itself (probably not, as cache line with data will be shared between cores). >=20 > > I suppose the question here do you foresee a lot of concurrent > > register/unregister at data-path? > IMO, yes, because of the event dev API being blocking. > We can solve this by providing separate APIs for allocation/freeing of th= e IDs. I am just questioning where these APIs should be. >=20 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > + while (lcore_mask) { > > > > > > > + l =3D __builtin_ctz(lcore_mask); > > > > > > > + if (v->w[l].cnt !=3D t) > > > > > > > + break; > > > > > > > > > > > > As I understand, that makes control-path function progress > > > > > > dependent on simultaneous invocation of data-path functions. > > > > > I agree that the control-path function progress (for ex: how long > > > > > to wait for freeing the memory) depends on invocation of the > > > > > data-path functions. The separation of 'start', 'check' and the > > > > > option not to block in > > > > 'check' provide the flexibility for control-path to do some other > > > > work if it chooses to. > > > > > > > > > > > In some cases that might cause control-path to hang. > > > > > > Let say if data-path function wouldn't be called, or user > > > > > > invokes control-path and data-path functions from the same thre= ad. > > > > > I agree with the case of data-path function not getting called. I > > > > > would consider that as programming error. I can document that > > > > > warning in > > > > the rte_tqs_check API. > > > > > > > > Sure, it can be documented. > > > > Though that means, that each data-path thread would have to do > > > > explicit > > > > update() call for every tqs it might use. > > > > I just think that it would complicate things and might limit usage > > > > of the library quite significantly. > > > Each data path thread has to report its quiescent state. Hence, each > > > data-path thread has to call update() (similar to how > > > rte_timer_manage() has to be called periodically on the worker thread= ). > > > > I understand that. > > Though that means that each data-path thread has to know explicitly wha= t rcu > > vars it accesses. > Yes. That is correct. It is both good and bad. It is providing flexibilit= y to reduce the overhead. For ex: in pipeline mode, it may be that a > particular data structure is accessed only by some of the threads in the = application. In this case, this library allows for per data structure > vars, which reduces the over head. This applies for service cores as well= . >=20 > > Would be hard to adopt such API with rcu vars used inside some library. > > But ok, as I understand people do use QSBR approach in their apps and f= ind it > > useful. > It can be adopted in the library with different levels of assumptions/con= straints. > 1) With the assumption that the data plane threads will update the quiesc= ent state. For ex: for rte_hash library we could ask the user to pass > the TQS variable as input and rte_hash writer APIs can call rte_tqs_start= and rte_tqs_check APIs. > 2) If the assumption in 1) is not good, updating of the quiescent state c= an be hidden in the library, but again with the assumption that the > data plane library API is called on a regular basis. For ex: the rte_tqs_= update can be called within rte_hash_lookup API. > 3) If we do not want to assume that the data plane API will be called on = a regular basis, then the rte_tqs_register/unregister APIs need to be > used before and after entering the critical section along with calling rt= e_tqs_update API. For ex: rte_hash_lookup should have the sequence > rte_tqs_register, , rte_tqs_unregister, rte_tqs_update.= (very similar to GP) #3 is surely possible but it seems quite expensive. Anyway, as I said before, people do use QSBR approach - it has the small overhead for readers and relatively straightforward. So let start with that one, and have some ability to extend the lib with new methods in future. >=20 > > > > > Do you have any particular use case in mind where this fails? > > > > Let say it means that library can't be used to add/del RX/TX ethdev cal= lbacks > > in a safe manner. > I need to understand this better. I will look at rte_ethdev library. Ok, you can also have a look at: lib/librte_bpf/bpf_pkt.c to check how we overcome it now. >=20 > > > > BTW, two side questions: > > 1) As I understand what you propose is very similar to QSBR main concep= t. > > Wouldn't it be better to name it accordingly to avoid confusion (or at = least > > document it somewhere). > > I think someone else already raised that question. > QSBR stands for Quiescent State Based Reclamation. This library already h= as 'Thread Quiescent State' in the name. Others have > questioned/suggested why not use RCU instead. I called it thread quiescen= t state as this library just helps determine if all the readers have > entered the quiescent state. It does not do anything else. >=20 > However, you are also bringing up an important point, 'will we add other = methods of memory reclamation'? With that in mind, may be we > should not call it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will = also future proof the API incase we want to add additional RCU types. Yep, that sounds like a good approach to me. >=20 > > 2) Would QSBR be the only technique in that lib? > > Any plans to add something similar to GP one too (with MBs at reader-si= de)? > I believe, by GP, you mean general-purpose RCU. Yes. > In my understanding QSBR is the one with least overhead. For DPDK applica= tions, I think > reducing that overhead is important. The GP adds additional over head on = the reader side. Yes, but it provides better flexibility. > I did not see a need to add any additional ones as of now. Let say your #3 solution above. I think GP will be cheaper than register/unregister for each library call. > But if there are use cases that cannot be achieved with the proposed APIs= , we can definitely expand it. >=20 > > > > > > > > > > > > > > > > > > > In the case of same thread calling both control-path and data-pat= h > > > > > functions, it would depend on the sequence of the calls. The > > > > > following > > > > sequence should not cause any hangs: > > > > > Worker thread > > > > > 1) 'deletes' an entry from a lock-free data structure > > > > > 2) rte_tqs_start > > > > > 3) rte_tqs_update > > > > > 4) rte_tqs_check (wait =3D=3D 1 or wait =3D=3D 0) > > > > > 5) 'free' the entry deleted in 1) > > > > > > > > That an interesting idea, and that should help, I think. > > > > Probably worth to have {2,3,4} sequence as a new high level functio= n. > > > > > > > Yes, this is a good idea. Such a function would be applicable only in > > > the worker thread. I would prefer to leave it to the application to t= ake care. > > > > Yes, it would be applicable only to worker thread, but why we can't hav= e a > > function for it? > > Let say it could be 2 different functions: one doing {2,3,4} - for work= er threads, > > and second doing just {2,4} - for control threads. > > Or it could be just one function that takes extra parameter: lcore_id/w= [] index. > > If it is some predefined invalid value (-1 or so), step #3 will be skip= ped. > The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the= writers do not have to spend CPU/memory cycles polling for the > readers' quiescent state. In the context of DPDK, this overhead will be s= ignificant (at least equal to the length of 1 while loop on the worker > core). This is one of the key features of this library. Combining 2,[3], = 4 will defeat this purpose. For ex: in the rte_hash library, whenever a > writer on the data path calls rte_hash_add, (with 2,3,4 combined) it will= wait for the rest of the readers to enter quiescent state. i.e. the > performance will come down whenever a rte_hash_add is called. I am not suggesting to replace start+[update+]check with one mega-function. NP with currently defined API=20 I am talking about an additional function for the users where performance i= s not a main concern - they just need a function that would do things in a proper way for themc. I think having such extra function will simplify their life, again they can use it as a reference to understand the proper sequence they= need to call on their own.=20