From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by dpdk.org (Postfix) with ESMTP id A128D1B534
 for <dev@dpdk.org>; Mon, 17 Dec 2018 14:14:42 +0100 (CET)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga003.fm.intel.com ([10.253.24.29])
 by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 17 Dec 2018 05:14:41 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.56,365,1539673200"; d="scan'208";a="119118613"
Received: from irsmsx103.ger.corp.intel.com ([163.33.3.157])
 by FMSMGA003.fm.intel.com with ESMTP; 17 Dec 2018 05:14:39 -0800
Received: from irsmsx106.ger.corp.intel.com ([169.254.8.227]) by
 IRSMSX103.ger.corp.intel.com ([169.254.3.157]) with mapi id 14.03.0415.000;
 Mon, 17 Dec 2018 13:14:39 +0000
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, "dev@dpdk.org"
 <dev@dpdk.org>
CC: nd <nd@arm.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>, Malvika Gupta
 <Malvika.Gupta@arm.com>, "Gavin Hu (Arm Technology China)"
 <Gavin.Hu@arm.com>, nd <nd@arm.com>, nd <nd@arm.com>
Thread-Topic: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Thread-Index: AQHUghQLRVNcuR2lRU2RH6S2NG2yyqVeB0ZggAUcBqCAAiq/IIANopHQgAa+yECAAZtIEIAHrIEQ
Date: Mon, 17 Dec 2018 13:14:38 +0000
Message-ID: <2601191342CEEE43887BDE71AB977258010D8BB372@IRSMSX106.ger.corp.intel.com>
References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com>
 <20181122033055.3431-3-honnappa.nagarahalli@arm.com>
 <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com>
 <AM6PR08MB36727419D83871C334E3A78898D00@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <2601191342CEEE43887BDE71AB977258010CEBD5D4@IRSMSX106.ger.corp.intel.com>
 <AM6PR08MB3672057B31512AE4235C004598AA0@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <2601191342CEEE43887BDE71AB977258010D8B8CEE@IRSMSX106.ger.corp.intel.com>
 <AM6PR08MB36727D3C28BB5A6DEEC9B9AA98A00@AM6PR08MB3672.eurprd08.prod.outlook.com>
In-Reply-To: <AM6PR08MB36727D3C28BB5A6DEEC9B9AA98A00@AM6PR08MB3672.eurprd08.prod.outlook.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMjk2YTI4OWQtYmZhMS00MjIwLWFmNGYtOTk3NWNlYTQ5ZTllIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiVFR0ZEZvOFRrTGFkNnRRSVhCNVVhRXFlcFJQNkl2c1BlUWYwMkFjdWpDUmFcL0NsaGQwbnJpczV4aCt5ZjFzVFIifQ==
x-ctpclassification: CTP_NT
dlp-product: dlpe-windows
dlp-version: 11.0.400.15
dlp-reaction: no-action
x-originating-ip: [163.33.239.181]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Dec 2018 13:14:43 -0000


> >
> > > > >
> > > > > > > +
> > > > > > > +/* Add a reader thread, running on an lcore, to the list of
> > > > > > > +threads
> > > > > > > + * reporting their quiescent state on a TQS variable.
> > > > > > > + */
> > > > > > > +int __rte_experimental
> > > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore=
_id) {
> > > > > > > +	TQS_RETURN_IF_TRUE((v =3D=3D NULL || lcore_id >=3D
> > > > > > RTE_TQS_MAX_LCORE),
> > > > > > > +				-EINVAL);
> > > > > >
> > > > > > It is not very good practice to make function return different
> > > > > > values and behave in a different way in debug/non-debug mode.
> > > > > > I'd say that for slow-path (functions in .c) it is always good
> > > > > > to check input parameters.
> > > > > > For fast-path (functions in .h) we sometimes skip such checking=
,
> > > > > > but debug mode can probably use RTE_ASSERT() or so.
> > > > > Makes sense, I will change this in the next version.
> > > > >
> > > > > >
> > > > > >
> > > > > > lcore_id >=3D RTE_TQS_MAX_LCORE
> > > > > >
> > > > > > Is this limitation really necessary?
> > > > > I added this limitation because currently DPDK application cannot
> > > > > take a mask more than 64bit wide. Otherwise, this should be as bi=
g
> > > > > as
> > > > RTE_MAX_LCORE.
> > > > > I see that in the case of '-lcores' option, the number of lcores
> > > > > can be more than the number of PEs. In this case, we still need a
> > > > > MAX limit (but
> > > > can be bigger than 64).
> > > > >
> > > > > > First it means that only lcores can use that API (at least
> > > > > > data-path part), second even today many machines have more than=
 64
> > cores.
> > > > > > I think you can easily avoid such limitation, if instead of
> > > > > > requiring lcore_id as input parameter, you'll just make it
> > > > > > return index of
> > > > next available entry in w[].
> > > > > > Then tqs_update() can take that index as input parameter.
> > > > > I had thought about a similar approach based on IDs. I was
> > > > > concerned that ID will be one more thing to manage for the
> > > > > application. But, I see the
> > > > limitations of the current approach now. I will change it to alloca=
tion
> > based.
> > > > This will support even non-EAL pthreads as well.
> > > >
> > > > Yes, with such approach non-lcore threads will be able to use it al=
so.
> > > >
> > > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need
> > > to be efficient as they can be called from the worker's packet
> > > processing loop (rte_event_dequeue_burst allows blocking. So, the
> > > worker thread needs to call rte_tqs_unregister_lcore before calling
> > rte_event_dequeue_burst and rte_tqs_register_lcore before starting pack=
et
> > processing). Allocating the thread ID in these functions will make them=
 more
> > complex.
> > >
> > > I suggest that we change the argument 'lcore_id' to 'thread_id'. The
> > > application could use 'lcore_id' as 'thread_id' if threads are mapped=
 to
> > physical cores 1:1.
> > >
> > > If the threads are not mapped 1:1 to physical cores, the threads need
> > > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not
> > > see that DPDK has a thread_id concept. For TQS, the thread IDs are gl=
obal
> > (i.e. not per TQS variable). I could provide APIs to do the thread ID a=
llocation,
> > but I think the thread ID allocation should not be part of this library=
. Such
> > thread ID might be useful for other libraries.
> >
> > I don't think there is any point to introduce new thread_id concept jus=
t for
> > that library.
> Currently, we have rte_gettid API. It is being used by rte_spinlock. Howe=
ver, the thread ID returned here is the thread ID as defined by OS.
> rte_spinlock APIs do not care who defines the thread ID as long as those =
IDs are unique per thread. I think, if we have a thread_id concept
> that covers non-eal threads as well, it might help other libraries too. F=
or ex: [1] talks about the limitation of per-lcore cache.
> I think this
> limitation can be removed easily if we could have a thread_id that is in =
a small, well defined space (rather than OS defined thread ID which
> may be an arbitrary number). I see similar issues mentioned for rte_timer=
.

If we'll just introduce new ID (let's name it thread_id) then we'll just re=
place one limitation with the other.
If it still would be local_cache[], now based on some thread_id instead of =
current lcore_id.
I don't see how it will be better than current one.
To make any arbitrary thread to use mempool's cache we need something smart=
er
then just local_cache[] for each id, but without loss of performance.

> It might be useful in the dynamic threads Bruce talked about at the Dubli=
n summit (I am not sure on this one, just speculating).

That's probably about  make lcore_id allocation/freeing to be dynamic.
>=20
> [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#kno=
wn-issue-label
>=20
> > After all we already have a concept of lcore_id which pretty much serve=
s the
> > same purpose.
> > I still think that we need to either:
> > a) make register/unregister to work with any valid lcore_id (<=3D
> > RTE_MAX_LCORE)
> I have made this change already, it will be there in the next version.

Ok.

>=20
> > b) make register/unregister to return index in w[]
> >
> > For a) will need mask bigger than 64bits.
> > b)  would allow to use data-path API by non-lcores threads too, plus w[=
]
> > would occupy less space, and check() might be faster.
> > Though yes, as a drawback, for b) register/unregister probably would ne=
ed
> > extra 'while(CAS(...));' loop.
> Along with the CAS, we also need to search for available index in the arr=
ay.

Sure, but I thought that one is relatively cheap comparing to CAS itself
(probably not, as cache line with data will be shared between cores).

>=20
> > I suppose the question here do you foresee a lot of concurrent
> > register/unregister at data-path?
> IMO, yes, because of the event dev API being blocking.
> We can solve this by providing separate APIs for allocation/freeing of th=
e IDs. I am just questioning where these APIs should be.
>=20
> >
> > >
> > > <snip>
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +		while (lcore_mask) {
> > > > > > > +			l =3D __builtin_ctz(lcore_mask);
> > > > > > > +			if (v->w[l].cnt !=3D t)
> > > > > > > +				break;
> > > > > >
> > > > > > As I understand, that makes control-path function progress
> > > > > > dependent on simultaneous invocation of data-path functions.
> > > > > I agree that the control-path function progress (for ex: how long
> > > > > to wait for freeing the memory) depends on invocation of the
> > > > > data-path functions. The separation of 'start', 'check' and the
> > > > > option not to block in
> > > > 'check' provide the flexibility for control-path to do some other
> > > > work if it chooses to.
> > > > >
> > > > > > In some cases that might cause control-path to hang.
> > > > > > Let say if data-path function wouldn't be called, or user
> > > > > > invokes control-path and data-path functions from the same thre=
ad.
> > > > > I agree with the case of data-path function not getting called. I
> > > > > would consider that as programming error. I can document that
> > > > > warning in
> > > > the rte_tqs_check API.
> > > >
> > > > Sure, it can be documented.
> > > > Though that means, that each data-path thread would have to do
> > > > explicit
> > > > update() call for every tqs it might use.
> > > > I just think that it would complicate things and might limit usage
> > > > of the library quite significantly.
> > > Each data path thread has to report its quiescent state. Hence, each
> > > data-path thread has to call update() (similar to how
> > > rte_timer_manage() has to be called periodically on the worker thread=
).
> >
> > I understand that.
> > Though that means that each data-path thread has to know explicitly wha=
t rcu
> > vars it accesses.
> Yes. That is correct. It is both good and bad. It is providing flexibilit=
y to reduce the overhead. For ex: in pipeline mode, it may be that a
> particular data structure is accessed only by some of the threads in the =
application. In this case, this library allows for per data structure
> vars, which reduces the over head. This applies for service cores as well=
.
>=20
> > Would be hard to adopt such API with rcu vars used inside some library.
> > But ok, as I understand people do use QSBR approach in their apps and f=
ind it
> > useful.
> It can be adopted in the library with different levels of assumptions/con=
straints.
> 1) With the assumption that the data plane threads will update the quiesc=
ent state. For ex: for rte_hash library we could ask the user to pass
> the TQS variable as input and rte_hash writer APIs can call rte_tqs_start=
 and rte_tqs_check APIs.
> 2) If the assumption in 1) is not good, updating of the quiescent state c=
an be hidden in the library, but again with the assumption that the
> data plane library API is called on a regular basis. For ex: the rte_tqs_=
update can be called within rte_hash_lookup API.
> 3) If we do not want to assume that the data plane API will be called on =
a regular basis, then the rte_tqs_register/unregister APIs need to be
> used before and after entering the critical section along with calling rt=
e_tqs_update API. For ex: rte_hash_lookup should have the sequence
> rte_tqs_register, <critical section>, rte_tqs_unregister, rte_tqs_update.=
 (very similar to GP)

#3 is surely possible but it seems quite expensive.
Anyway, as I said before, people do use QSBR approach -
it has the small overhead for readers and relatively straightforward.
So let start with that one, and have some ability to extend the lib
with new methods  in future.

>=20
> >
> > > Do you have any particular use case in mind where this fails?
> >
> > Let say it means that library can't be used to add/del RX/TX ethdev cal=
lbacks
> > in a safe manner.
> I need to understand this better. I will look at rte_ethdev library.

Ok, you can also have a look at: lib/librte_bpf/bpf_pkt.c
to check how we overcome it now.

>=20
> >
> > BTW, two side questions:
> > 1) As I understand what you propose is very similar to QSBR main concep=
t.
> > Wouldn't it be better to name it accordingly to avoid confusion (or at =
least
> > document it somewhere).
> > I think someone else already raised that question.
> QSBR stands for Quiescent State Based Reclamation. This library already h=
as 'Thread Quiescent State' in the name. Others have
> questioned/suggested why not use RCU instead. I called it thread quiescen=
t state as this library just helps determine if all the readers have
> entered the quiescent state. It does not do anything else.
>=20
> However, you are also bringing up an important point, 'will we add other =
methods of memory reclamation'? With that in mind, may be we
> should not call it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will =
also future proof the API incase we want to add additional RCU types.

Yep, that sounds like a good approach to me.

>=20
> > 2) Would QSBR be the only technique in that lib?
> > Any plans to add something similar to GP one too (with MBs at reader-si=
de)?
> I believe, by GP, you mean general-purpose RCU.

Yes.

> In my understanding QSBR is the one with least overhead. For DPDK applica=
tions, I think
> reducing that overhead is important. The GP adds additional over head on =
the reader side.

Yes, but it provides better flexibility.
> I did not see a need to add any additional ones as of now.

Let say your #3 solution above.
I think GP will be cheaper than register/unregister for each library call.

> But if there are use cases that cannot be achieved with the proposed APIs=
, we can definitely expand it.
>=20
> >
> > >
> > > >
> > > > >
> > > > > In the case of same thread calling both control-path and data-pat=
h
> > > > > functions, it would depend on the sequence of the calls. The
> > > > > following
> > > > sequence should not cause any hangs:
> > > > > Worker thread
> > > > > 1) 'deletes' an entry from a lock-free data structure
> > > > > 2) rte_tqs_start
> > > > > 3) rte_tqs_update
> > > > > 4) rte_tqs_check (wait =3D=3D 1 or wait =3D=3D 0)
> > > > > 5) 'free' the entry deleted in 1)
> > > >
> > > > That an interesting idea, and that should help, I think.
> > > > Probably worth to have {2,3,4} sequence as a new high level functio=
n.
> > > >
> > > Yes, this is a good idea. Such a function would be applicable only in
> > > the worker thread. I would prefer to leave it to the application to t=
ake care.
> >
> > Yes, it would be applicable only to worker thread, but why we can't hav=
e a
> > function for it?
> > Let say it could be 2 different functions: one doing {2,3,4} - for work=
er threads,
> > and second doing just {2,4} - for control threads.
> > Or it could be just one function that takes extra parameter: lcore_id/w=
[] index.
> > If it is some predefined invalid value (-1 or so), step #3 will be skip=
ped.
> The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the=
 writers do not have to spend CPU/memory cycles polling for the
> readers' quiescent state. In the context of DPDK, this overhead will be s=
ignificant (at least equal to the length of 1 while loop on the worker
> core). This is one of the key features of this library. Combining 2,[3], =
4 will defeat this purpose. For ex: in the rte_hash library, whenever a
> writer on the data path calls rte_hash_add, (with 2,3,4 combined) it will=
 wait for the rest of the readers to enter quiescent state. i.e. the
> performance will come down whenever a rte_hash_add is called.

I am not suggesting to replace start+[update+]check with one mega-function.
NP with currently defined API=20
I am talking about an additional function for the users where performance i=
s not a main concern -
they just need a function that would do things in  a proper way for themc.
I think having such extra function will simplify their life,
again they can use it as a reference to understand the proper sequence they=
 need to call on their own.=20