From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Honnappa.Nagarahalli@arm.com>
Received: from EUR02-VE1-obe.outbound.protection.outlook.com
 (mail-eopbgr20049.outbound.protection.outlook.com [40.107.2.49])
 by dpdk.org (Postfix) with ESMTP id CA8811B4A2
 for <dev@dpdk.org>; Thu, 13 Dec 2018 08:39:32 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; 
 s=selector1-arm-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=PNkih46N9UtwOYHLX0zXDzewtz4KN09GUHX+ls69eIs=;
 b=UERmeV7eXqXdg0XhPro3ZaliDMj/SVbIoH3E5dTnADTM5PSN9rXYdaeBOoJQQAlP61X7VR37oVO1XeYe9qx6uMgG6XVwLMhMQnnRZQVl5dtb1dBIb3joBGGcTN019S9S3KeT524Gve8UCaJE4yiZnE/7lpRiJkospT7blZJgEjg=
Received: from AM6PR08MB3672.eurprd08.prod.outlook.com (20.177.115.29) by
 AM6PR08MB3400.eurprd08.prod.outlook.com (20.177.112.225) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.1425.18; Thu, 13 Dec 2018 07:39:31 +0000
Received: from AM6PR08MB3672.eurprd08.prod.outlook.com
 ([fe80::78ab:2bf4:5476:6c3e]) by AM6PR08MB3672.eurprd08.prod.outlook.com
 ([fe80::78ab:2bf4:5476:6c3e%2]) with mapi id 15.20.1425.016; Thu, 13 Dec 2018
 07:39:31 +0000
From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "dev@dpdk.org"
 <dev@dpdk.org>
CC: nd <nd@arm.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>, Malvika Gupta
 <Malvika.Gupta@arm.com>, "Gavin Hu (Arm Technology China)"
 <Gavin.Hu@arm.com>, nd <nd@arm.com>, nd <nd@arm.com>
Thread-Topic: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Thread-Index: AQHUghQLRVNcuR2lRU2RH6S2NG2yyqVeB0ZggAUcBqCAAiq/IIANopHQgAa+yECAAZtIEA==
Date: Thu, 13 Dec 2018 07:39:30 +0000
Message-ID: <AM6PR08MB36727D3C28BB5A6DEEC9B9AA98A00@AM6PR08MB3672.eurprd08.prod.outlook.com>
References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com>
 <20181122033055.3431-3-honnappa.nagarahalli@arm.com>
 <2601191342CEEE43887BDE71AB977258010CEBBBA9@IRSMSX106.ger.corp.intel.com>
 <AM6PR08MB36727419D83871C334E3A78898D00@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <2601191342CEEE43887BDE71AB977258010CEBD5D4@IRSMSX106.ger.corp.intel.com>
 <AM6PR08MB3672057B31512AE4235C004598AA0@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <2601191342CEEE43887BDE71AB977258010D8B8CEE@IRSMSX106.ger.corp.intel.com>
In-Reply-To: <2601191342CEEE43887BDE71AB977258010D8B8CEE@IRSMSX106.ger.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=Honnappa.Nagarahalli@arm.com; 
x-originating-ip: [217.140.111.135]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; AM6PR08MB3400;
 6:rzS5DKRzo+IaS7zRnwQdUhMz4YDjXxZxMhW53WasS5sxlWvoghN7I0wBk/l9O3giDAVOUJFn1sFx1NZccooir7LWPFFJmo/wcmUKE1ndaIr5N+3pEErSJs1dES9wFbWsqlKNVJwGVpIldwepmBv/wRHqTUfF+fWS89OTIZijUxKCCtA4lbnnVkhi//hnWoF/bhGrXYgrzTBABAOgwZjamuGyZpeGkGTU4a7WwNDsFufN6f/0jQD1pdAK3hlV4QIi9DSs0PSUi4G0pTHlv5q8jfxBVqJ/IAuDncFqG6f+2+BNST78A6PEz17/xnVbISBYQiZry0UsPCgAXvMWyK+BaBRpamr9Xs6j8WEFPIli/TVVe52D/Cm5ZoQmptvpabn9xtNuKHLSS6sH6tZqw61D/rbuf2pWeIO2u/wWfsZAKVBzbvKhTXYdib5BAWNOZdbGnhMWFGIbfNRnp7tkKgsSrA==;
 5:4kXTLNd2EXcM5Foqwl4UVP47uL6dSp27FZ8D+yzuz0jfAGv19ytX/1VbHaeX3+9xOX4nz1QbgwToexhV+SfTByq8J+YdJY+7a0B5GjvvB+1fdsWEj2FG3gThV8RdYihim/NmfDFxkl7zY7K1phGTiHQkVMXMaHnSV9/tLtpPu0c=;
 7:0+TR7FoOX1d3NIwB0i5KOXhwI2vL/aPaRLxDQTNDa1rgOSMGLSSMeIbWvoVTemakNEPWJgK7y9ReCHY+Gv65WCyiiw71mYY/p/vHDPUEyvJ2Iffke8d+hZzeKUbdTcU7YEit7UlfezPhNu8Z2rxN7A==
x-ms-exchange-antispam-srfa-diagnostics: SOS;SOR;
x-ms-office365-filtering-correlation-id: f70cea9d-fc49-4426-0d41-08d660ce1e8c
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0;
 RULEID:(2390098)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(4618075)(2017052603328)(7153060)(7193020);
 SRVR:AM6PR08MB3400; 
x-ms-traffictypediagnostic: AM6PR08MB3400:
nodisclaimer: True
x-microsoft-antispam-prvs: <AM6PR08MB34004220F017345D37BE001698A00@AM6PR08MB3400.eurprd08.prod.outlook.com>
x-ms-exchange-senderadcheck: 1
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(8211001083)(3230021)(999002)(6040522)(2401047)(5005006)(8121501046)(3002001)(10201501046)(3231475)(944501520)(52105112)(93006095)(93001095)(6055026)(148016)(149066)(150057)(6041310)(20161123560045)(20161123562045)(20161123564045)(20161123558120)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(201708071742011)(7699051)(76991095);
 SRVR:AM6PR08MB3400; BCL:0; PCL:0; RULEID:; SRVR:AM6PR08MB3400; 
x-forefront-prvs: 088552DE73
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(376002)(136003)(39860400002)(396003)(346002)(366004)(43544003)(199004)(189003)(7696005)(4326008)(2906002)(53936002)(5660300001)(76176011)(54906003)(6246003)(99286004)(110136005)(6116002)(3846002)(256004)(316002)(14444005)(229853002)(106356001)(105586002)(14454004)(68736007)(478600001)(86362001)(966005)(72206003)(7736002)(6506007)(25786009)(55016002)(74316002)(6306002)(305945005)(9686003)(446003)(11346002)(2501003)(486006)(476003)(186003)(26005)(6436002)(66066001)(97736004)(8936002)(81166006)(4744004)(93886005)(81156014)(8676002)(71190400001)(71200400001)(102836004)(33656002)(21314003);
 DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB3400;
 H:AM6PR08MB3672.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; A:1; MX:1; 
received-spf: None (protection.outlook.com: arm.com does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: nAET4+0oFg5E4JrF9MuK3A/bxh3UoSi83qmcCeGJZdB709NWNZ0tzA1TYzmdr/Zh8nQUOtoNviR2Hflp7j2EE17VUD4TsEopNApQ4abm5yqdjwFnCT1pcsgT1TgYiTtzcBNSZk7R1zD1aWFTt0CFYKvlLMF/yWCZhxrEVw5tKjELC4axP83LtxsqEMOp2OWSByoTUrUhPyRFSzuh9k8vDWyScgZR1OEVa5S8rKURC7Kio8+Bru4TK+dGLgkg4xOrLp5iJZ36msVwJsSIS8Sw8MXk1GOPLt6nc42lA5bY1WCEVs3rpTz3K5yTN8B89lYo
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-Network-Message-Id: f70cea9d-fc49-4426-0d41-08d660ce1e8c
X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Dec 2018 07:39:31.0557 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB3400
Subject: Re: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Dec 2018 07:39:33 -0000

>=20
>=20
> > > >
> > > > > > +
> > > > > > +/* Add a reader thread, running on an lcore, to the list of
> > > > > > +threads
> > > > > > + * reporting their quiescent state on a TQS variable.
> > > > > > + */
> > > > > > +int __rte_experimental
> > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_i=
d) {
> > > > > > +	TQS_RETURN_IF_TRUE((v =3D=3D NULL || lcore_id >=3D
> > > > > RTE_TQS_MAX_LCORE),
> > > > > > +				-EINVAL);
> > > > >
> > > > > It is not very good practice to make function return different
> > > > > values and behave in a different way in debug/non-debug mode.
> > > > > I'd say that for slow-path (functions in .c) it is always good
> > > > > to check input parameters.
> > > > > For fast-path (functions in .h) we sometimes skip such checking,
> > > > > but debug mode can probably use RTE_ASSERT() or so.
> > > > Makes sense, I will change this in the next version.
> > > >
> > > > >
> > > > >
> > > > > lcore_id >=3D RTE_TQS_MAX_LCORE
> > > > >
> > > > > Is this limitation really necessary?
> > > > I added this limitation because currently DPDK application cannot
> > > > take a mask more than 64bit wide. Otherwise, this should be as big
> > > > as
> > > RTE_MAX_LCORE.
> > > > I see that in the case of '-lcores' option, the number of lcores
> > > > can be more than the number of PEs. In this case, we still need a
> > > > MAX limit (but
> > > can be bigger than 64).
> > > >
> > > > > First it means that only lcores can use that API (at least
> > > > > data-path part), second even today many machines have more than 6=
4
> cores.
> > > > > I think you can easily avoid such limitation, if instead of
> > > > > requiring lcore_id as input parameter, you'll just make it
> > > > > return index of
> > > next available entry in w[].
> > > > > Then tqs_update() can take that index as input parameter.
> > > > I had thought about a similar approach based on IDs. I was
> > > > concerned that ID will be one more thing to manage for the
> > > > application. But, I see the
> > > limitations of the current approach now. I will change it to allocati=
on
> based.
> > > This will support even non-EAL pthreads as well.
> > >
> > > Yes, with such approach non-lcore threads will be able to use it also=
.
> > >
> > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need
> > to be efficient as they can be called from the worker's packet
> > processing loop (rte_event_dequeue_burst allows blocking. So, the
> > worker thread needs to call rte_tqs_unregister_lcore before calling
> rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet
> processing). Allocating the thread ID in these functions will make them m=
ore
> complex.
> >
> > I suggest that we change the argument 'lcore_id' to 'thread_id'. The
> > application could use 'lcore_id' as 'thread_id' if threads are mapped t=
o
> physical cores 1:1.
> >
> > If the threads are not mapped 1:1 to physical cores, the threads need
> > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not
> > see that DPDK has a thread_id concept. For TQS, the thread IDs are glob=
al
> (i.e. not per TQS variable). I could provide APIs to do the thread ID all=
ocation,
> but I think the thread ID allocation should not be part of this library. =
Such
> thread ID might be useful for other libraries.
>=20
> I don't think there is any point to introduce new thread_id concept just =
for
> that library.
Currently, we have rte_gettid API. It is being used by rte_spinlock. Howeve=
r, the thread ID returned here is the thread ID as defined by OS. rte_spinl=
ock APIs do not care who defines the thread ID as long as those IDs are uni=
que per thread. I think, if we have a thread_id concept that covers non-eal=
 threads as well, it might help other libraries too. For ex: [1] talks abou=
t the limitation of per-lcore cache. I think this limitation can be removed=
 easily if we could have a thread_id that is in a small, well defined space=
 (rather than OS defined thread ID which may be an arbitrary number). I see=
 similar issues mentioned for rte_timer.
It might be useful in the dynamic threads Bruce talked about at the Dublin =
summit (I am not sure on this one, just speculating).

[1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known=
-issue-label

> After all we already have a concept of lcore_id which pretty much serves =
the
> same purpose.
> I still think that we need to either:
> a) make register/unregister to work with any valid lcore_id (<=3D
> RTE_MAX_LCORE)
I have made this change already, it will be there in the next version.

> b) make register/unregister to return index in w[]
>=20
> For a) will need mask bigger than 64bits.
> b)  would allow to use data-path API by non-lcores threads too, plus w[]
> would occupy less space, and check() might be faster.
> Though yes, as a drawback, for b) register/unregister probably would need
> extra 'while(CAS(...));' loop.
Along with the CAS, we also need to search for available index in the array=
.

> I suppose the question here do you foresee a lot of concurrent
> register/unregister at data-path?
IMO, yes, because of the event dev API being blocking.
We can solve this by providing separate APIs for allocation/freeing of the =
IDs. I am just questioning where these APIs should be.

>=20
> >
> > <snip>
> >
> > >
> > > >
> > > > >
> > > > > > +
> > > > > > +		while (lcore_mask) {
> > > > > > +			l =3D __builtin_ctz(lcore_mask);
> > > > > > +			if (v->w[l].cnt !=3D t)
> > > > > > +				break;
> > > > >
> > > > > As I understand, that makes control-path function progress
> > > > > dependent on simultaneous invocation of data-path functions.
> > > > I agree that the control-path function progress (for ex: how long
> > > > to wait for freeing the memory) depends on invocation of the
> > > > data-path functions. The separation of 'start', 'check' and the
> > > > option not to block in
> > > 'check' provide the flexibility for control-path to do some other
> > > work if it chooses to.
> > > >
> > > > > In some cases that might cause control-path to hang.
> > > > > Let say if data-path function wouldn't be called, or user
> > > > > invokes control-path and data-path functions from the same thread=
.
> > > > I agree with the case of data-path function not getting called. I
> > > > would consider that as programming error. I can document that
> > > > warning in
> > > the rte_tqs_check API.
> > >
> > > Sure, it can be documented.
> > > Though that means, that each data-path thread would have to do
> > > explicit
> > > update() call for every tqs it might use.
> > > I just think that it would complicate things and might limit usage
> > > of the library quite significantly.
> > Each data path thread has to report its quiescent state. Hence, each
> > data-path thread has to call update() (similar to how
> > rte_timer_manage() has to be called periodically on the worker thread).
>=20
> I understand that.
> Though that means that each data-path thread has to know explicitly what =
rcu
> vars it accesses.
Yes. That is correct. It is both good and bad. It is providing flexibility =
to reduce the overhead. For ex: in pipeline mode, it may be that a particul=
ar data structure is accessed only by some of the threads in the applicatio=
n. In this case, this library allows for per data structure vars, which red=
uces the over head. This applies for service cores as well.

> Would be hard to adopt such API with rcu vars used inside some library.
> But ok, as I understand people do use QSBR approach in their apps and fin=
d it
> useful.
It can be adopted in the library with different levels of assumptions/const=
raints.
1) With the assumption that the data plane threads will update the quiescen=
t state. For ex: for rte_hash library we could ask the user to pass the TQS=
 variable as input and rte_hash writer APIs can call rte_tqs_start and rte_=
tqs_check APIs.
2) If the assumption in 1) is not good, updating of the quiescent state can=
 be hidden in the library, but again with the assumption that the data plan=
e library API is called on a regular basis. For ex: the rte_tqs_update can =
be called within rte_hash_lookup API.
3) If we do not want to assume that the data plane API will be called on a =
regular basis, then the rte_tqs_register/unregister APIs need to be used be=
fore and after entering the critical section along with calling rte_tqs_upd=
ate API. For ex: rte_hash_lookup should have the sequence rte_tqs_register,=
 <critical section>, rte_tqs_unregister, rte_tqs_update. (very similar to G=
P)

>=20
> > Do you have any particular use case in mind where this fails?
>=20
> Let say it means that library can't be used to add/del RX/TX ethdev callb=
acks
> in a safe manner.
I need to understand this better. I will look at rte_ethdev library.

>=20
> BTW, two side questions:
> 1) As I understand what you propose is very similar to QSBR main concept.
> Wouldn't it be better to name it accordingly to avoid confusion (or at le=
ast
> document it somewhere).
> I think someone else already raised that question.
QSBR stands for Quiescent State Based Reclamation. This library already has=
 'Thread Quiescent State' in the name. Others have questioned/suggested why=
 not use RCU instead. I called it thread quiescent state as this library ju=
st helps determine if all the readers have entered the quiescent state. It =
does not do anything else.

However, you are also bringing up an important point, 'will we add other me=
thods of memory reclamation'? With that in mind, may be we should not call =
it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will also future proof =
the API incase we want to add additional RCU types.

> 2) Would QSBR be the only technique in that lib?
> Any plans to add something similar to GP one too (with MBs at reader-side=
)?
I believe, by GP, you mean general-purpose RCU. In my understanding QSBR is=
 the one with least overhead. For DPDK applications, I think reducing that =
overhead is important. The GP adds additional over head on the reader side.=
 I did not see a need to add any additional ones as of now. But if there ar=
e use cases that cannot be achieved with the proposed APIs, we can definite=
ly expand it.

>=20
> >
> > >
> > > >
> > > > In the case of same thread calling both control-path and data-path
> > > > functions, it would depend on the sequence of the calls. The
> > > > following
> > > sequence should not cause any hangs:
> > > > Worker thread
> > > > 1) 'deletes' an entry from a lock-free data structure
> > > > 2) rte_tqs_start
> > > > 3) rte_tqs_update
> > > > 4) rte_tqs_check (wait =3D=3D 1 or wait =3D=3D 0)
> > > > 5) 'free' the entry deleted in 1)
> > >
> > > That an interesting idea, and that should help, I think.
> > > Probably worth to have {2,3,4} sequence as a new high level function.
> > >
> > Yes, this is a good idea. Such a function would be applicable only in
> > the worker thread. I would prefer to leave it to the application to tak=
e care.
>=20
> Yes, it would be applicable only to worker thread, but why we can't have =
a
> function for it?
> Let say it could be 2 different functions: one doing {2,3,4} - for worker=
 threads,
> and second doing just {2,4} - for control threads.
> Or it could be just one function that takes extra parameter: lcore_id/w[]=
 index.
> If it is some predefined invalid value (-1 or so), step #3 will be skippe=
d.
The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the w=
riters do not have to spend CPU/memory cycles polling for the readers' quie=
scent state. In the context of DPDK, this overhead will be significant (at =
least equal to the length of 1 while loop on the worker core). This is one =
of the key features of this library. Combining 2,[3], 4 will defeat this pu=
rpose. For ex: in the rte_hash library, whenever a writer on the data path =
calls rte_hash_add, (with 2,3,4 combined) it will wait for the rest of the =
readers to enter quiescent state. i.e. the performance will come down whene=
ver a rte_hash_add is called.

I am trying to understand your concern. If it is that, 'programmers may not=
 use the right sequence', I would prefer to treat that as programming error=
. May be it is better addressed by providing debugging capabilities.

>=20
> Konstantin