From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id 33925A05D3 for ; Fri, 26 Apr 2019 14:05:06 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C48C81B122; Fri, 26 Apr 2019 14:05:05 +0200 (CEST) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 239D569D4 for ; Fri, 26 Apr 2019 14:05:02 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Apr 2019 05:05:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,397,1549958400"; d="scan'208";a="145942326" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by orsmga003.jf.intel.com with ESMTP; 26 Apr 2019 05:04:59 -0700 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.31]) by IRSMSX154.ger.corp.intel.com ([169.254.12.101]) with mapi id 14.03.0415.000; Fri, 26 Apr 2019 13:04:58 +0100 From: "Ananyev, Konstantin" To: Honnappa Nagarahalli , "stephen@networkplumber.org" , "paulmck@linux.ibm.com" , "Kovacevic, Marko" , "dev@dpdk.org" CC: "gavin.hu@arm.com" , "dharmik.thakkar@arm.com" , "malvika.gupta@arm.com" Thread-Topic: [PATCH v8 0/4] lib/rcu: add RCU library supporting QSBR mechanism Thread-Index: AQHU++otMwUDBj7tgk6l9944m0u286ZOV9+w Date: Fri, 26 Apr 2019 12:04:58 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772580148A9C434@irsmsx105.ger.corp.intel.com> References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> <20190426044000.32670-1-honnappa.nagarahalli@arm.com> In-Reply-To: <20190426044000.32670-1-honnappa.nagarahalli@arm.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNGUxZGEwZjgtNDNjNS00MzIwLWFmZDAtOWFiM2I5ZWYwYzY0IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiSElaK0pldmFubjYxY3VsU1ZxcGRZOFwvQlluRXhGRmZrSzl5dHhWRDRDWGJvTUw5NlEwRG95eVUrb2NWS1dSQmcifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.600.7 dlp-reaction: no-action x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v8 0/4] lib/rcu: add RCU library supporting QSBR mechanism X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Message-ID: <20190426120458.BuXWskQ2laoVXerSNHmj5JgFbfsVJ_N8oAS8H37rNoI@z> > -----Original Message----- > From: Honnappa Nagarahalli [mailto:honnappa.nagarahalli@arm.com] > Sent: Friday, April 26, 2019 5:40 AM > To: Ananyev, Konstantin ; stephen@networkpl= umber.org; paulmck@linux.ibm.com; Kovacevic, Marko > ; dev@dpdk.org > Cc: honnappa.nagarahalli@arm.com; gavin.hu@arm.com; dharmik.thakkar@arm.c= om; malvika.gupta@arm.com > Subject: [PATCH v8 0/4] lib/rcu: add RCU library supporting QSBR mechanis= m >=20 > Lock-less data structures provide scalability and determinism. > They enable use cases where locking may not be allowed > (for ex: real-time applications). >=20 > In the following paras, the term 'memory' refers to memory allocated > by typical APIs like malloc or anything that is representative of > memory, for ex: an index of a free element array. >=20 > Since these data structures are lock less, the writers and readers > are accessing the data structures concurrently. Hence, while removing > an element from a data structure, the writers cannot return the memory > to the allocator, without knowing that the readers are not > referencing that element/memory anymore. Hence, it is required to > separate the operation of removing an element into 2 steps: >=20 > Delete: in this step, the writer removes the reference to the element fro= m > the data structure but does not return the associated memory to the > allocator. This will ensure that new readers will not get a reference to > the removed element. Removing the reference is an atomic operation. >=20 > Free(Reclaim): in this step, the writer returns the memory to the > memory allocator, only after knowing that all the readers have stopped > referencing the deleted element. >=20 > This library helps the writer determine when it is safe to free the > memory. >=20 > This library makes use of thread Quiescent State (QS). QS can be > defined as 'any point in the thread execution where the thread does > not hold a reference to shared memory'. It is upto the application to > determine its quiescent state. Let us consider the following diagram: >=20 > Time --------------------------------------------------> >=20 > | | > RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$ > | | > RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$ > | | > RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$ > | | > |<--->| > Del | Free > | > Cannot free memory > during this period > (Grace Period) >=20 > RTx - Reader thread > < and > - Start and end of while(1) loop > ***Dx*** - Reader thread is accessing the shared data structure Dx. > i.e. critical section. > +++ - Reader thread is not accessing any shared data structure. > i.e. non critical section or quiescent state. > Del - Point in time when the reference to the entry is removed using > atomic operation. > Free - Point in time when the writer can free the entry. > Grace Period - Time duration between Del and Free, during which memory ca= nnot > be freed. >=20 > As shown, thread RT1 accesses data structures D1, D2 and D3. When it is > accessing D2, if the writer has to remove an element from D2, the > writer cannot free the memory associated with that element immediately. > The writer can return the memory to the allocator only after the reader > stops referencing D2. In other words, reader thread RT1 has to enter > a quiescent state. >=20 > Similarly, since thread RT3 is also accessing D2, writer has to wait till > RT3 enters quiescent state as well. >=20 > However, the writer does not need to wait for RT2 to enter quiescent stat= e. > Thread RT2 was not accessing D2 when the delete operation happened. > So, RT2 will not get a reference to the deleted entry. >=20 > It can be noted that, the critical sections for D2 and D3 are quiescent s= tates > for D1. i.e. for a given data structure Dx, any point in the thread execu= tion > that does not reference Dx is a quiescent state. >=20 > Since memory is not freed immediately, there might be a need for > provisioning of additional memory, depending on the application requireme= nts. >=20 > It is important to make sure that this library keeps the overhead of > identifying the end of grace period and subsequent freeing of memory, > to a minimum. The following paras explain how grace period and critical > section affect this overhead. >=20 > The writer has to poll the readers to identify the end of grace period. > Polling introduces memory accesses and wastes CPU cycles. The memory > is not available for reuse during grace period. Longer grace periods > exasperate these conditions. >=20 > The length of the critical section and the number of reader threads > is proportional to the duration of the grace period. Keeping the critical > sections smaller will keep the grace period smaller. However, keeping the > critical sections smaller requires additional CPU cycles(due to additiona= l > reporting) in the readers. >=20 > Hence, we need the characteristics of small grace period and large critic= al > section. This library addresses this by allowing the writer to do > other work without having to block till the readers report their quiescen= t > state. >=20 > For DPDK applications, the start and end of while(1) loop (where no > references to shared data structures are kept) act as perfect quiescent > states. This will combine all the shared data structure accesses into a > single, large critical section which helps keep the overhead on the > reader side to a minimum. >=20 > DPDK supports pipeline model of packet processing and service cores. > In these use cases, a given data structure may not be used by all the > workers in the application. The writer does not have to wait for all > the workers to report their quiescent state. To provide the required > flexibility, this library has a concept of QS variable. The application > can create one QS variable per data structure to help it track the > end of grace period for each data structure. This helps keep the grace > period to a minimum. >=20 > The application has to allocate memory and initialize a QS variable. >=20 > Application can call rte_rcu_qsbr_get_memsize to calculate the size > of memory to allocate. This API takes maximum number of reader threads, > using this variable, as a parameter. Currently, a maximum of 1024 threads > are supported. >=20 > Further, the application can initialize a QS variable using the API > rte_rcu_qsbr_init. >=20 > Each reader thread is assumed to have a unique thread ID. Currently, the > management of the thread ID (for ex: allocation/free) is left to the > application. The thread ID should be in the range of 0 to > maximum number of threads provided while creating the QS variable. > The application could also use lcore_id as the thread ID where applicable= . >=20 > rte_rcu_qsbr_thread_register API will register a reader thread > to report its quiescent state. This can be called from a reader thread. > A control plane thread can also call this on behalf of a reader thread. > The reader thread must call rte_rcu_qsbr_thread_online API to start repor= ting > its quiescent state. >=20 > Some of the use cases might require the reader threads to make > blocking API calls (for ex: while using eventdev APIs). The writer thread > should not wait for such reader threads to enter quiescent state. > The reader thread must call rte_rcu_qsbr_thread_offline API, before calli= ng > blocking APIs. It can call rte_rcu_qsbr_thread_online API once the blocki= ng > API call returns. >=20 > The writer thread can trigger the reader threads to report their quiescen= t > state by calling the API rte_rcu_qsbr_start. It is possible for multiple > writer threads to query the quiescent state status simultaneously. Hence, > rte_rcu_qsbr_start returns a token to each caller. >=20 > The writer thread has to call rte_rcu_qsbr_check API with the token to ge= t the > current quiescent state status. Option to block till all the reader threa= ds > enter the quiescent state is provided. If this API indicates that all the > reader threads have entered the quiescent state, the application can free= the > deleted entry. >=20 > The APIs rte_rcu_qsbr_start and rte_rcu_qsbr_check are lock free. Hence, = they > can be called concurrently from multiple writers even while running > as worker threads. >=20 > The separation of triggering the reporting from querying the status provi= des > the writer threads flexibility to do useful work instead of blocking for = the > reader threads to enter the quiescent state or go offline. This reduces t= he > memory accesses due to continuous polling for the status. >=20 > rte_rcu_qsbr_synchronize API combines the functionality of rte_rcu_qsbr_s= tart > and blocking rte_rcu_qsbr_check into a single API. This API triggers the = reader > threads to report their quiescent state and polls till all the readers en= ter > the quiescent state or go offline. This API does not allow the writer to > do useful work while waiting and also introduces additional memory access= es > due to continuous polling. >=20 > The reader thread must call rte_rcu_qsbr_thread_offline and > rte_rcu_qsbr_thread_unregister APIs to remove itself from reporting its > quiescent state. The rte_rcu_qsbr_check API will not wait for this reader > thread to report the quiescent state status anymore. >=20 > The reader threads should call rte_rcu_qsbr_update API to indicate that t= hey > entered a quiescent state. This API checks if a writer has triggered a > quiescent state query and update the state accordingly. >=20 > Patch v8: > 1) Library changes > a) Symbols prefixed with '__RTE' or 'rte_' as required (Thomas) > b) Used PRI?64 macros to support 32b compilation (Thomas) > c) Fixed shared library compilation (Thomas) > 2) Test cases > a) Fixed segmentation fault when more than 20 cores are used for testi= ng (Jerin) > b) Used PRI?64 macros to support 32b compilation (Thomas) > c) Testing done on x86, ThunderX2, Octeon TX, BlueField for 32b(x86 on= ly)/64b, > debug/non-debug, shared/static linking, meson/makefile with various > number of cores >=20 > Patch v7: > 1) Library changes > a) Added macro RCU_IS_LOCK_CNT_ZERO > b) Added lock counter validation to rte_rcu_qsbr_thread_online/ > rte_rcu_qsbr_thread_offline/rte_rcu_qsbr_thread_register/ > rte_rcu_qsbr_thread_unregister APIs (Paul) >=20 > Patch v6: > 1) Library changes > a) Fixed and tested meson build on Arm and x86 (Konstantin) > b) Moved rte_rcu_qsbr_synchronize API to rte_rcu_qsbr.c >=20 > Patch v5: > 1) Library changes > a) Removed extra alignment in rte_rcu_qsbr_get_memsize API (Paul) > b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock APIs (Paul) > c) Clarified the need for 64b counters (Paul) > 2) Test cases > a) Added additional performance test cases to benchmark > __rcu_qsbr_check_all > b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock calls in various test c= ases > 3) Documentation > a) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock usage description >=20 > Patch v4: > 1) Library changes > a) Fixed the compilation issue on x86 (Konstantin) > b) Rebased with latest master >=20 > Patch v3: > 1) Library changes > a) Moved the registered thread ID array to the end of the > structure (Konstantin) > b) Removed the compile time constant RTE_RCU_MAX_THREADS > c) Added code to keep track of registered number of threads >=20 > Patch v2: > 1) Library changes > a) Corrected the RTE_ASSERT checks (Konstantin) > b) Replaced RTE_ASSERT with 'if' checks for non-datapath APIs (Konstan= tin) > c) Made rte_rcu_qsbr_thread_register/unregister non-datapath critical = APIs > d) Renamed rte_rcu_qsbr_update to rte_rcu_qsbr_quiescent (Ola) > e) Used rte_smp_mb() in rte_rcu_qsbr_thread_online API for x86 (Konsta= ntin) > f) Removed the macro to access the thread QS counters (Konstantin) > 2) Test cases > a) Added additional test cases for removing RTE_ASSERT > 3) Documentation > a) Changed the figure to make it bigger (Marko) > b) Spelling and format corrections (Marko) >=20 > Patch v1: > 1) Library changes > a) Changed the maximum number of reader threads to 1024 > b) Renamed rte_rcu_qsbr_register/unregister_thread to > rte_rcu_qsbr_thread_register/unregister > c) Added rte_rcu_qsbr_thread_online/offline API. These are optimized > version of rte_rcu_qsbr_thread_register/unregister API. These > also provide the flexibility for performance when the requested > maximum number of threads is higher than the current number of > threads. > d) Corrected memory orderings in rte_rcu_qsbr_update > e) Changed the signature of rte_rcu_qsbr_start API to return the token > f) Changed the signature of rte_rcu_qsbr_start API to not take the > expected number of QS states to wait. > g) Added debug logs > h) Added API and programmer guide documentation. >=20 > RFC v3: > 1) Library changes > a) Rebased to latest master > b) Added new API rte_rcu_qsbr_get_memsize > c) Add support for memory allocation for QSBR variable (Konstantin) > d) Fixed a bug in rte_rcu_qsbr_check (Konstantin) > 2) Testcase changes > a) Separated stress tests into a performance test case file > b) Added performance statistics >=20 > RFC v2: > 1) Cover letter changes > a) Explian the parameters that affect the overhead of using RCU > and their effect > b) Explain how this library addresses these effects to keep > the overhead to minimum > 2) Library changes > a) Rename the library to avoid confusion (Matias, Bruce, Konstantin) > b) Simplify the code/remove APIs to keep this library inline with > other synchronisation mechanisms like locks (Konstantin) > c) Change the design to support more than 64 threads (Konstantin) > d) Fixed version map to remove static inline functions > 3) Testcase changes > a) Add boundary and additional functional test cases > b) Add stress test cases (Paul E. McKenney) >=20 > Dharmik Thakkar (1): > test/rcu_qsbr: add API and functional tests >=20 > Honnappa Nagarahalli (3): > rcu: add RCU library supporting QSBR mechanism > doc/rcu: add lib_rcu documentation > doc: added RCU to the release notes >=20 > MAINTAINERS | 5 + > app/test/Makefile | 2 + > app/test/autotest_data.py | 12 + > app/test/meson.build | 7 +- > app/test/test_rcu_qsbr.c | 1014 +++++++++++++++++ > app/test/test_rcu_qsbr_perf.c | 704 ++++++++++++ > config/common_base | 6 + > doc/api/doxy-api-index.md | 3 +- > doc/api/doxy-api.conf.in | 1 + > .../prog_guide/img/rcu_general_info.svg | 509 +++++++++ > doc/guides/prog_guide/index.rst | 1 + > doc/guides/prog_guide/rcu_lib.rst | 185 +++ > doc/guides/rel_notes/release_19_05.rst | 8 + > lib/Makefile | 2 + > lib/librte_rcu/Makefile | 23 + > lib/librte_rcu/meson.build | 7 + > lib/librte_rcu/rte_rcu_qsbr.c | 277 +++++ > lib/librte_rcu/rte_rcu_qsbr.h | 641 +++++++++++ > lib/librte_rcu/rte_rcu_version.map | 13 + > lib/meson.build | 2 +- > mk/rte.app.mk | 1 + > 21 files changed, 3420 insertions(+), 3 deletions(-) > create mode 100644 app/test/test_rcu_qsbr.c > create mode 100644 app/test/test_rcu_qsbr_perf.c > create mode 100644 doc/guides/prog_guide/img/rcu_general_info.svg > create mode 100644 doc/guides/prog_guide/rcu_lib.rst > create mode 100644 lib/librte_rcu/Makefile > create mode 100644 lib/librte_rcu/meson.build > create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c > create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h > create mode 100644 lib/librte_rcu/rte_rcu_version.map >=20 > -- Run UT on my box (SKX) for both x86_64 and i686 over 96 cores. All passed. Tested-by: Konstantin Ananyev > 2.17.1