From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 537BC8D9E for ; Wed, 30 Sep 2015 16:30:07 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga103.fm.intel.com with ESMTP; 30 Sep 2015 07:29:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,612,1437462000"; d="scan'208";a="816265024" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga002.fm.intel.com with ESMTP; 30 Sep 2015 07:29:52 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id t8UETpvH003355; Wed, 30 Sep 2015 15:29:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id t8UETpFK029149; Wed, 30 Sep 2015 15:29:51 +0100 Received: (from ibetts@localhost) by sivswdev01.ir.intel.com with id t8UETp0g029145; Wed, 30 Sep 2015 15:29:51 +0100 From: ibetts To: dev@dpdk.org Date: Wed, 30 Sep 2015 15:29:44 +0100 Message-Id: <1443623388-29104-2-git-send-email-ian.betts@intel.com> X-Mailer: git-send-email 1.7.4.1 In-Reply-To: <1443623388-29104-1-git-send-email-ian.betts@intel.com> References: <1443623388-29104-1-git-send-email-ian.betts@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Ian Betts Subject: [dpdk-dev] =?utf-8?q?=5BPATCH_v1_1/5=5D_doc=3A_add_performance-th?= =?utf-8?q?read_sample_application_guide?= X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Sep 2015 14:30:09 -0000 From: Ian Betts This commit adds documentation for the performance-thread sample application. Signed-off-by: Ian Betts --- doc/guides/rel_notes/release_2_2.rst | 6 + doc/guides/sample_app_ug/index.rst | 1 + doc/guides/sample_app_ug/performance_thread.rst | 1221 +++++++++++++++++++++++ 3 files changed, 1228 insertions(+) create mode 100644 doc/guides/sample_app_ug/performance_thread.rst diff --git a/doc/guides/rel_notes/release_2_2.rst b/doc/guides/rel_notes/release_2_2.rst index 5687676..e9772d3 100644 --- a/doc/guides/rel_notes/release_2_2.rst +++ b/doc/guides/rel_notes/release_2_2.rst @@ -52,6 +52,12 @@ Libraries Examples ~~~~~~~~ +* **examples: Introducing a performance thread example** + + This an l3fwd derivative focused to enable characterization of performance + with different threading models, including multiple EAL threads per physical + core, and multiple Lightweight threads running in an EAL thread. + The examples includes a simple cooperative scheduler. Other ~~~~~ diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_ug/index.rst index 9beedd9..70d4a5c 100644 --- a/doc/guides/sample_app_ug/index.rst +++ b/doc/guides/sample_app_ug/index.rst @@ -73,6 +73,7 @@ Sample Applications User Guide vm_power_management tep_termination proc_info + performance_thread **Figures** diff --git a/doc/guides/sample_app_ug/performance_thread.rst b/doc/guides/sample_app_ug/performance_thread.rst new file mode 100644 index 0000000..497d729 --- /dev/null +++ b/doc/guides/sample_app_ug/performance_thread.rst @@ -0,0 +1,1220 @@ +.. BSD LICENSE + Copyright(c) 2010-2014 Intel Corporation. All rights reserved. + All rights reserved. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + * Re-distributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in + the documentation and/or other materials provided with the + distribution. + * Neither the name of Intel Corporation nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +Performance Thread Sample Application +===================================== + +The performance thread sample application is a derivative of the standard L3 +forwarding application that demonstrates different threading models. + +Overview +-------- +For a general description of the L3 forwarding applications capabilities +please refer to the documentation of the standard application in +:doc:`l3_forward`. + +The performance thread sample application differs from the standard L3 forward +example in that it divides the TX and Rx processing between different threads, +and makes it possible to assign individual threads to different cores. + +Three threading models are considered:- + +#. When there is EAL thread per physical core +#. When there are multiple EAL threads per physical core +#. When there are multiple lightweight threads per EAL thread + +Since DPDK release 2.0 it is possible to launch applications using the –lcores +EAL parameter, specifying CPU sets for a physical core. With the performance +thread sample application its is now also possible to assign individual Rx +and TX functions to different cores. + +As an alternative to dividing the L3 forwarding work between different EAL +threads the performance thread sample introduces the possibility to run the +application threads as lightweight threads (L-threads) within one or +more EAL threads. + +In order to facilitate this threading model the example includes a primitive +cooperative scheduler (L-thread) subsystem. More details of the L-thread +subsystem can be found in :ref:`lthread_subsystem` + +**Note:** Whilst theoretcially possible it is not anticipated that multiple +L-thread schedulers would be run on the same physical core, this mode of +operataion should not be expected to yield useful performance and is considered +invalid. + +Compiling the Application +------------------------- +The application is located in the sample application in the +performance-thread folder. + +#. Go to the example applications folder + + .. code-block:: console + + export RTE_SDK=/path/to/rte_sdk cd ${RTE_SDK}/examples/performance-thread/l3fwd-thread + +#. Set the target (a default target is used if not specified). For example: + + .. code-block:: console + + export RTE_TARGET=x86_64-native-linuxapp-gcc + + See the DPDK Getting Started Guide for possible RTE_TARGET values. + +#. Build the application: + + make + + + +Running the Application +----------------------- + +The application has a number of command line options: + +.. code-block:: console + + ./build/l3fwd-thread [EAL options] -- -p PORTMASK [-P] --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)] --tx(port,lcore,thread)[,(port,lcore,thread)] [--enable-jumbo [--max-pkt-len PKTLEN]] [--no-numa][--hash-entry-num][--ipv6] [--no-lthreads] + +where, + +* -p PORTMASK: Hexadecimal bitmask of ports to configure + +* -P: optional, sets all ports to promiscuous mode so that packets are + accepted regardless of the packet's Ethernet MAC destination address. + Without this option, only packets with the Ethernet MAC destination + address set to the Ethernet address of the port are accepted. + +* --rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]: + the list of NIC RX ports and queues handled by the RX lcores and threads + +* --tx (port,lcore,thread)[,(port,lcore,thread)]: + the list of NIC TX ports handled by the I/O TX lcores and threads. + +* --enable-jumbo: optional, enables jumbo frames + +* --max-pkt-len: optional, maximum packet length in decimal (64-9600) + +* --no-numa: optional, disables numa awareness + +* --hash-entry-num: optional, specifies the hash entry number in hex to be setup + +* --ipv6: optional, set it if running ipv6 packets + +* --no-lthreads: optional, disables lthread model and uses EAL threading model + +The l3fwd-threads application allows you to start packet processing in two threading +models: L-Threads (default) and EAL Threads (when "--no-lthreads" parameter is used). +For consistency all parameters are used the same way for both models. + +* rx parameters + +.. _table_l3fwd_rx_parameters: + ++--------+------------------------------------------------------+ +| port | rx port | ++--------+------------------------------------------------------+ +| queue | rx queue that will be read on the specified rx port | ++--------+------------------------------------------------------+ +| lcore | core to use for the thread | ++--------+------------------------------------------------------+ +| thread | thread id (continuously from 0 to N) | ++--------+------------------------------------------------------+ + + +* tx parameters + +.. _table_l3fwd_tx_parameters: + ++--------+------------------------------------------------------+ +| port | default port to transmit (if lookup fails nor found) | ++--------+------------------------------------------------------+ +| lcore | core to use for L3 route match and transmit | ++--------+------------------------------------------------------+ +| thread | thread id (continuously from 0 to N) | ++--------+------------------------------------------------------+ + + + +Running with L-threads +~~~~~~~~~~~~~~~~~~~~~~ + +When the L-thread model is used (default option), lcore and thread parameters in +--rx/--tx are used to affine threads to the selected scheduler using the rules: + +**If lcores are the same, l-threads are placed on the same scheduler** + +**If both lcore and l-thread id are the same, only one l-thread is used and +queues / rings are polled inside it** + +e.g. + + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(1,0,1,1)" \ + --tx="(0,2,0)(1,3,1)(0,4,2)(1,5,3)(0,6,4)(1,7,5)" + +Places every l-thread on different lcore + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(1,0,0,1)" \ + --tx="(0,1,0)(1,1,1)(0,2,2)(1,2,3)" + +Places rx lthreads on lcore 0 and tx l-threads on lcore 1 and 2 + +and so on. + +Running with EAL threads +~~~~~~~~~~~~~~~~~~~~~~~~ + +When --no-lthreads parameter is used, L-threading model is turned off and EAL +threads are used for all processing. EAL Threads are enumerated in the same way as L-threads, +but --lcores EAL parameter is used to affine thread to the selected cpu-set (scheduler). + +Thus it is possible to place every Rx and TX thread on different lcores +a) If lcore id is the same, only one EAL thread is used and queues / rings are +polled inside it. + +e.g. + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(1,0,1,1)" \ + --tx="(0,2,0)(1,3,1)(0,4,2)(1,5,3)(0,6,4)(1,7,5)" \ + --no-lthreads + +Places every EAL thread on different lcore. + +To affine two ore more EAL threads to one cpu-set, eal --lcores parameter is used + + .. code-block:: console + +l3fwd-thread -c ff -n 2 --lcores="(0,1)@0,(2,3)@1,(4,5)@2" -- -P -p 3 \ + --rx="(0,0,0,0)(1,0,1,1)" \ + --tx="(0,2,0)(1,3,1)(0,4,2)(1,5,3)" \ + --no-lthreads + +Places rx EAL threads on lcore 0 and tx eal threads on lcore 1 and 2 and so on. + + +Examples +~~~~~~~~ + +For selected scenarios the command line configuration of the application for L-Threads +and its corresponding EAL Threads command line can be realized as follows: + +a) Start every thread on different scheduler (1:1) + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(0,1,1,1)(1,0,2,2)(0,1,3,3)" \ + --tx="(0,4,0)(1,5,1)" + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(0,1,1,1)(1,0,2,2)(0,1,3,3)" \ + --tx="(0,4,0)(1,5,1)" \ + --no-lthreads + +b) Start all threads on one scheduler (N:1) + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(0,1,0,1)(1,0,0,2)(0,1,0,3)" \ + --tx="(0,0,0)(1,0,1)" + +Example above, starts 6 L-threads on lcore 0. + + .. code-block:: console + +l3fwd-thread -c ff -n 2 --lcores="(0-5)@0" -- -P -p 3 \ + --rx="(0,0,0,0)(0,1,1,1)(1,0,2,2)(0,1,3,3)" \ + --tx="(0,4,0)(1,5,1)" \ + --no-lthreads + +Example above, starts 6 EAL threads on cpu-set 0. + + +c) Start threads on different schedulers (N:M) + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx="(0,0,0,0)(0,1,0,1)(1,0,0,2)(0,1,0,3)" \ + --tx="(0,1,0)(1,1,1)" + +Example above, starts 4 L-threads (0,1,2,3) for rx on lcore 0, and 2 L-threads +for tx on lcore 1. + + .. code-block:: console + +l3fwd-thread -c ff -n 2 --lcores="(0-3)@0,(4,5)@1" -- -P -p 3 \ + --rx="(0,0,0,0)(0,1,1,1)(1,0,2,2)(0,1,3,3)" \ + --tx="(0,4,0)(1,5,1)" \ + --no-lthreads + +Example above, starts 4 EAL threads (0,1,2,3) for rx on cpu-set 0, and +2 EAL threads for tx on cpu-set 1. + + +Explanation +----------- + +To a great extent the sample application differs little from the standard L3 +forwarding application, and readers are advised to familiarize themselves with the +material covered in the :doc:`l3_forward` documentation before proceeding. + +The following explanation is focused on the way threading is handled in the +performance thread example. + + +Mode of operation with EAL threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The performance thread sample application has split the Rx and TX functionality +into two different threads, and the pairs of Rx and TX threads are +interconnected via software rings. With respect to these rings the Rx threads +are producers and the TX threads are consumers. + +On initialization the tx and rx threads are started according to the command +line parameters. + +The Rx threads poll the network interface queues and post received packets to a +TX thread via the corresponding software ring. + +The TX threads poll software rings, perform the L3 forwarding hash/LPM match, +and assemble packet bursts before performing burst transmit on the network +interface. + +As with the standard L3 forward application, burst draining of residual packets +is performed periodically with the period calculated from elapsed time using +the timestamps counter. + + +Mode of operation with L-threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Like the EAL thread configuration the application has split the Rx and TX +functionality into different threads, and the pairs of Rx and TX threads are +interconnected via software rings. + +On initialization an L-thread scheduler is started on every EAL thread. On all +but the master EAL thread only a a dummy L-thread is initially started. +The L-thread started on the master EAL thread then spawns other L-threads on +different L-thread shedulers according the the command line parameters. + +The Rx threads poll the network interface queues and post received packets +to a TX thread via the corresponding software ring. + +The ring interface is augmented by means of an L-thread condition variable that +enables the TX thread to be suspended when the TX ring is empty. The Rx thread +signals the condition whenever it posts to the TX ring, causing the TX thread +to be resumed. + +Additionally the TX L-thread spawns a worker L-thread to take care of +polling the software rings, whilst it handles burst draining of the transmit +buffer. + +The worker threads poll the software rings, perform L3 route lookup and +assemble packet bursts. If the TX ring is empty the worker thread suspends +itself by waiting on the condition variable associated with the ring. + +Burst draining of residual packets is performed by the TX thread which sleeps +(using an L-thread sleep function) and resumes periodically to flush the TX +buffer. + +This design means that L-threads that have no work, can yield the CPU to other +L-threads and avoid having to constantly poll the software rings. + + +.. _lthread_subsystem: + +The L-thread subsystem +---------------------- +The L-thread subsystem resides in the examples/performance-thread/common +directory and is built and linked automatically when building the l3fwd-lthread +example. + +The subsystem provides a simple cooperative scheduler to enable arbitrary +functions to run as cooperative threads within a single EAL thread. +The subsystem provides a pthread like API that is intended to assist in +reuse of legacy code written for POSIX pthreads. + +The following sections provide some detail on the features, constraints, +performance and porting considerations when using L-threads. + +.. _comparison_between_lthreads_and_pthreads: + +Comparison between L-threads and POSIX pthreads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The fundamental difference between the L-thread and pthread models is the +way which threads are scheduled. The simplest way to think about this is to +consider the case of a processor with a single CPU. To run multiple threads +on a single CPU, then the scheduler must frequently switch between the threads, +in order that each thread is able to make timely progress. +This is the basis of any multitasking operating system. + +This section explores the differences between the pthread model and the +L-thread model as implemented in the provided L-thread subsystem. If needed a +theoretical discussion of preemptive vs cooperative multithreading can be +found in any good text on operating system design. + +Sceduling and context switching +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The POSIX pthread library provides an application programming interface to +create and synchronize threads. Scheduling policy is determined by the host OS, +and may be configurable. The OS may use sophisticated rules to determine which +thread should be run next, threads may suspend themselves or make other threads +ready, and the scheduler may employ a time slice giving each thread a maximum +time quantum after which it will be preempted in favor of another thread that +is ready to run. To complicate matters further threads may be assigned +different scheduling priorities. + +By contrast the L-thread subsystem is considerably simpler. Logically the +L-thread scheduler performs the same multiplexing function for L-threads +within a single pthread as the OS scheduler does for pthreads within an +application process. The L-thread scheduler is simply the main loop of a +pthread, and in so far as the host OS is concerned it is just a regular +pthread like any other. The host OS is oblivious about the existence of and +not at all involved in the scheduling of L-threads. + +The other and most significant difference between the two models is that +L-threads are scheduled cooperatively. L-threads cannot not preempt each +other, nor can the L-thread scheduler preempt a running L-thread ( i.e. +there is no time slicing). The consequence is that programs implemented with +L-threads must possess frequent rescheduling points, meaning that they must +explicitly and of their own volition return to the scheduler at frequent +intervals, in order to allow other L-threads an opportunity to proceed. + +In both models switching between threads requires that the current CPU +context is saved and a new context (belonging to the next thread ready to run) +is restored. With pthreads this context switching is handled transparently +and the set of CPU registers that must be preserved between context switches +is as per an interrupt handler. + +An L-thread context switch is achieved by the thread itself making a function +call to the L-thread scheduler. Thus it is only necessary to preserve the +callee registers. The caller is responsible to save and restore any other +registers it is using before a function call, and restore them on return, +and this is handled by the compiler. For X86_64 on both Linux and BSD the +System V calling convention is used, this defines registers RSP,RBP,and R12-R15 +as callee-save registers (for more detailed discussion a good reference +can be found here https://en.wikipedia.org/wiki/X86_calling_conventions).Taking +advantage of this, and due to the absence of preemption, an L-thread context +switch is acheived with less than 20 load/store instructions. + +The scheduling policy for L-threads is fixed, there is no prioritization of +L-threads, all L-threads are equal and scheduling is based on a FIFO +ready queue. + +An L-thread is a struct containing the CPU context of the thread +(saved on context switch) and other useful items. The ready queue contains +pointers to threads that are ready to run. The L-thread scheduler is a simple +loop that polls the ready queue, reads from it the next thread ready to run, +which it resumes by saving the current context (the current position in the +scheduler loop) and restoring the context of the next thread from its thread +struct. Thus an L-thread is always resumed at the last place it yielded. + +A well behaved L-thread will call the context switch regularly (at least once +in its main loop) thus returning to the schedulers own main loop. Yielding +inserts the current thread at the back of the ready queue, and the process of +servicing the ready queue is repeated, thus the system runs by flipping back +and forth the between L-threads and scheduler loop. + +In the case of pthreads, the preemptive scheduling, time slicing, and support +for thread prioritization means that progress is normally possible for any +thread that is ready to run. This comes at the price of a relatively heavier +context switch and scheduling overhead. + +With L-threads the progress of any particular thread is determined by the +frequency of rescheduling opportunities in the other L-threads. This means that +an errant L-thread monopolizing the CPU might cause scheduling of other threads +to be stalled. Due to the lower cost of context switching, however, voluntary +rescheduling to ensure progress of other threads, if managed sensibly, is not +a prohibitive overhead, and overall performance can exceed that of an +application using pthreads. + +Mutual exclusion +^^^^^^^^^^^^^^^^ +With pthreads preemption means that threads which share data must observe +some form of mutual exclusion protocol. + +The fact that L-threads cannot preempt each other means that mutual exclusion +devices can be completely avoided. + +Locking to protect shared data can be a significant bottleneck in +multi-threaded applications so a carefully designed cooperatively scheduled +program can enjoy significant performance advantages. + +So far we have considered only the simplistic case of a single core CPU, +when multiple CPUs are considered things are somewhat more complex. + +First of all it is inevitable that there must be multiple L-thread schedulers, +one on each EAL thread. So long as these schedulers remain isolated from each +other the above assertions about the potential advantages of cooperative +scheduling hold true. + +A configuration with isolated cooperative schedulers is less flexible than the +pthread model where threads can be affined to run on any CPU. With isolated +schedulers scaling of applications to utilize fewer or more CPUs accorindg to +system demand is very difficult to achieve. + +The L-thread subsystem makes it possible for L-threads to migrate between +schedulers running on different CPUs. Needless to say if the migration means +that threads that share data end up running on different CPUs then this will +introduce the need for some kind mutual exclusion device. + +Of course rte_ring s/w rings can always be used to interconnect threads running +on different cores, however to protect other kinds of shared data structures, +lock free constructs or else explicit locking will be required. This is a +consideration for the application design. + +In support of this extended functionality, the L-thread subsystem implements +thread safe mutexes and condition variables. + +The cost of affining and of condition variable signaling is significantly +lower than the equivalent pthread operations, and so applications using +these features will see a performance benefit. + + +Thread local storage +^^^^^^^^^^^^^^^^^^^^ + +As with applications written for pthreads an application written for L-threads +can take advantage of thread local storage, in this case local to an L-thread. +An application may save and retrieve a single pointer to application data in +the L-thread struct. + +For legacy and backward compatibility reasons two alternative methods are also +offered, the first is modelled directly on the pthread get/set specific APIs, +the second approach is modelled on the RTE_PER_LCORE macros, whereby PER_LTHREAD +macros are introduced, in both cases the storage is local to the L-thread. + + +.. _constraints_and_performance_implications: + +Constraints and performance implications when using L-threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + +.. _API_compatibility: + +API compatibility +^^^^^^^^^^^^^^^^^ + +The L-thread subsystem provides a set of functions that are logically equivalent +to the corresponding functions offered by the POSIX pthread library, however not +all pthread functions have a corresponding L-thread equivalent, and not all +features available to pthreads are implemented for L-threads. + +The pthread library offers considerable flexibility via programmable attributes +that can be associated with threads, mutexes, and condition variables. + +By contrast the L-thread subsystem has fixed functionality, the scheduler policy +cannot be varied, and L-threads cannot be prioritized. There are no variable +attributes associated with any L-thread objects. L-threads, mutexs and +conditional variables, all have fixed functionality. (Note: reserved parameters +are included in the APIs to facilitate possible future support for attributes). + +The table below lists the pthread and equivalent L-thread APIs with notes on +differences and/or constraints. Where there is no L-thread entry in the table, +then the L-thread subsystem provides no equivalent function. + +.. _table_lthread_pthread: + ++-----------------------------+-----------------------------+--------------------+ +| **Pthread function** | **L-thread function** | **Notes** | ++=============================+=============================+====================+ +| pthread_barrier_destroy | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_barrier_init | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_barrier_wait | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cond_broadcast | lthread_cond_broadcast | See note 1 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cond_destroy | lthread_cond_destroy | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cond_init | lthread_cond_init | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cond_signal | lthread_cond_signal | See note 1 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cond_timedwait | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cond_wait | lthread_cond_wait | See note 5 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_create | lthread_create | See notes 2, 3 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_detach | lthread_detach | See note 4 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_equal | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_exit | lthread_exit | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_getspecific | lthread_getspecific | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_getcpuclockid | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_join | lthread_join | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_key_create | lthread_key_create | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_key_delete | lthread_key_delete | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_mutex_destroy | lthread_mutex_destroy | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_mutex_init | lthread_mutex_init | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_mutex_lock | lthread_mutex_lock | See note 6 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_mutex_trylock | lthread_mutex_trylock | See note 6 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_mutex_timedlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_mutex_unlock | lthread_mutex_unlock | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_once | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_destroy | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_init | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_rdlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_timedrdlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_timedwrlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_tryrdlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_trywrlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_unlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_rwlock_wrlock | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_self | lthread_current | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_setspecific | lthread_setspecific | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_spin_init | | See note 10 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_spin_destroy | | See note 10 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_spin_lock | | See note 10 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_spin_trylock | | See note 10 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_spin_unlock | | See note 10 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_cancel | lthread_cancel | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_setcancelstate | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_setcanceltype | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_testcancel | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_getschedparam | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_setschedparam | | | ++-----------------------------+-----------------------------+--------------------+ +| pthread_yield | lthread_yield | See note 7 | ++-----------------------------+-----------------------------+--------------------+ +| pthread_setaffinity_np | lthread_set_affinity | See notes 2, 3, 8 | ++-----------------------------+-----------------------------+--------------------+ +| | lthread_sleep | See note 9 | ++-----------------------------+-----------------------------+--------------------+ +| | lthread_sleep_clks | See note 9 | ++-----------------------------+-----------------------------+--------------------+ + + +Note 1: + +neither lthread_signal nor broadcast may be called concurrently by L-threads +running on different schedulers, although multiple L-threads running in the +same scheduler may freely perform signal or broadcast operations. L-threads +running on the same or different schedulers may always safely wait on a condition +variable. + + +Note 2: + +pthread attributes may be used to affine a pthread with a cpu-set. The L-thread +subsystem does not support a cpu-set. An L-thread may be affined only with a +single CPU at any time. + + +Note 3: + +if an L-thread is intended to run on a different NUMA node than the node that +creates it then, when calling lthread_create() it is advantageous to specify +the destination core as a parameter of lthread_create() +See :ref:`memory_allocation_and_NUMA_awareness` for details. + + +Note 4: + +an L-thread can only detach itself, and cannot detach other L-threads. + + +Note 5: + +a wait operation on a pthread condition variable is always associated with and +protected by a mutex which must be owned by the thread at the time it invokes +pthread_wait(). By contrast L-thread condition variables are thread safe +(for waiters) and do not use an associated mutex. Multiple L-threads (including +L-threads running on other schedulers) can safely wait on a L-thread condition +variable. As a consequence the performance of an L-thread condition variable is +typically an order of magnitude faster than its pthread counterpart. + + +Note 6: + +recursive locking is not supported with L-threads, attempts to take a lock +recursively will be detected and rejected. + + +Note 7: + +lthread_yield() will save the current context, insert the current thread to the +back of the ready queue, and resume the next ready thread. Yielding increases +ready queue backlog, see :ref:`ready_queue_backlog` for more details about the +implications of this. + + +N.B. The context switch time as measured from immediately before the call to +lthread_yield() to the point at which the next ready thread is resumed, can be +an order of magnitude faster that the same measurement for pthread_yield. + + +Note 8: + +lthread_set_affinity() is similar to a yield apart from the fact that the +yielding thread is inserted into a peer ready queue of another scheduler. +The peer ready queue is actually a separate thread safe queue, which means that +threads appearing in the peer ready queue can jump any backlog in the local +ready queue on the destination scheduler. + +The context switch time as measured from the time just before the call to +lthread_set_affinity() to just after the same thread is resumed on the new +scheduler can be orders of magnitude faster than the same measurement for +pthread_setaffinity_np(). + + +Note 9: + +although there is no pthread_sleep() function, lthread_sleep() and +lthread_sleep_clks() can be used wherever sleep(), usleep() or nanosleep() +might ordinarily be used. The L-thread sleep functions suspend the current +thread, start an rte_timer and resume the thread when the timer matures. +The rte_timer_manage() entry point is called on every pass of the scheduler +loop. This means that the worst case jitter on timer expiry is determined by +the longest period between context switches of any running L-threads. + +In a synthetic test with many threads sleeping and resuming then the measured +jitter is typically orders of magnitude lower than the same measurement made +for nanosleep(). + + +Note 10: + +spin locks are not provided because they are problematical in a cooperative +environment, see :ref:`porting_locks_and_spinlocks` for a more detailed +discussion on how to avoid spin locks. + +.. _Thread_local_storage_performance: + +Thread local storage +^^^^^^^^^^^^^^^^^^^^ + +Of the three L-thread local storage options the simplest and most efficient is +storing a single application data pointer in the L-thread struct. + +The PER_LTHREAD macros involve a run time computation to obtain the address +of the variable being saved/retrieved and also require that the accesses are +de-referenced via a pointer. This means that code that has used +RTE_PER_LCORE macros being ported to L-threads might need some slight +adjustment (see :ref:`porting_thread_local_storage` for hints about porting +code that makes use of thread local storage). + +The get/set specific APIs are consistent with their pthread counterparts both +in use and in performance. + +.. _memory_allocation_and_NUMA_awareness: + +Memory allocation and NUMA awareness +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All memory allocation is from DPDK huge pages, and is NUMA aware. Each +scheduler maintains its own caches of objects: lthreads, their stacks, TLS, +mutexes and condition variables. These caches are implemented as unbounded lock +free MPSC queues. When objects are created they are always allocated from the +caches on the local core (current EAL thread). + +If an L-thread has affined to a different sheduler, then it can always safely +free resources to the caches from which they originated (because the caches are +MPSC queues). + +If the L-thread has affined to a different NUMA node then the memory resources +associated with it may incur longer access latency. + +The commonly used pattern of setting affinity on entry to a thread after it has +started, means that memory allocation for both the stack and TLS will have been +made from caches on the NUMA node on which the threads creator is running. +This has the side effect that access latency will be sub-optimal after +affining. + +This side effect can be mitigated to some extent (although not completely) by +specifying the destination CPU as a parameter of lthread_create() this causes +the L-thread’s stack and TLS to be allocated when it is first scheduled on the +destination scheduler, if the destination is a on another NUMA node it results +in a more optimal memory allocation. + +Note that the lthread struct itself remains allocated from memory on the node +creating node, this is unavoidable because an L-thread is known everywhere by +the address of this struct. + +.. _object_cache_sizing: + +Object cache sizing +^^^^^^^^^^^^^^^^^^^ + +The per lcore object caches pre-allocate objects in bulk whenever a request to +allocate an object finds a cache empty. By default 100 objects are +pre-allocated, this is defined by LTHREAD_PREALLOC in the public API header +file lthread_api.h. This means that the caches constantly grow to meet system +demand. + +In the present implementation there is no mechanism to reduce the cache sizes +if system demand reduces. Thus the caches will remain at their maximum extent +indefinitely. + +A consequence of the bulk pre-allocation of objects is that every 100 +(default value) additional new object create operations results in a call to +rte_malloc. For creation of objects such as L-threads, which trigger the +allocation of even more objects ( i.e. their stacks and TLS) then this can +cause outliers in scheduling performance. + +If this is a problem the simplest mitigation strategy is to dimension the +system, by setting the bulk object pre-allocation size to some large number +that you do not expect to be exceeded. This means the caches will be populated +once only, the very first time a thread is created. + +.. _Ready_queue_backlog: + +Ready queue backlog +^^^^^^^^^^^^^^^^^^^ + +One of the more subtle performance considerations is managing the ready queue +backlog. The fewer threads that are waiting in the ready queue then the faster +any particular thread will get serviced. + +In a naive L-thread application with N L-threads simply looping and yielding, +this backlog will always be equal to the number of L-threads, thus the cost of +a yield to a particular L-thread will be N times the context switch time. + +This side effect can be mitigated by arranging for threads to be suspended and +waiting to be resumed, rather than polling for work by constantly yielding. +Blocking on a mutex or condition variable or even more obviously having a +thread sleep if it has a low frequency workload are all mechanisms by which a +thread can be excluded from the ready queue until it really does need to be +running. This can have a significant positive impact on performance. + +.. _Initialization_and_shutdown_dependencies: + +Initialization, shutdown and dependencies +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The L-thread subsystem depends on DPDK for huge page allocation and depends on +the rte_timer subsystem. The DPDK EAL initialization and +rte_timer_subsystem_init() MUST be completed before the L-thread sub system +can be used. + +Thereafter initialization of the L-thread subsystem is largely transparent to +the application. Constructor functions ensure that global variables are properly +initialized. Other than global variables each scheduler is initialized +independently the first time that an L-thread is created by a particular EAL +thread. + +If the schedulers are to be run as isolated and independent schedulers, with +no intention that L-threads running on different schedulers will migrate between +schedulers or synchronize with L-threads running on other schedulers, then +initialization consists simply of creating an L-thread, and then running the +L-thread scheduler. + +If there will be interaction between L-threads running on different schedulers, +then it is important that the starting of schedulers on different EAL threads +is synchronized. + +To achieve this an additional initialization step is necessary, this is simply +to set the number of schedulers by calling the API function +lthread_num_schedulers_set(n), where n = the number of EAL threads that will +run L-thread schedulers. Setting the number of schedulers to a number greater +than 0 will cause all schedulers to wait until the others have started before +beginning to schedule L-threads. + +The L-thread scheduler is started by calling the function +lthread_scheduler_run() and should be called from the EAL thread and thus +become the main loop of the EAL thread. + +The function lthread_scheduler run(), will not return until all threads running +on the scheduler have exited, and the scheduler has been explicitly stopped by +calling lthread_scheduler_shutdown(lcore) or lthread_scheduler_shutdown_all(). + +All these function do is tell the scheduler that it can exit when there are no +longer any running L-threads, neither function forces any running L-thread to +terminate. Any desired application shutdown behavior must be designed and +built into the application to ensure that L-threads complete in a timely +manner. + +**Important Note:** It is assumed when the scheduler exits that the application +is terminating for good, the scheduler does not free resources before exiting +and running the scheduler a subsequent time will result in undefined behavior. + +.. _porting_legacy_code_to_run_on_lthreads: + +Porting legacy code to run on L-threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Legacy code originally written for a pthread environment may be ported to +L-threads if the considerations about differences in scheduling policy, and +constraints discussed in the previous sections can be accommodated. + +This section looks in more detail at some of the issues that may have to be +resolved when porting code. + +.. _pthread_API_compatibility: + +pthread API compatibility +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first step is to establish exactly which pthread APIs the legacy +application uses, and to understand the requirements of those APIs. If there +are corresponding L-lthread APIs, and where the default pthread functionality +is used by the application then, notwithstanding the other issues discussed +here, it should be feasible to run the application with L-threads. If the +legacy code modifies the default behavior using attributes then if may be +necessary to make some adjustments to eliminate those requirements. + +.. _blocking_system_calls: + +Blocking system API calls +^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is important to understand what other system services the application may be +using, bearing in mind that in a cooperatively scheduled environment a thread +cannot block without stalling the scheduler and with it all other cooperative +threads. Any kind of blocking system call, for example file or socket IO, is a +potential problem, a good tool to analyze the application for this purpose is +the “strace” utility. + +There are many strategies to resolve these kind of issues, each with it +merits. Possible solutions include:- + +Adopting a polled mode of the system API concerned (if available). + +Arranging for another core to perform the function and synchronizing with that +core via constructs that will not block the L-thread. + +Affining the thread to another scheduler devoted (as a matter of policy) to +handling threads wishing to make blocking calls, and then back again when +finished. + + +.. _porting_locks_and_spinlocks: + +Locks and spinlocks +^^^^^^^^^^^^^^^^^^^ + +Locks and spinlocks are another source of blocking behavior that for the same +reasons as system calls will need to be addressed. + +If the application design ensures that the contending L-threads will always +run on the same scheduler then it its probably safe to remove locks and spin +locks completely, the only exception to this rule is if for some reason the +code performs any kind of context switch whilst holding the lock, this will +need to determined before deciding to eliminate a lock. + +If a lock cannot be eliminated then an L-thread mutex can be substituted for +either kind of lock. + +An L-thread blocking on an L-thread mutex will be suspended and will cause +another ready L-thread to be resumed, thus not blocking the scheduler. When +default behaviour is required, it can be used as a direct replacement for a +pthread mutex lock. + +Spin locks are typically used when lock contention is likely to be rare and +where the period during which the lock is held is relatively short. When the +contending L-threads are running on the same scheduler then an L-thread +blocking on a spin lock will enter an infinite loop stopping the scheduler +completely (see :ref:`porting_infinite_loops` below ). + +If the application design ensures that contending L-threads will always run +on different schedulers then it might be reasonable to leave a short spin lock +that rarely experiences contention in place. + +If after all considerations it appears that a spin lock can neither be +eliminated completely, replaced with an L-thread mutex, or left in place as +is, then an alternative is to loop on a flag, with a call to lthread_yield() +inside the loop ( n.b. if the contending L-threads might ever run on different +schedulers the flag will need to be manipulated atomically ). + +Spinning and yielding is the least preferred solution since it introduces +ready queue backlog ( see also :ref:`ready_queue_backlog`). + +.. _porting_sleeps_and_delays: + +Sleeps and delays +^^^^^^^^^^^^^^^^^ + +Yet another kind of blocking behavior (albeit momentary) are delay functions +like sleep(), usleep(), nanosleep() etc. All will have the consequence of +stalling the L-thread scheduler and unless the delay is very short ( e.g. a +very short nanosleep) calls to these functions will need to be eliminated. + +The simplest mitigation strategy is to use the L-thread sleep API functions, +of which two variants exist, lthread_sleep() and lthread_sleep_clks(). +These functions start an rte_timer against the L-thread, suspend the L-thread +and cause another ready L-thread to be resumed. The suspended L-thread is +resumed when the rte_timer matures. + +.. _porting_infinite_loops: + +Infinite loops +^^^^^^^^^^^^^^ + +Some applications have threads with loops that contain no inherent +rescheduling opportunity, and rely solely on the OS time slicing to share +the CPU. In a cooperative environment this will stop everything dead. These +kind of loops are not hard to identify, in a debug session you will find the +debugger is always stopping in the same loop. + +The simplest solution to this kind of problem is to insert an explicit +lthread_yield() or lthread_sleep() into the loop. Another solution might be +to include the function performed by the loop into the execution path of some +other loop that does in fact yield, if this is possible. + +.. _porting_thread_local_storage: + +Thread local storage +^^^^^^^^^^^^^^^^^^^^ + +If the application uses thread local storage, the use case should be +studied carefully. + +In a legacy pthread application either or both the __thread prefix, or the +pthread set/get specific APIs may have been used to define storage local +to a pthread. + +In some applications it may be a reasonable assumption that the data could +or in fact most likely should be placed in L-thread local storage. + +If the application (like many DPDK applications) has assumed a certain +relationship between a pthread and the CPU to which it is affined, there is +a risk that thread local storage may have been used to save some data items +that are correctly logically associated with the CPU, and others items which +relate to application context for the thread. Only a good understanding of +the application will reveal such cases. + +If the application requires an that an L-thread is to be able to move between +schedulers then care should be taken to separate these kinds of data, into per +lcore, and per L-thread storage. In this way a migrating thread will bring with +it the local data it needs, and pick up the new logical core specific values +from pthread local storage at its new home. + +.. _pthread_shim: + +Pthread shim +~~~~~~~~~~~~ + +A convenient way to get something working with legacy code can be to use a +shim that adapts pthread API calls to the corresponding L-thread ones. +This approach will not mitigate any of the porting considerations mentioned +in the previous sections, but it will reduce the amount of code churn that +would otherwise been involved. It is a reasonable approach to evaluate +L-threads, before investing effort in porting to the native L-thread APIs. + +Overview +^^^^^^^^ +The L-thread subsystem includes an example pthread shim. This is a partial +implementation but does contain the API stubs needed to get basic applications +running. There is a simple “hello world” application that demonstrates the +use of the pthread shim. + +A subtlety of working with a shim is that the application will still need +to make use of the genuine pthread library functions, at the very least in +order to create the EAL threads in which the L-thread schedulers will run. +This is the case with DPDK initialization, and exit. + +To deal with the initialization and shutdown scenarios, the shim is capable +of switching on or off its adaptor functionality, an application can control +this behavior by the calling the function pt_override_set(). The default state +is disabled. + +(Note: bearing in mind the preceding discussions about the impact of making +blocking system API calls in a cooperative environment, then switching the +shim in and out on the fly is something that should typically be avoided.) + +The pthread shim uses the dynamic linker loader and saves the loaded addresses +of the genuine pthread API functions in an internal table, when the shim +functionality is enabled it performs the adaptor function, when disabled it +invokes the genuine pthread function. + +The function pthread_exit() has additional special handling. The standard +system header file pthread.h declares pthread_exit() +with __attribute__((noreturn)) this is an optimization that is possible +because the pthread is terminating and this enables the compiler to omit the +normal handling of stack and protection of registers since the function is not +expected to return, and in fact the thread is being destroyed. +These optimizations are applied in both the callee and the caller of the +pthread_exit() function. + +In our cooperative scheduling environment this behavior is inadmissible. +The pthread is the L-thread scheduler thread, and, although an L-thread is +terminating, there must be a return to the scheduler in order that system can +continue to run. Further, returning from a function with attribute noreturn is +invalid and may result in undefined behavior. + +The solution is to redefine the pthread_exit function with a macro, causing it +to be mapped to a stub function in the shim that does not have the (noreturn) +attribute. This macro is defined in the file pthread_shim.h. The stub function +is otherwise no different than any of the other stub functions in the shim, +and will switch between the real pthread_exit() function or the lthread_exit() +function as required. The only difference is that the mapping to the stub by +macro substitution. + +A consequence of this is that the file pthread_shim.h must be included in +legacy code wishing to make use of the shim. It also means that dynamic linkage +of a pre-compiled binary that did not include pthread_shim.h is not be supported. + +Given the requirements for porting legacy code outlined in +:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at +least some minimal adjustment and recompilation to run on L-threads so +pre-compiled binaries are unlikely to be met in practice. + +In summary the shim approach adds some overhead but can be a useful tool to help +establish the feasibility of a code reuse project. It is also a fairly +straightforward task to extend the shim if necessary. + +**Note:** Bearing in mind the preceding discussions about the impact of making +blocking calls then switching the shim in and out on the fly to invoke any +pthread API this might block is something that should typically be avoided. + + +Building and running the pthread shim +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The shim example application is located in the sample application +in the performance-thread folder + +To build and run the pthread shim example + +#. Go to the example applications folder + + .. code-block:: console + + export RTE_SDK=/path/to/rte_sdk cd ${RTE_SDK}/examples/performance-thread/pthread_shim + + +#. Set the target (a default target is used if not specified). For example: + + .. code-block:: console + + export RTE_TARGET=x86_64-native-linuxapp-gcc + + See the DPDK Getting Started Guide for possible RTE_TARGET values. + +#. Build the application: + + .. code-block:: console + + make + +#. To run the pthread_shim example + + .. code-block:: console + + lthread-pthread-shim –c < core mask ) –n + +.. _lthread_diagnostics: + +L-thread Diagnostics +~~~~~~~~~~~~~~~~~~~~ + +When debugging you must take account of the fact that the L-threads are run in +a single pthread. The current scheduler is defined by +RTE_PER_LCORE(this_sched), and the current lthread is stored at +RTE_PER_LCORE(this_sched)->current_lthread. +Thus on a breakpoint in a GDB session the current lthread can be obtained by +displaying the pthread local variable "per_lcore_this_sched->current_lthread". + +Another useful diagnostic feature is the possibility to trace significant +events in the life of an L-thread, this feature is enabled by changing the +value of LTHREAD_DIAG from 0 to 1 in the file lthread_diag_api.h. + +Tracing of events can be individually masked, and the mask may be programmed at +run time. +An unmasked event results in a callback that provides information +about the event. The default callback simply prints trace information. +The default mask is 0 (all events off) the mask can be modified by calling the +function lthread_diagniostic_set_mask(). + +It is possible register a user callback to implement more sophisticated +diagnostic functions. +Object creation events (lthread, mutex, and condition variable) accept, and +store in the created object, a user supplied reference value from the callback +function. + +The reference value is passed back in all subsequent event callbacks pertaining +to the object, enabling a user to monitor, count, or even monitor specific +vents, on specific objects, for example to monitor for a specific thread +signalling a specific condition variable, or to monitor on all timer events, +the possibilities and combinations are endless. + +The callback can be set by calling the function lthread_diagnostic_enable() +supplying a callback and an event mask. + +Setting LTHREAD_DIAG also enables counting of statistics about cache and +queue usage, and these statistics can be displayed by calling the function +lthread_diag_stats_display(). This function also performs a consistency check +on the caches and queues. The function should only be called from the master +EAL thread after all slave threads have stopped and returned to the C main +program, otherwise the consistency check will fail. -- 1.9.3