DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v4 1/2] examples: add performance thread sample application
@ 2015-12-02 14:56 ibetts
  2015-12-02 14:56 ` [dpdk-dev] [PATCH v4 2/2] examples: add pthread-shim in performance-thread sample app ibetts
  0 siblings, 1 reply; 2+ messages in thread
From: ibetts @ 2015-12-02 14:56 UTC (permalink / raw)
  To: dev; +Cc: Ian Betts

From: Ian Betts <ian.betts@intel.com>

This example comprises a layer 3 forwarding derivative intended to
facilitate characterization of performance with different
threading models, specifically:-

1. EAL threads running on different physical cores
2. EAL threads running on the same physical core
3. Lightweight threads running in an EAL thread

Purpose and justification

Since dpdk 2.0 it has been possible to assign multiple EAL threads to
a physical core ( case 2 above ).
Currently no example application has focused on demonstrating the
performance constraints of differing threading models.

Whilst purpose built applications that fully comprehend the DPDK
single threaded programming model will always yield superior
performance, the desire to preserve ROI in legacy code written for
multithreaded operating environments  makes lightweight threads
(case 3 above) worthy of consideration.

As well as aiding with legacy code reuse, it is anticipated that
lightweight threads will make it possible to scale a multithreaded
application with fine granularity allowing an application  to more
easily take advantage of headroom on EAL cores, or conversely occupy
more cores, as dictated by system load.

To explore performance with lightweight threads a simple cooperative
scheduler subsystem is being included in this example application.
If the expected benefits and use cases prove to be of value, it is
anticipated that this lightweight thread subsystem would become a
library in some future DPDK release.

Changes in this version:-
  * Copyright updated for 2015
  * fix TLS destructor handling

Signed-off-by: Ian Betts <ian.betts@intel.com>
---
 config/common_linuxapp                             |    1 +
 config/defconfig_i686-native-linuxapp-gcc          |    1 +
 config/defconfig_i686-native-linuxapp-icc          |    1 +
 config/defconfig_x86_64-native-linuxapp-gcc        |    3 +
 config/defconfig_x86_64-native-linuxapp-icc        |    3 +
 doc/guides/sample_app_ug/performance_thread.rst    | 1149 ++++++
 examples/Makefile                                  |    2 +
 examples/performance-thread/Makefile               |   45 +
 .../performance-thread/common/arch/x86/atomic.h    |   59 +
 examples/performance-thread/common/arch/x86/ctx.c  |   93 +
 examples/performance-thread/common/arch/x86/ctx.h  |   57 +
 examples/performance-thread/common/common.mk       |   40 +
 examples/performance-thread/common/lthread.c       |  530 +++
 examples/performance-thread/common/lthread.h       |   99 +
 examples/performance-thread/common/lthread_api.h   |  829 +++++
 examples/performance-thread/common/lthread_cond.c  |  241 ++
 examples/performance-thread/common/lthread_cond.h  |   77 +
 examples/performance-thread/common/lthread_diag.c  |  321 ++
 examples/performance-thread/common/lthread_diag.h  |  129 +
 .../performance-thread/common/lthread_diag_api.h   |  319 ++
 examples/performance-thread/common/lthread_int.h   |  212 ++
 examples/performance-thread/common/lthread_mutex.c |  256 ++
 examples/performance-thread/common/lthread_mutex.h |   52 +
 .../performance-thread/common/lthread_objcache.h   |  160 +
 examples/performance-thread/common/lthread_pool.h  |  333 ++
 examples/performance-thread/common/lthread_queue.h |  303 ++
 examples/performance-thread/common/lthread_sched.c |  600 ++++
 examples/performance-thread/common/lthread_sched.h |  152 +
 examples/performance-thread/common/lthread_timer.h |   47 +
 examples/performance-thread/common/lthread_tls.c   |  254 ++
 examples/performance-thread/common/lthread_tls.h   |   57 +
 examples/performance-thread/l3fwd-thread/Makefile  |   57 +
 examples/performance-thread/l3fwd-thread/main.c    | 3641 ++++++++++++++++++++
 33 files changed, 10123 insertions(+)
 create mode 100644 doc/guides/sample_app_ug/performance_thread.rst
 create mode 100644 examples/performance-thread/Makefile
 create mode 100644 examples/performance-thread/common/arch/x86/atomic.h
 create mode 100644 examples/performance-thread/common/arch/x86/ctx.c
 create mode 100644 examples/performance-thread/common/arch/x86/ctx.h
 create mode 100644 examples/performance-thread/common/common.mk
 create mode 100644 examples/performance-thread/common/lthread.c
 create mode 100644 examples/performance-thread/common/lthread.h
 create mode 100644 examples/performance-thread/common/lthread_api.h
 create mode 100644 examples/performance-thread/common/lthread_cond.c
 create mode 100644 examples/performance-thread/common/lthread_cond.h
 create mode 100644 examples/performance-thread/common/lthread_diag.c
 create mode 100644 examples/performance-thread/common/lthread_diag.h
 create mode 100644 examples/performance-thread/common/lthread_diag_api.h
 create mode 100644 examples/performance-thread/common/lthread_int.h
 create mode 100644 examples/performance-thread/common/lthread_mutex.c
 create mode 100644 examples/performance-thread/common/lthread_mutex.h
 create mode 100644 examples/performance-thread/common/lthread_objcache.h
 create mode 100644 examples/performance-thread/common/lthread_pool.h
 create mode 100644 examples/performance-thread/common/lthread_queue.h
 create mode 100644 examples/performance-thread/common/lthread_sched.c
 create mode 100644 examples/performance-thread/common/lthread_sched.h
 create mode 100644 examples/performance-thread/common/lthread_timer.h
 create mode 100644 examples/performance-thread/common/lthread_tls.c
 create mode 100644 examples/performance-thread/common/lthread_tls.h
 create mode 100644 examples/performance-thread/l3fwd-thread/Makefile
 create mode 100644 examples/performance-thread/l3fwd-thread/main.c

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 2866986..95da485 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -526,3 +526,4 @@ CONFIG_RTE_APP_TEST=y
 CONFIG_RTE_TEST_PMD=y
 CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=n
 CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=n
+
diff --git a/config/defconfig_i686-native-linuxapp-gcc b/config/defconfig_i686-native-linuxapp-gcc
index a90de9b..c101942 100644
--- a/config/defconfig_i686-native-linuxapp-gcc
+++ b/config/defconfig_i686-native-linuxapp-gcc
@@ -49,3 +49,4 @@ CONFIG_RTE_LIBRTE_KNI=n
 # Vectorized PMD is not supported on 32-bit
 #
 CONFIG_RTE_IXGBE_INC_VECTOR=n
+
diff --git a/config/defconfig_i686-native-linuxapp-icc b/config/defconfig_i686-native-linuxapp-icc
index c021321..915bfa0 100644
--- a/config/defconfig_i686-native-linuxapp-icc
+++ b/config/defconfig_i686-native-linuxapp-icc
@@ -49,3 +49,4 @@ CONFIG_RTE_LIBRTE_KNI=n
 # Vectorized PMD is not supported on 32-bit
 #
 CONFIG_RTE_IXGBE_INC_VECTOR=n
+
diff --git a/config/defconfig_x86_64-native-linuxapp-gcc b/config/defconfig_x86_64-native-linuxapp-gcc
index 60baf5b..76d6b10 100644
--- a/config/defconfig_x86_64-native-linuxapp-gcc
+++ b/config/defconfig_x86_64-native-linuxapp-gcc
@@ -40,3 +40,6 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_TOOLCHAIN="gcc"
 CONFIG_RTE_TOOLCHAIN_GCC=y
+
+CONFIG_RTE_PERFORMANCE_THREAD=y
+
diff --git a/config/defconfig_x86_64-native-linuxapp-icc b/config/defconfig_x86_64-native-linuxapp-icc
index 71d1e28..58a1e09 100644
--- a/config/defconfig_x86_64-native-linuxapp-icc
+++ b/config/defconfig_x86_64-native-linuxapp-icc
@@ -40,3 +40,6 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_TOOLCHAIN="icc"
 CONFIG_RTE_TOOLCHAIN_ICC=y
+
+CONFIG_RTE_PERFORMANCE_THREAD=y
+
diff --git a/doc/guides/sample_app_ug/performance_thread.rst b/doc/guides/sample_app_ug/performance_thread.rst
new file mode 100644
index 0000000..6ea83cc
--- /dev/null
+++ b/doc/guides/sample_app_ug/performance_thread.rst
@@ -0,0 +1,1149 @@
+..  BSD LICENSE
+    Copyright(c) 2015 Intel Corporation. All rights reserved.
+    All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+    * Re-distributions of source code must retain the above copyright
+    notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright
+    notice, this list of conditions and the following disclaimer in
+    the documentation and/or other materials provided with the
+    distribution.
+    * Neither the name of Intel Corporation nor the names of its
+    contributors may be used to endorse or promote products derived
+    from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+Performance Thread Sample Application
+=====================================
+
+The performance thread sample application is a derivative of the standard L3
+forwarding application that demonstrates different threading models.
+
+Overview
+--------
+For a general description of the L3 forwarding applications capabilities
+please refer to the documentation of the standard application in
+:doc:`l3_forward`.
+
+The performance thread sample application differs from the standard L3
+forwarding example in that it divides the TX and RX processing between
+different threads, and makes it possible to assign individual threads to
+different cores.
+
+Three threading models are considered:
+
+#. When there is one EAL thread per physical core.
+#. When there are multiple EAL threads per physical core.
+#. When there are multiple lightweight threads per EAL thread.
+
+Since DPDK release 2.0 it is possible to launch applications using the
+``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the
+performance thread sample application its is now also possible to assign
+individual RX and TX functions to different cores.
+
+As an alternative to dividing the L3 forwarding work between different EAL
+threads the performance thread sample introduces the possibility to run the
+application threads as lightweight threads (L-threads) within one or
+more EAL threads.
+
+In order to facilitate this threading model the example includes a primitive
+cooperative scheduler (L-thread) subsystem. More details of the L-thread
+subsystem can be found in :ref:`lthread_subsystem`.
+
+**Note:** Whilst theoretically possible it is not anticipated that multiple
+L-thread schedulers would be run on the same physical core, this mode of
+operation should not be expected to yield useful performance and is considered
+invalid.
+
+Compiling the Application
+-------------------------
+The application is located in the sample application folder in the
+``performance-thread`` folder.
+
+#.  Go to the example applications folder
+
+    .. code-block:: console
+
+       export RTE_SDK=/path/to/rte_sdk
+       cd ${RTE_SDK}/examples/performance-thread/l3fwd-thread
+
+#.  Set the target (a default target is used if not specified). For example:
+
+    .. code-block:: console
+
+       export RTE_TARGET=x86_64-native-linuxapp-gcc
+
+    See the *DPDK Linux Getting Started Guide* for possible RTE_TARGET values.
+
+#.  Build the application:
+
+        make
+
+
+Running the Application
+-----------------------
+
+The application has a number of command line options::
+
+    ./build/l3fwd-thread [EAL options] --
+        -p PORTMASK [-P]
+        --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)]
+        --tx(lcore,thread)[,(lcore,thread)]
+        [--enable-jumbo] [--max-pkt-len PKTLEN]]  [--no-numa]
+        [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore]
+
+Where:
+
+* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure.
+
+* ``-P``: optional, sets all ports to promiscuous mode so that packets are
+  accepted regardless of the packet's Ethernet MAC destination address.
+  Without this option, only packets with the Ethernet MAC destination address
+  set to the Ethernet address of the port are accepted.
+
+* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of
+  NIC RX ports and queues handled by the RX lcores and threads. The parameters
+  are explained below.
+
+* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying
+  the lcore the thread runs on, and the id of RX thread with which it is
+  associated. The parameters are explained below.
+
+* ``--enable-jumbo``: optional, enables jumbo frames.
+
+* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600).
+
+* ``--no-numa``: optional, disables numa awareness.
+
+* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be
+  setup.
+
+* ``--ipv6``: optional, set it if running ipv6 packets.
+
+* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading
+  model. See below.
+
+* ``--stat-lcore``: optional, run CPU load stats collector on the specified
+  lcore.
+
+The parameters of the ``--rx`` and ``--tx`` options are:
+
+* ``--rx`` parameters
+
+   .. _table_l3fwd_rx_parameters:
+
+   +--------+------------------------------------------------------+
+   | port   | RX port                                              |
+   +--------+------------------------------------------------------+
+   | queue  | RX queue that will be read on the specified RX port  |
+   +--------+------------------------------------------------------+
+   | lcore  | Core to use for the thread                           |
+   +--------+------------------------------------------------------+
+   | thread | Thread id (continuously from 0 to N)                 |
+   +--------+------------------------------------------------------+
+
+
+* ``--tx`` parameters
+
+   .. _table_l3fwd_tx_parameters:
+
+   +--------+------------------------------------------------------+
+   | lcore  | Core to use for L3 route match and transmit          |
+   +--------+------------------------------------------------------+
+   | thread | Id of RX thread to be associated with this TX thread |
+   +--------+------------------------------------------------------+
+
+The ``l3fwd-thread`` application allows you to start packet processing in two
+threading models: L-Threads (default) and EAL Threads (when the
+``--no-lthreads`` parameter is used). For consistency all parameters are used
+in the same way for both models.
+
+
+Running with L-threads
+~~~~~~~~~~~~~~~~~~~~~~
+
+When the L-thread model is used (default option), lcore and thread parameters
+in ``--rx/--tx`` are used to affinitize threads to the selected scheduler.
+
+For example, the following places every l-thread on different lcores::
+
+   l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                --rx="(0,0,0,0)(1,0,1,1)" \
+                --tx="(2,0)(3,1)"
+
+The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2
+and so on::
+
+   l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                --rx="(0,0,0,0)(1,0,0,1)" \
+                --tx="(1,0)(2,1)"
+
+
+Running with EAL threads
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the ``--no-lthreads`` parameter is used, the L-threading model is turned
+off and EAL threads are used for all processing. EAL threads are enumerated in
+the same way as L-threads, but the ``--lcores`` EAL parameter is used to
+affinitize threads to the selected cpu-set (scheduler). Thus it is possible to
+place every RX and TX thread on different lcores.
+
+For example, the following places every EAL thread on different lcores::
+
+   l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                --rx="(0,0,0,0)(1,0,1,1)" \
+                --tx="(2,0)(3,1)" \
+                --no-lthreads
+
+
+To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores``
+parameter is used.
+
+The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1
+and 2 and so on::
+
+   l3fwd-thread -c ff -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \
+                --rx="(0,0,0,0)(1,0,1,1)" \
+                --tx="(2,0)(3,1)" \
+                --no-lthreads
+
+
+Examples
+~~~~~~~~
+
+For selected scenarios the command line configuration of the application for L-threads
+and its corresponding EAL threads command line can be realized as follows:
+
+a) Start every thread on different scheduler (1:1)::
+
+      l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                   --rx="(0,0,0,0)(1,0,1,1)" \
+                   --tx="(2,0)(3,1)"
+
+   EAL thread equivalent::
+
+      l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                   --rx="(0,0,0,0)(1,0,1,1)" \
+                   --tx="(2,0)(3,1)" \
+                   --no-lthreads
+
+b) Start all threads on one core (N:1).
+
+   Start 4 L-threads on lcore 0::
+
+      l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                   --rx="(0,0,0,0)(1,0,0,1)" \
+                   --tx="(0,0)(0,1)"
+
+   Start 4 EAL threads on cpu-set 0::
+
+      l3fwd-thread -c ff -n 2 --lcores="(0-3)@0" -- -P -p 3 \
+                   --rx="(0,0,0,0)(1,0,0,1)" \
+                   --tx="(2,0)(3,1)" \
+                   --no-lthreads
+
+c) Start threads on different cores (N:M).
+
+   Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1::
+
+      l3fwd-thread -c ff -n 2 -- -P -p 3 \
+                   --rx="(0,0,0,0)(1,0,0,1)" \
+                   --tx="(1,0)(1,1)"
+
+   Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on
+   cpu-set 1::
+
+      l3fwd-thread -c ff -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \
+                   --rx="(0,0,0,0)(1,0,1,1)" \
+                   --tx="(2,0)(3,1)" \
+                   --no-lthreads
+
+Explanation
+-----------
+
+To a great extent the sample application differs little from the standard L3
+forwarding application, and readers are advised to familiarize themselves with
+the material covered in the :doc:`l3_forward` documentation before proceeding.
+
+The following explanation is focused on the way threading is handled in the
+performance thread example.
+
+
+Mode of operation with EAL threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The performance thread sample application has split the RX and TX functionality
+into two different threads, and the RX and TX threads are
+interconnected via software rings. With respect to these rings the RX threads
+are producers and the TX threads are consumers.
+
+On initialization the TX and RX threads are started according to the command
+line parameters.
+
+The RX threads poll the network interface queues and post received packets to a
+TX thread via a corresponding software ring.
+
+The TX threads poll software rings, perform the L3 forwarding hash/LPM match,
+and assemble packet bursts before performing burst transmit on the network
+interface.
+
+As with the standard L3 forward application, burst draining of residual packets
+is performed periodically with the period calculated from elapsed time using
+the timestamps counter.
+
+The diagram below illustrates a case with two RX threads and three TX threads.
+
+.. _figure_performance_thread_1:
+
+.. figure:: img/performance_thread_1.*
+
+
+Mode of operation with L-threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Like the EAL thread configuration the application has split the RX and TX
+functionality into different threads, and the pairs of RX and TX threads are
+interconnected via software rings.
+
+On initialization an L-thread scheduler is started on every EAL thread. On all
+but the master EAL thread only a a dummy L-thread is initially started.
+The L-thread started on the master EAL thread then spawns other L-threads on
+different L-thread schedulers according the the command line parameters.
+
+The RX threads poll the network interface queues and post received packets
+to a TX thread via the corresponding software ring.
+
+The ring interface is augmented by means of an L-thread condition variable that
+enables the TX thread to be suspended when the TX ring is empty. The RX thread
+signals the condition whenever it posts to the TX ring, causing the TX thread
+to be resumed.
+
+Additionally the TX L-thread spawns a worker L-thread to take care of
+polling the software rings, whilst it handles burst draining of the transmit
+buffer.
+
+The worker threads poll the software rings, perform L3 route lookup and
+assemble packet bursts. If the TX ring is empty the worker thread suspends
+itself by waiting on the condition variable associated with the ring.
+
+Burst draining of residual packets, less than the burst size, is performed by
+the TX thread which sleeps (using an L-thread sleep function) and resumes
+periodically to flush the TX buffer.
+
+This design means that L-threads that have no work, can yield the CPU to other
+L-threads and avoid having to constantly poll the software rings.
+
+The diagram below illustrates a case with two RX threads and three TX functions
+(each comprising a thread that processes forwarding and a thread that
+periodically drains the output buffer of residual packets).
+
+.. _figure_performance_thread_2:
+
+.. figure:: img/performance_thread_2.*
+
+
+CPU load statistics
+~~~~~~~~~~~~~~~~~~~
+
+It is possible to display statistics showing estimated CPU load on each core.
+The statistics indicate the percentage of CPU time spent: processing
+received packets (forwarding), polling queues/rings (waiting for work),
+and doing any other processing (context switch and other overhead).
+
+When enabled statistics are gathered by having the application threads set and
+clear flags when they enter and exit pertinent code sections. The flags are
+then sampled in real time by a statistics collector thread running on another
+core. This thread displays the data in real time on the console.
+
+This feature is enabled by designating a statistics collector core, using the
+``--stat-lcore`` parameter.
+
+
+.. _lthread_subsystem:
+
+The L-thread subsystem
+----------------------
+
+The L-thread subsystem resides in the examples/performance-thread/common
+directory and is built and linked automatically when building the
+``l3fwd-thread`` example.
+
+The subsystem provides a simple cooperative scheduler to enable arbitrary
+functions to run as cooperative threads within a single EAL thread.
+The subsystem provides a pthread like API that is intended to assist in
+reuse of legacy code written for POSIX pthreads.
+
+The following sections provide some detail on the features, constraints,
+performance and porting considerations when using L-threads.
+
+
+.. _comparison_between_lthreads_and_pthreads:
+
+Comparison between L-threads and POSIX pthreads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The fundamental difference between the L-thread and pthread models is the
+way in which threads are scheduled. The simplest way to think about this is to
+consider the case of a processor with a single CPU. To run multiple threads
+on a single CPU, the scheduler must frequently switch between the threads,
+in order that each thread is able to make timely progress.
+This is the basis of any multitasking operating system.
+
+This section explores the differences between the pthread model and the
+L-thread model as implemented in the provided L-thread subsystem. If needed a
+theoretical discussion of preemptive vs cooperative multi-threading can be
+found in any good text on operating system design.
+
+
+Scheduling and context switching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The POSIX pthread library provides an application programming interface to
+create and synchronize threads. Scheduling policy is determined by the host OS,
+and may be configurable. The OS may use sophisticated rules to determine which
+thread should be run next, threads may suspend themselves or make other threads
+ready, and the scheduler may employ a time slice giving each thread a maximum
+time quantum after which it will be preempted in favor of another thread that
+is ready to run. To complicate matters further threads may be assigned
+different scheduling priorities.
+
+By contrast the L-thread subsystem is considerably simpler. Logically the
+L-thread scheduler performs the same multiplexing function for L-threads
+within a single pthread as the OS scheduler does for pthreads within an
+application process. The L-thread scheduler is simply the main loop of a
+pthread, and in so far as the host OS is concerned it is a regular pthread
+just like any other. The host OS is oblivious about the existence of and
+not at all involved in the scheduling of L-threads.
+
+The other and most significant difference between the two models is that
+L-threads are scheduled cooperatively. L-threads cannot not preempt each
+other, nor can the L-thread scheduler preempt a running L-thread (i.e.
+there is no time slicing). The consequence is that programs implemented with
+L-threads must possess frequent rescheduling points, meaning that they must
+explicitly and of their own volition return to the scheduler at frequent
+intervals, in order to allow other L-threads an opportunity to proceed.
+
+In both models switching between threads requires that the current CPU
+context is saved and a new context (belonging to the next thread ready to run)
+is restored. With pthreads this context switching is handled transparently
+and the set of CPU registers that must be preserved between context switches
+is as per an interrupt handler.
+
+An L-thread context switch is achieved by the thread itself making a function
+call to the L-thread scheduler. Thus it is only necessary to preserve the
+callee registers. The caller is responsible to save and restore any other
+registers it is using before a function call, and restore them on return,
+and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the
+System V calling convention is used, this defines registers RSP, RBP, and
+R12-R15 as callee-save registers (for more detailed discussion a good reference
+is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_).
+
+Taking advantage of this, and due to the absence of preemption, an L-thread
+context switch is achieved with less than 20 load/store instructions.
+
+The scheduling policy for L-threads is fixed, there is no prioritization of
+L-threads, all L-threads are equal and scheduling is based on a FIFO
+ready queue.
+
+An L-thread is a struct containing the CPU context of the thread
+(saved on context switch) and other useful items. The ready queue contains
+pointers to threads that are ready to run. The L-thread scheduler is a simple
+loop that polls the ready queue, reads from it the next thread ready to run,
+which it resumes by saving the current context (the current position in the
+scheduler loop) and restoring the context of the next thread from its thread
+struct. Thus an L-thread is always resumed at the last place it yielded.
+
+A well behaved L-thread will call the context switch regularly (at least once
+in its main loop) thus returning to the scheduler's own main loop. Yielding
+inserts the current thread at the back of the ready queue, and the process of
+servicing the ready queue is repeated, thus the system runs by flipping back
+and forth the between L-threads and scheduler loop.
+
+In the case of pthreads, the preemptive scheduling, time slicing, and support
+for thread prioritization means that progress is normally possible for any
+thread that is ready to run. This comes at the price of a relatively heavier
+context switch and scheduling overhead.
+
+With L-threads the progress of any particular thread is determined by the
+frequency of rescheduling opportunities in the other L-threads. This means that
+an errant L-thread monopolizing the CPU might cause scheduling of other threads
+to be stalled. Due to the lower cost of context switching, however, voluntary
+rescheduling to ensure progress of other threads, if managed sensibly, is not
+a prohibitive overhead, and overall performance can exceed that of an
+application using pthreads.
+
+
+Mutual exclusion
+^^^^^^^^^^^^^^^^
+
+With pthreads preemption means that threads that share data must observe
+some form of mutual exclusion protocol.
+
+The fact that L-threads cannot preempt each other means that in many cases
+mutual exclusion devices can be completely avoided.
+
+Locking to protect shared data can be a significant bottleneck in
+multi-threaded applications so a carefully designed cooperatively scheduled
+program can enjoy significant performance advantages.
+
+So far we have considered only the simplistic case of a single core CPU,
+when multiple CPUs are considered things are somewhat more complex.
+
+First of all it is inevitable that there must be multiple L-thread schedulers,
+one running on each EAL thread. So long as these schedulers remain isolated
+from each other the above assertions about the potential advantages of
+cooperative scheduling hold true.
+
+A configuration with isolated cooperative schedulers is less flexible than the
+pthread model where threads can be affinitized to run on any CPU. With isolated
+schedulers scaling of applications to utilize fewer or more CPUs according to
+system demand is very difficult to achieve.
+
+The L-thread subsystem makes it possible for L-threads to migrate between
+schedulers running on different CPUs. Needless to say if the migration means
+that threads that share data end up running on different CPUs then this will
+introduce the need for some kind of mutual exclusion system.
+
+Of course ``rte_ring`` software rings can always be used to interconnect
+threads running on different cores, however to protect other kinds of shared
+data structures, lock free constructs or else explicit locking will be
+required. This is a consideration for the application design.
+
+In support of this extended functionality, the L-thread subsystem implements
+thread safe mutexes and condition variables.
+
+The cost of affinitizing and of condition variable signaling is significantly
+lower than the equivalent pthread operations, and so applications using these
+features will see a performance benefit.
+
+
+Thread local storage
+^^^^^^^^^^^^^^^^^^^^
+
+As with applications written for pthreads an application written for L-threads
+can take advantage of thread local storage, in this case local to an L-thread.
+An application may save and retrieve a single pointer to application data in
+the L-thread struct.
+
+For legacy and backward compatibility reasons two alternative methods are also
+offered, the first is modelled directly on the pthread get/set specific APIs,
+the second approach is modelled on the ``RTE_PER_LCORE`` macros, whereby
+``PER_LTHREAD`` macros are introduced, in both cases the storage is local to
+the L-thread.
+
+
+.. _constraints_and_performance_implications:
+
+Constraints and performance implications when using L-threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+.. _API_compatibility:
+
+API compatibility
+^^^^^^^^^^^^^^^^^
+
+The L-thread subsystem provides a set of functions that are logically equivalent
+to the corresponding functions offered by the POSIX pthread library, however not
+all pthread functions have a corresponding L-thread equivalent, and not all
+features available to pthreads are implemented for L-threads.
+
+The pthread library offers considerable flexibility via programmable attributes
+that can be associated with threads, mutexes, and condition variables.
+
+By contrast the L-thread subsystem has fixed functionality, the scheduler policy
+cannot be varied, and L-threads cannot be prioritized. There are no variable
+attributes associated with any L-thread objects. L-threads, mutexes and
+conditional variables, all have fixed functionality. (Note: reserved parameters
+are included in the APIs to facilitate possible future support for attributes).
+
+The table below lists the pthread and equivalent L-thread APIs with notes on
+differences and/or constraints. Where there is no L-thread entry in the table,
+then the L-thread subsystem provides no equivalent function.
+
+.. _table_lthread_pthread:
+
+.. table:: Pthread and equivalent L-thread APIs.
+
+   +----------------------------+------------------------+-------------------+
+   | **Pthread function**       | **L-thread function**  | **Notes**         |
+   +============================+========================+===================+
+   | pthread_barrier_destroy    |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_barrier_init       |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_barrier_wait       |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cond_broadcast     | lthread_cond_broadcast | See note 1        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cond_destroy       | lthread_cond_destroy   |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cond_init          | lthread_cond_init      |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cond_signal        | lthread_cond_signal    | See note 1        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cond_timedwait     |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cond_wait          | lthread_cond_wait      | See note 5        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_create             | lthread_create         | See notes 2, 3    |
+   +----------------------------+------------------------+-------------------+
+   | pthread_detach             | lthread_detach         | See note 4        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_equal              |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_exit               | lthread_exit           |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_getspecific        | lthread_getspecific    |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_getcpuclockid      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_join               | lthread_join           |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_key_create         | lthread_key_create     |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_key_delete         | lthread_key_delete     |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_mutex_destroy      | lthread_mutex_destroy  |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_mutex_init         | lthread_mutex_init     |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_mutex_lock         | lthread_mutex_lock     | See note 6        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_mutex_trylock      | lthread_mutex_trylock  | See note 6        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_mutex_timedlock    |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_mutex_unlock       | lthread_mutex_unlock   |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_once               |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_destroy     |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_init        |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_rdlock      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_timedrdlock |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_timedwrlock |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_tryrdlock   |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_trywrlock   |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_unlock      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_rwlock_wrlock      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_self               | lthread_current        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_setspecific        | lthread_setspecific    |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_spin_init          |                        | See note 10       |
+   +----------------------------+------------------------+-------------------+
+   | pthread_spin_destroy       |                        | See note 10       |
+   +----------------------------+------------------------+-------------------+
+   | pthread_spin_lock          |                        | See note 10       |
+   +----------------------------+------------------------+-------------------+
+   | pthread_spin_trylock       |                        | See note 10       |
+   +----------------------------+------------------------+-------------------+
+   | pthread_spin_unlock        |                        | See note 10       |
+   +----------------------------+------------------------+-------------------+
+   | pthread_cancel             | lthread_cancel         |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_setcancelstate     |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_setcanceltype      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_testcancel         |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_getschedparam      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_setschedparam      |                        |                   |
+   +----------------------------+------------------------+-------------------+
+   | pthread_yield              | lthread_yield          | See note 7        |
+   +----------------------------+------------------------+-------------------+
+   | pthread_setaffinity_np     | lthread_set_affinity   | See notes 2, 3, 8 |
+   +----------------------------+------------------------+-------------------+
+   |                            | lthread_sleep          | See note 9        |
+   +----------------------------+------------------------+-------------------+
+   |                            | lthread_sleep_clks     | See note 9        |
+   +----------------------------+------------------------+-------------------+
+
+
+**Note 1**:
+
+Neither lthread signal nor broadcast may be called concurrently by L-threads
+running on different schedulers, although multiple L-threads running in the
+same scheduler may freely perform signal or broadcast operations. L-threads
+running on the same or different schedulers may always safely wait on a
+condition variable.
+
+
+**Note 2**:
+
+Pthread attributes may be used to affinitize a pthread with a cpu-set. The
+L-thread subsystem does not support a cpu-set. An L-thread may be affinitized
+only with a single CPU at any time.
+
+
+**Note 3**:
+
+If an L-thread is intended to run on a different NUMA node than the node that
+creates the thread then, when calling ``lthread_create()`` it is advantageous
+to specify the destination core as a parameter of ``lthread_create()``. See
+:ref:`memory_allocation_and_NUMA_awareness` for details.
+
+
+**Note 4**:
+
+An L-thread can only detach itself, and cannot detach other L-threads.
+
+
+**Note 5**:
+
+A wait operation on a pthread condition variable is always associated with and
+protected by a mutex which must be owned by the thread at the time it invokes
+``pthread_wait()``. By contrast L-thread condition variables are thread safe
+(for waiters) and do not use an associated mutex. Multiple L-threads (including
+L-threads running on other schedulers) can safely wait on a L-thread condition
+variable. As a consequence the performance of an L-thread condition variables
+is typically an order of magnitude faster than its pthread counterpart.
+
+
+**Note 6**:
+
+Recursive locking is not supported with L-threads, attempts to take a lock
+recursively will be detected and rejected.
+
+
+**Note 7**:
+
+``lthread_yield()`` will save the current context, insert the current thread
+to the back of the ready queue, and resume the next ready thread. Yielding
+increases ready queue backlog, see :ref:`ready_queue_backlog` for more details
+about the implications of this.
+
+
+N.B. The context switch time as measured from immediately before the call to
+``lthread_yield()`` to the point at which the next ready thread is resumed,
+can be an order of magnitude faster that the same measurement for
+pthread_yield.
+
+
+**Note 8**:
+
+``lthread_set_affinity()`` is similar to a yield apart from the fact that the
+yielding thread is inserted into a peer ready queue of another scheduler.
+The peer ready queue is actually a separate thread safe queue, which means that
+threads appearing in the peer ready queue can jump any backlog in the local
+ready queue on the destination scheduler.
+
+The context switch time as measured from the time just before the call to
+``lthread_set_affinity()`` to just after the same thread is resumed on the new
+scheduler can be orders of magnitude faster than the same measurement for
+``pthread_setaffinity_np()``.
+
+
+**Note 9**:
+
+Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and
+``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or
+``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend
+the current thread, start an ``rte_timer`` and resume the thread when the
+timer matures. The ``rte_timer_manage()`` entry point is called on every pass
+of the scheduler loop. This means that the worst case jitter on timer expiry
+is determined by the longest period between context switches of any running
+L-threads.
+
+In a synthetic test with many threads sleeping and resuming then the measured
+jitter is typically orders of magnitude lower than the same measurement made
+for ``nanosleep()``.
+
+
+**Note 10**:
+
+Spin locks are not provided because they are problematical in a cooperative
+environment, see :ref:`porting_locks_and_spinlocks` for a more detailed
+discussion on how to avoid spin locks.
+
+
+.. _Thread_local_storage_performance:
+
+Thread local storage
+^^^^^^^^^^^^^^^^^^^^
+
+Of the three L-thread local storage options the simplest and most efficient is
+storing a single application data pointer in the L-thread struct.
+
+The ``PER_LTHREAD`` macros involve a run time computation to obtain the address
+of the variable being saved/retrieved and also require that the accesses are
+de-referenced  via a pointer. This means that code that has used
+``RTE_PER_LCORE`` macros being ported to L-threads might need some slight
+adjustment (see :ref:`porting_thread_local_storage` for hints about porting
+code that makes use of thread local storage).
+
+The get/set specific APIs are consistent with their pthread counterparts both
+in use and in performance.
+
+
+.. _memory_allocation_and_NUMA_awareness:
+
+Memory allocation and NUMA awareness
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All memory allocation is from DPDK huge pages, and is NUMA aware. Each
+scheduler maintains its own caches of objects: lthreads, their stacks, TLS,
+mutexes and condition variables. These caches are implemented as unbounded lock
+free MPSC queues. When objects are created they are always allocated from the
+caches on the local core (current EAL thread).
+
+If an L-thread has been affinitized to a different scheduler, then it can
+always safely free resources to the caches from which they originated (because
+the caches are MPSC queues).
+
+If the L-thread has been affinitized to a different NUMA node then the memory
+resources associated with it may incur longer access latency.
+
+The commonly used pattern of setting affinity on entry to a thread after it has
+started, means that memory allocation for both the stack and TLS will have been
+made from caches on the NUMA node on which the threads creator is running.
+This has the side effect that access latency will be sub-optimal after
+affinitizing.
+
+This side effect can be mitigated to some extent (although not completely) by
+specifying the destination CPU as a parameter of ``lthread_create()`` this
+causes the L-thread's stack and TLS to be allocated when it is first scheduled
+on the destination scheduler, if the destination is a on another NUMA node it
+results in a more optimal memory allocation.
+
+Note that the lthread struct itself remains allocated from memory on the
+creating node, this is unavoidable because an L-thread is known everywhere by
+the address of this struct.
+
+
+.. _object_cache_sizing:
+
+Object cache sizing
+^^^^^^^^^^^^^^^^^^^
+
+The per lcore object caches pre-allocate objects in bulk whenever a request to
+allocate an object finds a cache empty. By default 100 objects are
+pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API
+header file lthread_api.h. This means that the caches constantly grow to meet
+system demand.
+
+In the present implementation there is no mechanism to reduce the cache sizes
+if system demand reduces. Thus the caches will remain at their maximum extent
+indefinitely.
+
+A consequence of the bulk pre-allocation of objects is that every 100 (default
+value) additional new object create operations results in a call to
+``rte_malloc()``. For creation of objects such as L-threads, which trigger the
+allocation of even more objects (i.e. their stacks and TLS) then this can
+cause outliers in scheduling performance.
+
+If this is a problem the simplest mitigation strategy is to dimension the
+system, by setting the bulk object pre-allocation size to some large number
+that you do not expect to be exceeded. This means the caches will be populated
+once only, the very first time a thread is created.
+
+
+.. _Ready_queue_backlog:
+
+Ready queue backlog
+^^^^^^^^^^^^^^^^^^^
+
+One of the more subtle performance considerations is managing the ready queue
+backlog. The fewer threads that are waiting in the ready queue then the faster
+any particular thread will get serviced.
+
+In a naive L-thread application with N L-threads simply looping and yielding,
+this backlog will always be equal to the number of L-threads, thus the cost of
+a yield to a particular L-thread will be N times the context switch time.
+
+This side effect can be mitigated by arranging for threads to be suspended and
+wait to be resumed, rather than polling for work by constantly yielding.
+Blocking on a mutex or condition variable or even more obviously having a
+thread sleep if it has a low frequency workload are all mechanisms by which a
+thread can be excluded from the ready queue until it really does need to be
+run. This can have a significant positive impact on performance.
+
+
+.. _Initialization_and_shutdown_dependencies:
+
+Initialization, shutdown and dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The L-thread subsystem depends on DPDK for huge page allocation and depends on
+the ``rte_timer subsystem``. The DPDK EAL initialization and
+``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub
+system can be used.
+
+Thereafter initialization of the L-thread subsystem is largely transparent to
+the application. Constructor functions ensure that global variables are properly
+initialized. Other than global variables each scheduler is initialized
+independently the first time that an L-thread is created by a particular EAL
+thread.
+
+If the schedulers are to be run as isolated and independent schedulers, with
+no intention that L-threads running on different schedulers will migrate between
+schedulers or synchronize with L-threads running on other schedulers, then
+initialization consists simply of creating an L-thread, and then running the
+L-thread scheduler.
+
+If there will be interaction between L-threads running on different schedulers,
+then it is important that the starting of schedulers on different EAL threads
+is synchronized.
+
+To achieve this an additional initialization step is necessary, this is simply
+to set the number of schedulers by calling the API function
+``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads
+that will run L-thread schedulers. Setting the number of schedulers to a
+number greater than 0 will cause all schedulers to wait until the others have
+started before beginning to schedule L-threads.
+
+The L-thread scheduler is started by calling the function ``lthread_run()``
+and should be called from the EAL thread and thus become the main loop of the
+EAL thread.
+
+The function ``lthread_run()``, will not return until all threads running on
+the scheduler have exited, and the scheduler has been explicitly stopped by
+calling ``lthread_scheduler_shutdown(lcore)`` or
+``lthread_scheduler_shutdown_all()``.
+
+All these function do is tell the scheduler that it can exit when there are no
+longer any running L-threads, neither function forces any running L-thread to
+terminate. Any desired application shutdown behavior must be designed and
+built into the application to ensure that L-threads complete in a timely
+manner.
+
+**Important Note:** It is assumed when the scheduler exits that the application
+is terminating for good, the scheduler does not free resources before exiting
+and running the scheduler a subsequent time will result in undefined behavior.
+
+
+.. _porting_legacy_code_to_run_on_lthreads:
+
+Porting legacy code to run on L-threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Legacy code originally written for a pthread environment may be ported to
+L-threads if the considerations about differences in scheduling policy, and
+constraints discussed in the previous sections can be accommodated.
+
+This section looks in more detail at some of the issues that may have to be
+resolved when porting code.
+
+
+.. _pthread_API_compatibility:
+
+pthread API compatibility
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The first step is to establish exactly which pthread APIs the legacy
+application uses, and to understand the requirements of those APIs. If there
+are corresponding L-lthread APIs, and where the default pthread functionality
+is used by the application then, notwithstanding the other issues discussed
+here, it should be feasible to run the application with L-threads. If the
+legacy code modifies the default behavior using attributes then if may be
+necessary to make some adjustments to eliminate those requirements.
+
+
+.. _blocking_system_calls:
+
+Blocking system API calls
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is important to understand what other system services the application may be
+using, bearing in mind that in a cooperatively scheduled environment a thread
+cannot block without stalling the scheduler and with it all other cooperative
+threads. Any kind of blocking system call, for example file or socket IO, is a
+potential problem, a good tool to analyze the application for this purpose is
+the ``strace`` utility.
+
+There are many strategies to resolve these kind of issues, each with it
+merits. Possible solutions include:
+
+* Adopting a polled mode of the system API concerned (if available).
+
+* Arranging for another core to perform the function and synchronizing with
+  that core via constructs that will not block the L-thread.
+
+* Affinitizing the thread to another scheduler devoted (as a matter of policy)
+  to handling threads wishing to make blocking calls, and then back again when
+  finished.
+
+
+.. _porting_locks_and_spinlocks:
+
+Locks and spinlocks
+^^^^^^^^^^^^^^^^^^^
+
+Locks and spinlocks are another source of blocking behavior that for the same
+reasons as system calls will need to be addressed.
+
+If the application design ensures that the contending L-threads will always
+run on the same scheduler then it its probably safe to remove locks and spin
+locks completely.
+
+The only exception to the above rule is if for some reason the
+code performs any kind of context switch whilst holding the lock
+(e.g. yield, sleep, or block on a different lock, or on a condition variable).
+This will need to determined before deciding to eliminate a lock.
+
+If a lock cannot be eliminated then an L-thread mutex can be substituted for
+either kind of lock.
+
+An L-thread blocking on an L-thread mutex will be suspended and will cause
+another ready L-thread to be resumed, thus not blocking the scheduler. When
+default behavior is required, it can be used as a direct replacement for a
+pthread mutex lock.
+
+Spin locks are typically used when lock contention is likely to be rare and
+where the period during which the lock may be held is relatively short.
+When the contending L-threads are running on the same scheduler then an
+L-thread blocking on a spin lock will enter an infinite loop stopping the
+scheduler completely (see :ref:`porting_infinite_loops` below).
+
+If the application design ensures that contending L-threads will always run
+on different schedulers then it might be reasonable to leave a short spin lock
+that rarely experiences contention in place.
+
+If after all considerations it appears that a spin lock can neither be
+eliminated completely, replaced with an L-thread mutex, or left in place as
+is, then an alternative is to loop on a flag, with a call to
+``lthread_yield()`` inside the loop (n.b. if the contending L-threads might
+ever run on different schedulers the flag will need to be manipulated
+atomically).
+
+Spinning and yielding is the least preferred solution since it introduces
+ready queue backlog (see also :ref:`ready_queue_backlog`).
+
+
+.. _porting_sleeps_and_delays:
+
+Sleeps and delays
+^^^^^^^^^^^^^^^^^
+
+Yet another kind of blocking behavior (albeit momentary) are delay functions
+like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the
+consequence of stalling the L-thread scheduler and unless the delay is very
+short (e.g. a very short nanosleep) calls to these functions will need to be
+eliminated.
+
+The simplest mitigation strategy is to use the L-thread sleep API functions,
+of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``.
+These functions start an rte_timer against the L-thread, suspend the L-thread
+and cause another ready L-thread to be resumed. The suspended L-thread is
+resumed when the rte_timer matures.
+
+
+.. _porting_infinite_loops:
+
+Infinite loops
+^^^^^^^^^^^^^^
+
+Some applications have threads with loops that contain no inherent
+rescheduling opportunity, and rely solely on the OS time slicing to share
+the CPU. In a cooperative environment this will stop everything dead. These
+kind of loops are not hard to identify, in a debug session you will find the
+debugger is always stopping in the same loop.
+
+The simplest solution to this kind of problem is to insert an explicit
+``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution
+might be to include the function performed by the loop into the execution path
+of some other loop that does in fact yield, if this is possible.
+
+
+.. _porting_thread_local_storage:
+
+Thread local storage
+^^^^^^^^^^^^^^^^^^^^
+
+If the application uses thread local storage, the use case should be
+studied carefully.
+
+In a legacy pthread application either or both the ``__thread`` prefix, or the
+pthread set/get specific APIs may have been used to define storage local to a
+pthread.
+
+In some applications it may be a reasonable assumption that the data could
+or in fact most likely should be placed in L-thread local storage.
+
+If the application (like many DPDK applications) has assumed a certain
+relationship between a pthread and the CPU to which it is affinitized, there
+is a risk that thread local storage may have been used to save some data items
+that are correctly logically associated with the CPU, and others items which
+relate to application context for the thread. Only a good understanding of the
+application will reveal such cases.
+
+If the application requires an that an L-thread is to be able to move between
+schedulers then care should be taken to separate these kinds of data, into per
+lcore, and per L-thread storage. In this way a migrating thread will bring with
+it the local data it needs, and pick up the new logical core specific values
+from pthread local storage at its new home.
+
+
+.. _lthread_diagnostics:
+
+L-thread Diagnostics
+~~~~~~~~~~~~~~~~~~~~
+
+When debugging you must take account of the fact that the L-threads are run in
+a single pthread. The current scheduler is defined by
+``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at
+``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB
+session the current lthread can be obtained by displaying the pthread local
+variable ``per_lcore_this_sched->current_lthread``.
+
+Another useful diagnostic feature is the possibility to trace significant
+events in the life of an L-thread, this feature is enabled by changing the
+value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``.
+
+Tracing of events can be individually masked, and the mask may be programmed
+at run time. An unmasked event results in a callback that provides information
+about the event. The default callback simply prints trace information. The
+default mask is 0 (all events off) the mask can be modified by calling the
+function ``lthread_diagniostic_set_mask()``.
+
+It is possible register a user callback function to implement more
+sophisticated diagnostic functions.
+Object creation events (lthread, mutex, and condition variable) accept, and
+store in the created object, a user supplied reference value returned by the
+callback function.
+
+The lthread reference value is passed back in all subsequent event callbacks,
+the mutex and APIs are provided to retrieve the reference value from
+mutexes and condition variables. This enables a user to monitor, count, or
+filter for specific events, on specific objects, for example to monitor for a
+specific thread signalling a specific condition variable, or to monitor
+on all timer events, the possibilities and combinations are endless.
+
+The callback function can be set by calling the function
+``lthread_diagnostic_enable()`` supplying a callback function pointer and an
+event mask.
+
+Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and
+queue usage, and these statistics can be displayed by calling the function
+``lthread_diag_stats_display()``. This function also performs a consistency
+check on the caches and queues. The function should only be called from the
+master EAL thread after all slave threads have stopped and returned to the C
+main program, otherwise the consistency check will fail.
diff --git a/examples/Makefile b/examples/Makefile
index 5dd2c53..aeb10f3 100644
--- a/examples/Makefile
+++ b/examples/Makefile
@@ -77,5 +77,7 @@ DIRS-y += vmdq
 DIRS-y += vmdq_dcb
 DIRS-$(CONFIG_RTE_LIBRTE_POWER) += vm_power_manager
 DIRS-$(CONFIG_RTE_LIBRTE_CRYPTODEV) += l2fwd-crypto
+DIRS-$(CONFIG_RTE_PERFORMANCE_THREAD) += performance-thread
+
 
 include $(RTE_SDK)/mk/rte.extsubdir.mk
diff --git a/examples/performance-thread/Makefile b/examples/performance-thread/Makefile
new file mode 100644
index 0000000..fce4f79
--- /dev/null
+++ b/examples/performance-thread/Makefile
@@ -0,0 +1,45 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2015 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+DIRS-$(CONFIG_RTE_PERFORMANCE_THREAD) += l3fwd-thread
+
+
+
+include $(RTE_SDK)/mk/rte.extsubdir.mk
diff --git a/examples/performance-thread/common/arch/x86/atomic.h b/examples/performance-thread/common/arch/x86/atomic.h
new file mode 100644
index 0000000..968dbe2
--- /dev/null
+++ b/examples/performance-thread/common/arch/x86/atomic.h
@@ -0,0 +1,59 @@
+/*
+ *-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef ATOMIC_H_
+#define ATOMIC_H_
+
+#include <stdint.h>
+
+/*
+ * Atomically set a value and return the old value
+ */
+static inline uint64_t
+atomic64_xchg(uint64_t *ptr, uint64_t val) __attribute__ ((always_inline));
+static inline uint64_t
+atomic64_xchg(uint64_t *ptr, uint64_t val)
+{
+	asm volatile (
+				"lock;"
+				"xchgq %0,%1;"
+				 : "=r" ((uint64_t) val)
+				 : "m" (*(uint64_t *) ptr), "0" (val)
+				 : "memory");
+
+	return val;
+}
+
+
+#endif /* ATOMIC_H_ */
diff --git a/examples/performance-thread/common/arch/x86/ctx.c b/examples/performance-thread/common/arch/x86/ctx.c
new file mode 100644
index 0000000..ccf1683
--- /dev/null
+++ b/examples/performance-thread/common/arch/x86/ctx.c
@@ -0,0 +1,93 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * https://github.com/halayli/lthread which carries the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+
+
+#if defined(__x86_64__)
+__asm__ (
+".text\n"
+".p2align 4,,15\n"
+".globl ctx_switch\n"
+".globl _ctx_switch\n"
+"ctx_switch:\n"
+"_ctx_switch:\n"
+"	movq %rsp, 0(%rsi)	# save stack_pointer\n"
+"	movq %rbp, 8(%rsi)	# save frame_pointer\n"
+"	movq (%rsp), %rax	# save insn_pointer\n"
+"	movq %rax, 16(%rsi)\n"
+"	movq %rbx, 24(%rsi)\n	# save rbx,r12-r15\n"
+"	movq 24(%rdi), %rbx\n"
+"	movq %r15, 56(%rsi)\n"
+"	movq %r14, 48(%rsi)\n"
+"	movq 48(%rdi), %r14\n"
+"	movq 56(%rdi), %r15\n"
+"	movq %r13, 40(%rsi)\n"
+"	movq %r12, 32(%rsi)\n"
+"	movq 32(%rdi), %r12\n"
+"	movq 40(%rdi), %r13\n"
+"	movq 0(%rdi), %rsp	# restore stack_pointer\n"
+"	movq 16(%rdi), %rax	# restore insn_pointer\n"
+"	movq 8(%rdi), %rbp	# restore frame_pointer\n"
+"	movq %rax, (%rsp)\n"
+"	ret\n"
+	);
+#else
+#pragma GCC error "__x86_64__ is not defined"
+#endif
diff --git a/examples/performance-thread/common/arch/x86/ctx.h b/examples/performance-thread/common/arch/x86/ctx.h
new file mode 100644
index 0000000..d0a626d
--- /dev/null
+++ b/examples/performance-thread/common/arch/x86/ctx.h
@@ -0,0 +1,57 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+
+#ifndef CTX_H
+#define CTX_H
+
+/*
+ * CPU context registers
+ */
+struct ctx {
+	void	*rsp;		/* 0  */
+	void	*rbp;		/* 8  */
+	void	*rip;		/* 16 */
+	void	*rbx;		/* 24 */
+	void	*r12;		/* 32 */
+	void	*r13;		/* 40 */
+	void	*r14;		/* 48 */
+	void	*r15;		/* 56 */
+};
+
+
+void
+ctx_switch(struct ctx *new_ctx, struct ctx *curr_ctx);
+
+
+#endif /* RTE_CTX_H_ */
diff --git a/examples/performance-thread/common/common.mk b/examples/performance-thread/common/common.mk
new file mode 100644
index 0000000..6e0bc87
--- /dev/null
+++ b/examples/performance-thread/common/common.mk
@@ -0,0 +1,40 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 6WIND S.A.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of 6WIND S.A. nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# list the C files belonhing to the lthread subsystem, these are common to all lthread apps
+SRCS-y +=	../common/lthread.c \
+			../common/lthread_sched.c \
+			../common/lthread_cond.c \
+			../common/lthread_tls.c \
+			../common/lthread_mutex.c \
+			../common/lthread_diag.c \
+			../common/arch/x86/ctx.c
+
+INCLUDES += -I$(RTE_SDK)/examples/performance-thread/common/ -I$(RTE_SDK)/examples/performance-thread/common/arch/x86/
diff --git a/examples/performance-thread/common/lthread.c b/examples/performance-thread/common/lthread.c
new file mode 100644
index 0000000..7b67b52
--- /dev/null
+++ b/examples/performance-thread/common/lthread.c
@@ -0,0 +1,530 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software is derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#define RTE_MEM 1
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <limits.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+
+#include <rte_config.h>
+#include <rte_log.h>
+#include <ctx.h>
+
+#include "lthread_api.h"
+#include "lthread.h"
+#include "lthread_timer.h"
+#include "lthread_tls.h"
+#include "lthread_objcache.h"
+#include "lthread_diag.h"
+
+
+/*
+ * This function gets called after an lthread function has returned.
+ */
+void _lthread_exit_handler(struct lthread *lt)
+{
+
+	lt->state |= BIT(ST_LT_EXITED);
+
+	if (!(lt->state & BIT(ST_LT_DETACH))) {
+		/* thread is this not explicitly detached
+		 * it must be joinable, so we call lthread_exit().
+		 */
+		lthread_exit(NULL);
+	}
+
+	/* if we get here the thread is detached so we can reschedule it,
+	 * allowing the scheduler to free it
+	 */
+	_reschedule();
+}
+
+
+/*
+ * Free resources allocated to an lthread
+ */
+void _lthread_free(struct lthread *lt)
+{
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_FREE, lt, 0);
+
+	/* invoke any user TLS destructor functions */
+	_lthread_tls_destroy(lt);
+
+	/* free memory allocated for TLS defined using RTE_PER_LTHREAD macros */
+	if (sizeof(void *) < (uint64_t)RTE_PER_LTHREAD_SECTION_SIZE)
+		_lthread_objcache_free(lt->tls->root_sched->per_lthread_cache,
+					lt->per_lthread_data);
+
+	/* free pthread style TLS memory */
+	_lthread_objcache_free(lt->tls->root_sched->tls_cache, lt->tls);
+
+	/* free the stack */
+	_lthread_objcache_free(lt->stack_container->root_sched->stack_cache,
+				lt->stack_container);
+
+	/* now free the thread */
+	_lthread_objcache_free(lt->root_sched->lthread_cache, lt);
+
+}
+
+/*
+ * Allocate a stack and maintain a cache of stacks
+ */
+struct lthread_stack *_stack_alloc(void)
+{
+	struct lthread_stack *s;
+
+	s = _lthread_objcache_alloc((THIS_SCHED)->stack_cache);
+	LTHREAD_ASSERT(s != NULL);
+
+	s->root_sched = THIS_SCHED;
+	s->stack_size = LTHREAD_MAX_STACK_SIZE;
+	return s;
+}
+
+/*
+ * Execute a ctx by invoking the start function
+ * On return call an exit handler if the user has provided one
+ */
+static void _lthread_exec(void *arg)
+{
+	struct lthread *lt = (struct lthread *)arg;
+
+	/* invoke the contexts function */
+	lt->fun(lt->arg);
+	/* do exit handling */
+	if (lt->exit_handler != NULL)
+		lt->exit_handler(lt);
+}
+
+/*
+ *	Initialize an lthread
+ *	Set its function, args, and exit handler
+ */
+void
+_lthread_init(struct lthread *lt,
+	lthread_func_t fun, void *arg, lthread_exit_func exit_handler)
+{
+
+	/* set ctx func and args */
+	lt->fun = fun;
+	lt->arg = arg;
+	lt->exit_handler = exit_handler;
+
+	/* set initial state */
+	lt->birth = _sched_now();
+	lt->state = BIT(ST_LT_INIT);
+	lt->join = LT_JOIN_INITIAL;
+}
+
+/*
+ *	set the lthread stack
+ */
+void _lthread_set_stack(struct lthread *lt, void *stack, size_t stack_size)
+{
+	char *stack_top = (char *)stack + stack_size;
+	void **s = (void **)stack_top;
+
+	/* set stack */
+	lt->stack = stack;
+	lt->stack_size = stack_size;
+
+	/* set initial context */
+	s[-3] = NULL;
+	s[-2] = (void *)lt;
+	lt->ctx.rsp = (void *)(stack_top - (4 * sizeof(void *)));
+	lt->ctx.rbp = (void *)(stack_top - (3 * sizeof(void *)));
+	lt->ctx.rip = (void *)_lthread_exec;
+}
+
+/*
+ * Create an lthread on the current scheduler
+ * If there is no current scheduler on this pthread then first create one
+ */
+int
+lthread_create(struct lthread **new_lt, int lcore_id,
+		lthread_func_t fun, void *arg)
+{
+	if ((new_lt == NULL) || (fun == NULL))
+		return POSIX_ERRNO(EINVAL);
+
+	if (lcore_id < 0)
+		lcore_id = rte_lcore_id();
+	else if (lcore_id > LTHREAD_MAX_LCORES)
+		return POSIX_ERRNO(EINVAL);
+
+	struct lthread *lt = NULL;
+
+	if (THIS_SCHED == NULL) {
+		THIS_SCHED = _lthread_sched_create(0);
+		if (THIS_SCHED == NULL) {
+			perror("Failed to create scheduler");
+			return POSIX_ERRNO(EAGAIN);
+		}
+	}
+
+	/* allocate a thread structure */
+	lt = _lthread_objcache_alloc((THIS_SCHED)->lthread_cache);
+	if (lt == NULL)
+		return POSIX_ERRNO(EAGAIN);
+
+	bzero(lt, sizeof(struct lthread));
+	lt->root_sched = THIS_SCHED;
+
+	/* set the function args and exit handlder */
+	_lthread_init(lt, fun, arg, _lthread_exit_handler);
+
+	/* put it in the ready queue */
+	*new_lt = lt;
+
+	if (lcore_id < 0)
+		lcore_id = rte_lcore_id();
+
+	DIAG_CREATE_EVENT(lt, LT_DIAG_LTHREAD_CREATE);
+
+	rte_wmb();
+	_ready_queue_insert(_lthread_sched_get(lcore_id), lt);
+	return 0;
+}
+
+/*
+ * Schedules lthread to sleep for `nsecs`
+ * setting the lthread state to LT_ST_SLEEPING.
+ * lthread state is cleared upon resumption or expiry.
+ */
+static inline void _lthread_sched_sleep(struct lthread *lt, uint64_t nsecs)
+{
+	uint64_t state = lt->state;
+	uint64_t clks = _ns_to_clks(nsecs);
+
+	if (clks) {
+		_timer_start(lt, clks);
+		lt->state = state | BIT(ST_LT_SLEEPING);
+	}
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_SLEEP, clks, 0);
+	_suspend();
+}
+
+
+
+/*
+ * Cancels any running timer.
+ * This can be called multiple times on the same lthread regardless if it was
+ * sleeping or not.
+ */
+int _lthread_desched_sleep(struct lthread *lt)
+{
+	uint64_t state = lt->state;
+
+	if (state & BIT(ST_LT_SLEEPING)) {
+		_timer_stop(lt);
+		state &= (CLEARBIT(ST_LT_SLEEPING) & CLEARBIT(ST_LT_EXPIRED));
+		lt->state = state | BIT(ST_LT_READY);
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * set user data pointer in an lthread
+ */
+void lthread_set_data(void *data)
+{
+	if (sizeof(void *) == RTE_PER_LTHREAD_SECTION_SIZE)
+		THIS_LTHREAD->per_lthread_data = data;
+}
+
+/*
+ * Retrieve user data pointer from an lthread
+ */
+void *lthread_get_data(void)
+{
+	return THIS_LTHREAD->per_lthread_data;
+}
+
+/*
+ * Return the current lthread handle
+ */
+struct lthread *lthread_current(void)
+{
+	struct lthread_sched *sched = THIS_SCHED;
+
+	if (sched)
+		return sched->current_lthread;
+	return NULL;
+}
+
+
+
+/*
+ * Tasklet to cancel a thread
+ */
+static void
+_cancel(void *arg)
+{
+	struct lthread *lt = (struct lthread *) arg;
+
+	lt->state |= BIT(ST_LT_CANCELLED);
+	lthread_detach();
+}
+
+
+/*
+ * Mark the specified as canceled
+ */
+int lthread_cancel(struct lthread *cancel_lt)
+{
+	struct lthread *lt;
+
+	if ((cancel_lt == NULL) || (cancel_lt == THIS_LTHREAD))
+		return POSIX_ERRNO(EINVAL);
+
+	DIAG_EVENT(cancel_lt, LT_DIAG_LTHREAD_CANCEL, cancel_lt, 0);
+
+	if (cancel_lt->sched != THIS_SCHED) {
+
+		/* spawn task-let to cancel the thread */
+		lthread_create(&lt,
+				cancel_lt->sched->lcore_id,
+				_cancel,
+				cancel_lt);
+		return 0;
+	}
+	cancel_lt->state |= BIT(ST_LT_CANCELLED);
+	return 0;
+}
+
+/*
+ * Suspend the current lthread for specified time
+ */
+void lthread_sleep(uint64_t nsecs)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	_lthread_sched_sleep(lt, nsecs);
+
+}
+
+/*
+ * Suspend the current lthread for specified time
+ */
+void lthread_sleep_clks(uint64_t clks)
+{
+	struct lthread *lt = THIS_LTHREAD;
+	uint64_t state = lt->state;
+
+	if (clks) {
+		_timer_start(lt, clks);
+		lt->state = state | BIT(ST_LT_SLEEPING);
+	}
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_SLEEP, clks, 0);
+	_suspend();
+}
+
+/*
+ * Requeue the current thread to the back of the ready queue
+ */
+void lthread_yield(void)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_YIELD, 0, 0);
+
+	_ready_queue_insert(THIS_SCHED, lt);
+	ctx_switch(&(THIS_SCHED)->ctx, &lt->ctx);
+}
+
+/*
+ * Exit the current lthread
+ * If a thread is joining pass the user pointer to it
+ */
+void lthread_exit(void *ptr)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	/* if thread is detached (this is not valid) just exit */
+	if (lt->state & BIT(ST_LT_DETACH))
+		return;
+
+	/* There is a race between lthread_join() and lthread_exit()
+	 *  - if exit before join then we suspend and resume on join
+	 *  - if join before exit then we resume the joining thread
+	 */
+	if ((lt->join == LT_JOIN_INITIAL)
+	    && rte_atomic64_cmpset(&lt->join, LT_JOIN_INITIAL,
+				   LT_JOIN_EXITING)) {
+
+		DIAG_EVENT(lt, LT_DIAG_LTHREAD_EXIT, 1, 0);
+		_suspend();
+		/* set the exit value */
+		if ((ptr != NULL) && (lt->lt_join->lt_exit_ptr != NULL))
+			*(lt->lt_join->lt_exit_ptr) = ptr;
+
+		/* let the joining thread know we have set the exit value */
+		lt->join = LT_JOIN_EXIT_VAL_SET;
+	} else {
+
+		DIAG_EVENT(lt, LT_DIAG_LTHREAD_EXIT, 0, 0);
+		/* set the exit value */
+		if ((ptr != NULL) && (lt->lt_join->lt_exit_ptr != NULL))
+			*(lt->lt_join->lt_exit_ptr) = ptr;
+		/* let the joining thread know we have set the exit value */
+		lt->join = LT_JOIN_EXIT_VAL_SET;
+		_ready_queue_insert(lt->lt_join->sched,
+				    (struct lthread *)lt->lt_join);
+	}
+
+
+	/* wait until the joinging thread has collected the exit value */
+	while (lt->join != LT_JOIN_EXIT_VAL_READ)
+		_reschedule();
+
+	/* reset join state */
+	lt->join = LT_JOIN_INITIAL;
+
+	/* detach it so its resources can be released */
+	lt->state |= (BIT(ST_LT_DETACH) | BIT(ST_LT_EXITED));
+}
+
+/*
+ * Join an lthread
+ * Suspend until the joined thread returns
+ */
+int lthread_join(struct lthread *lt, void **ptr)
+{
+	if (lt == NULL)
+		return POSIX_ERRNO(EINVAL);
+
+	struct lthread *current = THIS_LTHREAD;
+	uint64_t lt_state = lt->state;
+
+	/* invalid to join a detached thread, or a thread that is joined */
+	if ((lt_state & BIT(ST_LT_DETACH)) || (lt->join == LT_JOIN_THREAD_SET))
+		return POSIX_ERRNO(EINVAL);
+	/* pointer to the joining thread and a poingter to return a value */
+	lt->lt_join = current;
+	current->lt_exit_ptr = ptr;
+	/* There is a race between lthread_join() and lthread_exit()
+	 *  - if join before exit we suspend and will resume when exit is called
+	 *  - if exit before join we resume the exiting thread
+	 */
+	if ((lt->join == LT_JOIN_INITIAL)
+	    && rte_atomic64_cmpset(&lt->join, LT_JOIN_INITIAL,
+				   LT_JOIN_THREAD_SET)) {
+
+		DIAG_EVENT(current, LT_DIAG_LTHREAD_JOIN, lt, 1);
+		_suspend();
+	} else {
+		DIAG_EVENT(current, LT_DIAG_LTHREAD_JOIN, lt, 0);
+		_ready_queue_insert(lt->sched, lt);
+	}
+
+	/* wait for exiting thread to set return value */
+	while (lt->join != LT_JOIN_EXIT_VAL_SET)
+		_reschedule();
+
+	/* collect the return value */
+	if (ptr != NULL)
+		*ptr = *current->lt_exit_ptr;
+
+	/* let the exiting thread proceed to exit */
+	lt->join = LT_JOIN_EXIT_VAL_READ;
+	return 0;
+}
+
+
+/*
+ * Detach current lthread
+ * A detached thread cannot be joined
+ */
+void lthread_detach(void)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_DETACH, 0, 0);
+
+	uint64_t state = lt->state;
+
+	lt->state = state | BIT(ST_LT_DETACH);
+}
+
+/*
+ * Set function name of an lthread
+ * this is a debug aid
+ */
+void lthread_set_funcname(const char *f)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	strncpy(lt->funcname, f, sizeof(lt->funcname));
+	lt->funcname[sizeof(lt->funcname)-1] = 0;
+}
diff --git a/examples/performance-thread/common/lthread.h b/examples/performance-thread/common/lthread.h
new file mode 100644
index 0000000..571c289
--- /dev/null
+++ b/examples/performance-thread/common/lthread.h
@@ -0,0 +1,99 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software is derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef LTHREAD_H_
+#define LTHREAD_H_
+
+#include <rte_per_lcore.h>
+
+#include "lthread_api.h"
+#include "lthread_diag.h"
+
+struct lthread;
+struct lthread_sched;
+
+/* function to be called when a context function returns */
+typedef void (*lthread_exit_func) (struct lthread *);
+
+void _lthread_exit_handler(struct lthread *lt);
+
+void lthread_set_funcname(const char *f);
+
+void _lthread_sched_busy_sleep(struct lthread *lt, uint64_t nsecs);
+
+int _lthread_desched_sleep(struct lthread *lt);
+
+void _lthread_free(struct lthread *lt);
+
+struct lthread_sched *_lthread_sched_get(int lcore_id);
+
+struct lthread_stack *_stack_alloc(void);
+
+struct
+lthread_sched *_lthread_sched_create(size_t stack_size);
+
+void
+_lthread_init(struct lthread *lt,
+	      lthread_func_t fun, void *arg, lthread_exit_func exit_handler);
+
+void _lthread_set_stack(struct lthread *lt, void *stack, size_t stack_size);
+
+#endif				/* LTHREAD_H_ */
diff --git a/examples/performance-thread/common/lthread_api.h b/examples/performance-thread/common/lthread_api.h
new file mode 100644
index 0000000..6d3a6da
--- /dev/null
+++ b/examples/performance-thread/common/lthread_api.h
@@ -0,0 +1,829 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software may have been derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/**
+ *  @file
+ *
+ *  This file contains the public API for the L-thread subsystem
+ *
+ *  The L_thread subsystem provides a simple cooperative scheduler to
+ *  enable arbitrary functions to run as cooperative threads within a
+ * single P-thread.
+ *
+ * The subsystem provides a P-thread like API that is intended to assist in
+ * reuse of legacy code written for POSIX p_threads.
+ *
+ * The L-thread subsystem relies on cooperative multitasking, as such
+ * an L-thread must possess frequent rescheduling points. Often these
+ * rescheduling points are provided transparently when the application
+ * invokes an L-thread API.
+ *
+ * In some applications it is possible that the program may enter a loop the
+ * exit condition for which depends on the action of another thread or a
+ * response from hardware. In such a case it is necessary to yield the thread
+ * periodically in the loop body, to allow other threads an opportunity to
+ * run. This can be done by inserting a call to lthread_yield() or
+ * lthread_sleep(n) in the body of the loop.
+ *
+ * If the application makes expensive / blocking system calls or does other
+ * work that would take an inordinate amount of time to complete, this will
+ * stall the cooperative scheduler resulting in very poor performance.
+ *
+ * In such cases an L-thread can be migrated temporarily to another scheduler
+ * running in a different P-thread on another core. When the expensive or
+ * blocking operation is completed it can be migrated back to the original
+ * scheduler.  In this way other threads can continue to run on the original
+ * scheduler and will be completely unaffected by the blocking behaviour.
+ * To migrate an L-thread to another scheduler the API lthread_set_affinity()
+ * is provided.
+ *
+ * If L-threads that share data are running on the same core it is possible
+ * to design programs where mutual exclusion mechanisms to protect shared data
+ * can be avoided. This is due to the fact that the cooperative threads cannot
+ * preempt each other.
+ *
+ * There are two cases where mutual exclusion mechanisms are necessary.
+ *
+ *  a) Where the L-threads sharing data are running on different cores.
+ *  b) Where code must yield while updating data shared with another thread.
+ *
+ * The L-thread subsystem provides a set of mutex APIs to help with such
+ * scenarios, however excessive reliance on on these will impact performance
+ * and is best avoided if possible.
+ *
+ * L-threads can synchronise using a fast condition variable implementation
+ * that supports signal and broadcast. An L-thread running on any core can
+ * wait on a condition.
+ *
+ * L-threads can have L-thread local storage with an API modelled on either the
+ * P-thread get/set specific API or using PER_LTHREAD macros modelled on the
+ * RTE_PER_LCORE macros. Alternatively a simple user data pointer may be set
+ * and retrieved from a thread.
+ */
+#ifndef LTHREAD_H
+#define LTHREAD_H
+
+#include <stdint.h>
+#include <sys/socket.h>
+#include <fcntl.h>
+#include <netinet/in.h>
+
+#include <rte_cycles.h>
+
+
+struct lthread;
+struct lthread_cond;
+struct lthread_mutex;
+
+struct lthread_condattr;
+struct lthread_mutexattr;
+
+typedef void (*lthread_func_t) (void *);
+
+/*
+ * Define the size of stack for an lthread
+ * Then this is the size that will be allocated on lthread creation
+ * This is a fixed size and will not grow.
+ */
+#define LTHREAD_MAX_STACK_SIZE (1024*64)
+
+/**
+ * Define the maximum number of TLS keys that can be created
+ *
+ */
+#define LTHREAD_MAX_KEYS 1024
+
+/**
+ * Define the maximum number of attempts to destroy an lthread's
+ * TLS data on thread exit
+ */
+#define LTHREAD_DESTRUCTOR_ITERATIONS 4
+
+
+/**
+ * Define the maximum number of lcores that will support lthreads
+ */
+#define LTHREAD_MAX_LCORES RTE_MAX_LCORE
+
+/**
+ * How many lthread objects to pre-allocate as the system grows
+ * applies to lthreads + stacks, TLS, mutexs, cond vars.
+ *
+ * @see _lthread_alloc()
+ * @see _cond_alloc()
+ * @see _mutex_alloc()
+ *
+ */
+#define LTHREAD_PREALLOC 100
+
+/**
+ * Set the number of schedulers in the system.
+ *
+ * This function may optionally be called before starting schedulers.
+ *
+ * If the number of schedulers is not set, or set to 0 then each scheduler
+ * will begin scheduling lthreads immediately it is started.
+
+ * If the number of schedulers is set to greater than 0, then each scheduler
+ * will wait until all schedulers have started before beginning to schedule
+ * lthreads.
+ *
+ * If an application wishes to have threads migrate between cores using
+ * lthread_set_affinity(), or join threads running on other cores using
+ * lthread_join(), then it is prudent to set the number of schedulers to ensure
+ * that all schedulers are initialised beforehand.
+ *
+ * @param num
+ *  the number of schedulers in the system
+ * @return
+ * the number of schedulers in the system
+ */
+int lthread_num_schedulers_set(int num);
+
+/**
+ * Return the number of schedulers currently running
+ * @return
+ *  the number of schedulers in the system
+ */
+int lthread_active_schedulers(void);
+
+/**
+  * Shutdown the specified scheduler
+  *
+  *  This function tells the specified scheduler to
+  *  exit if/when there is no more work to do.
+  *
+  *  Note that although the scheduler will stop
+  *  resources are not freed.
+  *
+  * @param lcore
+  *	The lcore of the scheduler to shutdown
+  *
+  * @return
+  *  none
+  */
+void lthread_scheduler_shutdown(unsigned lcore);
+
+/**
+  * Shutdown all schedulers
+  *
+  *  This function tells all schedulers  including the current scheduler to
+  *  exit if/when there is no more work to do.
+  *
+  *  Note that although the schedulers will stop
+  *  resources are not freed.
+  *
+  * @return
+  *  none
+  */
+void lthread_scheduler_shutdown_all(void);
+
+/**
+  * Run the lthread scheduler
+  *
+  *  Runs the lthread scheduler.
+  *  This function returns only if/when all lthreads have exited.
+  *  This function must be the main loop of an EAL thread.
+  *
+  * @return
+  *	 none
+  */
+
+void lthread_run(void);
+
+/**
+  * Create an lthread
+  *
+  *  Creates an lthread and places it in the ready queue on a particular
+  *  lcore.
+  *
+  *  If no scheduler exists yet on the curret lcore then one is created.
+  *
+  * @param new_lt
+  *  Pointer to an lthread pointer that will be initialized
+  * @param lcore
+  *  the lcore the thread should be started on or the current clore
+  *    -1 the current lcore
+  *    0 - LTHREAD_MAX_LCORES any other lcore
+  * @param lthread_func
+  *  Pointer to the function the for the thread to run
+  * @param arg
+  *  Pointer to args that will be passed to the thread
+  *
+  * @return
+  *	 0    success
+  *	 EAGAIN  no resources available
+  *	 EINVAL  NULL thread or function pointer, or lcore_id out of range
+  */
+int
+lthread_create(struct lthread **new_lt,
+		int lcore, lthread_func_t func, void *arg);
+
+/**
+  * Cancel an lthread
+  *
+  *  Cancels an lthread and causes it to be terminated
+  *  If the lthread is detached it will be freed immediately
+  *  otherwise its resources will not be released until it is joined.
+  *
+  * @param new_lt
+  *  Pointer to an lthread that will be cancelled
+  *
+  * @return
+  *	 0    success
+  *	 EINVAL  thread was NULL
+  */
+int lthread_cancel(struct lthread *lt);
+
+/**
+  * Join an lthread
+  *
+  *  Joins the current thread with the specified lthread, and waits for that
+  *  thread to exit.
+  *  Passes an optional pointer to collect returned data.
+  *
+  * @param lt
+  *  Pointer to the lthread to be joined
+  * @param ptr
+  *  Pointer to pointer to collect returned data
+  *
+0  * @return
+  *  0    success
+  *  EINVAL lthread could not be joined.
+  */
+int lthread_join(struct lthread *lt, void **ptr);
+
+/**
+  * Detach an lthread
+  *
+  * Detaches the current thread
+  * On exit a detached lthread will be freed immediately and will not wait
+  * to be joined. The default state for a thread is not detached.
+  *
+  * @return
+  *  none
+  */
+void lthread_detach(void);
+
+/**
+  *  Exit an lthread
+  *
+  * Terminate the current thread, optionally return data.
+  * The data may be collected by lthread_join()
+  *
+  * After calling this function the lthread will be suspended until it is
+  * joined. After it is joined then its resources will be freed.
+  *
+  * @param ptr
+  *  Pointer to pointer to data to be returned
+  *
+  * @return
+  *  none
+  */
+void lthread_exit(void *val);
+
+/**
+  * Cause the current lthread to sleep for n nanoseconds
+  *
+  * The current thread will be suspended until the specified time has elapsed
+  * or has been exceeded.
+  *
+  * Execution will switch to the next lthread that is ready to run
+  *
+  * @param nsecs
+  *  Number of nanoseconds to sleep
+  *
+  * @return
+  *  none
+  */
+void lthread_sleep(uint64_t nsecs);
+
+/**
+  * Cause the current lthread to sleep for n cpu clock ticks
+  *
+  *  The current thread will be suspended until the specified time has elapsed
+  *  or has been exceeded.
+  *
+  *	 Execution will switch to the next lthread that is ready to run
+  *
+  * @param clks
+  *  Number of clock ticks to sleep
+  *
+  * @return
+  *  none
+  */
+void lthread_sleep_clks(uint64_t clks);
+
+/**
+  * Yield the current lthread
+  *
+  *  The current thread will yield and execution will switch to the
+  *  next lthread that is ready to run
+  *
+  * @return
+  *  none
+  */
+void lthread_yield(void);
+
+/**
+  * Migrate the current thread to another scheduler
+  *
+  *  This function migrates the current thread to another scheduler.
+  *  Execution will switch to the next lthread that is ready to run on the
+  *  current scheduler. The current thread will be resumed on the new scheduler.
+  *
+  * @param lcore
+  *	The lcore to migrate to
+  *
+  * @return
+  *  0   success we are now running on the specified core
+  *  EINVAL the destination lcore was not valid
+  */
+int lthread_set_affinity(unsigned lcore);
+
+/**
+  * Return the current lthread
+  *
+  *  Returns the current lthread
+  *
+  * @return
+  *  pointer to the current lthread
+  */
+struct lthread
+*lthread_current(void);
+
+/**
+  * Associate user data with an lthread
+  *
+  *  This function sets a user data pointer in the current lthread
+  *  The pointer can be retrieved with lthread_get_data()
+  *  It is the users responsibility to allocate and free any data referenced
+  *  by the user pointer.
+  *
+  * @param data
+  *  pointer to user data
+  *
+  * @return
+  *  none
+  */
+void lthread_set_data(void *data);
+
+/**
+  * Get user data for the current lthread
+  *
+  *  This function returns a user data pointer for the current lthread
+  *  The pointer must first be set with lthread_set_data()
+  *  It is the users responsibility to allocate and free any data referenced
+  *  by the user pointer.
+  *
+  * @return
+  *  pointer to user data
+  */
+void
+*lthread_get_data(void);
+
+struct lthread_key;
+typedef void (*tls_destructor_func) (void *);
+
+/**
+  * Create a key for lthread TLS
+  *
+  *  This function is modelled on pthread_key_create
+  *  It creates a thread-specific data key visible to all lthreads on the
+  *  current scheduler.
+  *
+  *  Key values may be used to locate thread-specific data.
+  *  The same key value	may be used by different threads, the values bound
+  *  to the key by	lthread_setspecific() are maintained on	a per-thread
+  *  basis and persist for the life of the calling thread.
+  *
+  *  An	optional destructor function may be associated with each key value.
+  *  At	thread exit, if	a key value has	a non-NULL destructor pointer, and the
+  *  thread has	a non-NULL value associated with the key, the function pointed
+  *  to	is called with the current associated value as its sole	argument.
+  *
+  * @param key
+  *   Pointer to the key to be created
+  * @param destructor
+  *   Pointer to destructor function
+  *
+  * @return
+  *  0 success
+  *  EINVAL the key ptr was NULL
+  *  EAGAIN no resources available
+  */
+int lthread_key_create(unsigned int *key, tls_destructor_func destructor);
+
+/**
+  * Delete key for lthread TLS
+  *
+  *  This function is modelled on pthread_key_delete().
+  *  It deletes a thread-specific data key previously returned by
+  *  lthread_key_create().
+  *  The thread-specific data values associated with the key need not be NULL
+  *  at the time that lthread_key_delete is called.
+  *  It is the responsibility of the application to free any application
+  *  storage or perform any cleanup actions for data structures related to the
+  *  deleted key. This cleanup can be done either before or after
+  * lthread_key_delete is called.
+  *
+  * @param key
+  *  The key to be deleted
+  *
+  * @return
+  *  0 Success
+  *  EINVAL the key was invalid
+  */
+int lthread_key_delete(unsigned int key);
+
+/**
+  * Get lthread TLS
+  *
+  *  This function is modelled on pthread_get_specific().
+  *  It returns the value currently bound to the specified key on behalf of the
+  *  calling thread. Calling lthread_getspecific() with a key value not
+  *  obtained from lthread_key_create() or after key has been deleted with
+  *  lthread_key_delete() will result in undefined behaviour.
+  *  lthread_getspecific() may be called from a thread-specific data destructor
+  *  function.
+  *
+  * @param key
+  *  The key for which data is requested
+  *
+  * @return
+  *  Pointer to the thread specific data associated with that key
+  *  or NULL if no data has been set.
+  */
+void
+*lthread_getspecific(unsigned int key);
+
+/**
+  * Set lthread TLS
+  *
+  *  This function is modelled on pthread_set_sepcific()
+  *  It associates a thread-specific value with a key obtained via a previous
+  *  call to lthread_key_create().
+  *  Different threads may bind different values to the same key. These values
+  *  are typically pointers to dynamically allocated memory that have been
+  *  reserved by the calling thread. Calling lthread_setspecific with a key
+  *  value not obtained from lthread_key_create or after the key has been
+  *  deleted with lthread_key_delete will result in undefined behaviour.
+  *
+  * @param key
+  *  The key for which data is to be set
+  * @param key
+  *  Pointer to the user data
+  *
+  * @return
+  *  0 success
+  *  EINVAL the key was invalid
+  */
+
+int lthread_setspecific(unsigned int key, const void *value);
+
+/**
+ * The macros below provide an alternative mechanism to access lthread local
+ *  storage.
+ *
+ * The macros can be used to declare define and access per lthread local
+ * storage in a similar way to the RTE_PER_LCORE macros which control storage
+ * local to an lcore.
+ *
+ * Memory for per lthread variables declared in this way is allocated when the
+ * lthread is created and a pointer to this memory is stored in the lthread.
+ * The per lthread variables are accessed via the pointer + the offset of the
+ * particular variable.
+ *
+ * The total size of per lthread storage, and the variable offsets are found by
+ * defining the variables in a unique global memory section, the start and end
+ * of which is known. This global memory section is used only in the
+ * computation of the addresses of the lthread variables, and is never actually
+ * used to store any data.
+ *
+ * Due to the fact that variables declared this way may be scattered across
+ * many files, the start and end of the section and variable offsets are only
+ * known after linking, thus the computation of section size and variable
+ * addresses is performed at run time.
+ *
+ * These macros are primarily provided to aid porting of code that makes use
+ * of the existing RTE_PER_LCORE macros. In principle it would be more efficient
+ * to gather all lthread local variables into a single structure and
+ * set/retrieve a pointer to that struct using the alternative
+ * lthread_data_set/get APIs.
+ *
+ * These macros are mutually exclusive with the lthread_data_set/get APIs.
+ * If you define storage using these macros then the lthread_data_set/get APIs
+ * will not perform as expected, the lthread_data_set API does nothing, and the
+ * lthread_data_get API returns the start of global section.
+ *
+ */
+/* start and end of per lthread section */
+extern char __start_per_lt;
+extern char __stop_per_lt;
+
+
+#define RTE_DEFINE_PER_LTHREAD(type, name)                      \
+__typeof__(type)__attribute((section("per_lt"))) per_lt_##name
+
+/**
+ * Macro to declare an extern per lthread variable "var" of type "type"
+ */
+#define RTE_DECLARE_PER_LTHREAD(type, name)                     \
+extern __typeof__(type)__attribute((section("per_lt"))) per_lt_##name
+
+/**
+ * Read/write the per-lcore variable value
+ */
+#define RTE_PER_LTHREAD(name) ((typeof(per_lt_##name) *)\
+((char *)lthread_get_data() +\
+((char *) &per_lt_##name - &__start_per_lt)))
+
+/**
+  * Initialize a mutex
+  *
+  *  This function provides a mutual exclusion device, the need for which
+  *  can normally be avoided in a cooperative multitasking environment.
+  *  It is provided to aid porting of legacy code originally written for
+  *   preemptive multitasking environments such as pthreads.
+  *
+  *  A mutex may be unlocked (not owned by any thread), or locked (owned by
+  *  one thread).
+  *
+  *  A mutex can never be owned  by more than one thread simultaneously.
+  *  A thread attempting to lock a mutex that is already locked by another
+  *  thread is suspended until the owning thread unlocks the mutex.
+  *
+  *  lthread_mutex_init() initializes the mutex object pointed to by mutex
+  *  Optional mutex attributes specified in mutexattr, are reserved for future
+  *  use and are currently ignored.
+  *
+  *  If a thread calls lthread_mutex_lock() on the mutex, then if the mutex
+  *  is currently unlocked,  it  becomes  locked  and  owned  by  the calling
+  *  thread, and lthread_mutex_lock returns immediately. If the mutex is
+  *  already locked by another thread, lthread_mutex_lock suspends the calling
+  *  thread until the mutex is unlocked.
+  *
+  *  lthread_mutex_trylock behaves identically to rte_thread_mutex_lock, except
+  *  that it does not block the calling  thread  if the mutex is already locked
+  *  by another thread.
+  *
+  *  lthread_mutex_unlock() unlocks the specified mutex. The mutex is assumed
+  *  to be locked and owned by the calling thread.
+  *
+  *  lthread_mutex_destroy() destroys a	mutex object, freeing its resources.
+  *  The mutex must be unlocked with nothing blocked on it before calling
+  *  lthread_mutex_destroy.
+  *
+  * @param name
+  *  Optional pointer to string describing the mutex
+  * @param mutex
+  *  Pointer to pointer to the mutex to be initialized
+  * @param attribute
+  *  Pointer to attribute - unused reserved
+  *
+  * @return
+  *  0 success
+  *  EINVAL mutex was not a valid pointer
+  *  EAGAIN insufficient resources
+  */
+
+int
+lthread_mutex_init(char *name, struct lthread_mutex **mutex,
+		   const struct lthread_mutexattr *attr);
+
+/**
+  * Destroy a mutex
+  *
+  *  This function destroys the specified mutex freeing its resources.
+  *  The mutex must be unlocked before calling lthread_mutex_destroy.
+  *
+  * @see lthread_mutex_init()
+  *
+  * @param mutex
+  *  Pointer to pointer to the mutex to be initialized
+  *
+  * @return
+  *  0 success
+  *  EINVAL mutex was not an initialized mutex
+  *  EBUSY mutex was still in use
+  */
+int lthread_mutex_destroy(struct lthread_mutex *mutex);
+
+/**
+  * Lock a mutex
+  *
+  *  This function attempts to lock a mutex.
+  *  If a thread calls lthread_mutex_lock() on the mutex, then if the mutex
+  *  is currently unlocked,  it  becomes  locked  and  owned  by  the calling
+  *  thread, and lthread_mutex_lock returns immediately. If the mutex is
+  *  already locked by another thread, lthread_mutex_lock suspends the calling
+  *  thread until the mutex is unlocked.
+  *
+  * @see lthread_mutex_init()
+  *
+  * @param mutex
+  *  Pointer to pointer to the mutex to be initialized
+  *
+  * @return
+  *  0 success
+  *  EINVAL mutex was not an initialized mutex
+  *  EDEADLOCK the mutex was already owned by the calling thread
+  */
+
+int lthread_mutex_lock(struct lthread_mutex *mutex);
+
+/**
+  * Try to lock a mutex
+  *
+  *  This function attempts to lock a mutex.
+  *  lthread_mutex_trylock behaves identically to rte_thread_mutex_lock, except
+  *  that it does not block the calling  thread  if the mutex is already locked
+  *  by another thread.
+  *
+  *
+  * @see lthread_mutex_init()
+  *
+  * @param mutex
+  *  Pointer to pointer to the mutex to be initialized
+  *
+  * @return
+  * 0 success
+  * EINVAL mutex was not an initialized mutex
+  * EBUSY the mutex was already locked by another thread
+  */
+int lthread_mutex_trylock(struct lthread_mutex *mutex);
+
+/**
+  * Unlock a mutex
+  *
+  * This function attempts to unlock the specified mutex. The mutex is assumed
+  * to be locked and owned by the calling thread.
+  *
+  * The oldest of any threads blocked on the mutex is made ready and may
+  * compete with any other running thread to gain the mutex, it fails it will
+  *  be blocked again.
+  *
+  * @param mutex
+  * Pointer to pointer to the mutex to be initialized
+  *
+  * @return
+  *  0 mutex was unlocked
+  *  EINVAL mutex was not an initialized mutex
+  *  EPERM the mutex was not owned by the calling thread
+  */
+
+int lthread_mutex_unlock(struct lthread_mutex *mutex);
+
+/**
+  * Initialize a condition variable
+  *
+  *  This function initializes a condition variable.
+  *
+  *  Condition variables can be used to communicate changes in the state of data
+  *  shared between threads.
+  *
+  * @see lthread_cond_wait()
+  *
+  * @param name
+  *  Pointer to optional string describing the condition variable
+  * @param c
+  *  Pointer to pointer to the condition variable to be initialized
+  * @param attr
+  *  Pointer to optional attribute reserved for future use, currently ignored
+  *
+  * @return
+  *  0 success
+  *  EINVAL cond was not a valid pointer
+  *  EAGAIN insufficient resources
+  */
+int
+lthread_cond_init(char *name, struct lthread_cond **c,
+		  const struct lthread_condattr *attr);
+
+/**
+  * Destroy a condition variable
+  *
+  *  This function destroys a condition variable that was created with
+  *  lthread_cond_init() and releases its resources.
+  *
+  * @param cond
+  *  Pointer to pointer to the condition variable to be destroyed
+  *
+  * @return
+  *  0 Success
+  *  EBUSY condition variable was still in use
+  *  EINVAL was not an initialised condition variable
+  */
+int lthread_cond_destroy(struct lthread_cond *cond);
+
+/**
+  * Wait on a condition variable
+  *
+  *  The function blocks the current thread waiting on the condition variable
+  *  specified by cond. The waiting thread unblocks only after another thread
+  *  calls lthread_cond_signal, or lthread_cond_broadcast, specifying the
+  *  same condition variable.
+  *
+  * @param cond
+  *  Pointer to pointer to the condition variable to be waited on
+  *
+  * @param reserved
+  *  reserved for future use
+  *
+  * @return
+  *  0 The condition was signalled ( Success )
+  *  EINVAL was not a an initialised condition variable
+  */
+int lthread_cond_wait(struct lthread_cond *c, uint64_t reserved);
+
+/**
+  * Signal a condition variable
+  *
+  *  The function unblocks one thread waiting for the condition variable cond.
+  *  If no threads are waiting on cond, the rte_lthead_cond_signal() function
+  *  has no effect.
+  *
+  * @param cond
+  *  Pointer to pointer to the condition variable to be signalled
+  *
+  * @return
+  *  0 The condition was signalled ( Success )
+  *  EINVAL was not a an initialised condition variable
+  */
+int lthread_cond_signal(struct lthread_cond *c);
+
+/**
+  * Broadcast a condition variable
+  *
+  *  The function unblocks all threads waiting for the condition variable cond.
+  *  If no threads are waiting on cond, the rte_lthead_cond_broadcast()
+  *  function has no effect.
+  *
+  * @param cond
+  *  Pointer to pointer to the condition variable to be signalled
+  *
+  * @return
+  *  0 The condition was signalled ( Success )
+  *  EINVAL was not a an initialised condition variable
+  */
+int lthread_cond_broadcast(struct lthread_cond *c);
+
+#endif				/* LTHREAD_H */
diff --git a/examples/performance-thread/common/lthread_cond.c b/examples/performance-thread/common/lthread_cond.c
new file mode 100644
index 0000000..b90f22f
--- /dev/null
+++ b/examples/performance-thread/common/lthread_cond.c
@@ -0,0 +1,241 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software may have been derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <limits.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+#include <errno.h>
+
+#include <rte_config.h>
+#include <rte_log.h>
+#include <rte_common.h>
+
+#include "lthread_api.h"
+#include "lthread_diag_api.h"
+#include "lthread_diag.h"
+#include "lthread_int.h"
+#include "lthread_sched.h"
+#include "lthread_queue.h"
+#include "lthread_objcache.h"
+#include "lthread_timer.h"
+#include "lthread_mutex.h"
+#include "lthread_cond.h"
+
+/*
+ * Create a condition variable
+ */
+int
+lthread_cond_init(char *name, struct lthread_cond **cond,
+		  __rte_unused const struct lthread_condattr *attr)
+{
+	struct lthread_cond *c;
+
+	if (cond == NULL)
+		return POSIX_ERRNO(EINVAL);
+
+	/* allocate a condition variable from cache */
+	c = _lthread_objcache_alloc((THIS_SCHED)->cond_cache);
+
+	if (c == NULL)
+		return POSIX_ERRNO(EAGAIN);
+
+	c->blocked = _lthread_queue_create("blocked");
+	if (c->blocked == NULL) {
+		_lthread_objcache_free((THIS_SCHED)->cond_cache, (void *)c);
+		return POSIX_ERRNO(EAGAIN);
+	}
+
+	if (name == NULL)
+		strncpy(c->name, "no name", sizeof(c->name));
+	else
+		strncpy(c->name, name, sizeof(c->name));
+	c->name[sizeof(c->name)-1] = 0;
+
+	c->root_sched = THIS_SCHED;
+
+	(*cond) = c;
+	DIAG_CREATE_EVENT((*cond), LT_DIAG_COND_CREATE);
+	return 0;
+}
+
+/*
+ * Destroy a condition variable
+ */
+int lthread_cond_destroy(struct lthread_cond *c)
+{
+	if (c == NULL) {
+		DIAG_EVENT(c, LT_DIAG_COND_DESTROY, c, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	/* try to free it */
+	if (_lthread_queue_destroy(c->blocked) < 0) {
+		/* queue in use */
+		DIAG_EVENT(c, LT_DIAG_COND_DESTROY, c, POSIX_ERRNO(EBUSY));
+		return POSIX_ERRNO(EBUSY);
+	}
+
+	/* okay free it */
+	_lthread_objcache_free(c->root_sched->cond_cache, c);
+	DIAG_EVENT(c, LT_DIAG_COND_DESTROY, c, 0);
+	return 0;
+}
+
+/*
+ * Wait on a condition variable
+ */
+int lthread_cond_wait(struct lthread_cond *c, __rte_unused uint64_t reserved)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	if (c == NULL) {
+		DIAG_EVENT(c, LT_DIAG_COND_WAIT, c, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+
+	DIAG_EVENT(c, LT_DIAG_COND_WAIT, c, 0);
+
+	/* queue the current thread in the blocked queue
+	 * this will be written when we return to the scheduler
+	 * to ensure that the current thread context is saved
+	 * before any signal could result in it being dequeued and
+	 * resumed
+	 */
+	lt->pending_wr_queue = c->blocked;
+	_suspend();
+
+	/* the condition happened */
+	return 0;
+}
+
+/*
+ * Signal a condition variable
+ * attempt to resume any blocked thread
+ */
+int lthread_cond_signal(struct lthread_cond *c)
+{
+	struct lthread *lt;
+
+	if (c == NULL) {
+		DIAG_EVENT(c, LT_DIAG_COND_SIGNAL, c, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	lt = _lthread_queue_remove(c->blocked);
+
+	if (lt != NULL) {
+		/* okay wake up this thread */
+		DIAG_EVENT(c, LT_DIAG_COND_SIGNAL, c, lt);
+		_ready_queue_insert((struct lthread_sched *)lt->sched, lt);
+	}
+	return 0;
+}
+
+/*
+ * Broadcast a condition variable
+ */
+int lthread_cond_broadcast(struct lthread_cond *c)
+{
+	struct lthread *lt;
+
+	if (c == NULL) {
+		DIAG_EVENT(c, LT_DIAG_COND_BROADCAST, c, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	DIAG_EVENT(c, LT_DIAG_COND_BROADCAST, c, 0);
+	do {
+		/* drain the queue waking everybody */
+		lt = _lthread_queue_remove(c->blocked);
+
+		if (lt != NULL) {
+			DIAG_EVENT(c, LT_DIAG_COND_BROADCAST, c, lt);
+			/* wake up */
+			_ready_queue_insert((struct lthread_sched *)lt->sched,
+					    lt);
+		}
+	} while (!_lthread_queue_empty(c->blocked));
+	_reschedule();
+	DIAG_EVENT(c, LT_DIAG_COND_BROADCAST, c, 0);
+	return 0;
+}
+
+/*
+ * return the diagnostic ref val stored in a condition var
+ */
+uint64_t
+lthread_cond_diag_ref(struct lthread_cond *c)
+{
+	if (c == NULL)
+		return 0;
+	return c->diag_ref;
+}
+
diff --git a/examples/performance-thread/common/lthread_cond.h b/examples/performance-thread/common/lthread_cond.h
new file mode 100644
index 0000000..9341df3
--- /dev/null
+++ b/examples/performance-thread/common/lthread_cond.h
@@ -0,0 +1,77 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software may have been derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef LTHREAD_COND_H_
+#define LTHREAD_COND_H_
+
+#include "lthread_queue.h"
+
+#define MAX_COND_NAME_SIZE 64
+
+struct lthread_cond {
+	struct lthread_queue *blocked;
+	struct lthread_sched *root_sched;
+	int count;
+	char name[MAX_COND_NAME_SIZE];
+	uint64_t diag_ref;	/* optional ref to user diag data */
+} __rte_cache_aligned;
+
+#endif				/* LTHREAD_COND_H_ */
diff --git a/examples/performance-thread/common/lthread_diag.c b/examples/performance-thread/common/lthread_diag.c
new file mode 100644
index 0000000..9b23704
--- /dev/null
+++ b/examples/performance-thread/common/lthread_diag.c
@@ -0,0 +1,321 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_config.h>
+#include <rte_log.h>
+#include <rte_common.h>
+
+#include "lthread_diag.h"
+#include "lthread_queue.h"
+#include "lthread_pool.h"
+#include "lthread_objcache.h"
+#include "lthread_sched.h"
+#include "lthread_diag_api.h"
+
+
+/* dummy ref value of default diagnostic callback */
+static uint64_t dummy_ref;
+
+#define DIAG_SCHED_STATS_FORMAT \
+"core %d\n%33s %12s %12s %12s %12s\n"
+
+#define DIAG_CACHE_STATS_FORMAT \
+"%20s %12lu %12lu %12lu %12lu %12lu\n"
+
+#define DIAG_QUEUE_STATS_FORMAT \
+"%20s %12lu %12lu %12lu\n"
+
+
+/*
+ * texts used in diagnostic events,
+ * corresponding diagnostic mask bit positions are given as comment
+ */
+const char *diag_event_text[] = {
+	"LTHREAD_CREATE     ",	/* 00 */
+	"LTHREAD_EXIT       ",	/* 01 */
+	"LTHREAD_JOIN       ",	/* 02 */
+	"LTHREAD_CANCEL     ",	/* 03 */
+	"LTHREAD_DETACH     ",	/* 04 */
+	"LTHREAD_FREE       ",	/* 05 */
+	"LTHREAD_SUSPENDED  ",	/* 06 */
+	"LTHREAD_YIELD      ",	/* 07 */
+	"LTHREAD_RESCHEDULED",	/* 08 */
+	"LTHREAD_SLEEP      ",	/* 09 */
+	"LTHREAD_RESUMED    ",	/* 10 */
+	"LTHREAD_AFFINITY   ",	/* 11 */
+	"LTHREAD_TMR_START  ",	/* 12 */
+	"LTHREAD_TMR_DELETE ",	/* 13 */
+	"LTHREAD_TMR_EXPIRED",	/* 14 */
+	"COND_CREATE        ",	/* 15 */
+	"COND_DESTROY       ",	/* 16 */
+	"COND_WAIT          ",	/* 17 */
+	"COND_SIGNAL        ",	/* 18 */
+	"COND_BROADCAST     ",	/* 19 */
+	"MUTEX_CREATE       ",	/* 20 */
+	"MUTEX_DESTROY      ",	/* 21 */
+	"MUTEX_LOCK         ",	/* 22 */
+	"MUTEX_TRYLOCK      ",	/* 23 */
+	"MUTEX_BLOCKED      ",	/* 24 */
+	"MUTEX_UNLOCKED     ",	/* 25 */
+	"SCHED_CREATE       ",	/* 26 */
+	"SCHED_SHUTDOWN     "	/* 27 */
+};
+
+
+/*
+ * set diagnostic ,ask
+ */
+void lthread_diagnostic_set_mask(DIAG_USED uint64_t mask)
+{
+#if LTHREAD_DIAG
+	diag_mask = mask;
+#else
+	RTE_LOG(INFO, LTHREAD,
+		"LTHREAD_DIAG is not set, see lthread_diag_api.h\n");
+#endif
+}
+
+
+/*
+ * Check consistency of the scheduler stats
+ * Only sensible run after the schedulers are stopped
+ * Count the number of objects lying in caches and queues
+ * and available in the qnode pool.
+ * This should be equal to the total capacity of all
+ * qnode pools.
+ */
+void
+_sched_stats_consistency_check(void);
+void
+_sched_stats_consistency_check(void)
+{
+#if LTHREAD_DIAG
+	int i;
+	struct lthread_sched *sched;
+	uint64_t count = 0;
+	uint64_t capacity = 0;
+
+	for (i = 0; i < LTHREAD_MAX_LCORES; i++) {
+		sched = schedcore[i];
+		if (sched == NULL)
+			continue;
+
+		/* each of these queues consumes a stub node */
+		count += 8;
+		count += DIAG_COUNT(sched->ready, size);
+		count += DIAG_COUNT(sched->pready, size);
+		count += DIAG_COUNT(sched->lthread_cache, available);
+		count += DIAG_COUNT(sched->stack_cache, available);
+		count += DIAG_COUNT(sched->tls_cache, available);
+		count += DIAG_COUNT(sched->per_lthread_cache, available);
+		count += DIAG_COUNT(sched->cond_cache, available);
+		count += DIAG_COUNT(sched->mutex_cache, available);
+
+		/* the node pool does not consume a stub node */
+		if (sched->qnode_pool->fast_alloc != NULL)
+			count++;
+		count += DIAG_COUNT(sched->qnode_pool, available);
+
+		capacity += DIAG_COUNT(sched->qnode_pool, capacity);
+	}
+	if (count != capacity) {
+		RTE_LOG(CRIT, LTHREAD,
+			"Scheduler caches are inconsistent\n");
+	} else {
+		RTE_LOG(INFO, LTHREAD,
+			"Scheduler caches are ok\n");
+	}
+#endif
+}
+
+/*
+ * Display node pool stats
+ */
+static inline void
+_qnode_pool_display(DIAG_USED struct qnode_pool *p)
+{
+#if LTHREAD_DIAG
+
+	printf(DIAG_CACHE_STATS_FORMAT,
+			p->name,
+			DIAG_COUNT(p, rd),
+			DIAG_COUNT(p, wr),
+			DIAG_COUNT(p, available),
+			DIAG_COUNT(p, prealloc),
+			DIAG_COUNT(p, capacity));
+	fflush(stdout);
+#endif
+}
+
+
+/*
+ * Display queue stats
+ */
+static inline void
+_lthread_queue_display(DIAG_USED struct lthread_queue *q)
+{
+#if LTHREAD_DIAG
+
+	printf(DIAG_QUEUE_STATS_FORMAT,
+			q->name,
+			DIAG_COUNT(q, rd),
+			DIAG_COUNT(q, wr),
+			DIAG_COUNT(q, size));
+	fflush(stdout);
+#endif
+}
+
+/*
+ * Display objcache stats
+ */
+static inline void
+_objcache_display(DIAG_USED struct lthread_objcache *c)
+{
+#if LTHREAD_DIAG
+
+	printf(DIAG_CACHE_STATS_FORMAT,
+			c->name,
+			DIAG_COUNT(c, rd),
+			DIAG_COUNT(c, wr),
+			DIAG_COUNT(c, available),
+			DIAG_COUNT(c, prealloc),
+			DIAG_COUNT(c, capacity));
+#if DISPLAY_OBCACHE_QUEUES
+	_lthread_queue_display(c->q);
+#endif
+	fflush(stdout);
+#endif
+}
+
+
+/*
+ * Display sched stats
+ */
+void
+lthread_sched_stats_display(void)
+{
+#if LTHREAD_DIAG
+	int i;
+	struct lthread_sched *sched;
+
+	for (i = 0; i < LTHREAD_MAX_LCORES; i++) {
+		sched = schedcore[i];
+		if (sched != NULL) {
+			printf(DIAG_SCHED_STATS_FORMAT,
+					sched->lcore_id,
+					"rd",
+					"wr",
+					"present",
+					"nb preallocs",
+					"capacity");
+			_lthread_queue_display(sched->ready);
+			_lthread_queue_display(sched->pready);
+			_qnode_pool_display(sched->qnode_pool);
+			_objcache_display(sched->lthread_cache);
+			_objcache_display(sched->stack_cache);
+			_objcache_display(sched->tls_cache);
+			_objcache_display(sched->per_lthread_cache);
+			_objcache_display(sched->cond_cache);
+			_objcache_display(sched->mutex_cache);
+		fflush(stdout);
+		}
+	}
+	_sched_stats_consistency_check();
+#else
+	RTE_LOG(INFO, LTHREAD,
+		"lthread diagnostics disabled\n"
+		"hint - set LTHREAD_DIAG in lthread_diag_api.h\n");
+#endif
+}
+
+/*
+ * Defafult diagnostic callback
+ */
+static uint64_t
+_lthread_diag_default_cb(uint64_t time, struct lthread *lt, int diag_event,
+		uint64_t diag_ref, const char *text, uint64_t p1, uint64_t p2)
+{
+	uint64_t _p2;
+	int lcore = (int) rte_lcore_id();
+
+	switch (diag_event) {
+	case LT_DIAG_LTHREAD_CREATE:
+	case LT_DIAG_MUTEX_CREATE:
+	case LT_DIAG_COND_CREATE:
+		_p2 = dummy_ref;
+		break;
+	default:
+		_p2 = p2;
+		break;
+	}
+
+	printf("%"PRIu64" %d %8.8lx %8.8lx %s %8.8lx %8.8lx\n",
+		time,
+		lcore,
+		(uint64_t) lt,
+		diag_ref,
+		text,
+		p1,
+		_p2);
+
+	return dummy_ref++;
+}
+
+/*
+ * plug in default diag callback with mask off
+ */
+void _lthread_diag_ctor(void)__attribute__((constructor));
+void _lthread_diag_ctor(void)
+{
+	diag_cb = _lthread_diag_default_cb;
+	diag_mask = 0;
+}
+
+
+/*
+ * enable diagnostics
+ */
+void lthread_diagnostic_enable(DIAG_USED diag_callback cb,
+				DIAG_USED uint64_t mask)
+{
+#if LTHREAD_DIAG
+	if (cb == NULL)
+		diag_cb = _lthread_diag_default_cb;
+	else
+		diag_cb = cb;
+	diag_mask = mask;
+#else
+	RTE_LOG(INFO, LTHREAD,
+		"LTHREAD_DIAG is not set, see lthread_diag_api.h\n");
+#endif
+}
diff --git a/examples/performance-thread/common/lthread_diag.h b/examples/performance-thread/common/lthread_diag.h
new file mode 100644
index 0000000..7b2f35b
--- /dev/null
+++ b/examples/performance-thread/common/lthread_diag.h
@@ -0,0 +1,129 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef LTHREAD_DIAG_H_
+#define LTHREAD_DIAG_H_
+
+#include <stdint.h>
+#include <inttypes.h>
+
+#include <rte_log.h>
+#include <rte_common.h>
+
+#include "lthread_api.h"
+#include "lthread_diag_api.h"
+
+extern diag_callback diag_cb;
+
+extern const char *diag_event_text[];
+extern uint64_t diag_mask;
+
+/* max size of name strings */
+#define LT_MAX_NAME_SIZE 64
+
+#if LTHREAD_DIAG
+
+/*
+ * Generate a diagnostic trace or event in the case where an object is created.
+ *
+ * The value returned by the callback is stored in the object.
+ *
+ * @ param obj
+ *  pointer to the object that was created
+ * @ param ev
+ *  the event code
+ *
+ */
+#define DIAG_CREATE_EVENT(obj, ev) do {					\
+	struct lthread *ct = RTE_PER_LCORE(this_sched)->current_lthread;\
+	if ((BIT(ev) & diag_mask) && (ev < LT_DIAG_EVENT_MAX)) {	\
+		(obj)->diag_ref = (diag_cb)(rte_rdtsc(),		\
+					ct,				\
+					(ev),				\
+					0,				\
+					diag_event_text[(ev)],		\
+					(uint64_t)obj,			\
+					0);				\
+	}								\
+} while (0)
+
+/*
+ * Generate a diagnostic trace event.
+ *
+ * @ param obj
+ *  pointer to the lthread, cond or mutex object
+ * @ param ev
+ *  the event code
+ * @ param p1
+ *  object specific value ( see lthread_diag_api.h )
+ * @ param p2
+ *  object specific value ( see lthread_diag_api.h )
+ */
+#define DIAG_EVENT(obj, ev, p1, p2) do {				\
+	struct lthread *ct = RTE_PER_LCORE(this_sched)->current_lthread;\
+	if ((BIT(ev) & diag_mask) && (ev < LT_DIAG_EVENT_MAX)) {	\
+		(diag_cb)(rte_rdtsc(),					\
+				ct,					\
+				ev,					\
+				(obj)->diag_ref,			\
+				diag_event_text[(ev)],			\
+				(uint64_t)(p1),				\
+				(uint64_t)(p2));			\
+	}								\
+} while (0)
+
+#define DIAG_COUNT_DEFINE(x) rte_atomic64_t count_##x
+#define DIAG_COUNT_INIT(o, x) rte_atomic64_init(&((o)->count_##x))
+#define DIAG_COUNT_INC(o, x) rte_atomic64_inc(&((o)->count_##x))
+#define DIAG_COUNT_DEC(o, x) rte_atomic64_dec(&((o)->count_##x))
+#define DIAG_COUNT(o, x) rte_atomic64_read(&((o)->count_##x))
+
+#define DIAG_USED
+
+#else
+
+/* no diagnostics configured */
+
+#define DIAG_CREATE_EVENT(obj, ev)
+#define DIAG_EVENT(obj, ev, p1, p)
+
+#define DIAG_COUNT_DEFINE(x)
+#define DIAG_COUNT_INIT(o, x) do {} while (0)
+#define DIAG_COUNT_INC(o, x) do {} while (0)
+#define DIAG_COUNT_DEC(o, x) do {} while (0)
+#define DIAG_COUNT(o, x) 0
+
+#define DIAG_USED __rte_unused
+
+#endif				/* LTHREAD_DIAG */
+#endif				/* LTHREAD_DIAG_H_ */
diff --git a/examples/performance-thread/common/lthread_diag_api.h b/examples/performance-thread/common/lthread_diag_api.h
new file mode 100644
index 0000000..b7f4f10
--- /dev/null
+++ b/examples/performance-thread/common/lthread_diag_api.h
@@ -0,0 +1,319 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef LTHREAD_DIAG_API_H_
+#define LTHREAD_DIAG_API_H_
+
+#include <stdint.h>
+#include <inttypes.h>
+
+/*
+ * Enable diagnostics
+ * 0 = conditionally compiled out
+ * 1 = compiled in and maskable at run time, see below for details
+ */
+#define LTHREAD_DIAG 0
+
+/**
+ * lthread diagnostic interface
+ *
+ * If enabled via configuration file option ( tbd ) the lthread subsystem
+ * can generate selected trace information, either RTE_LOG  (INFO) messages,
+ * or else invoke a user supplied callback function when any of the events
+ * listed below occur.
+ *
+ * Reporting of events can be selectively masked, the bit position in the
+ * mask is determined by the corresponding event identifier listed below.
+ *
+ * Diagnostics are enabled by registering the callback function and mask
+ * using the API lthread_diagnostic_enable().
+ *
+ * Various interesting parameters are passed to the callback, including the
+ * time in cpu clks, the lthread id, the diagnostic event id, a user ref value,
+ * event text string, object being traced, and two context dependent parameters
+ * (p1 and p2). The meaning of the two parameters p1 and p2 depends on
+ * the specific event.
+ *
+ * The events LT_DIAG_LTHREAD_CREATE, LT_DIAG_MUTEX_CREATE and
+ * LT_DIAG_COND_CREATE are implicitly enabled if the event mask includes any of
+ * the LT_DIAG_LTHREAD_XXX, LT_DIAG_MUTEX_XXX or LT_DIAG_COND_XXX events
+ * respectively.
+ *
+ * These create events may also be included in the mask discreetly if it is
+ * desired to monitor only create events.
+ *
+ * @param  time
+ *  The time in cpu clks at which the event occurred
+ *
+ * @param  lthread
+ *  The current lthread
+ *
+ * @param diag_event
+ *  The diagnostic event id (bit position in the mask)
+ *
+ * @param  diag_ref
+ *
+ * For LT_DIAG_LTHREAD_CREATE, LT_DIAG_MUTEX_CREATE or LT_DIAG_COND_CREATE
+ * this parameter is not used and set to 0.
+ * All other events diag_ref contains the user ref value returned by the
+ * callback function when lthread is created.
+ *
+ * The diag_ref values assigned to mutex and cond var can be retrieved
+ * using the APIs lthread_mutex_diag_ref(), and lthread_cond_diag_ref()
+ * respectively.
+ *
+ * @param p1
+ *  see below
+ *
+ * @param p1
+ *  see below
+ *
+ * @returns
+ * For LT_DIAG_LTHREAD_CREATE, LT_DIAG_MUTEX_CREATE or LT_DIAG_COND_CREATE
+ * expects a user diagnostic ref value that will be saved in the lthread, mutex
+ * or cond var.
+ *
+ * For all other events return value is ignored.
+ *
+ *	LT_DIAG_SCHED_CREATE - Invoked when a scheduler is created
+ *		p1 = the scheduler that was created
+ *		p2 = not used
+ *		return value will be ignored
+ *
+ *	LT_DIAG_SCHED_SHUTDOWN - Invoked when a shutdown request is received
+ *		p1 = the scheduler to be shutdown
+ *		p2 = not used
+ *		return value will be ignored
+ *
+ *	LT_DIAG_LTHREAD_CREATE - Invoked when a thread is created
+ *		p1 = the lthread that was created
+ *		p2 = not used
+ *		return value will be stored in the lthread
+ *
+ *	LT_DIAG_LTHREAD_EXIT - Invoked when a lthread exits
+ *		p2 = 0 if the thread was already joined
+ *		p2 = 1 if the thread was not already joined
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_JOIN - Invoked when a lthread exits
+ *		p1 = the lthread that is being joined
+ *		p2 = 0 if the thread was already exited
+ *		p2 = 1 if the thread was not already exited
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_CANCELLED - Invoked when an lthread is cancelled
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_DETACH - Invoked when an lthread is detached
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_FREE - Invoked when an lthread is freed
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_SUSPENDED - Invoked when an lthread is suspended
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_YIELD - Invoked when an lthread explicitly yields
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_RESCHEDULED - Invoked when an lthread is rescheduled
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_RESUMED - Invoked when an lthread is resumed
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_AFFINITY - Invoked when an lthread is affinitised
+ *		p1 = the destination lcore_id
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_TMR_START - Invoked when an lthread starts a timer
+ *		p1 = address of timer node
+ *		p2 = the timeout value
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_TMR_DELETE - Invoked when an lthread deletes a timer
+ *		p1 = address of the timer node
+ *		p2 = 0 the timer and the was successfully deleted
+ *		p2 = not usee
+ *		return val ignored
+ *
+ *	LT_DIAG_LTHREAD_TMR_EXPIRED - Invoked when an lthread timer expires
+ *		p1 = address of scheduler the timer expired on
+ *		p2 = the thread associated with the timer
+ *		return val ignored
+ *
+ *	LT_DIAG_COND_CREATE - Invoked when a condition variable is created
+ *		p1 = address of cond var that was created
+ *		p2 = not used
+ *		return diag ref value will be stored in the condition variable
+ *
+ *	LT_DIAG_COND_DESTROY - Invoked when a condition variable is destroyed
+ *		p1 = not used
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_COND_WAIT - Invoked when an lthread waits on a cond var
+ *		p1 = the address of the condition variable
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_COND_SIGNAL - Invoked when an lthread signals a cond var
+ *		p1 = the address of the cond var
+ *		p2 = the lthread that was signalled, or error code
+ *		return val ignored
+ *
+ *	LT_DIAG_COND_BROADCAST - Invoked when an lthread broadcasts a cond var
+ *		p1 = the address of the condition variable
+ *		p2 = the lthread(s) that are signalled, or error code
+ *
+ *	LT_DIAG_MUTEX_CREATE - Invoked when a mutex is created
+ *		p1 = address of muex
+ *		p2 = not used
+ *		return diag ref value will be stored in the mutex variable
+ *
+ *	LT_DIAG_MUTEX_DESTROY - Invoked when a mutex is destroyed
+ *		p1 = address of mutex
+ *		p2 = not used
+ *		return val ignored
+ *
+ *	LT_DIAG_MUTEX_LOCK - Invoked when a mutex lock is obtained
+ *		p1 = address of mutex
+ *		p2 = function return value
+ *		return val ignored
+ *
+ *	LT_DIAG_MUTEX_BLOCKED  - Invoked when an lthread blocks on a mutex
+ *		p1 = address of mutex
+ *		p2 = function return value
+ *		return val ignored
+ *
+ *	LT_DIAG_MUTEX_TRYLOCK - Invoked when a mutex try lock is attempted
+ *		p1 = address of mutex
+ *		p2 = the function return value
+ *		return val ignored
+ *
+ *	LT_DIAG_MUTEX_UNLOCKED - Invoked when a mutex is unlocked
+ *		p1 = address of mutex
+ *		p2 = the thread that was unlocked, or error code
+ *		return val ignored
+ */
+typedef uint64_t (*diag_callback) (uint64_t time, struct lthread *lt,
+				  int diag_event, uint64_t diag_ref,
+				const char *text, uint64_t p1, uint64_t p2);
+
+/*
+ * Set user diagnostic callback and mask
+ * If the callback function pointer is NULL the default
+ * callback handler will be restored.
+ */
+void lthread_diagnostic_enable(diag_callback cb, uint64_t diag_mask);
+
+/*
+ * Set diagnostic mask
+ */
+void lthread_diagnostic_set_mask(uint64_t mask);
+
+/*
+ * lthread diagnostic callback
+ */
+enum lthread_diag_ev {
+	/* bits 0 - 14 lthread flag group */
+	LT_DIAG_LTHREAD_CREATE,		/* 00 mask 0x00000001 */
+	LT_DIAG_LTHREAD_EXIT,		/* 01 mask 0x00000002 */
+	LT_DIAG_LTHREAD_JOIN,		/* 02 mask 0x00000004 */
+	LT_DIAG_LTHREAD_CANCEL,		/* 03 mask 0x00000008 */
+	LT_DIAG_LTHREAD_DETACH,		/* 04 mask 0x00000010 */
+	LT_DIAG_LTHREAD_FREE,		/* 05 mask 0x00000020 */
+	LT_DIAG_LTHREAD_SUSPENDED,	/* 06 mask 0x00000040 */
+	LT_DIAG_LTHREAD_YIELD,		/* 07 mask 0x00000080 */
+	LT_DIAG_LTHREAD_RESCHEDULED,	/* 08 mask 0x00000100 */
+	LT_DIAG_LTHREAD_SLEEP,		/* 09 mask 0x00000200 */
+	LT_DIAG_LTHREAD_RESUMED,	/* 10 mask 0x00000400 */
+	LT_DIAG_LTHREAD_AFFINITY,	/* 11 mask 0x00000800 */
+	LT_DIAG_LTHREAD_TMR_START,	/* 12 mask 0x00001000 */
+	LT_DIAG_LTHREAD_TMR_DELETE,	/* 13 mask 0x00002000 */
+	LT_DIAG_LTHREAD_TMR_EXPIRED,	/* 14 mask 0x00004000 */
+	/* bits 15 - 19 conditional variable flag group */
+	LT_DIAG_COND_CREATE,		/* 15 mask 0x00008000 */
+	LT_DIAG_COND_DESTROY,		/* 16 mask 0x00010000 */
+	LT_DIAG_COND_WAIT,		/* 17 mask 0x00020000 */
+	LT_DIAG_COND_SIGNAL,		/* 18 mask 0x00040000 */
+	LT_DIAG_COND_BROADCAST,		/* 19 mask 0x00080000 */
+	/* bits 20 - 25 mutex flag group */
+	LT_DIAG_MUTEX_CREATE,		/* 20 mask 0x00100000 */
+	LT_DIAG_MUTEX_DESTROY,		/* 21 mask 0x00200000 */
+	LT_DIAG_MUTEX_LOCK,		/* 22 mask 0x00400000 */
+	LT_DIAG_MUTEX_TRYLOCK,		/* 23 mask 0x00800000 */
+	LT_DIAG_MUTEX_BLOCKED,		/* 24 mask 0x01000000 */
+	LT_DIAG_MUTEX_UNLOCKED,		/* 25 mask 0x02000000 */
+	/* bits 26 - 27 scheduler flag group - 8 bits */
+	LT_DIAG_SCHED_CREATE,		/* 26 mask 0x04000000 */
+	LT_DIAG_SCHED_SHUTDOWN,		/* 27 mask 0x08000000 */
+	LT_DIAG_EVENT_MAX
+};
+
+#define LT_DIAG_ALL 0xffffffffffffffff
+
+
+/*
+ * Display scheduler stats
+ */
+void
+lthread_sched_stats_display(void);
+
+/*
+ * return the diagnostic ref val stored in a condition var
+ */
+uint64_t
+lthread_cond_diag_ref(struct lthread_cond *c);
+
+/*
+ * return the diagnostic ref val stored in a mutex
+ */
+uint64_t
+lthread_mutex_diag_ref(struct lthread_mutex *m);
+
+#endif				/* LTHREAD_DIAG_API_H_ */
diff --git a/examples/performance-thread/common/lthread_int.h b/examples/performance-thread/common/lthread_int.h
new file mode 100644
index 0000000..60ec289
--- /dev/null
+++ b/examples/performance-thread/common/lthread_int.h
@@ -0,0 +1,212 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software may have been derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef LTHREAD_INT_H
+#include <lthread_api.h>
+#define LTHREAD_INT_H
+
+#include <stdint.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <pthread.h>
+#include <time.h>
+
+#include <rte_cycles.h>
+#include <rte_per_lcore.h>
+#include <rte_timer.h>
+#include <rte_ring.h>
+#include <rte_atomic_64.h>
+#include <rte_spinlock.h>
+#include <ctx.h>
+
+#include <lthread_api.h>
+#include "lthread.h"
+#include "lthread_diag.h"
+#include "lthread_tls.h"
+
+struct lthread;
+struct lthread_sched;
+struct lthread_cond;
+struct lthread_mutex;
+struct lthread_key;
+
+struct key_pool;
+struct qnode;
+struct qnode_pool;
+struct lthread_sched;
+struct lthread_tls;
+
+
+#define BIT(x) (1 << (x))
+#define CLEARBIT(x) ~(1 << (x))
+
+#define POSIX_ERRNO(x)  (x)
+
+#define MAX_LTHREAD_NAME_SIZE 64
+
+#define RTE_LOGTYPE_LTHREAD RTE_LOGTYPE_USER1
+
+
+/* define some shorthand for current scheduler and current thread */
+#define THIS_SCHED RTE_PER_LCORE(this_sched)
+#define THIS_LTHREAD RTE_PER_LCORE(this_sched)->current_lthread
+
+/*
+ * Definition of an scheduler struct
+ */
+struct lthread_sched {
+	struct ctx ctx;					/* cpu context */
+	uint64_t birth;					/* time created */
+	struct lthread *current_lthread;		/* running thread */
+	unsigned lcore_id;				/* this sched lcore */
+	int run_flag;					/* sched shutdown */
+	uint64_t nb_blocked_threads;	/* blocked threads */
+	struct lthread_queue *ready;			/* local ready queue */
+	struct lthread_queue *pready;			/* peer ready queue */
+	struct lthread_objcache *lthread_cache;		/* free lthreads */
+	struct lthread_objcache *stack_cache;		/* free stacks */
+	struct lthread_objcache *per_lthread_cache;	/* free per lthread */
+	struct lthread_objcache *tls_cache;		/* free TLS */
+	struct lthread_objcache *cond_cache;		/* free cond vars */
+	struct lthread_objcache *mutex_cache;		/* free mutexes */
+	struct qnode_pool *qnode_pool;		/* pool of queue nodes */
+	struct key_pool *key_pool;		/* pool of free TLS keys */
+	size_t stack_size;
+	uint64_t diag_ref;				/* diag ref */
+} __rte_cache_aligned;
+
+RTE_DECLARE_PER_LCORE(struct lthread_sched *, this_sched);
+
+
+/*
+ * State for an lthread
+ */
+enum lthread_st {
+	ST_LT_INIT,		/* initial state */
+	ST_LT_READY,		/* lthread is ready to run */
+	ST_LT_SLEEPING,		/* lthread is sleeping */
+	ST_LT_EXPIRED,		/* lthread timeout has expired  */
+	ST_LT_EXITED,		/* lthread has exited and needs cleanup */
+	ST_LT_DETACH,		/* lthread frees on exit*/
+	ST_LT_CANCELLED,	/* lthread has been cancelled */
+};
+
+/*
+ * lthread sub states for exit/join
+ */
+enum join_st {
+	LT_JOIN_INITIAL,	/* initial state */
+	LT_JOIN_EXITING,	/* thread is exiting */
+	LT_JOIN_THREAD_SET,	/* joining thread has been set */
+	LT_JOIN_EXIT_VAL_SET,	/* exiting thread has set ret val */
+	LT_JOIN_EXIT_VAL_READ,	/* joining thread has collected ret val */
+};
+
+/* defnition of an lthread stack object */
+struct lthread_stack {
+	uint8_t stack[LTHREAD_MAX_STACK_SIZE];
+	size_t stack_size;
+	struct lthread_sched *root_sched;
+} __rte_cache_aligned;
+
+/*
+ * Definition of an lthread
+ */
+struct lthread {
+	struct ctx ctx;				/* cpu context */
+
+	uint64_t state;				/* current lthread state */
+
+	struct lthread_sched *sched;		/* current scheduler */
+	void *stack;				/* ptr to actual stack */
+	size_t stack_size;			/* current stack_size */
+	size_t last_stack_size;			/* last yield  stack_size */
+	lthread_func_t fun;			/* func ctx is running */
+	void *arg;				/* func args passed to func */
+	void *per_lthread_data;			/* per lthread user data */
+	lthread_exit_func exit_handler;		/* called when thread exits */
+	uint64_t birth;				/* time lthread was born */
+	struct lthread_queue *pending_wr_queue;	/* deferred  queue to write */
+	struct lthread *lt_join;		/* lthread to join on */
+	uint64_t join;				/* state for joining */
+	void **lt_exit_ptr;			/* exit ptr for lthread_join */
+	struct lthread_sched *root_sched;	/* thread was created here*/
+	struct queue_node *qnode;		/* node when in a queue */
+	struct rte_timer tim;			/* sleep timer */
+	struct lthread_tls *tls;		/* keys in use by the thread */
+	struct lthread_stack *stack_container;	/* stack */
+	char funcname[MAX_LTHREAD_NAME_SIZE];	/* thread func name */
+	uint64_t diag_ref;			/* ref to user diag data */
+} __rte_cache_aligned;
+
+/*
+ * Assert
+ */
+#if LTHREAD_DIAG
+#define LTHREAD_ASSERT(expr) do {					\
+	if (!(expr))							\
+		rte_panic("line%d\tassert \"" #expr "\" failed\n", __LINE__);\
+} while (0)
+#else
+#define LTHREAD_ASSERT(expr) do {} while (0)
+#endif
+
+#endif				/* LTHREAD_INT_H */
diff --git a/examples/performance-thread/common/lthread_mutex.c b/examples/performance-thread/common/lthread_mutex.c
new file mode 100644
index 0000000..df7938c
--- /dev/null
+++ b/examples/performance-thread/common/lthread_mutex.c
@@ -0,0 +1,256 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <limits.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+
+#include <rte_config.h>
+#include <rte_per_lcore.h>
+#include <rte_log.h>
+#include <rte_spinlock.h>
+#include <rte_common.h>
+
+#include "lthread_api.h"
+#include "lthread_int.h"
+#include "lthread_mutex.h"
+#include "lthread_sched.h"
+#include "lthread_queue.h"
+#include "lthread_objcache.h"
+#include "lthread_diag.h"
+
+/*
+ * Create a mutex
+ */
+int
+lthread_mutex_init(char *name, struct lthread_mutex **mutex,
+		   __rte_unused const struct lthread_mutexattr *attr)
+{
+	struct lthread_mutex *m;
+
+	if (mutex == NULL)
+		return POSIX_ERRNO(EINVAL);
+
+
+	m = _lthread_objcache_alloc((THIS_SCHED)->mutex_cache);
+	if (m == NULL)
+		return POSIX_ERRNO(EAGAIN);
+
+	m->blocked = _lthread_queue_create("blocked queue");
+	if (m->blocked == NULL) {
+		_lthread_objcache_free((THIS_SCHED)->mutex_cache, m);
+		return POSIX_ERRNO(EAGAIN);
+	}
+
+	if (name == NULL)
+		strncpy(m->name, "no name", sizeof(m->name));
+	else
+		strncpy(m->name, name, sizeof(m->name));
+	m->name[sizeof(m->name)-1] = 0;
+
+	m->root_sched = THIS_SCHED;
+	m->owner = NULL;
+
+	rte_atomic64_init(&m->count);
+
+	DIAG_CREATE_EVENT(m, LT_DIAG_MUTEX_CREATE);
+	/* success */
+	(*mutex) = m;
+	return 0;
+}
+
+/*
+ * Destroy a mutex
+ */
+int lthread_mutex_destroy(struct lthread_mutex *m)
+{
+	if ((m == NULL) || (m->blocked == NULL)) {
+		DIAG_EVENT(m, LT_DIAG_MUTEX_DESTROY, m, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	if (m->owner == NULL) {
+		/* try to delete the blocked queue */
+		if (_lthread_queue_destroy(m->blocked) < 0) {
+			DIAG_EVENT(m, LT_DIAG_MUTEX_DESTROY,
+					m, POSIX_ERRNO(EBUSY));
+			return POSIX_ERRNO(EBUSY);
+		}
+
+		/* free the mutex to cache */
+		_lthread_objcache_free(m->root_sched->mutex_cache, m);
+		DIAG_EVENT(m, LT_DIAG_MUTEX_DESTROY, m, 0);
+		return 0;
+	}
+	/* can't do its still in use */
+	DIAG_EVENT(m, LT_DIAG_MUTEX_DESTROY, m, POSIX_ERRNO(EBUSY));
+	return POSIX_ERRNO(EBUSY);
+}
+
+/*
+ * Try to obtain a mutex
+ */
+int lthread_mutex_lock(struct lthread_mutex *m)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	if ((m == NULL) || (m->blocked == NULL)) {
+		DIAG_EVENT(m, LT_DIAG_MUTEX_LOCK, m, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	/* allow no recursion */
+	if (m->owner == lt) {
+		DIAG_EVENT(m, LT_DIAG_MUTEX_LOCK, m, POSIX_ERRNO(EDEADLK));
+		return POSIX_ERRNO(EDEADLK);
+	}
+
+	for (;;) {
+		rte_atomic64_inc(&m->count);
+		do {
+			if (rte_atomic64_cmpset
+			    ((uint64_t *) &m->owner, 0, (uint64_t) lt)) {
+				/* happy days, we got the lock */
+				DIAG_EVENT(m, LT_DIAG_MUTEX_LOCK, m, 0);
+				return 0;
+			}
+			/* spin due to race with unlock when
+			* nothing was blocked
+			*/
+		} while ((rte_atomic64_read(&m->count) == 1) &&
+				(m->owner == NULL));
+
+		/* queue the current thread in the blocked queue
+		 * we defer this to after we return to the scheduler
+		 * to ensure that the current thread context is saved
+		 * before unlock could result in it being dequeued and
+		 * resumed
+		 */
+		DIAG_EVENT(m, LT_DIAG_MUTEX_BLOCKED, m, lt);
+		lt->pending_wr_queue = m->blocked;
+		/* now relinquish cpu */
+		_suspend();
+		/* resumed, must loop and compete for the lock again */
+	}
+	LTHREAD_ASSERT(0);
+	return 0;
+}
+
+/* try to lock a mutex but dont block */
+int lthread_mutex_trylock(struct lthread_mutex *m)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	if ((m == NULL) || (m->blocked == NULL)) {
+		DIAG_EVENT(m, LT_DIAG_MUTEX_TRYLOCK, m, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	if (m->owner == lt) {
+		/* no recursion */
+		DIAG_EVENT(m, LT_DIAG_MUTEX_TRYLOCK, m, POSIX_ERRNO(EDEADLK));
+		return POSIX_ERRNO(EDEADLK);
+	}
+
+	rte_atomic64_inc(&m->count);
+	if (rte_atomic64_cmpset
+	    ((uint64_t *) &m->owner, (uint64_t) NULL, (uint64_t) lt)) {
+		/* got the lock */
+		DIAG_EVENT(m, LT_DIAG_MUTEX_TRYLOCK, m, 0);
+		return 0;
+	}
+
+	/* failed so return busy */
+	rte_atomic64_dec(&m->count);
+	DIAG_EVENT(m, LT_DIAG_MUTEX_TRYLOCK, m, POSIX_ERRNO(EBUSY));
+	return POSIX_ERRNO(EBUSY);
+}
+
+/*
+ * Unlock a mutex
+ */
+int lthread_mutex_unlock(struct lthread_mutex *m)
+{
+	struct lthread *lt = THIS_LTHREAD;
+	struct lthread *unblocked;
+
+	if ((m == NULL) || (m->blocked == NULL)) {
+		DIAG_EVENT(m, LT_DIAG_MUTEX_UNLOCKED, m, POSIX_ERRNO(EINVAL));
+		return POSIX_ERRNO(EINVAL);
+	}
+
+	/* fail if its owned */
+	if (m->owner != lt || m->owner == NULL) {
+		DIAG_EVENT(m, LT_DIAG_MUTEX_UNLOCKED, m, POSIX_ERRNO(EPERM));
+		return POSIX_ERRNO(EPERM);
+	}
+
+	rte_atomic64_dec(&m->count);
+	/* if there are blocked threads then make one ready */
+	while (rte_atomic64_read(&m->count) > 0) {
+		unblocked = _lthread_queue_remove(m->blocked);
+
+		if (unblocked != NULL) {
+			rte_atomic64_dec(&m->count);
+			DIAG_EVENT(m, LT_DIAG_MUTEX_UNLOCKED, m, unblocked);
+			LTHREAD_ASSERT(unblocked->sched != NULL);
+			_ready_queue_insert((struct lthread_sched *)
+					    unblocked->sched, unblocked);
+			break;
+		}
+	}
+	/* release the lock */
+	m->owner = NULL;
+	return 0;
+}
+
+/*
+ * return the diagnostic ref val stored in a mutex
+ */
+uint64_t
+lthread_mutex_diag_ref(struct lthread_mutex *m)
+{
+	if (m == NULL)
+		return 0;
+	return m->diag_ref;
+}
+
diff --git a/examples/performance-thread/common/lthread_mutex.h b/examples/performance-thread/common/lthread_mutex.h
new file mode 100644
index 0000000..aebe77b
--- /dev/null
+++ b/examples/performance-thread/common/lthread_mutex.h
@@ -0,0 +1,52 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+
+#ifndef LTHREAD_MUTEX_H_
+#define LTHREAD_MUTEX_H_
+
+#include "lthread_queue.h"
+
+
+#define MAX_MUTEX_NAME_SIZE 64
+
+struct lthread_mutex {
+	struct lthread *owner;
+	rte_atomic64_t	count;
+	struct lthread_queue *blocked __rte_cache_aligned;
+	struct lthread_sched *root_sched;
+	char			name[MAX_MUTEX_NAME_SIZE];
+	uint64_t		diag_ref; /* optional ref to user diag data */
+} __rte_cache_aligned;
+
+#endif /* LTHREAD_MUTEX_H_ */
diff --git a/examples/performance-thread/common/lthread_objcache.h b/examples/performance-thread/common/lthread_objcache.h
new file mode 100644
index 0000000..2101ad2
--- /dev/null
+++ b/examples/performance-thread/common/lthread_objcache.h
@@ -0,0 +1,160 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef LTHREAD_OBJCACHE_H_
+#define LTHREAD_OBJCACHE_H_
+
+#include <string.h>
+
+#include <rte_per_lcore.h>
+#include <rte_malloc.h>
+#include <rte_memory.h>
+
+#include "lthread_int.h"
+#include "lthread_diag.h"
+#include "lthread_queue.h"
+
+
+#define DISPLAY_OBCACHE_QUEUES 0
+
+RTE_DECLARE_PER_LCORE(struct lthread_sched *, this_sched);
+
+struct lthread_objcache {
+	struct lthread_queue *q;
+	size_t obj_size;
+	int prealloc_size;
+	char name[LT_MAX_NAME_SIZE];
+
+	DIAG_COUNT_DEFINE(rd);
+	DIAG_COUNT_DEFINE(wr);
+	DIAG_COUNT_DEFINE(prealloc);
+	DIAG_COUNT_DEFINE(capacity);
+	DIAG_COUNT_DEFINE(available);
+};
+
+/*
+ * Create a cache
+ */
+static inline struct
+lthread_objcache *_lthread_objcache_create(const char *name,
+					size_t obj_size,
+					int prealloc_size)
+{
+	struct lthread_objcache *c =
+	    rte_malloc_socket(NULL, sizeof(struct lthread_objcache),
+				RTE_CACHE_LINE_SIZE,
+				rte_socket_id());
+	if (c == NULL)
+		return NULL;
+
+	c->q = _lthread_queue_create("cache queue");
+	if (c->q == NULL) {
+		rte_free(c);
+		return NULL;
+	}
+	c->obj_size = obj_size;
+	c->prealloc_size = prealloc_size;
+
+	if (name != NULL)
+		strncpy(c->name, name, LT_MAX_NAME_SIZE);
+	c->name[sizeof(c->name)-1] = 0;
+
+	DIAG_COUNT_INIT(c, rd);
+	DIAG_COUNT_INIT(c, wr);
+	DIAG_COUNT_INIT(c, prealloc);
+	DIAG_COUNT_INIT(c, capacity);
+	DIAG_COUNT_INIT(c, available);
+	return c;
+}
+
+/*
+ * Destroy an objcache
+ */
+static inline int
+_lthread_objcache_destroy(struct lthread_objcache *c)
+{
+	if (_lthread_queue_destroy(c->q) == 0) {
+		rte_free(c);
+		return 0;
+	}
+	return -1;
+}
+
+/*
+ * Allocate an object from an object cache
+ */
+static inline void *
+_lthread_objcache_alloc(struct lthread_objcache *c)
+{
+	int i;
+	void *data;
+	struct lthread_queue *q = c->q;
+	size_t obj_size = c->obj_size;
+	int prealloc_size = c->prealloc_size;
+
+	data = _lthread_queue_remove(q);
+
+	if (data == NULL) {
+		DIAG_COUNT_INC(c, prealloc);
+		for (i = 0; i < prealloc_size; i++) {
+			data =
+			    rte_zmalloc_socket(NULL, obj_size,
+					RTE_CACHE_LINE_SIZE,
+					rte_socket_id());
+			if (data == NULL)
+				return NULL;
+
+			DIAG_COUNT_INC(c, available);
+			DIAG_COUNT_INC(c, capacity);
+			_lthread_queue_insert_mp(q, data);
+		}
+		data = _lthread_queue_remove(q);
+	}
+	DIAG_COUNT_INC(c, rd);
+	DIAG_COUNT_DEC(c, available);
+	return data;
+}
+
+/*
+ * free an object to a cache
+ */
+static inline void
+_lthread_objcache_free(struct lthread_objcache *c, void *obj)
+{
+	DIAG_COUNT_INC(c, wr);
+	DIAG_COUNT_INC(c, available);
+	_lthread_queue_insert_mp(c->q, obj);
+}
+
+
+
+#endif				/* LTHREAD_OBJCACHE_H_ */
diff --git a/examples/performance-thread/common/lthread_pool.h b/examples/performance-thread/common/lthread_pool.h
new file mode 100644
index 0000000..7cbccb6
--- /dev/null
+++ b/examples/performance-thread/common/lthread_pool.h
@@ -0,0 +1,333 @@
+/*
+ *-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software is derived from the producer
+ * consumer queues described by Dmitry Vyukov and published  here
+ * http://www.1024cores.net
+ *
+ * Copyright (c) 2010-2011 Dmitry Vyukov. All rights reserved.
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ * this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY DMITRY VYUKOV "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DMITRY VYUKOV OR CONTRIBUTORS
+ * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
+ * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
+ * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+ * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+ * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * The views and conclusions contained in the software and documentation are
+ * those of the authors and should not be interpreted as representing official
+ * policies, either expressed or implied, of Dmitry Vyukov.
+ */
+
+#ifndef LTHREAD_POOL_H_
+#define LTHREAD_POOL_H_
+
+#include <rte_malloc.h>
+#include <rte_per_lcore.h>
+#include <rte_log.h>
+
+#include "lthread_int.h"
+#include "lthread_diag.h"
+#include "atomic.h"
+
+/*
+ * This file implements pool of queue nodes used by the queue implemented
+ * in lthread_queue.h.
+ *
+ * The pool is an intrusive lock free MPSC queue.
+ *
+ * The pool is created empty and populated lazily, i.e. on first attempt to
+ * allocate a the pool.
+ *
+ * Whenever the pool is empty more nodes are added to the pool
+ * The number of nodes preallocated in this way is a parameter of
+ * _qnode_pool_create. Freeing an object returns it to the pool.
+ *
+ * Each lthread scheduler maintains its own pool of nodes. L-threads must always
+ * allocate from this local pool ( because it is a single consumer queue ).
+ * L-threads can free nodes to any pool (because it is a multi producer queue)
+ * This enables threads that have affined to a different scheduler to free
+ * nodes safely.
+ */
+
+struct qnode;
+struct qnode_cache;
+
+/*
+ * define intermediate node
+ */
+struct qnode {
+	struct qnode *next;
+	void *data;
+	struct qnode_pool *pool;
+} __rte_cache_aligned;
+
+/*
+ * a pool structure
+ */
+struct qnode_pool {
+	struct qnode *head;
+	struct qnode *stub;
+	struct qnode *fast_alloc;
+	struct qnode *tail __rte_cache_aligned;
+	int pre_alloc;
+	char name[LT_MAX_NAME_SIZE];
+
+	DIAG_COUNT_DEFINE(rd);
+	DIAG_COUNT_DEFINE(wr);
+	DIAG_COUNT_DEFINE(available);
+	DIAG_COUNT_DEFINE(prealloc);
+	DIAG_COUNT_DEFINE(capacity);
+} __rte_cache_aligned;
+
+/*
+ * Create a pool of qnodes
+ */
+
+static inline struct qnode_pool *
+_qnode_pool_create(const char *name, int prealloc_size) {
+
+	struct qnode_pool *p = rte_malloc_socket(NULL,
+					sizeof(struct qnode_pool),
+					RTE_CACHE_LINE_SIZE,
+					rte_socket_id());
+
+	LTHREAD_ASSERT(p);
+
+	p->stub = rte_malloc_socket(NULL,
+				sizeof(struct qnode),
+				RTE_CACHE_LINE_SIZE,
+				rte_socket_id());
+
+	LTHREAD_ASSERT(p->stub);
+
+	if (name != NULL)
+		strncpy(p->name, name, LT_MAX_NAME_SIZE);
+	p->name[sizeof(p->name)-1] = 0;
+
+	p->stub->pool = p;
+	p->stub->next = NULL;
+	p->tail = p->stub;
+	p->head = p->stub;
+	p->pre_alloc = prealloc_size;
+
+	DIAG_COUNT_INIT(p, rd);
+	DIAG_COUNT_INIT(p, wr);
+	DIAG_COUNT_INIT(p, available);
+	DIAG_COUNT_INIT(p, prealloc);
+	DIAG_COUNT_INIT(p, capacity);
+
+	return p;
+}
+
+
+/*
+ * Insert a node into the pool
+ */
+static inline void __attribute__ ((always_inline))
+_qnode_pool_insert(struct qnode_pool *p, struct qnode *n)
+{
+	n->next = NULL;
+	struct qnode *prev = n;
+	/* We insert at the head */
+	prev = (struct qnode *) atomic64_xchg((uint64_t *)&p->head,
+						(uint64_t) prev);
+	/* there is a window of inconsistency until prev next is set */
+	/* which is why remove must retry */
+	prev->next = (n);
+}
+
+/*
+ * Remove a node from the pool
+ *
+ * There is a race with _qnode_pool_insert() whereby the queue could appear
+ * empty during a concurrent insert, this is handled by retrying
+ *
+ * The queue uses a stub node, which must be swung as the queue becomes
+ * empty, this requires an insert of the stub, which means that removing the
+ * last item from the queue incurs the penalty of an atomic exchange. Since the
+ * pool is maintained with a bulk pre-allocation the cost of this is amortised.
+ */
+static inline struct qnode *__attribute__ ((always_inline))
+_pool_remove(struct qnode_pool *p)
+{
+	struct qnode *head;
+	struct qnode *tail = p->tail;
+	struct qnode *next = tail->next;
+
+	/* we remove from the tail */
+	if (tail == p->stub) {
+		if (next == NULL)
+			return NULL;
+		/* advance the tail */
+		p->tail = next;
+		tail = next;
+		next = next->next;
+	}
+	if (likely(next != NULL)) {
+		p->tail = next;
+		return tail;
+	}
+
+	head = p->head;
+	if (tail == head)
+		return NULL;
+
+	/* swing stub node */
+	_qnode_pool_insert(p, p->stub);
+
+	next = tail->next;
+	if (next) {
+		p->tail = next;
+		return tail;
+	}
+	return NULL;
+}
+
+
+/*
+ * This adds a retry to the _pool_remove function
+ * defined above
+ */
+static inline struct qnode *__attribute__ ((always_inline))
+_qnode_pool_remove(struct qnode_pool *p)
+{
+	struct qnode *n;
+
+	do {
+		n = _pool_remove(p);
+		if (likely(n != NULL))
+			return n;
+
+		rte_compiler_barrier();
+	}  while ((p->head != p->tail) &&
+			(p->tail != p->stub));
+	return NULL;
+}
+
+/*
+ * Allocate a node from the pool
+ * If the pool is empty add mode nodes
+ */
+static inline struct qnode *__attribute__ ((always_inline))
+_qnode_alloc(void)
+{
+	struct qnode_pool *p = (THIS_SCHED)->qnode_pool;
+	int prealloc_size = p->pre_alloc;
+	struct qnode *n;
+	int i;
+
+	if (likely(p->fast_alloc != NULL)) {
+		n = p->fast_alloc;
+		p->fast_alloc = NULL;
+		return n;
+	}
+
+	n = _qnode_pool_remove(p);
+
+	if (unlikely(n == NULL)) {
+		DIAG_COUNT_INC(p, prealloc);
+		for (i = 0; i < prealloc_size; i++) {
+			n = rte_malloc_socket(NULL,
+					sizeof(struct qnode),
+					RTE_CACHE_LINE_SIZE,
+					rte_socket_id());
+			if (n == NULL)
+				return NULL;
+
+			DIAG_COUNT_INC(p, available);
+			DIAG_COUNT_INC(p, capacity);
+
+			n->pool = p;
+			_qnode_pool_insert(p, n);
+		}
+		n = _qnode_pool_remove(p);
+	}
+	n->pool = p;
+	DIAG_COUNT_INC(p, rd);
+	DIAG_COUNT_DEC(p, available);
+	return n;
+}
+
+
+
+/*
+* free a queue node to the per scheduler pool from which it came
+*/
+static inline void __attribute__ ((always_inline))
+_qnode_free(struct qnode *n)
+{
+	struct qnode_pool *p = n->pool;
+
+
+	if (unlikely(p->fast_alloc != NULL) ||
+			unlikely(n->pool != (THIS_SCHED)->qnode_pool)) {
+		DIAG_COUNT_INC(p, wr);
+		DIAG_COUNT_INC(p, available);
+		_qnode_pool_insert(p, n);
+		return;
+	}
+	p->fast_alloc = n;
+}
+
+/*
+ * Destroy an qnode pool
+ * queue must be empty when this is called
+ */
+static inline int
+_qnode_pool_destroy(struct qnode_pool *p)
+{
+	rte_free(p->stub);
+	rte_free(p);
+	return 0;
+}
+
+
+#endif				/* LTHREAD_POOL_H_ */
diff --git a/examples/performance-thread/common/lthread_queue.h b/examples/performance-thread/common/lthread_queue.h
new file mode 100644
index 0000000..0092862
--- /dev/null
+++ b/examples/performance-thread/common/lthread_queue.h
@@ -0,0 +1,303 @@
+/*
+ *-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software is derived from the producer
+ * consumer queues described by Dmitry Vyukov and published  here
+ * http://www.1024cores.net
+ *
+ * Copyright (c) 2010-2011 Dmitry Vyukov. All rights reserved.
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ * this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY DMITRY VYUKOV "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL DMITRY VYUKOV OR CONTRIBUTORS
+ * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
+ * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
+ * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+ * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+ * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * The views and conclusions contained in the software and documentation are
+ * those of the authors and should not be interpreted as representing official
+ * policies, either expressed or implied, of Dmitry Vyukov.
+ */
+
+#ifndef LTHREAD_QUEUE_H_
+#define LTHREAD_QUEUE_H_
+
+#include <string.h>
+
+#include <rte_prefetch.h>
+#include <rte_per_lcore.h>
+
+#include "lthread_int.h"
+#include "lthread.h"
+#include "lthread_diag.h"
+#include "lthread_pool.h"
+#include "atomic.h"
+
+struct lthread_queue;
+
+/*
+ * This file implements an unbounded FIFO queue based on a lock free
+ * linked list.
+ *
+ * The queue is non-intrusive in that it uses intermediate nodes, and does
+ * not require these nodes to be inserted into the object being placed
+ * in the queue.
+ *
+ * This is slightly more efficient than the very similar queue in lthread_pool
+ * in that it does not have to swing a stub node as the queue becomes empty.
+ *
+ * The queue access functions allocate and free intermediate node
+ * transparently from/to a per scheduler pool ( see lthread_pool.h ).
+ *
+ * The queue provides both MPSC and SPSC insert methods
+ */
+
+/*
+ * define a queue of lthread nodes
+ */
+struct lthread_queue {
+	struct qnode *head;
+	struct qnode *tail __rte_cache_aligned;
+	struct lthread_queue *p;
+	char name[LT_MAX_NAME_SIZE];
+
+	DIAG_COUNT_DEFINE(rd);
+	DIAG_COUNT_DEFINE(wr);
+	DIAG_COUNT_DEFINE(size);
+
+} __rte_cache_aligned;
+
+
+
+static inline struct lthread_queue *
+_lthread_queue_create(const char *name)
+{
+	struct qnode *stub;
+	struct lthread_queue *new_queue;
+
+	new_queue = rte_malloc_socket(NULL, sizeof(struct lthread_queue),
+					RTE_CACHE_LINE_SIZE,
+					rte_socket_id());
+	if (new_queue == NULL)
+		return NULL;
+
+	/* allocated stub node */
+	stub = _qnode_alloc();
+	LTHREAD_ASSERT(stub);
+
+	if (name != NULL)
+		strncpy(new_queue->name, name, sizeof(new_queue->name));
+	new_queue->name[sizeof(new_queue->name)-1] = 0;
+
+	/* initialize queue as empty */
+	stub->next = NULL;
+	new_queue->head = stub;
+	new_queue->tail = stub;
+
+	DIAG_COUNT_INIT(new_queue, rd);
+	DIAG_COUNT_INIT(new_queue, wr);
+	DIAG_COUNT_INIT(new_queue, size);
+
+	return new_queue;
+}
+
+/**
+ * Return true if the queue is empty
+ */
+static inline int __attribute__ ((always_inline))
+_lthread_queue_empty(struct lthread_queue *q)
+{
+	return (q->tail == q->head);
+}
+
+
+
+/**
+ * Destroy a queue
+ * fail if queue is not empty
+ */
+static inline int _lthread_queue_destroy(struct lthread_queue *q)
+{
+	if (q == NULL)
+		return -1;
+
+	if (!_lthread_queue_empty(q))
+		return -1;
+
+	_qnode_free(q->head);
+	rte_free(q);
+	return 0;
+}
+
+RTE_DECLARE_PER_LCORE(struct lthread_sched *, this_sched);
+
+/*
+ * Insert a node into a queue
+ * this implementation is multi producer safe
+ */
+static inline struct qnode *__attribute__ ((always_inline))
+_lthread_queue_insert_mp(struct lthread_queue
+							  *q, void *data)
+{
+	struct qnode *prev;
+	struct qnode *n = _qnode_alloc();
+
+	if (n == NULL)
+		return NULL;
+
+	/* set object in node */
+	n->data = data;
+	n->next = NULL;
+
+	/* this is an MPSC method, perform a locked update */
+	prev = n;
+	prev =
+	    (struct qnode *)atomic64_xchg((uint64_t *) &(q)->head,
+					       (uint64_t) prev);
+	/* there is a window of inconsistency until prev next is set,
+	 * which is why remove must retry
+	 */
+	prev->next = n;
+
+	DIAG_COUNT_INC(q, wr);
+	DIAG_COUNT_INC(q, size);
+
+	return n;
+}
+
+/*
+ * Insert an node into a queue in single producer mode
+ * this implementation is NOT mult producer safe
+ */
+static inline struct qnode *__attribute__ ((always_inline))
+_lthread_queue_insert_sp(struct lthread_queue
+							  *q, void *data)
+{
+	/* allocate a queue node */
+	struct qnode *prev;
+	struct qnode *n = _qnode_alloc();
+
+	if (n == NULL)
+		return NULL;
+
+	/* set data in node */
+	n->data = data;
+	n->next = NULL;
+
+	/* this is an SPSC method, no need for locked exchange operation */
+	prev = q->head;
+	prev->next = q->head = n;
+
+	DIAG_COUNT_INC(q, wr);
+	DIAG_COUNT_INC(q, size);
+
+	return n;
+}
+
+/*
+ * Remove a node from a queue
+ */
+static inline void *__attribute__ ((always_inline))
+_lthread_queue_poll(struct lthread_queue *q)
+{
+	void *data = NULL;
+	struct qnode *tail = q->tail;
+	struct qnode *next = (struct qnode *)tail->next;
+	/*
+	 * There is a small window of inconsistency between producer and
+	 * consumer whereby the queue may appear empty if consumer and
+	 * producer access it at the same time.
+	 * The consumer must handle this by retrying
+	 */
+
+	if (likely(next != NULL)) {
+		q->tail = next;
+		tail->data = next->data;
+		data = tail->data;
+
+		/* free the node */
+		_qnode_free(tail);
+
+		DIAG_COUNT_INC(q, rd);
+		DIAG_COUNT_DEC(q, size);
+		return data;
+	}
+	return NULL;
+}
+
+/*
+ * Remove a node from a queue
+ */
+static inline void *__attribute__ ((always_inline))
+_lthread_queue_remove(struct lthread_queue *q)
+{
+	void *data = NULL;
+
+	/*
+	 * There is a small window of inconsistency between producer and
+	 * consumer whereby the queue may appear empty if consumer and
+	 * producer access it at the same time. We handle this by retrying
+	 */
+	do {
+		data = _lthread_queue_poll(q);
+
+		if (likely(data != NULL)) {
+
+			DIAG_COUNT_INC(q, rd);
+			DIAG_COUNT_DEC(q, size);
+			return data;
+		}
+		rte_compiler_barrier();
+	} while (unlikely(!_lthread_queue_empty(q)));
+	return NULL;
+}
+
+
+#endif				/* LTHREAD_QUEUE_H_ */
diff --git a/examples/performance-thread/common/lthread_sched.c b/examples/performance-thread/common/lthread_sched.c
new file mode 100644
index 0000000..f1eb926
--- /dev/null
+++ b/examples/performance-thread/common/lthread_sched.c
@@ -0,0 +1,600 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software is derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+
+#define RTE_MEM 1
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <limits.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+#include <sched.h>
+
+#include <rte_config.h>
+#include <rte_prefetch.h>
+#include <rte_per_lcore.h>
+#include <rte_atomic.h>
+#include <rte_atomic_64.h>
+#include <rte_log.h>
+#include <rte_common.h>
+#include <rte_branch_prediction.h>
+
+#include "lthread_api.h"
+#include "lthread_int.h"
+#include "lthread_sched.h"
+#include "lthread_objcache.h"
+#include "lthread_timer.h"
+#include "lthread_mutex.h"
+#include "lthread_cond.h"
+#include "lthread_tls.h"
+#include "lthread_diag.h"
+
+/*
+ * This file implements the lthread scheduler
+ * The scheduler is the function lthread_run()
+ * This must be run as the main loop of an EAL thread.
+ *
+ * Currently once a scheduler is created it cannot be destroyed
+ * When a scheduler shuts down it is assumed that the application is terminating
+ */
+
+static rte_atomic16_t num_schedulers;
+static rte_atomic16_t active_schedulers;
+
+/* one scheduler per lcore */
+RTE_DEFINE_PER_LCORE(struct lthread_sched *, this_sched) = NULL;
+
+struct lthread_sched *schedcore[LTHREAD_MAX_LCORES];
+
+diag_callback diag_cb;
+
+uint64_t diag_mask;
+
+
+/* constructor */
+void lthread_sched_ctor(void) __attribute__ ((constructor));
+void lthread_sched_ctor(void)
+{
+	memset(schedcore, 0, sizeof(schedcore));
+	rte_atomic16_init(&num_schedulers);
+	rte_atomic16_set(&num_schedulers, 1);
+	rte_atomic16_init(&active_schedulers);
+	rte_atomic16_set(&active_schedulers, 0);
+	diag_cb = NULL;
+}
+
+
+enum sched_alloc_phase {
+	SCHED_ALLOC_OK,
+	SCHED_ALLOC_QNODE_POOL,
+	SCHED_ALLOC_READY_QUEUE,
+	SCHED_ALLOC_PREADY_QUEUE,
+	SCHED_ALLOC_LTHREAD_CACHE,
+	SCHED_ALLOC_STACK_CACHE,
+	SCHED_ALLOC_PERLT_CACHE,
+	SCHED_ALLOC_TLS_CACHE,
+	SCHED_ALLOC_COND_CACHE,
+	SCHED_ALLOC_MUTEX_CACHE,
+};
+
+static int
+_lthread_sched_alloc_resources(struct lthread_sched *new_sched)
+{
+	int alloc_status;
+
+	do {
+		/* Initialize per scheduler queue node pool */
+		alloc_status = SCHED_ALLOC_QNODE_POOL;
+		new_sched->qnode_pool =
+			_qnode_pool_create("qnode pool", LTHREAD_PREALLOC);
+		if (new_sched->qnode_pool == NULL)
+			break;
+
+		/* Initialize per scheduler local ready queue */
+		alloc_status = SCHED_ALLOC_READY_QUEUE;
+		new_sched->ready = _lthread_queue_create("ready queue");
+		if (new_sched->ready == NULL)
+			break;
+
+		/* Initialize per scheduler local peer ready queue */
+		alloc_status = SCHED_ALLOC_PREADY_QUEUE;
+		new_sched->pready = _lthread_queue_create("pready queue");
+		if (new_sched->pready == NULL)
+			break;
+
+		/* Initialize per scheduler local free lthread cache */
+		alloc_status = SCHED_ALLOC_LTHREAD_CACHE;
+		new_sched->lthread_cache =
+			_lthread_objcache_create("lthread cache",
+						sizeof(struct lthread),
+						LTHREAD_PREALLOC);
+		if (new_sched->lthread_cache == NULL)
+			break;
+
+		/* Initialize per scheduler local free stack cache */
+		alloc_status = SCHED_ALLOC_STACK_CACHE;
+		new_sched->stack_cache =
+			_lthread_objcache_create("stack_cache",
+						sizeof(struct lthread_stack),
+						LTHREAD_PREALLOC);
+		if (new_sched->stack_cache == NULL)
+			break;
+
+		/* Initialize per scheduler local free per lthread data cache */
+		alloc_status = SCHED_ALLOC_PERLT_CACHE;
+		new_sched->per_lthread_cache =
+			_lthread_objcache_create("per_lt cache",
+						RTE_PER_LTHREAD_SECTION_SIZE,
+						LTHREAD_PREALLOC);
+		if (new_sched->per_lthread_cache == NULL)
+			break;
+
+		/* Initialize per scheduler local free tls cache */
+		alloc_status = SCHED_ALLOC_TLS_CACHE;
+		new_sched->tls_cache =
+			_lthread_objcache_create("TLS cache",
+						sizeof(struct lthread_tls),
+						LTHREAD_PREALLOC);
+		if (new_sched->tls_cache == NULL)
+			break;
+
+		/* Initialize per scheduler local free cond var cache */
+		alloc_status = SCHED_ALLOC_COND_CACHE;
+		new_sched->cond_cache =
+			_lthread_objcache_create("cond cache",
+						sizeof(struct lthread_cond),
+						LTHREAD_PREALLOC);
+		if (new_sched->cond_cache == NULL)
+			break;
+
+		/* Initialize per scheduler local free mutex cache */
+		alloc_status = SCHED_ALLOC_MUTEX_CACHE;
+		new_sched->mutex_cache =
+			_lthread_objcache_create("mutex cache",
+						sizeof(struct lthread_mutex),
+						LTHREAD_PREALLOC);
+		if (new_sched->mutex_cache == NULL)
+			break;
+
+		alloc_status = SCHED_ALLOC_OK;
+	} while (0);
+
+	/* roll back on any failure */
+	switch (alloc_status) {
+	case SCHED_ALLOC_MUTEX_CACHE:
+		_lthread_objcache_destroy(new_sched->cond_cache);
+		/* fall through */
+	case SCHED_ALLOC_COND_CACHE:
+		_lthread_objcache_destroy(new_sched->tls_cache);
+		/* fall through */
+	case SCHED_ALLOC_TLS_CACHE:
+		_lthread_objcache_destroy(new_sched->per_lthread_cache);
+		/* fall through */
+	case SCHED_ALLOC_PERLT_CACHE:
+		_lthread_objcache_destroy(new_sched->stack_cache);
+		/* fall through */
+	case SCHED_ALLOC_STACK_CACHE:
+		_lthread_objcache_destroy(new_sched->lthread_cache);
+		/* fall through */
+	case SCHED_ALLOC_LTHREAD_CACHE:
+		_lthread_queue_destroy(new_sched->pready);
+		/* fall through */
+	case SCHED_ALLOC_PREADY_QUEUE:
+		_lthread_queue_destroy(new_sched->ready);
+		/* fall through */
+	case SCHED_ALLOC_READY_QUEUE:
+		_qnode_pool_destroy(new_sched->qnode_pool);
+		/* fall through */
+	case SCHED_ALLOC_QNODE_POOL:
+		/* fall through */
+	case SCHED_ALLOC_OK:
+		break;
+	}
+	return alloc_status;
+}
+
+
+/*
+ * Create a scheduler on the current lcore
+ */
+struct lthread_sched *_lthread_sched_create(size_t stack_size)
+{
+	int status;
+	struct lthread_sched *new_sched;
+	unsigned lcoreid = rte_lcore_id();
+
+	LTHREAD_ASSERT(stack_size <= LTHREAD_MAX_STACK_SIZE);
+
+	if (stack_size == 0)
+		stack_size = LTHREAD_MAX_STACK_SIZE;
+
+	new_sched =
+	     rte_calloc_socket(NULL, 1, sizeof(struct lthread_sched),
+				RTE_CACHE_LINE_SIZE,
+				rte_socket_id());
+	if (new_sched == NULL) {
+		RTE_LOG(CRIT, LTHREAD,
+			"Failed to allocate memory for scheduler\n");
+		return NULL;
+	}
+
+	_lthread_key_pool_init();
+
+	new_sched->stack_size = stack_size;
+	new_sched->birth = rte_rdtsc();
+	THIS_SCHED = new_sched;
+
+	status = _lthread_sched_alloc_resources(new_sched);
+	if (status != SCHED_ALLOC_OK) {
+		RTE_LOG(CRIT, LTHREAD,
+			"Failed to allocate resources for scheduler code = %d\n",
+			status);
+		rte_free(new_sched);
+		return NULL;
+	}
+
+	bzero(&new_sched->ctx, sizeof(struct ctx));
+
+	new_sched->lcore_id = lcoreid;
+
+	schedcore[lcoreid] = new_sched;
+
+	new_sched->run_flag = 1;
+
+	DIAG_EVENT(new_sched, LT_DIAG_SCHED_CREATE, rte_lcore_id(), 0);
+
+	rte_wmb();
+	return new_sched;
+}
+
+/*
+ * Set the number of schedulers in the system
+ */
+int lthread_num_schedulers_set(int num)
+{
+	rte_atomic16_set(&num_schedulers, num);
+	return (int)rte_atomic16_read(&num_schedulers);
+}
+
+/*
+ * Return the number of schedulers active
+ */
+int lthread_active_schedulers(void)
+{
+	return (int)rte_atomic16_read(&active_schedulers);
+}
+
+
+/**
+ * shutdown the scheduler running on the specified lcore
+ */
+void lthread_scheduler_shutdown(unsigned lcoreid)
+{
+	uint64_t coreid = (uint64_t) lcoreid;
+
+	if (coreid < LTHREAD_MAX_LCORES) {
+		if (schedcore[coreid] != NULL)
+			schedcore[coreid]->run_flag = 0;
+	}
+}
+
+/**
+ * shutdown all schedulers
+ */
+void lthread_scheduler_shutdown_all(void)
+{
+	uint64_t i;
+
+	/*
+	 * give time for all schedulers to have started
+	 * Note we use sched_yield() rather than pthread_yield() to allow
+	 * for the possibility of a pthread wrapper on lthread_yield(),
+	 * something that is not possible unless the scheduler is running.
+	 */
+	while (rte_atomic16_read(&active_schedulers) <
+	       rte_atomic16_read(&num_schedulers))
+		sched_yield();
+
+	for (i = 0; i < LTHREAD_MAX_LCORES; i++) {
+		if (schedcore[i] != NULL)
+			schedcore[i]->run_flag = 0;
+	}
+}
+
+/*
+ * Resume a suspended lthread
+ */
+static inline void
+_lthread_resume(struct lthread *lt) __attribute__ ((always_inline));
+static inline void _lthread_resume(struct lthread *lt)
+{
+	struct lthread_sched *sched = THIS_SCHED;
+	struct lthread_stack *s;
+	uint64_t state = lt->state;
+#if LTHREAD_DIAG
+	int init = 0;
+#endif
+
+	sched->current_lthread = lt;
+
+	if (state & (BIT(ST_LT_CANCELLED) | BIT(ST_LT_EXITED))) {
+		/* if detached we can free the thread now */
+		if (state & BIT(ST_LT_DETACH)) {
+			_lthread_free(lt);
+			sched->current_lthread = NULL;
+			return;
+		}
+	}
+
+	if (state & BIT(ST_LT_INIT)) {
+		/* first time this thread has been run */
+		/* assign thread to this scheduler */
+		lt->sched = THIS_SCHED;
+
+		/* allocate stack */
+		s = _stack_alloc();
+
+		lt->stack_container = s;
+		_lthread_set_stack(lt, s->stack, s->stack_size);
+
+		/* allocate memory for TLS used by this thread */
+		_lthread_tls_alloc(lt);
+
+		lt->state = BIT(ST_LT_READY);
+#if LTHREAD_DIAG
+		init = 1;
+#endif
+	}
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_RESUMED, init, lt);
+
+	/* switch to the new thread */
+	ctx_switch(&lt->ctx, &sched->ctx);
+
+	/* If posting to a queue that could be read by another lcore
+	 * we defer the queue write till now to ensure the context has been
+	 * saved before the other core tries to resume it
+	 * This applies to blocking on mutex, cond, and to set_affinity
+	 */
+	if (lt->pending_wr_queue != NULL) {
+		struct lthread_queue *dest = lt->pending_wr_queue;
+
+		lt->pending_wr_queue = NULL;
+
+		/* queue the current thread to the specified queue */
+		_lthread_queue_insert_mp(dest, lt);
+	}
+
+	sched->current_lthread = NULL;
+}
+
+/*
+ * Handle sleep timer expiry
+*/
+void
+_sched_timer_cb(struct rte_timer *tim, void *arg)
+{
+	struct lthread *lt = (struct lthread *) arg;
+	uint64_t state = lt->state;
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_TMR_EXPIRED, &lt->tim, 0);
+
+	rte_timer_stop(tim);
+
+	if (lt->state & BIT(ST_LT_CANCELLED))
+		(THIS_SCHED)->nb_blocked_threads--;
+
+	lt->state = state | BIT(ST_LT_EXPIRED);
+	_lthread_resume(lt);
+	lt->state = state & CLEARBIT(ST_LT_EXPIRED);
+}
+
+
+
+/*
+ * Returns 0 if there is a pending job in scheduler or 1 if done and can exit.
+ */
+static inline int _lthread_sched_isdone(struct lthread_sched *sched)
+{
+	return ((sched->run_flag == 0) &&
+			(_lthread_queue_empty(sched->ready)) &&
+			(_lthread_queue_empty(sched->pready)) &&
+			(sched->nb_blocked_threads == 0));
+}
+
+/*
+ * Wait for all schedulers to start
+ */
+static inline void _lthread_schedulers_sync_start(void)
+{
+	rte_atomic16_inc(&active_schedulers);
+
+	/* wait for lthread schedulers
+	 * Note we use sched_yield() rather than pthread_yield() to allow
+	 * for the possibility of a pthread wrapper on lthread_yield(),
+	 * something that is not possible unless the scheduler is running.
+	 */
+	while (rte_atomic16_read(&active_schedulers) <
+	       rte_atomic16_read(&num_schedulers))
+		sched_yield();
+
+}
+
+/*
+ * Wait for all schedulers to stop
+ */
+static inline void _lthread_schedulers_sync_stop(void)
+{
+	rte_atomic16_dec(&active_schedulers);
+	rte_atomic16_dec(&num_schedulers);
+
+	/* wait for schedulers
+	 * Note we use sched_yield() rather than pthread_yield() to allow
+	 * for the possibility of a pthread wrapper on lthread_yield(),
+	 * something that is not possible unless the scheduler is running.
+	 */
+	while (rte_atomic16_read(&active_schedulers) > 0)
+		sched_yield();
+
+}
+
+
+/*
+ * Run the lthread scheduler
+ * This loop is the heart of the system
+ */
+void lthread_run(void)
+{
+
+	struct lthread_sched *sched = THIS_SCHED;
+	struct lthread *lt = NULL;
+
+	RTE_LOG(INFO, LTHREAD,
+		"starting scheduler %p on lcore %u phys core %u\n",
+		sched, rte_lcore_id(),
+		rte_lcore_index(rte_lcore_id()));
+
+	/* if more than one, wait for all schedulers to start */
+	_lthread_schedulers_sync_start();
+
+
+	/*
+	 * This is the main scheduling loop
+	 * So long as there are tasks in existence we run this loop.
+	 * We check for:-
+	 *   expired timers,
+	 *   the local ready queue,
+	 *   and the peer ready queue,
+	 *
+	 * and resume lthreads ad infinitum.
+	 */
+	while (!_lthread_sched_isdone(sched)) {
+
+		rte_timer_manage();
+
+		lt = _lthread_queue_poll(sched->ready);
+		if (lt != NULL)
+			_lthread_resume(lt);
+		lt = _lthread_queue_poll(sched->pready);
+		if (lt != NULL)
+			_lthread_resume(lt);
+	}
+
+
+	/* if more than one wait for all schedulers to stop */
+	_lthread_schedulers_sync_stop();
+
+	(THIS_SCHED) = NULL;
+
+	RTE_LOG(INFO, LTHREAD,
+		"stopping scheduler %p on lcore %u phys core %u\n",
+		sched, rte_lcore_id(),
+		rte_lcore_index(rte_lcore_id()));
+	fflush(stdout);
+}
+
+/*
+ * Return the scheduler for this lcore
+ *
+ */
+struct lthread_sched *_lthread_sched_get(int lcore_id)
+{
+	if (lcore_id > LTHREAD_MAX_LCORES)
+		return NULL;
+	return schedcore[lcore_id];
+}
+
+/*
+ * migrate the current thread to another scheduler running
+ * on the specified lcore.
+ */
+int lthread_set_affinity(unsigned lcoreid)
+{
+	struct lthread *lt = THIS_LTHREAD;
+	struct lthread_sched *dest_sched;
+
+	if (unlikely(lcoreid > LTHREAD_MAX_LCORES))
+		return POSIX_ERRNO(EINVAL);
+
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_AFFINITY, lcoreid, 0);
+
+	dest_sched = schedcore[lcoreid];
+
+	if (unlikely(dest_sched == NULL))
+		return POSIX_ERRNO(EINVAL);
+
+	if (likely(dest_sched != THIS_SCHED)) {
+		lt->sched = dest_sched;
+		lt->pending_wr_queue = dest_sched->pready;
+		_affinitize();
+		return 0;
+	}
+	return 0;
+}
diff --git a/examples/performance-thread/common/lthread_sched.h b/examples/performance-thread/common/lthread_sched.h
new file mode 100644
index 0000000..f23264c
--- /dev/null
+++ b/examples/performance-thread/common/lthread_sched.h
@@ -0,0 +1,152 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Some portions of this software is derived from the
+ * https://github.com/halayli/lthread which carrys the following license.
+ *
+ * Copyright (C) 2012, Hasan Alayli <halayli@gmail.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef LTHREAD_SCHED_H_
+#define LTHREAD_SCHED_H_
+
+#include "lthread_int.h"
+#include "lthread_queue.h"
+#include "lthread_objcache.h"
+#include "lthread_diag.h"
+#include "ctx.h"
+
+/*
+ * insert an lthread into a queue
+ */
+static inline void
+_ready_queue_insert(struct lthread_sched *sched, struct lthread *lt)
+{
+	if (sched == THIS_SCHED)
+		_lthread_queue_insert_sp((THIS_SCHED)->ready, lt);
+	else
+		_lthread_queue_insert_mp(sched->pready, lt);
+}
+
+/*
+ * remove an lthread from a queue
+ */
+static inline struct lthread *_ready_queue_remove(struct lthread_queue *q)
+{
+	return _lthread_queue_remove(q);
+}
+
+/**
+ * Return true if the ready queue is empty
+ */
+static inline int _ready_queue_empty(struct lthread_queue *q)
+{
+	return _lthread_queue_empty(q);
+}
+
+static inline uint64_t _sched_now(void)
+{
+	uint64_t now = rte_rdtsc();
+
+	if (now > (THIS_SCHED)->birth)
+		return now - (THIS_SCHED)->birth;
+	if (now < (THIS_SCHED)->birth)
+		return (THIS_SCHED)->birth - now;
+	/* never return 0 because this means sleep forever */
+	return 1;
+}
+
+static inline void
+_affinitize(void) __attribute__ ((always_inline));
+static inline void
+_affinitize(void)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_SUSPENDED, 0, 0);
+	ctx_switch(&(THIS_SCHED)->ctx, &lt->ctx);
+}
+
+static inline void
+_suspend(void) __attribute__ ((always_inline));
+static inline void
+_suspend(void)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	(THIS_SCHED)->nb_blocked_threads++;
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_SUSPENDED, 0, 0);
+	ctx_switch(&(THIS_SCHED)->ctx, &lt->ctx);
+	(THIS_SCHED)->nb_blocked_threads--;
+}
+
+static inline void
+_reschedule(void) __attribute__ ((always_inline));
+static inline void
+_reschedule(void)
+{
+	struct lthread *lt = THIS_LTHREAD;
+
+	DIAG_EVENT(lt, LT_DIAG_LTHREAD_RESCHEDULED, 0, 0);
+	_ready_queue_insert(THIS_SCHED, lt);
+	ctx_switch(&(THIS_SCHED)->ctx, &lt->ctx);
+}
+
+extern struct lthread_sched *schedcore[];
+void _sched_timer_cb(struct rte_timer *tim, void *arg);
+void _sched_shutdown(__rte_unused void *arg);
+
+
+#endif				/* LTHREAD_SCHED_H_ */
diff --git a/examples/performance-thread/common/lthread_timer.h b/examples/performance-thread/common/lthread_timer.h
new file mode 100644
index 0000000..7616694
--- /dev/null
+++ b/examples/performance-thread/common/lthread_timer.h
@@ -0,0 +1,47 @@
+/* <COPYRIGHT_TAG>
+  */
+#ifndef LTHREAD_TIMER_H_
+#define LTHREAD_TIMER_H_
+
+#include "lthread_int.h"
+#include "lthread_sched.h"
+
+
+static inline uint64_t
+_ns_to_clks(uint64_t ns)
+{
+	unsigned __int128 clkns = rte_get_tsc_hz();
+
+	clkns *= ns;
+	clkns /= 1000000000;
+	return (uint64_t) clkns;
+}
+
+
+static inline void
+_timer_start(struct lthread *lt, uint64_t clks)
+{
+	if (clks > 0) {
+		DIAG_EVENT(lt, LT_DIAG_LTHREAD_TMR_START, &lt->tim, clks);
+		rte_timer_init(&lt->tim);
+		rte_timer_reset(&lt->tim,
+				clks,
+				SINGLE,
+				rte_lcore_id(),
+				_sched_timer_cb,
+				(void *)lt);
+	}
+}
+
+
+static inline void
+_timer_stop(struct lthread *lt)
+{
+	if (lt != NULL) {
+		DIAG_EVENT(lt, LT_DIAG_LTHREAD_TMR_DELETE, &lt->tim, 0);
+		rte_timer_stop(&lt->tim);
+	}
+}
+
+
+#endif /* LTHREAD_TIMER_H_ */
diff --git a/examples/performance-thread/common/lthread_tls.c b/examples/performance-thread/common/lthread_tls.c
new file mode 100644
index 0000000..5d6f0fd
--- /dev/null
+++ b/examples/performance-thread/common/lthread_tls.c
@@ -0,0 +1,254 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <limits.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+#include <execinfo.h>
+#include <sched.h>
+
+#include <rte_config.h>
+#include <rte_malloc.h>
+#include <rte_log.h>
+#include <rte_ring.h>
+#include <rte_atomic_64.h>
+
+#include "lthread_tls.h"
+#include "lthread_queue.h"
+#include "lthread_objcache.h"
+#include "lthread_sched.h"
+
+static struct rte_ring *key_pool;
+static uint64_t key_pool_init;
+
+/* needed to cause section start and end to be defined */
+RTE_DEFINE_PER_LTHREAD(void *, dummy);
+
+static struct lthread_key key_table[LTHREAD_MAX_KEYS];
+
+void lthread_tls_ctor(void) __attribute__((constructor));
+
+void lthread_tls_ctor(void)
+{
+	key_pool = NULL;
+	key_pool_init = 0;
+}
+
+/*
+ * Initialize a pool of keys
+ * These are unique tokens that can be obtained by threads
+ * calling lthread_key_create()
+ */
+void _lthread_key_pool_init(void)
+{
+	static struct rte_ring *pool;
+	struct lthread_key *new_key;
+	char name[MAX_LTHREAD_NAME_SIZE];
+
+	bzero(key_table, sizeof(key_table));
+
+	/* only one lcore should do this */
+	if (rte_atomic64_cmpset(&key_pool_init, 0, 1)) {
+
+		snprintf(name,
+			MAX_LTHREAD_NAME_SIZE,
+			"lthread_key_pool_%d",
+			getpid());
+
+		pool = rte_ring_create(name,
+					LTHREAD_MAX_KEYS, 0, 0);
+		LTHREAD_ASSERT(pool);
+
+		int i;
+
+		for (i = 1; i < LTHREAD_MAX_KEYS; i++) {
+			new_key = &key_table[i];
+			rte_ring_mp_enqueue((struct rte_ring *)pool,
+						(void *)new_key);
+		}
+		key_pool = pool;
+	}
+	/* other lcores wait here till done */
+	while (key_pool == NULL) {
+		rte_compiler_barrier();
+		sched_yield();
+	};
+}
+
+/*
+ * Create a key
+ * this means getting a key from the the pool
+ */
+int lthread_key_create(unsigned int *key, tls_destructor_func destructor)
+{
+	if (key == NULL)
+		return POSIX_ERRNO(EINVAL);
+
+	struct lthread_key *new_key;
+
+	if (rte_ring_mc_dequeue((struct rte_ring *)key_pool, (void **)&new_key)
+	    == 0) {
+		new_key->destructor = destructor;
+		*key = (new_key - key_table);
+
+		return 0;
+	}
+	return POSIX_ERRNO(EAGAIN);
+}
+
+
+/*
+ * Delete a key
+ */
+int lthread_key_delete(unsigned int k)
+{
+	struct lthread_key *key;
+
+	key = (struct lthread_key *) &key_table[k];
+
+	if (k > LTHREAD_MAX_KEYS)
+		return POSIX_ERRNO(EINVAL);
+
+	key->destructor = NULL;
+	rte_ring_mp_enqueue((struct rte_ring *)key_pool,
+					(void *)key);
+	return 0;
+}
+
+
+
+/*
+ * Break association for all keys in use by this thread
+ * invoke the destructor if available.
+ * Since a destructor can create keys we could enter an infinite loop
+ * therefore we give up after LTHREAD_DESTRUCTOR_ITERATIONS
+ * the behavior is modelled on pthread
+ */
+void _lthread_tls_destroy(struct lthread *lt)
+{
+	int i, k;
+	int nb_keys;
+	void *data;
+
+	for (i = 0; i < LTHREAD_DESTRUCTOR_ITERATIONS; i++) {
+
+		for (k = 1; k < LTHREAD_MAX_KEYS; k++) {
+
+			/* no keys in use ? */
+			nb_keys = lt->tls->nb_keys_inuse;
+			if (nb_keys == 0)
+				return;
+
+			/* this key not in use ? */
+			if (lt->tls->data[k] == NULL)
+				continue;
+
+			/* remove this key */
+			data = lt->tls->data[k];
+			lt->tls->data[k] = NULL;
+			lt->tls->nb_keys_inuse = nb_keys-1;
+
+			/* invoke destructor */
+			if (key_table[k].destructor != NULL)
+				key_table[k].destructor(data);
+		}
+	}
+}
+
+/*
+ * Return the pointer associated with a key
+ * If the key is no longer valid return NULL
+ */
+void
+*lthread_getspecific(unsigned int k)
+{
+
+	if (k > LTHREAD_MAX_KEYS)
+		return NULL;
+
+	return THIS_LTHREAD->tls->data[k];
+}
+
+/*
+ * Set a value against a key
+ * If the key is no longer valid return an error
+ * when storing value
+ */
+int lthread_setspecific(unsigned int k, const void *data)
+{
+	if (k > LTHREAD_MAX_KEYS)
+		return POSIX_ERRNO(EINVAL);
+
+	int n = THIS_LTHREAD->tls->nb_keys_inuse;
+
+	/* discard const qualifier */
+	char *p = (char *) (uintptr_t) data;
+
+
+	if (data != NULL) {
+		if (THIS_LTHREAD->tls->data[k] == NULL)
+			THIS_LTHREAD->tls->nb_keys_inuse = n+1;
+	}
+
+	THIS_LTHREAD->tls->data[k] = (void *) p;
+	return 0;
+}
+
+/*
+ * Allocate data for TLS cache
+*/
+void _lthread_tls_alloc(struct lthread *lt)
+{
+	struct lthread_tls *tls;
+
+	tls = _lthread_objcache_alloc((THIS_SCHED)->tls_cache);
+
+	LTHREAD_ASSERT(tls != NULL);
+
+	tls->root_sched = (THIS_SCHED);
+	lt->tls = tls;
+
+	/* allocate data for TLS varaiables using RTE_PER_LTHREAD macros */
+	if (sizeof(void *) < (uint64_t)RTE_PER_LTHREAD_SECTION_SIZE) {
+		lt->per_lthread_data =
+		    _lthread_objcache_alloc((THIS_SCHED)->per_lthread_cache);
+	}
+}
diff --git a/examples/performance-thread/common/lthread_tls.h b/examples/performance-thread/common/lthread_tls.h
new file mode 100644
index 0000000..7243234
--- /dev/null
+++ b/examples/performance-thread/common/lthread_tls.h
@@ -0,0 +1,57 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef LTHREAD_TLS_H_
+#define LTHREAD_TLS_H_
+
+#include "lthread_api.h"
+
+#define RTE_PER_LTHREAD_SECTION_SIZE \
+(&__stop_per_lt - &__start_per_lt)
+
+struct lthread_key {
+	tls_destructor_func destructor;
+};
+
+struct lthread_tls {
+	void *data[LTHREAD_MAX_KEYS];
+	int  nb_keys_inuse;
+	struct lthread_sched *root_sched;
+};
+
+void _lthread_tls_destroy(struct lthread *lt);
+void _lthread_key_pool_init(void);
+void _lthread_tls_alloc(struct lthread *lt);
+
+
+#endif				/* LTHREAD_TLS_H_ */
diff --git a/examples/performance-thread/l3fwd-thread/Makefile b/examples/performance-thread/l3fwd-thread/Makefile
new file mode 100644
index 0000000..d8fe5e6
--- /dev/null
+++ b/examples/performance-thread/l3fwd-thread/Makefile
@@ -0,0 +1,57 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# binary name
+APP = l3fwd-thread
+
+# all source are stored in SRCS-y
+SRCS-y := main.c
+
+include $(RTE_SDK)/examples/performance-thread/common/common.mk
+
+CFLAGS += -O3 -g $(USER_FLAGS) $(INCLUDES) $(WERROR_FLAGS)
+
+# workaround for a gcc bug with noreturn attribute
+# http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12603
+#ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
+CFLAGS_main.o += -Wno-return-type
+#endif
+
+include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/performance-thread/l3fwd-thread/main.c b/examples/performance-thread/l3fwd-thread/main.c
new file mode 100644
index 0000000..db6cb64
--- /dev/null
+++ b/examples/performance-thread/l3fwd-thread/main.c
@@ -0,0 +1,3641 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <sys/types.h>
+#include <string.h>
+#include <sys/queue.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <getopt.h>
+
+#include <rte_common.h>
+#include <rte_vect.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_memzone.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_launch.h>
+#include <rte_atomic.h>
+#include <rte_cycles.h>
+#include <rte_prefetch.h>
+#include <rte_lcore.h>
+#include <rte_per_lcore.h>
+#include <rte_branch_prediction.h>
+#include <rte_interrupts.h>
+#include <rte_pci.h>
+#include <rte_random.h>
+#include <rte_debug.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_ring.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_string_fns.h>
+
+#include <cmdline_parse.h>
+#include <cmdline_parse_etheraddr.h>
+
+#include <lthread_api.h>
+
+#define APP_LOOKUP_EXACT_MATCH          0
+#define APP_LOOKUP_LPM                  1
+#define DO_RFC_1812_CHECKS
+
+/* Enable cpu-load stats 0-off, 1-on */
+#define APP_CPU_LOAD                 1
+
+#ifndef APP_LOOKUP_METHOD
+#define APP_LOOKUP_METHOD             APP_LOOKUP_LPM
+#endif
+
+/*
+ *  When set to zero, simple forwaring path is eanbled.
+ *  When set to one, optimized forwarding path is enabled.
+ *  Note that LPM optimisation path uses SSE4.1 instructions.
+ */
+#if ((APP_LOOKUP_METHOD == APP_LOOKUP_LPM) && !defined(__SSE4_1__))
+#define ENABLE_MULTI_BUFFER_OPTIMIZE	0
+#else
+#define ENABLE_MULTI_BUFFER_OPTIMIZE	1
+#endif
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+#include <rte_hash.h>
+#elif (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+#include <rte_lpm.h>
+#include <rte_lpm6.h>
+#else
+#error "APP_LOOKUP_METHOD set to incorrect value"
+#endif
+
+#define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1
+
+#define MAX_JUMBO_PKT_LEN  9600
+
+#define IPV6_ADDR_LEN 16
+
+#define MEMPOOL_CACHE_SIZE 256
+
+/*
+ * This expression is used to calculate the number of mbufs needed depending on
+ * user input, taking into account memory for rx and tx hardware rings, cache
+ * per lcore and mtable per port per lcore. RTE_MAX is used to ensure that
+ * NB_MBUF never goes below a minimum value of 8192
+ */
+
+#define NB_MBUF RTE_MAX(\
+		(nb_ports*nb_rx_queue*RTE_TEST_RX_DESC_DEFAULT +       \
+		nb_ports*nb_lcores*MAX_PKT_BURST +                     \
+		nb_ports*n_tx_queue*RTE_TEST_TX_DESC_DEFAULT +         \
+		nb_lcores*MEMPOOL_CACHE_SIZE),                         \
+		(unsigned)8192)
+
+#define MAX_PKT_BURST     32
+#define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */
+
+/*
+ * Try to avoid TX buffering if we have at least MAX_TX_BURST packets to send.
+ */
+#define	MAX_TX_BURST  (MAX_PKT_BURST / 2)
+#define BURST_SIZE    MAX_TX_BURST
+
+#define NB_SOCKETS 8
+
+/* Configure how many packets ahead to prefetch, when reading packets */
+#define PREFETCH_OFFSET	3
+
+/* Used to mark destination port as 'invalid'. */
+#define	BAD_PORT	((uint16_t)-1)
+
+#define FWDSTEP	4
+
+/*
+ * Configurable number of RX/TX ring descriptors
+ */
+#define RTE_TEST_RX_DESC_DEFAULT 128
+#define RTE_TEST_TX_DESC_DEFAULT 128
+static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
+static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
+
+/* ethernet addresses of ports */
+static uint64_t dest_eth_addr[RTE_MAX_ETHPORTS];
+static struct ether_addr ports_eth_addr[RTE_MAX_ETHPORTS];
+
+static __m128i val_eth[RTE_MAX_ETHPORTS];
+
+/* replace first 12B of the ethernet header. */
+#define	MASK_ETH 0x3f
+
+/* mask of enabled ports */
+static uint32_t enabled_port_mask;
+static int promiscuous_on; /**< $et in promiscuous mode off by default. */
+static int numa_on = 1;    /**< NUMA is enabled by default. */
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+static int ipv6;           /**< ipv6 is false by default. */
+#endif
+
+#if (APP_CPU_LOAD == 1)
+
+#define MAX_CPU RTE_MAX_LCORE
+#define CPU_LOAD_TIMEOUT_US (5 * 1000 * 1000)  /**< Timeout for collecting 5s */
+
+#define CPU_PROCESS     0
+#define CPU_POLL        1
+#define MAX_CPU_COUNTER 2
+
+struct cpu_load {
+	uint16_t       n_cpu;
+	uint64_t       counter;
+	uint64_t       hits[MAX_CPU_COUNTER][MAX_CPU];
+} __rte_cache_aligned;
+
+static struct cpu_load cpu_load;
+static int cpu_load_lcore_id = -1;
+
+#define SET_CPU_BUSY(thread, counter) \
+		thread->conf.busy[counter] = 1
+
+#define SET_CPU_IDLE(thread, counter) \
+		thread->conf.busy[counter] = 0
+
+#define IS_CPU_BUSY(thread, counter) \
+		(thread->conf.busy[counter] > 0)
+
+#else
+
+#define SET_CPU_BUSY(thread, counter)
+#define SET_CPU_IDLE(thread, counter)
+#define IS_CPU_BUSY(thread, counter) 0
+
+#endif
+
+struct mbuf_table {
+	uint16_t len;
+	struct rte_mbuf *m_table[MAX_PKT_BURST];
+};
+
+struct lcore_rx_queue {
+	uint8_t port_id;
+	uint8_t queue_id;
+} __rte_cache_aligned;
+
+#define MAX_RX_QUEUE_PER_LCORE 16
+#define MAX_TX_QUEUE_PER_PORT  RTE_MAX_ETHPORTS
+#define MAX_RX_QUEUE_PER_PORT  128
+
+#define MAX_LCORE_PARAMS       1024
+struct rx_thread_params {
+	uint8_t port_id;
+	uint8_t queue_id;
+	uint8_t lcore_id;
+	uint8_t thread_id;
+} __rte_cache_aligned;
+
+static struct rx_thread_params rx_thread_params_array[MAX_LCORE_PARAMS];
+static struct rx_thread_params rx_thread_params_array_default[] = {
+	{0, 0, 2, 0},
+	{0, 1, 2, 1},
+	{0, 2, 2, 2},
+	{1, 0, 2, 3},
+	{1, 1, 2, 4},
+	{1, 2, 2, 5},
+	{2, 0, 2, 6},
+	{3, 0, 3, 7},
+	{3, 1, 3, 8},
+};
+
+static struct rx_thread_params *rx_thread_params =
+		rx_thread_params_array_default;
+static uint16_t nb_rx_thread_params = RTE_DIM(rx_thread_params_array_default);
+
+struct tx_thread_params {
+	uint8_t lcore_id;
+	uint8_t thread_id;
+} __rte_cache_aligned;
+
+static struct tx_thread_params tx_thread_params_array[MAX_LCORE_PARAMS];
+static struct tx_thread_params tx_thread_params_array_default[] = {
+	{4, 0},
+	{5, 1},
+	{6, 2},
+	{7, 3},
+	{8, 4},
+	{9, 5},
+	{10, 6},
+	{11, 7},
+	{12, 8},
+};
+
+static struct tx_thread_params *tx_thread_params =
+		tx_thread_params_array_default;
+static uint16_t nb_tx_thread_params = RTE_DIM(tx_thread_params_array_default);
+
+static struct rte_eth_conf port_conf = {
+	.rxmode = {
+		.mq_mode = ETH_MQ_RX_RSS,
+		.max_rx_pkt_len = ETHER_MAX_LEN,
+		.split_hdr_size = 0,
+		.header_split   = 0, /**< Header Split disabled */
+		.hw_ip_checksum = 1, /**< IP checksum offload enabled */
+		.hw_vlan_filter = 0, /**< VLAN filtering disabled */
+		.jumbo_frame    = 0, /**< Jumbo Frame Support disabled */
+		.hw_strip_crc   = 0, /**< CRC stripped by hardware */
+	},
+	.rx_adv_conf = {
+		.rss_conf = {
+			.rss_key = NULL,
+			.rss_hf = ETH_RSS_TCP,
+		},
+	},
+	.txmode = {
+		.mq_mode = ETH_MQ_TX_NONE,
+	},
+};
+
+static struct rte_mempool *pktmbuf_pool[NB_SOCKETS];
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+
+#ifdef RTE_MACHINE_CPUFLAG_SSE4_2
+#include <rte_hash_crc.h>
+#define DEFAULT_HASH_FUNC       rte_hash_crc
+#else
+#include <rte_jhash.h>
+#define DEFAULT_HASH_FUNC       rte_jhash
+#endif
+
+struct ipv4_5tuple {
+	uint32_t ip_dst;
+	uint32_t ip_src;
+	uint16_t port_dst;
+	uint16_t port_src;
+	uint8_t  proto;
+} __attribute__((__packed__));
+
+union ipv4_5tuple_host {
+	struct {
+		uint8_t  pad0;
+		uint8_t  proto;
+		uint16_t pad1;
+		uint32_t ip_src;
+		uint32_t ip_dst;
+		uint16_t port_src;
+		uint16_t port_dst;
+	};
+	__m128i xmm;
+};
+
+#define XMM_NUM_IN_IPV6_5TUPLE 3
+
+struct ipv6_5tuple {
+	uint8_t  ip_dst[IPV6_ADDR_LEN];
+	uint8_t  ip_src[IPV6_ADDR_LEN];
+	uint16_t port_dst;
+	uint16_t port_src;
+	uint8_t  proto;
+} __attribute__((__packed__));
+
+union ipv6_5tuple_host {
+	struct {
+		uint16_t pad0;
+		uint8_t  proto;
+		uint8_t  pad1;
+		uint8_t  ip_src[IPV6_ADDR_LEN];
+		uint8_t  ip_dst[IPV6_ADDR_LEN];
+		uint16_t port_src;
+		uint16_t port_dst;
+		uint64_t reserve;
+	};
+	__m128i xmm[XMM_NUM_IN_IPV6_5TUPLE];
+};
+
+struct ipv4_l3fwd_route {
+	struct ipv4_5tuple key;
+	uint8_t if_out;
+};
+
+struct ipv6_l3fwd_route {
+	struct ipv6_5tuple key;
+	uint8_t if_out;
+};
+
+static struct ipv4_l3fwd_route ipv4_l3fwd_route_array[] = {
+	{{IPv4(101, 0, 0, 0), IPv4(100, 10, 0, 1),  101, 11, IPPROTO_TCP}, 0},
+	{{IPv4(201, 0, 0, 0), IPv4(200, 20, 0, 1),  102, 12, IPPROTO_TCP}, 1},
+	{{IPv4(111, 0, 0, 0), IPv4(100, 30, 0, 1),  101, 11, IPPROTO_TCP}, 2},
+	{{IPv4(211, 0, 0, 0), IPv4(200, 40, 0, 1),  102, 12, IPPROTO_TCP}, 3},
+};
+
+static struct ipv6_l3fwd_route ipv6_l3fwd_route_array[] = {
+	{{
+	{0xfe, 0x80, 0, 0, 0, 0, 0, 0, 0x02, 0x1e, 0x67, 0xff, 0xfe, 0, 0, 0},
+	{0xfe, 0x80, 0, 0, 0, 0, 0, 0, 0x02, 0x1b, 0x21, 0xff, 0xfe, 0x91, 0x38,
+			0x05},
+	101, 11, IPPROTO_TCP}, 0},
+
+	{{
+	{0xfe, 0x90, 0, 0, 0, 0, 0, 0, 0x02, 0x1e, 0x67, 0xff, 0xfe, 0, 0, 0},
+	{0xfe, 0x90, 0, 0, 0, 0, 0, 0, 0x02, 0x1b, 0x21, 0xff, 0xfe, 0x91, 0x38,
+			0x05},
+	102, 12, IPPROTO_TCP}, 1},
+
+	{{
+	{0xfe, 0xa0, 0, 0, 0, 0, 0, 0, 0x02, 0x1e, 0x67, 0xff, 0xfe, 0, 0, 0},
+	{0xfe, 0xa0, 0, 0, 0, 0, 0, 0, 0x02, 0x1b, 0x21, 0xff, 0xfe, 0x91, 0x38,
+			0x05},
+	101, 11, IPPROTO_TCP}, 2},
+
+	{{
+	{0xfe, 0xb0, 0, 0, 0, 0, 0, 0, 0x02, 0x1e, 0x67, 0xff, 0xfe, 0, 0, 0},
+	{0xfe, 0xb0, 0, 0, 0, 0, 0, 0, 0x02, 0x1b, 0x21, 0xff, 0xfe, 0x91, 0x38,
+			0x05},
+	102, 12, IPPROTO_TCP}, 3},
+};
+
+typedef struct rte_hash lookup_struct_t;
+static lookup_struct_t *ipv4_l3fwd_lookup_struct[NB_SOCKETS];
+static lookup_struct_t *ipv6_l3fwd_lookup_struct[NB_SOCKETS];
+
+#ifdef RTE_ARCH_X86_64
+/* default to 4 million hash entries (approx) */
+#define L3FWD_HASH_ENTRIES (1024*1024*4)
+#else
+/* 32-bit has less address-space for hugepage memory, limit to 1M entries */
+#define L3FWD_HASH_ENTRIES (1024*1024*1)
+#endif
+#define HASH_ENTRY_NUMBER_DEFAULT 4
+
+static uint32_t hash_entry_number = HASH_ENTRY_NUMBER_DEFAULT;
+
+static inline uint32_t
+ipv4_hash_crc(const void *data, __rte_unused uint32_t data_len,
+		uint32_t init_val)
+{
+	const union ipv4_5tuple_host *k;
+	uint32_t t;
+	const uint32_t *p;
+
+	k = data;
+	t = k->proto;
+	p = (const uint32_t *)&k->port_src;
+
+#ifdef RTE_MACHINE_CPUFLAG_SSE4_2
+	init_val = rte_hash_crc_4byte(t, init_val);
+	init_val = rte_hash_crc_4byte(k->ip_src, init_val);
+	init_val = rte_hash_crc_4byte(k->ip_dst, init_val);
+	init_val = rte_hash_crc_4byte(*p, init_val);
+#else /* RTE_MACHINE_CPUFLAG_SSE4_2 */
+	init_val = rte_jhash_1word(t, init_val);
+	init_val = rte_jhash_1word(k->ip_src, init_val);
+	init_val = rte_jhash_1word(k->ip_dst, init_val);
+	init_val = rte_jhash_1word(*p, init_val);
+#endif /* RTE_MACHINE_CPUFLAG_SSE4_2 */
+	return init_val;
+}
+
+static inline uint32_t
+ipv6_hash_crc(const void *data, __rte_unused uint32_t data_len,
+		uint32_t init_val)
+{
+	const union ipv6_5tuple_host *k;
+	uint32_t t;
+	const uint32_t *p;
+#ifdef RTE_MACHINE_CPUFLAG_SSE4_2
+	const uint32_t *ip_src0, *ip_src1, *ip_src2, *ip_src3;
+	const uint32_t *ip_dst0, *ip_dst1, *ip_dst2, *ip_dst3;
+#endif /* RTE_MACHINE_CPUFLAG_SSE4_2 */
+
+	k = data;
+	t = k->proto;
+	p = (const uint32_t *)&k->port_src;
+
+#ifdef RTE_MACHINE_CPUFLAG_SSE4_2
+	ip_src0 = (const uint32_t *) k->ip_src;
+	ip_src1 = (const uint32_t *)(k->ip_src + 4);
+	ip_src2 = (const uint32_t *)(k->ip_src + 8);
+	ip_src3 = (const uint32_t *)(k->ip_src + 12);
+	ip_dst0 = (const uint32_t *) k->ip_dst;
+	ip_dst1 = (const uint32_t *)(k->ip_dst + 4);
+	ip_dst2 = (const uint32_t *)(k->ip_dst + 8);
+	ip_dst3 = (const uint32_t *)(k->ip_dst + 12);
+	init_val = rte_hash_crc_4byte(t, init_val);
+	init_val = rte_hash_crc_4byte(*ip_src0, init_val);
+	init_val = rte_hash_crc_4byte(*ip_src1, init_val);
+	init_val = rte_hash_crc_4byte(*ip_src2, init_val);
+	init_val = rte_hash_crc_4byte(*ip_src3, init_val);
+	init_val = rte_hash_crc_4byte(*ip_dst0, init_val);
+	init_val = rte_hash_crc_4byte(*ip_dst1, init_val);
+	init_val = rte_hash_crc_4byte(*ip_dst2, init_val);
+	init_val = rte_hash_crc_4byte(*ip_dst3, init_val);
+	init_val = rte_hash_crc_4byte(*p, init_val);
+#else /* RTE_MACHINE_CPUFLAG_SSE4_2 */
+	init_val = rte_jhash_1word(t, init_val);
+	init_val = rte_jhash(k->ip_src, sizeof(uint8_t) * IPV6_ADDR_LEN, init_val);
+	init_val = rte_jhash(k->ip_dst, sizeof(uint8_t) * IPV6_ADDR_LEN, init_val);
+	init_val = rte_jhash_1word(*p, init_val);
+#endif /* RTE_MACHINE_CPUFLAG_SSE4_2 */
+	return init_val;
+}
+
+#define IPV4_L3FWD_NUM_ROUTES RTE_DIM(ipv4_l3fwd_route_array)
+#define IPV6_L3FWD_NUM_ROUTES RTE_DIM(ipv6_l3fwd_route_array)
+
+static uint8_t ipv4_l3fwd_out_if[L3FWD_HASH_ENTRIES] __rte_cache_aligned;
+static uint8_t ipv6_l3fwd_out_if[L3FWD_HASH_ENTRIES] __rte_cache_aligned;
+
+#endif
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+struct ipv4_l3fwd_route {
+	uint32_t ip;
+	uint8_t  depth;
+	uint8_t  if_out;
+};
+
+struct ipv6_l3fwd_route {
+	uint8_t ip[16];
+	uint8_t depth;
+	uint8_t if_out;
+};
+
+static struct ipv4_l3fwd_route ipv4_l3fwd_route_array[] = {
+	{IPv4(1, 1, 1, 0), 24, 0},
+	{IPv4(2, 1, 1, 0), 24, 1},
+	{IPv4(3, 1, 1, 0), 24, 2},
+	{IPv4(4, 1, 1, 0), 24, 3},
+	{IPv4(5, 1, 1, 0), 24, 4},
+	{IPv4(6, 1, 1, 0), 24, 5},
+	{IPv4(7, 1, 1, 0), 24, 6},
+	{IPv4(8, 1, 1, 0), 24, 7},
+};
+
+static struct ipv6_l3fwd_route ipv6_l3fwd_route_array[] = {
+	{{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 0},
+	{{2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 1},
+	{{3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 2},
+	{{4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 3},
+	{{5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 4},
+	{{6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 5},
+	{{7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 6},
+	{{8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, 48, 7},
+};
+
+#define IPV4_L3FWD_NUM_ROUTES RTE_DIM(ipv4_l3fwd_route_array)
+#define IPV6_L3FWD_NUM_ROUTES RTE_DIM(ipv6_l3fwd_route_array)
+
+#define IPV4_L3FWD_LPM_MAX_RULES         1024
+#define IPV6_L3FWD_LPM_MAX_RULES         1024
+#define IPV6_L3FWD_LPM_NUMBER_TBL8S (1 << 16)
+
+typedef struct rte_lpm lookup_struct_t;
+typedef struct rte_lpm6 lookup6_struct_t;
+static lookup_struct_t *ipv4_l3fwd_lookup_struct[NB_SOCKETS];
+static lookup6_struct_t *ipv6_l3fwd_lookup_struct[NB_SOCKETS];
+#endif
+
+struct lcore_conf {
+	lookup_struct_t *ipv4_lookup_struct;
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+	lookup6_struct_t *ipv6_lookup_struct;
+#else
+	lookup_struct_t *ipv6_lookup_struct;
+#endif
+	void *data;
+} __rte_cache_aligned;
+
+static struct lcore_conf lcore_conf[RTE_MAX_LCORE];
+RTE_DEFINE_PER_LCORE(struct lcore_conf *, lcore_conf);
+
+#define MAX_RX_QUEUE_PER_THREAD 16
+#define MAX_TX_PORT_PER_THREAD  RTE_MAX_ETHPORTS
+#define MAX_TX_QUEUE_PER_PORT   RTE_MAX_ETHPORTS
+#define MAX_RX_QUEUE_PER_PORT   128
+
+#define MAX_RX_THREAD 1024
+#define MAX_TX_THREAD 1024
+#define MAX_THREAD    (MAX_RX_THREAD + MAX_TX_THREAD)
+
+/**
+ * Producers and consumers threads configuration
+ */
+static int lthreads_on = 1; /**< Use lthreads for processing*/
+
+rte_atomic16_t rx_counter;  /**< Number of spawned rx threads */
+rte_atomic16_t tx_counter;  /**< Number of spawned tx threads */
+
+struct thread_conf {
+	uint16_t lcore_id;      /**< Initial lcore for rx thread */
+	uint16_t cpu_id;        /**< Cpu id for cpu load stats counter */
+	uint16_t thread_id;     /**< Thread ID */
+
+#if (APP_CPU_LOAD > 0)
+	int busy[MAX_CPU_COUNTER];
+#endif
+};
+
+struct thread_rx_conf {
+	struct thread_conf conf;
+
+	uint16_t n_rx_queue;
+	struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE];
+
+	uint16_t n_ring;        /**< Number of output rings */
+	struct rte_ring *ring[RTE_MAX_LCORE];
+	struct lthread_cond *ready[RTE_MAX_LCORE];
+
+#if (APP_CPU_LOAD > 0)
+	int busy[MAX_CPU_COUNTER];
+#endif
+} __rte_cache_aligned;
+
+uint16_t n_rx_thread;
+struct thread_rx_conf rx_thread[MAX_RX_THREAD];
+
+struct thread_tx_conf {
+	struct thread_conf conf;
+
+	uint16_t tx_queue_id[RTE_MAX_LCORE];
+	struct mbuf_table tx_mbufs[RTE_MAX_LCORE];
+
+	struct rte_ring *ring;
+	struct lthread_cond **ready;
+
+} __rte_cache_aligned;
+
+uint16_t n_tx_thread;
+struct thread_tx_conf tx_thread[MAX_TX_THREAD];
+
+/* Send burst of packets on an output interface */
+static inline int
+send_burst(struct thread_tx_conf *qconf, uint16_t n, uint8_t port)
+{
+	struct rte_mbuf **m_table;
+	int ret;
+	uint16_t queueid;
+
+	queueid = qconf->tx_queue_id[port];
+	m_table = (struct rte_mbuf **)qconf->tx_mbufs[port].m_table;
+
+	ret = rte_eth_tx_burst(port, queueid, m_table, n);
+	if (unlikely(ret < n)) {
+		do {
+			rte_pktmbuf_free(m_table[ret]);
+		} while (++ret < n);
+	}
+
+	return 0;
+}
+
+/* Enqueue a single packet, and send burst if queue is filled */
+static inline int
+send_single_packet(struct rte_mbuf *m, uint8_t port)
+{
+	uint16_t len;
+	struct thread_tx_conf *qconf;
+
+	if (lthreads_on)
+		qconf = (struct thread_tx_conf *)lthread_get_data();
+	else
+		qconf = (struct thread_tx_conf *)RTE_PER_LCORE(lcore_conf)->data;
+
+	len = qconf->tx_mbufs[port].len;
+	qconf->tx_mbufs[port].m_table[len] = m;
+	len++;
+
+	/* enough pkts to be sent */
+	if (unlikely(len == MAX_PKT_BURST)) {
+		send_burst(qconf, MAX_PKT_BURST, port);
+		len = 0;
+	}
+
+	qconf->tx_mbufs[port].len = len;
+	return 0;
+}
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+static inline __attribute__((always_inline)) void
+send_packetsx4(uint8_t port,
+	struct rte_mbuf *m[], uint32_t num)
+{
+	uint32_t len, j, n;
+	struct thread_tx_conf *qconf;
+
+	if (lthreads_on)
+		qconf = (struct thread_tx_conf *)lthread_get_data();
+	else
+		qconf = (struct thread_tx_conf *)RTE_PER_LCORE(lcore_conf)->data;
+
+	len = qconf->tx_mbufs[port].len;
+
+	/*
+	 * If TX buffer for that queue is empty, and we have enough packets,
+	 * then send them straightway.
+	 */
+	if (num >= MAX_TX_BURST && len == 0) {
+		n = rte_eth_tx_burst(port, qconf->tx_queue_id[port], m, num);
+		if (unlikely(n < num)) {
+			do {
+				rte_pktmbuf_free(m[n]);
+			} while (++n < num);
+		}
+		return;
+	}
+
+	/*
+	 * Put packets into TX buffer for that queue.
+	 */
+
+	n = len + num;
+	n = (n > MAX_PKT_BURST) ? MAX_PKT_BURST - len : num;
+
+	j = 0;
+	switch (n % FWDSTEP) {
+	while (j < n) {
+	case 0:
+		qconf->tx_mbufs[port].m_table[len + j] = m[j];
+		j++;
+	case 3:
+		qconf->tx_mbufs[port].m_table[len + j] = m[j];
+		j++;
+	case 2:
+		qconf->tx_mbufs[port].m_table[len + j] = m[j];
+		j++;
+	case 1:
+		qconf->tx_mbufs[port].m_table[len + j] = m[j];
+		j++;
+	}
+	}
+
+	len += n;
+
+	/* enough pkts to be sent */
+	if (unlikely(len == MAX_PKT_BURST)) {
+
+		send_burst(qconf, MAX_PKT_BURST, port);
+
+		/* copy rest of the packets into the TX buffer. */
+		len = num - n;
+		j = 0;
+		switch (len % FWDSTEP) {
+		while (j < len) {
+		case 0:
+			qconf->tx_mbufs[port].m_table[j] = m[n + j];
+			j++;
+		case 3:
+			qconf->tx_mbufs[port].m_table[j] = m[n + j];
+			j++;
+		case 2:
+			qconf->tx_mbufs[port].m_table[j] = m[n + j];
+			j++;
+		case 1:
+			qconf->tx_mbufs[port].m_table[j] = m[n + j];
+			j++;
+		}
+		}
+	}
+
+	qconf->tx_mbufs[port].len = len;
+}
+#endif /* APP_LOOKUP_LPM */
+
+#ifdef DO_RFC_1812_CHECKS
+static inline int
+is_valid_ipv4_pkt(struct ipv4_hdr *pkt, uint32_t link_len)
+{
+	/* From http://www.rfc-editor.org/rfc/rfc1812.txt section 5.2.2 */
+	/*
+	 * 1. The packet length reported by the Link Layer must be large
+	 * enough to hold the minimum length legal IP datagram (20 bytes).
+	 */
+	if (link_len < sizeof(struct ipv4_hdr))
+		return -1;
+
+	/* 2. The IP checksum must be correct. */
+	/* this is checked in H/W */
+
+	/*
+	 * 3. The IP version number must be 4. If the version number is not 4
+	 * then the packet may be another version of IP, such as IPng or
+	 * ST-II.
+	 */
+	if (((pkt->version_ihl) >> 4) != 4)
+		return -3;
+	/*
+	 * 4. The IP header length field must be large enough to hold the
+	 * minimum length legal IP datagram (20 bytes = 5 words).
+	 */
+	if ((pkt->version_ihl & 0xf) < 5)
+		return -4;
+
+	/*
+	 * 5. The IP total length field must be large enough to hold the IP
+	 * datagram header, whose length is specified in the IP header length
+	 * field.
+	 */
+	if (rte_cpu_to_be_16(pkt->total_length) < sizeof(struct ipv4_hdr))
+		return -5;
+
+	return 0;
+}
+#endif
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+
+static __m128i mask0;
+static __m128i mask1;
+static __m128i mask2;
+static inline uint8_t
+get_ipv4_dst_port(void *ipv4_hdr, uint8_t portid,
+		lookup_struct_t *ipv4_l3fwd_lookup_struct)
+{
+	int ret = 0;
+	union ipv4_5tuple_host key;
+
+	ipv4_hdr = (uint8_t *)ipv4_hdr + offsetof(struct ipv4_hdr, time_to_live);
+	__m128i data = _mm_loadu_si128((__m128i *)(ipv4_hdr));
+	/* Get 5 tuple: dst port, src port, dst IP address, src IP address and
+	   protocol */
+	key.xmm = _mm_and_si128(data, mask0);
+	/* Find destination port */
+	ret = rte_hash_lookup(ipv4_l3fwd_lookup_struct, (const void *)&key);
+	return (uint8_t)((ret < 0) ? portid : ipv4_l3fwd_out_if[ret]);
+}
+
+static inline uint8_t
+get_ipv6_dst_port(void *ipv6_hdr, uint8_t portid,
+		lookup_struct_t *ipv6_l3fwd_lookup_struct)
+{
+	int ret = 0;
+	union ipv6_5tuple_host key;
+
+	ipv6_hdr = (uint8_t *)ipv6_hdr + offsetof(struct ipv6_hdr, payload_len);
+	__m128i data0 = _mm_loadu_si128((__m128i *)(ipv6_hdr));
+	__m128i data1 = _mm_loadu_si128((__m128i *)(((uint8_t *)ipv6_hdr) +
+			sizeof(__m128i)));
+	__m128i data2 = _mm_loadu_si128((__m128i *)(((uint8_t *)ipv6_hdr) +
+			sizeof(__m128i) + sizeof(__m128i)));
+	/* Get part of 5 tuple: src IP address lower 96 bits and protocol */
+	key.xmm[0] = _mm_and_si128(data0, mask1);
+	/* Get part of 5 tuple: dst IP address lower 96 bits and src IP address
+	   higher 32 bits */
+	key.xmm[1] = data1;
+	/* Get part of 5 tuple: dst port and src port and dst IP address higher
+	   32 bits */
+	key.xmm[2] = _mm_and_si128(data2, mask2);
+
+	/* Find destination port */
+	ret = rte_hash_lookup(ipv6_l3fwd_lookup_struct, (const void *)&key);
+	return (uint8_t)((ret < 0) ? portid : ipv6_l3fwd_out_if[ret]);
+}
+#endif
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+
+static inline uint8_t
+get_ipv4_dst_port(void *ipv4_hdr, uint8_t portid,
+		lookup_struct_t *ipv4_l3fwd_lookup_struct)
+{
+	uint8_t next_hop;
+
+	return (uint8_t)((rte_lpm_lookup(ipv4_l3fwd_lookup_struct,
+		rte_be_to_cpu_32(((struct ipv4_hdr *)ipv4_hdr)->dst_addr),
+		&next_hop) == 0) ? next_hop : portid);
+}
+
+static inline uint8_t
+get_ipv6_dst_port(void *ipv6_hdr,  uint8_t portid,
+		lookup6_struct_t *ipv6_l3fwd_lookup_struct)
+{
+	uint8_t next_hop;
+
+	return (uint8_t) ((rte_lpm6_lookup(ipv6_l3fwd_lookup_struct,
+			((struct ipv6_hdr *)ipv6_hdr)->dst_addr, &next_hop) == 0) ?
+			next_hop : portid);
+}
+#endif
+
+static inline void l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid)
+		__attribute__((unused));
+
+#if ((APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH) && \
+	(ENABLE_MULTI_BUFFER_OPTIMIZE == 1))
+
+#define MASK_ALL_PKTS   0xff
+#define EXCLUDE_1ST_PKT 0xfe
+#define EXCLUDE_2ND_PKT 0xfd
+#define EXCLUDE_3RD_PKT 0xfb
+#define EXCLUDE_4TH_PKT 0xf7
+#define EXCLUDE_5TH_PKT 0xef
+#define EXCLUDE_6TH_PKT 0xdf
+#define EXCLUDE_7TH_PKT 0xbf
+#define EXCLUDE_8TH_PKT 0x7f
+
+static inline void
+simple_ipv4_fwd_8pkts(struct rte_mbuf *m[8], uint8_t portid)
+{
+	struct ether_hdr *eth_hdr[8];
+	struct ipv4_hdr *ipv4_hdr[8];
+	uint8_t dst_port[8];
+	int32_t ret[8];
+	union ipv4_5tuple_host key[8];
+	__m128i data[8];
+
+	eth_hdr[0] = rte_pktmbuf_mtod(m[0], struct ether_hdr *);
+	eth_hdr[1] = rte_pktmbuf_mtod(m[1], struct ether_hdr *);
+	eth_hdr[2] = rte_pktmbuf_mtod(m[2], struct ether_hdr *);
+	eth_hdr[3] = rte_pktmbuf_mtod(m[3], struct ether_hdr *);
+	eth_hdr[4] = rte_pktmbuf_mtod(m[4], struct ether_hdr *);
+	eth_hdr[5] = rte_pktmbuf_mtod(m[5], struct ether_hdr *);
+	eth_hdr[6] = rte_pktmbuf_mtod(m[6], struct ether_hdr *);
+	eth_hdr[7] = rte_pktmbuf_mtod(m[7], struct ether_hdr *);
+
+	/* Handle IPv4 headers.*/
+	ipv4_hdr[0] = rte_pktmbuf_mtod_offset(m[0], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[1] = rte_pktmbuf_mtod_offset(m[1], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[2] = rte_pktmbuf_mtod_offset(m[2], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[3] = rte_pktmbuf_mtod_offset(m[3], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[4] = rte_pktmbuf_mtod_offset(m[4], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[5] = rte_pktmbuf_mtod_offset(m[5], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[6] = rte_pktmbuf_mtod_offset(m[6], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+	ipv4_hdr[7] = rte_pktmbuf_mtod_offset(m[7], struct ipv4_hdr *,
+			sizeof(struct ether_hdr));
+
+#ifdef DO_RFC_1812_CHECKS
+	/* Check to make sure the packet is valid (RFC1812) */
+	uint8_t valid_mask = MASK_ALL_PKTS;
+
+	if (is_valid_ipv4_pkt(ipv4_hdr[0], m[0]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[0]);
+		valid_mask &= EXCLUDE_1ST_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[1], m[1]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[1]);
+		valid_mask &= EXCLUDE_2ND_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[2], m[2]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[2]);
+		valid_mask &= EXCLUDE_3RD_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[3], m[3]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[3]);
+		valid_mask &= EXCLUDE_4TH_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[4], m[4]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[4]);
+		valid_mask &= EXCLUDE_5TH_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[5], m[5]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[5]);
+		valid_mask &= EXCLUDE_6TH_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[6], m[6]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[6]);
+		valid_mask &= EXCLUDE_7TH_PKT;
+	}
+	if (is_valid_ipv4_pkt(ipv4_hdr[7], m[7]->pkt_len) < 0) {
+		rte_pktmbuf_free(m[7]);
+		valid_mask &= EXCLUDE_8TH_PKT;
+	}
+	if (unlikely(valid_mask != MASK_ALL_PKTS)) {
+		if (valid_mask == 0)
+			return;
+
+		uint8_t i = 0;
+
+		for (i = 0; i < 8; i++)
+			if ((0x1 << i) & valid_mask)
+				l3fwd_simple_forward(m[i], portid);
+	}
+#endif /* End of #ifdef DO_RFC_1812_CHECKS */
+
+	data[0] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[0], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[1] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[1], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[2] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[2], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[3] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[3], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[4] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[4], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[5] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[5], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[6] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[6], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+	data[7] = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m[7], __m128i *,
+			sizeof(struct ether_hdr) +
+			offsetof(struct ipv4_hdr, time_to_live)));
+
+	key[0].xmm = _mm_and_si128(data[0], mask0);
+	key[1].xmm = _mm_and_si128(data[1], mask0);
+	key[2].xmm = _mm_and_si128(data[2], mask0);
+	key[3].xmm = _mm_and_si128(data[3], mask0);
+	key[4].xmm = _mm_and_si128(data[4], mask0);
+	key[5].xmm = _mm_and_si128(data[5], mask0);
+	key[6].xmm = _mm_and_si128(data[6], mask0);
+	key[7].xmm = _mm_and_si128(data[7], mask0);
+
+	const void *key_array[8] = {&key[0], &key[1], &key[2], &key[3],
+			&key[4], &key[5], &key[6], &key[7]};
+
+	rte_hash_lookup_multi(RTE_PER_LCORE(lcore_conf)->ipv4_lookup_struct,
+			&key_array[0], 8, ret);
+	dst_port[0] = (uint8_t) ((ret[0] < 0) ? portid : ipv4_l3fwd_out_if[ret[0]]);
+	dst_port[1] = (uint8_t) ((ret[1] < 0) ? portid : ipv4_l3fwd_out_if[ret[1]]);
+	dst_port[2] = (uint8_t) ((ret[2] < 0) ? portid : ipv4_l3fwd_out_if[ret[2]]);
+	dst_port[3] = (uint8_t) ((ret[3] < 0) ? portid : ipv4_l3fwd_out_if[ret[3]]);
+	dst_port[4] = (uint8_t) ((ret[4] < 0) ? portid : ipv4_l3fwd_out_if[ret[4]]);
+	dst_port[5] = (uint8_t) ((ret[5] < 0) ? portid : ipv4_l3fwd_out_if[ret[5]]);
+	dst_port[6] = (uint8_t) ((ret[6] < 0) ? portid : ipv4_l3fwd_out_if[ret[6]]);
+	dst_port[7] = (uint8_t) ((ret[7] < 0) ? portid : ipv4_l3fwd_out_if[ret[7]]);
+
+	if (dst_port[0] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[0]) == 0)
+		dst_port[0] = portid;
+	if (dst_port[1] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[1]) == 0)
+		dst_port[1] = portid;
+	if (dst_port[2] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[2]) == 0)
+		dst_port[2] = portid;
+	if (dst_port[3] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[3]) == 0)
+		dst_port[3] = portid;
+	if (dst_port[4] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[4]) == 0)
+		dst_port[4] = portid;
+	if (dst_port[5] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[5]) == 0)
+		dst_port[5] = portid;
+	if (dst_port[6] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[6]) == 0)
+		dst_port[6] = portid;
+	if (dst_port[7] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[7]) == 0)
+		dst_port[7] = portid;
+
+#ifdef DO_RFC_1812_CHECKS
+	/* Update time to live and header checksum */
+	--(ipv4_hdr[0]->time_to_live);
+	--(ipv4_hdr[1]->time_to_live);
+	--(ipv4_hdr[2]->time_to_live);
+	--(ipv4_hdr[3]->time_to_live);
+	++(ipv4_hdr[0]->hdr_checksum);
+	++(ipv4_hdr[1]->hdr_checksum);
+	++(ipv4_hdr[2]->hdr_checksum);
+	++(ipv4_hdr[3]->hdr_checksum);
+	--(ipv4_hdr[4]->time_to_live);
+	--(ipv4_hdr[5]->time_to_live);
+	--(ipv4_hdr[6]->time_to_live);
+	--(ipv4_hdr[7]->time_to_live);
+	++(ipv4_hdr[4]->hdr_checksum);
+	++(ipv4_hdr[5]->hdr_checksum);
+	++(ipv4_hdr[6]->hdr_checksum);
+	++(ipv4_hdr[7]->hdr_checksum);
+#endif
+
+	/* dst addr */
+	*(uint64_t *)&eth_hdr[0]->d_addr = dest_eth_addr[dst_port[0]];
+	*(uint64_t *)&eth_hdr[1]->d_addr = dest_eth_addr[dst_port[1]];
+	*(uint64_t *)&eth_hdr[2]->d_addr = dest_eth_addr[dst_port[2]];
+	*(uint64_t *)&eth_hdr[3]->d_addr = dest_eth_addr[dst_port[3]];
+	*(uint64_t *)&eth_hdr[4]->d_addr = dest_eth_addr[dst_port[4]];
+	*(uint64_t *)&eth_hdr[5]->d_addr = dest_eth_addr[dst_port[5]];
+	*(uint64_t *)&eth_hdr[6]->d_addr = dest_eth_addr[dst_port[6]];
+	*(uint64_t *)&eth_hdr[7]->d_addr = dest_eth_addr[dst_port[7]];
+
+	/* src addr */
+	ether_addr_copy(&ports_eth_addr[dst_port[0]], &eth_hdr[0]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[1]], &eth_hdr[1]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[2]], &eth_hdr[2]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[3]], &eth_hdr[3]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[4]], &eth_hdr[4]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[5]], &eth_hdr[5]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[6]], &eth_hdr[6]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[7]], &eth_hdr[7]->s_addr);
+
+	send_single_packet(m[0], (uint8_t)dst_port[0]);
+	send_single_packet(m[1], (uint8_t)dst_port[1]);
+	send_single_packet(m[2], (uint8_t)dst_port[2]);
+	send_single_packet(m[3], (uint8_t)dst_port[3]);
+	send_single_packet(m[4], (uint8_t)dst_port[4]);
+	send_single_packet(m[5], (uint8_t)dst_port[5]);
+	send_single_packet(m[6], (uint8_t)dst_port[6]);
+	send_single_packet(m[7], (uint8_t)dst_port[7]);
+
+}
+
+static inline void get_ipv6_5tuple(struct rte_mbuf *m0, __m128i mask0,
+		__m128i mask1, union ipv6_5tuple_host *key)
+{
+	__m128i tmpdata0 = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m0,
+			__m128i *, sizeof(struct ether_hdr) +
+			offsetof(struct ipv6_hdr, payload_len)));
+	__m128i tmpdata1 = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m0,
+			__m128i *, sizeof(struct ether_hdr) +
+			offsetof(struct ipv6_hdr, payload_len) + sizeof(__m128i)));
+	__m128i tmpdata2 = _mm_loadu_si128(rte_pktmbuf_mtod_offset(m0,
+			__m128i *, sizeof(struct ether_hdr) +
+			offsetof(struct ipv6_hdr, payload_len) + sizeof(__m128i) +
+			sizeof(__m128i)));
+	key->xmm[0] = _mm_and_si128(tmpdata0, mask0);
+	key->xmm[1] = tmpdata1;
+	key->xmm[2] = _mm_and_si128(tmpdata2, mask1);
+}
+
+static inline void
+simple_ipv6_fwd_8pkts(struct rte_mbuf *m[8], uint8_t portid)
+{
+	int32_t ret[8];
+	uint8_t dst_port[8];
+	struct ether_hdr *eth_hdr[8];
+	union ipv6_5tuple_host key[8];
+
+	__attribute__((unused)) struct ipv6_hdr *ipv6_hdr[8];
+
+	eth_hdr[0] = rte_pktmbuf_mtod(m[0], struct ether_hdr *);
+	eth_hdr[1] = rte_pktmbuf_mtod(m[1], struct ether_hdr *);
+	eth_hdr[2] = rte_pktmbuf_mtod(m[2], struct ether_hdr *);
+	eth_hdr[3] = rte_pktmbuf_mtod(m[3], struct ether_hdr *);
+	eth_hdr[4] = rte_pktmbuf_mtod(m[4], struct ether_hdr *);
+	eth_hdr[5] = rte_pktmbuf_mtod(m[5], struct ether_hdr *);
+	eth_hdr[6] = rte_pktmbuf_mtod(m[6], struct ether_hdr *);
+	eth_hdr[7] = rte_pktmbuf_mtod(m[7], struct ether_hdr *);
+
+	/* Handle IPv6 headers.*/
+	ipv6_hdr[0] = rte_pktmbuf_mtod_offset(m[0], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[1] = rte_pktmbuf_mtod_offset(m[1], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[2] = rte_pktmbuf_mtod_offset(m[2], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[3] = rte_pktmbuf_mtod_offset(m[3], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[4] = rte_pktmbuf_mtod_offset(m[4], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[5] = rte_pktmbuf_mtod_offset(m[5], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[6] = rte_pktmbuf_mtod_offset(m[6], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+	ipv6_hdr[7] = rte_pktmbuf_mtod_offset(m[7], struct ipv6_hdr *,
+			sizeof(struct ether_hdr));
+
+	get_ipv6_5tuple(m[0], mask1, mask2, &key[0]);
+	get_ipv6_5tuple(m[1], mask1, mask2, &key[1]);
+	get_ipv6_5tuple(m[2], mask1, mask2, &key[2]);
+	get_ipv6_5tuple(m[3], mask1, mask2, &key[3]);
+	get_ipv6_5tuple(m[4], mask1, mask2, &key[4]);
+	get_ipv6_5tuple(m[5], mask1, mask2, &key[5]);
+	get_ipv6_5tuple(m[6], mask1, mask2, &key[6]);
+	get_ipv6_5tuple(m[7], mask1, mask2, &key[7]);
+
+	const void *key_array[8] = {&key[0], &key[1], &key[2], &key[3],
+			&key[4], &key[5], &key[6], &key[7]};
+
+	rte_hash_lookup_multi(RTE_PER_LCORE(lcore_conf)->ipv6_lookup_struct,
+			&key_array[0], 4, ret);
+	dst_port[0] = (uint8_t) ((ret[0] < 0) ? portid : ipv6_l3fwd_out_if[ret[0]]);
+	dst_port[1] = (uint8_t) ((ret[1] < 0) ? portid : ipv6_l3fwd_out_if[ret[1]]);
+	dst_port[2] = (uint8_t) ((ret[2] < 0) ? portid : ipv6_l3fwd_out_if[ret[2]]);
+	dst_port[3] = (uint8_t) ((ret[3] < 0) ? portid : ipv6_l3fwd_out_if[ret[3]]);
+	dst_port[4] = (uint8_t) ((ret[4] < 0) ? portid : ipv6_l3fwd_out_if[ret[4]]);
+	dst_port[5] = (uint8_t) ((ret[5] < 0) ? portid : ipv6_l3fwd_out_if[ret[5]]);
+	dst_port[6] = (uint8_t) ((ret[6] < 0) ? portid : ipv6_l3fwd_out_if[ret[6]]);
+	dst_port[7] = (uint8_t) ((ret[7] < 0) ? portid : ipv6_l3fwd_out_if[ret[7]]);
+
+	if (dst_port[0] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[0]) == 0)
+		dst_port[0] = portid;
+	if (dst_port[1] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[1]) == 0)
+		dst_port[1] = portid;
+	if (dst_port[2] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[2]) == 0)
+		dst_port[2] = portid;
+	if (dst_port[3] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[3]) == 0)
+		dst_port[3] = portid;
+	if (dst_port[4] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[4]) == 0)
+		dst_port[4] = portid;
+	if (dst_port[5] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[5]) == 0)
+		dst_port[5] = portid;
+	if (dst_port[6] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[6]) == 0)
+		dst_port[6] = portid;
+	if (dst_port[7] >= RTE_MAX_ETHPORTS ||
+			(enabled_port_mask & 1 << dst_port[7]) == 0)
+		dst_port[7] = portid;
+
+	/* dst addr */
+	*(uint64_t *)&eth_hdr[0]->d_addr = dest_eth_addr[dst_port[0]];
+	*(uint64_t *)&eth_hdr[1]->d_addr = dest_eth_addr[dst_port[1]];
+	*(uint64_t *)&eth_hdr[2]->d_addr = dest_eth_addr[dst_port[2]];
+	*(uint64_t *)&eth_hdr[3]->d_addr = dest_eth_addr[dst_port[3]];
+	*(uint64_t *)&eth_hdr[4]->d_addr = dest_eth_addr[dst_port[4]];
+	*(uint64_t *)&eth_hdr[5]->d_addr = dest_eth_addr[dst_port[5]];
+	*(uint64_t *)&eth_hdr[6]->d_addr = dest_eth_addr[dst_port[6]];
+	*(uint64_t *)&eth_hdr[7]->d_addr = dest_eth_addr[dst_port[7]];
+
+	/* src addr */
+	ether_addr_copy(&ports_eth_addr[dst_port[0]], &eth_hdr[0]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[1]], &eth_hdr[1]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[2]], &eth_hdr[2]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[3]], &eth_hdr[3]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[4]], &eth_hdr[4]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[5]], &eth_hdr[5]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[6]], &eth_hdr[6]->s_addr);
+	ether_addr_copy(&ports_eth_addr[dst_port[7]], &eth_hdr[7]->s_addr);
+
+	send_single_packet(m[0], (uint8_t)dst_port[0]);
+	send_single_packet(m[1], (uint8_t)dst_port[1]);
+	send_single_packet(m[2], (uint8_t)dst_port[2]);
+	send_single_packet(m[3], (uint8_t)dst_port[3]);
+	send_single_packet(m[4], (uint8_t)dst_port[4]);
+	send_single_packet(m[5], (uint8_t)dst_port[5]);
+	send_single_packet(m[6], (uint8_t)dst_port[6]);
+	send_single_packet(m[7], (uint8_t)dst_port[7]);
+
+}
+#endif /* APP_LOOKUP_METHOD */
+
+static inline __attribute__((always_inline)) void
+l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid)
+{
+	struct ether_hdr *eth_hdr;
+	struct ipv4_hdr *ipv4_hdr;
+	uint8_t dst_port;
+
+	eth_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
+
+	if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) {
+		/* Handle IPv4 headers.*/
+		ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
+				sizeof(struct ether_hdr));
+
+#ifdef DO_RFC_1812_CHECKS
+		/* Check to make sure the packet is valid (RFC1812) */
+		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) {
+			rte_pktmbuf_free(m);
+			return;
+		}
+#endif
+
+		 dst_port = get_ipv4_dst_port(ipv4_hdr, portid,
+			RTE_PER_LCORE(lcore_conf)->ipv4_lookup_struct);
+		if (dst_port >= RTE_MAX_ETHPORTS ||
+				(enabled_port_mask & 1 << dst_port) == 0)
+			dst_port = portid;
+
+#ifdef DO_RFC_1812_CHECKS
+		/* Update time to live and header checksum */
+		--(ipv4_hdr->time_to_live);
+		++(ipv4_hdr->hdr_checksum);
+#endif
+		/* dst addr */
+		*(uint64_t *)&eth_hdr->d_addr = dest_eth_addr[dst_port];
+
+		/* src addr */
+		ether_addr_copy(&ports_eth_addr[dst_port], &eth_hdr->s_addr);
+
+		send_single_packet(m, dst_port);
+	} else if (RTE_ETH_IS_IPV6_HDR(m->packet_type)) {
+		/* Handle IPv6 headers.*/
+		struct ipv6_hdr *ipv6_hdr;
+
+		ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
+				sizeof(struct ether_hdr));
+
+		dst_port = get_ipv6_dst_port(ipv6_hdr, portid,
+				RTE_PER_LCORE(lcore_conf)->ipv6_lookup_struct);
+
+		if (dst_port >= RTE_MAX_ETHPORTS ||
+				(enabled_port_mask & 1 << dst_port) == 0)
+			dst_port = portid;
+
+		/* dst addr */
+		*(uint64_t *)&eth_hdr->d_addr = dest_eth_addr[dst_port];
+
+		/* src addr */
+		ether_addr_copy(&ports_eth_addr[dst_port], &eth_hdr->s_addr);
+
+		send_single_packet(m, dst_port);
+	} else
+		/* Free the mbuf that contains non-IPV4/IPV6 packet */
+		rte_pktmbuf_free(m);
+}
+
+#if ((APP_LOOKUP_METHOD == APP_LOOKUP_LPM) && \
+	(ENABLE_MULTI_BUFFER_OPTIMIZE == 1))
+#ifdef DO_RFC_1812_CHECKS
+
+#define	IPV4_MIN_VER_IHL	0x45
+#define	IPV4_MAX_VER_IHL	0x4f
+#define	IPV4_MAX_VER_IHL_DIFF	(IPV4_MAX_VER_IHL - IPV4_MIN_VER_IHL)
+
+/* Minimum value of IPV4 total length (20B) in network byte order. */
+#define	IPV4_MIN_LEN_BE	(sizeof(struct ipv4_hdr) << 8)
+
+/*
+ * From http://www.rfc-editor.org/rfc/rfc1812.txt section 5.2.2:
+ * - The IP version number must be 4.
+ * - The IP header length field must be large enough to hold the
+ *    minimum length legal IP datagram (20 bytes = 5 words).
+ * - The IP total length field must be large enough to hold the IP
+ *   datagram header, whose length is specified in the IP header length
+ *   field.
+ * If we encounter invalid IPV4 packet, then set destination port for it
+ * to BAD_PORT value.
+ */
+static inline __attribute__((always_inline)) void
+rfc1812_process(struct ipv4_hdr *ipv4_hdr, uint16_t *dp, uint32_t ptype)
+{
+	uint8_t ihl;
+
+	if (RTE_ETH_IS_IPV4_HDR(ptype)) {
+		ihl = ipv4_hdr->version_ihl - IPV4_MIN_VER_IHL;
+
+		ipv4_hdr->time_to_live--;
+		ipv4_hdr->hdr_checksum++;
+
+		if (ihl > IPV4_MAX_VER_IHL_DIFF ||
+				((uint8_t)ipv4_hdr->total_length == 0 &&
+				ipv4_hdr->total_length < IPV4_MIN_LEN_BE)) {
+			dp[0] = BAD_PORT;
+		}
+	}
+}
+
+#else
+#define	rfc1812_process(mb, dp)	do { } while (0)
+#endif /* DO_RFC_1812_CHECKS */
+#endif /* APP_LOOKUP_LPM && ENABLE_MULTI_BUFFER_OPTIMIZE */
+
+
+#if ((APP_LOOKUP_METHOD == APP_LOOKUP_LPM) && \
+	(ENABLE_MULTI_BUFFER_OPTIMIZE == 1))
+
+static inline __attribute__((always_inline)) uint16_t
+get_dst_port(struct rte_mbuf *pkt, uint32_t dst_ipv4, uint8_t portid)
+{
+	uint8_t next_hop;
+	struct ipv6_hdr *ipv6_hdr;
+	struct ether_hdr *eth_hdr;
+
+	if (RTE_ETH_IS_IPV4_HDR(pkt->packet_type)) {
+		if (rte_lpm_lookup(RTE_PER_LCORE(lcore_conf)->ipv4_lookup_struct,
+				dst_ipv4, &next_hop) != 0)
+			next_hop = portid;
+	} else if (RTE_ETH_IS_IPV6_HDR(pkt->packet_type)) {
+		eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
+		ipv6_hdr = (struct ipv6_hdr *)(eth_hdr + 1);
+		if (rte_lpm6_lookup(RTE_PER_LCORE(lcore_conf)->ipv6_lookup_struct,
+				ipv6_hdr->dst_addr, &next_hop) != 0)
+			next_hop = portid;
+	} else {
+		next_hop = portid;
+	}
+
+	return next_hop;
+}
+
+static inline void
+process_packet(struct rte_mbuf *pkt, uint16_t *dst_port, uint8_t portid)
+{
+	struct ether_hdr *eth_hdr;
+	struct ipv4_hdr *ipv4_hdr;
+	uint32_t dst_ipv4;
+	uint16_t dp;
+	__m128i te, ve;
+
+	eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
+	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
+
+	dst_ipv4 = ipv4_hdr->dst_addr;
+	dst_ipv4 = rte_be_to_cpu_32(dst_ipv4);
+	dp = get_dst_port(pkt, dst_ipv4, portid);
+
+	te = _mm_load_si128((__m128i *)eth_hdr);
+	ve = val_eth[dp];
+
+	dst_port[0] = dp;
+	rfc1812_process(ipv4_hdr, dst_port, pkt->packet_type);
+
+	te =  _mm_blend_epi16(te, ve, MASK_ETH);
+	_mm_store_si128((__m128i *)eth_hdr, te);
+}
+
+/*
+ * Read packet_type and destination IPV4 addresses from 4 mbufs.
+ */
+static inline void
+processx4_step1(struct rte_mbuf *pkt[FWDSTEP],
+		__m128i *dip,
+		uint32_t *ipv4_flag)
+{
+	struct ipv4_hdr *ipv4_hdr;
+	struct ether_hdr *eth_hdr;
+	uint32_t x0, x1, x2, x3;
+
+	eth_hdr = rte_pktmbuf_mtod(pkt[0], struct ether_hdr *);
+	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
+	x0 = ipv4_hdr->dst_addr;
+	ipv4_flag[0] = pkt[0]->packet_type & RTE_PTYPE_L3_IPV4;
+
+	eth_hdr = rte_pktmbuf_mtod(pkt[1], struct ether_hdr *);
+	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
+	x1 = ipv4_hdr->dst_addr;
+	ipv4_flag[0] &= pkt[1]->packet_type;
+
+	eth_hdr = rte_pktmbuf_mtod(pkt[2], struct ether_hdr *);
+	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
+	x2 = ipv4_hdr->dst_addr;
+	ipv4_flag[0] &= pkt[2]->packet_type;
+
+	eth_hdr = rte_pktmbuf_mtod(pkt[3], struct ether_hdr *);
+	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
+	x3 = ipv4_hdr->dst_addr;
+	ipv4_flag[0] &= pkt[3]->packet_type;
+
+	dip[0] = _mm_set_epi32(x3, x2, x1, x0);
+}
+
+/*
+ * Lookup into LPM for destination port.
+ * If lookup fails, use incoming port (portid) as destination port.
+ */
+static inline void
+processx4_step2(__m128i dip,
+		uint32_t ipv4_flag,
+		uint8_t portid,
+		struct rte_mbuf *pkt[FWDSTEP],
+		uint16_t dprt[FWDSTEP])
+{
+	rte_xmm_t dst;
+	const __m128i bswap_mask = _mm_set_epi8(12, 13, 14, 15, 8, 9, 10, 11,
+			4, 5, 6, 7, 0, 1, 2, 3);
+
+	/* Byte swap 4 IPV4 addresses. */
+	dip = _mm_shuffle_epi8(dip, bswap_mask);
+
+	/* if all 4 packets are IPV4. */
+	if (likely(ipv4_flag)) {
+		rte_lpm_lookupx4(RTE_PER_LCORE(lcore_conf)->ipv4_lookup_struct, dip,
+				dprt, portid);
+	} else {
+		dst.x = dip;
+		dprt[0] = get_dst_port(pkt[0], dst.u32[0], portid);
+		dprt[1] = get_dst_port(pkt[1], dst.u32[1], portid);
+		dprt[2] = get_dst_port(pkt[2], dst.u32[2], portid);
+		dprt[3] = get_dst_port(pkt[3], dst.u32[3], portid);
+	}
+}
+
+/*
+ * Update source and destination MAC addresses in the ethernet header.
+ * Perform RFC1812 checks and updates for IPV4 packets.
+ */
+static inline void
+processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
+{
+	__m128i te[FWDSTEP];
+	__m128i ve[FWDSTEP];
+	__m128i *p[FWDSTEP];
+
+	p[0] = rte_pktmbuf_mtod(pkt[0], __m128i *);
+	p[1] = rte_pktmbuf_mtod(pkt[1], __m128i *);
+	p[2] = rte_pktmbuf_mtod(pkt[2], __m128i *);
+	p[3] = rte_pktmbuf_mtod(pkt[3], __m128i *);
+
+	ve[0] = val_eth[dst_port[0]];
+	te[0] = _mm_load_si128(p[0]);
+
+	ve[1] = val_eth[dst_port[1]];
+	te[1] = _mm_load_si128(p[1]);
+
+	ve[2] = val_eth[dst_port[2]];
+	te[2] = _mm_load_si128(p[2]);
+
+	ve[3] = val_eth[dst_port[3]];
+	te[3] = _mm_load_si128(p[3]);
+
+	/* Update first 12 bytes, keep rest bytes intact. */
+	te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
+	te[1] =  _mm_blend_epi16(te[1], ve[1], MASK_ETH);
+	te[2] =  _mm_blend_epi16(te[2], ve[2], MASK_ETH);
+	te[3] =  _mm_blend_epi16(te[3], ve[3], MASK_ETH);
+
+	_mm_store_si128(p[0], te[0]);
+	_mm_store_si128(p[1], te[1]);
+	_mm_store_si128(p[2], te[2]);
+	_mm_store_si128(p[3], te[3]);
+
+	rfc1812_process((struct ipv4_hdr *)((struct ether_hdr *)p[0] + 1),
+			&dst_port[0], pkt[0]->packet_type);
+	rfc1812_process((struct ipv4_hdr *)((struct ether_hdr *)p[1] + 1),
+			&dst_port[1], pkt[1]->packet_type);
+	rfc1812_process((struct ipv4_hdr *)((struct ether_hdr *)p[2] + 1),
+			&dst_port[2], pkt[2]->packet_type);
+	rfc1812_process((struct ipv4_hdr *)((struct ether_hdr *)p[3] + 1),
+			&dst_port[3], pkt[3]->packet_type);
+}
+
+/*
+ * We group consecutive packets with the same destionation port into one burst.
+ * To avoid extra latency this is done together with some other packet
+ * processing, but after we made a final decision about packet's destination.
+ * To do this we maintain:
+ * pnum - array of number of consecutive packets with the same dest port for
+ * each packet in the input burst.
+ * lp - pointer to the last updated element in the pnum.
+ * dlp - dest port value lp corresponds to.
+ */
+
+#define	GRPSZ	(1 << FWDSTEP)
+#define	GRPMSK	(GRPSZ - 1)
+
+#define GROUP_PORT_STEP(dlp, dcp, lp, pn, idx)	do { \
+	if (likely((dlp) == (dcp)[(idx)])) {         \
+		(lp)[0]++;                           \
+	} else {                                     \
+		(dlp) = (dcp)[idx];                  \
+		(lp) = (pn) + (idx);                 \
+		(lp)[0] = 1;                         \
+	}                                            \
+} while (0)
+
+/*
+ * Group consecutive packets with the same destination port in bursts of 4.
+ * Suppose we have array of destionation ports:
+ * dst_port[] = {a, b, c, d,, e, ... }
+ * dp1 should contain: <a, b, c, d>, dp2: <b, c, d, e>.
+ * We doing 4 comparisions at once and the result is 4 bit mask.
+ * This mask is used as an index into prebuild array of pnum values.
+ */
+static inline uint16_t *
+port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
+{
+	static const struct {
+		uint64_t pnum; /* prebuild 4 values for pnum[]. */
+		int32_t  idx;  /* index for new last updated elemnet. */
+		uint16_t lpv;  /* add value to the last updated element. */
+	} gptbl[GRPSZ] = {
+	{
+		/* 0: a != b, b != c, c != d, d != e */
+		.pnum = UINT64_C(0x0001000100010001),
+		.idx = 4,
+		.lpv = 0,
+	},
+	{
+		/* 1: a == b, b != c, c != d, d != e */
+		.pnum = UINT64_C(0x0001000100010002),
+		.idx = 4,
+		.lpv = 1,
+	},
+	{
+		/* 2: a != b, b == c, c != d, d != e */
+		.pnum = UINT64_C(0x0001000100020001),
+		.idx = 4,
+		.lpv = 0,
+	},
+	{
+		/* 3: a == b, b == c, c != d, d != e */
+		.pnum = UINT64_C(0x0001000100020003),
+		.idx = 4,
+		.lpv = 2,
+	},
+	{
+		/* 4: a != b, b != c, c == d, d != e */
+		.pnum = UINT64_C(0x0001000200010001),
+		.idx = 4,
+		.lpv = 0,
+	},
+	{
+		/* 5: a == b, b != c, c == d, d != e */
+		.pnum = UINT64_C(0x0001000200010002),
+		.idx = 4,
+		.lpv = 1,
+	},
+	{
+		/* 6: a != b, b == c, c == d, d != e */
+		.pnum = UINT64_C(0x0001000200030001),
+		.idx = 4,
+		.lpv = 0,
+	},
+	{
+		/* 7: a == b, b == c, c == d, d != e */
+		.pnum = UINT64_C(0x0001000200030004),
+		.idx = 4,
+		.lpv = 3,
+	},
+	{
+		/* 8: a != b, b != c, c != d, d == e */
+		.pnum = UINT64_C(0x0002000100010001),
+		.idx = 3,
+		.lpv = 0,
+	},
+	{
+		/* 9: a == b, b != c, c != d, d == e */
+		.pnum = UINT64_C(0x0002000100010002),
+		.idx = 3,
+		.lpv = 1,
+	},
+	{
+		/* 0xa: a != b, b == c, c != d, d == e */
+		.pnum = UINT64_C(0x0002000100020001),
+		.idx = 3,
+		.lpv = 0,
+	},
+	{
+		/* 0xb: a == b, b == c, c != d, d == e */
+		.pnum = UINT64_C(0x0002000100020003),
+		.idx = 3,
+		.lpv = 2,
+	},
+	{
+		/* 0xc: a != b, b != c, c == d, d == e */
+		.pnum = UINT64_C(0x0002000300010001),
+		.idx = 2,
+		.lpv = 0,
+	},
+	{
+		/* 0xd: a == b, b != c, c == d, d == e */
+		.pnum = UINT64_C(0x0002000300010002),
+		.idx = 2,
+		.lpv = 1,
+	},
+	{
+		/* 0xe: a != b, b == c, c == d, d == e */
+		.pnum = UINT64_C(0x0002000300040001),
+		.idx = 1,
+		.lpv = 0,
+	},
+	{
+		/* 0xf: a == b, b == c, c == d, d == e */
+		.pnum = UINT64_C(0x0002000300040005),
+		.idx = 0,
+		.lpv = 4,
+	},
+	};
+
+	union {
+		uint16_t u16[FWDSTEP + 1];
+		uint64_t u64;
+	} *pnum = (void *)pn;
+
+	int32_t v;
+
+	dp1 = _mm_cmpeq_epi16(dp1, dp2);
+	dp1 = _mm_unpacklo_epi16(dp1, dp1);
+	v = _mm_movemask_ps((__m128)dp1);
+
+	/* update last port counter. */
+	lp[0] += gptbl[v].lpv;
+
+	/* if dest port value has changed. */
+	if (v != GRPMSK) {
+		lp = pnum->u16 + gptbl[v].idx;
+		lp[0] = 1;
+		pnum->u64 = gptbl[v].pnum;
+	}
+
+	return lp;
+}
+
+#endif /* APP_LOOKUP_METHOD */
+
+static void
+process_burst(struct rte_mbuf *pkts_burst[MAX_PKT_BURST], int nb_rx,
+		uint8_t portid) {
+
+	int j;
+
+#if ((APP_LOOKUP_METHOD == APP_LOOKUP_LPM) && \
+	(ENABLE_MULTI_BUFFER_OPTIMIZE == 1))
+	int32_t k;
+	uint16_t dlp;
+	uint16_t *lp;
+	uint16_t dst_port[MAX_PKT_BURST];
+	__m128i dip[MAX_PKT_BURST / FWDSTEP];
+	uint32_t ipv4_flag[MAX_PKT_BURST / FWDSTEP];
+	uint16_t pnum[MAX_PKT_BURST + 1];
+#endif
+
+
+#if (ENABLE_MULTI_BUFFER_OPTIMIZE == 1)
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+	{
+		/*
+		 * Send nb_rx - nb_rx%8 packets
+		 * in groups of 8.
+		 */
+		int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8);
+
+		for (j = 0; j < n; j += 8) {
+			uint32_t pkt_type =
+				pkts_burst[j]->packet_type &
+				pkts_burst[j+1]->packet_type &
+				pkts_burst[j+2]->packet_type &
+				pkts_burst[j+3]->packet_type &
+				pkts_burst[j+4]->packet_type &
+				pkts_burst[j+5]->packet_type &
+				pkts_burst[j+6]->packet_type &
+				pkts_burst[j+7]->packet_type;
+			if (pkt_type & RTE_PTYPE_L3_IPV4) {
+				simple_ipv4_fwd_8pkts(&pkts_burst[j], portid);
+			} else if (pkt_type &
+				RTE_PTYPE_L3_IPV6) {
+				simple_ipv6_fwd_8pkts(&pkts_burst[j], portid);
+			} else {
+				l3fwd_simple_forward(pkts_burst[j], portid);
+				l3fwd_simple_forward(pkts_burst[j+1], portid);
+				l3fwd_simple_forward(pkts_burst[j+2], portid);
+				l3fwd_simple_forward(pkts_burst[j+3], portid);
+				l3fwd_simple_forward(pkts_burst[j+4], portid);
+				l3fwd_simple_forward(pkts_burst[j+5], portid);
+				l3fwd_simple_forward(pkts_burst[j+6], portid);
+				l3fwd_simple_forward(pkts_burst[j+7], portid);
+			}
+		}
+		for (; j < nb_rx ; j++)
+			l3fwd_simple_forward(pkts_burst[j], portid);
+	}
+#elif (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+
+	k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
+	for (j = 0; j != k; j += FWDSTEP)
+		processx4_step1(&pkts_burst[j], &dip[j / FWDSTEP],
+				&ipv4_flag[j / FWDSTEP]);
+
+	k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
+	for (j = 0; j != k; j += FWDSTEP)
+		processx4_step2(dip[j / FWDSTEP], ipv4_flag[j / FWDSTEP],
+				portid, &pkts_burst[j], &dst_port[j]);
+
+	/*
+	 * Finish packet processing and group consecutive
+	 * packets with the same destination port.
+	 */
+	k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
+	if (k != 0) {
+		__m128i dp1, dp2;
+
+		lp = pnum;
+		lp[0] = 1;
+
+		processx4_step3(pkts_burst, dst_port);
+
+		/* dp1: <d[0], d[1], d[2], d[3], ... > */
+		dp1 = _mm_loadu_si128((__m128i *)dst_port);
+
+		for (j = FWDSTEP; j != k; j += FWDSTEP) {
+			processx4_step3(&pkts_burst[j], &dst_port[j]);
+
+			/*
+			 * dp2:
+			 * <d[j-3], d[j-2], d[j-1], d[j], ... >
+			 */
+			dp2 = _mm_loadu_si128(
+					(__m128i *)&dst_port[j - FWDSTEP + 1]);
+			lp  = port_groupx4(&pnum[j - FWDSTEP], lp, dp1, dp2);
+
+			/*
+			 * dp1:
+			 * <d[j], d[j+1], d[j+2], d[j+3], ... >
+			 */
+			dp1 = _mm_srli_si128(dp2, (FWDSTEP - 1) *
+					sizeof(dst_port[0]));
+		}
+
+		/*
+		 * dp2: <d[j-3], d[j-2], d[j-1], d[j-1], ... >
+		 */
+		dp2 = _mm_shufflelo_epi16(dp1, 0xf9);
+		lp  = port_groupx4(&pnum[j - FWDSTEP], lp, dp1, dp2);
+
+		/*
+		 * remove values added by the last repeated
+		 * dst port.
+		 */
+		lp[0]--;
+		dlp = dst_port[j - 1];
+	} else {
+		/* set dlp and lp to the never used values. */
+		dlp = BAD_PORT - 1;
+		lp = pnum + MAX_PKT_BURST;
+	}
+
+	/* Process up to last 3 packets one by one. */
+	switch (nb_rx % FWDSTEP) {
+	case 3:
+		process_packet(pkts_burst[j], dst_port + j, portid);
+		GROUP_PORT_STEP(dlp, dst_port, lp, pnum, j);
+		j++;
+	case 2:
+		process_packet(pkts_burst[j], dst_port + j, portid);
+		GROUP_PORT_STEP(dlp, dst_port, lp, pnum, j);
+		j++;
+	case 1:
+		process_packet(pkts_burst[j], dst_port + j, portid);
+		GROUP_PORT_STEP(dlp, dst_port, lp, pnum, j);
+		j++;
+	}
+
+	/*
+	 * Send packets out, through destination port.
+	 * Consecuteve pacekts with the same destination port
+	 * are already grouped together.
+	 * If destination port for the packet equals BAD_PORT,
+	 * then free the packet without sending it out.
+	 */
+	for (j = 0; j < nb_rx; j += k) {
+
+		int32_t m;
+		uint16_t pn;
+
+		pn = dst_port[j];
+		k = pnum[j];
+
+		if (likely(pn != BAD_PORT))
+			send_packetsx4(pn, pkts_burst + j, k);
+		else
+			for (m = j; m != j + k; m++)
+				rte_pktmbuf_free(pkts_burst[m]);
+
+	}
+
+#endif /* APP_LOOKUP_METHOD */
+#else /* ENABLE_MULTI_BUFFER_OPTIMIZE == 0 */
+
+	/* Prefetch first packets */
+	for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++)
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j], void *));
+
+	/* Prefetch and forward already prefetched packets */
+	for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+				j + PREFETCH_OFFSET], void *));
+		l3fwd_simple_forward(pkts_burst[j], portid, qconf);
+	}
+
+	/* Forward remaining prefetched packets */
+	for (; j < nb_rx; j++)
+		l3fwd_simple_forward(pkts_burst[j], portid, qconf);
+
+#endif /* ENABLE_MULTI_BUFFER_OPTIMIZE */
+
+}
+
+#if (APP_CPU_LOAD > 0)
+
+/*
+ * CPU-load stats collector
+ */
+static int
+cpu_load_collector(__rte_unused void *arg) {
+	unsigned i, j, k;
+	uint64_t hits;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	uint64_t total[MAX_CPU] = { 0 };
+	unsigned min_cpu = MAX_CPU;
+	unsigned max_cpu = 0;
+	unsigned cpu_id;
+	int busy_total = 0;
+	int busy_flag = 0;
+
+	unsigned int n_thread_per_cpu[MAX_CPU] = { 0 };
+	struct thread_conf *thread_per_cpu[MAX_CPU][MAX_THREAD];
+
+	struct thread_conf *thread_conf;
+
+	const uint64_t interval_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
+		US_PER_S * CPU_LOAD_TIMEOUT_US;
+
+	prev_tsc = 0;
+	/*
+	 * Wait for all threads
+	 */
+
+	printf("Waiting for %d rx threads and %d tx threads\n", n_rx_thread,
+			n_tx_thread);
+
+	while (rte_atomic16_read(&rx_counter) < n_rx_thread)
+		rte_pause();
+
+	while (rte_atomic16_read(&tx_counter) < n_tx_thread)
+		rte_pause();
+
+	for (i = 0; i < n_rx_thread; i++) {
+
+		thread_conf = &rx_thread[i].conf;
+		cpu_id = thread_conf->cpu_id;
+		thread_per_cpu[cpu_id][n_thread_per_cpu[cpu_id]++] = thread_conf;
+
+		if (cpu_id > max_cpu)
+			max_cpu = cpu_id;
+		if (cpu_id < min_cpu)
+			min_cpu = cpu_id;
+	}
+	for (i = 0; i < n_tx_thread; i++) {
+
+		thread_conf = &tx_thread[i].conf;
+		cpu_id = thread_conf->cpu_id;
+		thread_per_cpu[cpu_id][n_thread_per_cpu[cpu_id]++] = thread_conf;
+
+		if (thread_conf->cpu_id > max_cpu)
+			max_cpu = thread_conf->cpu_id;
+		if (thread_conf->cpu_id < min_cpu)
+			min_cpu = thread_conf->cpu_id;
+	}
+
+	while (1) {
+
+		cpu_load.counter++;
+		for (i = min_cpu; i <= max_cpu; i++) {
+			for (j = 0; j < MAX_CPU_COUNTER; j++) {
+				for (k = 0; k < n_thread_per_cpu[i]; k++)
+					if (thread_per_cpu[i][k]->busy[j]) {
+						busy_flag = 1;
+						break;
+					}
+				if (busy_flag) {
+					cpu_load.hits[j][i]++;
+					busy_total = 1;
+					busy_flag = 0;
+				}
+			}
+
+			if (busy_total) {
+				total[i]++;
+				busy_total = 0;
+			}
+		}
+
+		cur_tsc = rte_rdtsc();
+
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > interval_tsc)) {
+
+			printf("\033c");
+
+			printf("Cpu usage for %d rx threads and %d tx threads:\n\n",
+					n_rx_thread, n_tx_thread);
+
+			printf("cpu#     proc%%  poll%%  overhead%%\n\n");
+
+			for (i = min_cpu; i <= max_cpu; i++) {
+				hits = 0;
+				printf("CPU %d:", i);
+				for (j = 0; j < MAX_CPU_COUNTER; j++) {
+					printf("%7" PRIu64 "",
+							cpu_load.hits[j][i] * 100 / cpu_load.counter);
+					hits += cpu_load.hits[j][i];
+					cpu_load.hits[j][i] = 0;
+				}
+				printf("%7" PRIu64 "\n",
+						100 - total[i] * 100 / cpu_load.counter);
+				total[i] = 0;
+			}
+			cpu_load.counter = 0;
+
+			prev_tsc = cur_tsc;
+		}
+
+	}
+}
+#endif /* APP_CPU_LOAD */
+
+/*
+ * Null processing lthread loop
+ *
+ * This loop is used to start empty scheduler on lcore.
+ */
+static void
+lthread_null(__rte_unused void *args)
+{
+	int lcore_id = rte_lcore_id();
+
+	RTE_LOG(INFO, L3FWD, "Starting scheduler on lcore %d.\n", lcore_id);
+	lthread_exit(NULL);
+}
+
+/* main processing loop */
+static void
+lthread_tx_per_ring(void *dummy)
+{
+	int nb_rx;
+	uint8_t portid;
+	struct rte_ring *ring;
+	struct thread_tx_conf *tx_conf;
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	struct lthread_cond *ready;
+
+	tx_conf = (struct thread_tx_conf *)dummy;
+	ring = tx_conf->ring;
+	ready = *tx_conf->ready;
+
+	lthread_set_data((void *)tx_conf);
+
+	/*
+	 * Move this lthread to lcore
+	 */
+	lthread_set_affinity(tx_conf->conf.lcore_id);
+
+	RTE_LOG(INFO, L3FWD, "entering main tx loop on lcore %u\n", rte_lcore_id());
+
+	nb_rx = 0;
+	rte_atomic16_inc(&tx_counter);
+	while (1) {
+
+		/*
+		 * Read packet from ring
+		 */
+		SET_CPU_BUSY(tx_conf, CPU_POLL);
+		nb_rx = rte_ring_sc_dequeue_burst(ring, (void **)pkts_burst,
+				MAX_PKT_BURST);
+		SET_CPU_IDLE(tx_conf, CPU_POLL);
+
+		if (nb_rx > 0) {
+			SET_CPU_BUSY(tx_conf, CPU_PROCESS);
+			portid = pkts_burst[0]->port;
+			process_burst(pkts_burst, nb_rx, portid);
+			SET_CPU_IDLE(tx_conf, CPU_PROCESS);
+			lthread_yield();
+		} else
+			lthread_cond_wait(ready, 0);
+
+	}
+}
+
+/*
+ * Main tx-lthreads spawner lthread.
+ *
+ * This lthread is used to spawn one new lthread per ring from producers.
+ *
+ */
+static void
+lthread_tx(void *args)
+{
+	struct lthread *lt;
+
+	unsigned lcore_id;
+	uint8_t portid;
+	struct thread_tx_conf *tx_conf;
+
+	tx_conf = (struct thread_tx_conf *)args;
+	lthread_set_data((void *)tx_conf);
+
+	/*
+	 * Move this lthread to the selected lcore
+	 */
+	lthread_set_affinity(tx_conf->conf.lcore_id);
+
+	/*
+	 * Spawn tx readers (one per input ring)
+	 */
+	lthread_create(&lt, tx_conf->conf.lcore_id, lthread_tx_per_ring,
+			(void *)tx_conf);
+
+	lcore_id = rte_lcore_id();
+
+	RTE_LOG(INFO, L3FWD, "Entering Tx main loop on lcore %u\n", lcore_id);
+
+	tx_conf->conf.cpu_id = sched_getcpu();
+	while (1) {
+
+		lthread_sleep(BURST_TX_DRAIN_US * 1000);
+
+		/*
+		 * TX burst queue drain
+		 */
+		for (portid = 0; portid < RTE_MAX_ETHPORTS; portid++) {
+			if (tx_conf->tx_mbufs[portid].len == 0)
+				continue;
+			SET_CPU_BUSY(tx_conf, CPU_PROCESS);
+			send_burst(tx_conf, tx_conf->tx_mbufs[portid].len, portid);
+			SET_CPU_IDLE(tx_conf, CPU_PROCESS);
+			tx_conf->tx_mbufs[portid].len = 0;
+		}
+
+	}
+}
+
+static void
+lthread_rx(void *dummy)
+{
+	int ret;
+	uint16_t nb_rx;
+	int i;
+	uint8_t portid, queueid;
+	int worker_id;
+	int len[RTE_MAX_LCORE] = { 0 };
+	int old_len, new_len;
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	struct thread_rx_conf *rx_conf;
+
+	rx_conf = (struct thread_rx_conf *)dummy;
+	lthread_set_data((void *)rx_conf);
+
+	/*
+	 * Move this lthread to lcore
+	 */
+	lthread_set_affinity(rx_conf->conf.lcore_id);
+
+	if (rx_conf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", rte_lcore_id());
+		return;
+	}
+
+	RTE_LOG(INFO, L3FWD, "Entering main Rx loop on lcore %u\n", rte_lcore_id());
+
+	for (i = 0; i < rx_conf->n_rx_queue; i++) {
+
+		portid = rx_conf->rx_queue_list[i].port_id;
+		queueid = rx_conf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD, " -- lcoreid=%u portid=%hhu rxqueueid=%hhu\n",
+				rte_lcore_id(), portid, queueid);
+	}
+
+	/*
+	 * Init all condition variables (one per rx thread)
+	 */
+	for (i = 0; i < rx_conf->n_rx_queue; i++)
+		lthread_cond_init(NULL, &rx_conf->ready[i], NULL);
+
+	worker_id = 0;
+
+	rx_conf->conf.cpu_id = sched_getcpu();
+	rte_atomic16_inc(&rx_counter);
+	while (1) {
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < rx_conf->n_rx_queue; ++i) {
+			portid = rx_conf->rx_queue_list[i].port_id;
+			queueid = rx_conf->rx_queue_list[i].queue_id;
+
+			SET_CPU_BUSY(rx_conf, CPU_POLL);
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+				MAX_PKT_BURST);
+			SET_CPU_IDLE(rx_conf, CPU_POLL);
+
+			if (nb_rx != 0) {
+				worker_id = (worker_id + 1) % rx_conf->n_ring;
+				old_len = len[worker_id];
+
+				SET_CPU_BUSY(rx_conf, CPU_PROCESS);
+				ret = rte_ring_sp_enqueue_burst(
+						rx_conf->ring[worker_id],
+						(void **) pkts_burst,
+						nb_rx);
+
+				new_len = old_len + ret;
+
+				if (new_len >= BURST_SIZE) {
+					lthread_cond_signal(rx_conf->ready[worker_id]);
+					new_len = 0;
+				}
+
+				len[worker_id] = new_len;
+
+				if (unlikely(ret < nb_rx)) {
+					uint32_t k;
+
+					for (k = ret; k < nb_rx; k++) {
+						struct rte_mbuf *m = pkts_burst[k];
+
+						rte_pktmbuf_free(m);
+					}
+				}
+				SET_CPU_IDLE(rx_conf, CPU_PROCESS);
+			}
+
+			lthread_yield();
+		}
+	}
+}
+
+/*
+ * Start scheduler with initial lthread on lcore
+ *
+ * This lthread loop spawns all rx and tx lthreads on master lcore
+ */
+
+static void
+lthread_spawner(__rte_unused void *arg) {
+	struct lthread *lt[MAX_THREAD];
+	int i;
+	int n_thread = 0;
+
+	printf("Entering lthread_spawner\n");
+
+	/*
+	 * Create producers (rx threads) on default lcore
+	 */
+	for (i = 0; i < n_rx_thread; i++) {
+		rx_thread[i].conf.thread_id = i;
+		lthread_create(&lt[n_thread], -1, lthread_rx,
+				(void *)&rx_thread[i]);
+		n_thread++;
+	}
+
+	/*
+	 * Wait for all producers. Until some producers can be started on the same
+	 * scheduler as this lthread, yielding is required to let them to run and
+	 * prevent deadlock here.
+	 */
+	while (rte_atomic16_read(&rx_counter) < n_rx_thread)
+		lthread_sleep(100000);
+
+	/*
+	 * Create consumers (tx threads) on default lcore_id
+	 */
+	for (i = 0; i < n_tx_thread; i++) {
+		tx_thread[i].conf.thread_id = i;
+		lthread_create(&lt[n_thread], -1, lthread_tx,
+				(void *)&tx_thread[i]);
+		n_thread++;
+	}
+
+	/*
+	 * Wait for all threads finished
+	 */
+	for (i = 0; i < n_thread; i++)
+		lthread_join(lt[i], NULL);
+
+}
+
+/*
+ * Start master scheduler with initial lthread spawning rx and tx lthreads
+ * (main_lthread_master).
+ */
+static int
+lthread_master_spawner(__rte_unused void *arg) {
+	struct lthread *lt;
+	int lcore_id = rte_lcore_id();
+
+	RTE_PER_LCORE(lcore_conf) = &lcore_conf[lcore_id];
+	lthread_create(&lt, -1, lthread_spawner, NULL);
+	lthread_run();
+
+	return 0;
+}
+
+/*
+ * Start scheduler on lcore.
+ */
+static int
+sched_spawner(__rte_unused void *arg) {
+	struct lthread *lt;
+	int lcore_id = rte_lcore_id();
+
+#if (APP_CPU_LOAD)
+	if (lcore_id == cpu_load_lcore_id) {
+		cpu_load_collector(arg);
+		return 0;
+	}
+#endif /* APP_CPU_LOAD */
+
+	RTE_PER_LCORE(lcore_conf) = &lcore_conf[lcore_id];
+	lthread_create(&lt, -1, lthread_null, NULL);
+	lthread_run();
+
+	return 0;
+}
+
+/* main processing loop */
+static int
+pthread_tx(void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int nb_rx;
+	uint8_t portid;
+	struct thread_tx_conf *tx_conf;
+
+	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
+		US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	tx_conf = (struct thread_tx_conf *)dummy;
+
+	RTE_LOG(INFO, L3FWD, "Entering main Tx loop on lcore %u\n", rte_lcore_id());
+
+	tx_conf->conf.cpu_id = sched_getcpu();
+	rte_atomic16_inc(&tx_counter);
+	while (1) {
+
+		cur_tsc = rte_rdtsc();
+
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+
+			/*
+			 * This could be optimized (use queueid instead of
+			 * portid), but it is not called so often
+			 */
+			SET_CPU_BUSY(tx_conf, CPU_PROCESS);
+			for (portid = 0; portid < RTE_MAX_ETHPORTS; portid++) {
+				if (tx_conf->tx_mbufs[portid].len == 0)
+					continue;
+				send_burst(tx_conf, tx_conf->tx_mbufs[portid].len, portid);
+				tx_conf->tx_mbufs[portid].len = 0;
+			}
+			SET_CPU_IDLE(tx_conf, CPU_PROCESS);
+
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from ring
+		 */
+		SET_CPU_BUSY(tx_conf, CPU_POLL);
+		nb_rx = rte_ring_sc_dequeue_burst(tx_conf->ring,
+				(void **)pkts_burst, MAX_PKT_BURST);
+		SET_CPU_IDLE(tx_conf, CPU_POLL);
+
+		if (unlikely(nb_rx == 0)) {
+			sched_yield();
+			continue;
+		}
+
+		SET_CPU_BUSY(tx_conf, CPU_PROCESS);
+		portid = pkts_burst[0]->port;
+		process_burst(pkts_burst, nb_rx, portid);
+		SET_CPU_IDLE(tx_conf, CPU_PROCESS);
+
+	}
+}
+
+static int
+pthread_rx(void *dummy)
+{
+	int i;
+	int worker_id;
+	uint32_t n;
+	uint32_t nb_rx;
+	unsigned lcore_id;
+	uint8_t portid, queueid;
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+
+	struct thread_rx_conf *rx_conf;
+
+	lcore_id = rte_lcore_id();
+	rx_conf = (struct thread_rx_conf *)dummy;
+
+	if (rx_conf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	RTE_LOG(INFO, L3FWD, "entering main rx loop on lcore %u\n", lcore_id);
+
+	for (i = 0; i < rx_conf->n_rx_queue; i++) {
+
+		portid = rx_conf->rx_queue_list[i].port_id;
+		queueid = rx_conf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD, " -- lcoreid=%u portid=%hhu rxqueueid=%hhu\n",
+				lcore_id, portid, queueid);
+	}
+
+	worker_id = 0;
+	rx_conf->conf.cpu_id = sched_getcpu();
+	rte_atomic16_inc(&rx_counter);
+	while (1) {
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < rx_conf->n_rx_queue; ++i) {
+			portid = rx_conf->rx_queue_list[i].port_id;
+			queueid = rx_conf->rx_queue_list[i].queue_id;
+
+			SET_CPU_BUSY(rx_conf, CPU_POLL);
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+				MAX_PKT_BURST);
+			SET_CPU_IDLE(rx_conf, CPU_POLL);
+
+			if (nb_rx == 0) {
+				sched_yield();
+				continue;
+			}
+
+			SET_CPU_BUSY(rx_conf, CPU_PROCESS);
+			worker_id = (worker_id + 1) % rx_conf->n_ring;
+			n = rte_ring_sp_enqueue_burst(rx_conf->ring[worker_id],
+					(void **)pkts_burst, nb_rx);
+
+			if (unlikely(n != nb_rx)) {
+				uint32_t k;
+
+				for (k = n; k < nb_rx; k++) {
+					struct rte_mbuf *m = pkts_burst[k];
+
+					rte_pktmbuf_free(m);
+				}
+			}
+
+			SET_CPU_IDLE(rx_conf, CPU_PROCESS);
+
+		}
+	}
+}
+
+/*
+ * P-Thread spawner.
+ */
+static int
+pthread_run(__rte_unused void *arg) {
+	int lcore_id = rte_lcore_id();
+	int i;
+
+	for (i = 0; i < n_rx_thread; i++)
+		if (rx_thread[i].conf.lcore_id == lcore_id) {
+			printf("Start rx thread on %d...\n", lcore_id);
+			RTE_PER_LCORE(lcore_conf) = &lcore_conf[lcore_id];
+			RTE_PER_LCORE(lcore_conf)->data = (void *)&rx_thread[i];
+			pthread_rx((void *)&rx_thread[i]);
+			return 0;
+		}
+
+	for (i = 0; i < n_tx_thread; i++)
+		if (tx_thread[i].conf.lcore_id == lcore_id) {
+			printf("Start tx thread on %d...\n", lcore_id);
+			RTE_PER_LCORE(lcore_conf) = &lcore_conf[lcore_id];
+			RTE_PER_LCORE(lcore_conf)->data = (void *)&tx_thread[i];
+			pthread_tx((void *)&tx_thread[i]);
+			return 0;
+		}
+
+#if (APP_CPU_LOAD)
+	if (lcore_id == cpu_load_lcore_id)
+		cpu_load_collector(arg);
+#endif /* APP_CPU_LOAD */
+
+	return 0;
+}
+
+static int
+check_lcore_params(void)
+{
+	uint8_t queue, lcore;
+	uint16_t i;
+	int socketid;
+
+	for (i = 0; i < nb_rx_thread_params; ++i) {
+		queue = rx_thread_params[i].queue_id;
+		if (queue >= MAX_RX_QUEUE_PER_PORT) {
+			printf("invalid queue number: %hhu\n", queue);
+			return -1;
+		}
+		lcore = rx_thread_params[i].lcore_id;
+		if (!rte_lcore_is_enabled(lcore)) {
+			printf("error: lcore %hhu is not enabled in lcore mask\n", lcore);
+			return -1;
+		}
+		socketid = rte_lcore_to_socket_id(lcore);
+		if ((socketid != 0) && (numa_on == 0))
+			printf("warning: lcore %hhu is on socket %d with numa off\n",
+				lcore, socketid);
+	}
+	return 0;
+}
+
+static int
+check_port_config(const unsigned nb_ports)
+{
+	unsigned portid;
+	uint16_t i;
+
+	for (i = 0; i < nb_rx_thread_params; ++i) {
+		portid = rx_thread_params[i].port_id;
+		if ((enabled_port_mask & (1 << portid)) == 0) {
+			printf("port %u is not enabled in port mask\n", portid);
+			return -1;
+		}
+		if (portid >= nb_ports) {
+			printf("port %u is not present on the board\n", portid);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static uint8_t
+get_port_n_rx_queues(const uint8_t port)
+{
+	int queue = -1;
+	uint16_t i;
+
+	for (i = 0; i < nb_rx_thread_params; ++i)
+		if (rx_thread_params[i].port_id == port &&
+				rx_thread_params[i].queue_id > queue)
+			queue = rx_thread_params[i].queue_id;
+
+	return (uint8_t)(++queue);
+}
+
+static int
+init_rx_rings(void)
+{
+	unsigned socket_io;
+	struct thread_rx_conf *rx_conf;
+	struct thread_tx_conf *tx_conf;
+	unsigned rx_thread_id, tx_thread_id;
+	char name[256];
+	struct rte_ring *ring = NULL;
+
+	for (tx_thread_id = 0; tx_thread_id < n_tx_thread; tx_thread_id++) {
+
+		tx_conf = &tx_thread[tx_thread_id];
+
+		printf("Connecting tx-thread %d with rx-thread %d\n", tx_thread_id,
+				tx_conf->conf.thread_id);
+
+		rx_thread_id = tx_conf->conf.thread_id;
+		if (rx_thread_id > n_tx_thread) {
+			printf("connection from tx-thread %u to rx-thread %u fails "
+					"(rx-thread not defined)\n", tx_thread_id, rx_thread_id);
+			return -1;
+		}
+
+		rx_conf = &rx_thread[rx_thread_id];
+		socket_io = rte_lcore_to_socket_id(rx_conf->conf.lcore_id);
+
+		snprintf(name, sizeof(name), "app_ring_s%u_rx%u_tx%u",
+				socket_io, rx_thread_id, tx_thread_id);
+
+		ring = rte_ring_create(name, 1024 * 4, socket_io,
+				RING_F_SP_ENQ | RING_F_SC_DEQ);
+
+		if (ring == NULL) {
+			rte_panic("Cannot create ring to connect rx-thread %u "
+					"with tx-thread %u\n", rx_thread_id, tx_thread_id);
+		}
+
+		rx_conf->ring[rx_conf->n_ring] = ring;
+
+		tx_conf->ring = ring;
+		tx_conf->ready = &rx_conf->ready[rx_conf->n_ring];
+
+		rx_conf->n_ring++;
+	}
+	return 0;
+}
+
+static int
+init_rx_queues(void)
+{
+	uint16_t i, nb_rx_queue;
+	uint8_t thread;
+
+	n_rx_thread = 0;
+
+	for (i = 0; i < nb_rx_thread_params; ++i) {
+		thread = rx_thread_params[i].thread_id;
+		nb_rx_queue = rx_thread[thread].n_rx_queue;
+
+		if (nb_rx_queue >= MAX_RX_QUEUE_PER_LCORE) {
+			printf("error: too many queues (%u) for thread: %u\n",
+				(unsigned)nb_rx_queue + 1, (unsigned)thread);
+			return -1;
+		}
+
+		rx_thread[thread].conf.thread_id = thread;
+		rx_thread[thread].conf.lcore_id = rx_thread_params[i].lcore_id;
+		rx_thread[thread].rx_queue_list[nb_rx_queue].port_id =
+			rx_thread_params[i].port_id;
+		rx_thread[thread].rx_queue_list[nb_rx_queue].queue_id =
+			rx_thread_params[i].queue_id;
+		rx_thread[thread].n_rx_queue++;
+
+		if (thread >= n_rx_thread)
+			n_rx_thread = thread + 1;
+
+	}
+	return 0;
+}
+
+static int
+init_tx_threads(void)
+{
+	int i;
+
+	n_tx_thread = 0;
+	for (i = 0; i < nb_tx_thread_params; ++i) {
+		tx_thread[n_tx_thread].conf.thread_id = tx_thread_params[i].thread_id;
+		tx_thread[n_tx_thread].conf.lcore_id = tx_thread_params[i].lcore_id;
+		n_tx_thread++;
+	}
+	return 0;
+}
+
+/* display usage */
+static void
+print_usage(const char *prgname)
+{
+	printf("%s [EAL options] -- -p PORTMASK -P"
+		"  [--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread]]"
+		"  [--tx (lcore,thread)[,(lcore,thread]]"
+		"  [--enable-jumbo [--max-pkt-len PKTLEN]]\n"
+		"  -p PORTMASK: hexadecimal bitmask of ports to configure\n"
+		"  -P : enable promiscuous mode\n"
+		"  --rx (port,queue,lcore,thread): rx queues configuration\n"
+		"  --tx (lcore,thread): tx threads configuration\n"
+		"  --stat-lcore LCORE: use lcore for stat collector\n"
+		"  --eth-dest=X,MM:MM:MM:MM:MM:MM: optional, ethernet destination for port X\n"
+		"  --no-numa: optional, disable numa awareness\n"
+		"  --ipv6: optional, specify it if running ipv6 packets\n"
+		"  --enable-jumbo: enable jumbo frame"
+		" which max packet len is PKTLEN in decimal (64-9600)\n"
+		"  --hash-entry-num: specify the hash entry number in hexadecimal to be setup\n"
+		"  --no-lthreads: turn off lthread model\n",
+		prgname);
+}
+
+static int parse_max_pkt_len(const char *pktlen)
+{
+	char *end = NULL;
+	unsigned long len;
+
+	/* parse decimal string */
+	len = strtoul(pktlen, &end, 10);
+	if ((pktlen[0] == '\0') || (end == NULL) || (*end != '\0'))
+		return -1;
+
+	if (len == 0)
+		return -1;
+
+	return len;
+}
+
+static int
+parse_portmask(const char *portmask)
+{
+	char *end = NULL;
+	unsigned long pm;
+
+	/* parse hexadecimal string */
+	pm = strtoul(portmask, &end, 16);
+	if ((portmask[0] == '\0') || (end == NULL) || (*end != '\0'))
+		return -1;
+
+	if (pm == 0)
+		return -1;
+
+	return pm;
+}
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+static int
+parse_hash_entry_number(const char *hash_entry_num)
+{
+	char *end = NULL;
+	unsigned long hash_en;
+
+	/* parse hexadecimal string */
+	hash_en = strtoul(hash_entry_num, &end, 16);
+	if ((hash_entry_num[0] == '\0') || (end == NULL) || (*end != '\0'))
+		return -1;
+
+	if (hash_en == 0)
+		return -1;
+
+	return hash_en;
+}
+#endif
+
+static int
+parse_rx_config(const char *q_arg)
+{
+	char s[256];
+	const char *p, *p0 = q_arg;
+	char *end;
+	enum fieldnames {
+		FLD_PORT = 0,
+		FLD_QUEUE,
+		FLD_LCORE,
+		FLD_THREAD,
+		_NUM_FLD
+	};
+	unsigned long int_fld[_NUM_FLD];
+	char *str_fld[_NUM_FLD];
+	int i;
+	unsigned size;
+
+	nb_rx_thread_params = 0;
+
+	while ((p = strchr(p0, '(')) != NULL) {
+		++p;
+		p0 = strchr(p, ')');
+		if (p0 == NULL)
+			return -1;
+
+		size = p0 - p;
+		if (size >= sizeof(s))
+			return -1;
+
+		snprintf(s, sizeof(s), "%.*s", size, p);
+		if (rte_strsplit(s, sizeof(s), str_fld, _NUM_FLD, ',') != _NUM_FLD)
+			return -1;
+		for (i = 0; i < _NUM_FLD; i++) {
+			errno = 0;
+			int_fld[i] = strtoul(str_fld[i], &end, 0);
+			if (errno != 0 || end == str_fld[i] || int_fld[i] > 255)
+				return -1;
+		}
+		if (nb_rx_thread_params >= MAX_LCORE_PARAMS) {
+			printf("exceeded max number of rx params: %hu\n",
+					nb_rx_thread_params);
+			return -1;
+		}
+		rx_thread_params_array[nb_rx_thread_params].port_id =
+				(uint8_t)int_fld[FLD_PORT];
+		rx_thread_params_array[nb_rx_thread_params].queue_id =
+				(uint8_t)int_fld[FLD_QUEUE];
+		rx_thread_params_array[nb_rx_thread_params].lcore_id =
+				(uint8_t)int_fld[FLD_LCORE];
+		rx_thread_params_array[nb_rx_thread_params].thread_id =
+				(uint8_t)int_fld[FLD_THREAD];
+		++nb_rx_thread_params;
+	}
+	rx_thread_params = rx_thread_params_array;
+	return 0;
+}
+
+static int
+parse_tx_config(const char *q_arg)
+{
+	char s[256];
+	const char *p, *p0 = q_arg;
+	char *end;
+	enum fieldnames {
+		FLD_LCORE = 0,
+		FLD_THREAD,
+		_NUM_FLD
+	};
+	unsigned long int_fld[_NUM_FLD];
+	char *str_fld[_NUM_FLD];
+	int i;
+	unsigned size;
+
+	nb_tx_thread_params = 0;
+
+	while ((p = strchr(p0, '(')) != NULL) {
+		++p;
+		p0 = strchr(p, ')');
+		if (p0 == NULL)
+			return -1;
+
+		size = p0 - p;
+		if (size >= sizeof(s))
+			return -1;
+
+		snprintf(s, sizeof(s), "%.*s", size, p);
+		if (rte_strsplit(s, sizeof(s), str_fld, _NUM_FLD, ',') != _NUM_FLD)
+			return -1;
+		for (i = 0; i < _NUM_FLD; i++) {
+			errno = 0;
+			int_fld[i] = strtoul(str_fld[i], &end, 0);
+			if (errno != 0 || end == str_fld[i] || int_fld[i] > 255)
+				return -1;
+		}
+		if (nb_tx_thread_params >= MAX_LCORE_PARAMS) {
+			printf("exceeded max number of tx params: %hu\n",
+				nb_tx_thread_params);
+			return -1;
+		}
+		tx_thread_params_array[nb_tx_thread_params].lcore_id =
+				(uint8_t)int_fld[FLD_LCORE];
+		tx_thread_params_array[nb_tx_thread_params].thread_id =
+				(uint8_t)int_fld[FLD_THREAD];
+		++nb_tx_thread_params;
+	}
+	tx_thread_params = tx_thread_params_array;
+
+	return 0;
+}
+
+#if (APP_CPU_LOAD > 0)
+static int
+parse_stat_lcore(const char *stat_lcore)
+{
+	char *end = NULL;
+	unsigned long lcore_id;
+
+	lcore_id = strtoul(stat_lcore, &end, 10);
+	if ((stat_lcore[0] == '\0') || (end == NULL) || (*end != '\0'))
+		return -1;
+
+	return lcore_id;
+}
+#endif
+
+static void
+parse_eth_dest(const char *optarg)
+{
+	uint8_t portid;
+	char *port_end;
+	uint8_t c, *dest, peer_addr[6];
+
+	errno = 0;
+	portid = strtoul(optarg, &port_end, 10);
+	if (errno != 0 || port_end == optarg || *port_end++ != ',')
+		rte_exit(EXIT_FAILURE,
+		"Invalid eth-dest: %s", optarg);
+	if (portid >= RTE_MAX_ETHPORTS)
+		rte_exit(EXIT_FAILURE,
+		"eth-dest: port %d >= RTE_MAX_ETHPORTS(%d)\n",
+		portid, RTE_MAX_ETHPORTS);
+
+	if (cmdline_parse_etheraddr(NULL, port_end,
+		&peer_addr, sizeof(peer_addr)) < 0)
+		rte_exit(EXIT_FAILURE,
+		"Invalid ethernet address: %s\n",
+		port_end);
+	dest = (uint8_t *)&dest_eth_addr[portid];
+	for (c = 0; c < 6; c++)
+		dest[c] = peer_addr[c];
+	*(uint64_t *)(val_eth + portid) = dest_eth_addr[portid];
+}
+
+#define CMD_LINE_OPT_RX_CONFIG "rx"
+#define CMD_LINE_OPT_TX_CONFIG "tx"
+#define CMD_LINE_OPT_STAT_LCORE "stat-lcore"
+#define CMD_LINE_OPT_ETH_DEST "eth-dest"
+#define CMD_LINE_OPT_NO_NUMA "no-numa"
+#define CMD_LINE_OPT_IPV6 "ipv6"
+#define CMD_LINE_OPT_ENABLE_JUMBO "enable-jumbo"
+#define CMD_LINE_OPT_HASH_ENTRY_NUM "hash-entry-num"
+#define CMD_LINE_OPT_NO_LTHREADS "no-lthreads"
+
+/* Parse the argument given in the command line of the application */
+static int
+parse_args(int argc, char **argv)
+{
+	int opt, ret;
+	char **argvopt;
+	int option_index;
+	char *prgname = argv[0];
+	static struct option lgopts[] = {
+		{CMD_LINE_OPT_RX_CONFIG, 1, 0, 0},
+		{CMD_LINE_OPT_TX_CONFIG, 1, 0, 0},
+		{CMD_LINE_OPT_STAT_LCORE, 1, 0, 0},
+		{CMD_LINE_OPT_ETH_DEST, 1, 0, 0},
+		{CMD_LINE_OPT_NO_NUMA, 0, 0, 0},
+		{CMD_LINE_OPT_IPV6, 0, 0, 0},
+		{CMD_LINE_OPT_ENABLE_JUMBO, 0, 0, 0},
+		{CMD_LINE_OPT_HASH_ENTRY_NUM, 1, 0, 0},
+		{CMD_LINE_OPT_NO_LTHREADS, 0, 0, 0},
+		{NULL, 0, 0, 0}
+	};
+
+	argvopt = argv;
+
+	while ((opt = getopt_long(argc, argvopt, "p:P",
+				lgopts, &option_index)) != EOF) {
+
+		switch (opt) {
+		/* portmask */
+		case 'p':
+			enabled_port_mask = parse_portmask(optarg);
+			if (enabled_port_mask == 0) {
+				printf("invalid portmask\n");
+				print_usage(prgname);
+				return -1;
+			}
+			break;
+		case 'P':
+			printf("Promiscuous mode selected\n");
+			promiscuous_on = 1;
+			break;
+
+		/* long options */
+		case 0:
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_RX_CONFIG,
+				sizeof(CMD_LINE_OPT_RX_CONFIG))) {
+				ret = parse_rx_config(optarg);
+				if (ret) {
+					printf("invalid rx-config\n");
+					print_usage(prgname);
+					return -1;
+				}
+			}
+
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_TX_CONFIG,
+				sizeof(CMD_LINE_OPT_TX_CONFIG))) {
+				ret = parse_tx_config(optarg);
+				if (ret) {
+					printf("invalid tx-config\n");
+					print_usage(prgname);
+					return -1;
+				}
+			}
+
+#if (APP_CPU_LOAD > 0)
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_STAT_LCORE,
+					sizeof(CMD_LINE_OPT_STAT_LCORE))) {
+				cpu_load_lcore_id = parse_stat_lcore(optarg);
+			}
+#endif
+
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_ETH_DEST,
+				sizeof(CMD_LINE_OPT_ETH_DEST)))
+					parse_eth_dest(optarg);
+
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_NO_NUMA,
+				sizeof(CMD_LINE_OPT_NO_NUMA))) {
+				printf("numa is disabled\n");
+				numa_on = 0;
+			}
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_IPV6,
+				sizeof(CMD_LINE_OPT_IPV6))) {
+				printf("ipv6 is specified\n");
+				ipv6 = 1;
+			}
+#endif
+
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_NO_LTHREADS,
+					sizeof(CMD_LINE_OPT_NO_LTHREADS))) {
+				printf("l-threads model is disabled\n");
+				lthreads_on = 0;
+			}
+
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_ENABLE_JUMBO,
+				sizeof(CMD_LINE_OPT_ENABLE_JUMBO))) {
+				struct option lenopts = {"max-pkt-len", required_argument, 0,
+						0};
+
+				printf("jumbo frame is enabled - disabling simple TX path\n");
+				port_conf.rxmode.jumbo_frame = 1;
+
+				/* if no max-pkt-len set, use the default value ETHER_MAX_LEN */
+				if (0 == getopt_long(argc, argvopt, "", &lenopts,
+						&option_index)) {
+
+					ret = parse_max_pkt_len(optarg);
+					if ((ret < 64) || (ret > MAX_JUMBO_PKT_LEN)) {
+						printf("invalid packet length\n");
+						print_usage(prgname);
+						return -1;
+					}
+					port_conf.rxmode.max_rx_pkt_len = ret;
+				}
+				printf("set jumbo frame max packet length to %u\n",
+						(unsigned int)port_conf.rxmode.max_rx_pkt_len);
+			}
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+			if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_HASH_ENTRY_NUM,
+				sizeof(CMD_LINE_OPT_HASH_ENTRY_NUM))) {
+				ret = parse_hash_entry_number(optarg);
+				if ((ret > 0) && (ret <= L3FWD_HASH_ENTRIES)) {
+					hash_entry_number = ret;
+				} else {
+					printf("invalid hash entry number\n");
+					print_usage(prgname);
+					return -1;
+				}
+			}
+#endif
+			break;
+
+		default:
+			print_usage(prgname);
+			return -1;
+		}
+	}
+
+	if (optind >= 0)
+		argv[optind-1] = prgname;
+
+	ret = optind-1;
+	optind = 0; /* reset getopt lib */
+	return ret;
+}
+
+static void
+print_ethaddr(const char *name, const struct ether_addr *eth_addr)
+{
+	char buf[ETHER_ADDR_FMT_SIZE];
+
+	ether_format_addr(buf, ETHER_ADDR_FMT_SIZE, eth_addr);
+	printf("%s%s", name, buf);
+}
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
+
+static void convert_ipv4_5tuple(struct ipv4_5tuple *key1,
+		union ipv4_5tuple_host *key2)
+{
+	key2->ip_dst = rte_cpu_to_be_32(key1->ip_dst);
+	key2->ip_src = rte_cpu_to_be_32(key1->ip_src);
+	key2->port_dst = rte_cpu_to_be_16(key1->port_dst);
+	key2->port_src = rte_cpu_to_be_16(key1->port_src);
+	key2->proto = key1->proto;
+	key2->pad0 = 0;
+	key2->pad1 = 0;
+}
+
+static void convert_ipv6_5tuple(struct ipv6_5tuple *key1,
+		union ipv6_5tuple_host *key2)
+{
+	uint32_t i;
+
+	for (i = 0; i < 16; i++) {
+		key2->ip_dst[i] = key1->ip_dst[i];
+		key2->ip_src[i] = key1->ip_src[i];
+	}
+	key2->port_dst = rte_cpu_to_be_16(key1->port_dst);
+	key2->port_src = rte_cpu_to_be_16(key1->port_src);
+	key2->proto = key1->proto;
+	key2->pad0 = 0;
+	key2->pad1 = 0;
+	key2->reserve = 0;
+}
+
+#define BYTE_VALUE_MAX 256
+#define ALL_32_BITS 0xffffffff
+#define BIT_8_TO_15 0x0000ff00
+static inline void
+populate_ipv4_few_flow_into_table(const struct rte_hash *h)
+{
+	uint32_t i;
+	int32_t ret;
+	uint32_t array_len = RTE_DIM(ipv4_l3fwd_route_array);
+
+	mask0 = _mm_set_epi32(ALL_32_BITS, ALL_32_BITS, ALL_32_BITS, BIT_8_TO_15);
+	for (i = 0; i < array_len; i++) {
+		struct ipv4_l3fwd_route  entry;
+		union ipv4_5tuple_host newkey;
+
+		entry = ipv4_l3fwd_route_array[i];
+		convert_ipv4_5tuple(&entry.key, &newkey);
+		ret = rte_hash_add_key(h, (void *)&newkey);
+		if (ret < 0) {
+			rte_exit(EXIT_FAILURE, "Unable to add entry %" PRIu32
+				" to the l3fwd hash.\n", i);
+		}
+		ipv4_l3fwd_out_if[ret] = entry.if_out;
+	}
+	printf("Hash: Adding 0x%" PRIx32 " keys\n", array_len);
+}
+
+#define BIT_16_TO_23 0x00ff0000
+static inline void
+populate_ipv6_few_flow_into_table(const struct rte_hash *h)
+{
+	uint32_t i;
+	int32_t ret;
+	uint32_t array_len = RTE_DIM(ipv6_l3fwd_route_array);
+
+	mask1 = _mm_set_epi32(ALL_32_BITS, ALL_32_BITS, ALL_32_BITS, BIT_16_TO_23);
+	mask2 = _mm_set_epi32(0, 0, ALL_32_BITS, ALL_32_BITS);
+	for (i = 0; i < array_len; i++) {
+		struct ipv6_l3fwd_route entry;
+		union ipv6_5tuple_host newkey;
+
+		entry = ipv6_l3fwd_route_array[i];
+		convert_ipv6_5tuple(&entry.key, &newkey);
+		ret = rte_hash_add_key(h, (void *)&newkey);
+		if (ret < 0) {
+			rte_exit(EXIT_FAILURE, "Unable to add entry %" PRIu32
+				" to the l3fwd hash.\n", i);
+		}
+		ipv6_l3fwd_out_if[ret] = entry.if_out;
+	}
+	printf("Hash: Adding 0x%" PRIx32 "keys\n", array_len);
+}
+
+#define NUMBER_PORT_USED 4
+static inline void
+populate_ipv4_many_flow_into_table(const struct rte_hash *h,
+		unsigned int nr_flow)
+{
+	unsigned i;
+
+	mask0 = _mm_set_epi32(ALL_32_BITS, ALL_32_BITS, ALL_32_BITS, BIT_8_TO_15);
+
+	for (i = 0; i < nr_flow; i++) {
+		struct ipv4_l3fwd_route entry;
+		union ipv4_5tuple_host newkey;
+		uint8_t a = (uint8_t)((i / NUMBER_PORT_USED) % BYTE_VALUE_MAX);
+		uint8_t b = (uint8_t)(((i / NUMBER_PORT_USED) / BYTE_VALUE_MAX) %
+				BYTE_VALUE_MAX);
+		uint8_t c = (uint8_t)((i / NUMBER_PORT_USED) / (BYTE_VALUE_MAX *
+				BYTE_VALUE_MAX));
+		/* Create the ipv4 exact match flow */
+		memset(&entry, 0, sizeof(entry));
+		switch (i & (NUMBER_PORT_USED - 1)) {
+		case 0:
+			entry = ipv4_l3fwd_route_array[0];
+			entry.key.ip_dst = IPv4(101, c, b, a);
+			break;
+		case 1:
+			entry = ipv4_l3fwd_route_array[1];
+			entry.key.ip_dst = IPv4(201, c, b, a);
+			break;
+		case 2:
+			entry = ipv4_l3fwd_route_array[2];
+			entry.key.ip_dst = IPv4(111, c, b, a);
+			break;
+		case 3:
+			entry = ipv4_l3fwd_route_array[3];
+			entry.key.ip_dst = IPv4(211, c, b, a);
+			break;
+		};
+		convert_ipv4_5tuple(&entry.key, &newkey);
+		int32_t ret = rte_hash_add_key(h, (void *)&newkey);
+
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "Unable to add entry %u\n", i);
+
+		ipv4_l3fwd_out_if[ret] = (uint8_t)entry.if_out;
+
+	}
+	printf("Hash: Adding 0x%x keys\n", nr_flow);
+}
+
+static inline void
+populate_ipv6_many_flow_into_table(const struct rte_hash *h,
+		unsigned int nr_flow)
+{
+	unsigned i;
+
+	mask1 = _mm_set_epi32(ALL_32_BITS, ALL_32_BITS, ALL_32_BITS, BIT_16_TO_23);
+	mask2 = _mm_set_epi32(0, 0, ALL_32_BITS, ALL_32_BITS);
+	for (i = 0; i < nr_flow; i++) {
+		struct ipv6_l3fwd_route entry;
+		union ipv6_5tuple_host newkey;
+
+		uint8_t a = (uint8_t) ((i / NUMBER_PORT_USED) % BYTE_VALUE_MAX);
+		uint8_t b = (uint8_t) (((i / NUMBER_PORT_USED) / BYTE_VALUE_MAX) %
+				BYTE_VALUE_MAX);
+		uint8_t c = (uint8_t) ((i / NUMBER_PORT_USED) / (BYTE_VALUE_MAX *
+				BYTE_VALUE_MAX));
+
+		/* Create the ipv6 exact match flow */
+		memset(&entry, 0, sizeof(entry));
+		switch (i & (NUMBER_PORT_USED - 1)) {
+		case 0:
+			entry = ipv6_l3fwd_route_array[0];
+			break;
+		case 1:
+			entry = ipv6_l3fwd_route_array[1];
+			break;
+		case 2:
+			entry = ipv6_l3fwd_route_array[2];
+			break;
+		case 3:
+			entry = ipv6_l3fwd_route_array[3];
+			break;
+		};
+		entry.key.ip_dst[13] = c;
+		entry.key.ip_dst[14] = b;
+		entry.key.ip_dst[15] = a;
+		convert_ipv6_5tuple(&entry.key, &newkey);
+		int32_t ret = rte_hash_add_key(h, (void *)&newkey);
+
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "Unable to add entry %u\n", i);
+
+		ipv6_l3fwd_out_if[ret] = (uint8_t) entry.if_out;
+
+	}
+	printf("Hash: Adding 0x%x keys\n", nr_flow);
+}
+
+static void
+setup_hash(int socketid)
+{
+	struct rte_hash_parameters ipv4_l3fwd_hash_params = {
+		.name = NULL,
+		.entries = L3FWD_HASH_ENTRIES,
+		.key_len = sizeof(union ipv4_5tuple_host),
+		.hash_func = ipv4_hash_crc,
+		.hash_func_init_val = 0,
+	};
+
+	struct rte_hash_parameters ipv6_l3fwd_hash_params = {
+		.name = NULL,
+		.entries = L3FWD_HASH_ENTRIES,
+		.key_len = sizeof(union ipv6_5tuple_host),
+		.hash_func = ipv6_hash_crc,
+		.hash_func_init_val = 0,
+	};
+
+	char s[64];
+
+	/* create ipv4 hash */
+	snprintf(s, sizeof(s), "ipv4_l3fwd_hash_%d", socketid);
+	ipv4_l3fwd_hash_params.name = s;
+	ipv4_l3fwd_hash_params.socket_id = socketid;
+	ipv4_l3fwd_lookup_struct[socketid] =
+			rte_hash_create(&ipv4_l3fwd_hash_params);
+	if (ipv4_l3fwd_lookup_struct[socketid] == NULL)
+		rte_exit(EXIT_FAILURE, "Unable to create the l3fwd hash on "
+				"socket %d\n", socketid);
+
+	/* create ipv6 hash */
+	snprintf(s, sizeof(s), "ipv6_l3fwd_hash_%d", socketid);
+	ipv6_l3fwd_hash_params.name = s;
+	ipv6_l3fwd_hash_params.socket_id = socketid;
+	ipv6_l3fwd_lookup_struct[socketid] =
+			rte_hash_create(&ipv6_l3fwd_hash_params);
+	if (ipv6_l3fwd_lookup_struct[socketid] == NULL)
+		rte_exit(EXIT_FAILURE, "Unable to create the l3fwd hash on "
+				"socket %d\n", socketid);
+
+	if (hash_entry_number != HASH_ENTRY_NUMBER_DEFAULT) {
+		/* For testing hash matching with a large number of flows we
+		 * generate millions of IP 5-tuples with an incremented dst
+		 * address to initialize the hash table. */
+		if (ipv6 == 0) {
+			/* populate the ipv4 hash */
+			populate_ipv4_many_flow_into_table(
+				ipv4_l3fwd_lookup_struct[socketid], hash_entry_number);
+		} else {
+			/* populate the ipv6 hash */
+			populate_ipv6_many_flow_into_table(
+				ipv6_l3fwd_lookup_struct[socketid], hash_entry_number);
+		}
+	} else {
+		/* Use data in ipv4/ipv6 l3fwd lookup table directly to initialize
+		 * the hash table */
+		if (ipv6 == 0) {
+			/* populate the ipv4 hash */
+			populate_ipv4_few_flow_into_table(
+					ipv4_l3fwd_lookup_struct[socketid]);
+		} else {
+			/* populate the ipv6 hash */
+			populate_ipv6_few_flow_into_table(
+					ipv6_l3fwd_lookup_struct[socketid]);
+		}
+	}
+}
+#endif
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+static void
+setup_lpm(int socketid)
+{
+	struct rte_lpm6_config config;
+	unsigned i;
+	int ret;
+	char s[64];
+
+	/* create the LPM table */
+	snprintf(s, sizeof(s), "IPV4_L3FWD_LPM_%d", socketid);
+	ipv4_l3fwd_lookup_struct[socketid] = rte_lpm_create(s, socketid,
+				IPV4_L3FWD_LPM_MAX_RULES, 0);
+	if (ipv4_l3fwd_lookup_struct[socketid] == NULL)
+		rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table"
+				" on socket %d\n", socketid);
+
+	/* populate the LPM table */
+	for (i = 0; i < IPV4_L3FWD_NUM_ROUTES; i++) {
+
+		/* skip unused ports */
+		if ((1 << ipv4_l3fwd_route_array[i].if_out &
+				enabled_port_mask) == 0)
+			continue;
+
+		ret = rte_lpm_add(ipv4_l3fwd_lookup_struct[socketid],
+			ipv4_l3fwd_route_array[i].ip,
+			ipv4_l3fwd_route_array[i].depth,
+			ipv4_l3fwd_route_array[i].if_out);
+
+		if (ret < 0) {
+			rte_exit(EXIT_FAILURE, "Unable to add entry %u to the "
+				"l3fwd LPM table on socket %d\n",
+				i, socketid);
+		}
+
+		printf("LPM: Adding route 0x%08x / %d (%d)\n",
+			(unsigned)ipv4_l3fwd_route_array[i].ip,
+			ipv4_l3fwd_route_array[i].depth,
+			ipv4_l3fwd_route_array[i].if_out);
+	}
+
+	/* create the LPM6 table */
+	snprintf(s, sizeof(s), "IPV6_L3FWD_LPM_%d", socketid);
+
+	config.max_rules = IPV6_L3FWD_LPM_MAX_RULES;
+	config.number_tbl8s = IPV6_L3FWD_LPM_NUMBER_TBL8S;
+	config.flags = 0;
+	ipv6_l3fwd_lookup_struct[socketid] = rte_lpm6_create(s, socketid,
+				&config);
+	if (ipv6_l3fwd_lookup_struct[socketid] == NULL)
+		rte_exit(EXIT_FAILURE, "Unable to create the l3fwd LPM table"
+				" on socket %d\n", socketid);
+
+	/* populate the LPM table */
+	for (i = 0; i < IPV6_L3FWD_NUM_ROUTES; i++) {
+
+		/* skip unused ports */
+		if ((1 << ipv6_l3fwd_route_array[i].if_out &
+				enabled_port_mask) == 0)
+			continue;
+
+		ret = rte_lpm6_add(ipv6_l3fwd_lookup_struct[socketid],
+			ipv6_l3fwd_route_array[i].ip,
+			ipv6_l3fwd_route_array[i].depth,
+			ipv6_l3fwd_route_array[i].if_out);
+
+		if (ret < 0) {
+			rte_exit(EXIT_FAILURE, "Unable to add entry %u to the "
+				"l3fwd LPM table on socket %d\n",
+				i, socketid);
+		}
+
+		printf("LPM: Adding route %s / %d (%d)\n",
+			"IPV6",
+			ipv6_l3fwd_route_array[i].depth,
+			ipv6_l3fwd_route_array[i].if_out);
+	}
+}
+#endif
+
+static int
+init_mem(unsigned nb_mbuf)
+{
+	struct lcore_conf *qconf;
+	int socketid;
+	unsigned lcore_id;
+	char s[64];
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		if (rte_lcore_is_enabled(lcore_id) == 0)
+			continue;
+
+		if (numa_on)
+			socketid = rte_lcore_to_socket_id(lcore_id);
+		else
+			socketid = 0;
+
+		if (socketid >= NB_SOCKETS) {
+			rte_exit(EXIT_FAILURE, "Socket %d of lcore %u is out of range %d\n",
+				socketid, lcore_id, NB_SOCKETS);
+		}
+		if (pktmbuf_pool[socketid] == NULL) {
+			snprintf(s, sizeof(s), "mbuf_pool_%d", socketid);
+			pktmbuf_pool[socketid] =
+				rte_pktmbuf_pool_create(s, nb_mbuf,
+					MEMPOOL_CACHE_SIZE, 0,
+					RTE_MBUF_DEFAULT_BUF_SIZE, socketid);
+			if (pktmbuf_pool[socketid] == NULL)
+				rte_exit(EXIT_FAILURE,
+						"Cannot init mbuf pool on socket %d\n", socketid);
+			else
+				printf("Allocated mbuf pool on socket %d\n", socketid);
+
+#if (APP_LOOKUP_METHOD == APP_LOOKUP_LPM)
+			setup_lpm(socketid);
+#else
+			setup_hash(socketid);
+#endif
+		}
+		qconf = &lcore_conf[lcore_id];
+		qconf->ipv4_lookup_struct = ipv4_l3fwd_lookup_struct[socketid];
+		qconf->ipv6_lookup_struct = ipv6_l3fwd_lookup_struct[socketid];
+	}
+	return 0;
+}
+
+/* Check the link status of all ports in up to 9s, and print them finally */
+static void
+check_all_ports_link_status(uint8_t port_num, uint32_t port_mask)
+{
+#define CHECK_INTERVAL 100 /* 100ms */
+#define MAX_CHECK_TIME 90 /* 9s (90 * 100ms) in total */
+	uint8_t portid, count, all_ports_up, print_flag = 0;
+	struct rte_eth_link link;
+
+	printf("\nChecking link status");
+	fflush(stdout);
+	for (count = 0; count <= MAX_CHECK_TIME; count++) {
+		all_ports_up = 1;
+		for (portid = 0; portid < port_num; portid++) {
+			if ((port_mask & (1 << portid)) == 0)
+				continue;
+			memset(&link, 0, sizeof(link));
+			rte_eth_link_get_nowait(portid, &link);
+			/* print link status if flag set */
+			if (print_flag == 1) {
+				if (link.link_status)
+					printf("Port %d Link Up - speed %u "
+						"Mbps - %s\n", (uint8_t)portid,
+						(unsigned)link.link_speed,
+				(link.link_duplex == ETH_LINK_FULL_DUPLEX) ?
+					("full-duplex") : ("half-duplex\n"));
+				else
+					printf("Port %d Link Down\n",
+						(uint8_t)portid);
+				continue;
+			}
+			/* clear all_ports_up flag if any link down */
+			if (link.link_status == 0) {
+				all_ports_up = 0;
+				break;
+			}
+		}
+		/* after finally printing all link status, get out */
+		if (print_flag == 1)
+			break;
+
+		if (all_ports_up == 0) {
+			printf(".");
+			fflush(stdout);
+			rte_delay_ms(CHECK_INTERVAL);
+		}
+
+		/* set the print_flag if all ports up or timeout */
+		if (all_ports_up == 1 || count == (MAX_CHECK_TIME - 1)) {
+			print_flag = 1;
+			printf("done\n");
+		}
+	}
+}
+
+int
+main(int argc, char **argv)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_txconf *txconf;
+	int ret;
+	int i;
+	unsigned nb_ports;
+	uint16_t queueid;
+	unsigned lcore_id;
+	uint32_t n_tx_queue, nb_lcores;
+	uint8_t portid, nb_rx_queue, queue, socketid;
+
+	/* init EAL */
+	ret = rte_eal_init(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n");
+	argc -= ret;
+	argv += ret;
+
+	/* pre-init dst MACs for all ports to 02:00:00:00:00:xx */
+	for (portid = 0; portid < RTE_MAX_ETHPORTS; portid++) {
+		dest_eth_addr[portid] = ETHER_LOCAL_ADMIN_ADDR +
+				((uint64_t)portid << 40);
+		*(uint64_t *)(val_eth + portid) = dest_eth_addr[portid];
+	}
+
+	/* parse application arguments (after the EAL ones) */
+	ret = parse_args(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "Invalid L3FWD parameters\n");
+
+	if (check_lcore_params() < 0)
+		rte_exit(EXIT_FAILURE, "check_lcore_params failed\n");
+
+	printf("Initializing rx-queues...\n");
+	ret = init_rx_queues();
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "init_rx_queues failed\n");
+
+	printf("Initializing tx-threads...\n");
+	ret = init_tx_threads();
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "init_tx_threads failed\n");
+
+	printf("Initializing rings...\n");
+	ret = init_rx_rings();
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "init_rx_rings failed\n");
+
+	nb_ports = rte_eth_dev_count();
+	if (nb_ports > RTE_MAX_ETHPORTS)
+		nb_ports = RTE_MAX_ETHPORTS;
+
+	if (check_port_config(nb_ports) < 0)
+		rte_exit(EXIT_FAILURE, "check_port_config failed\n");
+
+	nb_lcores = rte_lcore_count();
+
+	/* initialize all ports */
+	for (portid = 0; portid < nb_ports; portid++) {
+		/* skip ports that are not enabled */
+		if ((enabled_port_mask & (1 << portid)) == 0) {
+			printf("\nSkipping disabled port %d\n", portid);
+			continue;
+		}
+
+		/* init port */
+		printf("Initializing port %d ... ", portid);
+		fflush(stdout);
+
+		nb_rx_queue = get_port_n_rx_queues(portid);
+		n_tx_queue = nb_lcores;
+		if (n_tx_queue > MAX_TX_QUEUE_PER_PORT)
+			n_tx_queue = MAX_TX_QUEUE_PER_PORT;
+		printf("Creating queues: nb_rxq=%d nb_txq=%u... ",
+			nb_rx_queue, (unsigned)n_tx_queue);
+		ret = rte_eth_dev_configure(portid, nb_rx_queue,
+					(uint16_t)n_tx_queue, &port_conf);
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "Cannot configure device: err=%d, port=%d\n",
+				ret, portid);
+
+		rte_eth_macaddr_get(portid, &ports_eth_addr[portid]);
+		print_ethaddr(" Address:", &ports_eth_addr[portid]);
+		printf(", ");
+		print_ethaddr("Destination:",
+			(const struct ether_addr *)&dest_eth_addr[portid]);
+		printf(", ");
+
+		/*
+		 * prepare src MACs for each port.
+		 */
+		ether_addr_copy(&ports_eth_addr[portid],
+			(struct ether_addr *)(val_eth + portid) + 1);
+
+		/* init memory */
+		ret = init_mem(NB_MBUF);
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "init_mem failed\n");
+
+		/* init one TX queue per couple (lcore,port) */
+		queueid = 0;
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+
+			if (numa_on)
+				socketid = (uint8_t)rte_lcore_to_socket_id(lcore_id);
+			else
+				socketid = 0;
+
+			printf("txq=%u,%d,%d ", lcore_id, queueid, socketid);
+			fflush(stdout);
+
+			rte_eth_dev_info_get(portid, &dev_info);
+			txconf = &dev_info.default_txconf;
+			if (port_conf.rxmode.jumbo_frame)
+				txconf->txq_flags = 0;
+			ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd,
+						     socketid, txconf);
+			if (ret < 0)
+				rte_exit(EXIT_FAILURE, "rte_eth_tx_queue_setup: err=%d, "
+					"port=%d\n", ret, portid);
+
+			tx_thread[lcore_id].tx_queue_id[portid] = queueid;
+			queueid++;
+		}
+		printf("\n");
+	}
+
+	for (i = 0; i < n_rx_thread; i++) {
+		lcore_id = rx_thread[i].conf.lcore_id;
+
+		if (rte_lcore_is_enabled(lcore_id) == 0) {
+			rte_exit(EXIT_FAILURE,
+					"Cannot start Rx thread on lcore %u: lcore disabled\n",
+					lcore_id
+				);
+		}
+
+		printf("\nInitializing rx queues for Rx thread %d on lcore %u ... ",
+				i, lcore_id);
+		fflush(stdout);
+
+		/* init RX queues */
+		for (queue = 0; queue < rx_thread[i].n_rx_queue; ++queue) {
+			portid = rx_thread[i].rx_queue_list[queue].port_id;
+			queueid = rx_thread[i].rx_queue_list[queue].queue_id;
+
+			if (numa_on)
+				socketid = (uint8_t)rte_lcore_to_socket_id(lcore_id);
+			else
+				socketid = 0;
+
+			printf("rxq=%d,%d,%d ", portid, queueid, socketid);
+			fflush(stdout);
+
+			ret = rte_eth_rx_queue_setup(portid, queueid, nb_rxd,
+					socketid,
+					NULL,
+					pktmbuf_pool[socketid]);
+			if (ret < 0)
+				rte_exit(EXIT_FAILURE, "rte_eth_rx_queue_setup: err=%d, "
+						"port=%d\n", ret, portid);
+		}
+	}
+
+	printf("\n");
+
+	/* start ports */
+	for (portid = 0; portid < nb_ports; portid++) {
+		if ((enabled_port_mask & (1 << portid)) == 0)
+			continue;
+
+		/* Start device */
+		ret = rte_eth_dev_start(portid);
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "rte_eth_dev_start: err=%d, port=%d\n",
+				ret, portid);
+
+		/*
+		 * If enabled, put device in promiscuous mode.
+		 * This allows IO forwarding mode to forward packets
+		 * to itself through 2 cross-connected  ports of the
+		 * target machine.
+		 */
+		if (promiscuous_on)
+			rte_eth_promiscuous_enable(portid);
+	}
+
+	check_all_ports_link_status((uint8_t)nb_ports, enabled_port_mask);
+
+	if (lthreads_on) {
+		printf("Starting L-Threading Model\n");
+
+#if (APP_CPU_LOAD > 0)
+		if (cpu_load_lcore_id > 0)
+			/* Use one lcore for cpu load collector */
+			nb_lcores--;
+#endif
+
+		lthread_num_schedulers_set(nb_lcores);
+		rte_eal_mp_remote_launch(sched_spawner, NULL, SKIP_MASTER);
+		lthread_master_spawner(NULL);
+
+	} else {
+		printf("Starting P-Threading Model\n");
+		/* launch per-lcore init on every lcore */
+		rte_eal_mp_remote_launch(pthread_run, NULL, CALL_MASTER);
+		RTE_LCORE_FOREACH_SLAVE(lcore_id) {
+			if (rte_eal_wait_lcore(lcore_id) < 0)
+				return -1;
+		}
+	}
+
+	return 0;
+}
-- 
2.1.4

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] examples: add pthread-shim in performance-thread sample app
  2015-12-02 14:56 [dpdk-dev] [PATCH v4 1/2] examples: add performance thread sample application ibetts
@ 2015-12-02 14:56 ` ibetts
  0 siblings, 0 replies; 2+ messages in thread
From: ibetts @ 2015-12-02 14:56 UTC (permalink / raw)
  To: dev; +Cc: Ian Betts

From: Ian Betts <ian.betts@intel.com>

This commit adds a simple pthread_shim example for the
cooperative scheduler included with this patchset.

The shim demonstrates a way in which legacy code writtem for
pthreads could be adapted to lighweight threads.

Signed-off-by: Ian Betts <ian.betts@intel.com>
---
 doc/guides/sample_app_ug/performance_thread.rst    | 114 ++++
 examples/performance-thread/Makefile               |   2 +
 examples/performance-thread/pthread_shim/Makefile  |  60 ++
 examples/performance-thread/pthread_shim/main.c    | 284 ++++++++
 .../performance-thread/pthread_shim/pthread_shim.c | 714 +++++++++++++++++++++
 .../performance-thread/pthread_shim/pthread_shim.h | 113 ++++
 6 files changed, 1287 insertions(+)
 create mode 100644 examples/performance-thread/pthread_shim/Makefile
 create mode 100644 examples/performance-thread/pthread_shim/main.c
 create mode 100644 examples/performance-thread/pthread_shim/pthread_shim.c
 create mode 100644 examples/performance-thread/pthread_shim/pthread_shim.h

diff --git a/doc/guides/sample_app_ug/performance_thread.rst b/doc/guides/sample_app_ug/performance_thread.rst
index 6ea83cc..d71bb84 100644
--- a/doc/guides/sample_app_ug/performance_thread.rst
+++ b/doc/guides/sample_app_ug/performance_thread.rst
@@ -1102,6 +1102,120 @@ it the local data it needs, and pick up the new logical core specific values
 from pthread local storage at its new home.
 
 
+.. _pthread_shim:
+
+Pthread shim
+~~~~~~~~~~~~
+
+A convenient way to get something working with legacy code can be to use a
+shim that adapts pthread API calls to the corresponding L-thread ones.
+This approach will not mitigate any of the porting considerations mentioned
+in the previous sections, but it will reduce the amount of code churn that
+would otherwise been involved. It is a reasonable approach to evaluate
+L-threads, before investing effort in porting to the native L-thread APIs.
+
+
+Overview
+^^^^^^^^
+The L-thread subsystem includes an example pthread shim. This is a partial
+implementation but does contain the API stubs needed to get basic applications
+running. There is a simple "hello world" application that demonstrates the
+use of the pthread shim.
+
+A subtlety of working with a shim is that the application will still need
+to make use of the genuine pthread library functions, at the very least in
+order to create the EAL threads in which the L-thread schedulers will run.
+This is the case with DPDK initialization, and exit.
+
+To deal with the initialization and shutdown scenarios, the shim is capable of
+switching on or off its adaptor functionality, an application can control this
+behavior by the calling the function ``pt_override_set()``. The default state
+is disabled.
+
+The pthread shim uses the dynamic linker loader and saves the loaded addresses
+of the genuine pthread API functions in an internal table, when the shim
+functionality is enabled it performs the adaptor function, when disabled it
+invokes the genuine pthread function.
+
+The function ``pthread_exit()`` has additional special handling. The standard
+system header file pthread.h declares ``pthread_exit()`` with
+``__attribute__((noreturn))`` this is an optimization that is possible because
+the pthread is terminating and this enables the compiler to omit the normal
+handling of stack and protection of registers since the function is not
+expected to return, and in fact the thread is being destroyed. These
+optimizations are applied in both the callee and the caller of the
+``pthread_exit()`` function.
+
+In our cooperative scheduling environment this behavior is inadmissible. The
+pthread is the L-thread scheduler thread, and, although an L-thread is
+terminating, there must be a return to the scheduler in order that the system
+can continue to run. Further, returning from a function with attribute
+``noreturn`` is invalid and may result in undefined behavior.
+
+The solution is to redefine the ``pthread_exit`` function with a macro,
+causing it to be mapped to a stub function in the shim that does not have the
+``noreturn`` attribute. This macro is defined in the file
+``pthread_shim.h``. The stub function is otherwise no different than any of
+the other stub functions in the shim, and will switch between the real
+``pthread_exit()`` function or the ``lthread_exit()`` function as
+required. The only difference is that the mapping to the stub by macro
+substitution.
+
+A consequence of this is that the file ``pthread_shim.h`` must be included in
+legacy code wishing to make use of the shim. It also means that dynamic
+linkage of a pre-compiled binary that did not include pthread_shim.h is not be
+supported.
+
+Given the requirements for porting legacy code outlined in
+:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at
+least some minimal adjustment and recompilation to run on L-threads so
+pre-compiled binaries are unlikely to be met in practice.
+
+In summary the shim approach adds some overhead but can be a useful tool to help
+establish the feasibility of a code reuse project. It is also a fairly
+straightforward task to extend the shim if necessary.
+
+**Note:** Bearing in mind the preceding discussions about the impact of making
+blocking calls then switching the shim in and out on the fly to invoke any
+pthread API this might block is something that should typically be avoided.
+
+
+Building and running the pthread shim
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The shim example application is located in the sample application
+in the performance-thread folder
+
+To build and run the pthread shim example
+
+#. Go to the example applications folder
+
+   .. code-block:: console
+
+       export RTE_SDK=/path/to/rte_sdk
+       cd ${RTE_SDK}/examples/performance-thread/pthread_shim
+
+
+#. Set the target (a default target is used if not specified). For example:
+
+   .. code-block:: console
+
+       export RTE_TARGET=x86_64-native-linuxapp-gcc
+
+   See the DPDK Getting Started Guide for possible RTE_TARGET values.
+
+#. Build the application:
+
+   .. code-block:: console
+
+       make
+
+#. To run the pthread_shim example
+
+   .. code-block:: console
+
+       lthread-pthread-shim -c core_mask -n number_of_channels
+
 .. _lthread_diagnostics:
 
 L-thread Diagnostics
diff --git a/examples/performance-thread/Makefile b/examples/performance-thread/Makefile
index fce4f79..1be67e9 100644
--- a/examples/performance-thread/Makefile
+++ b/examples/performance-thread/Makefile
@@ -39,6 +39,8 @@ RTE_TARGET ?= x86_64-native-linuxapp-gcc
 include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_PERFORMANCE_THREAD) += l3fwd-thread
+DIRS-$(CONFIG_RTE_PERFORMANCE_THREAD) += pthread_shim
+
 
 
 
diff --git a/examples/performance-thread/pthread_shim/Makefile b/examples/performance-thread/pthread_shim/Makefile
new file mode 100644
index 0000000..9cf32e3
--- /dev/null
+++ b/examples/performance-thread/pthread_shim/Makefile
@@ -0,0 +1,60 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2015 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# binary name
+APP = lthread_pthread_shim
+
+# all source are stored in SRCS-y
+SRCS-y := main.c  pthread_shim.c
+INCLUDES := -I$(RTE_SDK)/$(RTE_TARGET)/include -I$(SRCDIR)
+include $(RTE_SDK)/examples/performance-thread/common/common.mk
+
+CFLAGS=    -g -O3 $(USER_FLAGS) $(INCLUDES)
+CFLAGS += $(WERROR_FLAGS)
+
+LDFLAGS += -lpthread
+
+# workaround for a gcc bug with noreturn attribute
+# http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12603
+ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
+CFLAGS_main.o += -Wno-return-type
+endif
+
+include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/performance-thread/pthread_shim/main.c b/examples/performance-thread/pthread_shim/main.c
new file mode 100644
index 0000000..2f67c1b
--- /dev/null
+++ b/examples/performance-thread/pthread_shim/main.c
@@ -0,0 +1,284 @@
+
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <sys/types.h>
+#include <string.h>
+#include <sys/queue.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <getopt.h>
+#include <unistd.h>
+#include <sched.h>
+#include <pthread.h>
+
+#include <rte_config.h>
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_per_lcore.h>
+#include <rte_timer.h>
+
+#include "lthread_api.h"
+#include "lthread_diag_api.h"
+#include "pthread_shim.h"
+
+#define DEBUG_APP 0
+#define HELLOW_WORLD_MAX_LTHREADS 10
+
+__thread int print_count;
+__thread pthread_mutex_t print_lock;
+
+__thread pthread_mutex_t exit_lock;
+__thread pthread_cond_t exit_cond;
+
+/*
+ * A simple thread that demonstrates use of a mutex, a condition
+ * variable, thread local storage, explicit yield, and thread exit.
+ *
+ * The thread uses a mutex to protect a shared counter which is incremented
+ * and then it waits on condition variable before exiting.
+ *
+ * The thread argument is stored in and retrieved from TLS, using
+ * the pthread key create, get and set specific APIs.
+ *
+ * The thread yields while holding the mutex, to provide opportunity
+ * for other threads to contend.
+ *
+ * All of the pthread API functions used by this thread are actually
+ * resolved to corresponding lthread functions by the pthread shim
+ * implemented in pthread_shim.c
+ */
+void *helloworld_pthread(void *arg);
+void *helloworld_pthread(void *arg)
+{
+	pthread_key_t key;
+
+	/* create a key for TLS */
+	pthread_key_create(&key, NULL);
+
+	/* store the arg in TLS */
+	pthread_setspecific(key, arg);
+
+	/* grab lock and increment shared counter */
+	pthread_mutex_lock(&print_lock);
+	print_count++;
+
+	/* yield thread to give opportunity for lock contention */
+	pthread_yield();
+
+	/* retrieve arg from TLS */
+	uint64_t thread_no = (uint64_t) pthread_getspecific(key);
+
+	printf("Hello - lcore = %d count = %d thread_no = %d thread_id = %p\n",
+			sched_getcpu(),
+			print_count,
+			(int) thread_no,
+			(void *)pthread_self());
+
+	/* release the lock */
+	pthread_mutex_unlock(&print_lock);
+
+	/*
+	 * wait on condition variable
+	 * before exiting
+	 */
+	pthread_mutex_lock(&exit_lock);
+	pthread_cond_wait(&exit_cond, &exit_lock);
+	pthread_mutex_unlock(&exit_lock);
+
+	/* exit */
+	pthread_exit((void *) thread_no);
+}
+
+
+/*
+ * This is the initial thread
+ *
+ * It demonstrates pthread, mutex and condition variable creation,
+ * broadcast and pthread join APIs.
+ *
+ * This initial thread must always start life as an lthread.
+ *
+ * This thread creates many more threads then waits a short time
+ * before signalling them to exit using a broadcast.
+ *
+ * All of the pthread API functions used by this thread are actually
+ * resolved to corresponding lthread functions by the pthread shim
+ * implemented in pthread_shim.c
+ *
+ * After all threads have finished the lthread scheduler is shutdown
+ * and normal pthread operation is restored
+ */
+__thread pthread_t tid[HELLOW_WORLD_MAX_LTHREADS];
+
+static void initial_lthread(void *args);
+static void initial_lthread(void *args __attribute__((unused)))
+{
+	int lcore = (int) rte_lcore_id();
+	/*
+	 *
+	 * We can now enable pthread API override
+	 * and start to use the pthread APIs
+	 */
+	pthread_override_set(1);
+
+	uint64_t i;
+
+	/* initialize mutex for shared counter */
+	print_count = 0;
+	pthread_mutex_init(&print_lock, NULL);
+
+	/* initialize mutex and condition variable controlling thread exit */
+	pthread_mutex_init(&exit_lock, NULL);
+	pthread_cond_init(&exit_cond, NULL);
+
+	/* spawn a number of threads */
+	for (i = 0; i < HELLOW_WORLD_MAX_LTHREADS; i++) {
+
+		/*
+		 * Not strictly necessary but
+		 * for the sake of this example
+		 * use an attribute to pass the desired lcore
+		 */
+		pthread_attr_t attr;
+		cpu_set_t cpuset;
+
+		CPU_ZERO(&cpuset);
+		CPU_SET(lcore, &cpuset);
+		pthread_attr_init(&attr);
+		pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpuset);
+
+		/* create the thread */
+		pthread_create(&tid[i], &attr, helloworld_pthread, (void *) i);
+	}
+
+	/* wait for 1s to allow threads
+	 * to block on the condition variable
+	 * N.B. nanosleep() is resolved to lthread_sleep()
+	 * by the shim.
+	 */
+	struct timespec time;
+
+	time.tv_sec = 1;
+	time.tv_nsec = 0;
+	nanosleep(&time, NULL);
+
+	/* wake up all the threads */
+	pthread_cond_broadcast(&exit_cond);
+
+	/* wait for them to finish */
+	for (i = 0; i < HELLOW_WORLD_MAX_LTHREADS; i++) {
+
+		uint64_t thread_no;
+
+		pthread_join(tid[i], (void *) &thread_no);
+		if (thread_no != i)
+			printf("error on thread exit\n");
+	}
+
+	/* shutdown the lthread scheduler */
+	lthread_scheduler_shutdown(rte_lcore_id());
+	lthread_detach();
+}
+
+
+
+/* This thread creates a single initial lthread
+ * and then runs the scheduler
+ * An instance of this thread is created on each thread
+ * in the core mask
+ */
+static int
+lthread_scheduler(void *args);
+static int
+lthread_scheduler(void *args __attribute__((unused)))
+{
+	/* create initial thread  */
+	struct lthread *lt;
+
+	lthread_create(&lt, -1, initial_lthread, (void *) NULL);
+
+	/* run the lthread scheduler */
+	lthread_run();
+
+	/* restore genuine pthread operation */
+	pthread_override_set(0);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int num_sched = 0;
+
+	/* basic DPDK initialization is all that is necessary to run lthreads*/
+	int ret = rte_eal_init(argc, argv);
+
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n");
+
+	/* enable timer subsystem */
+	rte_timer_subsystem_init();
+
+#if DEBUG_APP
+	lthread_diagnostic_set_mask(LT_DIAG_ALL);
+#endif
+
+	/* create a scheduler on every core in the core mask
+	 * and launch an initial lthread that will spawn many more.
+	 */
+	unsigned lcore_id;
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		if (rte_lcore_is_enabled(lcore_id))
+			num_sched++;
+	}
+
+	/* set the number of schedulers, this forces all schedulers synchronize
+	 * before entering their main loop
+	 */
+	lthread_num_schedulers_set(num_sched);
+
+	/* launch all threads */
+	rte_eal_mp_remote_launch(lthread_scheduler, (void *)NULL, CALL_MASTER);
+
+	/* wait for threads to stop */
+	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
+		rte_eal_wait_lcore(lcore_id);
+	}
+	return 0;
+}
diff --git a/examples/performance-thread/pthread_shim/pthread_shim.c b/examples/performance-thread/pthread_shim/pthread_shim.c
new file mode 100644
index 0000000..e55077e
--- /dev/null
+++ b/examples/performance-thread/pthread_shim/pthread_shim.c
@@ -0,0 +1,714 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <errno.h>
+#define __USE_GNU
+#include <sched.h>
+#include <dlfcn.h>
+
+#include <rte_config.h>
+#include <rte_log.h>
+
+#include "lthread_api.h"
+#include "pthread_shim.h"
+
+#define RTE_LOGTYPE_PTHREAD_SHIM RTE_LOGTYPE_USER3
+
+#define POSIX_ERRNO(x)  (x)
+
+/*
+ * this flag determines at run time if we override pthread
+ * calls and map then to equivalent lthread calls
+ * or of we call the standard pthread function
+ */
+static __thread int override;
+
+
+/*
+ * this structures contains function pointers that will be
+ * initialised to the loaded address of the real
+ * pthread library API functions
+ */
+struct pthread_lib_funcs {
+int (*f_pthread_barrier_destroy)
+	(pthread_barrier_t *);
+int (*f_pthread_barrier_init)
+	(pthread_barrier_t *, const pthread_barrierattr_t *, unsigned);
+int (*f_pthread_barrier_wait)
+	(pthread_barrier_t *);
+int (*f_pthread_cond_broadcast)
+	(pthread_cond_t *);
+int (*f_pthread_cond_destroy)
+	(pthread_cond_t *);
+int (*f_pthread_cond_init)
+	(pthread_cond_t *, const pthread_condattr_t *);
+int (*f_pthread_cond_signal)
+	(pthread_cond_t *);
+int (*f_pthread_cond_timedwait)
+	(pthread_cond_t *, pthread_mutex_t *, const struct timespec *);
+int (*f_pthread_cond_wait)
+	(pthread_cond_t *, pthread_mutex_t *);
+int (*f_pthread_create)
+	(pthread_t *, const pthread_attr_t *, void *(*)(void *), void *);
+int (*f_pthread_detach)
+	(pthread_t);
+int (*f_pthread_equal)
+	(pthread_t, pthread_t);
+void (*f_pthread_exit)
+	(void *);
+void * (*f_pthread_getspecific)
+	(pthread_key_t);
+int (*f_pthread_getcpuclockid)
+	(pthread_t, clockid_t *);
+int (*f_pthread_join)
+	(pthread_t, void **);
+int (*f_pthread_key_create)
+	(pthread_key_t *, void (*) (void *));
+int (*f_pthread_key_delete)
+	(pthread_key_t);
+int (*f_pthread_mutex_destroy)
+	(pthread_mutex_t *__mutex);
+int (*f_pthread_mutex_init)
+	(pthread_mutex_t *__mutex, const pthread_mutexattr_t *);
+int (*f_pthread_mutex_lock)
+	(pthread_mutex_t *__mutex);
+int (*f_pthread_mutex_trylock)
+	(pthread_mutex_t *__mutex);
+int (*f_pthread_mutex_timedlock)
+	(pthread_mutex_t *__mutex, const struct timespec *);
+int (*f_pthread_mutex_unlock)
+	(pthread_mutex_t *__mutex);
+int (*f_pthread_once)
+	(pthread_once_t *, void (*) (void));
+int (*f_pthread_rwlock_destroy)
+	(pthread_rwlock_t *__rwlock);
+int (*f_pthread_rwlock_init)
+	(pthread_rwlock_t *__rwlock, const pthread_rwlockattr_t *);
+int (*f_pthread_rwlock_rdlock)
+	(pthread_rwlock_t *__rwlock);
+int (*f_pthread_rwlock_timedrdlock)
+	(pthread_rwlock_t *__rwlock, const struct timespec *);
+int (*f_pthread_rwlock_timedwrlock)
+	(pthread_rwlock_t *__rwlock, const struct timespec *);
+int (*f_pthread_rwlock_tryrdlock)
+	(pthread_rwlock_t *__rwlock);
+int (*f_pthread_rwlock_trywrlock)
+	(pthread_rwlock_t *__rwlock);
+int (*f_pthread_rwlock_unlock)
+	(pthread_rwlock_t *__rwlock);
+int (*f_pthread_rwlock_wrlock)
+	(pthread_rwlock_t *__rwlock);
+pthread_t (*f_pthread_self)
+	(void);
+int (*f_pthread_setspecific)
+	(pthread_key_t, const void *);
+int (*f_pthread_spin_init)
+	(pthread_spinlock_t *__spin, int);
+int (*f_pthread_spin_destroy)
+	(pthread_spinlock_t *__spin);
+int (*f_pthread_spin_lock)
+	(pthread_spinlock_t *__spin);
+int (*f_pthread_spin_trylock)
+	(pthread_spinlock_t *__spin);
+int (*f_pthread_spin_unlock)
+	(pthread_spinlock_t *__spin);
+int (*f_pthread_cancel)
+	(pthread_t);
+int (*f_pthread_setcancelstate)
+	(int, int *);
+int (*f_pthread_setcanceltype)
+	(int, int *);
+void (*f_pthread_testcancel)
+	(void);
+int (*f_pthread_getschedparam)
+	(pthread_t pthread, int *, struct sched_param *);
+int (*f_pthread_setschedparam)
+	(pthread_t, int, const struct sched_param *);
+int (*f_pthread_yield)
+	(void);
+int (*f_pthread_setaffinity_np)
+	(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset);
+int (*f_nanosleep)
+	(const struct timespec *req, struct timespec *rem);
+} _sys_pthread_funcs = {
+	.f_pthread_barrier_destroy = NULL,
+};
+
+
+/*
+ * this macro obtains the loaded address of a library function
+ * and saves it.
+ */
+static void *__libc_dl_handle = RTLD_NEXT;
+
+#define get_addr_of_loaded_symbol(name) do {				\
+	char *error_str;						\
+	_sys_pthread_funcs.f_##name = dlsym(__libc_dl_handle, (#name));	\
+	error_str = dlerror();						\
+	if (error_str != NULL) {					\
+		fprintf(stderr, "%s\n", error_str);			\
+		*((int *)0) = 0;                                        \
+	}								\
+} while (0)
+
+
+/*
+ * The constructor function initialises the
+ * function pointers for pthread library functions
+ */
+void
+pthread_intercept_ctor(void)__attribute__((constructor));
+void
+pthread_intercept_ctor(void)
+{
+	override = 0;
+	/*
+	 * Get the original functions
+	 */
+	get_addr_of_loaded_symbol(pthread_barrier_destroy);
+	get_addr_of_loaded_symbol(pthread_barrier_init);
+	get_addr_of_loaded_symbol(pthread_barrier_wait);
+	get_addr_of_loaded_symbol(pthread_cond_broadcast);
+	get_addr_of_loaded_symbol(pthread_cond_destroy);
+	get_addr_of_loaded_symbol(pthread_cond_init);
+	get_addr_of_loaded_symbol(pthread_cond_signal);
+	get_addr_of_loaded_symbol(pthread_cond_timedwait);
+	get_addr_of_loaded_symbol(pthread_cond_wait);
+	get_addr_of_loaded_symbol(pthread_create);
+	get_addr_of_loaded_symbol(pthread_detach);
+	get_addr_of_loaded_symbol(pthread_equal);
+	get_addr_of_loaded_symbol(pthread_exit);
+	get_addr_of_loaded_symbol(pthread_getspecific);
+	get_addr_of_loaded_symbol(pthread_getcpuclockid);
+	get_addr_of_loaded_symbol(pthread_join);
+	get_addr_of_loaded_symbol(pthread_key_create);
+	get_addr_of_loaded_symbol(pthread_key_delete);
+	get_addr_of_loaded_symbol(pthread_mutex_destroy);
+	get_addr_of_loaded_symbol(pthread_mutex_init);
+	get_addr_of_loaded_symbol(pthread_mutex_lock);
+	get_addr_of_loaded_symbol(pthread_mutex_trylock);
+	get_addr_of_loaded_symbol(pthread_mutex_timedlock);
+	get_addr_of_loaded_symbol(pthread_mutex_unlock);
+	get_addr_of_loaded_symbol(pthread_once);
+	get_addr_of_loaded_symbol(pthread_rwlock_destroy);
+	get_addr_of_loaded_symbol(pthread_rwlock_init);
+	get_addr_of_loaded_symbol(pthread_rwlock_rdlock);
+	get_addr_of_loaded_symbol(pthread_rwlock_timedrdlock);
+	get_addr_of_loaded_symbol(pthread_rwlock_timedwrlock);
+	get_addr_of_loaded_symbol(pthread_rwlock_tryrdlock);
+	get_addr_of_loaded_symbol(pthread_rwlock_trywrlock);
+	get_addr_of_loaded_symbol(pthread_rwlock_unlock);
+	get_addr_of_loaded_symbol(pthread_rwlock_wrlock);
+	get_addr_of_loaded_symbol(pthread_self);
+	get_addr_of_loaded_symbol(pthread_setspecific);
+	get_addr_of_loaded_symbol(pthread_spin_init);
+	get_addr_of_loaded_symbol(pthread_spin_destroy);
+	get_addr_of_loaded_symbol(pthread_spin_lock);
+	get_addr_of_loaded_symbol(pthread_spin_trylock);
+	get_addr_of_loaded_symbol(pthread_spin_unlock);
+	get_addr_of_loaded_symbol(pthread_cancel);
+	get_addr_of_loaded_symbol(pthread_setcancelstate);
+	get_addr_of_loaded_symbol(pthread_setcanceltype);
+	get_addr_of_loaded_symbol(pthread_testcancel);
+	get_addr_of_loaded_symbol(pthread_getschedparam);
+	get_addr_of_loaded_symbol(pthread_setschedparam);
+	get_addr_of_loaded_symbol(pthread_yield);
+	get_addr_of_loaded_symbol(pthread_setaffinity_np);
+	get_addr_of_loaded_symbol(nanosleep);
+}
+
+
+/*
+ * Enable/Disable pthread override
+ * state
+ *  0 disable
+ *  1 enable
+ */
+void pthread_override_set(int state)
+{
+	override = state;
+}
+
+
+/*
+ * Return pthread override state
+ * return
+ *  0 disable
+ *  1 enable
+ */
+int pthread_override_get(void)
+{
+	return override;
+}
+
+/*
+ * This macro is used to catch and log
+ * invocation of stubs for unimplemented pthread
+ * API functions.
+ */
+#define NOT_IMPLEMENTED do {				\
+	if (override) {					\
+		RTE_LOG(WARNING,			\
+			PTHREAD_SHIM,			\
+			"WARNING %s NOT IMPLEMENTED\n",	\
+			__func__);			\
+	}						\
+} while (0)
+
+/*
+ * pthread API override functions follow
+ * Note in this example code only a subset of functions are
+ * implemented.
+ *
+ * The stub functions provided will issue a warning log
+ * message if an unimplemented function is invoked
+ *
+ */
+
+int pthread_barrier_destroy(pthread_barrier_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_barrier_destroy(a);
+}
+
+int
+pthread_barrier_init(pthread_barrier_t *a,
+		     const pthread_barrierattr_t *b, unsigned c)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_barrier_init(a, b, c);
+}
+
+int pthread_barrier_wait(pthread_barrier_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_barrier_wait(a);
+}
+
+int pthread_cond_broadcast(pthread_cond_t *cond)
+{
+	if (override) {
+
+		lthread_cond_broadcast(*(struct lthread_cond **)cond);
+		return 0;
+	}
+	return _sys_pthread_funcs.f_pthread_cond_broadcast(cond);
+}
+
+int pthread_cond_destroy(pthread_cond_t *cond)
+{
+	if (override)
+		return -lthread_cond_destroy(*(struct lthread_cond **)cond);
+	return _sys_pthread_funcs.f_pthread_cond_destroy(cond);
+}
+
+int pthread_cond_init(pthread_cond_t *cond, const pthread_condattr_t *attr)
+{
+	if (override)
+		return -lthread_cond_init(NULL,
+				(struct lthread_cond **)cond,
+				(const struct lthread_condattr *) attr);
+	return _sys_pthread_funcs.f_pthread_cond_init(cond, attr);
+}
+
+int pthread_cond_signal(pthread_cond_t *cond)
+{
+	if (override) {
+		lthread_cond_signal(*(struct lthread_cond **)cond);
+		return 0;
+	}
+	return _sys_pthread_funcs.f_pthread_cond_signal(cond);
+}
+
+int
+pthread_cond_timedwait(pthread_cond_t *__restrict cond,
+		       pthread_mutex_t *__restrict mutex,
+		       const struct timespec *__restrict time)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_cond_timedwait(cond, mutex, time);
+}
+
+int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
+{
+	if (override) {
+		pthread_mutex_unlock(mutex);
+		int rv = lthread_cond_wait(*(struct lthread_cond **)cond, 0);
+
+		pthread_mutex_lock(mutex);
+		return rv;
+	}
+	return _sys_pthread_funcs.f_pthread_cond_wait(cond, mutex);
+}
+
+int
+pthread_create(pthread_t *__restrict tid,
+		const pthread_attr_t *__restrict attr,
+		void *(func) (void *),
+	       void *__restrict arg)
+{
+	if (override) {
+		int lcore = -1;
+
+		if (attr != NULL) {
+			/* determine CPU being requested */
+			cpu_set_t cpuset;
+
+			CPU_ZERO(&cpuset);
+			pthread_attr_getaffinity_np(attr,
+						sizeof(cpu_set_t),
+						&cpuset);
+
+			if (CPU_COUNT(&cpuset) != 1)
+				return POSIX_ERRNO(EINVAL);
+
+			for (lcore = 0; lcore < LTHREAD_MAX_LCORES; lcore++) {
+				if (!CPU_ISSET(lcore, &cpuset))
+					continue;
+				break;
+			}
+		}
+		return lthread_create((struct lthread **)tid, lcore,
+				      (void (*)(void *))func, arg);
+	}
+	return _sys_pthread_funcs.f_pthread_create(tid, attr, func, arg);
+}
+
+int pthread_detach(pthread_t tid)
+{
+	if (override) {
+		struct lthread *lt = (struct lthread *)tid;
+
+		if (lt == lthread_current())
+			lthread_detach();
+			return 0;
+		NOT_IMPLEMENTED;
+	}
+	return _sys_pthread_funcs.f_pthread_detach(tid);
+}
+
+int pthread_equal(pthread_t a, pthread_t b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_equal(a, b);
+}
+
+void pthread_exit_override(void *v)
+{
+	if (override) {
+		lthread_exit(v);
+		return;
+	}
+	_sys_pthread_funcs.f_pthread_exit(v);
+}
+
+void
+*pthread_getspecific(pthread_key_t key)
+{
+	if (override)
+		return lthread_getspecific((unsigned int) key);
+	return _sys_pthread_funcs.f_pthread_getspecific(key);
+}
+
+int pthread_getcpuclockid(pthread_t a, clockid_t *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_getcpuclockid(a, b);
+}
+
+int pthread_join(pthread_t tid, void **val)
+{
+	if (override)
+		return lthread_join((struct lthread *)tid, val);
+	return _sys_pthread_funcs.f_pthread_join(tid, val);
+}
+
+int pthread_key_create(pthread_key_t *keyptr, void (*dtor) (void *))
+{
+	if (override)
+		return lthread_key_create((unsigned int *)keyptr, dtor);
+	return _sys_pthread_funcs.f_pthread_key_create(keyptr, dtor);
+}
+
+int pthread_key_delete(pthread_key_t key)
+{
+	if (override) {
+		lthread_key_delete((unsigned int) key);
+		return 0;
+	}
+	return _sys_pthread_funcs.f_pthread_key_delete(key);
+}
+
+
+int
+pthread_mutex_init(pthread_mutex_t *mutex, const pthread_mutexattr_t *attr)
+{
+	if (override)
+		return lthread_mutex_init(NULL,
+				(struct lthread_mutex **)mutex,
+				(const struct lthread_mutexattr *)attr);
+	return _sys_pthread_funcs.f_pthread_mutex_init(mutex, attr);
+}
+
+int pthread_mutex_lock(pthread_mutex_t *mutex)
+{
+	if (override)
+		return lthread_mutex_lock(*(struct lthread_mutex **)mutex);
+	return _sys_pthread_funcs.f_pthread_mutex_lock(mutex);
+}
+
+int pthread_mutex_trylock(pthread_mutex_t *mutex)
+{
+	if (override)
+		return lthread_mutex_trylock(*(struct lthread_mutex **)mutex);
+	return _sys_pthread_funcs.f_pthread_mutex_trylock(mutex);
+}
+
+int pthread_mutex_timedlock(pthread_mutex_t *mutex, const struct timespec *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_mutex_timedlock(mutex, b);
+}
+
+int pthread_mutex_unlock(pthread_mutex_t *mutex)
+{
+	if (override)
+		return lthread_mutex_unlock(*(struct lthread_mutex **)mutex);
+	return _sys_pthread_funcs.f_pthread_mutex_unlock(mutex);
+}
+
+int pthread_once(pthread_once_t *a, void (b) (void))
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_once(a, b);
+}
+
+int pthread_rwlock_destroy(pthread_rwlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_destroy(a);
+}
+
+int pthread_rwlock_init(pthread_rwlock_t *a, const pthread_rwlockattr_t *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_init(a, b);
+}
+
+int pthread_rwlock_rdlock(pthread_rwlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_rdlock(a);
+}
+
+int pthread_rwlock_timedrdlock(pthread_rwlock_t *a, const struct timespec *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_timedrdlock(a, b);
+}
+
+int pthread_rwlock_timedwrlock(pthread_rwlock_t *a, const struct timespec *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_timedwrlock(a, b);
+}
+
+int pthread_rwlock_tryrdlock(pthread_rwlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_tryrdlock(a);
+}
+
+int pthread_rwlock_trywrlock(pthread_rwlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_trywrlock(a);
+}
+
+int pthread_rwlock_unlock(pthread_rwlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_unlock(a);
+}
+
+int pthread_rwlock_wrlock(pthread_rwlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_rwlock_wrlock(a);
+}
+
+int pthread_yield(void)
+{
+	if (override) {
+		lthread_yield();
+		return 0;
+	}
+	return _sys_pthread_funcs.f_pthread_yield();
+
+}
+
+pthread_t pthread_self(void)
+{
+	if (override)
+		return (pthread_t) lthread_current();
+	return _sys_pthread_funcs.f_pthread_self();
+}
+
+int pthread_setspecific(pthread_key_t key, const void *data)
+{
+	if (override) {
+		int rv =  lthread_setspecific((unsigned int)key, data);
+		return rv;
+	}
+	return _sys_pthread_funcs.f_pthread_setspecific(key, data);
+}
+
+int pthread_spin_init(pthread_spinlock_t *a, int b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_spin_init(a, b);
+}
+
+int pthread_spin_destroy(pthread_spinlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_spin_destroy(a);
+}
+
+int pthread_spin_lock(pthread_spinlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_spin_lock(a);
+}
+
+int pthread_spin_trylock(pthread_spinlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_spin_trylock(a);
+}
+
+int pthread_spin_unlock(pthread_spinlock_t *a)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_spin_unlock(a);
+}
+
+int pthread_cancel(pthread_t tid)
+{
+	if (override) {
+		lthread_cancel(*(struct lthread **)tid);
+		return 0;
+	}
+	return _sys_pthread_funcs.f_pthread_cancel(tid);
+}
+
+int pthread_setcancelstate(int a, int *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_setcancelstate(a, b);
+}
+
+int pthread_setcanceltype(int a, int *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_setcanceltype(a, b);
+}
+
+void pthread_testcancel(void)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_testcancel();
+}
+
+
+int pthread_getschedparam(pthread_t tid, int *a, struct sched_param *b)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_getschedparam(tid, a, b);
+}
+
+int pthread_setschedparam(pthread_t a, int b, const struct sched_param *c)
+{
+	NOT_IMPLEMENTED;
+	return _sys_pthread_funcs.f_pthread_setschedparam(a, b, c);
+}
+
+
+int nanosleep(const struct timespec *req, struct timespec *rem)
+{
+	if (override) {
+		uint64_t ns = req->tv_sec * 1000000000 + req->tv_nsec;
+
+		lthread_sleep(ns);
+		return 0;
+	}
+	return _sys_pthread_funcs.f_nanosleep(req, rem);
+}
+
+int
+pthread_setaffinity_np(pthread_t thread, size_t cpusetsize,
+		       const cpu_set_t *cpuset)
+{
+	if (override) {
+		/* we only allow affinity with a single CPU */
+		if (CPU_COUNT(cpuset) != 1)
+			return POSIX_ERRNO(EINVAL);
+
+		/* we only allow the current thread to sets its own affinity */
+		struct lthread *lt = (struct lthread *)thread;
+
+		if (lthread_current() != lt)
+			return POSIX_ERRNO(EINVAL);
+
+		/* determine the CPU being requested */
+		int i;
+
+		for (i = 0; i < LTHREAD_MAX_LCORES; i++) {
+			if (!CPU_ISSET(i, cpuset))
+				continue;
+			break;
+		}
+		/* check requested core is allowed */
+		if (i == LTHREAD_MAX_LCORES)
+			return POSIX_ERRNO(EINVAL);
+
+		/* finally we can set affinity to the requested lcore */
+		lthread_set_affinity(i);
+		return 0;
+	}
+	return _sys_pthread_funcs.f_pthread_setaffinity_np(thread, cpusetsize,
+							   cpuset);
+}
diff --git a/examples/performance-thread/pthread_shim/pthread_shim.h b/examples/performance-thread/pthread_shim/pthread_shim.h
new file mode 100644
index 0000000..78bbb5a
--- /dev/null
+++ b/examples/performance-thread/pthread_shim/pthread_shim.h
@@ -0,0 +1,113 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _PTHREAD_SHIM_H_
+#define _PTHREAD_SHIM_H_
+#include <pthread.h>
+
+/*
+ * This pthread shim is an example that demonstrates how legacy code
+ * that makes use of POSIX pthread services can make use of lthreads
+ * with reduced porting effort.
+ *
+ * N.B. The example is not a complete implementation, only a subset of
+ * pthread APIs sufficient to demonstrate the principle of operation
+ * are implemented.
+ *
+ * In general pthread attribute objects do not have equivalent functions
+ * in lthreads, and are ignored.
+ *
+ * There is one exception and that is the use of attr to specify a
+ * core affinity in calls to pthread_create.
+ *
+ * The shim operates as follows:-
+ *
+ * On initialisation a constructor function uses dlsym to obtain and
+ * save the loaded address of the full set of pthread APIs that will
+ * be overridden.
+ *
+ * For each function there is a stub provided that will invoke either
+ * the genuine pthread library function saved saved by the constructor,
+ * or else the corresponding equivalent lthread function.
+ *
+ * The stub functions are implemented in pthread_shim.c
+ *
+ * The stub will take care of adapting parameters, and any police
+ * any constraints where lthread functionality differs.
+ *
+ * The initial thread must always be a pure lthread.
+ *
+ * The decision whether to invoke the real library function or the lthread
+ * function is controlled by a per pthread flag that can be switched
+ * on of off by the pthread_override_set() API described below. Typcially
+ * this should be done as the first action of the initial lthread.
+ *
+ * N.B In general it would be poor practice to revert to invoke a real
+ * pthread function when running as an lthread, since these may block and
+ * effectively stall the lthread scheduler.
+ *
+ */
+
+
+/*
+ * An exiting lthread must not terminate the pthread it is running in
+ * since this would mean terminating the lthread scheduler.
+ * We override pthread_exit() with a macro because it is typically declared with
+ * __attribute__((noreturn))
+ */
+void pthread_exit_override(void *v);
+
+#define pthread_exit(v) do { \
+	pthread_exit_override((v));	\
+	return NULL;	\
+} while (0)
+
+/*
+ * Enable/Disable pthread override
+ * state
+ * 0 disable
+ * 1 enable
+ */
+void pthread_override_set(int state);
+
+
+/*
+ * Return pthread override state
+ * return
+ * 0 disable
+ * 1 enable
+ */
+int pthread_override_get(void);
+
+
+#endif /* _PTHREAD_SHIM_H_ */
-- 
2.1.4

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-12-02 15:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-02 14:56 [dpdk-dev] [PATCH v4 1/2] examples: add performance thread sample application ibetts
2015-12-02 14:56 ` [dpdk-dev] [PATCH v4 2/2] examples: add pthread-shim in performance-thread sample app ibetts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).