DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
@ 2014-08-07 18:31 Konstantin Ananyev
  2014-08-07 20:11 ` Neil Horman
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Konstantin Ananyev @ 2014-08-07 18:31 UTC (permalink / raw)
  To: dev, dev

Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
 (make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
  rte_acl_classify_sse() - could be build and used only on systems with sse4.2
  and upper, return -ENOTSUP on lower arch.
  rte_acl_classify_scalar() - a slower version, but could be build and used
  on all systems.
- keep common code shared between these two codepaths.

v2 chages:
 run-time selection of most appropriate code-path for given ISA.
 By default the highest supprted one is selected.
 User can still override that selection by manually assigning new value to 
 the global function pointer rte_acl_default_classify.
 rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
 points to.


Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c                |  13 +-
 lib/librte_acl/Makefile            |   5 +-
 lib/librte_acl/acl_bld.c           |   5 +-
 lib/librte_acl/acl_match_check.def |  92 ++++
 lib/librte_acl/acl_run.c           | 944 -------------------------------------
 lib/librte_acl/acl_run.h           | 220 +++++++++
 lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
 lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c           |  15 +
 lib/librte_acl/rte_acl.h           |  24 +-
 10 files changed, 1189 insertions(+), 956 deletions(-)
 create mode 100644 lib/librte_acl/acl_match_check.def
 delete mode 100644 lib/librte_acl/acl_run.c
 create mode 100644 lib/librte_acl/acl_run.h
 create mode 100644 lib/librte_acl/acl_run_scalar.c
 create mode 100644 lib/librte_acl/acl_run_sse.c

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..45c6fa6 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -787,6 +787,10 @@ acx_init(void)
 	/* perform build. */
 	ret = rte_acl_build(config.acx, &cfg);
 
+	/* setup default rte_acl_classify */
+	if (config.scalar)
+		rte_acl_default_classify = rte_acl_classify_scalar;
+
 	dump_verbose(DUMP_NONE, stdout,
 		"rte_acl_build(%u) finished with %d\n",
 		config.bld_categories, ret);
@@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 			v += config.trace_sz;
 		}
 
-		if (scalar != 0)
-			ret = rte_acl_classify_scalar(config.acx, data,
-				results, n, categories);
-
-		else
-			ret = rte_acl_classify(config.acx, data,
-				results, n, categories);
+		ret = rte_acl_classify(config.acx, data, results,
+			n, categories);
 
 		if (ret != 0)
 			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <nmmintrin.h>
 #include <rte_acl.h>
 #include "tb_mem.h"
 #include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 
 			switch (rule->config->defs[n].type) {
 			case RTE_ACL_FIELD_TYPE_BITMASK:
-				wild = (size -
-					_mm_popcnt_u32(fld->mask_range.u8)) /
+				wild = (size - __builtin_popcount(
+					fld->mask_range.u8)) /
 					size;
 				break;
 
diff --git a/lib/librte_acl/acl_match_check.def b/lib/librte_acl/acl_match_check.def
new file mode 100644
index 0000000..8ff4ec3
--- /dev/null
+++ b/lib/librte_acl/acl_match_check.def
@@ -0,0 +1,92 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Creates a definition for '__func_match_check__' function.
+ * '__func_resolve_priority__' should point to already  resolved function.
+ */
+
+#ifndef __func_match_check__
+#error __func_match_check__ undefined
+#endif
+
+#ifndef __func_resolve_priority__
+#error __func_resolve_priority__ undefined
+#endif
+
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+__func_match_check__(uint64_t transition, int slot,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows)
+{
+	const struct rte_acl_match_results *p;
+
+	p = (const struct rte_acl_match_results *)
+		(flows->trans + ctx->match_index);
+
+	if (transition & RTE_ACL_NODE_MATCH) {
+
+		/* Remove flags from index and decrement active traversals */
+		transition &= RTE_ACL_NODE_INDEX;
+		flows->started--;
+
+		/* Resolve priorities for this trie and running results */
+		if (flows->categories == 1)
+			resolve_single_priority(transition, slot, ctx,
+				parms, p);
+		else
+			__func_resolve_priority__(transition, slot, ctx, parms,
+				p, flows->categories);
+
+		/* Count down completed tries for this search request */
+		parms[slot].cmplt->count--;
+
+		/* Fill the slot with the next trie or idle trie */
+		transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+	} else if (transition == ctx->idle) {
+		/* reset indirection table for idle slots */
+		parms[slot].data_index = idle;
+	}
+
+	return transition;
+}
+
+#undef __func_match_check__
+#undef __func_resolve_priority__
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in
- *       the documentation and/or other materials provided with the
- *       distribution.
- *     * Neither the name of Intel Corporation nor the names of its
- *       contributors may be used to endorse or promote products derived
- *       from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8	8
-#define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
-#define MAX_SEARCHES_SCALAR	2
-
-#define GET_NEXT_4BYTES(prm, idx)	\
-	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define	SCALAR_QRANGE_MULT	0x01010101
-#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
-#define	SCALAR_QRANGE_MIN	0x80808080
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
-	uint32_t            num_packets;
-	/* number of packets processed */
-	uint32_t            started;
-	/* number of trie traversals in progress */
-	uint32_t            trie;
-	/* current trie index (0 to N-1) */
-	uint32_t            cmplt_size;
-	uint32_t            total_packets;
-	uint32_t            categories;
-	/* number of result categories per packet. */
-	/* maximum number of packets to process */
-	const uint64_t     *trans;
-	const uint8_t     **data;
-	uint32_t           *results;
-	struct completion  *last_cmplt;
-	struct completion  *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
-	uint32_t *results;                          /* running results. */
-	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
-	uint32_t  count;                            /* num of remaining tries */
-	/* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
-	const uint8_t              *data;
-	/* input data for this packet */
-	const uint32_t             *data_index;
-	/* data indirection for this trie */
-	struct completion          *cmplt;
-	/* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
-	uint32_t *results)
-{
-	uint32_t n;
-
-	for (n = 0; n < size; n++) {
-
-		if (p[n].count == 0) {
-
-			/* mark as allocated and set number of tries. */
-			p[n].count = tries;
-			p[n].results = results;
-			return &(p[n]);
-		}
-	}
-
-	/* should never get here */
-	return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	const struct rte_acl_match_results *p)
-{
-	if (parms[n].cmplt->count == ctx->num_tries ||
-			parms[n].cmplt->priority[0] <=
-			p[transition].priority[0]) {
-
-		parms[n].cmplt->priority[0] = p[transition].priority[0];
-		parms[n].cmplt->results[0] = p[transition].results[0];
-	}
-
-	parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-
-	/* Count down completed tries for this search request */
-	parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
-	const struct rte_acl_ctx *ctx)
-{
-	uint64_t transition;
-
-	/* if there are any more packets to process */
-	if (flows->num_packets < flows->total_packets) {
-		parms[n].data = flows->data[flows->num_packets];
-		parms[n].data_index = ctx->trie[flows->trie].data_index;
-
-		/* if this is the first trie for this packet */
-		if (flows->trie == 0) {
-			flows->last_cmplt = alloc_completion(flows->cmplt_array,
-				flows->cmplt_size, ctx->num_tries,
-				flows->results +
-				flows->num_packets * flows->categories);
-		}
-
-		/* set completion parameters and starting index for this slot */
-		parms[n].cmplt = flows->last_cmplt;
-		transition =
-			flows->trans[parms[n].data[*parms[n].data_index++] +
-			ctx->trie[flows->trie].root_index];
-
-		/*
-		 * if this is the last trie for this packet,
-		 * then setup next packet.
-		 */
-		flows->trie++;
-		if (flows->trie >= ctx->num_tries) {
-			flows->trie = 0;
-			flows->num_packets++;
-		}
-
-		/* keep track of number of active trie traversals */
-		flows->started++;
-
-	/* no more tries to process, set slot to an idle position */
-	} else {
-		transition = ctx->idle;
-		parms[n].data = (const uint8_t *)idle;
-		parms[n].data_index = idle;
-	}
-	return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows)
-{
-	const struct rte_acl_match_results *p;
-
-	p = (const struct rte_acl_match_results *)
-		(flows->trans + ctx->match_index);
-
-	if (transition & RTE_ACL_NODE_MATCH) {
-
-		/* Remove flags from index and decrement active traversals */
-		transition &= RTE_ACL_NODE_INDEX;
-		flows->started--;
-
-		/* Resolve priorities for this trie and running results */
-		if (flows->categories == 1)
-			resolve_single_priority(transition, slot, ctx,
-				parms, p);
-		else
-			resolve_priority(transition, slot, ctx, parms, p,
-				flows->categories);
-
-		/* Fill the slot with the next trie or idle trie */
-		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
-	}
-
-	return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indicies);
-
-	/* extract transition from high 64 bits. */
-	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indicies);
-
-	transition1 = acl_match_check_transition(transition1, slot, ctx,
-		parms, flows);
-	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
-		parms, flows);
-
-	/* update indicies with new transitions. */
-	*indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indicies);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indicies);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies1, slot, ctx, parms, flows);
-		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-					(__m128)*indicies2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr, node_types, temp;
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indicies2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-		(__m128)*indicies2, 0xdd);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
-
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indicies2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* check ranges */
-	temp = MM_CMPGT8(temp, *indicies2);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, indicies2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
-	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
-	uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
-	flows->num_packets = 0;
-	flows->started = 0;
-	flows->trie = 0;
-	flows->last_cmplt = NULL;
-	flows->cmplt_array = cmplt;
-	flows->total_packets = data_num;
-	flows->categories = categories;
-	flows->cmplt_size = cmplt_size;
-	flows->data = data;
-	flows->results = results;
-	flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indicies1, indicies2, indicies3, indicies4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indicies1 contains index_array[0,1]
-	 * indicies2 contains index_array[2,3]
-	 * indicies3 contains index_array[4,5]
-	 * indicies4 contains index_array[6,7]
-	 */
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indicies3, &indicies4, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indicies3, &indicies4, mm_match_mask.m);
-	}
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indicies1, indicies2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-	}
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1)
-{
-	uint64_t t;
-	xmm_t addr, indicies2;
-
-	indicies2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, &indicies2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indicies;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
-			mm_match_mask64.m);
-	}
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
-	return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
-	uint8_t input)
-{
-	uint32_t addr, index, ranges, x, a, b, c;
-
-	/* break transition into component parts */
-	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
-	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
-	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
-	/* pickup next transition */
-	transition = *(trans_table + addr);
-	return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	int n;
-	uint64_t transition0, transition1;
-	uint32_t input0, input1;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SCALAR];
-	struct completion cmplt[MAX_SEARCHES_SCALAR];
-	struct parms parms[MAX_SEARCHES_SCALAR];
-
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
-		categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	transition0 = index_array[0];
-	transition1 = index_array[1];
-
-	while (flows.started > 0) {
-
-		input0 = GET_NEXT_4BYTES(parms, 0);
-		input1 = GET_NEXT_4BYTES(parms, 1);
-
-		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
-
-			input0 >>= CHAR_BIT;
-
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
-			input1 >>= CHAR_BIT;
-
-		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
-			transition0 = acl_match_check_transition(transition0,
-				0, ctx, parms, &flows);
-			transition1 = acl_match_check_transition(transition1,
-				1, ctx, parms, &flows);
-
-		}
-	}
-	return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	if (likely(num >= MAX_SEARCHES_SSE8))
-		search_sse_8(ctx, data, results, num, categories);
-	else if (num >= MAX_SEARCHES_SSE4)
-		search_sse_4(ctx, data, results, num, categories);
-	else
-		search_sse_2(ctx, data, results, num, categories);
-
-	return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..c39650e
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,220 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef	_ACL_RUN_H_
+#define	_ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8	8
+#define MAX_SEARCHES_SSE4	4
+#define MAX_SEARCHES_SSE2	2
+#define MAX_SEARCHES_SCALAR	2
+
+#define GET_NEXT_4BYTES(prm, idx)	\
+	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define	SCALAR_QRANGE_MULT	0x01010101
+#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
+#define	SCALAR_QRANGE_MIN	0x80808080
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+	uint32_t            num_packets;
+	/* number of packets processed */
+	uint32_t            started;
+	/* number of trie traversals in progress */
+	uint32_t            trie;
+	/* current trie index (0 to N-1) */
+	uint32_t            cmplt_size;
+	uint32_t            total_packets;
+	uint32_t            categories;
+	/* number of result categories per packet. */
+	/* maximum number of packets to process */
+	const uint64_t     *trans;
+	const uint8_t     **data;
+	uint32_t           *results;
+	struct completion  *last_cmplt;
+	struct completion  *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+	uint32_t *results;                          /* running results. */
+	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+	uint32_t  count;                            /* num of remaining tries */
+	/* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+	const uint8_t              *data;
+	/* input data for this packet */
+	const uint32_t             *data_index;
+	/* data indirection for this trie */
+	struct completion          *cmplt;
+	/* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+	uint32_t *results)
+{
+	uint32_t n;
+
+	for (n = 0; n < size; n++) {
+
+		if (p[n].count == 0) {
+
+			/* mark as allocated and set number of tries. */
+			p[n].count = tries;
+			p[n].results = results;
+			return &(p[n]);
+		}
+	}
+
+	/* should never get here */
+	return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p)
+{
+	if (parms[n].cmplt->count == ctx->num_tries ||
+			parms[n].cmplt->priority[0] <=
+			p[transition].priority[0]) {
+
+		parms[n].cmplt->priority[0] = p[transition].priority[0];
+		parms[n].cmplt->results[0] = p[transition].results[0];
+	}
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+	const struct rte_acl_ctx *ctx)
+{
+	uint64_t transition;
+
+	/* if there are any more packets to process */
+	if (flows->num_packets < flows->total_packets) {
+		parms[n].data = flows->data[flows->num_packets];
+		parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+		/* if this is the first trie for this packet */
+		if (flows->trie == 0) {
+			flows->last_cmplt = alloc_completion(flows->cmplt_array,
+				flows->cmplt_size, ctx->num_tries,
+				flows->results +
+				flows->num_packets * flows->categories);
+		}
+
+		/* set completion parameters and starting index for this slot */
+		parms[n].cmplt = flows->last_cmplt;
+		transition =
+			flows->trans[parms[n].data[*parms[n].data_index++] +
+			ctx->trie[flows->trie].root_index];
+
+		/*
+		 * if this is the last trie for this packet,
+		 * then setup next packet.
+		 */
+		flows->trie++;
+		if (flows->trie >= ctx->num_tries) {
+			flows->trie = 0;
+			flows->num_packets++;
+		}
+
+		/* keep track of number of active trie traversals */
+		flows->started++;
+
+	/* no more tries to process, set slot to an idle position */
+	} else {
+		transition = ctx->idle;
+		parms[n].data = (const uint8_t *)idle;
+		parms[n].data_index = idle;
+	}
+	return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+	uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+	flows->num_packets = 0;
+	flows->started = 0;
+	flows->trie = 0;
+	flows->last_cmplt = NULL;
+	flows->cmplt_array = cmplt;
+	flows->total_packets = data_num;
+	flows->categories = categories;
+	flows->cmplt_size = cmplt_size;
+	flows->data = data;
+	flows->results = results;
+	flows->trans = trans;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..b6d8b40
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,197 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p, uint32_t categories)
+{
+	uint32_t i;
+	int32_t *saved_priority;
+	uint32_t *saved_results;
+	const int32_t *priority;
+	const uint32_t *results;
+
+	saved_results = parms[n].cmplt->results;
+	saved_priority = parms[n].cmplt->priority;
+
+	/* results and priorities for completed trie */
+	results = p[transition].results;
+	priority = p[transition].priority;
+
+	/* if this is not the first completed trie */
+	if (parms[n].cmplt->count != ctx->num_tries) {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			if (saved_priority[i] <= priority[i]) {
+				saved_priority[i] = priority[i];
+				saved_results[i] = results[i];
+			}
+			if (saved_priority[i + 1] <= priority[i + 1]) {
+				saved_priority[i + 1] = priority[i + 1];
+				saved_results[i + 1] = results[i + 1];
+			}
+			if (saved_priority[i + 2] <= priority[i + 2]) {
+				saved_priority[i + 2] = priority[i + 2];
+				saved_results[i + 2] = results[i + 2];
+			}
+			if (saved_priority[i + 3] <= priority[i + 3]) {
+				saved_priority[i + 3] = priority[i + 3];
+				saved_results[i + 3] = results[i + 3];
+			}
+		}
+	} else {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+			saved_priority[i] = priority[i];
+			saved_priority[i + 1] = priority[i + 1];
+			saved_priority[i + 2] = priority[i + 2];
+			saved_priority[i + 3] = priority[i + 3];
+
+			saved_results[i] = results[i];
+			saved_results[i + 1] = results[i + 1];
+			saved_results[i + 2] = results[i + 2];
+			saved_results[i + 3] = results[i + 3];
+		}
+	}
+}
+
+#define	__func_resolve_priority__	resolve_priority_scalar
+#define	__func_match_check__		acl_match_check_scalar
+#include "acl_match_check.def"
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+	return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+	uint8_t input)
+{
+	uint32_t addr, index, ranges, x, a, b, c;
+
+	/* break transition into component parts */
+	ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+	/* calc address for a QRANGE node */
+	c = input * SCALAR_QRANGE_MULT;
+	a = ranges | SCALAR_QRANGE_MIN;
+	index = transition & ~RTE_ACL_NODE_INDEX;
+	a -= (c & SCALAR_QRANGE_MASK);
+	b = c & SCALAR_QRANGE_MIN;
+	addr = transition ^ index;
+	a &= SCALAR_QRANGE_MIN;
+	a ^= (ranges ^ b) & (a ^ b);
+	x = scan_forward(a, 32) >> 3;
+	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	/* pickup next transition */
+	transition = *(trans_table + addr);
+	return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	int n;
+	uint64_t transition0, transition1;
+	uint32_t input0, input1;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SCALAR];
+	struct completion cmplt[MAX_SEARCHES_SCALAR];
+	struct parms parms[MAX_SEARCHES_SCALAR];
+
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+		categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	transition0 = index_array[0];
+	transition1 = index_array[1];
+
+	while (flows.started > 0) {
+
+		input0 = GET_NEXT_4BYTES(parms, 0);
+		input1 = GET_NEXT_4BYTES(parms, 1);
+
+		for (n = 0; n < 4; n++) {
+			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+				transition0 = scalar_transition(flows.trans,
+					transition0, (uint8_t)input0);
+
+			input0 >>= CHAR_BIT;
+
+			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+				transition1 = scalar_transition(flows.trans,
+					transition1, (uint8_t)input1);
+
+			input1 >>= CHAR_BIT;
+
+		}
+		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+			transition0 = acl_match_check_scalar(transition0,
+				0, ctx, parms, &flows);
+			transition1 = acl_match_check_scalar(transition1,
+				1, ctx, parms, &flows);
+
+		}
+	}
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..104053f
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,630 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+	},
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		0,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+#define	__func_resolve_priority__	resolve_priority_sse
+#define	__func_match_check__		acl_match_check_sse
+#include "acl_match_check.def"
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indicies);
+
+	/* extract transition from high 64 bits. */
+	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indicies);
+
+	transition1 = acl_match_check_sse(transition1, slot, ctx,
+		parms, flows);
+	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
+		parms, flows);
+
+	/* update indicies with new transitions. */
+	*indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indicies);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indicies);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies1, slot, ctx, parms, flows);
+		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+					(__m128)*indicies2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr, node_types, temp;
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indicies2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+		(__m128)*indicies2, 0xdd);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+	/* add input byte to DFA position */
+	temp = MM_AND(temp, bytes);
+	temp = MM_AND(temp, next_input);
+	addr = MM_ADD32(addr, temp);
+
+	/*
+	 * Calc addr for Range nodes -> range_index + range(input)
+	 */
+	node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indicies2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* check ranges */
+	temp = MM_CMPGT8(temp, *indicies2);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	temp = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = MM_AND(temp, node_types);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, indicies2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indicies1, indicies2, indicies3, indicies4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indicies1 contains index_array[0,1]
+	 * indicies2 contains index_array[2,3]
+	 * indicies3 contains index_array[4,5]
+	 * indicies4 contains index_array[6,7]
+	 */
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indicies3, &indicies4, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+			0);
+		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+			0);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indicies3, &indicies4, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indicies1, indicies2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1)
+{
+	uint64_t t;
+	xmm_t addr, indicies2;
+
+	indicies2 = MM_XOR(ones_16, ones_16);
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, &indicies2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indicies;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+			mm_match_mask64.m);
+	}
+
+	return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	if (likely(num >= MAX_SEARCHES_SSE8))
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..0cde07e 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -38,6 +38,21 @@
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+/* by default, use always avaialbe scalar code path. */
+rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
+
+void __attribute__((constructor(INT16_MAX)))
+rte_acl_select_classify(void)
+{
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
+		/* SSE version requires SSE4.1 */
+		rte_acl_default_classify = rte_acl_classify_sse;
+	} else {
+		/* reset to scalar version. */
+		rte_acl_default_classify = rte_acl_classify_scalar;
+	}
+}
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..e6a4472 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
  * If more than one rule is applicable for given input buffer and
  * given category, then rule with highest priority will be returned as a match.
+ * Note, that this function could be run only on CPUs with SSE4.1 support.
+ * It is up to the caller to make sure that this function is only invoked on
+ * a machine that supports SSE4.1 ISA.
  * Note, that it is a caller responsibility to ensure that input parameters
  * are valid and point to correct memory locations.
  *
@@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * @return
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
+ *   -ENOTSUP for unsupported platforms.
  */
 int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
 /**
@@ -327,6 +331,24 @@ int
 rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+/**
+ * Invokes default rte_acl_classify function.
+ */
+extern rte_acl_classify_t rte_acl_default_classify;
+
+#define	rte_acl_classify(ctx, data, results, num, categories)	\
+	(*rte_acl_default_classify)(ctx, data, results, num, categories)
+
+/**
+ * Analyze ISA of the current CPU and points rte_acl_default_classify
+ * to the highest applicable version of classify function.
+ */
+void
+rte_acl_select_classify(void);
+
 /**
  * Dump an ACL context structure to the console.
  *
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 18:31 [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
@ 2014-08-07 20:11 ` Neil Horman
  2014-08-07 20:58   ` Vincent JARDIN
  2014-08-08 11:49   ` Ananyev, Konstantin
  2014-08-21 20:15 ` [dpdk-dev] [PATCHv3] " Neil Horman
  2014-08-28 20:38 ` [dpdk-dev] [PATCHv4] " Neil Horman
  2 siblings, 2 replies; 21+ messages in thread
From: Neil Horman @ 2014-08-07 20:11 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev

On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to 
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

This is alot better thank you.  A few remaining issues.

> ---
>  app/test-acl/main.c                |  13 +-
>  lib/librte_acl/Makefile            |   5 +-
>  lib/librte_acl/acl_bld.c           |   5 +-
>  lib/librte_acl/acl_match_check.def |  92 ++++
>  lib/librte_acl/acl_run.c           | 944 -------------------------------------
>  lib/librte_acl/acl_run.h           | 220 +++++++++
>  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
>  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c           |  15 +
>  lib/librte_acl/rte_acl.h           |  24 +-
>  10 files changed, 1189 insertions(+), 956 deletions(-)
>  create mode 100644 lib/librte_acl/acl_match_check.def
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 
> diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> index d654409..45c6fa6 100644
> --- a/app/test-acl/main.c
> +++ b/app/test-acl/main.c
> @@ -787,6 +787,10 @@ acx_init(void)
>  	/* perform build. */
>  	ret = rte_acl_build(config.acx, &cfg);
>  
> +	/* setup default rte_acl_classify */
> +	if (config.scalar)
> +		rte_acl_default_classify = rte_acl_classify_scalar;
> +
Exporting this variable as part of the ABI is a bad idea.  If the prototype of
the function changes you have to update all your applications.  Make the pointer
an internal symbol and set it using a get/set routine with an enum to represent
the path to choose.  That will help isolate the ABI from the internal
implementation.  It will also let you prevent things like selecting a run time
path that is incompatible with the running system, and prevent path switching
during searches, which may produce unexpected results.

><snip>
> diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> deleted file mode 100644
> index e3d9fc1..0000000
> --- a/lib/librte_acl/acl_run.c
> +++ /dev/null
> @@ -1,944 +0,0 @@
> -/*-
> - *   BSD LICENSE
> - *
> - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> - *   All rights reserved.
> - *
> - *   Redistribution and use in source and binary forms, with or without
> - *   modification, are permitted provided that the following conditions
><snip>
> +
> +#define	__func_resolve_priority__	resolve_priority_scalar
> +#define	__func_match_check__		acl_match_check_scalar
> +#include "acl_match_check.def"
> +
I get this lets you make some more code common, but its just unpleasant to trace
through.  Looking at the defintion of __func_match_check__ I don't see anything
particularly performance sensitive there.  What if instead you simply redefined
__func_match_check__ in a common internal header as acl_match_check (a generic
function), and had it accept priority resolution function as an argument?  That
would still give you all the performance enhancements without having to include
c files in the middle of other c files, and would make the code a bit more
parseable.

> +/*
> + * When processing the transition, rather than using if/else
> + * construct, the offset is calculated for DFA and QRANGE and
> + * then conditionally added to the address based on node type.
> + * This is done to avoid branch mis-predictions. Since the
> + * offset is rather simple calculation it is more efficient
> + * to do the calculation and do a condition move rather than
> + * a conditional branch to determine which calculation to do.
> + */
> +static inline uint32_t
> +scan_forward(uint32_t input, uint32_t max)
> +{
> +	return (input == 0) ? max : rte_bsf32(input);
> +}
> +	}
> +}
><snip>
> +
> +#define	__func_resolve_priority__	resolve_priority_sse
> +#define	__func_match_check__		acl_match_check_sse
> +#include "acl_match_check.def"
> +
Same deal as above.

> +/*
> + * Extract transitions from an XMM register and check for any matches
> + */
> +static void
> +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, struct acl_flow_data *flows)
> +{
> +	uint64_t transition1, transition2;
> +
> +	/* extract transition from low 64 bits. */
> +	transition1 = MM_CVT64(*indicies);
> +
> +	/* extract transition from high 64 bits. */
> +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> +	transition2 = MM_CVT64(*indicies);
> +
> +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> +		parms, flows);
> +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> +		parms, flows);
> +
> +	/* update indicies with new transitions. */
> +	*indicies = MM_SET64(transition2, transition1);
> +}
> +
> +/*
> + * Check for a match in 2 transitions (contained in SSE register)
> + */
> +static inline void
> +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	temp = MM_AND(match_mask, *indicies);
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies, slot, ctx, parms, flows);
> +		temp = MM_AND(match_mask, *indicies);
> +	}
> +}
> +
> +/*
> + * Check for any match in 4 transitions (contained in 2 SSE registers)
> + */
> +static inline void
> +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> +	xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	/* put low 32 bits of each transition into one register */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	/* test for match node */
> +	temp = MM_AND(match_mask, temp);
> +
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> +
> +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +					(__m128)*indicies2,
> +					0x88);
> +		temp = MM_AND(match_mask, temp);
> +	}
> +}
> +
> +/*
> + * Calculate the address of the next transition for
> + * all types of nodes. Note that only DFA nodes and range
> + * nodes actually transition to another node. Match
> + * nodes don't move.
> + */
> +static inline xmm_t
> +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr, node_types, temp;
> +
> +	/*
> +	 * Note that no transition is done for a match
> +	 * node and therefore a stream freezes when
> +	 * it reaches a match.
> +	 */
> +
> +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +		(__m128)*indicies2, 0xdd);
> +
> +	/* Calc node type and node addr */
> +	node_types = MM_ANDNOT(index_mask, temp);
> +	addr = MM_AND(index_mask, temp);
> +
> +	/*
> +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> +	 */
> +
> +	/* mask for DFA type (0) nodes */
> +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> +
> +	/* add input byte to DFA position */
> +	temp = MM_AND(temp, bytes);
> +	temp = MM_AND(temp, next_input);
> +	addr = MM_ADD32(addr, temp);
> +
> +	/*
> +	 * Calc addr for Range nodes -> range_index + range(input)
> +	 */
> +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> +
> +	/*
> +	 * Calculate number of range boundaries that are less than the
> +	 * input value. Range boundaries for each node are in signed 8 bit,
> +	 * ordered from -128 to 127 in the indicies2 register.
> +	 * This is effectively a popcnt of bytes that are greater than the
> +	 * input byte.
> +	 */
> +
> +	/* shuffle input byte to all 4 positions of 32 bit value */
> +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> +
> +	/* check ranges */
> +	temp = MM_CMPGT8(temp, *indicies2);
> +
> +	/* convert -1 to 1 (bytes greater than input byte */
> +	temp = MM_SIGN8(temp, temp);
> +
> +	/* horizontal add pairs of bytes into words */
> +	temp = MM_MADD8(temp, temp);
> +
> +	/* horizontal add pairs of words into dwords */
> +	temp = MM_MADD16(temp, ones_16);
> +
> +	/* mask to range type nodes */
> +	temp = MM_AND(temp, node_types);
> +
> +	/* add index into node position */
> +	return MM_ADD32(addr, temp);
> +}
> +
> +/*
> + * Process 4 transitions (in 2 SIMD registers) in parallel
> + */
> +static inline xmm_t
> +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr;
> +	uint64_t trans0, trans2;
> +
> +	 /* Calculate the address (array index) for all 4 transitions. */
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, indicies2);
> +
> +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> +
> +	trans0 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 2 */
> +
> +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> +	trans2 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +
> +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> +
> +	/* get slot 3 */
> +
> +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 8 traversals in parallel
> + */
> +static inline int
> +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE8];
> +	struct completion cmplt[MAX_SEARCHES_SSE8];
> +	struct parms parms[MAX_SEARCHES_SSE8];
> +	xmm_t input0, input1;
> +	xmm_t indicies1, indicies2, indicies3, indicies4;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	/*
> +	 * indicies1 contains index_array[0,1]
> +	 * indicies2 contains index_array[2,3]
> +	 * indicies3 contains index_array[4,5]
> +	 * indicies4 contains index_array[6,7]
> +	 */
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> +
> +	 /* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +	acl_match_check_x4(4, ctx, parms, &flows,
> +		&indicies3, &indicies4, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> +			0);
> +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> +			0);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> +
> +		 /* Process the 4 bytes of input on each stream. */
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		 /* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +		acl_match_check_x4(4, ctx, parms, &flows,
> +			&indicies3, &indicies4, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Execute trie traversal with 4 traversals in parallel
> + */
> +static inline int
> +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	 uint32_t *results, int total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE4];
> +	struct completion cmplt[MAX_SEARCHES_SSE4];
> +	struct parms parms[MAX_SEARCHES_SSE4];
> +	xmm_t input, indicies1, indicies2;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +		input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline xmm_t
> +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1)
> +{
> +	uint64_t t;
> +	xmm_t addr, indicies2;
> +
> +	indicies2 = MM_XOR(ones_16, ones_16);
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, &indicies2);
> +
> +	/* Gather 64 bit transitions and pack 2 per register. */
> +
> +	t = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 2 traversals in parallel.
> + */
> +static inline int
> +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE2];
> +	struct completion cmplt[MAX_SEARCHES_SSE2];
> +	struct parms parms[MAX_SEARCHES_SSE2];
> +	xmm_t input, indicies;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> +			mm_match_mask64.m);
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	if (likely(num >= MAX_SEARCHES_SSE8))
> +		return search_sse_8(ctx, data, results, num, categories);
> +	else if (num >= MAX_SEARCHES_SSE4)
> +		return search_sse_4(ctx, data, results, num, categories);
> +	else
> +		return search_sse_2(ctx, data, results, num, categories);
> +}
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..0cde07e 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -38,6 +38,21 @@
>  
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
>  
> +/* by default, use always avaialbe scalar code path. */
> +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> +
make this static, the outside world shouldn't need to see it.

> +void __attribute__((constructor(INT16_MAX)))
> +rte_acl_select_classify(void)
Make it static, The outside world doesn't need to call this.

> +{
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> +		/* SSE version requires SSE4.1 */
> +		rte_acl_default_classify = rte_acl_classify_sse;
> +	} else {
> +		/* reset to scalar version. */
> +		rte_acl_default_classify = rte_acl_classify_scalar;
Don't need the else clause here, the static initalizer has you covered.
> +	}
> +}
> +
> +
> +/**
> + * Invokes default rte_acl_classify function.
> + */
> +extern rte_acl_classify_t rte_acl_default_classify;
> +
Doesn't need to be extern.
> +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> +
Not sure why you need this either.  The rte_acl_classify_t should be enough, no?

Regards
Neil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 20:11 ` Neil Horman
@ 2014-08-07 20:58   ` Vincent JARDIN
  2014-08-07 21:28     ` Chris Wright
  2014-08-08 11:49   ` Ananyev, Konstantin
  1 sibling, 1 reply; 21+ messages in thread
From: Vincent JARDIN @ 2014-08-07 20:58 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

What's about using function versioning attributes too:

https://gcc.gnu.org/wiki/FunctionMultiVersioning

?

Le 7 août 2014 22:11, "Neil Horman" <nhorman@tuxdriver.com> a écrit :
>
> On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with
sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and
used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> >
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value
to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever
rte_acl_default_classify
> >  points to.
> >
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>
> This is alot better thank you.  A few remaining issues.
>
> > ---
> >  app/test-acl/main.c                |  13 +-
> >  lib/librte_acl/Makefile            |   5 +-
> >  lib/librte_acl/acl_bld.c           |   5 +-
> >  lib/librte_acl/acl_match_check.def |  92 ++++
> >  lib/librte_acl/acl_run.c           | 944
-------------------------------------
> >  lib/librte_acl/acl_run.h           | 220 +++++++++
> >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> >  lib/librte_acl/rte_acl.c           |  15 +
> >  lib/librte_acl/rte_acl.h           |  24 +-
> >  10 files changed, 1189 insertions(+), 956 deletions(-)
> >  create mode 100644 lib/librte_acl/acl_match_check.def
> >  delete mode 100644 lib/librte_acl/acl_run.c
> >  create mode 100644 lib/librte_acl/acl_run.h
> >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> >  create mode 100644 lib/librte_acl/acl_run_sse.c
> >
> > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > index d654409..45c6fa6 100644
> > --- a/app/test-acl/main.c
> > +++ b/app/test-acl/main.c
> > @@ -787,6 +787,10 @@ acx_init(void)
> >       /* perform build. */
> >       ret = rte_acl_build(config.acx, &cfg);
> >
> > +     /* setup default rte_acl_classify */
> > +     if (config.scalar)
> > +             rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> Exporting this variable as part of the ABI is a bad idea.  If the
prototype of
> the function changes you have to update all your applications.  Make the
pointer
> an internal symbol and set it using a get/set routine with an enum to
represent
> the path to choose.  That will help isolate the ABI from the internal
> implementation.  It will also let you prevent things like selecting a run
time
> path that is incompatible with the running system, and prevent path
switching
> during searches, which may produce unexpected results.
>
> ><snip>
> > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > deleted file mode 100644
> > index e3d9fc1..0000000
> > --- a/lib/librte_acl/acl_run.c
> > +++ /dev/null
> > @@ -1,944 +0,0 @@
> > -/*-
> > - *   BSD LICENSE
> > - *
> > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > - *   All rights reserved.
> > - *
> > - *   Redistribution and use in source and binary forms, with or without
> > - *   modification, are permitted provided that the following conditions
> ><snip>
> > +
> > +#define      __func_resolve_priority__       resolve_priority_scalar
> > +#define      __func_match_check__            acl_match_check_scalar
> > +#include "acl_match_check.def"
> > +
> I get this lets you make some more code common, but its just unpleasant
to trace
> through.  Looking at the defintion of __func_match_check__ I don't see
anything
> particularly performance sensitive there.  What if instead you simply
redefined
> __func_match_check__ in a common internal header as acl_match_check (a
generic
> function), and had it accept priority resolution function as an argument?
 That
> would still give you all the performance enhancements without having to
include
> c files in the middle of other c files, and would make the code a bit more
> parseable.
>
> > +/*
> > + * When processing the transition, rather than using if/else
> > + * construct, the offset is calculated for DFA and QRANGE and
> > + * then conditionally added to the address based on node type.
> > + * This is done to avoid branch mis-predictions. Since the
> > + * offset is rather simple calculation it is more efficient
> > + * to do the calculation and do a condition move rather than
> > + * a conditional branch to determine which calculation to do.
> > + */
> > +static inline uint32_t
> > +scan_forward(uint32_t input, uint32_t max)
> > +{
> > +     return (input == 0) ? max : rte_bsf32(input);
> > +}
> > +     }
> > +}
> ><snip>
> > +
> > +#define      __func_resolve_priority__       resolve_priority_sse
> > +#define      __func_match_check__            acl_match_check_sse
> > +#include "acl_match_check.def"
> > +
> Same deal as above.
>
> > +/*
> > + * Extract transitions from an XMM register and check for any matches
> > + */
> > +static void
> > +acl_process_matches(xmm_t *indicies, int slot, const struct
rte_acl_ctx *ctx,
> > +     struct parms *parms, struct acl_flow_data *flows)
> > +{
> > +     uint64_t transition1, transition2;
> > +
> > +     /* extract transition from low 64 bits. */
> > +     transition1 = MM_CVT64(*indicies);
> > +
> > +     /* extract transition from high 64 bits. */
> > +     *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > +     transition2 = MM_CVT64(*indicies);
> > +
> > +     transition1 = acl_match_check_sse(transition1, slot, ctx,
> > +             parms, flows);
> > +     transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > +             parms, flows);
> > +
> > +     /* update indicies with new transitions. */
> > +     *indicies = MM_SET64(transition2, transition1);
> > +}
> > +
> > +/*
> > + * Check for a match in 2 transitions (contained in SSE register)
> > + */
> > +static inline void
> > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct
parms *parms,
> > +     struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > +{
> > +     xmm_t temp;
> > +
> > +     temp = MM_AND(match_mask, *indicies);
> > +     while (!MM_TESTZ(temp, temp)) {
> > +             acl_process_matches(indicies, slot, ctx, parms, flows);
> > +             temp = MM_AND(match_mask, *indicies);
> > +     }
> > +}
> > +
> > +/*
> > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > + */
> > +static inline void
> > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct
parms *parms,
> > +     struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > +     xmm_t match_mask)
> > +{
> > +     xmm_t temp;
> > +
> > +     /* put low 32 bits of each transition into one register */
> > +     temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +             0x88);
> > +     /* test for match node */
> > +     temp = MM_AND(match_mask, temp);
> > +
> > +     while (!MM_TESTZ(temp, temp)) {
> > +             acl_process_matches(indicies1, slot, ctx, parms, flows);
> > +             acl_process_matches(indicies2, slot + 2, ctx, parms,
flows);
> > +
> > +             temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +                                     (__m128)*indicies2,
> > +                                     0x88);
> > +             temp = MM_AND(match_mask, temp);
> > +     }
> > +}
> > +
> > +/*
> > + * Calculate the address of the next transition for
> > + * all types of nodes. Note that only DFA nodes and range
> > + * nodes actually transition to another node. Match
> > + * nodes don't move.
> > + */
> > +static inline xmm_t
> > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +     xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +     xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +     xmm_t addr, node_types, temp;
> > +
> > +     /*
> > +      * Note that no transition is done for a match
> > +      * node and therefore a stream freezes when
> > +      * it reaches a match.
> > +      */
> > +
> > +     /* Shuffle low 32 into temp and high 32 into indicies2 */
> > +     temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +             0x88);
> > +     *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +             (__m128)*indicies2, 0xdd);
> > +
> > +     /* Calc node type and node addr */
> > +     node_types = MM_ANDNOT(index_mask, temp);
> > +     addr = MM_AND(index_mask, temp);
> > +
> > +     /*
> > +      * Calc addr for DFAs - addr = dfa_index + input_byte
> > +      */
> > +
> > +     /* mask for DFA type (0) nodes */
> > +     temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > +
> > +     /* add input byte to DFA position */
> > +     temp = MM_AND(temp, bytes);
> > +     temp = MM_AND(temp, next_input);
> > +     addr = MM_ADD32(addr, temp);
> > +
> > +     /*
> > +      * Calc addr for Range nodes -> range_index + range(input)
> > +      */
> > +     node_types = MM_CMPEQ32(node_types, type_quad_range);
> > +
> > +     /*
> > +      * Calculate number of range boundaries that are less than the
> > +      * input value. Range boundaries for each node are in signed 8
bit,
> > +      * ordered from -128 to 127 in the indicies2 register.
> > +      * This is effectively a popcnt of bytes that are greater than the
> > +      * input byte.
> > +      */
> > +
> > +     /* shuffle input byte to all 4 positions of 32 bit value */
> > +     temp = MM_SHUFFLE8(next_input, shuffle_input);
> > +
> > +     /* check ranges */
> > +     temp = MM_CMPGT8(temp, *indicies2);
> > +
> > +     /* convert -1 to 1 (bytes greater than input byte */
> > +     temp = MM_SIGN8(temp, temp);
> > +
> > +     /* horizontal add pairs of bytes into words */
> > +     temp = MM_MADD8(temp, temp);
> > +
> > +     /* horizontal add pairs of words into dwords */
> > +     temp = MM_MADD16(temp, ones_16);
> > +
> > +     /* mask to range type nodes */
> > +     temp = MM_AND(temp, node_types);
> > +
> > +     /* add index into node position */
> > +     return MM_ADD32(addr, temp);
> > +}
> > +
> > +/*
> > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > + */
> > +static inline xmm_t
> > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +     xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +     const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +     xmm_t addr;
> > +     uint64_t trans0, trans2;
> > +
> > +      /* Calculate the address (array index) for all 4 transitions. */
> > +
> > +     addr = acl_calc_addr(index_mask, next_input, shuffle_input,
ones_16,
> > +             bytes, type_quad_range, indicies1, indicies2);
> > +
> > +      /* Gather 64 bit transitions and pack back into 2 registers. */
> > +
> > +     trans0 = trans[MM_CVT32(addr)];
> > +
> > +     /* get slot 2 */
> > +
> > +     /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > +     trans2 = trans[MM_CVT32(addr)];
> > +
> > +     /* get slot 1 */
> > +
> > +     /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +     *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > +
> > +     /* get slot 3 */
> > +
> > +     /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > +     *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > +
> > +     return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 8 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +     uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +     int n;
> > +     struct acl_flow_data flows;
> > +     uint64_t index_array[MAX_SEARCHES_SSE8];
> > +     struct completion cmplt[MAX_SEARCHES_SSE8];
> > +     struct parms parms[MAX_SEARCHES_SSE8];
> > +     xmm_t input0, input1;
> > +     xmm_t indicies1, indicies2, indicies3, indicies4;
> > +
> > +     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +             total_packets, categories, ctx->trans_table);
> > +
> > +     for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > +             cmplt[n].count = 0;
> > +             index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > +     }
> > +
> > +     /*
> > +      * indicies1 contains index_array[0,1]
> > +      * indicies2 contains index_array[2,3]
> > +      * indicies3 contains index_array[4,5]
> > +      * indicies4 contains index_array[6,7]
> > +      */
> > +
> > +     indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +     indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +     indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > +     indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > +
> > +      /* Check for any matches. */
> > +     acl_match_check_x4(0, ctx, parms, &flows,
> > +             &indicies1, &indicies2, mm_match_mask.m);
> > +     acl_match_check_x4(4, ctx, parms, &flows,
> > +             &indicies3, &indicies4, mm_match_mask.m);
> > +
> > +     while (flows.started > 0) {
> > +
> > +             /* Gather 4 bytes of input data for each stream. */
> > +             input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0),
> > +                     0);
> > +             input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
4),
> > +                     0);
> > +
> > +             input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1),
1);
> > +             input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5),
1);
> > +
> > +             input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2),
2);
> > +             input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6),
2);
> > +
> > +             input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3),
3);
> > +             input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7),
3);
> > +
> > +              /* Process the 4 bytes of input on each stream. */
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +              /* Check for any matches. */
> > +             acl_match_check_x4(0, ctx, parms, &flows,
> > +                     &indicies1, &indicies2, mm_match_mask.m);
> > +             acl_match_check_x4(4, ctx, parms, &flows,
> > +                     &indicies3, &indicies4, mm_match_mask.m);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 4 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +      uint32_t *results, int total_packets, uint32_t categories)
> > +{
> > +     int n;
> > +     struct acl_flow_data flows;
> > +     uint64_t index_array[MAX_SEARCHES_SSE4];
> > +     struct completion cmplt[MAX_SEARCHES_SSE4];
> > +     struct parms parms[MAX_SEARCHES_SSE4];
> > +     xmm_t input, indicies1, indicies2;
> > +
> > +     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +             total_packets, categories, ctx->trans_table);
> > +
> > +     for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > +             cmplt[n].count = 0;
> > +             index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > +     }
> > +
> > +     indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +     indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +     /* Check for any matches. */
> > +     acl_match_check_x4(0, ctx, parms, &flows,
> > +             &indicies1, &indicies2, mm_match_mask.m);
> > +
> > +     while (flows.started > 0) {
> > +
> > +             /* Gather 4 bytes of input data for each stream. */
> > +             input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0), 0);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > +
> > +             /* Process the 4 bytes of input on each stream. */
> > +             input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +              input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +              input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +              input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             /* Check for any matches. */
> > +             acl_match_check_x4(0, ctx, parms, &flows,
> > +                     &indicies1, &indicies2, mm_match_mask.m);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static inline xmm_t
> > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +     xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +     const uint64_t *trans, xmm_t *indicies1)
> > +{
> > +     uint64_t t;
> > +     xmm_t addr, indicies2;
> > +
> > +     indicies2 = MM_XOR(ones_16, ones_16);
> > +
> > +     addr = acl_calc_addr(index_mask, next_input, shuffle_input,
ones_16,
> > +             bytes, type_quad_range, indicies1, &indicies2);
> > +
> > +     /* Gather 64 bit transitions and pack 2 per register. */
> > +
> > +     t = trans[MM_CVT32(addr)];
> > +
> > +     /* get slot 1 */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +     *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > +
> > +     return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 2 traversals in parallel.
> > + */
> > +static inline int
> > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +     uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +     int n;
> > +     struct acl_flow_data flows;
> > +     uint64_t index_array[MAX_SEARCHES_SSE2];
> > +     struct completion cmplt[MAX_SEARCHES_SSE2];
> > +     struct parms parms[MAX_SEARCHES_SSE2];
> > +     xmm_t input, indicies;
> > +
> > +     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +             total_packets, categories, ctx->trans_table);
> > +
> > +     for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > +             cmplt[n].count = 0;
> > +             index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > +     }
> > +
> > +     indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > +
> > +     /* Check for any matches. */
> > +     acl_match_check_x2(0, ctx, parms, &flows, &indicies,
mm_match_mask64.m);
> > +
> > +     while (flows.started > 0) {
> > +
> > +             /* Gather 4 bytes of input data for each stream. */
> > +             input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0), 0);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +
> > +             /* Process the 4 bytes of input on each stream. */
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             /* Check for any matches. */
> > +             acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > +                     mm_match_mask64.m);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +int
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t
**data,
> > +     uint32_t *results, uint32_t num, uint32_t categories)
> > +{
> > +     if (categories != 1 &&
> > +             ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > +             return -EINVAL;
> > +
> > +     if (likely(num >= MAX_SEARCHES_SSE8))
> > +             return search_sse_8(ctx, data, results, num, categories);
> > +     else if (num >= MAX_SEARCHES_SSE4)
> > +             return search_sse_4(ctx, data, results, num, categories);
> > +     else
> > +             return search_sse_2(ctx, data, results, num, categories);
> > +}
> > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > index 7c288bd..0cde07e 100644
> > --- a/lib/librte_acl/rte_acl.c
> > +++ b/lib/librte_acl/rte_acl.c
> > @@ -38,6 +38,21 @@
> >
> >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> >
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> make this static, the outside world shouldn't need to see it.
>
> > +void __attribute__((constructor(INT16_MAX)))
> > +rte_acl_select_classify(void)
> Make it static, The outside world doesn't need to call this.
>
> > +{
> > +     if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > +             /* SSE version requires SSE4.1 */
> > +             rte_acl_default_classify = rte_acl_classify_sse;
> > +     } else {
> > +             /* reset to scalar version. */
> > +             rte_acl_default_classify = rte_acl_classify_scalar;
> Don't need the else clause here, the static initalizer has you covered.
> > +     }
> > +}
> > +
> > +
> > +/**
> > + * Invokes default rte_acl_classify function.
> > + */
> > +extern rte_acl_classify_t rte_acl_default_classify;
> > +
> Doesn't need to be extern.
> > +#define      rte_acl_classify(ctx, data, results, num, categories)   \
> > +     (*rte_acl_default_classify)(ctx, data, results, num, categories)
> > +
> Not sure why you need this either.  The rte_acl_classify_t should be
enough, no?
>
> Regards
> Neil
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 20:58   ` Vincent JARDIN
@ 2014-08-07 21:28     ` Chris Wright
  2014-08-08  2:07       ` Neil Horman
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wright @ 2014-08-07 21:28 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev

* Vincent JARDIN (vincent.jardin@6wind.com) wrote:
> What's about using function versioning attributes too:
> 
> https://gcc.gnu.org/wiki/FunctionMultiVersioning
> 
> ?

Neat, hadn't seen that gcc feature before, but:

  "This support is available in GCC 4.8 and later. Support is only
   available in C++ for i386 targets."

I have 4.8.2, gcc errors, g++ works.  And entirely unclear re:
icc (which was a supported compiler).  Seems like it's not
really an option.

thanks,
-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 21:28     ` Chris Wright
@ 2014-08-08  2:07       ` Neil Horman
  0 siblings, 0 replies; 21+ messages in thread
From: Neil Horman @ 2014-08-08  2:07 UTC (permalink / raw)
  To: Chris Wright; +Cc: dev

On Thu, Aug 07, 2014 at 02:28:47PM -0700, Chris Wright wrote:
> * Vincent JARDIN (vincent.jardin@6wind.com) wrote:
> > What's about using function versioning attributes too:
> > 
> > https://gcc.gnu.org/wiki/FunctionMultiVersioning
> > 
> > ?
> 
> Neat, hadn't seen that gcc feature before, but:
> 
>   "This support is available in GCC 4.8 and later. Support is only
>    available in C++ for i386 targets."
> 
> I have 4.8.2, gcc errors, g++ works.  And entirely unclear re:
> icc (which was a supported compiler).  Seems like it's not
> really an option.
> 
I agree, nice idea, probably not pragmatic until its supported in all generally
available compilers (from distributions).  Its also a bit questionable as to how
this works on alternate arches.

Neil

> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 20:11 ` Neil Horman
  2014-08-07 20:58   ` Vincent JARDIN
@ 2014-08-08 11:49   ` Ananyev, Konstantin
  2014-08-08 12:25     ` Neil Horman
  1 sibling, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2014-08-08 11:49 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 07, 2014 9:12 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> 
> On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> >
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> >  points to.
> >
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 
> This is alot better thank you.  A few remaining issues.

My comments inline too.
Thanks
Konstantin

> 
> > ---
> >  app/test-acl/main.c                |  13 +-
> >  lib/librte_acl/Makefile            |   5 +-
> >  lib/librte_acl/acl_bld.c           |   5 +-
> >  lib/librte_acl/acl_match_check.def |  92 ++++
> >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> >  lib/librte_acl/acl_run.h           | 220 +++++++++
> >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> >  lib/librte_acl/rte_acl.c           |  15 +
> >  lib/librte_acl/rte_acl.h           |  24 +-
> >  10 files changed, 1189 insertions(+), 956 deletions(-)
> >  create mode 100644 lib/librte_acl/acl_match_check.def
> >  delete mode 100644 lib/librte_acl/acl_run.c
> >  create mode 100644 lib/librte_acl/acl_run.h
> >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> >  create mode 100644 lib/librte_acl/acl_run_sse.c
> >
> > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > index d654409..45c6fa6 100644
> > --- a/app/test-acl/main.c
> > +++ b/app/test-acl/main.c
> > @@ -787,6 +787,10 @@ acx_init(void)
> >  	/* perform build. */
> >  	ret = rte_acl_build(config.acx, &cfg);
> >
> > +	/* setup default rte_acl_classify */
> > +	if (config.scalar)
> > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> the function changes you have to update all your applications.

If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway. 

>  Make the pointer
> an internal symbol and set it using a get/set routine with an enum to represent
> the path to choose.  That will help isolate the ABI from the internal
> implementation. 

That's was my first intention too.
But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
and it will cost us extra call (or jump).
Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().  

> It will also let you prevent things like selecting a run time
> path that is incompatible with the running system

If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
>From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.

> and prevent path switching
> during searches, which may produce unexpected results.

Not that I am advertising it, but  it should be safe to update rte_acl_default_classify during searches:
All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.

> 
> ><snip>
> > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > deleted file mode 100644
> > index e3d9fc1..0000000
> > --- a/lib/librte_acl/acl_run.c
> > +++ /dev/null
> > @@ -1,944 +0,0 @@
> > -/*-
> > - *   BSD LICENSE
> > - *
> > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > - *   All rights reserved.
> > - *
> > - *   Redistribution and use in source and binary forms, with or without
> > - *   modification, are permitted provided that the following conditions
> ><snip>
> > +
> > +#define	__func_resolve_priority__	resolve_priority_scalar
> > +#define	__func_match_check__		acl_match_check_scalar
> > +#include "acl_match_check.def"
> > +
> I get this lets you make some more code common, but its just unpleasant to trace
> through.  Looking at the defintion of __func_match_check__ I don't see anything
> particularly performance sensitive there.  What if instead you simply redefined
> __func_match_check__ in a common internal header as acl_match_check (a generic
> function), and had it accept priority resolution function as an argument?  That
> would still give you all the performance enhancements without having to include
> c files in the middle of other c files, and would make the code a bit more
> parseable.

Yes, that way it would look much better.
And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
Will change as you suggested. 

> 
> > +/*
> > + * When processing the transition, rather than using if/else
> > + * construct, the offset is calculated for DFA and QRANGE and
> > + * then conditionally added to the address based on node type.
> > + * This is done to avoid branch mis-predictions. Since the
> > + * offset is rather simple calculation it is more efficient
> > + * to do the calculation and do a condition move rather than
> > + * a conditional branch to determine which calculation to do.
> > + */
> > +static inline uint32_t
> > +scan_forward(uint32_t input, uint32_t max)
> > +{
> > +	return (input == 0) ? max : rte_bsf32(input);
> > +}
> > +	}
> > +}
> ><snip>
> > +
> > +#define	__func_resolve_priority__	resolve_priority_sse
> > +#define	__func_match_check__		acl_match_check_sse
> > +#include "acl_match_check.def"
> > +
> Same deal as above.
> 
> > +/*
> > + * Extract transitions from an XMM register and check for any matches
> > + */
> > +static void
> > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > +	struct parms *parms, struct acl_flow_data *flows)
> > +{
> > +	uint64_t transition1, transition2;
> > +
> > +	/* extract transition from low 64 bits. */
> > +	transition1 = MM_CVT64(*indicies);
> > +
> > +	/* extract transition from high 64 bits. */
> > +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > +	transition2 = MM_CVT64(*indicies);
> > +
> > +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> > +		parms, flows);
> > +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > +		parms, flows);
> > +
> > +	/* update indicies with new transitions. */
> > +	*indicies = MM_SET64(transition2, transition1);
> > +}
> > +
> > +/*
> > + * Check for a match in 2 transitions (contained in SSE register)
> > + */
> > +static inline void
> > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > +{
> > +	xmm_t temp;
> > +
> > +	temp = MM_AND(match_mask, *indicies);
> > +	while (!MM_TESTZ(temp, temp)) {
> > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > +		temp = MM_AND(match_mask, *indicies);
> > +	}
> > +}
> > +
> > +/*
> > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > + */
> > +static inline void
> > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > +	xmm_t match_mask)
> > +{
> > +	xmm_t temp;
> > +
> > +	/* put low 32 bits of each transition into one register */
> > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +		0x88);
> > +	/* test for match node */
> > +	temp = MM_AND(match_mask, temp);
> > +
> > +	while (!MM_TESTZ(temp, temp)) {
> > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > +
> > +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +					(__m128)*indicies2,
> > +					0x88);
> > +		temp = MM_AND(match_mask, temp);
> > +	}
> > +}
> > +
> > +/*
> > + * Calculate the address of the next transition for
> > + * all types of nodes. Note that only DFA nodes and range
> > + * nodes actually transition to another node. Match
> > + * nodes don't move.
> > + */
> > +static inline xmm_t
> > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +	xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +	xmm_t addr, node_types, temp;
> > +
> > +	/*
> > +	 * Note that no transition is done for a match
> > +	 * node and therefore a stream freezes when
> > +	 * it reaches a match.
> > +	 */
> > +
> > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +		0x88);
> > +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +		(__m128)*indicies2, 0xdd);
> > +
> > +	/* Calc node type and node addr */
> > +	node_types = MM_ANDNOT(index_mask, temp);
> > +	addr = MM_AND(index_mask, temp);
> > +
> > +	/*
> > +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> > +	 */
> > +
> > +	/* mask for DFA type (0) nodes */
> > +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > +
> > +	/* add input byte to DFA position */
> > +	temp = MM_AND(temp, bytes);
> > +	temp = MM_AND(temp, next_input);
> > +	addr = MM_ADD32(addr, temp);
> > +
> > +	/*
> > +	 * Calc addr for Range nodes -> range_index + range(input)
> > +	 */
> > +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> > +
> > +	/*
> > +	 * Calculate number of range boundaries that are less than the
> > +	 * input value. Range boundaries for each node are in signed 8 bit,
> > +	 * ordered from -128 to 127 in the indicies2 register.
> > +	 * This is effectively a popcnt of bytes that are greater than the
> > +	 * input byte.
> > +	 */
> > +
> > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> > +
> > +	/* check ranges */
> > +	temp = MM_CMPGT8(temp, *indicies2);
> > +
> > +	/* convert -1 to 1 (bytes greater than input byte */
> > +	temp = MM_SIGN8(temp, temp);
> > +
> > +	/* horizontal add pairs of bytes into words */
> > +	temp = MM_MADD8(temp, temp);
> > +
> > +	/* horizontal add pairs of words into dwords */
> > +	temp = MM_MADD16(temp, ones_16);
> > +
> > +	/* mask to range type nodes */
> > +	temp = MM_AND(temp, node_types);
> > +
> > +	/* add index into node position */
> > +	return MM_ADD32(addr, temp);
> > +}
> > +
> > +/*
> > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > + */
> > +static inline xmm_t
> > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +	xmm_t addr;
> > +	uint64_t trans0, trans2;
> > +
> > +	 /* Calculate the address (array index) for all 4 transitions. */
> > +
> > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > +		bytes, type_quad_range, indicies1, indicies2);
> > +
> > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > +
> > +	trans0 = trans[MM_CVT32(addr)];
> > +
> > +	/* get slot 2 */
> > +
> > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > +	trans2 = trans[MM_CVT32(addr)];
> > +
> > +	/* get slot 1 */
> > +
> > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > +
> > +	/* get slot 3 */
> > +
> > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > +
> > +	return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 8 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +	int n;
> > +	struct acl_flow_data flows;
> > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > +	struct parms parms[MAX_SEARCHES_SSE8];
> > +	xmm_t input0, input1;
> > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > +
> > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +		total_packets, categories, ctx->trans_table);
> > +
> > +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > +		cmplt[n].count = 0;
> > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > +	}
> > +
> > +	/*
> > +	 * indicies1 contains index_array[0,1]
> > +	 * indicies2 contains index_array[2,3]
> > +	 * indicies3 contains index_array[4,5]
> > +	 * indicies4 contains index_array[6,7]
> > +	 */
> > +
> > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > +
> > +	 /* Check for any matches. */
> > +	acl_match_check_x4(0, ctx, parms, &flows,
> > +		&indicies1, &indicies2, mm_match_mask.m);
> > +	acl_match_check_x4(4, ctx, parms, &flows,
> > +		&indicies3, &indicies4, mm_match_mask.m);
> > +
> > +	while (flows.started > 0) {
> > +
> > +		/* Gather 4 bytes of input data for each stream. */
> > +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > +			0);
> > +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > +			0);
> > +
> > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > +
> > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > +
> > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > +
> > +		 /* Process the 4 bytes of input on each stream. */
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		 /* Check for any matches. */
> > +		acl_match_check_x4(0, ctx, parms, &flows,
> > +			&indicies1, &indicies2, mm_match_mask.m);
> > +		acl_match_check_x4(4, ctx, parms, &flows,
> > +			&indicies3, &indicies4, mm_match_mask.m);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 4 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	 uint32_t *results, int total_packets, uint32_t categories)
> > +{
> > +	int n;
> > +	struct acl_flow_data flows;
> > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > +	struct parms parms[MAX_SEARCHES_SSE4];
> > +	xmm_t input, indicies1, indicies2;
> > +
> > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +		total_packets, categories, ctx->trans_table);
> > +
> > +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > +		cmplt[n].count = 0;
> > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > +	}
> > +
> > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +	/* Check for any matches. */
> > +	acl_match_check_x4(0, ctx, parms, &flows,
> > +		&indicies1, &indicies2, mm_match_mask.m);
> > +
> > +	while (flows.started > 0) {
> > +
> > +		/* Gather 4 bytes of input data for each stream. */
> > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > +
> > +		/* Process the 4 bytes of input on each stream. */
> > +		input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		 input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		 input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		 input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		/* Check for any matches. */
> > +		acl_match_check_x4(0, ctx, parms, &flows,
> > +			&indicies1, &indicies2, mm_match_mask.m);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static inline xmm_t
> > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +	const uint64_t *trans, xmm_t *indicies1)
> > +{
> > +	uint64_t t;
> > +	xmm_t addr, indicies2;
> > +
> > +	indicies2 = MM_XOR(ones_16, ones_16);
> > +
> > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > +		bytes, type_quad_range, indicies1, &indicies2);
> > +
> > +	/* Gather 64 bit transitions and pack 2 per register. */
> > +
> > +	t = trans[MM_CVT32(addr)];
> > +
> > +	/* get slot 1 */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > +
> > +	return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 2 traversals in parallel.
> > + */
> > +static inline int
> > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +	int n;
> > +	struct acl_flow_data flows;
> > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > +	struct parms parms[MAX_SEARCHES_SSE2];
> > +	xmm_t input, indicies;
> > +
> > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +		total_packets, categories, ctx->trans_table);
> > +
> > +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > +		cmplt[n].count = 0;
> > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > +	}
> > +
> > +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > +
> > +	/* Check for any matches. */
> > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > +
> > +	while (flows.started > 0) {
> > +
> > +		/* Gather 4 bytes of input data for each stream. */
> > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +
> > +		/* Process the 4 bytes of input on each stream. */
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		/* Check for any matches. */
> > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > +			mm_match_mask64.m);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	uint32_t *results, uint32_t num, uint32_t categories)
> > +{
> > +	if (categories != 1 &&
> > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > +		return -EINVAL;
> > +
> > +	if (likely(num >= MAX_SEARCHES_SSE8))
> > +		return search_sse_8(ctx, data, results, num, categories);
> > +	else if (num >= MAX_SEARCHES_SSE4)
> > +		return search_sse_4(ctx, data, results, num, categories);
> > +	else
> > +		return search_sse_2(ctx, data, results, num, categories);
> > +}
> > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > index 7c288bd..0cde07e 100644
> > --- a/lib/librte_acl/rte_acl.c
> > +++ b/lib/librte_acl/rte_acl.c
> > @@ -38,6 +38,21 @@
> >
> >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> >
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> make this static, the outside world shouldn't need to see it.

As I said above, I think it more plausible to keep it globally visible.

> 
> > +void __attribute__((constructor(INT16_MAX)))
> > +rte_acl_select_classify(void)
> Make it static, The outside world doesn't need to call this.

See above, would like user to have an ability to call it manually if needed.

> 
> > +{
> > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > +		/* SSE version requires SSE4.1 */
> > +		rte_acl_default_classify = rte_acl_classify_sse;
> > +	} else {
> > +		/* reset to scalar version. */
> > +		rte_acl_default_classify = rte_acl_classify_scalar;
> Don't need the else clause here, the static initalizer has you covered.

I think we better keep it like that - in case user calls it manually.
We always reset  rte_acl_default_classify to the 'best proper' value.

> > +	}
> > +}
> > +
> > +
> > +/**
> > + * Invokes default rte_acl_classify function.
> > + */
> > +extern rte_acl_classify_t rte_acl_default_classify;
> > +
> Doesn't need to be extern.
> > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > +
> Not sure why you need this either.  The rte_acl_classify_t should be enough, no?

We preserve existing rte_acl_classify() API, so users don't need to modify their code.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 11:49   ` Ananyev, Konstantin
@ 2014-08-08 12:25     ` Neil Horman
  2014-08-08 13:09       ` Ananyev, Konstantin
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Horman @ 2014-08-08 12:25 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, August 07, 2014 9:12 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > 
> > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > Make ACL library to build/work on 'default' architecture:
> > > - make rte_acl_classify_scalar really scalar
> > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > - Provide two versions of rte_acl_classify code path:
> > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > >   and upper, return -ENOTSUP on lower arch.
> > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > >   on all systems.
> > > - keep common code shared between these two codepaths.
> > >
> > > v2 chages:
> > >  run-time selection of most appropriate code-path for given ISA.
> > >  By default the highest supprted one is selected.
> > >  User can still override that selection by manually assigning new value to
> > >  the global function pointer rte_acl_default_classify.
> > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > >  points to.
> > >
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > 
> > This is alot better thank you.  A few remaining issues.
> 
> My comments inline too.
> Thanks
> Konstantin
> 
> > 
> > > ---
> > >  app/test-acl/main.c                |  13 +-
> > >  lib/librte_acl/Makefile            |   5 +-
> > >  lib/librte_acl/acl_bld.c           |   5 +-
> > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > >  lib/librte_acl/rte_acl.c           |  15 +
> > >  lib/librte_acl/rte_acl.h           |  24 +-
> > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > >  delete mode 100644 lib/librte_acl/acl_run.c
> > >  create mode 100644 lib/librte_acl/acl_run.h
> > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > >
> > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > index d654409..45c6fa6 100644
> > > --- a/app/test-acl/main.c
> > > +++ b/app/test-acl/main.c
> > > @@ -787,6 +787,10 @@ acx_init(void)
> > >  	/* perform build. */
> > >  	ret = rte_acl_build(config.acx, &cfg);
> > >
> > > +	/* setup default rte_acl_classify */
> > > +	if (config.scalar)
> > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > +
> > Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> > the function changes you have to update all your applications.
> 
> If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway. 
> 
Why?  If you hide this from the application, changes to the internal
implementation will also be invisible.  When building as a DSO, an application
will be able to transition between libraries without the need for a rebuild.

> >  Make the pointer
> > an internal symbol and set it using a get/set routine with an enum to represent
> > the path to choose.  That will help isolate the ABI from the internal
> > implementation. 
> 
> That's was my first intention too.
> But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> and it will cost us extra call (or jump).
Thats true, but I don't see that as a problem.  We're not talking about a hot
code path here, its a setup function.  Or do you think that an application will
be switching between classification functions on every classify operation?


> Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
I agree, but both the methods we are advocating for allow that.  Its really just
a question of exposing the mechanism as data or text in the binary.  Exposing it
as data comes with implicit ABI constraints that are less prevalanet when done
as code entry points.

> For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().  
In the case of a bug in the optimized path, you just fix the bug.  If you want
to provide your own classification function, thats fine I suppose, but that
seems completely outside the scope of what we're trying to do here.  Its not
adventageous to just throw that in there.  If you want to be able to provide
your own classifier function, lets at least take some time to make sure that the
function prototype is sufficiently capable to accept all the data you might want
to pass it in the future, before we go exposing it.  Otherwise you'll have to
break the ABI in future versions, whcih is something we've been discussing
trying to avoid.

> > It will also let you prevent things like selecting a run time
> > path that is incompatible with the running system
> 
> If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
That really seems like poor design to me.  I don't see why you wouldn't at least
want to warn the developer of an application if they were at run time to assign
a default classifier method that was incompatible with a running system.  Yes,
they're likely smart enough to know what their doing, but smart people make
mistakes, and appreciate being told when they're doing so, especially if the
method of telling is something a bit more civil than a machine check that
might occur well after the application has been initilized.

> From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
Not if the function is statically declared and not exposed to the application
they cant :)

> 
> > and prevent path switching
> > during searches, which may produce unexpected results.
> 
> Not that I am advertising it, but  it should be safe to update rte_acl_default_classify during searches:
> All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
> 
Fair enough.

> > 
> > ><snip>
> > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > deleted file mode 100644
> > > index e3d9fc1..0000000
> > > --- a/lib/librte_acl/acl_run.c
> > > +++ /dev/null
> > > @@ -1,944 +0,0 @@
> > > -/*-
> > > - *   BSD LICENSE
> > > - *
> > > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > - *   All rights reserved.
> > > - *
> > > - *   Redistribution and use in source and binary forms, with or without
> > > - *   modification, are permitted provided that the following conditions
> > ><snip>
> > > +
> > > +#define	__func_resolve_priority__	resolve_priority_scalar
> > > +#define	__func_match_check__		acl_match_check_scalar
> > > +#include "acl_match_check.def"
> > > +
> > I get this lets you make some more code common, but its just unpleasant to trace
> > through.  Looking at the defintion of __func_match_check__ I don't see anything
> > particularly performance sensitive there.  What if instead you simply redefined
> > __func_match_check__ in a common internal header as acl_match_check (a generic
> > function), and had it accept priority resolution function as an argument?  That
> > would still give you all the performance enhancements without having to include
> > c files in the middle of other c files, and would make the code a bit more
> > parseable.
> 
> Yes, that way it would look much better.
> And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
> Will change as you suggested. 
> 
Thank you
Neil

> > 
> > > +/*
> > > + * When processing the transition, rather than using if/else
> > > + * construct, the offset is calculated for DFA and QRANGE and
> > > + * then conditionally added to the address based on node type.
> > > + * This is done to avoid branch mis-predictions. Since the
> > > + * offset is rather simple calculation it is more efficient
> > > + * to do the calculation and do a condition move rather than
> > > + * a conditional branch to determine which calculation to do.
> > > + */
> > > +static inline uint32_t
> > > +scan_forward(uint32_t input, uint32_t max)
> > > +{
> > > +	return (input == 0) ? max : rte_bsf32(input);
> > > +}
> > > +	}
> > > +}
> > ><snip>
> > > +
> > > +#define	__func_resolve_priority__	resolve_priority_sse
> > > +#define	__func_match_check__		acl_match_check_sse
> > > +#include "acl_match_check.def"
> > > +
> > Same deal as above.
> > 
> > > +/*
> > > + * Extract transitions from an XMM register and check for any matches
> > > + */
> > > +static void
> > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > > +	struct parms *parms, struct acl_flow_data *flows)
> > > +{
> > > +	uint64_t transition1, transition2;
> > > +
> > > +	/* extract transition from low 64 bits. */
> > > +	transition1 = MM_CVT64(*indicies);
> > > +
> > > +	/* extract transition from high 64 bits. */
> > > +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > +	transition2 = MM_CVT64(*indicies);
> > > +
> > > +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> > > +		parms, flows);
> > > +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > > +		parms, flows);
> > > +
> > > +	/* update indicies with new transitions. */
> > > +	*indicies = MM_SET64(transition2, transition1);
> > > +}
> > > +
> > > +/*
> > > + * Check for a match in 2 transitions (contained in SSE register)
> > > + */
> > > +static inline void
> > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > +{
> > > +	xmm_t temp;
> > > +
> > > +	temp = MM_AND(match_mask, *indicies);
> > > +	while (!MM_TESTZ(temp, temp)) {
> > > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > > +		temp = MM_AND(match_mask, *indicies);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > > + */
> > > +static inline void
> > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > +	xmm_t match_mask)
> > > +{
> > > +	xmm_t temp;
> > > +
> > > +	/* put low 32 bits of each transition into one register */
> > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > +		0x88);
> > > +	/* test for match node */
> > > +	temp = MM_AND(match_mask, temp);
> > > +
> > > +	while (!MM_TESTZ(temp, temp)) {
> > > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > +
> > > +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > +					(__m128)*indicies2,
> > > +					0x88);
> > > +		temp = MM_AND(match_mask, temp);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Calculate the address of the next transition for
> > > + * all types of nodes. Note that only DFA nodes and range
> > > + * nodes actually transition to another node. Match
> > > + * nodes don't move.
> > > + */
> > > +static inline xmm_t
> > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > +	xmm_t *indicies1, xmm_t *indicies2)
> > > +{
> > > +	xmm_t addr, node_types, temp;
> > > +
> > > +	/*
> > > +	 * Note that no transition is done for a match
> > > +	 * node and therefore a stream freezes when
> > > +	 * it reaches a match.
> > > +	 */
> > > +
> > > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > +		0x88);
> > > +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > +		(__m128)*indicies2, 0xdd);
> > > +
> > > +	/* Calc node type and node addr */
> > > +	node_types = MM_ANDNOT(index_mask, temp);
> > > +	addr = MM_AND(index_mask, temp);
> > > +
> > > +	/*
> > > +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> > > +	 */
> > > +
> > > +	/* mask for DFA type (0) nodes */
> > > +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > +
> > > +	/* add input byte to DFA position */
> > > +	temp = MM_AND(temp, bytes);
> > > +	temp = MM_AND(temp, next_input);
> > > +	addr = MM_ADD32(addr, temp);
> > > +
> > > +	/*
> > > +	 * Calc addr for Range nodes -> range_index + range(input)
> > > +	 */
> > > +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> > > +
> > > +	/*
> > > +	 * Calculate number of range boundaries that are less than the
> > > +	 * input value. Range boundaries for each node are in signed 8 bit,
> > > +	 * ordered from -128 to 127 in the indicies2 register.
> > > +	 * This is effectively a popcnt of bytes that are greater than the
> > > +	 * input byte.
> > > +	 */
> > > +
> > > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > > +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> > > +
> > > +	/* check ranges */
> > > +	temp = MM_CMPGT8(temp, *indicies2);
> > > +
> > > +	/* convert -1 to 1 (bytes greater than input byte */
> > > +	temp = MM_SIGN8(temp, temp);
> > > +
> > > +	/* horizontal add pairs of bytes into words */
> > > +	temp = MM_MADD8(temp, temp);
> > > +
> > > +	/* horizontal add pairs of words into dwords */
> > > +	temp = MM_MADD16(temp, ones_16);
> > > +
> > > +	/* mask to range type nodes */
> > > +	temp = MM_AND(temp, node_types);
> > > +
> > > +	/* add index into node position */
> > > +	return MM_ADD32(addr, temp);
> > > +}
> > > +
> > > +/*
> > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > + */
> > > +static inline xmm_t
> > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > +{
> > > +	xmm_t addr;
> > > +	uint64_t trans0, trans2;
> > > +
> > > +	 /* Calculate the address (array index) for all 4 transitions. */
> > > +
> > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > +		bytes, type_quad_range, indicies1, indicies2);
> > > +
> > > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > > +
> > > +	trans0 = trans[MM_CVT32(addr)];
> > > +
> > > +	/* get slot 2 */
> > > +
> > > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > +	trans2 = trans[MM_CVT32(addr)];
> > > +
> > > +	/* get slot 1 */
> > > +
> > > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > +
> > > +	/* get slot 3 */
> > > +
> > > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > +
> > > +	return MM_SRL32(next_input, 8);
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 8 traversals in parallel
> > > + */
> > > +static inline int
> > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > +{
> > > +	int n;
> > > +	struct acl_flow_data flows;
> > > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > > +	struct parms parms[MAX_SEARCHES_SSE8];
> > > +	xmm_t input0, input1;
> > > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > > +
> > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > +		total_packets, categories, ctx->trans_table);
> > > +
> > > +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > > +		cmplt[n].count = 0;
> > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > +	}
> > > +
> > > +	/*
> > > +	 * indicies1 contains index_array[0,1]
> > > +	 * indicies2 contains index_array[2,3]
> > > +	 * indicies3 contains index_array[4,5]
> > > +	 * indicies4 contains index_array[6,7]
> > > +	 */
> > > +
> > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > +
> > > +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > > +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > > +
> > > +	 /* Check for any matches. */
> > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > +	acl_match_check_x4(4, ctx, parms, &flows,
> > > +		&indicies3, &indicies4, mm_match_mask.m);
> > > +
> > > +	while (flows.started > 0) {
> > > +
> > > +		/* Gather 4 bytes of input data for each stream. */
> > > +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > +			0);
> > > +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > +			0);
> > > +
> > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > +
> > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > +
> > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > +
> > > +		 /* Process the 4 bytes of input on each stream. */
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		 /* Check for any matches. */
> > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > +		acl_match_check_x4(4, ctx, parms, &flows,
> > > +			&indicies3, &indicies4, mm_match_mask.m);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 4 traversals in parallel
> > > + */
> > > +static inline int
> > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	 uint32_t *results, int total_packets, uint32_t categories)
> > > +{
> > > +	int n;
> > > +	struct acl_flow_data flows;
> > > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > > +	struct parms parms[MAX_SEARCHES_SSE4];
> > > +	xmm_t input, indicies1, indicies2;
> > > +
> > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > +		total_packets, categories, ctx->trans_table);
> > > +
> > > +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > > +		cmplt[n].count = 0;
> > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > +	}
> > > +
> > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > +
> > > +	/* Check for any matches. */
> > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > +
> > > +	while (flows.started > 0) {
> > > +
> > > +		/* Gather 4 bytes of input data for each stream. */
> > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > +
> > > +		/* Process the 4 bytes of input on each stream. */
> > > +		input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		 input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		 input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		 input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		/* Check for any matches. */
> > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static inline xmm_t
> > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > +	const uint64_t *trans, xmm_t *indicies1)
> > > +{
> > > +	uint64_t t;
> > > +	xmm_t addr, indicies2;
> > > +
> > > +	indicies2 = MM_XOR(ones_16, ones_16);
> > > +
> > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > +		bytes, type_quad_range, indicies1, &indicies2);
> > > +
> > > +	/* Gather 64 bit transitions and pack 2 per register. */
> > > +
> > > +	t = trans[MM_CVT32(addr)];
> > > +
> > > +	/* get slot 1 */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > > +
> > > +	return MM_SRL32(next_input, 8);
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 2 traversals in parallel.
> > > + */
> > > +static inline int
> > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > +{
> > > +	int n;
> > > +	struct acl_flow_data flows;
> > > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > > +	struct parms parms[MAX_SEARCHES_SSE2];
> > > +	xmm_t input, indicies;
> > > +
> > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > +		total_packets, categories, ctx->trans_table);
> > > +
> > > +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > > +		cmplt[n].count = 0;
> > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > +	}
> > > +
> > > +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > > +
> > > +	/* Check for any matches. */
> > > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > > +
> > > +	while (flows.started > 0) {
> > > +
> > > +		/* Gather 4 bytes of input data for each stream. */
> > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > +
> > > +		/* Process the 4 bytes of input on each stream. */
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		/* Check for any matches. */
> > > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > +			mm_match_mask64.m);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int
> > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	uint32_t *results, uint32_t num, uint32_t categories)
> > > +{
> > > +	if (categories != 1 &&
> > > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > > +		return -EINVAL;
> > > +
> > > +	if (likely(num >= MAX_SEARCHES_SSE8))
> > > +		return search_sse_8(ctx, data, results, num, categories);
> > > +	else if (num >= MAX_SEARCHES_SSE4)
> > > +		return search_sse_4(ctx, data, results, num, categories);
> > > +	else
> > > +		return search_sse_2(ctx, data, results, num, categories);
> > > +}
> > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > index 7c288bd..0cde07e 100644
> > > --- a/lib/librte_acl/rte_acl.c
> > > +++ b/lib/librte_acl/rte_acl.c
> > > @@ -38,6 +38,21 @@
> > >
> > >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > >
> > > +/* by default, use always avaialbe scalar code path. */
> > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > +
> > make this static, the outside world shouldn't need to see it.
> 
> As I said above, I think it more plausible to keep it globally visible.
> 
> > 
> > > +void __attribute__((constructor(INT16_MAX)))
> > > +rte_acl_select_classify(void)
> > Make it static, The outside world doesn't need to call this.
> 
> See above, would like user to have an ability to call it manually if needed.
> 
> > 
> > > +{
> > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > +		/* SSE version requires SSE4.1 */
> > > +		rte_acl_default_classify = rte_acl_classify_sse;
> > > +	} else {
> > > +		/* reset to scalar version. */
> > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > Don't need the else clause here, the static initalizer has you covered.
> 
> I think we better keep it like that - in case user calls it manually.
> We always reset  rte_acl_default_classify to the 'best proper' value.
> 
> > > +	}
> > > +}
> > > +
> > > +
> > > +/**
> > > + * Invokes default rte_acl_classify function.
> > > + */
> > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > +
> > Doesn't need to be extern.
> > > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > +
> > Not sure why you need this either.  The rte_acl_classify_t should be enough, no?
> 
> We preserve existing rte_acl_classify() API, so users don't need to modify their code.
> 
This would be a great candidate for versioning (Bruce and have been discussing
this).

Neil

> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 12:25     ` Neil Horman
@ 2014-08-08 13:09       ` Ananyev, Konstantin
  2014-08-08 14:30         ` Neil Horman
  0 siblings, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2014-08-08 13:09 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev


> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Friday, August 08, 2014 1:25 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> 
> On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, August 07, 2014 9:12 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > >
> > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > Make ACL library to build/work on 'default' architecture:
> > > > - make rte_acl_classify_scalar really scalar
> > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > - Provide two versions of rte_acl_classify code path:
> > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > >   and upper, return -ENOTSUP on lower arch.
> > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > >   on all systems.
> > > > - keep common code shared between these two codepaths.
> > > >
> > > > v2 chages:
> > > >  run-time selection of most appropriate code-path for given ISA.
> > > >  By default the highest supprted one is selected.
> > > >  User can still override that selection by manually assigning new value to
> > > >  the global function pointer rte_acl_default_classify.
> > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > >  points to.
> > > >
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >
> > > This is alot better thank you.  A few remaining issues.
> >
> > My comments inline too.
> > Thanks
> > Konstantin
> >
> > >
> > > > ---
> > > >  app/test-acl/main.c                |  13 +-
> > > >  lib/librte_acl/Makefile            |   5 +-
> > > >  lib/librte_acl/acl_bld.c           |   5 +-
> > > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > > >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> > > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > > >  lib/librte_acl/rte_acl.c           |  15 +
> > > >  lib/librte_acl/rte_acl.h           |  24 +-
> > > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > > >  delete mode 100644 lib/librte_acl/acl_run.c
> > > >  create mode 100644 lib/librte_acl/acl_run.h
> > > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > > >
> > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > index d654409..45c6fa6 100644
> > > > --- a/app/test-acl/main.c
> > > > +++ b/app/test-acl/main.c
> > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > >  	/* perform build. */
> > > >  	ret = rte_acl_build(config.acx, &cfg);
> > > >
> > > > +	/* setup default rte_acl_classify */
> > > > +	if (config.scalar)
> > > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +
> > > Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> > > the function changes you have to update all your applications.
> >
> > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> >
> Why?  If you hide this from the application, changes to the internal
> implementation will also be invisible.  When building as a DSO, an application
> will be able to transition between libraries without the need for a rebuild.

Because rte_acl_classify() is part of the ACL API that users use.
If we'll add/modify its parameters and/or return value - users would have to change their apps anyway.   
 
> > >  Make the pointer
> > > an internal symbol and set it using a get/set routine with an enum to represent
> > > the path to choose.  That will help isolate the ABI from the internal
> > > implementation.
> >
> > That's was my first intention too.
> > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> > and it will cost us extra call (or jump).
> Thats true, but I don't see that as a problem.  We're not talking about a hot
> code path here, its a setup function.

I am talking not about rte_acl_select_classify() but about rte_acl_classify() itself (not code path).
If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would need to become a real function and it'll be something like that:

->call rte_acl_acl_classify
---> load rte_acl_calssify_default value into the reg 
--->  jmp (*reg)

>  Or do you think that an application will
> be switching between classification functions on every classify operation?

God no.

> > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> I agree, but both the methods we are advocating for allow that.  Its really just
> a question of exposing the mechanism as data or text in the binary.  Exposing it
> as data comes with implicit ABI constraints that are less prevalanet when done
> as code entry points.
 
> > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().

> In the case of a bug in the optimized path, you just fix the bug. 

It is not about me. It is about a user who get librte_acl as part of binary distribution.
Of course, he probably will report about it and we probably fix it sooner or later.
But with such ability he can switch to the safe implementation immediately
without touching the library and then wait for the fix.

>  If you want
> to provide your own classification function, thats fine I suppose, but that
> seems completely outside the scope of what we're trying to do here.  Its not
> adventageous to just throw that in there.  If you want to be able to provide
> your own classifier function, lets at least take some time to make sure that the
> function prototype is sufficiently capable to accept all the data you might want
> to pass it in the future, before we go exposing it.  Otherwise you'll have to
> break the ABI in future versions, whcih is something we've been discussing
> trying to avoid.

rte_acl_classify() it is already exposed (PART of API), same as rte_acl_classify_scalar().
If in future, we'll change these functions prototypes will break ABI anyway.

> 
> > > It will also let you prevent things like selecting a run time
> > > path that is incompatible with the running system
> >
> > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
> That really seems like poor design to me.  I don't see why you wouldn't at least
> want to warn the developer of an application if they were at run time to assign
> a default classifier method that was incompatible with a running system.  Yes,
> they're likely smart enough to know what their doing, but smart people make
> mistakes, and appreciate being told when they're doing so, especially if the
> method of telling is something a bit more civil than a machine check that
> might occur well after the application has been initilized.

I have no problem providing rte_acl_check_classify(flags_required, classify_ptr) that would do checking and emit the warning.
Though as I said above, I'll prefer not to hide rte_acl_default_classify it will cause extra overhead for rte_acl_classify().

> 
> > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> Not if the function is statically declared and not exposed to the application
> they cant :)

I don't really want to hide  rte_acl_classify_sse/rte_acl_classify_scalar().
Should be available directly I think.   
In future we might introduce new versions for more sophisticated ISAs (rte_acl_classify_avx() or something).
Users should have an ability to downgrade their classify() function if they like.  

> >
> > > and prevent path switching
> > > during searches, which may produce unexpected results.
> >
> > Not that I am advertising it, but  it should be safe to update rte_acl_default_classify during searches:
> > All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
> >
> Fair enough.
> 
> > >
> > > ><snip>
> > > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > > deleted file mode 100644
> > > > index e3d9fc1..0000000
> > > > --- a/lib/librte_acl/acl_run.c
> > > > +++ /dev/null
> > > > @@ -1,944 +0,0 @@
> > > > -/*-
> > > > - *   BSD LICENSE
> > > > - *
> > > > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > - *   All rights reserved.
> > > > - *
> > > > - *   Redistribution and use in source and binary forms, with or without
> > > > - *   modification, are permitted provided that the following conditions
> > > ><snip>
> > > > +
> > > > +#define	__func_resolve_priority__	resolve_priority_scalar
> > > > +#define	__func_match_check__		acl_match_check_scalar
> > > > +#include "acl_match_check.def"
> > > > +
> > > I get this lets you make some more code common, but its just unpleasant to trace
> > > through.  Looking at the defintion of __func_match_check__ I don't see anything
> > > particularly performance sensitive there.  What if instead you simply redefined
> > > __func_match_check__ in a common internal header as acl_match_check (a generic
> > > function), and had it accept priority resolution function as an argument?  That
> > > would still give you all the performance enhancements without having to include
> > > c files in the middle of other c files, and would make the code a bit more
> > > parseable.
> >
> > Yes, that way it would look much better.
> > And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
> > Will change as you suggested.
> >
> Thank you
> Neil
> 
> > >
> > > > +/*
> > > > + * When processing the transition, rather than using if/else
> > > > + * construct, the offset is calculated for DFA and QRANGE and
> > > > + * then conditionally added to the address based on node type.
> > > > + * This is done to avoid branch mis-predictions. Since the
> > > > + * offset is rather simple calculation it is more efficient
> > > > + * to do the calculation and do a condition move rather than
> > > > + * a conditional branch to determine which calculation to do.
> > > > + */
> > > > +static inline uint32_t
> > > > +scan_forward(uint32_t input, uint32_t max)
> > > > +{
> > > > +	return (input == 0) ? max : rte_bsf32(input);
> > > > +}
> > > > +	}
> > > > +}
> > > ><snip>
> > > > +
> > > > +#define	__func_resolve_priority__	resolve_priority_sse
> > > > +#define	__func_match_check__		acl_match_check_sse
> > > > +#include "acl_match_check.def"
> > > > +
> > > Same deal as above.
> > >
> > > > +/*
> > > > + * Extract transitions from an XMM register and check for any matches
> > > > + */
> > > > +static void
> > > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > > > +	struct parms *parms, struct acl_flow_data *flows)
> > > > +{
> > > > +	uint64_t transition1, transition2;
> > > > +
> > > > +	/* extract transition from low 64 bits. */
> > > > +	transition1 = MM_CVT64(*indicies);
> > > > +
> > > > +	/* extract transition from high 64 bits. */
> > > > +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > > +	transition2 = MM_CVT64(*indicies);
> > > > +
> > > > +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> > > > +		parms, flows);
> > > > +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > > > +		parms, flows);
> > > > +
> > > > +	/* update indicies with new transitions. */
> > > > +	*indicies = MM_SET64(transition2, transition1);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for a match in 2 transitions (contained in SSE register)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > > +{
> > > > +	xmm_t temp;
> > > > +
> > > > +	temp = MM_AND(match_mask, *indicies);
> > > > +	while (!MM_TESTZ(temp, temp)) {
> > > > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > > > +		temp = MM_AND(match_mask, *indicies);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > > +	xmm_t match_mask)
> > > > +{
> > > > +	xmm_t temp;
> > > > +
> > > > +	/* put low 32 bits of each transition into one register */
> > > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > > +		0x88);
> > > > +	/* test for match node */
> > > > +	temp = MM_AND(match_mask, temp);
> > > > +
> > > > +	while (!MM_TESTZ(temp, temp)) {
> > > > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > > +
> > > > +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > +					(__m128)*indicies2,
> > > > +					0x88);
> > > > +		temp = MM_AND(match_mask, temp);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Calculate the address of the next transition for
> > > > + * all types of nodes. Note that only DFA nodes and range
> > > > + * nodes actually transition to another node. Match
> > > > + * nodes don't move.
> > > > + */
> > > > +static inline xmm_t
> > > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > +	xmm_t addr, node_types, temp;
> > > > +
> > > > +	/*
> > > > +	 * Note that no transition is done for a match
> > > > +	 * node and therefore a stream freezes when
> > > > +	 * it reaches a match.
> > > > +	 */
> > > > +
> > > > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > > +		0x88);
> > > > +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > +		(__m128)*indicies2, 0xdd);
> > > > +
> > > > +	/* Calc node type and node addr */
> > > > +	node_types = MM_ANDNOT(index_mask, temp);
> > > > +	addr = MM_AND(index_mask, temp);
> > > > +
> > > > +	/*
> > > > +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> > > > +	 */
> > > > +
> > > > +	/* mask for DFA type (0) nodes */
> > > > +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > > +
> > > > +	/* add input byte to DFA position */
> > > > +	temp = MM_AND(temp, bytes);
> > > > +	temp = MM_AND(temp, next_input);
> > > > +	addr = MM_ADD32(addr, temp);
> > > > +
> > > > +	/*
> > > > +	 * Calc addr for Range nodes -> range_index + range(input)
> > > > +	 */
> > > > +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> > > > +
> > > > +	/*
> > > > +	 * Calculate number of range boundaries that are less than the
> > > > +	 * input value. Range boundaries for each node are in signed 8 bit,
> > > > +	 * ordered from -128 to 127 in the indicies2 register.
> > > > +	 * This is effectively a popcnt of bytes that are greater than the
> > > > +	 * input byte.
> > > > +	 */
> > > > +
> > > > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > > > +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> > > > +
> > > > +	/* check ranges */
> > > > +	temp = MM_CMPGT8(temp, *indicies2);
> > > > +
> > > > +	/* convert -1 to 1 (bytes greater than input byte */
> > > > +	temp = MM_SIGN8(temp, temp);
> > > > +
> > > > +	/* horizontal add pairs of bytes into words */
> > > > +	temp = MM_MADD8(temp, temp);
> > > > +
> > > > +	/* horizontal add pairs of words into dwords */
> > > > +	temp = MM_MADD16(temp, ones_16);
> > > > +
> > > > +	/* mask to range type nodes */
> > > > +	temp = MM_AND(temp, node_types);
> > > > +
> > > > +	/* add index into node position */
> > > > +	return MM_ADD32(addr, temp);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > > + */
> > > > +static inline xmm_t
> > > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > +	xmm_t addr;
> > > > +	uint64_t trans0, trans2;
> > > > +
> > > > +	 /* Calculate the address (array index) for all 4 transitions. */
> > > > +
> > > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > > +		bytes, type_quad_range, indicies1, indicies2);
> > > > +
> > > > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > > > +
> > > > +	trans0 = trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 2 */
> > > > +
> > > > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > > +	trans2 = trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 1 */
> > > > +
> > > > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > > +
> > > > +	/* get slot 3 */
> > > > +
> > > > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > > +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > > +
> > > > +	return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 8 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > > > +	struct parms parms[MAX_SEARCHES_SSE8];
> > > > +	xmm_t input0, input1;
> > > > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > > > +		cmplt[n].count = 0;
> > > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * indicies1 contains index_array[0,1]
> > > > +	 * indicies2 contains index_array[2,3]
> > > > +	 * indicies3 contains index_array[4,5]
> > > > +	 * indicies4 contains index_array[6,7]
> > > > +	 */
> > > > +
> > > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > > > +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > > > +
> > > > +	 /* Check for any matches. */
> > > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > > +	acl_match_check_x4(4, ctx, parms, &flows,
> > > > +		&indicies3, &indicies4, mm_match_mask.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > > +			0);
> > > > +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > > +			0);
> > > > +
> > > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > > +
> > > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > > +
> > > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > > +
> > > > +		 /* Process the 4 bytes of input on each stream. */
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		 /* Check for any matches. */
> > > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > > +		acl_match_check_x4(4, ctx, parms, &flows,
> > > > +			&indicies3, &indicies4, mm_match_mask.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 4 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	 uint32_t *results, int total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > > > +	struct parms parms[MAX_SEARCHES_SSE4];
> > > > +	xmm_t input, indicies1, indicies2;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > > > +		cmplt[n].count = 0;
> > > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > +	/* Check for any matches. */
> > > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +
> > > > +		/* Process the 4 bytes of input on each stream. */
> > > > +		input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		/* Check for any matches. */
> > > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static inline xmm_t
> > > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	const uint64_t *trans, xmm_t *indicies1)
> > > > +{
> > > > +	uint64_t t;
> > > > +	xmm_t addr, indicies2;
> > > > +
> > > > +	indicies2 = MM_XOR(ones_16, ones_16);
> > > > +
> > > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > > +		bytes, type_quad_range, indicies1, &indicies2);
> > > > +
> > > > +	/* Gather 64 bit transitions and pack 2 per register. */
> > > > +
> > > > +	t = trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 1 */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > > > +
> > > > +	return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 2 traversals in parallel.
> > > > + */
> > > > +static inline int
> > > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > > > +	struct parms parms[MAX_SEARCHES_SSE2];
> > > > +	xmm_t input, indicies;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > > > +		cmplt[n].count = 0;
> > > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +
> > > > +	/* Check for any matches. */
> > > > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +
> > > > +		/* Process the 4 bytes of input on each stream. */
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		/* Check for any matches. */
> > > > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > > +			mm_match_mask64.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int
> > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t num, uint32_t categories)
> > > > +{
> > > > +	if (categories != 1 &&
> > > > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (likely(num >= MAX_SEARCHES_SSE8))
> > > > +		return search_sse_8(ctx, data, results, num, categories);
> > > > +	else if (num >= MAX_SEARCHES_SSE4)
> > > > +		return search_sse_4(ctx, data, results, num, categories);
> > > > +	else
> > > > +		return search_sse_2(ctx, data, results, num, categories);
> > > > +}
> > > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > > index 7c288bd..0cde07e 100644
> > > > --- a/lib/librte_acl/rte_acl.c
> > > > +++ b/lib/librte_acl/rte_acl.c
> > > > @@ -38,6 +38,21 @@
> > > >
> > > >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > > >
> > > > +/* by default, use always avaialbe scalar code path. */
> > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +
> > > make this static, the outside world shouldn't need to see it.
> >
> > As I said above, I think it more plausible to keep it globally visible.
> >
> > >
> > > > +void __attribute__((constructor(INT16_MAX)))
> > > > +rte_acl_select_classify(void)
> > > Make it static, The outside world doesn't need to call this.
> >
> > See above, would like user to have an ability to call it manually if needed.
> >
> > >
> > > > +{
> > > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > > +		/* SSE version requires SSE4.1 */
> > > > +		rte_acl_default_classify = rte_acl_classify_sse;
> > > > +	} else {
> > > > +		/* reset to scalar version. */
> > > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > Don't need the else clause here, the static initalizer has you covered.
> >
> > I think we better keep it like that - in case user calls it manually.
> > We always reset  rte_acl_default_classify to the 'best proper' value.
> >
> > > > +	}
> > > > +}
> > > > +
> > > > +
> > > > +/**
> > > > + * Invokes default rte_acl_classify function.
> > > > + */
> > > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > > +
> > > Doesn't need to be extern.
> > > > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > > > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > > +
> > > Not sure why you need this either.  The rte_acl_classify_t should be enough, no?
> >
> > We preserve existing rte_acl_classify() API, so users don't need to modify their code.
> >
> This would be a great candidate for versioning (Bruce and have been discussing
> this).
> 
> Neil
> 
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 13:09       ` Ananyev, Konstantin
@ 2014-08-08 14:30         ` Neil Horman
  2014-08-11 22:23           ` Thomas Monjalon
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Horman @ 2014-08-08 14:30 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Fri, Aug 08, 2014 at 01:09:34PM +0000, Ananyev, Konstantin wrote:
> 
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Friday, August 08, 2014 1:25 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > 
> > On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Thursday, August 07, 2014 9:12 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > > >
> > > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > > Make ACL library to build/work on 'default' architecture:
> > > > > - make rte_acl_classify_scalar really scalar
> > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > - Provide two versions of rte_acl_classify code path:
> > > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > >   and upper, return -ENOTSUP on lower arch.
> > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > >   on all systems.
> > > > > - keep common code shared between these two codepaths.
> > > > >
> > > > > v2 chages:
> > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > >  By default the highest supprted one is selected.
> > > > >  User can still override that selection by manually assigning new value to
> > > > >  the global function pointer rte_acl_default_classify.
> > > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > >  points to.
> > > > >
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > >
> > > > This is alot better thank you.  A few remaining issues.
> > >
> > > My comments inline too.
> > > Thanks
> > > Konstantin
> > >
> > > >
> > > > > ---
> > > > >  app/test-acl/main.c                |  13 +-
> > > > >  lib/librte_acl/Makefile            |   5 +-
> > > > >  lib/librte_acl/acl_bld.c           |   5 +-
> > > > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > > > >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> > > > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > > > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > > > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > > > >  lib/librte_acl/rte_acl.c           |  15 +
> > > > >  lib/librte_acl/rte_acl.h           |  24 +-
> > > > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > > > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > > > >  delete mode 100644 lib/librte_acl/acl_run.c
> > > > >  create mode 100644 lib/librte_acl/acl_run.h
> > > > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > > > >
> > > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > > index d654409..45c6fa6 100644
> > > > > --- a/app/test-acl/main.c
> > > > > +++ b/app/test-acl/main.c
> > > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > > >  	/* perform build. */
> > > > >  	ret = rte_acl_build(config.acx, &cfg);
> > > > >
> > > > > +	/* setup default rte_acl_classify */
> > > > > +	if (config.scalar)
> > > > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > > > +
> > > > Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> > > > the function changes you have to update all your applications.
> > >
> > > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> > >
> > Why?  If you hide this from the application, changes to the internal
> > implementation will also be invisible.  When building as a DSO, an application
> > will be able to transition between libraries without the need for a rebuild.
> 
> Because rte_acl_classify() is part of the ACL API that users use.
> If we'll add/modify its parameters and/or return value - users would have to change their apps anyway.   
>  
Thats not at all true.  With API versioning scripts you can make several
versions of the same function with different prototypes as future needs dictate.
Hiding the internal implementation just makes that easier.

> > > >  Make the pointer
> > > > an internal symbol and set it using a get/set routine with an enum to represent
> > > > the path to choose.  That will help isolate the ABI from the internal
> > > > implementation.
> > >
> > > That's was my first intention too.
> > > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> > > and it will cost us extra call (or jump).
> > Thats true, but I don't see that as a problem.  We're not talking about a hot
> > code path here, its a setup function.
> 
> I am talking not about rte_acl_select_classify() but about rte_acl_classify() itself (not code path).
> If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would need to become a real function and it'll be something like that:
> 
> ->call rte_acl_acl_classify
> ---> load rte_acl_calssify_default value into the reg 
> --->  jmp (*reg)
> 
Ah, yes, the actual classification path, you will need an extra call
instruction there.  I would say if thats the case, then you should either make
rte_acl_classify a macro or real function based on weather your building as a
shared library or a static library.  

> >  Or do you think that an application will
> > be switching between classification functions on every classify operation?
> 
> God no.
> 
> > > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> > I agree, but both the methods we are advocating for allow that.  Its really just
> > a question of exposing the mechanism as data or text in the binary.  Exposing it
> > as data comes with implicit ABI constraints that are less prevalanet when done
> > as code entry points.
>  
> > > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> 
> > In the case of a bug in the optimized path, you just fix the bug. 
> 
> It is not about me. It is about a user who get librte_acl as part of binary distribution.
Yes, those are my users :)

> Of course, he probably will report about it and we probably fix it sooner or later.
> But with such ability he can switch to the safe implementation immediately
> without touching the library and then wait for the fix.
> 
Thats not how users of a binary pacakge from a distribution operate.  If their
using a binary package they either:

1) Don't want to rebuild anything themselves, in which case they file the bug,
and wait for the developers to fix the issue.

or 

2) Have a staff to help them work around the issue, which will be done by
rebuilding/fixing the library, not the application.

With (2), what I am saying is that, if a 3rd party finds a bug in the classifier
code within dpdk which is built as a shared library within a distribution, and
they need it fixed immediately, they have a choice of what to do, they can
either (a), write a custom classifier function and point the dpdk library to it,
or (b), just fix the bug in the library directly.  Given that, if they can
accomplish (a), they by all rights can also accompilsh (b), the only decision
they need to make is one which makes the most sense for them.  The answer is
(b), because thats where the functionality lives.  i.e. when the fix occurs
upstream and a new release gets issued, you can go back to using the library
maintained version, and you don't have to clean up what has become vestigial
unused code.
 
> >  If you want
> > to provide your own classification function, thats fine I suppose, but that
> > seems completely outside the scope of what we're trying to do here.  Its not
> > adventageous to just throw that in there.  If you want to be able to provide
> > your own classifier function, lets at least take some time to make sure that the
> > function prototype is sufficiently capable to accept all the data you might want
> > to pass it in the future, before we go exposing it.  Otherwise you'll have to
> > break the ABI in future versions, whcih is something we've been discussing
> > trying to avoid.
> 
> rte_acl_classify() it is already exposed (PART of API), same as rte_acl_classify_scalar().
> If in future, we'll change these functions prototypes will break ABI anyway.
> 
Well, at the moment, thats fine because you don't make any ABI promises anyway,
I've been working to change that, so distributions can have greater dpdk
adoption.

> > 
> > > > It will also let you prevent things like selecting a run time
> > > > path that is incompatible with the running system
> > >
> > > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
> > That really seems like poor design to me.  I don't see why you wouldn't at least
> > want to warn the developer of an application if they were at run time to assign
> > a default classifier method that was incompatible with a running system.  Yes,
> > they're likely smart enough to know what their doing, but smart people make
> > mistakes, and appreciate being told when they're doing so, especially if the
> > method of telling is something a bit more civil than a machine check that
> > might occur well after the application has been initilized.
> 
> I have no problem providing rte_acl_check_classify(flags_required, classify_ptr) that would do checking and emit the warning.
> Though as I said above, I'll prefer not to hide rte_acl_default_classify it will cause extra overhead for rte_acl_classify().
> 
> > 
> > > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> > Not if the function is statically declared and not exposed to the application
> > they cant :)
> 
> I don't really want to hide  rte_acl_classify_sse/rte_acl_classify_scalar().
> Should be available directly I think.   
> In future we might introduce new versions for more sophisticated ISAs (rte_acl_classify_avx() or something).
> Users should have an ability to downgrade their classify() function if they like.  
What in your mind is the reasoning behind being able to do so?  What is
adventageous about that?  Asside possibly from debugging that is (for which I
can see a use).  But in normal production operation, why would you choose to not
use the sse classifier over the scalar classifier?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 14:30         ` Neil Horman
@ 2014-08-11 22:23           ` Thomas Monjalon
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Monjalon @ 2014-08-11 22:23 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

Hi all,

2014-08-08 10:30, Neil Horman:
> On Fri, Aug 08, 2014 at 01:09:34PM +0000, Ananyev, Konstantin wrote:

> > > > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> > > I agree, but both the methods we are advocating for allow that.  Its really just
> > > a question of exposing the mechanism as data or text in the binary.  Exposing it
> > > as data comes with implicit ABI constraints that are less prevalanet when done
> > > as code entry points.
> >  
> > > > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> 
> > Of course, he probably will report about it and we probably fix it sooner or later.
> > But with such ability he can switch to the safe implementation immediately
> > without touching the library and then wait for the fix.
> 
> Thats not how users of a binary pacakge from a distribution operate.  If their
> using a binary package they either:
> 
> 1) Don't want to rebuild anything themselves, in which case they file the bug,
> and wait for the developers to fix the issue.
> 
> or 
> 
> 2) Have a staff to help them work around the issue, which will be done by
> rebuilding/fixing the library, not the application.
> 
> With (2), what I am saying is that, if a 3rd party finds a bug in the classifier
> code within dpdk which is built as a shared library within a distribution, and
> they need it fixed immediately, they have a choice of what to do, they can
> either (a), write a custom classifier function and point the dpdk library to it,
> or (b), just fix the bug in the library directly.  Given that, if they can
> accomplish (a), they by all rights can also accompilsh (b), the only decision
> they need to make is one which makes the most sense for them.  The answer is
> (b), because thats where the functionality lives.  i.e. when the fix occurs
> upstream and a new release gets issued, you can go back to using the library
> maintained version, and you don't have to clean up what has become vestigial
> unused code.

I think it's even simpler: thinking API to allow behaviour changes without
rebuilding is not sane. So we should expose all functions?

Please try to reduce API as much as possible.
Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-07 18:31 [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
  2014-08-07 20:11 ` Neil Horman
@ 2014-08-21 20:15 ` Neil Horman
  2014-08-25 16:30   ` Ananyev, Konstantin
  2014-08-28 20:38 ` [dpdk-dev] [PATCHv4] " Neil Horman
  2 siblings, 1 reply; 21+ messages in thread
From: Neil Horman @ 2014-08-21 20:15 UTC (permalink / raw)
  To: dev

Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
 (make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
  rte_acl_classify_sse() - could be build and used only on systems with sse4.2
  and upper, return -ENOTSUP on lower arch.
  rte_acl_classify_scalar() - a slower version, but could be build and used
  on all systems.
- keep common code shared between these two codepaths.

v2 chages:
 run-time selection of most appropriate code-path for given ISA.
 By default the highest supprted one is selected.
 User can still override that selection by manually assigning new value to
 the global function pointer rte_acl_default_classify.
 rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
 points to.

V3 Changes
 Updated classify pointer to be a function so as to better preserve ABI
 REmoved macro definitions for match check functions to make them static inline

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 app/test-acl/main.c              |  13 +-
 app/test/test_acl.c              |  12 +-
 lib/librte_acl/Makefile          |   5 +-
 lib/librte_acl/acl_bld.c         |   5 +-
 lib/librte_acl/acl_match_check.h |  83 ++++
 lib/librte_acl/acl_run.c         | 944 ---------------------------------------
 lib/librte_acl/acl_run.h         | 220 +++++++++
 lib/librte_acl/acl_run_scalar.c  | 198 ++++++++
 lib/librte_acl/acl_run_sse.c     | 627 ++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c         |  46 ++
 lib/librte_acl/rte_acl.h         |  26 +-
 11 files changed, 1216 insertions(+), 963 deletions(-)
 create mode 100644 lib/librte_acl/acl_match_check.h
 delete mode 100644 lib/librte_acl/acl_run.c
 create mode 100644 lib/librte_acl/acl_run.h
 create mode 100644 lib/librte_acl/acl_run_scalar.c
 create mode 100644 lib/librte_acl/acl_run_sse.c

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..a77f47d 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -787,6 +787,10 @@ acx_init(void)
 	/* perform build. */
 	ret = rte_acl_build(config.acx, &cfg);
 
+	/* setup default rte_acl_classify */
+	if (config.scalar)
+		rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+
 	dump_verbose(DUMP_NONE, stdout,
 		"rte_acl_build(%u) finished with %d\n",
 		config.bld_categories, ret);
@@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 			v += config.trace_sz;
 		}
 
-		if (scalar != 0)
-			ret = rte_acl_classify_scalar(config.acx, data,
-				results, n, categories);
-
-		else
-			ret = rte_acl_classify(config.acx, data,
-				results, n, categories);
+		ret = rte_acl_classify(config.acx, data, results,
+			n, categories);
 
 		if (ret != 0)
 			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 869f6d3..2fcef6e 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -148,7 +148,8 @@ test_classify_run(struct rte_acl_ctx *acx)
 	}
 
 	/* make a quick check for scalar */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	ret = rte_acl_classify(acx, data, results,
 			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -362,7 +363,8 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples (scalar) */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	ret = rte_acl_classify(acx, data, results,
 			RTE_DIM(results), 1);
 	if (ret != 0) {
 		printf("Line %i: Scalar classify failed!\n", __LINE__);
@@ -850,7 +852,8 @@ test_invalid_parameters(void)
 	/* scalar classify test */
 
 	/* cover zero categories in classify (should not fail) */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 0);
 	if (result != 0) {
 		printf("Line %i: Scalar classify with zero categories "
 				"failed!\n", __LINE__);
@@ -859,7 +862,8 @@ test_invalid_parameters(void)
 	}
 
 	/* cover invalid but positive categories in classify */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
 	if (result == 0) {
 		printf("Line %i: Scalar classify with 3 categories "
 				"should have failed!\n", __LINE__);
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <nmmintrin.h>
 #include <rte_acl.h>
 #include "tb_mem.h"
 #include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 
 			switch (rule->config->defs[n].type) {
 			case RTE_ACL_FIELD_TYPE_BITMASK:
-				wild = (size -
-					_mm_popcnt_u32(fld->mask_range.u8)) /
+				wild = (size - __builtin_popcount(
+					fld->mask_range.u8)) /
 					size;
 				break;
 
diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
new file mode 100644
index 0000000..4dc1982
--- /dev/null
+++ b/lib/librte_acl/acl_match_check.h
@@ -0,0 +1,83 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _ACL_MATCH_CHECK_H_
+#define _ACL_MATCH_CHECK_H_
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+acl_match_check(uint64_t transition, int slot,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, void (*resolve_priority)(
+	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories))
+{
+	const struct rte_acl_match_results *p;
+
+	p = (const struct rte_acl_match_results *)
+		(flows->trans + ctx->match_index);
+
+	if (transition & RTE_ACL_NODE_MATCH) {
+
+		/* Remove flags from index and decrement active traversals */
+		transition &= RTE_ACL_NODE_INDEX;
+		flows->started--;
+
+		/* Resolve priorities for this trie and running results */
+		if (flows->categories == 1)
+			resolve_single_priority(transition, slot, ctx,
+				parms, p);
+		else
+			resolve_priority(transition, slot, ctx, parms,
+				p, flows->categories);
+
+		/* Count down completed tries for this search request */
+		parms[slot].cmplt->count--;
+
+		/* Fill the slot with the next trie or idle trie */
+		transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+	} else if (transition == ctx->idle) {
+		/* reset indirection table for idle slots */
+		parms[slot].data_index = idle;
+	}
+
+	return transition;
+}
+
+#endif
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in
- *       the documentation and/or other materials provided with the
- *       distribution.
- *     * Neither the name of Intel Corporation nor the names of its
- *       contributors may be used to endorse or promote products derived
- *       from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8	8
-#define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
-#define MAX_SEARCHES_SCALAR	2
-
-#define GET_NEXT_4BYTES(prm, idx)	\
-	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define	SCALAR_QRANGE_MULT	0x01010101
-#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
-#define	SCALAR_QRANGE_MIN	0x80808080
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
-	uint32_t            num_packets;
-	/* number of packets processed */
-	uint32_t            started;
-	/* number of trie traversals in progress */
-	uint32_t            trie;
-	/* current trie index (0 to N-1) */
-	uint32_t            cmplt_size;
-	uint32_t            total_packets;
-	uint32_t            categories;
-	/* number of result categories per packet. */
-	/* maximum number of packets to process */
-	const uint64_t     *trans;
-	const uint8_t     **data;
-	uint32_t           *results;
-	struct completion  *last_cmplt;
-	struct completion  *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
-	uint32_t *results;                          /* running results. */
-	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
-	uint32_t  count;                            /* num of remaining tries */
-	/* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
-	const uint8_t              *data;
-	/* input data for this packet */
-	const uint32_t             *data_index;
-	/* data indirection for this trie */
-	struct completion          *cmplt;
-	/* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
-	uint32_t *results)
-{
-	uint32_t n;
-
-	for (n = 0; n < size; n++) {
-
-		if (p[n].count == 0) {
-
-			/* mark as allocated and set number of tries. */
-			p[n].count = tries;
-			p[n].results = results;
-			return &(p[n]);
-		}
-	}
-
-	/* should never get here */
-	return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	const struct rte_acl_match_results *p)
-{
-	if (parms[n].cmplt->count == ctx->num_tries ||
-			parms[n].cmplt->priority[0] <=
-			p[transition].priority[0]) {
-
-		parms[n].cmplt->priority[0] = p[transition].priority[0];
-		parms[n].cmplt->results[0] = p[transition].results[0];
-	}
-
-	parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-
-	/* Count down completed tries for this search request */
-	parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
-	const struct rte_acl_ctx *ctx)
-{
-	uint64_t transition;
-
-	/* if there are any more packets to process */
-	if (flows->num_packets < flows->total_packets) {
-		parms[n].data = flows->data[flows->num_packets];
-		parms[n].data_index = ctx->trie[flows->trie].data_index;
-
-		/* if this is the first trie for this packet */
-		if (flows->trie == 0) {
-			flows->last_cmplt = alloc_completion(flows->cmplt_array,
-				flows->cmplt_size, ctx->num_tries,
-				flows->results +
-				flows->num_packets * flows->categories);
-		}
-
-		/* set completion parameters and starting index for this slot */
-		parms[n].cmplt = flows->last_cmplt;
-		transition =
-			flows->trans[parms[n].data[*parms[n].data_index++] +
-			ctx->trie[flows->trie].root_index];
-
-		/*
-		 * if this is the last trie for this packet,
-		 * then setup next packet.
-		 */
-		flows->trie++;
-		if (flows->trie >= ctx->num_tries) {
-			flows->trie = 0;
-			flows->num_packets++;
-		}
-
-		/* keep track of number of active trie traversals */
-		flows->started++;
-
-	/* no more tries to process, set slot to an idle position */
-	} else {
-		transition = ctx->idle;
-		parms[n].data = (const uint8_t *)idle;
-		parms[n].data_index = idle;
-	}
-	return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows)
-{
-	const struct rte_acl_match_results *p;
-
-	p = (const struct rte_acl_match_results *)
-		(flows->trans + ctx->match_index);
-
-	if (transition & RTE_ACL_NODE_MATCH) {
-
-		/* Remove flags from index and decrement active traversals */
-		transition &= RTE_ACL_NODE_INDEX;
-		flows->started--;
-
-		/* Resolve priorities for this trie and running results */
-		if (flows->categories == 1)
-			resolve_single_priority(transition, slot, ctx,
-				parms, p);
-		else
-			resolve_priority(transition, slot, ctx, parms, p,
-				flows->categories);
-
-		/* Fill the slot with the next trie or idle trie */
-		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
-	}
-
-	return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indicies);
-
-	/* extract transition from high 64 bits. */
-	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indicies);
-
-	transition1 = acl_match_check_transition(transition1, slot, ctx,
-		parms, flows);
-	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
-		parms, flows);
-
-	/* update indicies with new transitions. */
-	*indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indicies);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indicies);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies1, slot, ctx, parms, flows);
-		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-					(__m128)*indicies2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr, node_types, temp;
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indicies2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-		(__m128)*indicies2, 0xdd);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
-
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indicies2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* check ranges */
-	temp = MM_CMPGT8(temp, *indicies2);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, indicies2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
-	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
-	uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
-	flows->num_packets = 0;
-	flows->started = 0;
-	flows->trie = 0;
-	flows->last_cmplt = NULL;
-	flows->cmplt_array = cmplt;
-	flows->total_packets = data_num;
-	flows->categories = categories;
-	flows->cmplt_size = cmplt_size;
-	flows->data = data;
-	flows->results = results;
-	flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indicies1, indicies2, indicies3, indicies4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indicies1 contains index_array[0,1]
-	 * indicies2 contains index_array[2,3]
-	 * indicies3 contains index_array[4,5]
-	 * indicies4 contains index_array[6,7]
-	 */
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indicies3, &indicies4, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indicies3, &indicies4, mm_match_mask.m);
-	}
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indicies1, indicies2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-	}
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1)
-{
-	uint64_t t;
-	xmm_t addr, indicies2;
-
-	indicies2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, &indicies2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indicies;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
-			mm_match_mask64.m);
-	}
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
-	return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
-	uint8_t input)
-{
-	uint32_t addr, index, ranges, x, a, b, c;
-
-	/* break transition into component parts */
-	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
-	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
-	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
-	/* pickup next transition */
-	transition = *(trans_table + addr);
-	return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	int n;
-	uint64_t transition0, transition1;
-	uint32_t input0, input1;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SCALAR];
-	struct completion cmplt[MAX_SEARCHES_SCALAR];
-	struct parms parms[MAX_SEARCHES_SCALAR];
-
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
-		categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	transition0 = index_array[0];
-	transition1 = index_array[1];
-
-	while (flows.started > 0) {
-
-		input0 = GET_NEXT_4BYTES(parms, 0);
-		input1 = GET_NEXT_4BYTES(parms, 1);
-
-		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
-
-			input0 >>= CHAR_BIT;
-
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
-			input1 >>= CHAR_BIT;
-
-		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
-			transition0 = acl_match_check_transition(transition0,
-				0, ctx, parms, &flows);
-			transition1 = acl_match_check_transition(transition1,
-				1, ctx, parms, &flows);
-
-		}
-	}
-	return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	if (likely(num >= MAX_SEARCHES_SSE8))
-		search_sse_8(ctx, data, results, num, categories);
-	else if (num >= MAX_SEARCHES_SSE4)
-		search_sse_4(ctx, data, results, num, categories);
-	else
-		search_sse_2(ctx, data, results, num, categories);
-
-	return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..c39650e
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,220 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef	_ACL_RUN_H_
+#define	_ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8	8
+#define MAX_SEARCHES_SSE4	4
+#define MAX_SEARCHES_SSE2	2
+#define MAX_SEARCHES_SCALAR	2
+
+#define GET_NEXT_4BYTES(prm, idx)	\
+	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define	SCALAR_QRANGE_MULT	0x01010101
+#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
+#define	SCALAR_QRANGE_MIN	0x80808080
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+	uint32_t            num_packets;
+	/* number of packets processed */
+	uint32_t            started;
+	/* number of trie traversals in progress */
+	uint32_t            trie;
+	/* current trie index (0 to N-1) */
+	uint32_t            cmplt_size;
+	uint32_t            total_packets;
+	uint32_t            categories;
+	/* number of result categories per packet. */
+	/* maximum number of packets to process */
+	const uint64_t     *trans;
+	const uint8_t     **data;
+	uint32_t           *results;
+	struct completion  *last_cmplt;
+	struct completion  *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+	uint32_t *results;                          /* running results. */
+	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+	uint32_t  count;                            /* num of remaining tries */
+	/* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+	const uint8_t              *data;
+	/* input data for this packet */
+	const uint32_t             *data_index;
+	/* data indirection for this trie */
+	struct completion          *cmplt;
+	/* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+	uint32_t *results)
+{
+	uint32_t n;
+
+	for (n = 0; n < size; n++) {
+
+		if (p[n].count == 0) {
+
+			/* mark as allocated and set number of tries. */
+			p[n].count = tries;
+			p[n].results = results;
+			return &(p[n]);
+		}
+	}
+
+	/* should never get here */
+	return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p)
+{
+	if (parms[n].cmplt->count == ctx->num_tries ||
+			parms[n].cmplt->priority[0] <=
+			p[transition].priority[0]) {
+
+		parms[n].cmplt->priority[0] = p[transition].priority[0];
+		parms[n].cmplt->results[0] = p[transition].results[0];
+	}
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+	const struct rte_acl_ctx *ctx)
+{
+	uint64_t transition;
+
+	/* if there are any more packets to process */
+	if (flows->num_packets < flows->total_packets) {
+		parms[n].data = flows->data[flows->num_packets];
+		parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+		/* if this is the first trie for this packet */
+		if (flows->trie == 0) {
+			flows->last_cmplt = alloc_completion(flows->cmplt_array,
+				flows->cmplt_size, ctx->num_tries,
+				flows->results +
+				flows->num_packets * flows->categories);
+		}
+
+		/* set completion parameters and starting index for this slot */
+		parms[n].cmplt = flows->last_cmplt;
+		transition =
+			flows->trans[parms[n].data[*parms[n].data_index++] +
+			ctx->trie[flows->trie].root_index];
+
+		/*
+		 * if this is the last trie for this packet,
+		 * then setup next packet.
+		 */
+		flows->trie++;
+		if (flows->trie >= ctx->num_tries) {
+			flows->trie = 0;
+			flows->num_packets++;
+		}
+
+		/* keep track of number of active trie traversals */
+		flows->started++;
+
+	/* no more tries to process, set slot to an idle position */
+	} else {
+		transition = ctx->idle;
+		parms[n].data = (const uint8_t *)idle;
+		parms[n].data_index = idle;
+	}
+	return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+	uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+	flows->num_packets = 0;
+	flows->started = 0;
+	flows->trie = 0;
+	flows->last_cmplt = NULL;
+	flows->cmplt_array = cmplt;
+	flows->total_packets = data_num;
+	flows->categories = categories;
+	flows->cmplt_size = cmplt_size;
+	flows->data = data;
+	flows->results = results;
+	flows->trans = trans;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..a59ff17
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,198 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_match_check.h"
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p, uint32_t categories)
+{
+	uint32_t i;
+	int32_t *saved_priority;
+	uint32_t *saved_results;
+	const int32_t *priority;
+	const uint32_t *results;
+
+	saved_results = parms[n].cmplt->results;
+	saved_priority = parms[n].cmplt->priority;
+
+	/* results and priorities for completed trie */
+	results = p[transition].results;
+	priority = p[transition].priority;
+
+	/* if this is not the first completed trie */
+	if (parms[n].cmplt->count != ctx->num_tries) {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			if (saved_priority[i] <= priority[i]) {
+				saved_priority[i] = priority[i];
+				saved_results[i] = results[i];
+			}
+			if (saved_priority[i + 1] <= priority[i + 1]) {
+				saved_priority[i + 1] = priority[i + 1];
+				saved_results[i + 1] = results[i + 1];
+			}
+			if (saved_priority[i + 2] <= priority[i + 2]) {
+				saved_priority[i + 2] = priority[i + 2];
+				saved_results[i + 2] = results[i + 2];
+			}
+			if (saved_priority[i + 3] <= priority[i + 3]) {
+				saved_priority[i + 3] = priority[i + 3];
+				saved_results[i + 3] = results[i + 3];
+			}
+		}
+	} else {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+			saved_priority[i] = priority[i];
+			saved_priority[i + 1] = priority[i + 1];
+			saved_priority[i + 2] = priority[i + 2];
+			saved_priority[i + 3] = priority[i + 3];
+
+			saved_results[i] = results[i];
+			saved_results[i + 1] = results[i + 1];
+			saved_results[i + 2] = results[i + 2];
+			saved_results[i + 3] = results[i + 3];
+		}
+	}
+}
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+	return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+	uint8_t input)
+{
+	uint32_t addr, index, ranges, x, a, b, c;
+
+	/* break transition into component parts */
+	ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+	/* calc address for a QRANGE node */
+	c = input * SCALAR_QRANGE_MULT;
+	a = ranges | SCALAR_QRANGE_MIN;
+	index = transition & ~RTE_ACL_NODE_INDEX;
+	a -= (c & SCALAR_QRANGE_MASK);
+	b = c & SCALAR_QRANGE_MIN;
+	addr = transition ^ index;
+	a &= SCALAR_QRANGE_MIN;
+	a ^= (ranges ^ b) & (a ^ b);
+	x = scan_forward(a, 32) >> 3;
+	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	/* pickup next transition */
+	transition = *(trans_table + addr);
+	return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	int n;
+	uint64_t transition0, transition1;
+	uint32_t input0, input1;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SCALAR];
+	struct completion cmplt[MAX_SEARCHES_SCALAR];
+	struct parms parms[MAX_SEARCHES_SCALAR];
+
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+		categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	transition0 = index_array[0];
+	transition1 = index_array[1];
+
+	while (flows.started > 0) {
+
+		input0 = GET_NEXT_4BYTES(parms, 0);
+		input1 = GET_NEXT_4BYTES(parms, 1);
+
+		for (n = 0; n < 4; n++) {
+			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+				transition0 = scalar_transition(flows.trans,
+					transition0, (uint8_t)input0);
+
+			input0 >>= CHAR_BIT;
+
+			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+				transition1 = scalar_transition(flows.trans,
+					transition1, (uint8_t)input1);
+
+			input1 >>= CHAR_BIT;
+
+		}
+		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+			transition0 = acl_match_check(transition0,
+				0, ctx, parms, &flows, resolve_priority_scalar);
+			transition1 = acl_match_check(transition1,
+				1, ctx, parms, &flows, resolve_priority_scalar);
+
+		}
+	}
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..3f5c721
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,627 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_match_check.h"
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+	},
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		0,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indicies);
+
+	/* extract transition from high 64 bits. */
+	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indicies);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indicies with new transitions. */
+	*indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indicies);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indicies);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies1, slot, ctx, parms, flows);
+		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+					(__m128)*indicies2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr, node_types, temp;
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indicies2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+		(__m128)*indicies2, 0xdd);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+	/* add input byte to DFA position */
+	temp = MM_AND(temp, bytes);
+	temp = MM_AND(temp, next_input);
+	addr = MM_ADD32(addr, temp);
+
+	/*
+	 * Calc addr for Range nodes -> range_index + range(input)
+	 */
+	node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indicies2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* check ranges */
+	temp = MM_CMPGT8(temp, *indicies2);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	temp = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = MM_AND(temp, node_types);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, indicies2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indicies1, indicies2, indicies3, indicies4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indicies1 contains index_array[0,1]
+	 * indicies2 contains index_array[2,3]
+	 * indicies3 contains index_array[4,5]
+	 * indicies4 contains index_array[6,7]
+	 */
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indicies3, &indicies4, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+			0);
+		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+			0);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indicies3, &indicies4, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indicies1, indicies2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1)
+{
+	uint64_t t;
+	xmm_t addr, indicies2;
+
+	indicies2 = MM_XOR(ones_16, ones_16);
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, &indicies2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indicies;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+			mm_match_mask64.m);
+	}
+
+	return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	if (likely(num >= MAX_SEARCHES_SSE8))
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..b9173c1 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -38,6 +38,52 @@
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/* by default, use always avaialbe scalar code path. */
+rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
+
+void rte_acl_select_classify(enum acl_classify_alg alg)
+{
+
+	switch(alg)
+	{
+		case ACL_CLASSIFY_DEFAULT:
+		case ACL_CLASSIFY_SCALAR:
+			rte_acl_default_classify = rte_acl_classify_scalar;
+			break;
+		case ACL_CLASSIFY_SSE:
+			rte_acl_default_classify = rte_acl_classify_sse;
+			break;
+	}
+
+}
+
+static void __attribute__((constructor))
+rte_acl_init(void)
+{
+	enum acl_classify_alg alg = ACL_CLASSIFY_DEFAULT;
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		alg = ACL_CLASSIFY_SSE;
+
+	rte_acl_select_classify(alg);
+}
+
+inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
+                            const uint8_t **data,
+                            uint32_t *results, uint32_t num,
+                            uint32_t categories)
+{
+	return rte_acl_default_classify(ctx, data, results, num, categories);
+}
+
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..650b306 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
  * If more than one rule is applicable for given input buffer and
  * given category, then rule with highest priority will be returned as a match.
+ * Note, that this function could be run only on CPUs with SSE4.1 support.
+ * It is up to the caller to make sure that this function is only invoked on
+ * a machine that supports SSE4.1 ISA.
  * Note, that it is a caller responsibility to ensure that input parameters
  * are valid and point to correct memory locations.
  *
@@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * @return
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
+ *   -ENOTSUP for unsupported platforms.
  */
 int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
 /**
@@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
  */
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
+
+enum acl_classify_alg {
+	ACL_CLASSIFY_DEFAULT = 0,
+	ACL_CLASSIFY_SCALAR = 1,
+	ACL_CLASSIFY_SSE = 2,
+};
+
+extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
+				   const uint8_t **data,
+				   uint32_t *results, uint32_t num,
+				   uint32_t categories);
+/**
+ * Analyze ISA of the current CPU and points rte_acl_default_classify
+ * to the highest applicable version of classify function.
+ */
+extern void
+rte_acl_select_classify(enum acl_classify_alg alg);
 
 /**
  * Dump an ACL context structure to the console.
-- 
1.9.3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-21 20:15 ` [dpdk-dev] [PATCHv3] " Neil Horman
@ 2014-08-25 16:30   ` Ananyev, Konstantin
  2014-08-26 17:44     ` Neil Horman
  0 siblings, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2014-08-25 16:30 UTC (permalink / raw)
  To: Neil Horman, dev

Hi Neil,

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 21, 2014 9:15 PM
> To: dev@dpdk.org
> Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 

I see you decided not to wait for me and fix everything by yourself :)

> V3 Changes
>  Updated classify pointer to be a function so as to better preserve ABI

As I said in my previous mail it generates extra jump...
Though from numbers I got the performance impact is negligible: < 1%.
So I suppose, I don't have a good enough reason to object :)

Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
First of all keep  rte_acl_classify_scalar() is already part of our public API.
Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
-  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.  
- to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
  old_alg = rte_acl_get_classify();
  rte_acl_select_classify(new_alg);
  ...
  rte_acl_select_classify(old_alg); 
  
>  REmoved macro definitions for match check functions to make them static inline

More comments inlined below.

Thanks
Konstantin

> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  app/test-acl/main.c              |  13 +-
>  app/test/test_acl.c              |  12 +-
>  lib/librte_acl/Makefile          |   5 +-
>  lib/librte_acl/acl_bld.c         |   5 +-
>  lib/librte_acl/acl_match_check.h |  83 ++++
>  lib/librte_acl/acl_run.c         | 944 ---------------------------------------
>  lib/librte_acl/acl_run.h         | 220 +++++++++
>  lib/librte_acl/acl_run_scalar.c  | 198 ++++++++
>  lib/librte_acl/acl_run_sse.c     | 627 ++++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c         |  46 ++
>  lib/librte_acl/rte_acl.h         |  26 +-
>  11 files changed, 1216 insertions(+), 963 deletions(-)
>  create mode 100644 lib/librte_acl/acl_match_check.h
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 
> diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> index d654409..a77f47d 100644
> --- a/app/test-acl/main.c
> +++ b/app/test-acl/main.c
> @@ -787,6 +787,10 @@ acx_init(void)
>  	/* perform build. */
>  	ret = rte_acl_build(config.acx, &cfg);
> 
> +	/* setup default rte_acl_classify */
> +	if (config.scalar)
> +		rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +
>  	dump_verbose(DUMP_NONE, stdout,
>  		"rte_acl_build(%u) finished with %d\n",
>  		config.bld_categories, ret);
> @@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
>  			v += config.trace_sz;
>  		}
> 
> -		if (scalar != 0)
> -			ret = rte_acl_classify_scalar(config.acx, data,
> -				results, n, categories);
> -
> -		else
> -			ret = rte_acl_classify(config.acx, data,
> -				results, n, categories);
> +		ret = rte_acl_classify(config.acx, data, results,
> +			n, categories);
> 
>  		if (ret != 0)
>  			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
> diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> index 869f6d3..2fcef6e 100644
> --- a/app/test/test_acl.c
> +++ b/app/test/test_acl.c
> @@ -148,7 +148,8 @@ test_classify_run(struct rte_acl_ctx *acx)
>  	}
> 
>  	/* make a quick check for scalar */
> -	ret = rte_acl_classify_scalar(acx, data, results,
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	ret = rte_acl_classify(acx, data, results,
>  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);


As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the original value.
To support it properly, we need to:
old_alg = rte_acl_get_classify();
 rte_acl_select_classify(new_alg);
 ...
 rte_acl_select_classify(old_alg);

Make all this just to keep UT valid seems like a big hassle to me.
So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.

>  	if (ret != 0) {
>  		printf("Line %i: SSE classify failed!\n", __LINE__);
> @@ -362,7 +363,8 @@ test_invalid_layout(void)
>  	}
> 
>  	/* classify tuples (scalar) */
> -	ret = rte_acl_classify_scalar(acx, data, results,
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	ret = rte_acl_classify(acx, data, results,
>  			RTE_DIM(results), 1);
>  	if (ret != 0) {
>  		printf("Line %i: Scalar classify failed!\n", __LINE__);
> @@ -850,7 +852,8 @@ test_invalid_parameters(void)
>  	/* scalar classify test */
> 
>  	/* cover zero categories in classify (should not fail) */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 0);
>  	if (result != 0) {
>  		printf("Line %i: Scalar classify with zero categories "
>  				"failed!\n", __LINE__);
> @@ -859,7 +862,8 @@ test_invalid_parameters(void)
>  	}
> 
>  	/* cover invalid but positive categories in classify */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
>  	if (result == 0) {
>  		printf("Line %i: Scalar classify with 3 categories "
>  				"should have failed!\n", __LINE__);
> diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> index 4fe4593..65e566d 100644
> --- a/lib/librte_acl/Makefile
> +++ b/lib/librte_acl/Makefile
> @@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> -SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> +
> +CFLAGS_acl_run_sse.o += -msse4.1
> 
>  # install this header file
>  SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
> diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
> index 873447b..09d58ea 100644
> --- a/lib/librte_acl/acl_bld.c
> +++ b/lib/librte_acl/acl_bld.c
> @@ -31,7 +31,6 @@
>   *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>   */
> 
> -#include <nmmintrin.h>
>  #include <rte_acl.h>
>  #include "tb_mem.h"
>  #include "acl.h"
> @@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
> 
>  			switch (rule->config->defs[n].type) {
>  			case RTE_ACL_FIELD_TYPE_BITMASK:
> -				wild = (size -
> -					_mm_popcnt_u32(fld->mask_range.u8)) /
> +				wild = (size - __builtin_popcount(
> +					fld->mask_range.u8)) /
>  					size;
>  				break;
> 
> diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> new file mode 100644
> index 0000000..4dc1982
> --- /dev/null
> +++ b/lib/librte_acl/acl_match_check.h

As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.

> @@ -0,0 +1,83 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _ACL_MATCH_CHECK_H_
> +#define _ACL_MATCH_CHECK_H_
> +
> +/*
> + * Detect matches. If a match node transition is found, then this trie
> + * traversal is complete and fill the slot with the next trie
> + * to be processed.
> + */
> +static inline uint64_t
> +acl_match_check(uint64_t transition, int slot,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, void (*resolve_priority)(
> +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, const struct rte_acl_match_results *p,
> +	uint32_t categories))

Ugh, that's really hard to read.
Can we create a typedef for resolve_priority function type:
typedef void (*resolve_priority_t)(uint64_t, int,
        const struct rte_acl_ctx *ctx, struct parms *,
        const struct rte_acl_match_results *, uint32_t);
And use it here?

> +{
> +	const struct rte_acl_match_results *p;
> +
> +	p = (const struct rte_acl_match_results *)
> +		(flows->trans + ctx->match_index);
> +
> +	if (transition & RTE_ACL_NODE_MATCH) {
> +
> +		/* Remove flags from index and decrement active traversals */
> +		transition &= RTE_ACL_NODE_INDEX;
> +		flows->started--;
> +
> +		/* Resolve priorities for this trie and running results */
> +		if (flows->categories == 1)
> +			resolve_single_priority(transition, slot, ctx,
> +				parms, p);
> +		else
> +			resolve_priority(transition, slot, ctx, parms,
> +				p, flows->categories);
> +
> +		/* Count down completed tries for this search request */
> +		parms[slot].cmplt->count--;
> +
> +		/* Fill the slot with the next trie or idle trie */
> +		transition = acl_start_next_trie(flows, parms, slot, ctx);
> +
> +	} else if (transition == ctx->idle) {
> +		/* reset indirection table for idle slots */
> +		parms[slot].data_index = idle;
> +	}
> +
> +	return transition;
> +}
> +
> +#endif
> diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> deleted file mode 100644
> index e3d9fc1..0000000
> --- a/lib/librte_acl/acl_run.c
> +++ /dev/null
> @@ -1,944 +0,0 @@
> -/*-
> - *   BSD LICENSE
> - *
> - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> - *   All rights reserved.
> - *
> - *   Redistribution and use in source and binary forms, with or without
> - *   modification, are permitted provided that the following conditions
> - *   are met:
> - *
> - *     * Redistributions of source code must retain the above copyright
> - *       notice, this list of conditions and the following disclaimer.
> - *     * Redistributions in binary form must reproduce the above copyright
> - *       notice, this list of conditions and the following disclaimer in
> - *       the documentation and/or other materials provided with the
> - *       distribution.
> - *     * Neither the name of Intel Corporation nor the names of its
> - *       contributors may be used to endorse or promote products derived
> - *       from this software without specific prior written permission.
> - *
> - *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> - *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> - *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> - *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> - *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> - *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> - *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> - *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> - *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> - *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> - *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> - */
> -
> -#include <rte_acl.h>
> -#include "acl_vect.h"
> -#include "acl.h"
> -
> -#define MAX_SEARCHES_SSE8	8
> -#define MAX_SEARCHES_SSE4	4
> -#define MAX_SEARCHES_SSE2	2
> -#define MAX_SEARCHES_SCALAR	2
> -
> -#define GET_NEXT_4BYTES(prm, idx)	\
> -	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
> -
> -
> -#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
> -
> -#define	SCALAR_QRANGE_MULT	0x01010101
> -#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
> -#define	SCALAR_QRANGE_MIN	0x80808080
> -
> -enum {
> -	SHUFFLE32_SLOT1 = 0xe5,
> -	SHUFFLE32_SLOT2 = 0xe6,
> -	SHUFFLE32_SLOT3 = 0xe7,
> -	SHUFFLE32_SWAP64 = 0x4e,
> -};
> -
> -/*
> - * Structure to manage N parallel trie traversals.
> - * The runtime trie traversal routines can process 8, 4, or 2 tries
> - * in parallel. Each packet may require multiple trie traversals (up to 4).
> - * This structure is used to fill the slots (0 to n-1) for parallel processing
> - * with the trie traversals needed for each packet.
> - */
> -struct acl_flow_data {
> -	uint32_t            num_packets;
> -	/* number of packets processed */
> -	uint32_t            started;
> -	/* number of trie traversals in progress */
> -	uint32_t            trie;
> -	/* current trie index (0 to N-1) */
> -	uint32_t            cmplt_size;
> -	uint32_t            total_packets;
> -	uint32_t            categories;
> -	/* number of result categories per packet. */
> -	/* maximum number of packets to process */
> -	const uint64_t     *trans;
> -	const uint8_t     **data;
> -	uint32_t           *results;
> -	struct completion  *last_cmplt;
> -	struct completion  *cmplt_array;
> -};
> -
> -/*
> - * Structure to maintain running results for
> - * a single packet (up to 4 tries).
> - */
> -struct completion {
> -	uint32_t *results;                          /* running results. */
> -	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
> -	uint32_t  count;                            /* num of remaining tries */
> -	/* true for allocated struct */
> -} __attribute__((aligned(XMM_SIZE)));
> -
> -/*
> - * One parms structure for each slot in the search engine.
> - */
> -struct parms {
> -	const uint8_t              *data;
> -	/* input data for this packet */
> -	const uint32_t             *data_index;
> -	/* data indirection for this trie */
> -	struct completion          *cmplt;
> -	/* completion data for this packet */
> -};
> -
> -/*
> - * Define an global idle node for unused engine slots
> - */
> -static const uint32_t idle[UINT8_MAX + 1];
> -
> -static const rte_xmm_t mm_type_quad_range = {
> -	.u32 = {
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -	},
> -};
> -
> -static const rte_xmm_t mm_type_quad_range64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		0,
> -		0,
> -	},
> -};
> -
> -static const rte_xmm_t mm_shuffle_input = {
> -	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
> -};
> -
> -static const rte_xmm_t mm_shuffle_input64 = {
> -	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
> -};
> -
> -static const rte_xmm_t mm_ones_16 = {
> -	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
> -};
> -
> -static const rte_xmm_t mm_bytes = {
> -	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
> -};
> -
> -static const rte_xmm_t mm_bytes64 = {
> -	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
> -};
> -
> -static const rte_xmm_t mm_match_mask = {
> -	.u32 = {
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -	},
> -};
> -
> -static const rte_xmm_t mm_match_mask64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_MATCH,
> -		0,
> -		RTE_ACL_NODE_MATCH,
> -		0,
> -	},
> -};
> -
> -static const rte_xmm_t mm_index_mask = {
> -	.u32 = {
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -	},
> -};
> -
> -static const rte_xmm_t mm_index_mask64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		0,
> -		0,
> -	},
> -};
> -
> -/*
> - * Allocate a completion structure to manage the tries for a packet.
> - */
> -static inline struct completion *
> -alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
> -	uint32_t *results)
> -{
> -	uint32_t n;
> -
> -	for (n = 0; n < size; n++) {
> -
> -		if (p[n].count == 0) {
> -
> -			/* mark as allocated and set number of tries. */
> -			p[n].count = tries;
> -			p[n].results = results;
> -			return &(p[n]);
> -		}
> -	}
> -
> -	/* should never get here */
> -	return NULL;
> -}
> -
> -/*
> - * Resolve priority for a single result trie.
> - */
> -static inline void
> -resolve_single_priority(uint64_t transition, int n,
> -	const struct rte_acl_ctx *ctx, struct parms *parms,
> -	const struct rte_acl_match_results *p)
> -{
> -	if (parms[n].cmplt->count == ctx->num_tries ||
> -			parms[n].cmplt->priority[0] <=
> -			p[transition].priority[0]) {
> -
> -		parms[n].cmplt->priority[0] = p[transition].priority[0];
> -		parms[n].cmplt->results[0] = p[transition].results[0];
> -	}
> -
> -	parms[n].cmplt->count--;
> -}
> -
> -/*
> - * Resolve priority for multiple results. This consists comparing
> - * the priority of the current traversal with the running set of
> - * results for the packet. For each result, keep a running array of
> - * the result (rule number) and its priority for each category.
> - */
> -static inline void
> -resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> -	struct parms *parms, const struct rte_acl_match_results *p,
> -	uint32_t categories)
> -{
> -	uint32_t x;
> -	xmm_t results, priority, results1, priority1, selector;
> -	xmm_t *saved_results, *saved_priority;
> -
> -	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
> -
> -		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
> -		saved_priority =
> -			(xmm_t *)(&parms[n].cmplt->priority[x]);
> -
> -		/* get results and priorities for completed trie */
> -		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
> -		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
> -
> -		/* if this is not the first completed trie */
> -		if (parms[n].cmplt->count != ctx->num_tries) {
> -
> -			/* get running best results and their priorities */
> -			results1 = MM_LOADU(saved_results);
> -			priority1 = MM_LOADU(saved_priority);
> -
> -			/* select results that are highest priority */
> -			selector = MM_CMPGT32(priority1, priority);
> -			results = MM_BLENDV8(results, results1, selector);
> -			priority = MM_BLENDV8(priority, priority1, selector);
> -		}
> -
> -		/* save running best results and their priorities */
> -		MM_STOREU(saved_results, results);
> -		MM_STOREU(saved_priority, priority);
> -	}
> -
> -	/* Count down completed tries for this search request */
> -	parms[n].cmplt->count--;
> -}
> -
> -/*
> - * Routine to fill a slot in the parallel trie traversal array (parms) from
> - * the list of packets (flows).
> - */
> -static inline uint64_t
> -acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
> -	const struct rte_acl_ctx *ctx)
> -{
> -	uint64_t transition;
> -
> -	/* if there are any more packets to process */
> -	if (flows->num_packets < flows->total_packets) {
> -		parms[n].data = flows->data[flows->num_packets];
> -		parms[n].data_index = ctx->trie[flows->trie].data_index;
> -
> -		/* if this is the first trie for this packet */
> -		if (flows->trie == 0) {
> -			flows->last_cmplt = alloc_completion(flows->cmplt_array,
> -				flows->cmplt_size, ctx->num_tries,
> -				flows->results +
> -				flows->num_packets * flows->categories);
> -		}
> -
> -		/* set completion parameters and starting index for this slot */
> -		parms[n].cmplt = flows->last_cmplt;
> -		transition =
> -			flows->trans[parms[n].data[*parms[n].data_index++] +
> -			ctx->trie[flows->trie].root_index];
> -
> -		/*
> -		 * if this is the last trie for this packet,
> -		 * then setup next packet.
> -		 */
> -		flows->trie++;
> -		if (flows->trie >= ctx->num_tries) {
> -			flows->trie = 0;
> -			flows->num_packets++;
> -		}
> -
> -		/* keep track of number of active trie traversals */
> -		flows->started++;
> -
> -	/* no more tries to process, set slot to an idle position */
> -	} else {
> -		transition = ctx->idle;
> -		parms[n].data = (const uint8_t *)idle;
> -		parms[n].data_index = idle;
> -	}
> -	return transition;
> -}
> -
> -/*
> - * Detect matches. If a match node transition is found, then this trie
> - * traversal is complete and fill the slot with the next trie
> - * to be processed.
> - */
> -static inline uint64_t
> -acl_match_check_transition(uint64_t transition, int slot,
> -	const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows)
> -{
> -	const struct rte_acl_match_results *p;
> -
> -	p = (const struct rte_acl_match_results *)
> -		(flows->trans + ctx->match_index);
> -
> -	if (transition & RTE_ACL_NODE_MATCH) {
> -
> -		/* Remove flags from index and decrement active traversals */
> -		transition &= RTE_ACL_NODE_INDEX;
> -		flows->started--;
> -
> -		/* Resolve priorities for this trie and running results */
> -		if (flows->categories == 1)
> -			resolve_single_priority(transition, slot, ctx,
> -				parms, p);
> -		else
> -			resolve_priority(transition, slot, ctx, parms, p,
> -				flows->categories);
> -
> -		/* Fill the slot with the next trie or idle trie */
> -		transition = acl_start_next_trie(flows, parms, slot, ctx);
> -
> -	} else if (transition == ctx->idle) {
> -		/* reset indirection table for idle slots */
> -		parms[slot].data_index = idle;
> -	}
> -
> -	return transition;
> -}
> -
> -/*
> - * Extract transitions from an XMM register and check for any matches
> - */
> -static void
> -acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> -	struct parms *parms, struct acl_flow_data *flows)
> -{
> -	uint64_t transition1, transition2;
> -
> -	/* extract transition from low 64 bits. */
> -	transition1 = MM_CVT64(*indicies);
> -
> -	/* extract transition from high 64 bits. */
> -	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> -	transition2 = MM_CVT64(*indicies);
> -
> -	transition1 = acl_match_check_transition(transition1, slot, ctx,
> -		parms, flows);
> -	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
> -		parms, flows);
> -
> -	/* update indicies with new transitions. */
> -	*indicies = MM_SET64(transition2, transition1);
> -}
> -
> -/*
> - * Check for a match in 2 transitions (contained in SSE register)
> - */
> -static inline void
> -acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> -{
> -	xmm_t temp;
> -
> -	temp = MM_AND(match_mask, *indicies);
> -	while (!MM_TESTZ(temp, temp)) {
> -		acl_process_matches(indicies, slot, ctx, parms, flows);
> -		temp = MM_AND(match_mask, *indicies);
> -	}
> -}
> -
> -/*
> - * Check for any match in 4 transitions (contained in 2 SSE registers)
> - */
> -static inline void
> -acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> -	xmm_t match_mask)
> -{
> -	xmm_t temp;
> -
> -	/* put low 32 bits of each transition into one register */
> -	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> -		0x88);
> -	/* test for match node */
> -	temp = MM_AND(match_mask, temp);
> -
> -	while (!MM_TESTZ(temp, temp)) {
> -		acl_process_matches(indicies1, slot, ctx, parms, flows);
> -		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> -
> -		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> -					(__m128)*indicies2,
> -					0x88);
> -		temp = MM_AND(match_mask, temp);
> -	}
> -}
> -
> -/*
> - * Calculate the address of the next transition for
> - * all types of nodes. Note that only DFA nodes and range
> - * nodes actually transition to another node. Match
> - * nodes don't move.
> - */
> -static inline xmm_t
> -acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	xmm_t *indicies1, xmm_t *indicies2)
> -{
> -	xmm_t addr, node_types, temp;
> -
> -	/*
> -	 * Note that no transition is done for a match
> -	 * node and therefore a stream freezes when
> -	 * it reaches a match.
> -	 */
> -
> -	/* Shuffle low 32 into temp and high 32 into indicies2 */
> -	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> -		0x88);
> -	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> -		(__m128)*indicies2, 0xdd);
> -
> -	/* Calc node type and node addr */
> -	node_types = MM_ANDNOT(index_mask, temp);
> -	addr = MM_AND(index_mask, temp);
> -
> -	/*
> -	 * Calc addr for DFAs - addr = dfa_index + input_byte
> -	 */
> -
> -	/* mask for DFA type (0) nodes */
> -	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> -
> -	/* add input byte to DFA position */
> -	temp = MM_AND(temp, bytes);
> -	temp = MM_AND(temp, next_input);
> -	addr = MM_ADD32(addr, temp);
> -
> -	/*
> -	 * Calc addr for Range nodes -> range_index + range(input)
> -	 */
> -	node_types = MM_CMPEQ32(node_types, type_quad_range);
> -
> -	/*
> -	 * Calculate number of range boundaries that are less than the
> -	 * input value. Range boundaries for each node are in signed 8 bit,
> -	 * ordered from -128 to 127 in the indicies2 register.
> -	 * This is effectively a popcnt of bytes that are greater than the
> -	 * input byte.
> -	 */
> -
> -	/* shuffle input byte to all 4 positions of 32 bit value */
> -	temp = MM_SHUFFLE8(next_input, shuffle_input);
> -
> -	/* check ranges */
> -	temp = MM_CMPGT8(temp, *indicies2);
> -
> -	/* convert -1 to 1 (bytes greater than input byte */
> -	temp = MM_SIGN8(temp, temp);
> -
> -	/* horizontal add pairs of bytes into words */
> -	temp = MM_MADD8(temp, temp);
> -
> -	/* horizontal add pairs of words into dwords */
> -	temp = MM_MADD16(temp, ones_16);
> -
> -	/* mask to range type nodes */
> -	temp = MM_AND(temp, node_types);
> -
> -	/* add index into node position */
> -	return MM_ADD32(addr, temp);
> -}
> -
> -/*
> - * Process 4 transitions (in 2 SIMD registers) in parallel
> - */
> -static inline xmm_t
> -transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> -{
> -	xmm_t addr;
> -	uint64_t trans0, trans2;
> -
> -	 /* Calculate the address (array index) for all 4 transitions. */
> -
> -	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> -		bytes, type_quad_range, indicies1, indicies2);
> -
> -	 /* Gather 64 bit transitions and pack back into 2 registers. */
> -
> -	trans0 = trans[MM_CVT32(addr)];
> -
> -	/* get slot 2 */
> -
> -	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> -	trans2 = trans[MM_CVT32(addr)];
> -
> -	/* get slot 1 */
> -
> -	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> -	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> -
> -	/* get slot 3 */
> -
> -	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> -	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> -
> -	return MM_SRL32(next_input, 8);
> -}
> -
> -static inline void
> -acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
> -	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
> -	uint32_t data_num, uint32_t categories, const uint64_t *trans)
> -{
> -	flows->num_packets = 0;
> -	flows->started = 0;
> -	flows->trie = 0;
> -	flows->last_cmplt = NULL;
> -	flows->cmplt_array = cmplt;
> -	flows->total_packets = data_num;
> -	flows->categories = categories;
> -	flows->cmplt_size = cmplt_size;
> -	flows->data = data;
> -	flows->results = results;
> -	flows->trans = trans;
> -}
> -
> -/*
> - * Execute trie traversal with 8 traversals in parallel
> - */
> -static inline void
> -search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE8];
> -	struct completion cmplt[MAX_SEARCHES_SSE8];
> -	struct parms parms[MAX_SEARCHES_SSE8];
> -	xmm_t input0, input1;
> -	xmm_t indicies1, indicies2, indicies3, indicies4;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	/*
> -	 * indicies1 contains index_array[0,1]
> -	 * indicies2 contains index_array[2,3]
> -	 * indicies3 contains index_array[4,5]
> -	 * indicies4 contains index_array[6,7]
> -	 */
> -
> -	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> -	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> -
> -	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> -	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> -
> -	 /* Check for any matches. */
> -	acl_match_check_x4(0, ctx, parms, &flows,
> -		&indicies1, &indicies2, mm_match_mask.m);
> -	acl_match_check_x4(4, ctx, parms, &flows,
> -		&indicies3, &indicies4, mm_match_mask.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> -			0);
> -		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> -			0);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> -
> -		 /* Process the 4 bytes of input on each stream. */
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		 /* Check for any matches. */
> -		acl_match_check_x4(0, ctx, parms, &flows,
> -			&indicies1, &indicies2, mm_match_mask.m);
> -		acl_match_check_x4(4, ctx, parms, &flows,
> -			&indicies3, &indicies4, mm_match_mask.m);
> -	}
> -}
> -
> -/*
> - * Execute trie traversal with 4 traversals in parallel
> - */
> -static inline void
> -search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	 uint32_t *results, int total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE4];
> -	struct completion cmplt[MAX_SEARCHES_SSE4];
> -	struct parms parms[MAX_SEARCHES_SSE4];
> -	xmm_t input, indicies1, indicies2;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> -	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> -
> -	/* Check for any matches. */
> -	acl_match_check_x4(0, ctx, parms, &flows,
> -		&indicies1, &indicies2, mm_match_mask.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> -
> -		/* Process the 4 bytes of input on each stream. */
> -		input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		/* Check for any matches. */
> -		acl_match_check_x4(0, ctx, parms, &flows,
> -			&indicies1, &indicies2, mm_match_mask.m);
> -	}
> -}
> -
> -static inline xmm_t
> -transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	const uint64_t *trans, xmm_t *indicies1)
> -{
> -	uint64_t t;
> -	xmm_t addr, indicies2;
> -
> -	indicies2 = MM_XOR(ones_16, ones_16);
> -
> -	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> -		bytes, type_quad_range, indicies1, &indicies2);
> -
> -	/* Gather 64 bit transitions and pack 2 per register. */
> -
> -	t = trans[MM_CVT32(addr)];
> -
> -	/* get slot 1 */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> -	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> -
> -	return MM_SRL32(next_input, 8);
> -}
> -
> -/*
> - * Execute trie traversal with 2 traversals in parallel.
> - */
> -static inline void
> -search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE2];
> -	struct completion cmplt[MAX_SEARCHES_SSE2];
> -	struct parms parms[MAX_SEARCHES_SSE2];
> -	xmm_t input, indicies;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> -
> -	/* Check for any matches. */
> -	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> -
> -		/* Process the 4 bytes of input on each stream. */
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		/* Check for any matches. */
> -		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> -			mm_match_mask64.m);
> -	}
> -}
> -
> -/*
> - * When processing the transition, rather than using if/else
> - * construct, the offset is calculated for DFA and QRANGE and
> - * then conditionally added to the address based on node type.
> - * This is done to avoid branch mis-predictions. Since the
> - * offset is rather simple calculation it is more efficient
> - * to do the calculation and do a condition move rather than
> - * a conditional branch to determine which calculation to do.
> - */
> -static inline uint32_t
> -scan_forward(uint32_t input, uint32_t max)
> -{
> -	return (input == 0) ? max : rte_bsf32(input);
> -}
> -
> -static inline uint64_t
> -scalar_transition(const uint64_t *trans_table, uint64_t transition,
> -	uint8_t input)
> -{
> -	uint32_t addr, index, ranges, x, a, b, c;
> -
> -	/* break transition into component parts */
> -	ranges = transition >> (sizeof(index) * CHAR_BIT);
> -
> -	/* calc address for a QRANGE node */
> -	c = input * SCALAR_QRANGE_MULT;
> -	a = ranges | SCALAR_QRANGE_MIN;
> -	index = transition & ~RTE_ACL_NODE_INDEX;
> -	a -= (c & SCALAR_QRANGE_MASK);
> -	b = c & SCALAR_QRANGE_MIN;
> -	addr = transition ^ index;
> -	a &= SCALAR_QRANGE_MIN;
> -	a ^= (ranges ^ b) & (a ^ b);
> -	x = scan_forward(a, 32) >> 3;
> -	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
> -
> -	/* pickup next transition */
> -	transition = *(trans_table + addr);
> -	return transition;
> -}
> -
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories)
> -{
> -	int n;
> -	uint64_t transition0, transition1;
> -	uint32_t input0, input1;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SCALAR];
> -	struct completion cmplt[MAX_SEARCHES_SCALAR];
> -	struct parms parms[MAX_SEARCHES_SCALAR];
> -
> -	if (categories != 1 &&
> -		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> -		return -EINVAL;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
> -		categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	transition0 = index_array[0];
> -	transition1 = index_array[1];
> -
> -	while (flows.started > 0) {
> -
> -		input0 = GET_NEXT_4BYTES(parms, 0);
> -		input1 = GET_NEXT_4BYTES(parms, 1);
> -
> -		for (n = 0; n < 4; n++) {
> -			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
> -				transition0 = scalar_transition(flows.trans,
> -					transition0, (uint8_t)input0);
> -
> -			input0 >>= CHAR_BIT;
> -
> -			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
> -				transition1 = scalar_transition(flows.trans,
> -					transition1, (uint8_t)input1);
> -
> -			input1 >>= CHAR_BIT;
> -
> -		}
> -		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> -			transition0 = acl_match_check_transition(transition0,
> -				0, ctx, parms, &flows);
> -			transition1 = acl_match_check_transition(transition1,
> -				1, ctx, parms, &flows);
> -
> -		}
> -	}
> -	return 0;
> -}
> -
> -int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories)
> -{
> -	if (categories != 1 &&
> -		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> -		return -EINVAL;
> -
> -	if (likely(num >= MAX_SEARCHES_SSE8))
> -		search_sse_8(ctx, data, results, num, categories);
> -	else if (num >= MAX_SEARCHES_SSE4)
> -		search_sse_4(ctx, data, results, num, categories);
> -	else
> -		search_sse_2(ctx, data, results, num, categories);
> -
> -	return 0;
> -}
> diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
> new file mode 100644
> index 0000000..c39650e
> --- /dev/null
> +++ b/lib/librte_acl/acl_run.h
> @@ -0,0 +1,220 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef	_ACL_RUN_H_
> +#define	_ACL_RUN_H_
> +
> +#include <rte_acl.h>
> +#include "acl_vect.h"
> +#include "acl.h"
> +
> +#define MAX_SEARCHES_SSE8	8
> +#define MAX_SEARCHES_SSE4	4
> +#define MAX_SEARCHES_SSE2	2
> +#define MAX_SEARCHES_SCALAR	2
> +
> +#define GET_NEXT_4BYTES(prm, idx)	\
> +	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
> +
> +
> +#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
> +
> +#define	SCALAR_QRANGE_MULT	0x01010101
> +#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
> +#define	SCALAR_QRANGE_MIN	0x80808080
> +
> +/*
> + * Structure to manage N parallel trie traversals.
> + * The runtime trie traversal routines can process 8, 4, or 2 tries
> + * in parallel. Each packet may require multiple trie traversals (up to 4).
> + * This structure is used to fill the slots (0 to n-1) for parallel processing
> + * with the trie traversals needed for each packet.
> + */
> +struct acl_flow_data {
> +	uint32_t            num_packets;
> +	/* number of packets processed */
> +	uint32_t            started;
> +	/* number of trie traversals in progress */
> +	uint32_t            trie;
> +	/* current trie index (0 to N-1) */
> +	uint32_t            cmplt_size;
> +	uint32_t            total_packets;
> +	uint32_t            categories;
> +	/* number of result categories per packet. */
> +	/* maximum number of packets to process */
> +	const uint64_t     *trans;
> +	const uint8_t     **data;
> +	uint32_t           *results;
> +	struct completion  *last_cmplt;
> +	struct completion  *cmplt_array;
> +};
> +
> +/*
> + * Structure to maintain running results for
> + * a single packet (up to 4 tries).
> + */
> +struct completion {
> +	uint32_t *results;                          /* running results. */
> +	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
> +	uint32_t  count;                            /* num of remaining tries */
> +	/* true for allocated struct */
> +} __attribute__((aligned(XMM_SIZE)));
> +
> +/*
> + * One parms structure for each slot in the search engine.
> + */
> +struct parms {
> +	const uint8_t              *data;
> +	/* input data for this packet */
> +	const uint32_t             *data_index;
> +	/* data indirection for this trie */
> +	struct completion          *cmplt;
> +	/* completion data for this packet */
> +};
> +
> +/*
> + * Define an global idle node for unused engine slots
> + */
> +static const uint32_t idle[UINT8_MAX + 1];
> +
> +/*
> + * Allocate a completion structure to manage the tries for a packet.
> + */
> +static inline struct completion *
> +alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
> +	uint32_t *results)
> +{
> +	uint32_t n;
> +
> +	for (n = 0; n < size; n++) {
> +
> +		if (p[n].count == 0) {
> +
> +			/* mark as allocated and set number of tries. */
> +			p[n].count = tries;
> +			p[n].results = results;
> +			return &(p[n]);
> +		}
> +	}
> +
> +	/* should never get here */
> +	return NULL;
> +}
> +
> +/*
> + * Resolve priority for a single result trie.
> + */
> +static inline void
> +resolve_single_priority(uint64_t transition, int n,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	const struct rte_acl_match_results *p)
> +{
> +	if (parms[n].cmplt->count == ctx->num_tries ||
> +			parms[n].cmplt->priority[0] <=
> +			p[transition].priority[0]) {
> +
> +		parms[n].cmplt->priority[0] = p[transition].priority[0];
> +		parms[n].cmplt->results[0] = p[transition].results[0];
> +	}
> +}
> +
> +/*
> + * Routine to fill a slot in the parallel trie traversal array (parms) from
> + * the list of packets (flows).
> + */
> +static inline uint64_t
> +acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
> +	const struct rte_acl_ctx *ctx)
> +{
> +	uint64_t transition;
> +
> +	/* if there are any more packets to process */
> +	if (flows->num_packets < flows->total_packets) {
> +		parms[n].data = flows->data[flows->num_packets];
> +		parms[n].data_index = ctx->trie[flows->trie].data_index;
> +
> +		/* if this is the first trie for this packet */
> +		if (flows->trie == 0) {
> +			flows->last_cmplt = alloc_completion(flows->cmplt_array,
> +				flows->cmplt_size, ctx->num_tries,
> +				flows->results +
> +				flows->num_packets * flows->categories);
> +		}
> +
> +		/* set completion parameters and starting index for this slot */
> +		parms[n].cmplt = flows->last_cmplt;
> +		transition =
> +			flows->trans[parms[n].data[*parms[n].data_index++] +
> +			ctx->trie[flows->trie].root_index];
> +
> +		/*
> +		 * if this is the last trie for this packet,
> +		 * then setup next packet.
> +		 */
> +		flows->trie++;
> +		if (flows->trie >= ctx->num_tries) {
> +			flows->trie = 0;
> +			flows->num_packets++;
> +		}
> +
> +		/* keep track of number of active trie traversals */
> +		flows->started++;
> +
> +	/* no more tries to process, set slot to an idle position */
> +	} else {
> +		transition = ctx->idle;
> +		parms[n].data = (const uint8_t *)idle;
> +		parms[n].data_index = idle;
> +	}
> +	return transition;
> +}
> +
> +static inline void
> +acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
> +	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
> +	uint32_t data_num, uint32_t categories, const uint64_t *trans)
> +{
> +	flows->num_packets = 0;
> +	flows->started = 0;
> +	flows->trie = 0;
> +	flows->last_cmplt = NULL;
> +	flows->cmplt_array = cmplt;
> +	flows->total_packets = data_num;
> +	flows->categories = categories;
> +	flows->cmplt_size = cmplt_size;
> +	flows->data = data;
> +	flows->results = results;
> +	flows->trans = trans;
> +}
> +
> +#endif /* _ACL_RUN_H_ */
> diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
> new file mode 100644
> index 0000000..a59ff17
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_scalar.c
> @@ -0,0 +1,198 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include "acl_run.h"
> +#include "acl_match_check.h"
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +/*
> + * Resolve priority for multiple results (scalar version).
> + * This consists comparing the priority of the current traversal with the
> + * running set of results for the packet.
> + * For each result, keep a running array of the result (rule number) and
> + * its priority for each category.
> + */
> +static inline void
> +resolve_priority_scalar(uint64_t transition, int n,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	const struct rte_acl_match_results *p, uint32_t categories)
> +{
> +	uint32_t i;
> +	int32_t *saved_priority;
> +	uint32_t *saved_results;
> +	const int32_t *priority;
> +	const uint32_t *results;
> +
> +	saved_results = parms[n].cmplt->results;
> +	saved_priority = parms[n].cmplt->priority;
> +
> +	/* results and priorities for completed trie */
> +	results = p[transition].results;
> +	priority = p[transition].priority;
> +
> +	/* if this is not the first completed trie */
> +	if (parms[n].cmplt->count != ctx->num_tries) {
> +		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
> +
> +			if (saved_priority[i] <= priority[i]) {
> +				saved_priority[i] = priority[i];
> +				saved_results[i] = results[i];
> +			}
> +			if (saved_priority[i + 1] <= priority[i + 1]) {
> +				saved_priority[i + 1] = priority[i + 1];
> +				saved_results[i + 1] = results[i + 1];
> +			}
> +			if (saved_priority[i + 2] <= priority[i + 2]) {
> +				saved_priority[i + 2] = priority[i + 2];
> +				saved_results[i + 2] = results[i + 2];
> +			}
> +			if (saved_priority[i + 3] <= priority[i + 3]) {
> +				saved_priority[i + 3] = priority[i + 3];
> +				saved_results[i + 3] = results[i + 3];
> +			}
> +		}
> +	} else {
> +		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
> +			saved_priority[i] = priority[i];
> +			saved_priority[i + 1] = priority[i + 1];
> +			saved_priority[i + 2] = priority[i + 2];
> +			saved_priority[i + 3] = priority[i + 3];
> +
> +			saved_results[i] = results[i];
> +			saved_results[i + 1] = results[i + 1];
> +			saved_results[i + 2] = results[i + 2];
> +			saved_results[i + 3] = results[i + 3];
> +		}
> +	}
> +}
> +
> +/*
> + * When processing the transition, rather than using if/else
> + * construct, the offset is calculated for DFA and QRANGE and
> + * then conditionally added to the address based on node type.
> + * This is done to avoid branch mis-predictions. Since the
> + * offset is rather simple calculation it is more efficient
> + * to do the calculation and do a condition move rather than
> + * a conditional branch to determine which calculation to do.
> + */
> +static inline uint32_t
> +scan_forward(uint32_t input, uint32_t max)
> +{
> +	return (input == 0) ? max : rte_bsf32(input);
> +}
> +
> +static inline uint64_t
> +scalar_transition(const uint64_t *trans_table, uint64_t transition,
> +	uint8_t input)
> +{
> +	uint32_t addr, index, ranges, x, a, b, c;
> +
> +	/* break transition into component parts */
> +	ranges = transition >> (sizeof(index) * CHAR_BIT);
> +
> +	/* calc address for a QRANGE node */
> +	c = input * SCALAR_QRANGE_MULT;
> +	a = ranges | SCALAR_QRANGE_MIN;
> +	index = transition & ~RTE_ACL_NODE_INDEX;
> +	a -= (c & SCALAR_QRANGE_MASK);
> +	b = c & SCALAR_QRANGE_MIN;
> +	addr = transition ^ index;
> +	a &= SCALAR_QRANGE_MIN;
> +	a ^= (ranges ^ b) & (a ^ b);
> +	x = scan_forward(a, 32) >> 3;
> +	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
> +
> +	/* pickup next transition */
> +	transition = *(trans_table + addr);
> +	return transition;
> +}
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	int n;
> +	uint64_t transition0, transition1;
> +	uint32_t input0, input1;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SCALAR];
> +	struct completion cmplt[MAX_SEARCHES_SCALAR];
> +	struct parms parms[MAX_SEARCHES_SCALAR];
> +
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
> +		categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	transition0 = index_array[0];
> +	transition1 = index_array[1];
> +
> +	while (flows.started > 0) {
> +
> +		input0 = GET_NEXT_4BYTES(parms, 0);
> +		input1 = GET_NEXT_4BYTES(parms, 1);
> +
> +		for (n = 0; n < 4; n++) {
> +			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
> +				transition0 = scalar_transition(flows.trans,
> +					transition0, (uint8_t)input0);
> +
> +			input0 >>= CHAR_BIT;
> +
> +			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
> +				transition1 = scalar_transition(flows.trans,
> +					transition1, (uint8_t)input1);
> +
> +			input1 >>= CHAR_BIT;
> +
> +		}
> +		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> +			transition0 = acl_match_check(transition0,
> +				0, ctx, parms, &flows, resolve_priority_scalar);
> +			transition1 = acl_match_check(transition1,
> +				1, ctx, parms, &flows, resolve_priority_scalar);
> +
> +		}
> +	}
> +	return 0;
> +}
> diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
> new file mode 100644
> index 0000000..3f5c721
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_sse.c
> @@ -0,0 +1,627 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include "acl_run.h"
> +#include "acl_match_check.h"
> +
> +enum {
> +	SHUFFLE32_SLOT1 = 0xe5,
> +	SHUFFLE32_SLOT2 = 0xe6,
> +	SHUFFLE32_SLOT3 = 0xe7,
> +	SHUFFLE32_SWAP64 = 0x4e,
> +};
> +
> +static const rte_xmm_t mm_type_quad_range = {
> +	.u32 = {
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +	},
> +};
> +
> +static const rte_xmm_t mm_type_quad_range64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		0,
> +		0,
> +	},
> +};
> +
> +static const rte_xmm_t mm_shuffle_input = {
> +	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
> +};
> +
> +static const rte_xmm_t mm_shuffle_input64 = {
> +	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
> +};
> +
> +static const rte_xmm_t mm_ones_16 = {
> +	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
> +};
> +
> +static const rte_xmm_t mm_bytes = {
> +	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
> +};
> +
> +static const rte_xmm_t mm_bytes64 = {
> +	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
> +};
> +
> +static const rte_xmm_t mm_match_mask = {
> +	.u32 = {
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +	},
> +};
> +
> +static const rte_xmm_t mm_match_mask64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_MATCH,
> +		0,
> +		RTE_ACL_NODE_MATCH,
> +		0,
> +	},
> +};
> +
> +static const rte_xmm_t mm_index_mask = {
> +	.u32 = {
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +	},
> +};
> +
> +static const rte_xmm_t mm_index_mask64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		0,
> +		0,
> +	},
> +};
> +
> +
> +/*
> + * Resolve priority for multiple results (sse version).
> + * This consists comparing the priority of the current traversal with the
> + * running set of results for the packet.
> + * For each result, keep a running array of the result (rule number) and
> + * its priority for each category.
> + */
> +static inline void
> +resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, const struct rte_acl_match_results *p,
> +	uint32_t categories)
> +{
> +	uint32_t x;
> +	xmm_t results, priority, results1, priority1, selector;
> +	xmm_t *saved_results, *saved_priority;
> +
> +	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
> +
> +		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
> +		saved_priority =
> +			(xmm_t *)(&parms[n].cmplt->priority[x]);
> +
> +		/* get results and priorities for completed trie */
> +		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
> +		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
> +
> +		/* if this is not the first completed trie */
> +		if (parms[n].cmplt->count != ctx->num_tries) {
> +
> +			/* get running best results and their priorities */
> +			results1 = MM_LOADU(saved_results);
> +			priority1 = MM_LOADU(saved_priority);
> +
> +			/* select results that are highest priority */
> +			selector = MM_CMPGT32(priority1, priority);
> +			results = MM_BLENDV8(results, results1, selector);
> +			priority = MM_BLENDV8(priority, priority1, selector);
> +		}
> +
> +		/* save running best results and their priorities */
> +		MM_STOREU(saved_results, results);
> +		MM_STOREU(saved_priority, priority);
> +	}
> +}
> +
> +/*
> + * Extract transitions from an XMM register and check for any matches
> + */
> +static void
> +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, struct acl_flow_data *flows)
> +{
> +	uint64_t transition1, transition2;
> +
> +	/* extract transition from low 64 bits. */
> +	transition1 = MM_CVT64(*indicies);
> +
> +	/* extract transition from high 64 bits. */
> +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> +	transition2 = MM_CVT64(*indicies);
> +
> +	transition1 = acl_match_check(transition1, slot, ctx,
> +		parms, flows, resolve_priority_sse);
> +	transition2 = acl_match_check(transition2, slot + 1, ctx,
> +		parms, flows, resolve_priority_sse);
> +
> +	/* update indicies with new transitions. */
> +	*indicies = MM_SET64(transition2, transition1);
> +}
> +
> +/*
> + * Check for a match in 2 transitions (contained in SSE register)
> + */
> +static inline void
> +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	temp = MM_AND(match_mask, *indicies);
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies, slot, ctx, parms, flows);
> +		temp = MM_AND(match_mask, *indicies);
> +	}
> +}
> +
> +/*
> + * Check for any match in 4 transitions (contained in 2 SSE registers)
> + */
> +static inline void
> +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> +	xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	/* put low 32 bits of each transition into one register */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	/* test for match node */
> +	temp = MM_AND(match_mask, temp);
> +
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> +
> +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +					(__m128)*indicies2,
> +					0x88);
> +		temp = MM_AND(match_mask, temp);
> +	}
> +}
> +
> +/*
> + * Calculate the address of the next transition for
> + * all types of nodes. Note that only DFA nodes and range
> + * nodes actually transition to another node. Match
> + * nodes don't move.
> + */
> +static inline xmm_t
> +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr, node_types, temp;
> +
> +	/*
> +	 * Note that no transition is done for a match
> +	 * node and therefore a stream freezes when
> +	 * it reaches a match.
> +	 */
> +
> +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +		(__m128)*indicies2, 0xdd);
> +
> +	/* Calc node type and node addr */
> +	node_types = MM_ANDNOT(index_mask, temp);
> +	addr = MM_AND(index_mask, temp);
> +
> +	/*
> +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> +	 */
> +
> +	/* mask for DFA type (0) nodes */
> +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> +
> +	/* add input byte to DFA position */
> +	temp = MM_AND(temp, bytes);
> +	temp = MM_AND(temp, next_input);
> +	addr = MM_ADD32(addr, temp);
> +
> +	/*
> +	 * Calc addr for Range nodes -> range_index + range(input)
> +	 */
> +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> +
> +	/*
> +	 * Calculate number of range boundaries that are less than the
> +	 * input value. Range boundaries for each node are in signed 8 bit,
> +	 * ordered from -128 to 127 in the indicies2 register.
> +	 * This is effectively a popcnt of bytes that are greater than the
> +	 * input byte.
> +	 */
> +
> +	/* shuffle input byte to all 4 positions of 32 bit value */
> +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> +
> +	/* check ranges */
> +	temp = MM_CMPGT8(temp, *indicies2);
> +
> +	/* convert -1 to 1 (bytes greater than input byte */
> +	temp = MM_SIGN8(temp, temp);
> +
> +	/* horizontal add pairs of bytes into words */
> +	temp = MM_MADD8(temp, temp);
> +
> +	/* horizontal add pairs of words into dwords */
> +	temp = MM_MADD16(temp, ones_16);
> +
> +	/* mask to range type nodes */
> +	temp = MM_AND(temp, node_types);
> +
> +	/* add index into node position */
> +	return MM_ADD32(addr, temp);
> +}
> +
> +/*
> + * Process 4 transitions (in 2 SIMD registers) in parallel
> + */
> +static inline xmm_t
> +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr;
> +	uint64_t trans0, trans2;
> +
> +	 /* Calculate the address (array index) for all 4 transitions. */
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, indicies2);
> +
> +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> +
> +	trans0 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 2 */
> +
> +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> +	trans2 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +
> +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> +
> +	/* get slot 3 */
> +
> +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 8 traversals in parallel
> + */
> +static inline int
> +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE8];
> +	struct completion cmplt[MAX_SEARCHES_SSE8];
> +	struct parms parms[MAX_SEARCHES_SSE8];
> +	xmm_t input0, input1;
> +	xmm_t indicies1, indicies2, indicies3, indicies4;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	/*
> +	 * indicies1 contains index_array[0,1]
> +	 * indicies2 contains index_array[2,3]
> +	 * indicies3 contains index_array[4,5]
> +	 * indicies4 contains index_array[6,7]
> +	 */
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> +
> +	 /* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +	acl_match_check_x4(4, ctx, parms, &flows,
> +		&indicies3, &indicies4, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> +			0);
> +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> +			0);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> +
> +		 /* Process the 4 bytes of input on each stream. */
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		 /* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +		acl_match_check_x4(4, ctx, parms, &flows,
> +			&indicies3, &indicies4, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Execute trie traversal with 4 traversals in parallel
> + */
> +static inline int
> +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	 uint32_t *results, int total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE4];
> +	struct completion cmplt[MAX_SEARCHES_SSE4];
> +	struct parms parms[MAX_SEARCHES_SSE4];
> +	xmm_t input, indicies1, indicies2;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +		input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline xmm_t
> +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1)
> +{
> +	uint64_t t;
> +	xmm_t addr, indicies2;
> +
> +	indicies2 = MM_XOR(ones_16, ones_16);
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, &indicies2);
> +
> +	/* Gather 64 bit transitions and pack 2 per register. */
> +
> +	t = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 2 traversals in parallel.
> + */
> +static inline int
> +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE2];
> +	struct completion cmplt[MAX_SEARCHES_SSE2];
> +	struct parms parms[MAX_SEARCHES_SSE2];
> +	xmm_t input, indicies;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> +			mm_match_mask64.m);
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	if (likely(num >= MAX_SEARCHES_SSE8))
> +		return search_sse_8(ctx, data, results, num, categories);
> +	else if (num >= MAX_SEARCHES_SSE4)
> +		return search_sse_4(ctx, data, results, num, categories);
> +	else
> +		return search_sse_2(ctx, data, results, num, categories);
> +}
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..b9173c1 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -38,6 +38,52 @@
> 
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> 
> +typedef int (*rte_acl_classify_t)
> +(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
> +
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +/* by default, use always avaialbe scalar code path. */
> +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;

Why not 'static'?
I thought you'd like to hide it  from external world.

> +
> +void rte_acl_select_classify(enum acl_classify_alg alg)
> +{
> +
> +	switch(alg)
> +	{
> +		case ACL_CLASSIFY_DEFAULT:
> +		case ACL_CLASSIFY_SCALAR:
> +			rte_acl_default_classify = rte_acl_classify_scalar;
> +			break;
> +		case ACL_CLASSIFY_SSE:
> +			rte_acl_default_classify = rte_acl_classify_sse;
> +			break;
> +	}
> +
> +}

As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return value, if not.  

> +
> +static void __attribute__((constructor))
> +rte_acl_init(void)
> +{
> +	enum acl_classify_alg alg = ACL_CLASSIFY_DEFAULT;
> +
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> +		alg = ACL_CLASSIFY_SSE;
> +
> +	rte_acl_select_classify(alg);
> +}
> +
> +inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +                            const uint8_t **data,
> +                            uint32_t *results, uint32_t num,
> +                            uint32_t categories)
> +{
> +	return rte_acl_default_classify(ctx, data, results, num, categories);
> +}
> +
> +
>  struct rte_acl_ctx *
>  rte_acl_find_existing(const char *name)
>  {
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
> index afc0f69..650b306 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
>   * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
>   * If more than one rule is applicable for given input buffer and
>   * given category, then rule with highest priority will be returned as a match.
> + * Note, that this function could be run only on CPUs with SSE4.1 support.
> + * It is up to the caller to make sure that this function is only invoked on
> + * a machine that supports SSE4.1 ISA.
>   * Note, that it is a caller responsibility to ensure that input parameters
>   * are valid and point to correct memory locations.
>   *
> @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
>   * @return
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
> + *   -ENOTSUP for unsupported platforms.

Please remove the line above: current implementation doesn't return ENOTSUP
(I think that was left from v1).

>   */
>  int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
>  	uint32_t *results, uint32_t num, uint32_t categories);
> 
>  /**
> @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
>   */
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);


As I said above we'd better keep it.  

> +
> +enum acl_classify_alg {
> +	ACL_CLASSIFY_DEFAULT = 0,
> +	ACL_CLASSIFY_SCALAR = 1,
> +	ACL_CLASSIFY_SSE = 2,
> +};

As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg

> +
> +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +				   const uint8_t **data,
> +				   uint32_t *results, uint32_t num,
> +				   uint32_t categories);

Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
extern nt
rte_acl_classify(...);

> +/**
> + * Analyze ISA of the current CPU and points rte_acl_default_classify
> + * to the highest applicable version of classify function.
> + */
> +extern void
> +rte_acl_select_classify(enum acl_classify_alg alg);
> 
>  /**
>   * Dump an ACL context structure to the console.
> --
> 1.9.3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-25 16:30   ` Ananyev, Konstantin
@ 2014-08-26 17:44     ` Neil Horman
  2014-08-27 11:25       ` Ananyev, Konstantin
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Horman @ 2014-08-26 17:44 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> Hi Neil,
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, August 21, 2014 9:15 PM
> > To: dev@dpdk.org
> > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > 
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> > 
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> >  points to.
> > 
> 
> I see you decided not to wait for me and fix everything by yourself :)
> 
Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
had been about 2 weeks, so I figured I'd just take care of it.

> > V3 Changes
> >  Updated classify pointer to be a function so as to better preserve ABI
> 
> As I said in my previous mail it generates extra jump...
> Though from numbers I got the performance impact is negligible: < 1%.
> So I suppose, I don't have a good enough reason to object :)
> 
Yeah, I just don't see a way around it.  I was hoping that the compiler would
have been smart enough to see that the rte_acl_classify function was small and
in-linable, but apparently it won't do that.  As you note however the
performance change is minor (I'm guessing within a standard deviation of your
results).

> Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> First of all keep  rte_acl_classify_scalar() is already part of our public API.
> Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> Plus using rte_acl_select_classify() to always switch between implementations is not always handy:

I'm not exactly opposed to this, though it seems odd to me that a user might
want to call a particular version of the classifier directly.  But I certainly
can't predict everything a consumer wants to do.  If we really need to keep it
public then, it begs the question, is providing a generic entry point even
worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
directly so the application can just embody the intellegence to select the best
path?  That saves us having to maintain another API point.  I can go with
consensus on that.

> -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.  
> - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
>   old_alg = rte_acl_get_classify();
>   rte_acl_select_classify(new_alg);
>   ...
>   rte_acl_select_classify(old_alg); 
>   
We could attach the classification method to the acl context, so each
rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
remove the global issues you point out above.  Or alternatively we can just not
provide a generic entry point and let each user select a specific function.


> >  REmoved macro definitions for match check functions to make them static inline
> 
> More comments inlined below.
>snip> 
> > 
> >  	/* make a quick check for scalar */
> > -	ret = rte_acl_classify_scalar(acx, data, results,
> > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > +	ret = rte_acl_classify(acx, data, results,
> >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> 
> 
> As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the original value.
> To support it properly, we need to:
> old_alg = rte_acl_get_classify();
>  rte_acl_select_classify(new_alg);
>  ...
>  rte_acl_select_classify(old_alg);
> 
So, for the purposes of this test application, I don't see that as being needed.
Every call to rte_acl_classify is preceded by a setting of the classifier
function, so you're safe.  If you're concerned about other processes using the
dpdk library at the same time, you're still safe, as despite being a global
variable, data pages in a DSO are Copy on Write, so each process gets their own
copy of the global variable.

Multiple threads within the same process are problematic, I agree, and thats
solvable with the per-acl-context mechanism that I described above, though that
shouldn't be needed here as this seems to be a single threaded program.

> Make all this just to keep UT valid seems like a big hassle to me.
> So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> 
That works for me too, though the per-context mechanism seems kind of nice to
me.  Let me know what you prefer.

><snip>
> > 
> > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > new file mode 100644
> > index 0000000..4dc1982
> > --- /dev/null
> > +++ b/lib/librte_acl/acl_match_check.h
> 
> As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> 
Agreed, I can move that to acl_run.h.

><snip>
> > + */
> > +static inline uint64_t
> > +acl_match_check(uint64_t transition, int slot,
> > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > +	struct parms *parms, const struct rte_acl_match_results *p,
> > +	uint32_t categories))
> 
> Ugh, that's really hard to read.
> Can we create a typedef for resolve_priority function type:
> typedef void (*resolve_priority_t)(uint64_t, int,
>         const struct rte_acl_ctx *ctx, struct parms *,
>         const struct rte_acl_match_results *, uint32_t);
> And use it here?
> 
Sure, I'm fine with doing that.

><snip>
> > +
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> 
> Why not 'static'?
> I thought you'd like to hide it  from external world.
> 
Doh!  I didn't do the one thing that I really meant to do.  I removed it from
the header file but I forgot to declare the variable static.  I'll fix that.

> > +
> > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > +{
> > +
> > +	switch(alg)
> > +	{
> > +		case ACL_CLASSIFY_DEFAULT:
> > +		case ACL_CLASSIFY_SCALAR:
> > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > +			break;
> > +		case ACL_CLASSIFY_SSE:
> > +			rte_acl_default_classify = rte_acl_classify_sse;
> > +			break;
> > +	}
> > +
> > +}
> 
> As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return value, if not.  
> 
Not sure I follow what you're saying above, are you suggesting that we add a
rte_cpu_get_flag_enabled check to rte_acl_select_classify above?

><snip>
> >   *
> > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> >   * @return
> >   *   zero on successful completion.
> >   *   -EINVAL for incorrect arguments.
> > + *   -ENOTSUP for unsupported platforms.
> 
> Please remove the line above: current implementation doesn't return ENOTSUP
> (I think that was left from v1).
> 
Yup, probably was.  I'll remove it.

> >   */
> >  int
> > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> >  	uint32_t *results, uint32_t num, uint32_t categories);
> > 
> >  /**
> > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> >   *   zero on successful completion.
> >   *   -EINVAL for incorrect arguments.
> >   */
> > -int
> > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > -	uint32_t *results, uint32_t num, uint32_t categories);
> 
> 
> As I said above we'd better keep it.  
> 
Ok, can do.

> > +
> > +enum acl_classify_alg {
> > +	ACL_CLASSIFY_DEFAULT = 0,
> > +	ACL_CLASSIFY_SCALAR = 1,
> > +	ACL_CLASSIFY_SSE = 2,
> > +};
> 
> As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> 
Sure, done.

> > +
> > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > +				   const uint8_t **data,
> > +				   uint32_t *results, uint32_t num,
> > +				   uint32_t categories);
> 
> Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> extern nt
> rte_acl_classify(...);
> 
Ok

I'll produce another version based on your feedback regarding the
per-context-calssifier method vs. just removing the generic classifier.

Regards
Neil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-26 17:44     ` Neil Horman
@ 2014-08-27 11:25       ` Ananyev, Konstantin
  2014-08-27 18:56         ` Neil Horman
  0 siblings, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2014-08-27 11:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, August 26, 2014 6:45 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > Hi Neil,
> >
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, August 21, 2014 9:15 PM
> > > To: dev@dpdk.org
> > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > >
> > > Make ACL library to build/work on 'default' architecture:
> > > - make rte_acl_classify_scalar really scalar
> > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > - Provide two versions of rte_acl_classify code path:
> > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > >   and upper, return -ENOTSUP on lower arch.
> > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > >   on all systems.
> > > - keep common code shared between these two codepaths.
> > >
> > > v2 chages:
> > >  run-time selection of most appropriate code-path for given ISA.
> > >  By default the highest supprted one is selected.
> > >  User can still override that selection by manually assigning new value to
> > >  the global function pointer rte_acl_default_classify.
> > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > >  points to.
> > >
> >
> > I see you decided not to wait for me and fix everything by yourself :)
> >
> Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> had been about 2 weeks, so I figured I'd just take care of it.

No worries. I admit that it was a long delay from my side.

> 
> > > V3 Changes
> > >  Updated classify pointer to be a function so as to better preserve ABI
> >
> > As I said in my previous mail it generates extra jump...
> > Though from numbers I got the performance impact is negligible: < 1%.
> > So I suppose, I don't have a good enough reason to object :)
> >
> Yeah, I just don't see a way around it.  I was hoping that the compiler would
> have been smart enough to see that the rte_acl_classify function was small and
> in-linable, but apparently it won't do that.  As you note however the
> performance change is minor (I'm guessing within a standard deviation of your
> results).
> 
> > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> 
> I'm not exactly opposed to this, though it seems odd to me that a user might
> want to call a particular version of the classifier directly.  But I certainly
> can't predict everything a consumer wants to do.  If we really need to keep it
> public then, it begs the question, is providing a generic entry point even
> worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> directly so the application can just embody the intellegence to select the best
> path?  That saves us having to maintain another API point.  I can go with
> consensus on that.
> 
> > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> >   old_alg = rte_acl_get_classify();
> >   rte_acl_select_classify(new_alg);
> >   ...
> >   rte_acl_select_classify(old_alg);
> >
> We could attach the classification method to the acl context, so each
> rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> remove the global issues you point out above.

I thought about that approach too.
But there is one implication with DPDK MP model: 
Same ACL context can be shared by different DPDK processes,
while acl_classify() could be loaded to the different addresses.
Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
store inside ACL ctx alg instead of actual function pointer.
But that means extra overhead of at least two loads per classify() call.

>  Or alternatively we can just not
> provide a generic entry point and let each user select a specific function.

I wonder can we have sort of mixed approach:
1. provide a generic entry point that would be set to the best (from our knowledge) available classify function.
2. Let each user use a specific function if he wants too.

i.e: 
- keep classify_scalar/classify_sse/classify_... public.
- keep your current implementation of rte_acl_classify()
BTW in that way, we probably can make acl_select_classify() static. 

So most users would just use generic entry point and wouldn't need to write their own code wrappers around it.
For users who need to use a particular classify()  version - they can call it directly.

> 
> 
> > >  REmoved macro definitions for match check functions to make them static inline
> >
> > More comments inlined below.
> >snip>
> > >
> > >  	/* make a quick check for scalar */
> > > -	ret = rte_acl_classify_scalar(acx, data, results,
> > > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > > +	ret = rte_acl_classify(acx, data, results,
> > >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> >
> >
> > As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the
> original value.
> > To support it properly, we need to:
> > old_alg = rte_acl_get_classify();
> >  rte_acl_select_classify(new_alg);
> >  ...
> >  rte_acl_select_classify(old_alg);
> >
> So, for the purposes of this test application, I don't see that as being needed.
> Every call to rte_acl_classify is preceded by a setting of the classifier
> function, so you're safe.

Not every, that's a problem.
As I can see, in test/test_acl.c you replaced
rte_acl_classify_scalar();
with
rte_acl_select_classify(SCALAR);
rte_acl_classify();

And never restore previous value of rte_acl_default_classify.
Right now rte_acl_default_classify is global, so after first:
rte_acl_select_classify(SCALAR);
all subsequent rte_acl_classify() will actually use scalar version.

>  If you're concerned about other processes using the
> dpdk library at the same time, you're still safe, as despite being a global
> variable, data pages in a DSO are Copy on Write, so each process gets their own
> copy of the global variable.

No, my concern here is only about  app/test here. 

> 
> Multiple threads within the same process are problematic, I agree, and thats
> solvable with the per-acl-context mechanism that I described above, though that
> shouldn't be needed here as this seems to be a single threaded program.
> 
> > Make all this just to keep UT valid seems like a big hassle to me.
> > So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> >
> That works for me too, though the per-context mechanism seems kind of nice to
> me.  Let me know what you prefer.
> 
> ><snip>
> > >
> > > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > > new file mode 100644
> > > index 0000000..4dc1982
> > > --- /dev/null
> > > +++ b/lib/librte_acl/acl_match_check.h
> >
> > As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> >
> Agreed, I can move that to acl_run.h.
> 
> ><snip>
> > > + */
> > > +static inline uint64_t
> > > +acl_match_check(uint64_t transition, int slot,
> > > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > > +	struct parms *parms, const struct rte_acl_match_results *p,
> > > +	uint32_t categories))
> >
> > Ugh, that's really hard to read.
> > Can we create a typedef for resolve_priority function type:
> > typedef void (*resolve_priority_t)(uint64_t, int,
> >         const struct rte_acl_ctx *ctx, struct parms *,
> >         const struct rte_acl_match_results *, uint32_t);
> > And use it here?
> >
> Sure, I'm fine with doing that.
> 
> ><snip>
> > > +
> > > +/* by default, use always avaialbe scalar code path. */
> > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> >
> > Why not 'static'?
> > I thought you'd like to hide it  from external world.
> >
> Doh!  I didn't do the one thing that I really meant to do.  I removed it from
> the header file but I forgot to declare the variable static.  I'll fix that.
> 
> > > +
> > > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > > +{
> > > +
> > > +	switch(alg)
> > > +	{
> > > +		case ACL_CLASSIFY_DEFAULT:
> > > +		case ACL_CLASSIFY_SCALAR:
> > > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > > +			break;
> > > +		case ACL_CLASSIFY_SSE:
> > > +			rte_acl_default_classify = rte_acl_classify_sse;
> > > +			break;
> > > +	}
> > > +
> > > +}
> >
> > As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return
> value, if not.
> >
> Not sure I follow what you're saying above, are you suggesting that we add a
> rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
> 
> ><snip>
> > >   *
> > > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> > >   * @return
> > >   *   zero on successful completion.
> > >   *   -EINVAL for incorrect arguments.
> > > + *   -ENOTSUP for unsupported platforms.
> >
> > Please remove the line above: current implementation doesn't return ENOTSUP
> > (I think that was left from v1).
> >
> Yup, probably was.  I'll remove it.
> 
> > >   */
> > >  int
> > > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > >  	uint32_t *results, uint32_t num, uint32_t categories);
> > >
> > >  /**
> > > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > >   *   zero on successful completion.
> > >   *   -EINVAL for incorrect arguments.
> > >   */
> > > -int
> > > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > -	uint32_t *results, uint32_t num, uint32_t categories);
> >
> >
> > As I said above we'd better keep it.
> >
> Ok, can do.
> 
> > > +
> > > +enum acl_classify_alg {
> > > +	ACL_CLASSIFY_DEFAULT = 0,
> > > +	ACL_CLASSIFY_SCALAR = 1,
> > > +	ACL_CLASSIFY_SSE = 2,
> > > +};
> >
> > As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> >
> Sure, done.
> 
> > > +
> > > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > > +				   const uint8_t **data,
> > > +				   uint32_t *results, uint32_t num,
> > > +				   uint32_t categories);
> >
> > Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> > extern nt
> > rte_acl_classify(...);
> >
> Ok
> 
> I'll produce another version based on your feedback regarding the
> per-context-calssifier method vs. just removing the generic classifier.
> 
> Regards
> Neil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 11:25       ` Ananyev, Konstantin
@ 2014-08-27 18:56         ` Neil Horman
  2014-08-27 19:18           ` Ananyev, Konstantin
  0 siblings, 1 reply; 21+ messages in thread
From: Neil Horman @ 2014-08-27 18:56 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Tuesday, August 26, 2014 6:45 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > 
> > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > Hi Neil,
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > To: dev@dpdk.org
> > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > >
> > > > Make ACL library to build/work on 'default' architecture:
> > > > - make rte_acl_classify_scalar really scalar
> > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > - Provide two versions of rte_acl_classify code path:
> > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > >   and upper, return -ENOTSUP on lower arch.
> > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > >   on all systems.
> > > > - keep common code shared between these two codepaths.
> > > >
> > > > v2 chages:
> > > >  run-time selection of most appropriate code-path for given ISA.
> > > >  By default the highest supprted one is selected.
> > > >  User can still override that selection by manually assigning new value to
> > > >  the global function pointer rte_acl_default_classify.
> > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > >  points to.
> > > >
> > >
> > > I see you decided not to wait for me and fix everything by yourself :)
> > >
> > Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> > had been about 2 weeks, so I figured I'd just take care of it.
> 
> No worries. I admit that it was a long delay from my side.
> 
> > 
> > > > V3 Changes
> > > >  Updated classify pointer to be a function so as to better preserve ABI
> > >
> > > As I said in my previous mail it generates extra jump...
> > > Though from numbers I got the performance impact is negligible: < 1%.
> > > So I suppose, I don't have a good enough reason to object :)
> > >
> > Yeah, I just don't see a way around it.  I was hoping that the compiler would
> > have been smart enough to see that the rte_acl_classify function was small and
> > in-linable, but apparently it won't do that.  As you note however the
> > performance change is minor (I'm guessing within a standard deviation of your
> > results).
> > 
> > > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> > 
> > I'm not exactly opposed to this, though it seems odd to me that a user might
> > want to call a particular version of the classifier directly.  But I certainly
> > can't predict everything a consumer wants to do.  If we really need to keep it
> > public then, it begs the question, is providing a generic entry point even
> > worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> > directly so the application can just embody the intellegence to select the best
> > path?  That saves us having to maintain another API point.  I can go with
> > consensus on that.
> > 
> > > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> > >   old_alg = rte_acl_get_classify();
> > >   rte_acl_select_classify(new_alg);
> > >   ...
> > >   rte_acl_select_classify(old_alg);
> > >
> > We could attach the classification method to the acl context, so each
> > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > remove the global issues you point out above.
> 
> I thought about that approach too.
> But there is one implication with DPDK MP model: 
> Same ACL context can be shared by different DPDK processes,
> while acl_classify() could be loaded to the different addresses.
> Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
> store inside ACL ctx alg instead of actual function pointer.
> But that means extra overhead of at least two loads per classify() call.
> 
Hmm, how is the context shared around between processes?  Is it just shared as a
common cow data page resulting from a fork?  If so, then we should be good
because the DSO text will be at the same address (i.e. the pointer will still be
valid).  If you do some sort of message passing, then, yes, thats a problem.


> >  Or alternatively we can just not
> > provide a generic entry point and let each user select a specific function.
> 
> I wonder can we have sort of mixed approach:
> 1. provide a generic entry point that would be set to the best (from our knowledge) available classify function.
> 2. Let each user use a specific function if he wants too.
> 
> i.e: 
> - keep classify_scalar/classify_sse/classify_... public.
> - keep your current implementation of rte_acl_classify()
> BTW in that way, we probably can make acl_select_classify() static. 
> 
Agreed, depending on your answer above, this might be the best solution.

> So most users would just use generic entry point and wouldn't need to write their own code wrappers around it.
> For users who need to use a particular classify()  version - they can call it directly.
> 
It does seem reasonable.  Let me know what the ctx sharing mechanism is from
above, and we can settle this.

> > 
> > 
> > > >  REmoved macro definitions for match check functions to make them static inline
> > >
> > > More comments inlined below.
> > >snip>
> > > >
> > > >  	/* make a quick check for scalar */
> > > > -	ret = rte_acl_classify_scalar(acx, data, results,
> > > > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > > > +	ret = rte_acl_classify(acx, data, results,
> > > >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> > >
> > >
> > > As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the
> > original value.
> > > To support it properly, we need to:
> > > old_alg = rte_acl_get_classify();
> > >  rte_acl_select_classify(new_alg);
> > >  ...
> > >  rte_acl_select_classify(old_alg);
> > >
> > So, for the purposes of this test application, I don't see that as being needed.
> > Every call to rte_acl_classify is preceded by a setting of the classifier
> > function, so you're safe.
> 
> Not every, that's a problem.
> As I can see, in test/test_acl.c you replaced
> rte_acl_classify_scalar();
> with
> rte_acl_select_classify(SCALAR);
> rte_acl_classify();
> 
> And never restore previous value of rte_acl_default_classify.
> Right now rte_acl_default_classify is global, so after first:
> rte_acl_select_classify(SCALAR);
> all subsequent rte_acl_classify() will actually use scalar version.
> 
Hmm, ok, I'll take a closer look at it.

> >  If you're concerned about other processes using the
> > dpdk library at the same time, you're still safe, as despite being a global
> > variable, data pages in a DSO are Copy on Write, so each process gets their own
> > copy of the global variable.
> 
> No, my concern here is only about  app/test here. 
> 
> > 
> > Multiple threads within the same process are problematic, I agree, and thats
> > solvable with the per-acl-context mechanism that I described above, though that
> > shouldn't be needed here as this seems to be a single threaded program.
> > 
> > > Make all this just to keep UT valid seems like a big hassle to me.
> > > So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> > >
> > That works for me too, though the per-context mechanism seems kind of nice to
> > me.  Let me know what you prefer.
> > 
> > ><snip>
> > > >
> > > > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > > > new file mode 100644
> > > > index 0000000..4dc1982
> > > > --- /dev/null
> > > > +++ b/lib/librte_acl/acl_match_check.h
> > >
> > > As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> > >
> > Agreed, I can move that to acl_run.h.
> > 
> > ><snip>
> > > > + */
> > > > +static inline uint64_t
> > > > +acl_match_check(uint64_t transition, int slot,
> > > > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > > > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > > > +	struct parms *parms, const struct rte_acl_match_results *p,
> > > > +	uint32_t categories))
> > >
> > > Ugh, that's really hard to read.
> > > Can we create a typedef for resolve_priority function type:
> > > typedef void (*resolve_priority_t)(uint64_t, int,
> > >         const struct rte_acl_ctx *ctx, struct parms *,
> > >         const struct rte_acl_match_results *, uint32_t);
> > > And use it here?
> > >
> > Sure, I'm fine with doing that.
> > 
> > ><snip>
> > > > +
> > > > +/* by default, use always avaialbe scalar code path. */
> > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > >
> > > Why not 'static'?
> > > I thought you'd like to hide it  from external world.
> > >
> > Doh!  I didn't do the one thing that I really meant to do.  I removed it from
> > the header file but I forgot to declare the variable static.  I'll fix that.
> > 
> > > > +
> > > > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > > > +{
> > > > +
> > > > +	switch(alg)
> > > > +	{
> > > > +		case ACL_CLASSIFY_DEFAULT:
> > > > +		case ACL_CLASSIFY_SCALAR:
> > > > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +			break;
> > > > +		case ACL_CLASSIFY_SSE:
> > > > +			rte_acl_default_classify = rte_acl_classify_sse;
> > > > +			break;
> > > > +	}
> > > > +
> > > > +}
> > >
> > > As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return
> > value, if not.
> > >
> > Not sure I follow what you're saying above, are you suggesting that we add a
> > rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
> > 
> > ><snip>
> > > >   *
> > > > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> > > >   * @return
> > > >   *   zero on successful completion.
> > > >   *   -EINVAL for incorrect arguments.
> > > > + *   -ENOTSUP for unsupported platforms.
> > >
> > > Please remove the line above: current implementation doesn't return ENOTSUP
> > > (I think that was left from v1).
> > >
> > Yup, probably was.  I'll remove it.
> > 
> > > >   */
> > > >  int
> > > > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > >  	uint32_t *results, uint32_t num, uint32_t categories);
> > > >
> > > >  /**
> > > > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > >   *   zero on successful completion.
> > > >   *   -EINVAL for incorrect arguments.
> > > >   */
> > > > -int
> > > > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > -	uint32_t *results, uint32_t num, uint32_t categories);
> > >
> > >
> > > As I said above we'd better keep it.
> > >
> > Ok, can do.
> > 
> > > > +
> > > > +enum acl_classify_alg {
> > > > +	ACL_CLASSIFY_DEFAULT = 0,
> > > > +	ACL_CLASSIFY_SCALAR = 1,
> > > > +	ACL_CLASSIFY_SSE = 2,
> > > > +};
> > >
> > > As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> > >
> > Sure, done.
> > 
> > > > +
> > > > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > > > +				   const uint8_t **data,
> > > > +				   uint32_t *results, uint32_t num,
> > > > +				   uint32_t categories);
> > >
> > > Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> > > extern nt
> > > rte_acl_classify(...);
> > >
> > Ok
> > 
> > I'll produce another version based on your feedback regarding the
> > per-context-calssifier method vs. just removing the generic classifier.
> > 
> > Regards
> > Neil
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 18:56         ` Neil Horman
@ 2014-08-27 19:18           ` Ananyev, Konstantin
  2014-08-28  9:02             ` Richardson, Bruce
  2014-08-28 15:55             ` Neil Horman
  0 siblings, 2 replies; 21+ messages in thread
From: Ananyev, Konstantin @ 2014-08-27 19:18 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, August 27, 2014 7:57 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Tuesday, August 26, 2014 6:45 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > >
> > > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > > Hi Neil,
> > > >
> > > > > -----Original Message-----
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > > To: dev@dpdk.org
> > > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > > >
> > > > > Make ACL library to build/work on 'default' architecture:
> > > > > - make rte_acl_classify_scalar really scalar
> > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > - Provide two versions of rte_acl_classify code path:
> > > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > >   and upper, return -ENOTSUP on lower arch.
> > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > >   on all systems.
> > > > > - keep common code shared between these two codepaths.
> > > > >
> > > > > v2 chages:
> > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > >  By default the highest supprted one is selected.
> > > > >  User can still override that selection by manually assigning new value to
> > > > >  the global function pointer rte_acl_default_classify.
> > > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > >  points to.
> > > > >
> > > >
> > > > I see you decided not to wait for me and fix everything by yourself :)
> > > >
> > > Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> > > had been about 2 weeks, so I figured I'd just take care of it.
> >
> > No worries. I admit that it was a long delay from my side.
> >
> > >
> > > > > V3 Changes
> > > > >  Updated classify pointer to be a function so as to better preserve ABI
> > > >
> > > > As I said in my previous mail it generates extra jump...
> > > > Though from numbers I got the performance impact is negligible: < 1%.
> > > > So I suppose, I don't have a good enough reason to object :)
> > > >
> > > Yeah, I just don't see a way around it.  I was hoping that the compiler would
> > > have been smart enough to see that the rte_acl_classify function was small and
> > > in-linable, but apparently it won't do that.  As you note however the
> > > performance change is minor (I'm guessing within a standard deviation of your
> > > results).
> > >
> > > > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > > > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> > >
> > > I'm not exactly opposed to this, though it seems odd to me that a user might
> > > want to call a particular version of the classifier directly.  But I certainly
> > > can't predict everything a consumer wants to do.  If we really need to keep it
> > > public then, it begs the question, is providing a generic entry point even
> > > worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> > > directly so the application can just embody the intellegence to select the best
> > > path?  That saves us having to maintain another API point.  I can go with
> > > consensus on that.
> > >
> > > > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> > > >   old_alg = rte_acl_get_classify();
> > > >   rte_acl_select_classify(new_alg);
> > > >   ...
> > > >   rte_acl_select_classify(old_alg);
> > > >
> > > We could attach the classification method to the acl context, so each
> > > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > > remove the global issues you point out above.
> >
> > I thought about that approach too.
> > But there is one implication with DPDK MP model:
> > Same ACL context can be shared by different DPDK processes,
> > while acl_classify() could be loaded to the different addresses.
> > Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
> > store inside ACL ctx alg instead of actual function pointer.
> > But that means extra overhead of at least two loads per classify() call.
> >
> Hmm, how is the context shared around between processes?  Is it just shared as a
> common cow data page resulting from a fork?  If so, then we should be good
> because the DSO text will be at the same address (i.e. the pointer will still be
> valid).  If you do some sort of message passing, then, yes, thats a problem.
> 

No, it is not parent-child relationship.
There could be a group of  independently spawned processes.
One of them should be 'primary' (starts first), other 'secondary's'.
All hugepage memory pages mapped by the primary process, supposed to be mapped to the same VAs by each secondary.    
So all stuff that is allocated from hugepage memory is shared between all processes in the group.
More  detailed  description: http://dpdk.org/doc/intel/dpdk-prog-guide-1.7.0.pdf, section 23.

> 
> > >  Or alternatively we can just not
> > > provide a generic entry point and let each user select a specific function.
> >
> > I wonder can we have sort of mixed approach:
> > 1. provide a generic entry point that would be set to the best (from our knowledge) available classify function.
> > 2. Let each user use a specific function if he wants too.
> >
> > i.e:
> > - keep classify_scalar/classify_sse/classify_... public.
> > - keep your current implementation of rte_acl_classify()
> > BTW in that way, we probably can make acl_select_classify() static.
> >
> Agreed, depending on your answer above, this might be the best solution.
> 
> > So most users would just use generic entry point and wouldn't need to write their own code wrappers around it.
> > For users who need to use a particular classify()  version - they can call it directly.
> >
> It does seem reasonable.  Let me know what the ctx sharing mechanism is from
> above, and we can settle this.
> 
> > >
> > >
> > > > >  REmoved macro definitions for match check functions to make them static inline
> > > >
> > > > More comments inlined below.
> > > >snip>
> > > > >
> > > > >  	/* make a quick check for scalar */
> > > > > -	ret = rte_acl_classify_scalar(acx, data, results,
> > > > > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > > > > +	ret = rte_acl_classify(acx, data, results,
> > > > >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> > > >
> > > >
> > > > As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to
> the
> > > original value.
> > > > To support it properly, we need to:
> > > > old_alg = rte_acl_get_classify();
> > > >  rte_acl_select_classify(new_alg);
> > > >  ...
> > > >  rte_acl_select_classify(old_alg);
> > > >
> > > So, for the purposes of this test application, I don't see that as being needed.
> > > Every call to rte_acl_classify is preceded by a setting of the classifier
> > > function, so you're safe.
> >
> > Not every, that's a problem.
> > As I can see, in test/test_acl.c you replaced
> > rte_acl_classify_scalar();
> > with
> > rte_acl_select_classify(SCALAR);
> > rte_acl_classify();
> >
> > And never restore previous value of rte_acl_default_classify.
> > Right now rte_acl_default_classify is global, so after first:
> > rte_acl_select_classify(SCALAR);
> > all subsequent rte_acl_classify() will actually use scalar version.
> >
> Hmm, ok, I'll take a closer look at it.
> 
> > >  If you're concerned about other processes using the
> > > dpdk library at the same time, you're still safe, as despite being a global
> > > variable, data pages in a DSO are Copy on Write, so each process gets their own
> > > copy of the global variable.
> >
> > No, my concern here is only about  app/test here.
> >
> > >
> > > Multiple threads within the same process are problematic, I agree, and thats
> > > solvable with the per-acl-context mechanism that I described above, though that
> > > shouldn't be needed here as this seems to be a single threaded program.
> > >
> > > > Make all this just to keep UT valid seems like a big hassle to me.
> > > > So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> > > >
> > > That works for me too, though the per-context mechanism seems kind of nice to
> > > me.  Let me know what you prefer.
> > >
> > > ><snip>
> > > > >
> > > > > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > > > > new file mode 100644
> > > > > index 0000000..4dc1982
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_acl/acl_match_check.h
> > > >
> > > > As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> > > >
> > > Agreed, I can move that to acl_run.h.
> > >
> > > ><snip>
> > > > > + */
> > > > > +static inline uint64_t
> > > > > +acl_match_check(uint64_t transition, int slot,
> > > > > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > > > > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > > > > +	struct parms *parms, const struct rte_acl_match_results *p,
> > > > > +	uint32_t categories))
> > > >
> > > > Ugh, that's really hard to read.
> > > > Can we create a typedef for resolve_priority function type:
> > > > typedef void (*resolve_priority_t)(uint64_t, int,
> > > >         const struct rte_acl_ctx *ctx, struct parms *,
> > > >         const struct rte_acl_match_results *, uint32_t);
> > > > And use it here?
> > > >
> > > Sure, I'm fine with doing that.
> > >
> > > ><snip>
> > > > > +
> > > > > +/* by default, use always avaialbe scalar code path. */
> > > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > >
> > > > Why not 'static'?
> > > > I thought you'd like to hide it  from external world.
> > > >
> > > Doh!  I didn't do the one thing that I really meant to do.  I removed it from
> > > the header file but I forgot to declare the variable static.  I'll fix that.
> > >
> > > > > +
> > > > > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > > > > +{
> > > > > +
> > > > > +	switch(alg)
> > > > > +	{
> > > > > +		case ACL_CLASSIFY_DEFAULT:
> > > > > +		case ACL_CLASSIFY_SCALAR:
> > > > > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > > > > +			break;
> > > > > +		case ACL_CLASSIFY_SSE:
> > > > > +			rte_acl_default_classify = rte_acl_classify_sse;
> > > > > +			break;
> > > > > +	}
> > > > > +
> > > > > +}
> > > >
> > > > As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return
> > > value, if not.
> > > >
> > > Not sure I follow what you're saying above, are you suggesting that we add a
> > > rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
> > >
> > > ><snip>
> > > > >   *
> > > > > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> > > > >   * @return
> > > > >   *   zero on successful completion.
> > > > >   *   -EINVAL for incorrect arguments.
> > > > > + *   -ENOTSUP for unsupported platforms.
> > > >
> > > > Please remove the line above: current implementation doesn't return ENOTSUP
> > > > (I think that was left from v1).
> > > >
> > > Yup, probably was.  I'll remove it.
> > >
> > > > >   */
> > > > >  int
> > > > > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > >  	uint32_t *results, uint32_t num, uint32_t categories);
> > > > >
> > > > >  /**
> > > > > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > >   *   zero on successful completion.
> > > > >   *   -EINVAL for incorrect arguments.
> > > > >   */
> > > > > -int
> > > > > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > > -	uint32_t *results, uint32_t num, uint32_t categories);
> > > >
> > > >
> > > > As I said above we'd better keep it.
> > > >
> > > Ok, can do.
> > >
> > > > > +
> > > > > +enum acl_classify_alg {
> > > > > +	ACL_CLASSIFY_DEFAULT = 0,
> > > > > +	ACL_CLASSIFY_SCALAR = 1,
> > > > > +	ACL_CLASSIFY_SSE = 2,
> > > > > +};
> > > >
> > > > As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> > > >
> > > Sure, done.
> > >
> > > > > +
> > > > > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > > > > +				   const uint8_t **data,
> > > > > +				   uint32_t *results, uint32_t num,
> > > > > +				   uint32_t categories);
> > > >
> > > > Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> > > > extern nt
> > > > rte_acl_classify(...);
> > > >
> > > Ok
> > >
> > > I'll produce another version based on your feedback regarding the
> > > per-context-calssifier method vs. just removing the generic classifier.
> > >
> > > Regards
> > > Neil
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 19:18           ` Ananyev, Konstantin
@ 2014-08-28  9:02             ` Richardson, Bruce
  2014-08-28 15:55             ` Neil Horman
  1 sibling, 0 replies; 21+ messages in thread
From: Richardson, Bruce @ 2014-08-28  9:02 UTC (permalink / raw)
  To: Ananyev, Konstantin, Neil Horman; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Wednesday, August 27, 2014 8:19 PM
> To: Neil Horman
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default'
> target
> 
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, August 27, 2014 7:57 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> >
> > On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Tuesday, August 26, 2014 6:45 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > > > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > > >
> > > > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > > > Hi Neil,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > > > To: dev@dpdk.org
> > > > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > > > >
> > > > > > Make ACL library to build/work on 'default' architecture:
> > > > > > - make rte_acl_classify_scalar really scalar
> > > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > > - Provide two versions of rte_acl_classify code path:
> > > > > >   rte_acl_classify_sse() - could be build and used only on systems with
> sse4.2
> > > > > >   and upper, return -ENOTSUP on lower arch.
> > > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > > >   on all systems.
> > > > > > - keep common code shared between these two codepaths.
> > > > > >
> > > > > > v2 chages:
> > > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > > >  By default the highest supprted one is selected.
> > > > > >  User can still override that selection by manually assigning new value
> to
> > > > > >  the global function pointer rte_acl_default_classify.
> > > > > >  rte_acl_classify() becomes a macro calling whatever
> rte_acl_default_classify
> > > > > >  points to.
> > > > > >
> > > > >
> > > > > I see you decided not to wait for me and fix everything by yourself :)
> > > > >
> > > > Yeah, sorry, I'm getting pinged about enabling these features in Fedora,
> and it
> > > > had been about 2 weeks, so I figured I'd just take care of it.
> > >
> > > No worries. I admit that it was a long delay from my side.
> > >
> > > >
> > > > > > V3 Changes
> > > > > >  Updated classify pointer to be a function so as to better preserve ABI
> > > > >
> > > > > As I said in my previous mail it generates extra jump...
> > > > > Though from numbers I got the performance impact is negligible: < 1%.
> > > > > So I suppose, I don't have a good enough reason to object :)
> > > > >
> > > > Yeah, I just don't see a way around it.  I was hoping that the compiler
> would
> > > > have been smart enough to see that the rte_acl_classify function was small
> and
> > > > in-linable, but apparently it won't do that.  As you note however the
> > > > performance change is minor (I'm guessing within a standard deviation of
> your
> > > > results).
> > > >
> > > > > Though I still think we better keep  rte_acl_classify_scalar() publically
> available (same as we do for rte acl_classify_sse()):
> > > > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > > > Also, as I remember, one of the customers explicitly asked for scalar
> version and they planned to call it directly.
> > > > > Plus using rte_acl_select_classify() to always switch between
> implementations is not always handy:
> > > >
> > > > I'm not exactly opposed to this, though it seems odd to me that a user
> might
> > > > want to call a particular version of the classifier directly.  But I certainly
> > > > can't predict everything a consumer wants to do.  If we really need to keep
> it
> > > > public then, it begs the question, is providing a generic entry point even
> > > > worthwhile?  Is it just as easy to expose the scalar/sse and any future
> versions
> > > > directly so the application can just embody the intellegence to select the
> best
> > > > path?  That saves us having to maintain another API point.  I can go with
> > > > consensus on that.
> > > >
> > > > > -  it is global, which means that we can't simultaneously use
> classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > > > - to properly support such switching we then will need to support
> something like (see app/test/test_acl.c below):
> > > > >   old_alg = rte_acl_get_classify();
> > > > >   rte_acl_select_classify(new_alg);
> > > > >   ...
> > > > >   rte_acl_select_classify(old_alg);
> > > > >
> > > > We could attach the classification method to the acl context, so each
> > > > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > > > remove the global issues you point out above.
> > >
> > > I thought about that approach too.
> > > But there is one implication with DPDK MP model:
> > > Same ACL context can be shared by different DPDK processes,
> > > while acl_classify() could be loaded to the different addresses.
> > > Of course we can overcome it by creating a global table of function pointers
> indexed by calssify_alg and
> > > store inside ACL ctx alg instead of actual function pointer.
> > > But that means extra overhead of at least two loads per classify() call.
> > >
> > Hmm, how is the context shared around between processes?  Is it just shared
> as a
> > common cow data page resulting from a fork?  If so, then we should be good
> > because the DSO text will be at the same address (i.e. the pointer will still be
> > valid).  If you do some sort of message passing, then, yes, thats a problem.
> >
> 
> No, it is not parent-child relationship.
> There could be a group of  independently spawned processes.
> One of them should be 'primary' (starts first), other 'secondary's'.
> All hugepage memory pages mapped by the primary process, supposed to be
> mapped to the same VAs by each secondary.
> So all stuff that is allocated from hugepage memory is shared between all
> processes in the group.
> More  detailed  description: http://dpdk.org/doc/intel/dpdk-prog-guide-
> 1.7.0.pdf, section 23.
> 

Function pointers just don't work easily with multiprocess.  Again some history, since today seems to be my Intel DPDK history day...
For the PMDs, originally we allowed NIC access only by the primary process, but later removed that limitation by having the secondary processes do a driver load and pci scan on startup, and having the ethdev structure split between the function pointer part which is not shared and configured independently in the secondary process as part of the pci scan, and the data part which is in hugepage memory and is shared across all processes. For the hash library, we needed a different approach and we looked at having tables of functions, but discarded the idea as largely unworkable when we took user-specified functions into account. What we ended up doing was provide separate api's to call the add/delete/lookup function with a pre-computed hash, so that multi-process apps could explicitly call the hash function without using a fn pointer and then pass in the computed value to the rest of the API calls.

Apologies for the digression from the immediate topic at hand, but I think it's something that is good to make people generally aware of when working with DPDK libs.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 19:18           ` Ananyev, Konstantin
  2014-08-28  9:02             ` Richardson, Bruce
@ 2014-08-28 15:55             ` Neil Horman
  1 sibling, 0 replies; 21+ messages in thread
From: Neil Horman @ 2014-08-28 15:55 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Wed, Aug 27, 2014 at 07:18:44PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, August 27, 2014 7:57 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > 
> > On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Tuesday, August 26, 2014 6:45 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > > > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > > >
> > > > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > > > Hi Neil,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > > > To: dev@dpdk.org
> > > > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > > > >
> > > > > > Make ACL library to build/work on 'default' architecture:
> > > > > > - make rte_acl_classify_scalar really scalar
> > > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > > - Provide two versions of rte_acl_classify code path:
> > > > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > > >   and upper, return -ENOTSUP on lower arch.
> > > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > > >   on all systems.
> > > > > > - keep common code shared between these two codepaths.
> > > > > >
> > > > > > v2 chages:
> > > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > > >  By default the highest supprted one is selected.
> > > > > >  User can still override that selection by manually assigning new value to
> > > > > >  the global function pointer rte_acl_default_classify.
> > > > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > > >  points to.
> > > > > >
> > > > >
> > > > > I see you decided not to wait for me and fix everything by yourself :)
> > > > >
> > > > Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> > > > had been about 2 weeks, so I figured I'd just take care of it.
> > >
> > > No worries. I admit that it was a long delay from my side.
> > >
> > > >
> > > > > > V3 Changes
> > > > > >  Updated classify pointer to be a function so as to better preserve ABI
> > > > >
> > > > > As I said in my previous mail it generates extra jump...
> > > > > Though from numbers I got the performance impact is negligible: < 1%.
> > > > > So I suppose, I don't have a good enough reason to object :)
> > > > >
> > > > Yeah, I just don't see a way around it.  I was hoping that the compiler would
> > > > have been smart enough to see that the rte_acl_classify function was small and
> > > > in-linable, but apparently it won't do that.  As you note however the
> > > > performance change is minor (I'm guessing within a standard deviation of your
> > > > results).
> > > >
> > > > > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > > > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > > > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > > > > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> > > >
> > > > I'm not exactly opposed to this, though it seems odd to me that a user might
> > > > want to call a particular version of the classifier directly.  But I certainly
> > > > can't predict everything a consumer wants to do.  If we really need to keep it
> > > > public then, it begs the question, is providing a generic entry point even
> > > > worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> > > > directly so the application can just embody the intellegence to select the best
> > > > path?  That saves us having to maintain another API point.  I can go with
> > > > consensus on that.
> > > >
> > > > > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > > > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> > > > >   old_alg = rte_acl_get_classify();
> > > > >   rte_acl_select_classify(new_alg);
> > > > >   ...
> > > > >   rte_acl_select_classify(old_alg);
> > > > >
> > > > We could attach the classification method to the acl context, so each
> > > > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > > > remove the global issues you point out above.
> > >
> > > I thought about that approach too.
> > > But there is one implication with DPDK MP model:
> > > Same ACL context can be shared by different DPDK processes,
> > > while acl_classify() could be loaded to the different addresses.
> > > Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
> > > store inside ACL ctx alg instead of actual function pointer.
> > > But that means extra overhead of at least two loads per classify() call.
> > >
> > Hmm, how is the context shared around between processes?  Is it just shared as a
> > common cow data page resulting from a fork?  If so, then we should be good
> > because the DSO text will be at the same address (i.e. the pointer will still be
> > valid).  If you do some sort of message passing, then, yes, thats a problem.
> > 
> 
> No, it is not parent-child relationship.
> There could be a group of  independently spawned processes.
> One of them should be 'primary' (starts first), other 'secondary's'.
> All hugepage memory pages mapped by the primary process, supposed to be mapped to the same VAs by each secondary.    
> So all stuff that is allocated from hugepage memory is shared between all processes in the group.
> More  detailed  description: http://dpdk.org/doc/intel/dpdk-prog-guide-1.7.0.pdf, section 23.
> 
Ugh, so because you explicitly share heap memory space accross all processes, we
can never guarantee any pointers to statically allocated symbols, like functions
or global data.  Great.  Ok, I'll try rework this.
Neil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [dpdk-dev] [PATCHv4] librte_acl make it build/work for 'default' target
  2014-08-07 18:31 [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
  2014-08-07 20:11 ` Neil Horman
  2014-08-21 20:15 ` [dpdk-dev] [PATCHv3] " Neil Horman
@ 2014-08-28 20:38 ` Neil Horman
  2014-08-29 17:58   ` Ananyev, Konstantin
  2 siblings, 1 reply; 21+ messages in thread
From: Neil Horman @ 2014-08-28 20:38 UTC (permalink / raw)
  To: dev

Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
 (make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
  rte_acl_classify_sse() - could be build and used only on systems with sse4.2
  and upper, return -ENOTSUP on lower arch.
  rte_acl_classify_scalar() - a slower version, but could be build and used
  on all systems.
- keep common code shared between these two codepaths.

v2 chages:
 run-time selection of most appropriate code-path for given ISA.
 By default the highest supprted one is selected.
 User can still override that selection by manually assigning new value to
 the global function pointer rte_acl_default_classify.
 rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
 points to.

V3 Changes
 Updated classify pointer to be a function so as to better preserve ABI
 REmoved macro definitions for match check functions to make them static inline

V4 Changes
 Rewrote classification selection mechanim to use a function table, so that we
can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
access works.  I understand that leaves us with an extra load instruction, but I
think thats ok, because it also allows...

 Addition of a new function rte_acl_classify_alg.  This function lets you
specify an enum value to override the acl contexts default algorith when doing a
classification.  This allows an application to specify a classification
algorithm without needing to pulicize each method.  I know there was concern
over keeping those methods public, but we don't have a static ABI at the moment,
so this seems to me a reasonable thing to do, as it gives us less of an ABI
surface to worry about.

 Fixed misc missed static declarations

 Removed acl_match_check.h and moved match_check function to acl_run.h

 typdeffed function pointer to match check.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: konstantin.ananyev@intel.com
CC: thomas.monjalon@6wind.com
---
 app/test-acl/main.c             |  13 +-
 app/test/test_acl.c             |  10 +-
 lib/librte_acl/Makefile         |   5 +-
 lib/librte_acl/acl.h            |   1 +
 lib/librte_acl/acl_bld.c        |   5 +-
 lib/librte_acl/acl_run.c        | 944 ----------------------------------------
 lib/librte_acl/acl_run.h        | 271 ++++++++++++
 lib/librte_acl/acl_run_scalar.c | 197 +++++++++
 lib/librte_acl/acl_run_sse.c    | 630 +++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        |  62 +++
 lib/librte_acl/rte_acl.h        |  66 ++-
 11 files changed, 1208 insertions(+), 996 deletions(-)
 delete mode 100644 lib/librte_acl/acl_run.c
 create mode 100644 lib/librte_acl/acl_run.h
 create mode 100644 lib/librte_acl/acl_run_scalar.c
 create mode 100644 lib/librte_acl/acl_run_sse.c

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..6551918 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -787,6 +787,10 @@ acx_init(void)
 	/* perform build. */
 	ret = rte_acl_build(config.acx, &cfg);
 
+	/* setup default rte_acl_classify */
+	if (config.scalar)
+		rte_acl_set_default_classify(RTE_ACL_CLASSIFY_SCALAR);
+
 	dump_verbose(DUMP_NONE, stdout,
 		"rte_acl_build(%u) finished with %d\n",
 		config.bld_categories, ret);
@@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 			v += config.trace_sz;
 		}
 
-		if (scalar != 0)
-			ret = rte_acl_classify_scalar(config.acx, data,
-				results, n, categories);
-
-		else
-			ret = rte_acl_classify(config.acx, data,
-				results, n, categories);
+		ret = rte_acl_classify(config.acx, data, results,
+			n, categories);
 
 		if (ret != 0)
 			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 869f6d3..2169f59 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -148,7 +148,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 	}
 
 	/* make a quick check for scalar */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	ret = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, data, results,
 			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -343,7 +343,7 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples */
-	ret = rte_acl_classify(acx, data, results,
+	ret = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, data, results,
 			RTE_DIM(results), 1);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -362,7 +362,7 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples (scalar) */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	ret = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, data, results,
 			RTE_DIM(results), 1);
 	if (ret != 0) {
 		printf("Line %i: Scalar classify failed!\n", __LINE__);
@@ -850,7 +850,7 @@ test_invalid_parameters(void)
 	/* scalar classify test */
 
 	/* cover zero categories in classify (should not fail) */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
+	result = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, NULL, NULL, 0, 0);
 	if (result != 0) {
 		printf("Line %i: Scalar classify with zero categories "
 				"failed!\n", __LINE__);
@@ -859,7 +859,7 @@ test_invalid_parameters(void)
 	}
 
 	/* cover invalid but positive categories in classify */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
 	if (result == 0) {
 		printf("Line %i: Scalar classify with 3 categories "
 				"should have failed!\n", __LINE__);
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index b9d63fd..9236b7b 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -168,6 +168,7 @@ struct rte_acl_ctx {
 	void               *mem;
 	size_t              mem_sz;
 	struct rte_acl_config config; /* copy of build config. */
+	enum rte_acl_classify_alg alg;
 };
 
 int rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <nmmintrin.h>
 #include <rte_acl.h>
 #include "tb_mem.h"
 #include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 
 			switch (rule->config->defs[n].type) {
 			case RTE_ACL_FIELD_TYPE_BITMASK:
-				wild = (size -
-					_mm_popcnt_u32(fld->mask_range.u8)) /
+				wild = (size - __builtin_popcount(
+					fld->mask_range.u8)) /
 					size;
 				break;
 
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in
- *       the documentation and/or other materials provided with the
- *       distribution.
- *     * Neither the name of Intel Corporation nor the names of its
- *       contributors may be used to endorse or promote products derived
- *       from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8	8
-#define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
-#define MAX_SEARCHES_SCALAR	2
-
-#define GET_NEXT_4BYTES(prm, idx)	\
-	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define	SCALAR_QRANGE_MULT	0x01010101
-#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
-#define	SCALAR_QRANGE_MIN	0x80808080
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
-	uint32_t            num_packets;
-	/* number of packets processed */
-	uint32_t            started;
-	/* number of trie traversals in progress */
-	uint32_t            trie;
-	/* current trie index (0 to N-1) */
-	uint32_t            cmplt_size;
-	uint32_t            total_packets;
-	uint32_t            categories;
-	/* number of result categories per packet. */
-	/* maximum number of packets to process */
-	const uint64_t     *trans;
-	const uint8_t     **data;
-	uint32_t           *results;
-	struct completion  *last_cmplt;
-	struct completion  *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
-	uint32_t *results;                          /* running results. */
-	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
-	uint32_t  count;                            /* num of remaining tries */
-	/* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
-	const uint8_t              *data;
-	/* input data for this packet */
-	const uint32_t             *data_index;
-	/* data indirection for this trie */
-	struct completion          *cmplt;
-	/* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
-	uint32_t *results)
-{
-	uint32_t n;
-
-	for (n = 0; n < size; n++) {
-
-		if (p[n].count == 0) {
-
-			/* mark as allocated and set number of tries. */
-			p[n].count = tries;
-			p[n].results = results;
-			return &(p[n]);
-		}
-	}
-
-	/* should never get here */
-	return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	const struct rte_acl_match_results *p)
-{
-	if (parms[n].cmplt->count == ctx->num_tries ||
-			parms[n].cmplt->priority[0] <=
-			p[transition].priority[0]) {
-
-		parms[n].cmplt->priority[0] = p[transition].priority[0];
-		parms[n].cmplt->results[0] = p[transition].results[0];
-	}
-
-	parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-
-	/* Count down completed tries for this search request */
-	parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
-	const struct rte_acl_ctx *ctx)
-{
-	uint64_t transition;
-
-	/* if there are any more packets to process */
-	if (flows->num_packets < flows->total_packets) {
-		parms[n].data = flows->data[flows->num_packets];
-		parms[n].data_index = ctx->trie[flows->trie].data_index;
-
-		/* if this is the first trie for this packet */
-		if (flows->trie == 0) {
-			flows->last_cmplt = alloc_completion(flows->cmplt_array,
-				flows->cmplt_size, ctx->num_tries,
-				flows->results +
-				flows->num_packets * flows->categories);
-		}
-
-		/* set completion parameters and starting index for this slot */
-		parms[n].cmplt = flows->last_cmplt;
-		transition =
-			flows->trans[parms[n].data[*parms[n].data_index++] +
-			ctx->trie[flows->trie].root_index];
-
-		/*
-		 * if this is the last trie for this packet,
-		 * then setup next packet.
-		 */
-		flows->trie++;
-		if (flows->trie >= ctx->num_tries) {
-			flows->trie = 0;
-			flows->num_packets++;
-		}
-
-		/* keep track of number of active trie traversals */
-		flows->started++;
-
-	/* no more tries to process, set slot to an idle position */
-	} else {
-		transition = ctx->idle;
-		parms[n].data = (const uint8_t *)idle;
-		parms[n].data_index = idle;
-	}
-	return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows)
-{
-	const struct rte_acl_match_results *p;
-
-	p = (const struct rte_acl_match_results *)
-		(flows->trans + ctx->match_index);
-
-	if (transition & RTE_ACL_NODE_MATCH) {
-
-		/* Remove flags from index and decrement active traversals */
-		transition &= RTE_ACL_NODE_INDEX;
-		flows->started--;
-
-		/* Resolve priorities for this trie and running results */
-		if (flows->categories == 1)
-			resolve_single_priority(transition, slot, ctx,
-				parms, p);
-		else
-			resolve_priority(transition, slot, ctx, parms, p,
-				flows->categories);
-
-		/* Fill the slot with the next trie or idle trie */
-		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
-	}
-
-	return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indicies);
-
-	/* extract transition from high 64 bits. */
-	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indicies);
-
-	transition1 = acl_match_check_transition(transition1, slot, ctx,
-		parms, flows);
-	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
-		parms, flows);
-
-	/* update indicies with new transitions. */
-	*indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indicies);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indicies);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies1, slot, ctx, parms, flows);
-		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-					(__m128)*indicies2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr, node_types, temp;
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indicies2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-		(__m128)*indicies2, 0xdd);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
-
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indicies2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* check ranges */
-	temp = MM_CMPGT8(temp, *indicies2);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, indicies2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
-	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
-	uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
-	flows->num_packets = 0;
-	flows->started = 0;
-	flows->trie = 0;
-	flows->last_cmplt = NULL;
-	flows->cmplt_array = cmplt;
-	flows->total_packets = data_num;
-	flows->categories = categories;
-	flows->cmplt_size = cmplt_size;
-	flows->data = data;
-	flows->results = results;
-	flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indicies1, indicies2, indicies3, indicies4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indicies1 contains index_array[0,1]
-	 * indicies2 contains index_array[2,3]
-	 * indicies3 contains index_array[4,5]
-	 * indicies4 contains index_array[6,7]
-	 */
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indicies3, &indicies4, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indicies3, &indicies4, mm_match_mask.m);
-	}
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indicies1, indicies2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-	}
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1)
-{
-	uint64_t t;
-	xmm_t addr, indicies2;
-
-	indicies2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, &indicies2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indicies;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
-			mm_match_mask64.m);
-	}
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
-	return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
-	uint8_t input)
-{
-	uint32_t addr, index, ranges, x, a, b, c;
-
-	/* break transition into component parts */
-	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
-	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
-	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
-	/* pickup next transition */
-	transition = *(trans_table + addr);
-	return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	int n;
-	uint64_t transition0, transition1;
-	uint32_t input0, input1;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SCALAR];
-	struct completion cmplt[MAX_SEARCHES_SCALAR];
-	struct parms parms[MAX_SEARCHES_SCALAR];
-
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
-		categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	transition0 = index_array[0];
-	transition1 = index_array[1];
-
-	while (flows.started > 0) {
-
-		input0 = GET_NEXT_4BYTES(parms, 0);
-		input1 = GET_NEXT_4BYTES(parms, 1);
-
-		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
-
-			input0 >>= CHAR_BIT;
-
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
-			input1 >>= CHAR_BIT;
-
-		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
-			transition0 = acl_match_check_transition(transition0,
-				0, ctx, parms, &flows);
-			transition1 = acl_match_check_transition(transition1,
-				1, ctx, parms, &flows);
-
-		}
-	}
-	return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	if (likely(num >= MAX_SEARCHES_SSE8))
-		search_sse_8(ctx, data, results, num, categories);
-	else if (num >= MAX_SEARCHES_SSE4)
-		search_sse_4(ctx, data, results, num, categories);
-	else
-		search_sse_2(ctx, data, results, num, categories);
-
-	return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..5009188
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,271 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef	_ACL_RUN_H_
+#define	_ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8	8
+#define MAX_SEARCHES_SSE4	4
+#define MAX_SEARCHES_SSE2	2
+#define MAX_SEARCHES_SCALAR	2
+
+#define GET_NEXT_4BYTES(prm, idx)	\
+	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define	SCALAR_QRANGE_MULT	0x01010101
+#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
+#define	SCALAR_QRANGE_MIN	0x80808080
+
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+	uint32_t            num_packets;
+	/* number of packets processed */
+	uint32_t            started;
+	/* number of trie traversals in progress */
+	uint32_t            trie;
+	/* current trie index (0 to N-1) */
+	uint32_t            cmplt_size;
+	uint32_t            total_packets;
+	uint32_t            categories;
+	/* number of result categories per packet. */
+	/* maximum number of packets to process */
+	const uint64_t     *trans;
+	const uint8_t     **data;
+	uint32_t           *results;
+	struct completion  *last_cmplt;
+	struct completion  *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+	uint32_t *results;                          /* running results. */
+	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+	uint32_t  count;                            /* num of remaining tries */
+	/* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+	const uint8_t              *data;
+	/* input data for this packet */
+	const uint32_t             *data_index;
+	/* data indirection for this trie */
+	struct completion          *cmplt;
+	/* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+	uint32_t *results)
+{
+	uint32_t n;
+
+	for (n = 0; n < size; n++) {
+
+		if (p[n].count == 0) {
+
+			/* mark as allocated and set number of tries. */
+			p[n].count = tries;
+			p[n].results = results;
+			return &(p[n]);
+		}
+	}
+
+	/* should never get here */
+	return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p)
+{
+	if (parms[n].cmplt->count == ctx->num_tries ||
+			parms[n].cmplt->priority[0] <=
+			p[transition].priority[0]) {
+
+		parms[n].cmplt->priority[0] = p[transition].priority[0];
+		parms[n].cmplt->results[0] = p[transition].results[0];
+	}
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+	const struct rte_acl_ctx *ctx)
+{
+	uint64_t transition;
+
+	/* if there are any more packets to process */
+	if (flows->num_packets < flows->total_packets) {
+		parms[n].data = flows->data[flows->num_packets];
+		parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+		/* if this is the first trie for this packet */
+		if (flows->trie == 0) {
+			flows->last_cmplt = alloc_completion(flows->cmplt_array,
+				flows->cmplt_size, ctx->num_tries,
+				flows->results +
+				flows->num_packets * flows->categories);
+		}
+
+		/* set completion parameters and starting index for this slot */
+		parms[n].cmplt = flows->last_cmplt;
+		transition =
+			flows->trans[parms[n].data[*parms[n].data_index++] +
+			ctx->trie[flows->trie].root_index];
+
+		/*
+		 * if this is the last trie for this packet,
+		 * then setup next packet.
+		 */
+		flows->trie++;
+		if (flows->trie >= ctx->num_tries) {
+			flows->trie = 0;
+			flows->num_packets++;
+		}
+
+		/* keep track of number of active trie traversals */
+		flows->started++;
+
+	/* no more tries to process, set slot to an idle position */
+	} else {
+		transition = ctx->idle;
+		parms[n].data = (const uint8_t *)idle;
+		parms[n].data_index = idle;
+	}
+	return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+	uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+	flows->num_packets = 0;
+	flows->started = 0;
+	flows->trie = 0;
+	flows->last_cmplt = NULL;
+	flows->cmplt_array = cmplt;
+	flows->total_packets = data_num;
+	flows->categories = categories;
+	flows->cmplt_size = cmplt_size;
+	flows->data = data;
+	flows->results = results;
+	flows->trans = trans;
+}
+
+typedef void (*resolve_priority_t)
+(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+        struct parms *parms, const struct rte_acl_match_results *p,
+        uint32_t categories);
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+acl_match_check(uint64_t transition, int slot,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, resolve_priority_t resolve_priority)
+{
+	const struct rte_acl_match_results *p;
+
+	p = (const struct rte_acl_match_results *)
+		(flows->trans + ctx->match_index);
+
+	if (transition & RTE_ACL_NODE_MATCH) {
+
+		/* Remove flags from index and decrement active traversals */
+		transition &= RTE_ACL_NODE_INDEX;
+		flows->started--;
+
+		/* Resolve priorities for this trie and running results */
+		if (flows->categories == 1)
+			resolve_single_priority(transition, slot, ctx,
+				parms, p);
+		else
+			resolve_priority(transition, slot, ctx, parms,
+				p, flows->categories);
+
+		/* Count down completed tries for this search request */
+		parms[slot].cmplt->count--;
+
+		/* Fill the slot with the next trie or idle trie */
+		transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+	} else if (transition == ctx->idle) {
+		/* reset indirection table for idle slots */
+		parms[slot].data_index = idle;
+	}
+
+	return transition;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..4bf58c7
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,197 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p, uint32_t categories)
+{
+	uint32_t i;
+	int32_t *saved_priority;
+	uint32_t *saved_results;
+	const int32_t *priority;
+	const uint32_t *results;
+
+	saved_results = parms[n].cmplt->results;
+	saved_priority = parms[n].cmplt->priority;
+
+	/* results and priorities for completed trie */
+	results = p[transition].results;
+	priority = p[transition].priority;
+
+	/* if this is not the first completed trie */
+	if (parms[n].cmplt->count != ctx->num_tries) {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			if (saved_priority[i] <= priority[i]) {
+				saved_priority[i] = priority[i];
+				saved_results[i] = results[i];
+			}
+			if (saved_priority[i + 1] <= priority[i + 1]) {
+				saved_priority[i + 1] = priority[i + 1];
+				saved_results[i + 1] = results[i + 1];
+			}
+			if (saved_priority[i + 2] <= priority[i + 2]) {
+				saved_priority[i + 2] = priority[i + 2];
+				saved_results[i + 2] = results[i + 2];
+			}
+			if (saved_priority[i + 3] <= priority[i + 3]) {
+				saved_priority[i + 3] = priority[i + 3];
+				saved_results[i + 3] = results[i + 3];
+			}
+		}
+	} else {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+			saved_priority[i] = priority[i];
+			saved_priority[i + 1] = priority[i + 1];
+			saved_priority[i + 2] = priority[i + 2];
+			saved_priority[i + 3] = priority[i + 3];
+
+			saved_results[i] = results[i];
+			saved_results[i + 1] = results[i + 1];
+			saved_results[i + 2] = results[i + 2];
+			saved_results[i + 3] = results[i + 3];
+		}
+	}
+}
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+	return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+	uint8_t input)
+{
+	uint32_t addr, index, ranges, x, a, b, c;
+
+	/* break transition into component parts */
+	ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+	/* calc address for a QRANGE node */
+	c = input * SCALAR_QRANGE_MULT;
+	a = ranges | SCALAR_QRANGE_MIN;
+	index = transition & ~RTE_ACL_NODE_INDEX;
+	a -= (c & SCALAR_QRANGE_MASK);
+	b = c & SCALAR_QRANGE_MIN;
+	addr = transition ^ index;
+	a &= SCALAR_QRANGE_MIN;
+	a ^= (ranges ^ b) & (a ^ b);
+	x = scan_forward(a, 32) >> 3;
+	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	/* pickup next transition */
+	transition = *(trans_table + addr);
+	return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	int n;
+	uint64_t transition0, transition1;
+	uint32_t input0, input1;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SCALAR];
+	struct completion cmplt[MAX_SEARCHES_SCALAR];
+	struct parms parms[MAX_SEARCHES_SCALAR];
+
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+		categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	transition0 = index_array[0];
+	transition1 = index_array[1];
+
+	while (flows.started > 0) {
+
+		input0 = GET_NEXT_4BYTES(parms, 0);
+		input1 = GET_NEXT_4BYTES(parms, 1);
+
+		for (n = 0; n < 4; n++) {
+			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+				transition0 = scalar_transition(flows.trans,
+					transition0, (uint8_t)input0);
+
+			input0 >>= CHAR_BIT;
+
+			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+				transition1 = scalar_transition(flows.trans,
+					transition1, (uint8_t)input1);
+
+			input1 >>= CHAR_BIT;
+
+		}
+		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+			transition0 = acl_match_check(transition0,
+				0, ctx, parms, &flows, resolve_priority_scalar);
+			transition1 = acl_match_check(transition1,
+				1, ctx, parms, &flows, resolve_priority_scalar);
+
+		}
+	}
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..7ae63dd
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,630 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+	},
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		0,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indicies);
+
+	/* extract transition from high 64 bits. */
+	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indicies);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indicies with new transitions. */
+	*indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indicies);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indicies);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies1, slot, ctx, parms, flows);
+		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+					(__m128)*indicies2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr, node_types, temp;
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indicies2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+		(__m128)*indicies2, 0xdd);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+	/* add input byte to DFA position */
+	temp = MM_AND(temp, bytes);
+	temp = MM_AND(temp, next_input);
+	addr = MM_ADD32(addr, temp);
+
+	/*
+	 * Calc addr for Range nodes -> range_index + range(input)
+	 */
+	node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indicies2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* check ranges */
+	temp = MM_CMPGT8(temp, *indicies2);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	temp = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = MM_AND(temp, node_types);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, indicies2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indicies1, indicies2, indicies3, indicies4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indicies1 contains index_array[0,1]
+	 * indicies2 contains index_array[2,3]
+	 * indicies3 contains index_array[4,5]
+	 * indicies4 contains index_array[6,7]
+	 */
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indicies3, &indicies4, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+			0);
+		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+			0);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indicies3, &indicies4, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indicies1, indicies2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1)
+{
+	uint64_t t;
+	xmm_t addr, indicies2;
+
+	indicies2 = MM_XOR(ones_16, ones_16);
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, &indicies2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indicies;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+			mm_match_mask64.m);
+	}
+
+	return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	if (likely(num >= MAX_SEARCHES_SSE8))
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..741bed4 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -33,11 +33,72 @@
 
 #include <rte_acl.h>
 #include "acl.h"
+#include "acl_run.h"
 
 #define	BIT_SIZEOF(x)	(sizeof(x) * CHAR_BIT)
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+extern int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+static rte_acl_classify_t classify_fns[] = {
+	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
+	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
+	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
+};
+
+
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/* by default, use always avaialbe scalar code path. */
+static enum rte_acl_classify_alg rte_acl_default_classify = RTE_ACL_CLASSIFY_SCALAR;
+
+void rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+{
+	rte_acl_default_classify = alg;
+}
+
+void rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	ctx->alg = alg;
+}
+
+static void __attribute__((constructor))
+rte_acl_init(void)
+{
+	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		alg = RTE_ACL_CLASSIFY_SSE;
+
+	rte_acl_set_default_classify(alg);
+}
+
+int rte_acl_classify(const struct rte_acl_ctx *ctx,
+		     const uint8_t **data,
+		     uint32_t *results, uint32_t num,
+		     uint32_t categories)
+{
+	return classify_fns[ctx->alg](ctx, data, results, num, categories);
+}
+
+int rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
+			 enum rte_acl_classify_alg alg,
+			 const uint8_t **data,
+			 uint32_t *results, uint32_t num,
+			 uint32_t categories)
+{
+	return classify_fns[alg](ctx, data, results, num, categories);
+}
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
@@ -165,6 +226,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
+		ctx->alg = rte_acl_default_classify;
 		snprintf(ctx->name, sizeof(ctx->name), "%s", param->name);
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..c092a49 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -259,39 +259,6 @@ void
 rte_acl_reset(struct rte_acl_ctx *ctx);
 
 /**
- * Search for a matching ACL rule for each input data buffer.
- * Each input data buffer can have up to *categories* matches.
- * That implies that results array should be big enough to hold
- * (categories * num) elements.
- * Also categories parameter should be either one or multiple of
- * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
- * If more than one rule is applicable for given input buffer and
- * given category, then rule with highest priority will be returned as a match.
- * Note, that it is a caller responsibility to ensure that input parameters
- * are valid and point to correct memory locations.
- *
- * @param ctx
- *   ACL context to search with.
- * @param data
- *   Array of pointers to input data buffers to perform search.
- *   Note that all fields in input data buffers supposed to be in network
- *   byte order (MSB).
- * @param results
- *   Array of search results, *categories* results per each input data buffer.
- * @param num
- *   Number of elements in the input data buffers array.
- * @param categories
- *   Number of maximum possible matches for each input buffer, one possible
- *   match per category.
- * @return
- *   zero on successful completion.
- *   -EINVAL for incorrect arguments.
- */
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
-
-/**
  * Perform scalar search for a matching ACL rule for each input data buffer.
  * Note, that while the search itself will avoid explicit use of SSE/AVX
  * intrinsics, code for comparing matching results/priorities sill might use
@@ -323,9 +290,36 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
  */
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
+
+enum rte_acl_classify_alg {
+	RTE_ACL_CLASSIFY_DEFAULT = 0,
+	RTE_ACL_CLASSIFY_SCALAR = 1,
+	RTE_ACL_CLASSIFY_SSE = 2,
+};
+
+extern int
+rte_acl_classify(const struct rte_acl_ctx *ctx,
+		 const uint8_t **data,
+		 uint32_t *results, uint32_t num,
+		 uint32_t categories);
+
+extern int
+rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
+		 enum rte_acl_classify_alg alg,
+		 const uint8_t **data,
+		 uint32_t *results, uint32_t num,
+		 uint32_t categories);
+/*
+ * Set the default classify algorithm for newly allocated classify contexts
+ */
+extern void
+rte_acl_set_default_classify(enum rte_acl_classify_alg alg);
+
+/*
+ * Override the default classifier function for a given ctx
+ */
+extern void
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg);
 
 /**
  * Dump an ACL context structure to the console.
-- 
1.9.3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv4] librte_acl make it build/work for 'default' target
  2014-08-28 20:38 ` [dpdk-dev] [PATCHv4] " Neil Horman
@ 2014-08-29 17:58   ` Ananyev, Konstantin
  2014-09-01 11:05     ` Thomas Monjalon
  0 siblings, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2014-08-29 17:58 UTC (permalink / raw)
  To: Neil Horman, dev


> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 28, 2014 9:38 PM
> To: dev@dpdk.org
> Cc: Neil Horman; Ananyev, Konstantin; thomas.monjalon@6wind.com
> Subject: [PATCHv4] librte_acl make it build/work for 'default' target
> 
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 
> V3 Changes
>  Updated classify pointer to be a function so as to better preserve ABI
>  REmoved macro definitions for match check functions to make them static inline
> 
> V4 Changes
>  Rewrote classification selection mechanim to use a function table, so that we
> can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
> access works.  I understand that leaves us with an extra load instruction, but I
> think thats ok, because it also allows...
> 
>  Addition of a new function rte_acl_classify_alg.  This function lets you
> specify an enum value to override the acl contexts default algorith when doing a
> classification.  This allows an application to specify a classification
> algorithm without needing to pulicize each method.  I know there was concern
> over keeping those methods public, but we don't have a static ABI at the moment,
> so this seems to me a reasonable thing to do, as it gives us less of an ABI
> surface to worry about.

Good way to overcome the problem.
>From what I am seeing it adds a tiny slowdown (as expected) ... 
Though it provides a good flexibility and I don't have any better ideas.
So I'd say let stick with that approach.

Below are few technical comments.

Thanks
Konstantin

> 
>  Fixed misc missed static declarations
> 
>  Removed acl_match_check.h and moved match_check function to acl_run.h
> 
>  typdeffed function pointer to match check.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: konstantin.ananyev@intel.com
> CC: thomas.monjalon@6wind.com
> ---
>  app/test-acl/main.c             |  13 +-
>  app/test/test_acl.c             |  10 +-
>  lib/librte_acl/Makefile         |   5 +-
>  lib/librte_acl/acl.h            |   1 +
>  lib/librte_acl/acl_bld.c        |   5 +-
>  lib/librte_acl/acl_run.c        | 944 ----------------------------------------
>  lib/librte_acl/acl_run.h        | 271 ++++++++++++
>  lib/librte_acl/acl_run_scalar.c | 197 +++++++++
>  lib/librte_acl/acl_run_sse.c    | 630 +++++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c        |  62 +++
>  lib/librte_acl/rte_acl.h        |  66 ++-
>  11 files changed, 1208 insertions(+), 996 deletions(-)
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 


> diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> index 869f6d3..2169f59 100644
> --- a/app/test/test_acl.c
> +++ b/app/test/test_acl.c
> @@ -859,7 +859,7 @@ test_invalid_parameters(void)
>  	}
> 
>  	/* cover invalid but positive categories in classify */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 3);

Typo, should be:
rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, NULL, NULL, 0, 3); 

> diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
> index b9d63fd..9236b7b 100644
> --- a/lib/librte_acl/acl.h
> +++ b/lib/librte_acl/acl.h
> @@ -168,6 +168,7 @@ struct rte_acl_ctx {
>  	void               *mem;
>  	size_t              mem_sz;
>  	struct rte_acl_config config; /* copy of build config. */
> +	enum rte_acl_classify_alg alg;
>  };

Each rte_acl_build() will reset all fields of rte_acl_ctx starting from num_categories and below.
So we need to move alg somewhere above num_categories:

--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -153,6 +153,7 @@ struct rte_acl_ctx {
        /** Name of the ACL context. */
        int32_t             socket_id;
        /** Socket ID to allocate memory from. */
+       enum rte_acl_classify_alg alg;
        void               *rules;
        uint32_t            max_rules;
        uint32_t            rule_sz;
@@ -168,9 +169,11 @@ struct rte_acl_ctx {
        void               *mem;
        size_t              mem_sz;
        struct rte_acl_config config; /* copy of build config. */
-       enum rte_acl_classify_alg alg;
 };

> diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
> new file mode 100644
> index 0000000..4bf58c7
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_scalar.c
> @@ -0,0 +1,197 @@
> +
> +#include "acl_run.h"
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);

No need to put this declaration here.
I think you can put both rte_acl_classify_sse(), rte_acl_classify_scalar() into acl.h (it is internal lib header).
And remove another declarations of these functions from rte_acl.c.

> diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
> new file mode 100644
> index 0000000..7ae63dd
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_sse.c
> +#include "acl_run.h"
> +
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+

Move to acl.h.

> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..741bed4 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -33,11 +33,72 @@
> 
>  #include <rte_acl.h>
>  #include "acl.h"
> +#include "acl_run.h"

acl_run.h contains defintions for a lot of functions and should be included only by acl_run_*.c. 
I  think it is better to move typedef int (*rte_acl_classify_t) into acl.h and don't include acl_run.h here.

> 
>  #define	BIT_SIZEOF(x)	(sizeof(x) * CHAR_BIT)
> 
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> 
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +extern int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +

As above: I think it is safe to move these declarations into acl.h.

> +static rte_acl_classify_t classify_fns[] = {
> +	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
> +	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
> +	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
> +};

static const static rte_acl_classify_t classify_fns[]
?

> +
> +
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);

Duplicate.

> +
> +/* by default, use always avaialbe scalar code path. */
> +static enum rte_acl_classify_alg rte_acl_default_classify = RTE_ACL_CLASSIFY_SCALAR;

Line is longer than 80 chars?

> +
> +void rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
> +{
> +	rte_acl_default_classify = alg;
> +}

void
rte_acls_set_default_classify(...)

Though, I am not sure why do we need it to be public now.
Users can setup ALG per context.

> +
> +void rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
> +{
> +	ctx->alg = alg;
> +}

Same as above:
void
rte_acl_set_ctx_classify(...)
Plus probably add checking that alg is a valid argument:
If ((uint32_t)alg < RTE_DIM(classify_fns)) {ctx->alg=alg; return 0;}
return -EINVAL; 

> +
> +static void __attribute__((constructor))
> +rte_acl_init(void)
> +{
> +	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
> +
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> +		alg = RTE_ACL_CLASSIFY_SSE;
> +
> +	rte_acl_set_default_classify(alg);
> +}
> +
> +int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +		     const uint8_t **data,
> +		     uint32_t *results, uint32_t num,
> +		     uint32_t categories)
> +{
> +	return classify_fns[ctx->alg](ctx, data, results, num, categories);
> +}
> +
> +int rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
> +			 enum rte_acl_classify_alg alg,
> +			 const uint8_t **data,
> +			 uint32_t *results, uint32_t num,
> +			 uint32_t categories)
> +{
> +	return classify_fns[alg](ctx, data, results, num, categories);
> +}

Can you move alg argument to be the last one?
That would prevent copying parameters between registers.

Plus same thing with the function definition style. 

> +
>  struct rte_acl_ctx *
>  rte_acl_find_existing(const char *name)
>  {
> @@ -165,6 +226,7 @@ rte_acl_create(const struct rte_acl_param *param)
>  		ctx->max_rules = param->max_rule_num;
>  		ctx->rule_sz = param->rule_size;
>  		ctx->socket_id = param->socket_id;
> +		ctx->alg = rte_acl_default_classify;
>  		snprintf(ctx->name, sizeof(ctx->name), "%s", param->name);
> 
>  		te->data = (void *) ctx;
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
> index afc0f69..c092a49 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -259,39 +259,6 @@ void
>  rte_acl_reset(struct rte_acl_ctx *ctx);
> 
>  /**
> - * Search for a matching ACL rule for each input data buffer.
> - * Each input data buffer can have up to *categories* matches.
> - * That implies that results array should be big enough to hold
> - * (categories * num) elements.
> - * Also categories parameter should be either one or multiple of
> - * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
> - * If more than one rule is applicable for given input buffer and
> - * given category, then rule with highest priority will be returned as a match.
> - * Note, that it is a caller responsibility to ensure that input parameters
> - * are valid and point to correct memory locations.
> - *
> - * @param ctx
> - *   ACL context to search with.
> - * @param data
> - *   Array of pointers to input data buffers to perform search.
> - *   Note that all fields in input data buffers supposed to be in network
> - *   byte order (MSB).
> - * @param results
> - *   Array of search results, *categories* results per each input data buffer.
> - * @param num
> - *   Number of elements in the input data buffers array.
> - * @param categories
> - *   Number of maximum possible matches for each input buffer, one possible
> - *   match per category.
> - * @return
> - *   zero on successful completion.
> - *   -EINVAL for incorrect arguments.
> - */
> -int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);
> -
> -/**
>   * Perform scalar search for a matching ACL rule for each input data buffer.
>   * Note, that while the search itself will avoid explicit use of SSE/AVX
>   * intrinsics, code for comparing matching results/priorities sill might use
> @@ -323,9 +290,36 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
>   */
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);
> +
> +enum rte_acl_classify_alg {
> +	RTE_ACL_CLASSIFY_DEFAULT = 0,
> +	RTE_ACL_CLASSIFY_SCALAR = 1,
> +	RTE_ACL_CLASSIFY_SSE = 2,
> +};
> +

I think you removed the wrong comment.
All public API function declaration supposed to be preceded by formal doxygen style comment:
Brief explanation, parameters and return value description, etc.
Please restore the proper comment for it.
BTW,  two new functions above - they need a formal comments too.

> +extern int
> +rte_acl_classify(const struct rte_acl_ctx *ctx,
> +		 const uint8_t **data,
> +		 uint32_t *results, uint32_t num,
> +		 uint32_t categories);
> +
> +extern int
> +rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
> +		 enum rte_acl_classify_alg alg,
> +		 const uint8_t **data,
> +		 uint32_t *results, uint32_t num,
> +		 uint32_t categories);
> +/*
> + * Set the default classify algorithm for newly allocated classify contexts
> + */
> +extern void
> +rte_acl_set_default_classify(enum rte_acl_classify_alg alg);
> +
> +/*
> + * Override the default classifier function for a given ctx
> + */
> +extern void
> +rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg);
> 
>  /**
>   * Dump an ACL context structure to the console.
> --
> 1.9.3

Also, need to update examples/l3fwd-acl/ (remove rte_acl_classify_scalar() calls).
Something like:

diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c
index 9b2c21b..8cbf202 100644
--- a/examples/l3fwd-acl/main.c
+++ b/examples/l3fwd-acl/main.c
@@ -278,15 +278,6 @@ send_single_packet(struct rte_mbuf *m, uint8_t port);
        (in) = end + 1;                                         \
 } while (0)

-#define CLASSIFY(context, data, res, num, cat) do {            \
-       if (scalar)                                             \
-               rte_acl_classify_scalar((context), (data),      \
-               (res), (num), (cat));                           \
-       else                                                    \
-               rte_acl_classify((context), (data),             \
-               (res), (num), (cat));                           \
-} while (0)
-
 /*
   * ACL rules should have higher priorities than route ones to ensure ACL rule
   * always be found when input packets have multi-matches in the database.
@@ -1253,6 +1244,9 @@ app_acl_init(void)

        dump_acl_config();

+       if (parm_config.scalar)
+                rte_acl_set_default_classify(RTE_ACL_CLASSIFY_SCALAR);
+
        /* Load  rules from the input file */
        if (add_rules(parm_config.rule_ipv4_name, &route_base_ipv4,
                        &route_num_ipv4, &acl_base_ipv4, &acl_num_ipv4,
@@ -1436,10 +1430,8 @@ main_loop(__attribute__((unused)) void *dummy)
        int socketid;
        const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1)
                        / US_PER_S * BURST_TX_DRAIN_US;
-       int scalar = parm_config.scalar;

        prev_tsc = 0;
-
        lcore_id = rte_lcore_id();
        qconf = &lcore_conf[lcore_id];
        socketid = rte_lcore_to_socket_id(lcore_id);
@@ -1503,7 +1495,8 @@ main_loop(__attribute__((unused)) void *dummy)
                                        nb_rx);

                                if (acl_search.num_ipv4) {
-                                       CLASSIFY(acl_config.acx_ipv4[socketid],
+                                       rte_acl_classify(
+                                               acl_config.acx_ipv4[socketid],
                                                acl_search.data_ipv4,
                                                acl_search.res_ipv4,
                                                acl_search.num_ipv4,
@@ -1515,7 +1508,8 @@ main_loop(__attribute__((unused)) void *dummy)
                                }

                                if (acl_search.num_ipv6) {
-                                       CLASSIFY(acl_config.acx_ipv6[socketid],
+                                       rte_acl_classify(
+                                               acl_config.acx_ipv6[socketid],
                                                acl_search.data_ipv6,
                                                acl_search.res_ipv6,
                                                acl_search.num_ipv6,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCHv4] librte_acl make it build/work for 'default' target
  2014-08-29 17:58   ` Ananyev, Konstantin
@ 2014-09-01 11:05     ` Thomas Monjalon
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Monjalon @ 2014-09-01 11:05 UTC (permalink / raw)
  To: Ananyev, Konstantin, Neil Horman; +Cc: dev

2014-08-29 17:58, Ananyev, Konstantin:
> Good way to overcome the problem.
> From what I am seeing it adds a tiny slowdown (as expected) ... 
> Though it provides a good flexibility and I don't have any better ideas.
> So I'd say let stick with that approach.

Nice work guys.
I'd like to have this patch for release 1.7.1 which must be tagged tomorrow
(Spetember, 2nd). Do you think it's possible to have a final version of this
patch?

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-09-01 11:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-07 18:31 [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
2014-08-07 20:11 ` Neil Horman
2014-08-07 20:58   ` Vincent JARDIN
2014-08-07 21:28     ` Chris Wright
2014-08-08  2:07       ` Neil Horman
2014-08-08 11:49   ` Ananyev, Konstantin
2014-08-08 12:25     ` Neil Horman
2014-08-08 13:09       ` Ananyev, Konstantin
2014-08-08 14:30         ` Neil Horman
2014-08-11 22:23           ` Thomas Monjalon
2014-08-21 20:15 ` [dpdk-dev] [PATCHv3] " Neil Horman
2014-08-25 16:30   ` Ananyev, Konstantin
2014-08-26 17:44     ` Neil Horman
2014-08-27 11:25       ` Ananyev, Konstantin
2014-08-27 18:56         ` Neil Horman
2014-08-27 19:18           ` Ananyev, Konstantin
2014-08-28  9:02             ` Richardson, Bruce
2014-08-28 15:55             ` Neil Horman
2014-08-28 20:38 ` [dpdk-dev] [PATCHv4] " Neil Horman
2014-08-29 17:58   ` Ananyev, Konstantin
2014-09-01 11:05     ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).